EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer

SAS Tutorial in handling large datasets
Tip-II: Netezza/Hadoop/Teradata Applications

Question 1: What is Netezza/Hadoop/Teradata access? Why do we need Netezza/Hadoop/Teradata? How to access Netezza/Hadoop/Teradata from SAS? How to create table in Netezza? How to distribut datasets in Netezza/Hadoop/Teradata?

Answer: In layman language, Netezza/SAS access provides a faster way to extra subset of data from a huge dataset(say, over billions of records, each record with hundreds of variables).

For the reason why Netezza is faster, inside the Netezza performance server(NPS),
through the netezza streaming analytic applicance it facilities some cool netezza tips e.g. storing, filtering, and processing terabytes of data within a single unit by way of analyzing only the relevant information for each query.

        /* create table on Netezza server, make the connection by libname */
        libname netezza_lib Netezza/Hadoop/Teradata server='netezza server name'
        database=database_name user=username Password=******;

        data dataout;         set netezza_lib.datain;         run;

Note: you need to know the table name(datain) inside that datatbase, otherwise, in the previous SAS version(9.1.3), there is no way you can click that library to see which table they have.

        /***** Using bulkload method to upload, faster *******/
        /***** Advantage: log file summaries to check the uploading data **/
        /***** Disadvantage: easy to break by some tiny data error *******/
        proc sql;
        create table netezza_lib.dataout(bulkload=yes bl_delimiter='|'
        bl_options= "logdir 'the directory to write the log file' ")
        as select *         from work.datain;         quit;

In fact, if you change to the following, it's even more faster!

        proc sql;
        create table netezza_lib.dataout(bulkload=yes bl_delimiter='|'
        bl_options= "logdir '\the directory to write the log file\' ")
        as select * from work.datain(obs=0); quit;

        proc append base=netezza_lib.dataout( bulkload=YES
        bl_use_pipe=NO bl_delete_datafile=YES
        bl_options=' ctrlchars true
        crInString true skipRows 0 maxerrors 100
        logdir "\the directory to write the log file\"')
        data=work.datain force; run;

Most of time, the default setting for delimiter is the pipe sign "|", which we can change to other delimiter. You can also use the following regular but more slow approach:
        proc sql;
        create table netezza_lib.dataout
        (distribute_on = 'random' (or change to 'keyvar' ))
        as select * from work.datain; quit;

You may also have learnt some other pupular data management tools, like Teradata, or Oracle, IBM(IBM bought Netezza), Sybase IQ, which are all based on a "massively parallel processing / shared nothing architecture". Netezza seems more popular and compactable very well with SAS.

What if you still get those ECL errors when uploading onto netezza, either because of some columns used some "dafault" restricted columns names, or some columns have speical delimiters. How do we know which one to delete/detect? We can actually have a small macro to check each variable.

data  Sample;
set database.Original_data(obs=10000);
/*where datepart(datecreated)>'01Jul2016'd;*/

 proc contents noprint data=Sample
      out=vars(keep=name type length varnum format label) varnum; run; 
 proc sql noprint; 
       select count(*) into: num_of_num_vars from vars where type=1; 
 proc sql noprint; 
        select name into: var1-:var%eval(&num_of_num_vars.) 
        from vars where type=1; quit; 

option mprint=no symbolgen=yes;
%macro up;
%do i=1 %to &num_of_num_vars.;

data  Sample1;
set Sample;
keep &&var&i..;


%mend up;

Related links:

Continue to next: Tips III: Various Netezza In-house Analytics Functions   SAS tutorial home
Back to: Tips to handle large datasets-I: by using Index   Statistics tutorial home