EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer

SAS Tutorial in handling large datasets
Tip-I: Apply index to speed up large data process
Tip-II: Aplly Proc Datasets to Label, Rename, Informat

Question: When we deal with a table of millions/billions of records, and we need to do some table joins, say, inner join with a table of just 10,000 records by some key variables, or even just want to extract part of the data by some key variables, it may
take a while(sometimes several hours) if we don't handle it carefully. How can we process the large datasets as fast as possible?

Answer: SAS indexes can drastically improve the performance of programs that access small subsets of observations from large SAS data sets.

Example: Here is a SAS example to create index on large datasets, and application:

        /******** Create two simple indexs on dataset data1 *********/
        proc sql;
        create index visitid on data1;
        Create a composite index for subsetting using two variables
        create index car on data1(make model);
        /************ apply index to extract data **************/
        proc sql;
        create table table1 as
        select *
        from data1(where=(zipcode='94305'));
        /************* apply index to join data *****************/
        proc sql;
        create table table2 as
        select a.*
        from data1 as a inner join data2 as b
        on a.zipcode=b.zipcode; quit;

Something to Note:

Aplly Proc Datasets to Label, Rename, Informat

If we want to label,rename, format or informat some big dateset, the 2nd way by Proc datasets is much more efficient. The followying code are summarized from the SAS paper: PROC DATASETS: Managing Data Efficiently by Daphne Ewing.

        data dataout1(label='Adverse Events');
        set datain;
        rename newvar=oldvar;
        format date1 mmddyy10.;     informat date2 date7.;
        label var1 = 'Subject Number'
        var2 = 'Site Number'; run;

        proc datasets lib=work nolist;
        modify datain (label='Adverse Events');
        modify datain; label var1 = 'Subject Number'
        var2 = 'Site Number';
        rename newvar=oldvar;
        format date1 mmddyy10.;     informat date2 date7.;
        quit; run;

Comparing the CPU TIME for the above two tasks that are doing exactly the same thing, the DATASETS procedure completes more quickly. The difference between these two methods is that when the Data Step is used, every record contained in the data set has to be processed. When using the DATASETS procedure, SAS is manipulating the header information of the data set(descriptor) and therefore doesn't process each observation making it far more efficient.

Related links:

Continue to next: Tips to handle large datasets-II: Netezza/SAS Access   SAS tutorial home
Back to: Tips for survey data: Remove carriage returns   Statistics tutorial home