EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer
Physics
C.S.


How to replace substring in SAS/SQL/Netezza?
How to grab web content from page url by SAS?

In this tutorial we are using the prx(Perl Regular Expression) SAS function to grab web contents. Those SAS Perl function includes: prxmatch, prxchange etc, very effevtive in text extraction.

Here is some example, say, you want to replace the text string 'World' by 'Planet' in the expression: 'Hello World',
new_expression=prxchange('s/World/Planet/',1,'Hello World');
s is for substitution, 1 means the first match.

You can also use Tranwrd function to replace all the substring in a string by something else:
new_text=tranwrd(old_text,'old_string_to_replace','new_string_replaced_with');

        /*********Example to use Tranwrd*****************/
        data _null_;
        new_text=tranwrd("Hello World","World","Planet");
        put new_text=;
        run;


you may want to replace all "(" or ")" or "&" by nothing.
new_expression=prxchnage('s/\)//',1,old_expression);
Notice the special "\" before ")" to indicate the special characters, same as "+";

or you may want to replace all the phone number in a string new_text =
prxchange('s/\([2-9]\d\d\) ?[2-9]\d\d-\d\d\d\d/*Num Removed*/', -1, text);
An expression like (801)443-9876 will be replaced by "Num Removed"

or youwant to find all records with digits/numbers inside var1:
where prxmatch('/\d{1}/',var_name)>0 ;

In SQL insisde Netezza, we can use the function replace or regexp_replace to replace substring by something else. You need to Download and Install Netezza SQL extensions toolkit.

select sql_functions..replace('123135', '12', 'ab') ==> 'ab3135'
select sql_functions..regexp_replace('123135', '12', 'ab') ==> 'ab3135'

Replace all the non-alphabet characters by _ in the string inside Netezza/SQL/SAS:
EDW_UTILITY_1_4..regexp_replace(var, '[^:abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:]', '_') as ... A good resource of can be found at here.

Replace all the non-alphabet characters by _ in the string in R: gsub("[^:abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:]","_",)

Here is an example to read the table from the NBA stats webpage for Kobe:

grab read load content from webpage in SAS

        /******************************************/
        filename etf url
    'http://www.basketball-reference.com/players/b/bryanko01/gamelog/2010/#pgl_advanced::none'
        RECFM=v lrecl=1000;

        /*ready all the url content into file with varing length*/
        data temp1;
        infile etf length=len;
        length content $32766. ;
        input content $varying32766. len;
        run;

        /*you can use the following code to check the data structure in temp1*/
        proc contents data=temp1 position; run;
        /* Use "position" options intead of "varnum" */
        /* to get both alphabetical order output and creation order output. */

        data work.temp2(drop=content);
        retain min;
        length content_new $50;
        set work.temp1;
        /*get rid of all the content between <> and spaces*/
        content_new= prxchange('s/<.+?>//',-1,content);

        /* "-1" means to the end of the expression, not the 1st, 2nd,...*/
        row=_n_;
        if index(content_new,'Age is Yea')>0 then min=row;
        if min^=.; run;

        proc sql;         select min(min) into: min from temp2;
        quit;
        data temp3(drop=row lag_content) ; set temp2 ;
        lag_content=lag(content_new);
        row_num=ceil((_n_)/30);
        if lag_content='' and content_new=''
        then delete;
        if min=&min. and _n_>8;
        run;

        proc transpose data=temp3 out=temp4;
        var content_new;
        by row_num; run;

        data temp5(drop=row_num _name_) ; set temp4 ;
        if compress(col2)='G' or col2=''
        then delete; run;



Related links:

Continue to next: Keep temporary files and re-direct work folder in SAS   SAS tutorial home
Back to: How to get beautiful geographical picture in SAS?   Statistics tutorial home