EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer

Logistic Regression Analysis Study Notes
Data Preparation Check

Download pdf • SAS Linear Regression Analysis, Linear Regression Analysis Study Notes

Before we run any logistic regression, we have a few list to check:
1) What's the % of bad v.s. % of good?
2) Grab a few sample records from the bad cases to check if they are really bad?
proc surveyselect data=test1a seed=411792001
   method=srs n=20 out=test1b noprint;
	 strata level_id;
3) If you found out some tagged bad cass are not really bad, then re-consider the definition.

4) If it's due to computational power limitation, e.g. it may take forever to run a logistic regression on the 100 Million of records over more than 200 variables, then you can consider sampling over some good ones, and include all the bad ones. To avoid data domination from some particular type(e.g. some big auto-makers with large number of records; some big brand has more data records), you may want to do different sample ratio on those different types, so you outcome data looks more even on them.

5) Watch out, you may have double count in case both bad=0 and bad=1 showed up.
 proc sql;
      create table data1a as
       select distinct id
	  		from data0 as a
		 	where bad=1;quit; 
       proc sql;
       create table data1b as
       select distinct id
	  		from data0 as a
			where bad=0;quit; 
      data data1b; 
      set data1b; 
      if rannum<=0.1;   /*say take 10% of good ones*/
      drop rannum; run;         
      proc sql; drop table data1; quit;
      proc append base=data1 data=data1a force run;
      proc append base=data1 data=data1b force run;
      proc sql;  /*remove double count cases*/
       create table data2 as
       select distinct id
	  	from data1 as a
      data data3;
      set data2; 
      run; /*split into testing and validation*/  
6) In some complex situations, for instance, purchase price of $20,000 for an economic car or an luxury car are quite different, returning $50 item to Bestbuy and $50 item to Walmart are quite different; if you want to compare them more apple to apple, then you might want to use the percentile variable instead of absolute raw dollars.

7) Back to define "bad" cases, this is actually the most important part since it's directly related with your outcome accuracy. Besides choose a few sample to double check line by line, also think about whether it's appropriate to define using those variables. To be accurate, we should some varabile in the future to define it's bad or not, rather than using the variables in the past; if you put those variables in the past into the model, they might be perfectly(nearly) cocorelated.

8) Before we run the logistic model, spend some time checking the distribution of explainary varaibles, run a simple proc means to see the basic statistics; just to make sure there is no silly data mistakes.
proc means data=datain;
var var1 var2 ... ;
output out= modout1; run;  
9) If the model output give you something unexpected or intuitively non-nonsense; for example, some parameter should be negative based on the common sense, but the model output positive estimates. In those situations,

a) Check the distribution of that varaible cross the bad cases: break into 10 ranking dices and see if there are some trend from the plot: e.g. the 10th dice has signicant more % of bad cases. Get some intuitive understanding for the variables.

b) Run the regular linear regression PROC REG to check VIF.
If it's bigger than 5(or conservatively 10), then there are strong coliearity exists in the model.
 proc reg data=modout(where=(split>0));
        model pred=var1 var2 ...
        / vif;
     output out=Preddat1 student=stdres;
   run; quit;  

Continue to next: Logistic Regression Analysis Study Notes 2   Stats 101 Home   SAS tutorial home
Back to: Classic regression home   Experiment Design & Data Analysis Home Statistics tutorial home