EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer
Physics
C.S.

Logistic Regression Analysis Study Notes 2-I
ROC/AUC Interpretation in SAS Logistic regression
Classic Approach in Credit Risk Analysis

SAS is used by many banks and other financial institutions to perform the necessary statistical analysis used to create intelligence from a company's data. Running SAS logistic regression becomes a very common task in credit risk analytics.

From the previous tutorial, we know how to use rocout option to generate the predicted probability for each observation, use those predicted prob to normalize as score, for example, we want to normaliz the score range from 0 to 800, then we can just use 800 as multiplier to the p_hat.

Before you run logistic regression, you may want to split the dataset into training and validation parts, for example, you can random select 75% as training data, and 25% as validation dataset by the following.

Area Under the Curve (AUC)
Simply put, the area under the curve (AUC) of a receiver operating characteristic (ROC) curve is a way to reduce ROC performance to a single value representing expected performance.

To explain with a little more detail, a ROC curve plots the true positives (sensitivity) vs. false positives (1 - specificity), for a binary classifier system as its discrimination threshold is varied. Since, a random method describes a horizontal curve through the unit interval, it has an AUC of .5. Minimally, classifiers should perform better than this, and the extent to which they score higher than one another (meaning the area under the ROC curve is larger), they have better expected performance.

For other evaluation methods, a user has to choose a cut-off point above which the target variable is part of the positive class (e.g. a logistic regression model returns any real number between 0 and 1 - the modeler might decide that predictions greater than 0.5 mean a positive class prediction while a prediction of less than 0.5 mean a negative class prediction). AUC evaluates entries at all cut-off points, giving better insight into how well the classifier is able to separate the two classes.

Special attention to the following coding right before the logistic modeling, we used: ods trace on: to output all the dataset available for output, this is very helpful when you are looking for some particual output.

Also we can use the coding: ods exclude ParameterEstimates(persist); to suppress the particular output into html or sas output. Sometimes you might feel you have too much output in the sas html/output window, just want to suppress some of them showing up in html or sas output(but you still get the output in the dataset format, instead of html or sas output). If you use the option noprint, that might prevent all the datasets format output.

```  data data1;
set data0;
split=(ranuni(123456789) le 0.75);
run; /*75% of training data,25% of testing data */

ods graphics on;
ods trace on; /*tell you all the dataset available for output*/
ods output ParameterEstimates=Parameter1;
/*ods exclude ParameterEstimates(persist);*/
proc logistic data=Data1 descending
namelen=100  plots=roc
/*namelen can be also applied in other regression e.g. proc glm */
outest=Cov_betas covout;
model dep_var  (event='1') = &vars.
/CTABLE PPROB =(0 to 1 by .10)

outroc=ROC1  /*output the sensitivity and specifity */
oddsratio Heat / at(Soak=1 2 3 4);
rsq  lackfit;
weight split;
Output out=Data2 predicted=p_hat
l=lower u=upper xbeta=logit;
Contrast '1 vs 4' A 2 1 1;
run;
ods graphics off;

/* Note the following KS calculation is based on split=1 */
data ROC2;
set ROC1; /*ROC output data automatically sort the pred variable*/
retain ks_stat;
score_cutoff=floor(_PROB_/0.1)*100;
total_good=_Neg_+_FALPOS_;

CDF_pcnt_good_catch=_neg_/total_good; /* pcnt of 0s */