ROC/AUC Interpretation in SAS Logistic regression

Classic Approach in Credit Risk Analysis

From the previous tutorial, we know how to use

Before you run logistic regression, you may want to split the dataset into training and validation parts, for example, you can random select 75% as training data, and 25% as validation dataset by the following.

Simply put, the area under the curve (AUC) of a receiver operating characteristic (ROC) curve is a way to reduce ROC performance to a single value representing expected performance.

To explain with a little more detail, a ROC curve plots the true positives (sensitivity) vs. false positives (1 - specificity), for a binary classifier system as its discrimination threshold is varied. Since, a random method describes a horizontal curve through the unit interval, it has an AUC of .5. Minimally, classifiers should perform better than this, and the extent to which they score higher than one another (meaning the area under the ROC curve is larger), they have better expected performance.

For other evaluation methods, a user has to choose a cut-off point above which the target variable is part of the positive class (e.g. a logistic regression model returns any real number between 0 and 1 - the modeler might decide that predictions greater than 0.5 mean a positive class prediction while a prediction of less than 0.5 mean a negative class prediction). AUC evaluates entries at all cut-off points, giving better insight into how well the classifier is able to separate the two classes.

Special attention to the following coding right before the logistic modeling, we used:

Also we can use the coding:

data data1; set data0;If you got some error message like this: "ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space."" Try to put "noprint after the proc logistic statement", it should go away.split=(ranuni(123456789) le 0.75); run; /*75% of training data,25% of testing data */ ods graphics on; ods trace on; /*tell you all the dataset available for output*/ ods output ParameterEstimates=Parameter1; /*ods exclude ParameterEstimates(persist);*/ proc logistic data=Data1descendingnamelen=100 plots=roc /*namelen can be also applied in other regression e.g. proc glm */outest=Cov_betas covout; model dep_var (event='1') = &vars. /CTABLE PPROB=(0 to 1 by .10)outroc=ROC1/*output the sensitivity and specifity */oddsratioHeat / at(Soak=1 2 3 4);rsqlackfit;weightsplit; Output out=Data2predicted=p_hatl=lower u=upperxbeta=logit;Contrast '1 vs 4' A 2 1 1;run; ods graphics off; /* Note the following KS calculation is based on split=1 */ data ROC2; set ROC1; /*ROC output data automatically sort the pred variable*/ retain ks_stat; format CDF_pcnt_bad_catch CDF_pcnt_good_catch percent8.1; score_cutoff=floor(_PROB_/0.1)*100; total_bad=_POS_+_FALNEG_; total_good=_Neg_+_FALPOS_; CDF_pcnt_bad_catch=_FalNeg_/total_bad; /* pcnt of 1s */ CDF_pcnt_good_catch=_neg_/total_good; /* pcnt of 0s */ ks_stat=max(ks_stat,CDF_pcnt_good_catch-CDF_pcnt_bad_catch); run; title "The KS for training data = &KS. "; proc sql ; select max(_neg_ / (_Neg_+_FALPOS_) - _FalNeg_ /(_pos_+_FALNEG_)) into: KS from roc1 ;quit;