EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer

Robust Estimates in Detecting outliers in SAS

There are always some extremely noisy records in our data. Rousseeuw & Hubert:
"In real datasets, it often happens some observations are different from the majority.
Such observations are called outliers. ...They do not fit the model well. It is very important to be able to detect these outliers.""

Robust Estimates in the UNIVARIATE Procedure
The UNIVARIATE procedure also supports these robust estimators. The trimmed and Winsorized means are computed by using the TRIM= and WINSOR= options, respectively. Not only does PROC UNIVARIATE compute robust estimates, but it computes standard errors as shown in the following example.
    data a;
    input x @@;
    6.25 6.27 6.28 6.34 63.1
    proc univariate data=a trim=0.2 winsor=1;
    var x;
    ods select BasicMeasures TrimmedMeans WinsorizedMeans;
The trimmed mean is computed by excluding the k smallest and k largest values, and computing the mean of the remaining values. The Winsorized mean is computed by replacing the k smallest values with the (k+1)st smallest, and replacing the k largest values with the (k+1)st largest.

The mean of these remaining values is the Winsorized mean. For both of these functions, you can specify either a number of observations to trim or Winsorize, or a percentage of values. Formulas for the trimmed and Winsorized means are included in the documentation of the UNIVARIATE procedure.

Robust Estimates in the PROC IML Procedure
 proc iml;
x = {6.25, 6.27, 6.28, 6.34, 63.1};
mad = mad(x, "NMAD"); 
print mad;  
z = (x - mean(x)) / std(x); /* rules to detect outliers */
print z;  
  *Because the mean and standard deviation are both influenced by the outlier, 
no observation has a large z-score, therefore none is flagged as an outlier. 
However, using robust estimators in the z-score formula does successfully 
identify the outlier, as shown in the following statements;   
zRobust = (x - median(x)) / mad(x, "NMAD");
print zRobust;  
  *The outlier has a HUGE "robust score." Of course, you don't need print out
the scores and inspect them. The following statements use the LOC function
(the most useful function that you've never heard of!) to find all the data
for which the robust z-score exceeds 2.5, and prints only the outliers;  
outIdx = loc(abs(zRobust)>2.5);
if ncol(outIdx)>0 then 
   outliers = x[outIdx];
   outliers = .;
print outliers;  
Robust Estimates in the PROC IML Procedure for Multivariate
proc iml;
x = { 80  27  89,  80  27  88,  75  25  90, 
      62  24  87,  62  22  87,  62  23  87, 
      62  24  93,  62  24  93,  58  23  87, 
      58  18  80,  58  18  89,  58  17  88, 
      58  18  82,  58  19  93,  50  18  89, 
      50  18  86,  50  19  72,  50  19  79, 
      50  20  80,  56  20  82,  70  20  91 };
labl = {"x1" "x2" "x3"};
mean = mean(x);     /* classical estimates */
cov = cov(x);
print mean[c=labl format=5.2], cov[r=labl c=labl format=5.2];
/* robust estimates */
N = nrow(x);   /* 21 observations */
p = ncol(x);   /*  3 variables */
optn = j(8,1,.); /* default options for MCD */
optn[1] = 0;     /* =1 if you want printed output */
optn[4]= floor(0.75*N); /* h = 75% of obs */
call MCD(sc, est, dist, optn, x);
RobustLoc = est[1, ];     /* robust location */
RobustCov = est[3:2+p, ]; /* robust scatter matrix */
print RobustLoc[c=labl format=5.2], RobustCov[r=labl c=labl format=5.2];

outIdx = loc(dist[3,]=0); /* RD > cutoff */
print outIdx;   /* rules to detect outliers */
cutoff = sqrt( quantile("chisquare", 0.975, p) ); /*dist^2 ~chi-square*/ 
The ROBUSTREG procedure can also compute MCD estimates. Usually ROBUSTREG procedure is used as a regression procedure, but you can also use it to obtain the MCD estimates by "inventing" a response variable. The MCD estimates are produced for the explanatory variables, so the choice of a response variable is unimportant. We use random values for the response variablen in the following example .

In a regression context, the word "outlier" is reserved for an observation for which the value of the response variable is far from the predicted value. In other words, in regression an outlier means "far away (from the model) in the Y direction." In contrast, the ROBUSTREG procedure uses the MCD algorithm to identify influential observations in the space of explanatory (that is, X) variables. These are also called high-leverage points. They are observations that are far from the center of the X variables. High-leverage points are very influential in ordinary least squares regression, and that is why it is important to identify them.
  data X; set X;
y=rannor(1); /* random response variable */ run;
proc robustreg data=X method=lts;
   model y = x1 x2 x3 / diagnostics leverage(MCDInfo);
   ods select MCDCenter MCDCov Diagnostics;
   ods output diagnostics=Diagnostics(where=(leverage=1));
Acknowledgement: The tutorial is from famous SAS blogger by Dr. Rick. Another two Good Tutorial for Proc IML-1, and Good Tutorial for Proc IML-2

Continue to Reverse Sample Weight-What and Why       Statistics Tutorial Home
Back to 6. Elasticity: Theory and Application-3       SAS Tutorial Home