"In real datasets, it often happens some observations are different from the majority.

Such observations are called outliers. ...They do not fit the model well. It is very important to be able to detect these outliers.""

data a; input x @@; datalines; 6.25 6.27 6.28 6.34 63.1 ; proc univariate data=aThe trimmed mean is computed by excluding the k smallest and k largest values, and computing the mean of the remaining values. The Winsorized mean is computed by replacing the k smallest values with the (k+1)st smallest, and replacing the k largest values with the (k+1)st largest.trim=0.2 winsor=1; var x; ods selectBasicMeasures TrimmedMeans WinsorizedMeans; run;

The mean of these remaining values is the Winsorized mean. For both of these functions, you can specify either a number of observations to trim or Winsorize, or a percentage of values. Formulas for the trimmed and Winsorized means are included in the documentation of the UNIVARIATE procedure.

proc iml; x = {6.25, 6.27, 6.28, 6.34, 63.1}; mad = mad(x, "NMAD"); print mad; z = (x - mean(x)) / std(x);/* rules to detect outliers */print z; *Because the mean and standard deviation are both influenced by the outlier, no observation has a large z-score, therefore none is flagged as an outlier. However, using robust estimators in the z-score formula does successfully identify the outlier, as shown in the following statements; zRobust = (x - median(x)) / mad(x, "NMAD"); print zRobust; *The outlier has a HUGE "robust score." Of course, you don't need print out the scores and inspect them. The following statements use the LOC function (the most useful function that you've never heard of!) to find all the data for which the robust z-score exceeds 2.5, and prints only the outliers; outIdx = loc(abs(zRobust)>2.5); if ncol(outIdx)>0 then outliers = x[outIdx]; else outliers = .; print outliers;

proc iml; x = { 80 27 89, 80 27 88, 75 25 90, 62 24 87, 62 22 87, 62 23 87, 62 24 93, 62 24 93, 58 23 87, 58 18 80, 58 18 89, 58 17 88, 58 18 82, 58 19 93, 50 18 89, 50 18 86, 50 19 72, 50 19 79, 50 20 80, 56 20 82, 70 20 91 }; labl = {"x1" "x2" "x3"}; mean = mean(x);The/* classical estimates */cov = cov(x); print mean[c=labl format=5.2], cov[r=labl c=labl format=5.2];/* robust estimates */N = nrow(x); /* 21 observations */ p = ncol(x); /* 3 variables */ optn = j(8,1,.); /* default options for MCD */ optn[1] = 0; /* =1 if you want printed output */ optn[4]= floor(0.75*N); /* h = 75% of obs */ call MCD(sc, est, dist, optn, x); RobustLoc = est[1, ]; /* robust location */ RobustCov = est[3:2+p, ]; /* robust scatter matrix */ print RobustLoc[c=labl format=5.2], RobustCov[r=labl c=labl format=5.2]; outIdx = loc(dist[3,]=0); /* RD > cutoff */ print outIdx;/* rules to detect outliers */cutoff = sqrt( quantile("chisquare", 0.975, p) ); /*dist^2 ~chi-square*/

In a regression context, the word "outlier" is reserved for an observation for which the value of the response variable is far from the predicted value. In other words, in regression an outlier means "far away (from the model) in the Y direction." In contrast, the ROBUSTREG procedure uses the MCD algorithm to identify influential observations in the space of explanatory (that is, X) variables. These are also called high-leverage points. They are observations that are far from the center of the X variables. High-leverage points are very influential in ordinary least squares regression, and that is why it is important to identify them.

data X; set X; y=rannor(1); /* random response variable */ run; proc robustreg data=X method=lts; model y = x1 x2 x3 / diagnostics leverage(MCDInfo); ods select MCDCenter MCDCov Diagnostics; ods output diagnostics=Diagnostics(where=(leverage=1)); run;Acknowledgement: The tutorial is from famous SAS blogger by Dr. Rick. Another two Good Tutorial for Proc IML-1, and Good Tutorial for Proc IML-2

Back to 6. Elasticity: Theory and Application-3 SAS Tutorial Home