Imputation Techniques in SAS
• Missing Data
In Statistical data analytics, we often need to deal with cases of missing data:
-- What should be done about missing data?
-- What is the risk of doing nothing(simply delete/ignore)?
-- what type of bias of introduced by imputation?
Before we do any imputation, we recommend trying some exploratory analysis for the data first.
For instance, check the records of missing values, can we find out some pattern that the missingness
is predictable from other variables in the dataset. If so, then all becomes simple. Or the missingness
of some variable happens at the same time as some other variables. Here are highly recommended books on missing data:
If we don't handle the missing data carefully,we may have those consequences:
-- Generate bias due to difference between those with full data and missing data.
-- Missing those unbalanced data patterns caused by incomplete data.
-- Limit our choices for data analysis/management.
• Patterns of Missing data
As showed in the following graph, the pattern we handle
the most is called "monotone missing"
. In other words,
if we have the
variables in the order: var1, var2, var3,... then there will be more and more missing on the right side;
the data looks like an upper triangle. If var2 is missing at some rows, then var3 will also be missing in those rows.
In general, a variable Yj is missing for a particular individual implies that all subsequent variables Yk, k>j,
are missing for that individual(click to see larger picutre).
• Simple Imputation:
By default, simple imputation replace missing values with
-- For Interval variables, replace missing values with mean of the non-missing ones.
-- For Categorical ones, replace missing values with the most frequent category.
• Imputation Approached
: As suggested above, we highly recommend multiple imputation approach.
Simple inputation might be over-simplied in most cases. Replacing the missing values by overall average or group mean or mode might give us misleading information.
For multiple imputation approach, there are usually 3-Steps:
Proc MI --> Analytics Procs(regressions etc)-->Proc MIAnalyze
• Statistical Assumptions for Multiple Imputation
proc mi data=data out=mi_out
var var1 var2 var3;
/* variables' order mattters*/
proc reg data=mi_out outest=reg1;
model var3=var1 var2;
proc mianalyze data=reg1;
modeleffects intercept var1 var2;
/*overall one simple output */
The MI procedure assumes that the data are from a continuous multivariate distribution and contain missing values that can occur for any of the variables.
It also assumes that the data are from a multivariate normal distribution when either the regression method or the MCMC method is used.
Both MI and MIANALYZE procedures assume that the missing data are missing at random (MAR)
that is, the probability that an observation is missing can depend on the observed records , but not on the other non-missing records for the missing varaible.
For example, consider a trivariate data set with variables Y1 and Y2 fully observed, and a variable Y3 that has missing values.
MAR assumes that the probability that Y3 is missing for an individual can be related to the individual's values of variables Y1 and Y2 ,
but not to its value of Y3. On the other hand, if a complete case and an incomplete case for Y3 with exactly the same values for variables Y1 and Y2 have systematically different values,
then there exists a response bias for Y3, and MAR is violated.
For more details and help you better understand, we recommend you try those examples by yourself,
you can read Multiple Imputation for Missing Data
and this paper: Multiple Imputation for Missing Data
to discuss MI and Mianalyze procesures in details.
Read different delimited files into SAS
SAS Interview home
Back to: How to prepare
for Base/Advanced SAS Programming Certificates?
SAS tutorial home