Regression Analysis Study Notes 1
Confounding and Collinearity in Multiple Linear Regression
Download pdf •
Simple Linear Regression Analysis in SAS,
Regression Analysis Study Notes 1: Linear Regression
: A third variable, not the dependent (outcome) or main independent
(exposure) variable of interest, that distorts the observed relationship between
the exposure and outcome
. Confounding complicates analyses owing to the
presence of a third factor that is associated with both the putative risk factor
and the outcome.
Proc Reg Data=A;
Model Y=X1 X2 X3 X4 / VIF Collin ColliNoInt;
For example, consider the mortality rate in Florida, which is much higher than
in Michigan. Before concluding that Florida is a riskier place to live, one needs
to consider confounding factors such as age. Florida has a higher proportion
of people of retirement age and older than does Michigan, and older people are
more likely to die in any given interval of time. Therefore, one must "adjust"
for age before drawing any conclusions.
One of the epidemiologist's tools for discovering and correcting confounding is
stratification, which in the preceding example would have the epidemiologist
compare mortality rates in Florida and Michigan separately for people in across
a range of age groups. Indeed, such stratified analyses
, should often be a first
step towards investigating confounding.
Another way would be to use multiple regression
, to derive mortality rates for
Florida compared to Michigan adjusted for any differences in age (and possibly
for other confounding factors).
Note: a confounder must be associated with the main independent variable of
interest. For example, the confounder must be unevenly distributed as far
as the independent variable is concerned
, as in the smoking and occupation
example of last class. Smoking was a confounder for the outcome of cancer,
because smoking is associated with cancer, and was unevenly distributed
among occupation categories.
In practice, collinearity or high correlations among independent variables will generally
have the following effects:
• Regression coefficients will change dramatically according to whether other variables
are included or excluded from the model.
• The standard errors of the regression coefficients will tend to be large, since the
beta coefficients will not be accurately estimated. In extreme cases, regression
coefficients for collinear variables will be large in magnitude with signs that
seem to be assigned at random. If you see "non-sensical" coefficients and SD's,
collinearity should be immediately suspected as a possible cause.
• Predictors with known, strong relationships to the response will not necessarily
have their regression coefficients accurately estimated.
: If variables are perfectly collinear, the coefficient of determination
R2 will be 1 when any one of them is regressed upon the others. This is the
motivation behind calculating a variable's "tolerance", a measure of collinearity.
• Each predictor can regressed on the other predictors, and its tolerance is defined
as 1 - R2. A small value of the tolerance indicates that the variable under
consideration is almost a perfect linear combination of the independent variables
already in the equation, and so not all these variables need to be added to the
equation. Some statisticians suggest that a tolerance less than 0.1 deserves
attention, although this is somewhat arbitrary.
• The tolerance is sometimes reexpressed as the Variance Inflation Factor (VIF)
the inverse of the tolerance (= 1/tolerance). Tolerances of 0.10 or less become
VIFs of 10 or more.
Acknowledgement: The tutorial is based on the notes from: www.ats.ucla.edu
Continue to next:
Logistic Regression Analysis Study Notes 2
Stats 101 Home
SAS tutorial home
Classic regression home
Experiment Design & Data Analysis Home
Statistics tutorial home