Confounding and Collinearity in Multiple Linear Regression

For example, consider the mortality rate in Florida, which is much higher than in Michigan. Before concluding that Florida is a riskier place to live, one needs to consider confounding factors such as age. Florida has a higher proportion of people of retirement age and older than does Michigan, and older people are more likely to die in any given interval of time. Therefore, one must "adjust" for age before drawing any conclusions.Proc Reg Data=A; Model Y=X1 X2 X3 X4 / VIF Collin ColliNoInt; Run;

One of the epidemiologist's tools for discovering and correcting confounding is stratification, which in the preceding example would have the epidemiologist compare mortality rates in Florida and Michigan separately for people in across a range of age groups. Indeed, such

Another way would be to use

Note: a confounder must be associated with the main independent variable of interest. For example,

In practice, collinearity or high correlations among independent variables will generally have the following effects:

• Regression coefficients will change dramatically according to whether other variables are included or excluded from the model.

• The standard errors of the regression coefficients will tend to be large, since the beta coefficients will not be accurately estimated. In extreme cases, regression coefficients for collinear variables will be large in magnitude with signs that seem to be assigned at random. If you see "non-sensical" coefficients and SD's, collinearity should be immediately suspected as a possible cause.

• Predictors with known, strong relationships to the response will not necessarily have their regression coefficients accurately estimated.

•

• Each predictor can regressed on the other predictors, and its tolerance is defined as 1 - R2. A small value of the tolerance indicates that the variable under consideration is almost a perfect linear combination of the independent variables already in the equation, and so not all these variables need to be added to the equation. Some statisticians suggest that a tolerance less than 0.1 deserves attention, although this is somewhat arbitrary.

• The tolerance is sometimes reexpressed as the

Acknowledgement: The tutorial is based on the notes from: www.ats.ucla.edu.