EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer

Regression Analysis Study Notes 1
Confounding and Collinearity in Multiple Linear Regression

Download pdf • Simple Linear Regression Analysis in SAS, Regression Analysis Study Notes 1: Linear Regression

Confounding: A third variable, not the dependent (outcome) or main independent (exposure) variable of interest, that distorts the observed relationship between the exposure and outcome. Confounding complicates analyses owing to the presence of a third factor that is associated with both the putative risk factor and the outcome.
Proc Reg Data=A;
Model Y=X1 X2 X3 X4 / VIF Collin ColliNoInt;
For example, consider the mortality rate in Florida, which is much higher than in Michigan. Before concluding that Florida is a riskier place to live, one needs to consider confounding factors such as age. Florida has a higher proportion of people of retirement age and older than does Michigan, and older people are more likely to die in any given interval of time. Therefore, one must "adjust" for age before drawing any conclusions.

One of the epidemiologist's tools for discovering and correcting confounding is stratification, which in the preceding example would have the epidemiologist compare mortality rates in Florida and Michigan separately for people in across a range of age groups. Indeed, such stratified analyses, should often be a first step towards investigating confounding.

Another way would be to use multiple regression, to derive mortality rates for Florida compared to Michigan adjusted for any differences in age (and possibly for other confounding factors).

Note: a confounder must be associated with the main independent variable of interest. For example, the confounder must be unevenly distributed as far as the independent variable is concerned, as in the smoking and occupation example of last class. Smoking was a confounder for the outcome of cancer, because smoking is associated with cancer, and was unevenly distributed among occupation categories.

In practice, collinearity or high correlations among independent variables will generally have the following effects:

    • Regression coefficients will change dramatically according to whether other variables are included or excluded from the model.

    • The standard errors of the regression coefficients will tend to be large, since the beta coefficients will not be accurately estimated. In extreme cases, regression coefficients for collinear variables will be large in magnitude with signs that seem to be assigned at random. If you see "non-sensical" coefficients and SD's, collinearity should be immediately suspected as a possible cause.

    • Predictors with known, strong relationships to the response will not necessarily have their regression coefficients accurately estimated.

    • Tolerance: If variables are perfectly collinear, the coefficient of determination R2 will be 1 when any one of them is regressed upon the others. This is the motivation behind calculating a variable's "tolerance", a measure of collinearity.

    • Each predictor can regressed on the other predictors, and its tolerance is defined as 1 - R2. A small value of the tolerance indicates that the variable under consideration is almost a perfect linear combination of the independent variables already in the equation, and so not all these variables need to be added to the equation. Some statisticians suggest that a tolerance less than 0.1 deserves attention, although this is somewhat arbitrary.

    • The tolerance is sometimes reexpressed as the Variance Inflation Factor (VIF), the inverse of the tolerance (= 1/tolerance). Tolerances of 0.10 or less become VIFs of 10 or more.

Acknowledgement: The tutorial is based on the notes from: www.ats.ucla.edu.

Continue to next: Logistic Regression Analysis Study Notes 2   Stats 101 Home   SAS tutorial home
Back to: Classic regression home   Experiment Design & Data Analysis Home Statistics tutorial home