EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer
Physics
C.S.

Multivariate Logistic Regression in R
Logistic regression with multiple predictors

We have brief introduction to logistic regression in R for single numerical predictor,
now what if we have multiple numerical predictors?

We are using the folloing dataset:
Seen: is coded as 0=no and 1=yes.
C: score is derived from naming the color.
W: score derived from reading a list of color words.
CW: Stroop task, score derived from the subject's attempt to name the color

   seen   W   C CW
1     0 126  86 64
2     0 118  76 54
...
48    1 120  87 54
49    1  97  82 41
To get them into R, try this first...
> file = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/gorilla.csv"
> read.csv(file) -> gorilla
> str(gorilla)
'data.frame':   49 obs. of  4 variables:
 $ seen: int  0 0 0 0 0 0 0 0 0 0 ...
 $ W   : int  126 118 61 69 57 78 114 81 73 93 ...
 $ C   : int  86 76 66 48 59 64 61 85 57 50 ...
 $ CW  : int  64 54 44 32 42 53 41 47 33 45 ...

### Begin copying here.
gorilla = data.frame(rep(c(0,1),c(30,19)),
            c(126,118,61,69,57,78,114,81,73,93,116,156,90,120,99,113,103,123,
              86,99,102,120,128,100,95,80,98,111,101,102,100,112,82,72,72,
              89,108,88,116,100,99,93,100,110,100,106,115,120,97),
            c(86,76,66,48,59,64,61,85,57,50,92,70,66,73,68,110,78,61,65,
              77,77,74,100,89,61,55,92,90,85,78,66,78,84,63,65,71,46,70,
              83,69,70,63,93,76,83,71,112,87,82),
            c(64,54,44,32,42,53,41,47,33,45,49,45,48,49,44,47,52,28,42,51,54,
              53,56,56,37,36,51,52,45,51,48,55,37,46,47,49,29,49,67,39,43,36,
              62,56,36,49,66,54,41))
colnames(gorilla) = c("seen","W","C","CW")
str(gorilla)                                  
### End copying here.
And if that doesn't work, well, you know what you have to do!

We might begin like this...

> cor(gorilla)                         ### a correlation matrix
            seen           W          C         CW
seen  1.00000000 -0.03922667 0.05437115 0.06300865
W    -0.03922667  1.00000000 0.43044418 0.35943580
C     0.05437115  0.43044418 1.00000000 0.64463361
CW    0.06300865  0.35943580 0.64463361 1.00000000
...or like this...
> with(gorilla, tapply(W, seen, mean))
        0         1 
100.40000  98.89474 
> with(gorilla, tapply(C, seen, mean))
       0        1 
73.76667 75.36842 
> with(gorilla, tapply(CW, seen, mean))
       0        1 
46.70000 47.84211
The Stroop scale scores are moderately positively correlated with each other, but none of them appears to be related to the "seen" response variable, at least not to any impressive extent. There doesn't appear to be much here to look at. Let's have a go at it anyway.

Since the response is a binomial variable, a logistic regression can be done as follows...

> glm.out = glm(seen ~ W * C * CW, family=binomial(logit), data=gorilla)
> summary(glm.out)

Call:
glm(formula = seen ~ W * C * CW, family = binomial(logit), data = gorilla)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8073  -0.9897  -0.5740   1.2368   1.7362  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -1.323e+02  8.037e+01  -1.646   0.0998 .
W            1.316e+00  7.514e-01   1.751   0.0799 .
C            2.129e+00  1.215e+00   1.753   0.0797 .
CW           2.206e+00  1.659e+00   1.329   0.1837  
W:C         -2.128e-02  1.140e-02  -1.866   0.0621 .
W:CW        -2.201e-02  1.530e-02  -1.439   0.1502  
C:CW        -3.582e-02  2.413e-02  -1.485   0.1376  
W:C:CW       3.579e-04  2.225e-04   1.608   0.1078  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 65.438  on 48  degrees of freedom
Residual deviance: 57.281  on 41  degrees of freedom
AIC: 73.281

Number of Fisher Scoring iterations: 5

> anova(glm.out, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: seen

Terms added sequentially (first to last)


       Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                      48     65.438          
W       1    0.075        47     65.362     0.784
C       1    0.310        46     65.052     0.578
CW      1    0.106        45     64.946     0.745
W:C     1    2.363        44     62.583     0.124
W:CW    1    0.568        43     62.015     0.451
C:CW    1    1.429        42     60.586     0.232
W:C:CW  1    3.305        41     57.281     0.069
Two different extractor functions have been used to see the results of our analysis. What do they mean?

The first gives us what amount to regression coefficients with standard errors and a z-test, as we saw in the single variable example above. None of the coefficients are significantly different from zero (but a few are close). The deviance was reduced by 8.157 points on 7 degrees of freedom, for a p-value of...

> 1 - pchisq(8.157, df=7)
[1] 0.3189537
Overall, the model appears to have performed poorly, showing no significant reduction in deviance (no significant difference from the null model).

Gorilla Fitted

The second print out shows the same overall reduction in deviance, from 65.438 to 57.281 on 7 degrees of freedom. In this print out, however, the reduction in deviance is shown for each term, added sequentially first to last. Of note is the three-way interaction term, which produced a nearly significant reduction in deviance of 3.305 on 1 degree of freedom (p = 0.069).

In the event you are encouraged by any of this, the following graph might be revealing...

> plot(glm.out$fitted)
> abline(v=30.5,col="red")
> abline(h=.3,col="green")
> abline(h=.5,col="green")
> text(15,.9,"seen = 0")
> text(40,.9,"seen = 1")
Logistic R Result
Related links:
Continue to Financial Data Visualization in R: QuantMod   SAS Interview

Back to An Almost perfect simple example for Logistic regression   Analytics Home