EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer
Physics
C.S.

Introduction to Logistic Regression in R
An Almost perfect simple example for Logistic regression

For introduction to logistic regression in SAS, click the follwoing tutorial.

In the "MASS" library there is a data set called "menarche", there are 3 variables:
1) "Age" (average age of age homogeneous groups of girls),
2) "Total" (number of girls in each group),
3) "Menarche" (number of girls in the group who have reached menarche)
Don't be awkward for this data, most of us are over 18 to use R/SAS.
> library("MASS")
> data(menarche)
> str(menarche)
'data.frame':   25 obs. of  3 variables:
 $ Age     : num   9.21 10.21 10.58 10.83 11.08 ...
 $ Total   : num  376 200 93 120 90 88 105 111 100 93 ...
 $ Menarche: num  0 0 0 2 2 5 10 17 16 29 ...
> summary(menarche)
      Age            Total           Menarche      
 Min.   : 9.21   Min.   :  88.0   Min.   :   0.00  
 1st Qu.:11.58   1st Qu.:  98.0   1st Qu.:  10.00  
 Median :13.08   Median : 105.0   Median :  51.00  
 Mean   :13.10   Mean   : 156.7   Mean   :  92.32  
 3rd Qu.:14.58   3rd Qu.: 117.0   3rd Qu.:  92.00  
 Max.   :17.58   Max.   :1049.0   Max.   :1049.00
> plot(Menarche/Total ~ Age, data=menarche)
From the graph at right, it appears a logistic fit is called for here. The fit would be done this way...
> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
+               family=binomial(logit), data=menarche)
1). glm( ) is the function used to do generalized linear models. With "family=" set to "binomial" with a "logit" link, glm( ) produces a logistic regression. Because we are using glm( ) with binomial errors in the response variable, the ordinary assumptions of least squares linear regression (normality and homoscedasticity) don't apply.

2. our data frame does not contain a row for every case (i.e., every girl upon whom data were collected). Therefore, we do not have a binary (0,1) coded response variable. However, No problem! If we feed glm( ) a table (or matrix) in which the first column is number of successes and the second column is number of failures, R will take care of the coding for us. In the above analysis, we made that table on the fly inside the model formula by binding "Menarche" and "Total − Menarche" into the columns of a table using cbind( ).

Let's look at how closely the fitted values from our logistic regression match the observed values...

> plot(Menarche/Total ~ Age, data=menarche)
> lines(menarche$Age, glm.out$fitted, type="l", col="red")
> title(main="Menarche Data with Fitted Logistic Regression Line")
Logistic R Result
I'm impressed! I don't know about you. The numerical results are extracted like this...
> summary(glm.out)

Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, 
    family = binomial(logit), data = menarche)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0363  -0.9953  -0.4900   0.7780   1.3675  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -21.22639    0.77068  -27.54   <2e-16 ***
Age           1.63197    0.05895   27.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3693.884  on 24  degrees of freedom
Residual deviance:   26.703  on 23  degrees of freedom
AIC: 114.76

Number of Fisher Scoring iterations: 4
The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and anova(glm.out).

Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."

To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean). This is essentially a chi square value on 24 degrees of freedom(p-value will be very small due to the large deviance), and indicates very little fit (a highly significant difference between fitted values and observed values); which means the null model is not good.

Adding in our predictors--just "Age" in this case--decreased the deviance by 3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the observed values. The copyright of this tutorial is from Prof. King.

Related links:
Continue to Logistic regression with multiple predictors   SAS Interview

Back to Coufounding and colearity in R regression analytics   Analytics Home