> library("MASS") > data(menarche) > str(menarche) 'data.frame': 25 obs. of 3 variables: $ Age : num 9.21 10.21 10.58 10.83 11.08 ... $ Total : num 376 200 93 120 90 88 105 111 100 93 ... $ Menarche: num 0 0 0 2 2 5 10 17 16 29 ... > summary(menarche) Age Total Menarche Min. : 9.21 Min. : 88.0 Min. : 0.00 1st Qu.:11.58 1st Qu.: 98.0 1st Qu.: 10.00 Median :13.08 Median : 105.0 Median : 51.00 Mean :13.10 Mean : 156.7 Mean : 92.32 3rd Qu.:14.58 3rd Qu.: 117.0 3rd Qu.: 92.00 Max. :17.58 Max. :1049.0 Max. :1049.00 > plot(Menarche/Total ~ Age, data=menarche)From the graph at right, it appears a logistic fit is called for here. The fit would be done this way...
> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age, + family=binomial(logit), data=menarche)1). glm( ) is the function used to do generalized linear models. With "family=" set to "binomial" with a "logit" link, glm( ) produces a logistic regression. Because we are using glm( ) with binomial errors in the response variable, the ordinary assumptions of least squares linear regression (normality and homoscedasticity) don't apply.
Let's look at how closely the fitted values from our logistic regression
match the observed values...
> plot(Menarche/Total ~ Age, data=menarche)
> lines(menarche$Age, glm.out$fitted, type="l", col="red")
> title(main="Menarche Data with Fitted Logistic Regression Line")
> summary(glm.out) Call: glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), data = menarche) Deviance Residuals: Min 1Q Median 3Q Max -2.0363 -0.9953 -0.4900 0.7780 1.3675 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -21.22639 0.77068 -27.54 <2e-16 *** Age 1.63197 0.05895 27.68 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3693.884 on 24 degrees of freedom Residual deviance: 26.703 on 23 degrees of freedom AIC: 114.76 Number of Fisher Scoring iterations: 4The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and anova(glm.out).
Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."
To evaluate the overall performance of the model, look at the null deviance
and residual deviance near the bottom of the print out. Null deviance shows how
well the response is predicted by a model with nothing but an intercept (grand
mean). This is essentially a chi square value on 24 degrees of freedom(p-value will be very small due to the large deviance), and
indicates very little fit (a highly significant difference between fitted values
and observed values); which means the null model is not good.
Adding in our predictors--just "Age" in this case--decreased the deviance by 3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the observed values. The copyright of this tutorial is from Prof. King.