In the "MASS" library there is a data set called "menarche", there are 3 variables:

1) "Age" (average age of age homogeneous groups of girls),

2) "Total" (number of girls in each group),

3) "Menarche" (number of girls in the group who have reached menarche)

Don't be awkward for this data, most of us are over 18 to use R/SAS.

> library("MASS") > data(menarche) > str(menarche) 'data.frame': 25 obs. of 3 variables: $ Age : num 9.21 10.21 10.58 10.83 11.08 ... $ Total : num 376 200 93 120 90 88 105 111 100 93 ... $ Menarche: num 0 0 0 2 2 5 10 17 16 29 ... > summary(menarche) Age Total Menarche Min. : 9.21 Min. : 88.0 Min. : 0.00 1st Qu.:11.58 1st Qu.: 98.0 1st Qu.: 10.00 Median :13.08 Median : 105.0 Median : 51.00 Mean :13.10 Mean : 156.7 Mean : 92.32 3rd Qu.:14.58 3rd Qu.: 117.0 3rd Qu.: 92.00 Max. :17.58 Max. :1049.0 Max. :1049.00 > plot(Menarche/Total ~ Age, data=menarche)From the graph at right, it appears a logistic fit is called for here. The fit would be done this way...

> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age, + family=binomial(logit), data=menarche)1). glm( ) is the function used to do generalized linear models. With "family=" set to "binomial" with a "logit" link, glm( ) produces a logistic regression. Because we are using glm( ) with binomial errors in the response variable, the ordinary assumptions of least squares linear regression (normality and homoscedasticity) don't apply.

2. our data frame does not contain a row for every case (i.e., every girl upon whom data were collected). Therefore, we do not have a binary (0,1) coded response variable. However, No problem! If we feed glm( ) a table (or matrix) in which the first column is number of successes and the second column is number of failures, R will take care of the coding for us. In the above analysis, we made that table on the fly inside the model formula by binding "Menarche" and "Total − Menarche" into the columns of a table using cbind( ).

Let's look at how closely the fitted values from our logistic regression
match the observed values...

> plot(Menarche/Total ~ Age, data=menarche) > lines(menarche$Age, glm.out$fitted, type="l", col="red") > title(main="Menarche Data with Fitted Logistic Regression Line")

> summary(glm.out) Call: glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), data = menarche) Deviance Residuals: Min 1Q Median 3Q Max -2.0363 -0.9953 -0.4900 0.7780 1.3675 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -21.22639 0.77068 -27.54 <2e-16 *** Age 1.63197 0.05895 27.68 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3693.884 on 24 degrees of freedom Residual deviance: 26.703 on 23 degrees of freedom AIC: 114.76 Number of Fisher Scoring iterations: 4The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and anova(glm.out).

Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."

To evaluate the overall performance of the model, look at the null deviance
and residual deviance near the bottom of the print out. Null deviance shows how
well the response is predicted by a model with nothing but an intercept (grand
mean). This is essentially a chi square value on 24 degrees of freedom(p-value will be very small due to the large deviance), and
indicates very little fit (**a highly significant difference between fitted values
and observed values**); which means the null model is not good.

Adding in our predictors--just "Age" in this case--**decreased
the deviance by 3667 points on 1 degree of freedom**. Again, this is interpreted
as a chi square value and indicates a highly significant decrease in deviance.
The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the
overall fit of the model by once again treating this as a chi square value. **A
chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269**. The null
hypothesis (i.e., the model) is not rejected. The fitted values are not
significantly different from the observed values.
The copyright of this tutorial is from Prof. King.