Analysis of covariance
The obvious next step is to combine regression and analysis of variance, i.e. understand how a dependent variable depends on both categorical and continuous variables. This is called analysis of co-variance, or ANCOVA.
Let x be category with two levels, z is continuous, fit is y=x1*(a1+b1z) +x2*(a2+b2z). Or can think of as doing two regressions, one for each variable.
For example, say we wanted to model weight as a function of sex and age, then we have:
weightmale=amale+bmale x age,
weightfemale=afemale+bfemale x age
A model with four parameters would be required if both intercept and slope differ for males and females. Sketch for yourself what this would look like on a graph, and then sketch what it would look like if you only needed three parameters, either different intercepts but common slope, or the same intercept with different slopes. What about if only one intercept and slope is needed for both groups, i.e. no effect of gender? Or if age had no effect? What is the null hypothesis graph in which neither variable has an effect?
We’ll go through an example from Crawley: the data is an experiment on how much plants regrow after grazing (sorry for all the botanical examples, but the data was conveniently to hand!). The initial size of the plant at the top of its rootstock is recorded, and the weight of seeds it produces at the end of the season is the response or dependent variable. You can download the textfile of data here: compensation.txt
What would you conclude so far about the effect of initial size on regrowth? And of grazing on regrowth? Does this suggest grazing is good or bad for seed production?
Fit the full model:
> model<-lm(Fruit~Grazing*Root) # the order matters, this separates by grazing then fits fruit to root
This suggests that the interaction of grazing and root is not significant, so we could fit a simpler model:
And test if this is significantly different:
It is not, but in fact you already had that information in the F-value for the interaction in the ANOVA table.
Look at the parameter estimates:
How much explanatory power does this model have ( r2)?
The equation fitted without the interaction is y=x1*a1+x2*a2+b1z, where x is the levels of grazing, and z the initial size of the root. Looking at the output of summary.lm(model2) , the first parameter (Intercept) is the intercept (a1) for the first ‘level’ of grazing. By default R uses alphabetical order, so this is for ‘grazed’ - it might have been better to enforce the opposite order (see last session) because ‘ungrazed’ is the control. The second parameter is the difference in the intercept from the ‘grazed’ to ‘ungrazed’, so the intercept for ‘ungrazed’ (a2) is the sum of the first two parameters. The third parameter is the slope (b1) of fruit against root; as there is no interaction, this is the same for both levels of the factor.
To see what this looks like graphically we will split the data then plot, and fit the relevant lines.
> plot(Root,Fruit,type="n",ylab="Seed production",xlab="Initial Root Diameter")
> points(sr[],sf[],pch=16) # these are the grazed plants (first level of grazing)
> abline(-127.829,23.56) # this is the intercept and slope for grazed
> points(sr[],sf[]) # these are the ungrazed plants (second level of grazing)
> abline(-127.829+36.103,23.56) # this is the intercept and slope for ungrazed
You should be able to see that there is a strong effect of initial root size to resulting seed production, but grazing reduces the seed production. The reason our first boxplot seemed to show a positive grazing effect is that plant size was not randomly distributed between the two conditions, most of the larger plants were in the grazed group, and most smaller ones in the ungrazed group. Hence a naive analysis, e.g.:
would have led us to the wrong conclusion.
Three other points to note: