Analysis of covariance

The obvious next step is to combine regression and analysis of variance, i.e. understand how a dependent variable depends on both categorical and continuous variables. This is called analysis of co-variance, or ANCOVA.

Let x be category with two levels, z is continuous, fit is y=x1*(a1+b1z) +x2*(a2+b2z). Or can think of as doing two regressions, one for each variable.

For example, say we wanted to model weight as a function of sex and age, then we have:

weightmale=amale+bmale x age,

weightfemale=afemale+bfemale x age

A model with four parameters would be required if both intercept and slope differ for males and females. Sketch for yourself what this would look like on a graph, and then sketch what it would look like if you only needed three parameters, either different intercepts but common slope, or the same intercept with different slopes. What about if only one intercept and slope is needed for both groups, i.e. no effect of gender? Or if age had no effect? What is the null hypothesis graph in which neither variable has an effect?

We’ll go through an example from Crawley: the data is an experiment on how much plants regrow after grazing (sorry for all the botanical examples, but the data was conveniently to hand!). The initial size of the plant at the top of its rootstock is recorded, and the weight of seeds it produces at the end of the season is the response or dependent variable. You can download the textfile of data here: compensation.txt

> attach(compensation)

> str(compensation)

> plot(Root,Fruit)

> plot(Grazing,Fruit)

What would you conclude so far about the effect of initial size on regrowth? And of grazing on regrowth? Does this suggest grazing is good or bad for seed production?

Fit the full model:

> model<-lm(Fruit~Grazing*Root) # the order matters, this separates by grazing then fits fruit to root

> summary.aov(model)

This suggests that the interaction of grazing and root is not significant, so we could fit a simpler model:

> model2<-lm(Fruit~Grazing+Root)

And test if this is significantly different:

> anova(model,model2)

It is not, but in fact you already had that information in the F-value for the interaction in the ANOVA table.

Look at the parameter estimates:

> summary.lm(model2)

How much explanatory power does this model have ( r2)?

The equation fitted without the interaction is y=x1*a1+x2*a2+b1z, where x is the levels of grazing, and z the initial size of the root. Looking at the output of summary.lm(model2) , the first parameter (Intercept) is the intercept (a1) for the first ‘level’ of grazing. By default R uses alphabetical order, so this is for ‘grazed’ - it might have been better to enforce the opposite order (see last session) because ‘ungrazed’ is the control. The second parameter is the difference in the intercept from the ‘grazed’ to ‘ungrazed’, so the intercept for ‘ungrazed’ (a2) is the sum of the first two parameters. The third parameter is the slope (b1) of fruit against root; as there is no interaction, this is the same for both levels of the factor.

To see what this looks like graphically we will split the data then plot, and fit the relevant lines.

> sf=split(Fruit,Grazing)

> sr=split(Root,Grazing)

> plot(Root,Fruit,type="n",ylab="Seed production",xlab="Initial Root Diameter")

> points(sr[[1]],sf[[1]],pch=16) # these are the grazed plants (first level of grazing)

> abline(-127.829,23.56) # this is the intercept and slope for grazed

> points(sr[[2]],sf[[2]]) # these are the ungrazed plants (second level of grazing)

> abline(-127.829+36.103,23.56) # this is the intercept and slope for ungrazed

You should be able to see that there is a strong effect of initial root size to resulting seed production, but grazing reduces the seed production. The reason our first boxplot seemed to show a positive grazing effect is that plant size was not randomly distributed between the two conditions, most of the larger plants were in the grazed group, and most smaller ones in the ungrazed group. Hence a naive analysis, e.g.:

>t.test(Fruit~Grazing)

would have led us to the wrong conclusion.

Three other points to note:

• Using a covariate can often be a good way to deal with something in your experiment that you would like to control for, because it is likely to increase the variance, but cannot. E.g. might use an IQ test as a covariate in a memory task. If there is any correlation of your covariate and your response measure, including it in analysis will substantially improve your chances to detect the effect of the factor you are interested in, because you have removed the variance due to the covariate.
• It can sometimes be ambiguous whether a variable should be treated as a factor or a ‘continuous’ numerica variable. If the latter is an option, it may be preferable, as it reduces the number of parameters (from one per level of the factor, to one) and hence results in a simpler model.
• In this experiment, we were not actually interested in showing that larger plants at the start of the season have more seeds at the end of the season (we could have guessed this relationship exists) but rather were trying to remove that effect from interfering with our assessment of the effect of grazing. In some cases of ANCOVA, it is the categorical variable we want to control for, to test a scientific hypothesis about the regression relationship. A good example is our iris data: it is not surprising that iris of different species have different petal lengths, but having measurements from different species added noise to our attempt to fit the relationship of sepal length to petal length – we probably need a different intercept and slope for each species. To test your understanding, you should try running an ANCOVA on the iris data.