Brief notes on ANOVA, model fitting and experimental design

A quick recap:

· We fit two models to the data, where one is simpler than the other

· Our null hypothesis is that they explain the data (account for the variance in the data) equally well. We would need to reject this hypothesis to justify using the more complex model.

· We test the null hypothesis by constructing a test statistic (e.g. F) which has a known distribution under the null hypothesis, and calculating the probability of obtaining a value of statistic at least as extreme as that actually observed. If this probability is below some threshold we reject the null hypothesis.

So far we have mostly focussed on a very simple experimental design in which there is one factor (e.g. ‘drug’) with several levels (e.g. placebo, drugA, drugB) and we measure one outcome (e.g. health) after administering different levels of the factor to different groups of subjects.

· In simple ANOVA, the statistic F is the ratio of SSM/df_M to SSR/df_R

· SSR is the sum of squares of residuals for the complex model y = b0 + b1x1 + b2x2… (i.e. assuming group differences, so including a parameter for each group)

· SSM is the variance explained by the complex model

· SST is the sum of squares of residuals for the simple model y = b0 (i.e. no difference between the groups)

· SSM = SST−SSR, which is the difference in the fit of the simple and complex models

· Under the null hypothesis, the F statistic will have an F distribution, parameterised by two ‘degrees of freedom’ values:

o the difference in the number of parameters in the complex and simple models (number of groups − 1) = df_M,

o the degrees of freedom after fitting the complex model (number of data points – number of groups) = df_R

Let’s now consider a very slightly more complex design. Let’s say you want to use transcranial magnetic stimulation (TMS) in two different brain areas, and test its effect on cognitive ability. You set up an experiment with two groups who will get TMS in different areas (TMSA and TMSB), and you measure cognitive ability by their scores on a simple puzzle, which they do twice, at time 1 and time 2. The data look something like this:

	Subject	Score time 1	Score time 2
TMS A	1
	2
	3
	4
	5

TMS B	6
	7
	8
	9
	10

N.B. for simplicity I’ve used just two groups, two measures; but all that follows applies if have more than two; where I describe t-tests below, you could use ANOVA instead.

There are at least two wrong ways to analyse this data (you can probably think of more):

Pseudo-replication: Run a t-test comparing the 10 scores for group A against the 10 scores for group B (Lazic (2010) discusses recent examples of this error in neuroscience papers). The problem is that you have 10 observations per group but they are not independent observations; you only have 5 subjects.

Separate significance tests: Using a paired t-test (avoiding pseudo replication!) look for a significant change between the scores at time 1 and time 2 for group A. Do a second t-test comparing scores at time 1 and time 2 for group B. Report a difference between the groups if you find p<.05 for group A and p>.05 for group B (or vice versa). Niewenhuis et al. (2011) discuss the frequency of this error in high-level neuroscience publications. The problem is that you could easily get this result without group A and group B being significantly different.

So what is the right way? It depends on the logic of your experimental design. Why did you collect puzzle scores at two different times?

Situation One: The first score is meant to be a baseline, in fact, the subjects did the puzzle first, at time 1, before you randomly assigned them to the groups. The simplest approach is to take differences, giving you 5 scores for each group, and then do a t-test between A and B. Another approach is to use the score at time 1 as a covariate, and do an Analysis of Covariance: ANCOVA. That is, you think the baseline score is an independent continuous variable that is adding noise to the scores, so you try to fit this in the model: y = b0+b1group + b2baseline

Situation Two: You are interested in potential time differences of the effect of TMS in the two areas, e.g. predicting TMSA will have a quick and long-lasting effect, TMSB will act more slowly but in the end have a larger effect. You should do a factorial ANOVA. In fact it might have been better to have a ‘simpler’ design, with different subjects tested at the two different times, because otherwise doing the puzzle for the second time might be a confound. The simpler design would be a standard 2-way ANOVA, fitting y = b0 + b1group + b2time + b3grouptime.

Situation Three: You are not sure which time is better for seeing the effect, and/or are using the second test to try to ‘boost’ the signal. The simplest approach would be to add or average the scores at the two times, to get 5 scores per group, and do a t-test between A and B. You could just do one test for time 1, and another for time 2, but you would need to reduce your α-level and this ignores the fact that the tests are related (done by the same subjects). Another approach is to treat the scores as two different, but likely correlated, dependent variables (y1 and y2), and do a multivariate ANOVA: MANOVA. This fits [y1,y2] = b0+b1*group.

Situation Four: You want to reduce variance by having each subject act as their own control. Here you might have considered an alternative design in which you score each subject only once for one TMS application, but have all subjects experience both TMS conditions (in randomised order to control for time effects). This is a classic ‘repeated measures’ design. You cannot use a standard one-way ANOVA (or unpaired t-test) because it violates the assumption of independence. You can use a repeated measures ANOVA (or paired t-test) which fits y = b0 + b1group + b2subject. This is sometimes described as partitioning the residual sum of squares: RSS = SS-subject + SS-error. You can then use the (smaller) SS-error term in the F statistic. However this still assumes that the ‘extra variance’ caused by the subjects is equal across all group differences.

The best approach is to treat it as a mixed (hierarchical) model: see this tutorial.

Unless explicitly stated otherwise, all material is copyright The University of Edinburgh