Statistical models

Simple regression

In session one, we did a simple regression, y=a+bx. E.g. using the built in data set ‘iris’:

 > x=iris$Sepal.Length

> y=iris$Petal.Length

To look at the data (always start with descriptive statistics):

> plot(x,y)

Let’s fit a simple linear model, to see how y depends on x.

>model = lm(y~x)


You can look at the fit with:

> abline(model$coefficients)

We could do an ANOVA on this regression, i.e. compare the variance explained by the model to the residual variance; and this gives us an F-ratio: does the regression explain the data better than the null hypothesis that there is no relation?


The F-value is the ratio of SSM/dfM to SSR/dfR. An alternative description of the goodness of fit of the model is to take the ratio SSR/SST, i.e. the fraction of the variation explained by the regression. This is equal to r2 – the correlation coefficient squared. You can check:

> 352.9/(352.9+111.5) # SS taken from the summary.aov output

> cor(x,y)^2

Can also get a standard error for the slope and the intercept, and use these to do a t-test against 0.

>summary.lm(model) # Note this also gives you the R-square and F value output discussed above

E.g. if the intercept, a, is not significantly different from zero, we could explain the data with the simpler equation y=bx. If the slope b is not significantly different from zero, we could explain the data with the simpler equation y=a.

Simple 1-way ANOVA

We’re going to use a data set called InsectSprays. 6 different insect sprays (1 Independent Variable with 6 levels) were tested to see if there was a difference in the number of insects found in the field after each spraying (Dependent Variable).

> attach(InsectSprays)

> data(InsectSprays)

> str(InsectSprays)

'data.frame':   72 obs. of  2 variables:

 $ count: num  10 7 20 14 14 12 10 23 17 20 ...

 $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...


1.      Descriptive statistics

a.      Mean, variance, number of elements in each cell

b.      Visualise the data – boxplot; look at distribution, look for outliers


We’ll use the tapply() function which is a helpful shortcut in processing data, basically allowing you to specify a response variable, a factor (or factors) and a function that should be applied to each subset of the response variable defined by each level of the factor. I.e. Instead of doing: 


> mean(count[spray=="A"])   # and the same for B, C, D etc.


We use tapply(response,factor,function-name) as follows


Let’s look at the means:

> tapply(count, spray, mean) 

        A         B         C         D         E         F

14.500000 15.333333  2.083333  4.916667  3.500000 16.666667


The variances:

> tapply(count, spray, var)

 A B C D E F

22.272727 18.242424 3.901515 6.265152 3.000000 38.606061

And sample sizes

> tapply(count, spray, length)

 A B C D E F

12 12 12 12 12 12

And a boxplot:

> boxplot(count ~ spray)

N.B. Default order for factors in a dataframe is alphabetical. Often, there may be a more logical order, and you would like boxplots and other output to use that order. More importantly, when it comes to interpreting the output of ANOVA and particularly contrasts between means (see below) it is important to have the control condition as the first factor. If factors are not in the correct order – i.e., if there was logical order but this is garbled when listed alphabetically (e.g.  "Very.short","Short","Long","Very.long" would end up as “Long”, “Short”, ”Very.long”, ”Very.short”)  you need to explicitly re-order the variables. E.g. in the current case to change to, for example, F < B < C < D < E < A, use:


> spray=factor(spray,levels=c("F","B","C","D","E","A"))

Check it:
> tapply(count,Photoperiod,mean)

        F         B         C         D         E         A

16.666667 15.333333  2.083333  4.916667  3.500000 14.500000


If you want to check that a variable is a factor (especially for variables with numbers as factor levels), use:

> is.factor(spray)


[1] TRUE


How does the data look? We can test for homogeneity of variance for more than two groups using:

> bartlett.test(count ~ spray)


In this case the significant result suggests variances are not homogeneous, which violates one assumption of ANOVA.

For the purposes of today’s session we will go ahead with an ANOVA. This is not entirely unjustified, as ANOVA is moderately robust against the violation of equal variances if the number in each group is equal; and it is possible to include a correction.

You can consult to read more about how to test different assumptions in R. Also, contains a discussion of how violations of assumptions can affect your results.


2.      Run 1-way ANOVA


We can use:

> oneway.test(count~spray)


        One-way analysis of means (not assuming equal variances)

data:  count and spray

F = 36.0654, num df = 5.000, denom df = 30.043, p-value = 7.999e-12


Alternatively we can treat this as a regression, and use (as we did in regression above):

> model = lm(count~spray)

> model

A one-way ANOVA has a factor x, with k levels. So we are fitting the equation y=b1x1+b2x2+b3x3… where xi=1 if the factor is at level i, and 0 otherwise. But note that by default R will actually fit y=a+b2x2+b3x3…, i.e. the parameters bi i=2,3… are the difference of each level from the first level. So here the intercept “a” is the mean of the first spray type, and the co-efficients give the difference to the mean for each of the other spray types. You can see why it becomes important to enforce a sensible order, e.g. with the control variable first, rather than letting R use alphabetical order. However you can also force the analysis to fit y=b1x1+b2x2+b3x3 by using:

> model = lm(count~spray-1)

> model

We now want to look at the statistical significance:

>  summary.aov(model)


However this seems to give us a slightly different F value to the oneway.test we just did. The reason is that oneway.test uses a default setting where equal variances (i.e. homogeneity of variance) are not assumed and Welch’s correction is applied (and this also explains why the denom d.f. (which is normally k*(n-1)) is not a whole number in the output). To change this, set "var.equal=" option to TRUE.


> oneway.test(count~spray, "var.equal="T)



summary.lm() shows some of the same information, but also a separate t-value for each of the co-efficients. The significance of the coefficient is a measure of whether the model including that coefficient is a better fit than the model without that co-efficient; or equivalently here can be thought of as a test of each level of the factor against 0.

In fact there is a wrapper function for lm that does essentially the same analysis but changes the default data displayed to be in the language of an ANOVA rather than a regression and allows some different arguments to be used, and relevant functions to be applied. 

> model.aov = aov(count~spray)

> model.aov

Accessing numerical values programmatically

Sometimes you might want to access specific values from the model or from the summary of the statistical test.

To figure out which fields are accessible we can call names on the model or the summary (in case of summary variable we need to use summary[[1]] as an argument).

> names(model.aov)

 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"      

 [7] "qr"            "df.residual"   "contrasts"     "xlevels"       "call"          "terms"       

[13] "model"       

> summary.aov = summary(model.aov)

> names(summary.aov[[1]])

[1] "Df"      "Sum Sq"  "Mean Sq" "F value" "Pr(>F)" 

Then, we can access the fields in two ways, either with $ operator or with the name of the field as a string in square brackets. For example to get the number of residual degrees of freedom we need to type:

> model.aov$df.residual

[1] 66


> model.aov['df.residual']


[1] 66

In a similar way, if we want to extract F statistic values from the summary we can type:

> summary.aov[[1]]$F

[1] 34.70228       NA

However, this returns a full row from the table, to obtain just the F statistic we need to specify index of the required number in the vector:

> summary.aov[[1]]$F[1]

[1] 34.70228

3. Planned Contrasts

In general, the F-ratio its p-value for an ANOVA tell us that the groups differ, but not which groups differ. In some experimental designs we might have particular expectations about which group differences will be meaningful.

First, we need to specify contrasts attribute for our predictor variable (spray type). We can use one of the default contrast matrices , e.g. contr.helmert() is an orthogonal contrast matrix where each category (except the last) is compared to the mean effect of all subsequent categories – a standard procedure for studies along the lines of control vs. drug A vs. drug B.

> contrasts(InsectSprays$spray)=contr.helmert(6)

> contrasts(InsectSprays$spray)

  [,1] [,2] [,3] [,4] [,5]

A   -1   -1   -1   -1   -1

B    1   -1   -1   -1   -1

C    0    2   -1   -1   -1

D    0    0    3   -1   -1

E    0    0    0    4   -1

F    0    0    0    0    5

In the case of this example it might not be the most appropriate set of contrasts though, but if we want to define our own contrasts it is possible too.

Let's say that first we want to compare first 3 conditions to the last 3 and then run more contrasts within those two subgroups of conditions.

First, we define the weights

> con1=c(1,1,1,-1,-1,-1)

> con2=c(2,-1,-1,0,0,0)

> con3=c(0,1,-1,0,0,0)

> con4=c(0,0,0,2,-1,-1)

> con5=c(0,0,0,0,1,-1)

> our.contrasts=cbind(con1,con2,con3,con4,con5)

> our.contrasts

     con1 con2 con3 con4 con5

[1,]    1    2    0    0    0

[2,]    1   -1    1    0    0

[3,]    1   -1   -1    0    0

[4,]   -1    0    0    2    0

[5,]   -1    0    0   -1    1

[6,]   -1    0    0   -1   -1


Then we attach contrasts attribute.

> contrasts(InsectSprays$spray)=our.contrasts

Finally, once we obtain our model with aov() function we use summary.lm() instead of standard summary() to observe the results of the contrasts.

> model.aov=aov(count~spray,data=InsectSprays)

> summary.lm(model.aov)



aov(formula = count ~ spray, data = InsectSprays)



   Min     1Q Median     3Q    Max

-8.333 -1.958 -0.500  1.667  9.333



            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   9.5000     0.4622  20.554  < 2e-16 ***

spraycon1     1.1389     0.4622   2.464 0.016350 * 

spraycon2     1.9306     0.4622   4.177 8.86e-05 ***

spraycon3     6.6250     0.8006   8.276 8.51e-12 ***

spraycon4    -1.7222     0.4622  -3.726 0.000405 ***

spraycon5    -6.5833     0.8006  -8.223 1.05e-11 ***


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 3.922 on 66 degrees of freedom

Multiple R-squared:  0.7244,   Adjusted R-squared:  0.7036

F-statistic:  34.7 on 5 and 66 DF,  p-value: < 2.2e-16

This output shows t-test statistic value and significance for each of our planned contrasts, from spraycon1 to spraycon5.

This way we can code orthogonal or non-orthogonal contrasts.

4. Post Hoc Tests

A common approach alternative approach to planned contrasts is simply to carry out a set of t-tests between various groups, potentially whatever looks interesting in the data (hence the name, post-hoc). It is necessary to control for the overall type I error rate, α, given the number of comparisons, c. The simplest approach is using the simple Bonferonni correction – set the significance level at α/c, e.g. here, if comparing each group against the first group we would be doing 5 comparisons, so would need each t-test to have a p<.01 to be significant to ensure overall α<0.05. More precise is to use Dunn-Sidak’s correction:


There are various other methods for multiple comparisons, each of which makes slightly different assumptions in attempting to control for both type I and type II error (power), including: Neuman-Keuls (liberal on Type I); Scheffé (strict on Type I but bad for Type II); Dunnett’s (optimised for comparing all treatments to a single control); Games-Howell (doesn’t assume equal variances); and Tukey HSD (optimised for when all possible pairwise comparisons are being made). Try:


> TukeyHSD(model.aov) #note can apply to model produced by lm using TukeyHSD(aov(model)) but this function does not handle the model without the intercept


5. Model checking plots

We can also look at some plots that R prepares automatically when fitting a model


> plot(model)      

You need to step through the plots by hitting the return key.

The first shows if there is any pattern in the residuals by plotting them against the fitted values from the model. Ideally this should show similar scatter for each condition. Here there is a worrying effect of larger residuals for larger fitted values. This is called ‘heteroscedascity’ meaning that not only is variance in the response not equal across groups, but that the variance has some specific relationship with the size of the response. In fact you could see this in the original boxplots. It contradicts assumptions made when doing an ANOVA.

The second plot is a test for normality of the residuals; if they are not normal, the assumptions of ANOVA are potentially violated.


The third is like the first plot but now to specifically test if the residuals increase with the fitted values, which they do.

The fourth gives the residuals by factor levels, suggesting which factors have been best fitted.

The non-normality of the residuals suggests this is a situation where we might want to use a non-parametric alternative to the ANOVA:

> kruskal.test(count ~ spray, data=InsectSprays)

        Kruskal-Wallis rank sum test

data:  count by spray

Kruskal-Wallis chi-squared = 54.6913, df = 5, p-value = 1.511e-10


As for the Wilcoxon test (or Mann-Whitney test) with two samples, this test converts the response values to ranks, and tests whether the ranks are distributed equally across the conditions, as would be expected under the null hypothesis. Note that it does still assume similar distribution shapes in each group, an assumption violated in our earlier test for homogeneity of variance.


Beyond simple regression

Let’s go back and look at the same types of plots for the regression analysis we did at the start:

> model.iris = lm(y~x)    # recall this did a regression of petal length on sepal length for the Iris data set

> plot(model.iris)

Note on first plot the data points with highest residuals are labelled with numbers, indicating their row in the data frame – these may be outliers. There’s also a lot of structure in this plot, which we don’t want to see. The second shows the residuals are roughly normally distributed. The fourth gives an idea of which data points are having the most effect (leverage) on the fit.

A linear fit is not entirely convincing so you could try polynomial regression y=a+b1x+b2x2

> xsquared = x^2

> model.poly=lm(y~x+xsquared)

> summary(model.poly)

Have a look at the fit:

> z=-17.4+5.4*x-0.3*x^2 # coefficients taken from the summary output

> lines(sort(x),sort(z))

Could compare linear and quadratic models with:


This suggests the polynomial fit is significantly better.

Can also do non-linear models with nls() funnction but won’t discuss further here.

Can do multiple regression, e.g. fit y=a+b1x1+b2x2

> x1=iris$Sepal.Length

> x2=iris$Petal.Width


Or with interaction, i.e. y=a+b1x1+b2x2+b3x1x2