Statistical models
Simple
regression
In session one, we did a simple
regression, y=a+bx. E.g. using the built in data set
‘iris’:
> x=iris$Sepal.Length
> y=iris$Petal.Length
To look at the data (always start
with descriptive statistics):
> plot(x,y)
Let’s fit a simple linear model, to
see how y depends on x.
>model = lm(y~x)
>model
You can look at the fit with:
> abline(model$coefficients)
We could do an ANOVA on this
regression, i.e. compare the variance explained by the model to the residual
variance; and this gives us an F-ratio: does the regression explain the data
better than the null hypothesis that there is no relation?
>summary.aov(model)
The F-value is the ratio of SSM/dfM to SSR/dfR. An alternative description of the goodness of fit of the model
is to take the ratio SSR/SST, i.e. the fraction of the variation explained by the regression.
This is equal to r2 – the correlation coefficient squared. You can
check:
> 352.9/(352.9+111.5) # SS taken from the summary.aov
output
> cor(x,y)^2
Can also get a standard error for the
slope and the intercept, and use these to do a t-test against 0.
>summary.lm(model) # Note
this also gives you the R-square and F value output discussed above
E.g. if the intercept, a, is not
significantly different from zero, we could explain the data with the simpler
equation y=bx. If the slope b is not significantly different from zero, we
could explain the data with the simpler equation y=a.
Simple
1-way ANOVA
We’re going to use a data set called InsectSprays. 6 different insect sprays (1 Independent
Variable with 6 levels) were tested to see if there was a difference in the
number of insects found in the field after each spraying (Dependent Variable).
> attach(InsectSprays)
> data(InsectSprays)
> str(InsectSprays)
'data.frame':
72 obs. of 2 variables:
$ count: num
10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels
"A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1
...
1.
Descriptive statistics
a.
Mean, variance, number of elements in each cell
b.
Visualise the data – boxplot; look at distribution, look for outliers
We’ll use the tapply()
function which is a helpful shortcut in processing data, basically allowing you
to specify a response variable, a factor (or factors) and a function that
should be applied to each subset of the response variable defined by each level
of the factor. I.e. Instead of doing:
> mean(count[spray=="A"])
# and the same for B, C, D etc.
We use tapply(response,factor,function-name)
as follows
Let’s look at the means:
> tapply(count, spray,
mean)
A
B
C
D
E F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
The variances:
> tapply(count, spray, var)
A B C D E F
22.272727 18.242424 3.901515 6.265152 3.000000
38.606061
And sample sizes
> tapply(count, spray, length)
A B C D E F
12 12 12 12 12 12
And a
boxplot:
> boxplot(count ~ spray)
N.B. Default order for
factors in a dataframe is alphabetical. Often, there
may be a more logical order, and you would like boxplots and other output to
use that order. More importantly, when it comes to interpreting the output of
ANOVA and particularly contrasts between means (see below) it is important to
have the control condition as the first factor. If factors are not in the
correct order – i.e., if there was logical order but this is garbled when
listed alphabetically (e.g. "Very.short","Short","Long","Very.long" would end up as “Long”, “Short”, ”Very.long”, ”Very.short”) you
need to explicitly re-order the variables. E.g. in the current case to change to, for
example, F < B < C < D < E < A, use:
>
spray=factor(spray,levels=c("F","B","C","D","E","A"))
Check it:
> tapply(count,Photoperiod,mean)
F B
C D
E A
16.666667
15.333333 2.083333 4.916667 3.500000
14.500000
If
you want to check that a variable is a factor (especially for variables with
numbers as factor levels), use:
> is.factor(spray)
[1] TRUE
How does the data look? We can test
for homogeneity of variance for more than two groups using:
> bartlett.test(count ~ spray)
In this case the significant result
suggests variances are not homogeneous, which violates one assumption of ANOVA.
For the purposes of today’s session we
will go ahead with an ANOVA. This is not entirely unjustified, as ANOVA is
moderately robust against the violation of equal variances if the number in
each group is equal; and it is possible to include a correction.
You can consult http://www.statmethods.net/stats/anovaAssumptions.html
to read more about how to test different assumptions in R. Also, http://www.basic.northwestern.edu/statguidefiles/oneway_anova_ass_viol.html
contains a discussion of how violations of assumptions can affect your results.
2.
Run 1-way ANOVA
We can use:
> oneway.test(count~spray)
One-way analysis of means (not assuming equal variances)
data: count and spray
F
= 36.0654, num df
= 5.000, denom df = 30.043,
p-value = 7.999e-12
Alternatively we can treat this as a
regression, and use (as we did in regression above):
> model
= lm(count~spray)
> model
A one-way ANOVA has a factor x, with
k levels. So we are fitting the equation y=b1x1+b2x2+b3x3… where xi=1 if
the factor is at level i, and 0 otherwise. But note
that by default R will actually fit y=a+b2x2+b3x3…, i.e. the parameters bi i=2,3… are the difference of each
level from the first level. So here the intercept “a” is the mean of the first
spray type, and the co-efficients give the difference
to the mean for each of the other spray types. You can see why it becomes
important to enforce a sensible order, e.g. with the control variable first,
rather than letting R use alphabetical order. However you can also force the
analysis to fit y=b1x1+b2x2+b3x3 by using:
> model
= lm(count~spray-1)
> model
We now want to look at the
statistical significance:
> summary.aov(model)
However this seems to give us a
slightly different F value to the oneway.test we just
did. The reason is that oneway.test uses a default
setting where equal variances (i.e. homogeneity of variance) are not assumed
and Welch’s correction is applied (and this also explains why the denom d.f. (which is normally
k*(n-1)) is not a whole number in the output). To change this, set "var.equal=" option to TRUE.
> oneway.test(count~spray, "var.equal="T)
>summary.lm(model)
summary.lm()
shows
some of the same information, but also a separate t-value for each of the co-efficients. The significance of the coefficient is a
measure of whether the model including that coefficient is a better fit than
the model without that co-efficient; or equivalently here can be thought of as
a test of each level of the factor against 0.
In fact there is a wrapper function
for lm that does essentially the same analysis but changes the default data displayed
to be in the language of an ANOVA rather than a regression and allows some
different arguments to be used, and relevant functions to be applied.
> model.aov
= aov(count~spray)
> model.aov
Accessing numerical values programmatically
Sometimes you might want to access specific
values from the model or from the summary of the statistical test.
To figure out which fields are accessible
we can call names on the model or the summary (in case of summary
variable we need to use summary[[1]] as an argument).
> names(model.aov)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
"assign"
[7] "qr"
"df.residual" "contrasts" "xlevels" "call" "terms"
[13] "model"
> summary.aov
= summary(model.aov)
> names(summary.aov[[1]])
[1] "Df"
"Sum Sq" "Mean Sq"
"F value" "Pr(>F)"
Then, we can access the fields in two ways,
either with $ operator or with the name of the field as a string in square
brackets. For example to get the number of residual degrees of freedom we need
to type:
> model.aov$df.residual
[1] 66
or:
> model.aov['df.residual']
$df.residual
[1] 66
In a similar way, if we want to extract F
statistic values from the summary we can type:
> summary.aov[[1]]$F
[1] 34.70228 NA
However, this returns a full row from the
table, to obtain just the F statistic we need to specify index of the required
number in the vector:
> summary.aov[[1]]$F[1]
[1] 34.70228
3. Planned Contrasts
In general, the F-ratio its p-value
for an ANOVA tell us that the groups differ, but not which groups differ. In
some experimental designs we might have particular expectations about which
group differences will be meaningful.
First, we need to specify contrasts
attribute for our predictor variable (spray type). We can use one of the
default contrast matrices , e.g. contr.helmert()
is an orthogonal contrast matrix where each category (except the last) is
compared to the mean effect of all subsequent categories – a standard procedure
for studies along the lines of control vs. drug A vs. drug B.
> contrasts(InsectSprays$spray)=contr.helmert(6)
> contrasts(InsectSprays$spray)
[,1] [,2] [,3] [,4]
[,5]
A -1
-1 -1 -1
-1
B 1
-1 -1 -1
-1
C 0
2 -1 -1
-1
D 0
0 3 -1
-1
E 0
0 0 4
-1
F 0 0
0 0 5
In the case of this example it might
not be the most appropriate set of contrasts though, but if we want to define
our own contrasts it is possible too.
Let's say that first we want to
compare first 3 conditions to the last 3 and then run more contrasts within
those two subgroups of conditions.
First, we define the weights
> con1=c(1,1,1,-1,-1,-1)
> con2=c(2,-1,-1,0,0,0)
> con3=c(0,1,-1,0,0,0)
> con4=c(0,0,0,2,-1,-1)
> con5=c(0,0,0,0,1,-1)
> our.contrasts=cbind(con1,con2,con3,con4,con5)
> our.contrasts
con1 con2 con3 con4 con5
[1,] 1
2 0 0
0
[2,] 1
-1 1 0
0
[3,] 1
-1 -1 0
0
[4,] -1
0 0 2
0
[5,] -1
0 0 -1
1
[6,] -1
0 0 -1
-1
Then we attach contrasts attribute.
> contrasts(InsectSprays$spray)=our.contrasts
Finally, once we obtain our model
with aov() function we use summary.lm()
instead of standard summary() to observe the results of the
contrasts.
> model.aov=aov(count~spray,data=InsectSprays)
> summary.lm(model.aov)
Call:
aov(formula = count ~ spray, data = InsectSprays)
Residuals:
Min
1Q Median 3Q Max
-8.333 -1.958 -0.500 1.667 9.333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.5000
0.4622 20.554 < 2e-16 ***
spraycon1 1.1389
0.4622 2.464 0.016350 *
spraycon2 1.9306
0.4622 4.177 8.86e-05 ***
spraycon3 6.6250
0.8006 8.276 8.51e-12 ***
spraycon4 -1.7222
0.4622 -3.726
0.000405 ***
spraycon5 -6.5833
0.8006 -8.223
1.05e-11 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error:
3.922 on 66 degrees of freedom
Multiple R-squared: 0.7244, Adjusted
R-squared: 0.7036
F-statistic: 34.7 on 5 and 66 DF, p-value: < 2.2e-16
This output shows t-test statistic value
and significance for each of our planned contrasts, from spraycon1 to spraycon5.
This way we can code orthogonal or
non-orthogonal contrasts.
4. Post Hoc Tests
A
common approach alternative approach to planned contrasts is simply to carry
out a set of t-tests between various groups, potentially whatever looks
interesting in the data (hence the name, post-hoc). It is necessary to control
for the overall type I error rate, α, given the number of comparisons, c. The simplest approach is using the
simple Bonferonni correction – set the significance
level at α/c, e.g. here, if comparing each group
against the first group we would be doing 5 comparisons, so would need each
t-test to have a p<.01 to be significant to ensure overall α<0.05.
More precise is to use Dunn-Sidak’s correction:
There are various
other methods for multiple comparisons, each of which makes slightly different
assumptions in attempting to control for both type I and type II error (power),
including: Neuman-Keuls (liberal on Type I); Scheffé (strict
on Type I but bad for Type II); Dunnett’s (optimised
for comparing all treatments to a single control); Games-Howell (doesn’t assume
equal variances); and Tukey HSD (optimised for when
all possible pairwise comparisons are being made). Try:
> TukeyHSD(model.aov) #note can apply to model produced by lm using TukeyHSD(aov(model)) but this
function does not handle the model without the intercept
5. Model checking plots
We can also look at some plots that R
prepares automatically when fitting a model
> plot(model)
You need to step through the plots by
hitting the return key.
The first shows if there is any
pattern in the residuals by plotting them against the fitted values from the
model. Ideally this should show similar scatter for each condition. Here there
is a worrying effect of larger residuals for larger fitted values. This is
called ‘heteroscedascity’ meaning that not only is
variance in the response not equal across groups, but that the variance has
some specific relationship with the size of the response. In fact you could see
this in the original boxplots. It contradicts assumptions made when doing an
ANOVA.
The second plot is a test for
normality of the residuals; if they are not normal, the assumptions of ANOVA
are potentially violated.
The third is like the first plot but
now to specifically test if the residuals increase with the fitted values,
which they do.
The fourth gives the residuals by factor
levels, suggesting which factors have been best fitted.
The non-normality of the residuals
suggests this is a situation where we might want to use a non-parametric
alternative to the ANOVA:
> kruskal.test(count ~ spray,
data=InsectSprays)
Kruskal-Wallis rank sum test
data: count by spray
Kruskal-Wallis chi-squared = 54.6913, df
= 5, p-value = 1.511e-10
As for the Wilcoxon test (or
Mann-Whitney test) with two samples, this test converts the response values to
ranks, and tests whether the ranks are distributed equally across the
conditions, as would be expected under the null hypothesis. Note that it does
still assume similar distribution shapes in each group, an assumption violated
in our earlier test for homogeneity of variance.
Beyond
simple regression
Let’s go back and look at the same
types of plots for the regression analysis we did at the start:
> model.iris
= lm(y~x) # recall this did a
regression of petal length on sepal length for the Iris data set
> plot(model.iris)
Note
on first plot the data points with highest residuals are labelled with numbers,
indicating their row in the data frame – these may be outliers. There’s also a
lot of structure in this plot, which we don’t want to see. The second shows the
residuals are roughly normally distributed. The fourth gives an idea of which
data points are having the most effect (leverage) on the fit.
A
linear fit is not entirely convincing so you could try polynomial regression y=a+b1x+b2x2
> xsquared = x^2
> model.poly=lm(y~x+xsquared)
> summary(model.poly)
Have
a look at the fit:
>
z=-17.4+5.4*x-0.3*x^2 # coefficients taken from the summary output
> lines(sort(x),sort(z))
Could
compare linear and quadratic models with:
>anova(model,model.poly)
This
suggests the polynomial fit is significantly better.
Can
also do non-linear models with nls() funnction but won’t discuss
further here.
Can
do multiple regression, e.g. fit y=a+b1x1+b2x2
> x1=iris$Sepal.Length
> x2=iris$Petal.Width
>model2=lm(y~x1+x2)
Or
with interaction, i.e. y=a+b1x1+b2x2+b3x1x2
>model3=lm(y~x1*x2)