Statistical inference – the basics

The basic logic of orthodox statistical inference is as follows: you try to determine the probability that the structure of your data can be explained by a null model; if this seems improbable, you infer that something more interesting is going on. We will return in the last session to a discussion of whether this is really the best approach, and what alternatives there are.

A simple example:

At a reception in the Informatics forum you notice that of the first 10 people to take wine, 8 choose the red. Is this evidence of a significant preference for red wine in the Informatics population?

Rephrased: if there was no real preference (i.e. choice was random) what is the probability of observing at least 8 out of 10 people choosing the red?

Can describe this probability using the binomial distribution:

for n trials, each with probability p for outcome X, the probability of getting k occurrences of X =

In this case n=10, p=0.5 (if random) and k=8,9,10 (i.e. the chances for at least 8):

As probability > 0.05, more than 1 in 20, this is (by convention) not considered significantly different from chance

To do this in R, you can use the distribution function binom, where dbinom(x,n,p) gives the probability density:

> dbinom(8,10,0.5)+dbinom(9,10,0.5)+dbinom(10,10,0.5)

Or use the binom.test() function

> binom.test(2,10,p=0.5) #actually the default is p=0.5 so you could just use binom.test(2,10)

But note this gives different probability (often abbreviated as p-value, don’t confuse with the parameter p=proportion), equal to twice the statistic we just calculated. Why? Above, our null hypothesis was:

H0: there is no population preference for red or white wine

and we considered this against the alternative:

H1: there is a population preference for red wine

This is often called a 'one-tailed' hypothesis because it assumes the deviation from chance could only be occurring in one direction. However, in advance of seeing the data (and hence knowing that more people chose red) we might have no reason to expect the deviation from the null hypothesis to be towards a preference for red or white (we would have thought either preference to be equally noteworthy) and our new alternative hypothesis would be:

H1: there is a population preference for one colour of wine over the other.

Therefore a more neutral test would ask ‘what is the probability of observing a split at least as large as 8 out of 10 in favour of one or other colour of wine?’ This means the cases we added up should also have included k=0,1,2; that is, the cases where red wine was unpopular and the preference for white wine was at least as strong as the preference we observed for red. By symmetry these are equal to k=8,9,10, hence the correct test gives twice the p-value. Try it out explicitly if you want to check. You can also explicitly do a one-tailed binomial test;

> binom.test(8,10,0.5,"greater")

Note that a possible objection to this entire procedure is that ‘the first 10 people who took wine’ is not a random sample from the population of Informatics – we can at best conclude something about the people who are generally first in the queue for wine…

Exercise: if the proportion of left-handed people in the population is 10%, and you observe in your group of subjects who experience synaesthesia that 4 out of 10 are left-handed, should you conclude left handedness is more common than usual in synaesthetes?

Note the same method can be used on numerical data to do a sign test, a quick and simple test which makes minimal assumptions about the population distribution (unlike some of the tests below). E.g.:

· Twelve (N = 12) rats are weighed before and after being subjected to a regimen of forced exercise.

· Each weight change (g) is the weight after exercise minus the weight before:

1.7, 0.7, -0.4, -1.8, 0.2, 0.9, -1.2, -0.9, -1.8, -1.4, -1.8, -2.0

> rats = c(1.7, 0.7, -0.4, -1.8, 0.2, 0.9, -1.2, -0.9, -1.8, -1.4, -1.8, -2.0)

Test whether the proportion that has decreased weight is higher than expected by chance:

> binom.test(sum(rats<0),length(rats))

But note that this test has thrown away a lot of information in the data, i.e. the actual size of the weight gains and losses, and as a consequence is less powerful at detecting real differences. E.g. if the data looked like this:

1.7, 0.7, -40.0, -18.8, 0.2, 0.9, -16.2, -10.9, -15.8, -19.4, -21.8, -25.0

You would have the same conclusion from the sign test even though the data appears to show much stronger evidence of significant weight loss. We will come back to the issue of power later.

Frequency testing

Now, let's deal with the situation when we have categorical data to analyse, in this simple example let's assume we collected information about squirrels in the UK and in Poland.

Our raw data is in http://homepages.inf.ed.ac.uk/bwebb/statistics/squirrels.csv. First, download it to your working directory, then, let's load it into R and see the first few rows:

> membership_data=read.csv(file='squirrels.csv')

> attach(membership_data)

> head(membership_data)

Then we can count representatives of 4 combinations of country x fur colour and present the counts in a contingency table:

> contingency=table(country,fur)

> contingency

We want to test whether red squirrels are underrepresented in the UK compared to Poland, i.e. how likely are the frequencies in the contingency table to be due to chance.

First solution (which is appropriate for 2x2 contingency tables and relatively small samples) is to use the Fisher exact test. This will calculate exact probability of these frequencies occurring given the marginals, i.e. total number of red squirrels in the sample, total number of British squirrels in the sample and total number of squirrels in the sample.

We can do it manually, by reading the value directly from the hypergeometric distribution

> phyper(contingency['UK','red'],sum(contingency[,'red']),sum(contingency[,'grey']),sum(contingency['UK',]))
[1] 0.1761937

Or we can use the built-in function fisher.test, which takes the entire contingency table as an argument and does the same calculation under the hood.

> fisher.test(contingency, alternative='less')

Fisher's Exact Test for Count Data

data: contingency
p-value = 0.1762
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
0.000000 1.522656
sample estimates:
odds ratio
0.4743571

The alternative is chi square test, which approximates probability with chi square distribution and is a preferred method for bigger samples, and can be used bigger contingency tables.

Chi square statistic is calculated from the equation

where O stands for observed value and E stands for expected value in each cell (based on marginal counts) and we sum over all cells.

Then, for any value of chi square statistic corresponding probability can be read from a curve for a given number of degrees of freedom. These curves represent 1 – F(x)

where F(x) is a cumulative chi square distribution function.

We could derive the expected values, perform the summation and use the test statistic value in a cumulative chi-square distribution function. However, we can let R do the work:

> chisq.test(contingency)

Pearson's Chi-squared test with Yates' continuity correction

data: contingency
X-squared = 0.8668, df = 1, p-value = 0.3518

Interestingly, on default, Yates' continuity correction is applied, it means that each O-E difference in the test statistic summation is decreased by 0.5, i.e. the formula becomes

Hence the test statistic becomes smaller and the resulting p-value is higher than if we run the test without the correction:

> chisq.test(contingency, correct=FALSE)

Pearson's Chi-squared test

data: contingency
X-squared = 1.5198, df = 1, p-value = 0.2177

Parametric testing

Note that the basic idea above was that we compared the data to an expected distribution, such the binomial distribution with a parameter given by the expected proportion (e.g. p=0.5 under equal chance assumption). The idea can obviously be generalised.

A simple example:

If the height of men in Britain is normally distributed with mean μ=175cm and standard deviation σ=8cm, is a height of 185cm unusual?

We could ask what is the probability of someone randomly chosen from this population being exactly 185cm tall, using the probability density dnorm(x,mean,s.d.) - recall we used ‘rnorm’ in the previous session to get a random variable from a normal distribution.

> dnorm(185,175,8)

But as before, the question is really about the probability of someone being at least 185cm tall, or taller. So we want to know the probability of falling in the top tail of the distribution, which we can calculate as 1 - the cumulative probability of the distribution up to 185, using the function pnorm:

> 1- pnorm(185,175,8)

An alternative way of considering a height of 185 compared to a mean of 175 is to ask about the probability of deviating 10cm from the mean. This is the two-tailed rather than one-tailed question, as discussed above. To answer it you need to add in the other end of the distribution, i.e. for heights below 165:

> 1- pnorm(185,175,8)+pnorm(165,175,8)

(Or more simply, due to symmetry of the normal distribution, just double your previous p-value). More typically, we compare the value of some statistic to the standardized normal distribution μ=1, σ=0, by first doing a Z-transform on the value (expressing each score in terms of how much it deviates from the mean, in multiples of sd.)

> z=(185-175)/8

And compare it to the standardised normal distribution (the default for the norm functions).

> 1-pnorm(z)

Above we mentioned that by convention, something that has a probability <.05 is considered significantly different from expectation. You should keep in mind that this means (a rather unimpressive) 1 in 20 – e.g., for heights, you are going to encounter plenty of men in the UK who are ‘significantly’ different from average, but it is unlikely someone’s unusual height would lead you to reject the assumption that they come from the UK. So when scientific reports say something is significant, you should not be too convinced. We will come back to this point later. Meanwhile, it is handy to know that the z-value which cuts off 2.5% at each end of the standardized normal distribution is +/-1.96. This is often called the ‘critical value’ and can be obtained using:

>qnorm(.975) # gives the 97.5 percentile for the normal distribution.

You can check:

> pnorm(-1.96)+1-pnorm(1.96)

So in fact, once you calculate a z-score, you can immediately see whether or not it falls above or below the ‘critical value’ z=+/-1.96, and thus whether it is significant at p <.05. E.g. for the height 185, the z-score was 1.25, which is<1.96 so p>.05. We can also calculate what deviation from average height would be considered ‘significant’, i.e. one that is 1.96σ from the mean, or in this case 175+/-1.96*8, i.e. taller than 190.6cm or shorter than 159.3cm. Another way to describe this is to say that 95% of the UK male population have a height that falls in the range 159.3 to 190.6 cm.

Single sample tests

Let’s go back to our previous simple example of data:

• Twelve (N = 12) rats are weighed before and after being subjected to a regimen of forced exercise.

• Each weight change (g) is the weight after exercise minus the weight before:

1.7, 0.7, -0.4, -1.8, 0.2, 0.9, -1.2, -0.9, -1.8, -1.4, -1.8, -2.0

What we want to do is use this sample to estimate the weight change that might be expected in the population (an interesting question to also consider here is ‘what is the population’? Strictly speaking we should only generalise to the rats from which these were a random sample, e.g. the particular strain of rats in this particular lab, but the motivation for the experiment is likely to be an interest is generalising from rats to humans), and to say whether this is significantly different from zero. A possible estimate for the average weight change is the sample mean:

> mean(rats)

-0.65

(Though you might want to check the distribution and decide if the mean is really representative. For the moment we will proceed as if it is). We want to compare the sample mean we have obtained to the expected distribution of sample means under the null hypothesis that the population mean is zero. If we took repeated independent random samples of size n from a population of mean μ and standard deviation σ:

[1] The mean of the population of sample means is equal to the mean of the parent population from which the population samples were drawn.

[2] The standard deviation of the population of sample means is equal to the standard deviation of the parent population divided by the square root of the sample size: σ/√n. This is sometimes called the standard error of the mean, or s.e.

[3] The distribution of sample means will increasingly approximate a normal distribution as the sample size increases, whatever the underlying population distribution. This is the Central Limit Theorem.

In this case we have a (null) hypothesis for the population mean, i.e. μ =0, but we don’t know σ. Let’s imagine for a moment that we do happen to have this, say σ=0.8. Then for our sample size, n=12, the standard error of the mean is 0.8/√12=0.23. We could then calculate the probability of the observed mean just as we did above for heights. We are comparing the sample mean =-0.65 to a normal distribution (of sample means) with mean 0 and s.d. = 0.23

> pnorm(-0.65,0,0.23)

Note again this is one-tailed – it assumes we’re only interested in the probability of values of the mean that are at least 0.65 below zero (weight decreases) but not equally large values above zero (weight increases). We could double the value to get the two-tailed result:

> 2*pnorm(-0.65,0,0.23)

This seems pretty improbable, so we might want to ‘reject’ our null hypothesis, and conclude our mean from the sample is a better estimate of the true value of the population mean. In fact, we could put some probability bounds on this estimate using the critical z-value 1.96 as before, by calculating that 95% of the distribution of sample means will be within +/-1.96*s.e. of the observed mean.

Thus in the case of rat weights we have a lower and upper bound of the ‘likely’ value of the mean:

> -0.65-1.96*0.23

> -0.65+1.96*0.23

This is called the 95% confidence interval (C.I.). Strictly speaking it says that if we kept taking samples of size n from the population with μ = -0.65, σ=0.8, 95% of the means would fall between -1.1g and -0.19g. It is quite useful as a summary statistic (which can be easily plotted on a graph) because it essentially combines an indication of the size of the observed effect with a significance test: if the value of our null hypothesis (here that μ=0) is not contained within this interval then we know the observed mean is significantly different from 0, p<0.05. By extrapolation we could calculate a 99% C.I., or any other percentage we think appropriate, but as the percentage goes up (so our ‘confidence’ that we’ve bracketed the true mean increases) the interval will get wider. Hopefully obvious is that the C.I. will get smaller as our sample size n gets larger – the accuracy of our estimate will improve as we collect more data.

The t-test

By now you should be worried by the fact that we don’t actually know the population standard deviation σ, we just used a number I made up. An obvious thing to do is instead to estimate it from the sample standard deviation:

and divide by √n to get the standard error of the mean:

> se.rats=sd(rats)/sqrt(length(rats))

We could now just plug that into our method above to get a p-value and C.I. However, because we are using an estimate, we need to make an adjustment, one that depends on the size of our sample: we would expect the estimate to become closer and closer to the real population standard deviation as we increase our sample size. Without going into detail, instead of using a standard normal distribution Z, we use the T-distribution:

Where V is the chi-squared distribution with υ degrees of freedom (d.f.). The degrees of freedom is the sample size minus the number of parameters estimated from the data, in this case having estimated the mean, d.f.=n-1. The effect of this adjustment is to increase the size of the tails of the Z distribution, gradually approaching the normal distribution as υ increases (a general rule of thumb is that for d.f.>30 there is no practical difference in the resulting statistical calculations).

Using R, we have the t() function which works like the binom() and norm(), e.g. pt(x,df) is the cumulative probability. However note that this is ‘standardised’ i.e. to get the probability of a particular value we need to give it as input a t-score:

And (unlike the Z distribution) it needs the d.f. parameter. In our example so far, we want to find the t-value of the sample mean, compared to a mean of zero, which is the sample mean divided by the standard error of the mean.

> t.rats = mean(rats)/se.rats

> df = length(rats)-1

> 2*pt(t.rats,df)

And in a similar way one can calculate a confidence interval, using qt() to obtain the critical value of the t distribution.

> qt(.975,11)

But to save time, you might guess that R already has a function to do a t-test based directly on the data vector, which also produces a 95% C.I.:

>t.test(rats) # you can specify a value other than zero for the population mean to test against with t.test(x,mu=y)

As a matter of interest you might like to compare the p-value result here to that we would estimate from the normal distribution (using pnorm), but now using the estimated s.d. from the sample (=1.25) compared to the hypothetical s.d. of 0.8 that we used above. The result should be an underestimate of the p-value obtained from the t-test.

Alternatives to the t-test

Above, we used the Central Limit Theorem to justify the assumption that our sample mean comes from a normal distribution (or adjusted t-distribution). But the CLT holds in the limit; if we have a small sample, we can’t make this assumption. However, the distribution of sample means from a normally distributed population will always be normal; in this case we are safe. Hence, it is often stated that we can only apply a t-test if our data comes from a normal distribution. This is probably better stated as:

If the data do not appear normally distributed (see last session) and the sample size is small, results from a t-test may be misleading, and an alternative ‘non-parametric’ test should be considered.

On the other hand, for a small sample, it may be hard to judge whether the distribution is normal. For rat weights in particular, we might argue that each measured weight difference is already the sum of a wide number of biochemical random factors, and hence we have a prior expectation (via the CLT) that they should be normally distributed.

We’ve already looked at one non-parametric test above, the sign test, which compared the values in our rat data to 0 by simply counting how many data points fell each side of zero.

A second alternative that uses a bit more information from the data is the Wilcoxon test. The basic idea (common to many non-parametric tests) is to use the ranks of the data rather than the numerical values themselves. As a consequence it is also appropriate to use this test when the data themselves are, in fact, ranks, e.g. as may typically result from survey data. The procedure is actually simple enough to do by hand, see e.g. here: but can also be done in R:

> wilcox.test(rats,mu=0)

However note this test still assumes the data is symmetric (around the median) so may not be suitable for certain kinds of departure from a normal distribution.

A third alternative is to use bootstrapping. The simple idea here is that our best estimate of the population distribution is that it is the same as the sample distribution. So we use that sample data to generate a distribution of sample means, by taking repeated samples with replacement from our original data and calculating the mean.

> boot.rat =numeric(1000)

> for (i in 1:1000) boot.rat[i] = mean(sample(rats,replace=T))

> hist(boot.rat) # have a look at this distribution

> sd(boot.rat) # compare this to the standard error of the mean calculated above

We can then calculate a 95% confidence interval directly from this distribution, from the .025 and .975 percentiles

> boot.rat=sort(boot.rat)

> boot.rat[c(25,975)]

Note that we also have an estimate of the mean:

> boot.rat[500]

And that the bootstrapped C.I. is potentially assymmetric around the mean, unlike the standard forms of C.I. calculated above.

Tests with two samples

A common experimental design will have a dependent variable, or response, measured under two (or more) conditions, for example, with one group getting treatment and the other acting as a control. So we are interested in whether there is a significant difference between the response of the groups.

Crucially there can be two different experimental designs with two groups: paired and unpaired. E.g. paired design occurs when the same participant is tested under both conditions (sometimes called ‘repeated measures’); but it includes any design where there is a logical matching between each score in one group with a corresponding score in the other group. Unpaired designs are sometimes called independent designs.

The easy way to avoid the mistake of analysing a paired design as an unpaired design, or vice versa, is to realise that the two samples of a paired design can be made into one sample by taking the difference of each pair, and then testing if the mean difference is equal to zero (using the single sample methods discussed above). The only reason not to do this would be if your data values are not interval data, e.g. represent ranks, in which case taking differences is misleading – then you should do a non-parametric test anyway (e.g. Wilcox.test() with the flag paired=T).

Variance of 2 samples

For an unpaired, or independent design, most statistical tests to compare the 2 means assume equal variance within each sample. We can test this assumption by comparing the ratio of the variances to the F distribution, which is defined as the ratio of two independent Chi-square distributions divided by their degrees of freedom. The Chi-square with k d.f. is the distribution of the sum of squares of k independent random normally distributed variables. N.B. the F-ratio will become key when we get to ANOVA tests.

Fisher’s F Test

· Divide the larger variance by the smaller one - variance ratio

· Determine the degrees of freedom, which will be [(n1-1),(n2-1)] where n1 and n2 are the sample sizes for the samples with larger and smaller variances, respectively

· Find p-value for F statistic using pf() function

e.g.

> f.test.data =read.table("garden.txt",header=T) ## find f.test.data

> attach(f.test.data) ## attach f.test.data

> names(f.test.data) ## list the names of f.test.data

[1] "gardenB" "gardenC"

> var(gardenB) ## Compute variance for gardenB

[1] 1.333333

> var(gardenC) ## Compute variance for garden C

[1] 14.22222

Variance gardenC is much bigger than var gardenB, so:

> F.ratio=var(gardenC)/var(gardenB)

> F.ratio

[1] 10.66667

1 or 2-tailed? 2 in this case – no expectation as to which would have higher variance

> 2*(1-pf(F.ratio,length(gardenC)-1,length(gardenA)-1))

[1] 0.001624199

This is a small probability, so the hypothesis that the variances are equal is rejected. There’s a quicker version:

> var.test(gardenB,gardenC)

T-test

The test statistic is the number of standard errors by which the 2 sample means are separated:

For independent variables, variance of the difference of means is sum of separate variances. Not true if variables are correlated (e.g. unlikely to be true in paired design, hence mistakenly analysing a paired design this way will overestimate the s.e., and underestimate t).

Carrying out the t-test

We’ll use the garden data in this example – however note (relative to last week’s discussion) this means that we are assuming the ten measurements for each garden are independent (e.g. they were not taken on the same days). We are interested in whether the ozone levels differ between garden A and garden B.

1. Set out hypotheses

H₀: There is no difference between the two means

H₁: There is a significant difference between the 2 means

2. Decide whether one or two-tailed, decide significance value, and establish degrees of freedom in each sample; use these to find a critical value for t

e.g. 2-tailed test; p=.05; for garden data, d.f. in each = n-1=9, total df = 18

> qt(.975, 18)

[1] 2.100922

So in this case the test statistic (t) must exceed 2.1 in order to reject H₀ – that the 2 means are not significantly different at the .05 level of significance.

3. Attach data frame (if you haven’t already)

> t.test.data=read.table("gardens.txt",header=T)

> attach(t.test.data)

> names(t.test.data)

[1] "gardenA" "gardenB"

4. Visualise the data

> ozone=c(gardenA,gardenB)

> label=factor(c(rep("A",10),rep("B",10)))

> boxplot(ozone~label,notch=T,xlab="Garden",ylab="Ozone")

Variability appears similar in both samples – range (whiskers), inter-quartile range (IQR, size of boxes) are fairly similar. The notches show +/-1.58*IQR/sqrt(n) which is an approximate 95% confidence interval for the median. If the notches don’t overlap, the medians are probably significantly different.

5. Long-hand t-test

a. Calculate variability for each sample (if they look very different, should check equality with F test, as above, and should not proceed if this is significant)

> varA=var(gardenA)

> varB=var(gardenB)

6. Calculate the value of t:

Student’s t is difference between 2 means divided by the square root of the sum of (the variance of each sample divided by its sample size)

> (mean(gardenA)-mean(gardenB))/sqrt(varA/10+varB/10)

[1] -3.872983

7. Interpret the value of t.

a. Ignore the minus sign – look at the absolute value of the test statistic

- A value of 3.87 exceeds the critical value of 2.10(above)

- Therefore we can reject the null hypothesis in favour of the alternative hypothesis: there is a significant difference between the 2 means

8. Calculate an exact p-value – the probability that this, or more extreme, data / differences would be observed under the null hypothesis

> 2*pt(-3.872983,18)

[1] 0.00111454

P < .001; it is highly unlikely these differences occurred by chance

Or – use the built-in t-test function:

> t.test(gardenA,gardenB)

Welch Two Sample t-test

data: gardenA and gardenB

t = -3.873, df = 18, p-value = 0.001115

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-3.0849115 -0.9150885

sample estimates:

mean of x mean of y

3 5

Garden B was found to have a significantly higher mean ozone concentration than garden A (means: 5.0, 3.0 p.p.h.m, respectively). t(18)=-3.873; p<.001

Note that the “Welch” test applies a correction for potentially unequal variance.

Power, effect size and sample size

So far we’ve mentioned several times that by convention, a probability <.05 for observing data under the null hypothesis is taken as significant evidence against the null hypothesis. More formally this is called the ‘Type I error rate’ or ‘α’ level, where a ‘Type I’ error would be to reject the null hypothesis when it is, in fact, true. ‘Type 2’ error would be to fail to reject the null hypothesis when it is, in fact, false. This is called the ‘β’ level, and often discussed as power=1-β. I.e. the power of a statistical test is its ability to correctly detect a false null hypothesis. In general there will be a trade-off between α and β, which depends on the size of the sample, the size of the effect (known as δ), and the variance in the data.

What this means is that if you know 4 of these 5 variables, you can determine the 5^th. E.g. after doing a test, you can calculate what its power was (see here for a discussion of how (and how not) to interpret this). If you have failed to reject the null hypothesis, a low power might suggest that you simple failed to detect a difference that does, in fact, exist – because the effect is small, or the variance is large, or you did not collect enough data, or all of these. Perhaps more usefully, before doing an experiment, you can estimate the sample size required to observe, at a given α and β level, an effect of a certain size, given an estimate of the likely variance. For example, if you expect the difference between the means to be at least 2, and the variance around 10, and want to ensure you have power=0.8 (again this is the conventional figure) using a conventional α=0.05, then you can use:

> power.t.test(type="two.sample",delta=2,sd=sqrt(10),power=0.8, sig.level=0.05)

-to find out that you need n=40. Note it is much better to do this before running your experiment: to make sure you collect enough data to see the effect if it exists; to not waste time collecting a lot more data than necessary to see the effect exists; or to redesign your experiment (to reduce variance or increase effect size) so that you don’t have to collect so much data. However it does require an estimate of the effect size and variance, before you run the experiment. Sometimes you may be able to estimate these from previous experiments in the literature. Sometimes a pilot experiment may help (using bootstrapping on pilot data to estimate the variance is a useful technique). What is really not a good idea is to start an experiment not knowing how much data you should collect, running statistical tests as the data comes in, and then stopping when the result is significant: see here for a quick demonstration and discussion of the misleading results this can produce.

Note also that there may be good reason for varying from the conventions for α and β, depending on the potential consequences of making a Type I or Type II mistake.

Effect size is a measure of the strength of the difference expected, or retrospectively, observed. An absolute effect size (e.g. difference between the means) might sometimes determine the practical (rather than statistical) significance of your result. E.g. if you were looking at the effects of various diets, and found one diet with radically reduced calories produced a 10 gram weight loss over a year compared to a control diet, that might be robustly statistically significant, but not actually of much practical or scientific significance. A standardised effect size measure, such as Cohen’s d, compares the difference in the means to the variance in the data, and is sometimes used to loosely classify an observed effect as ‘small’, ‘medium’ or ‘large’.

Where s is the pooled standard deviation of the samples (or if we have assumed these are equal, the s.d. for either). So for example in our garden data above, the effect size is

> d = (mean(gardenA)-mean(gardenB))/sqrt(varA)

which comes out as -1.7, meaning the means differ by 1.7 standard deviations. Cohen suggests that d=0.2 be considered a 'small' effect size (an effect that probably means something is really happening in the world but it takes careful study to see it), 0.5 represents a 'medium' effect size and 0.8 a 'large' effect size (an effect big or consistent enough that you could probably see it without statistics).

Although this is still somewhat arbitrary, note that this approach is better justified than trying to conclude you have a ‘strong’ effect from the size of the p-value – the logic of the orthodox statistical test implies that however small the difference between the means, a large enough sample size will result in rejecting the null hypothesis for any level of α you choose. We will come back to this point in session 5.

To get a better feel for the trade-off between power, sample size and effect size try this demonstration:

First we'll set up vectors with a range of effect sizes (the difference between the means relative to the variance in the data) and power values

> d = seq(0.25,1.5,.05)

> nr = length(d)

> p = seq(.4,.9,.1)

> np = length(p)

Now we can compute required sample size for all values of d and power

> samsize = array(numeric(nr*np), dim=c(nr,np))

> for (i in 1:np){

for (j in 1:nr){

result = power.t.test(type="two.sample",delta=d[j],

sd=1,power=p[i], sig.level=0.05)

samsize[j,i] = ceiling(result$n)

}

Finally, we set up a graph to visualise the results
> xrange = range(d)
> yrange = round(range(samplesize))
> colors = rainbow(length(p))
> plot(xrange, yrange, type="n", xlab="Difference between means delta", ylab="Sample Size n" )

For each value of power we add a curve in a different colour
> for (i in 1:np){
lines(d, samplesize[,i], type="l", lwd=2, col=colors[i])
}

Also, we can add a legend
> legend("topright", title="Power", as.character(p), fill=colors)

Now you can see how required sample size depends on the difference between means which we expect and the power value which we're aiming to achieve.

NOTE ABOUT INSTALLING PACKAGES

There is a package which extends power analysis to many other statistical tests, however, it is not supplied in the base installation of R. (This is a common situation with R)

There is a central repository of R packages (CRAN) which allows us to install and load packages in a standardised way.

Basic command for that is install.packages('package_name') to install it, and library(package_name) to load it, dependencies are resolved automatically.

However, if you're working on a machine where you don't have admin rights (e.g. DICE), you will need to install packages to a local folder, in order to set it up type:

> system('mkdir ${HOME}/localRlibraries')
> system('export R_LIBS=${HOME}/localRlibraries')

These two UNIX commands executed through R will create a folder for local R libraries and then point an environmental variable to this folder.

Then we can install a package and specify the repository to use (this only needs to be done once, subsequently you can omit repos argument)

> install.packages('pwr', repos='http://star-www.st-andrews.ac.uk/cran/')

Finally, we load packages with library command

> library(pwr)

You can explore contents of this package if you have time.