Why statistics?

There are several inter-related reasons for using statistics. Basically, we want to intepret data: to present or process it in a way that allows us to see patterns or structures; to draw conclusions about a general property of the world from a limited sample; to understand the causes of variation in our data and which of these are due to something scientifically interesting; to test hypotheses. You can probably think of more.

In a sense all of these can be interpreted as problems of model fitting, where the model should have simpler structure than the data itself (the simplest structure possible, but no simpler) and we want to estimate the parameters (to conclude something about the general properties) as well as evaluate the fit obtained (to test how well the hypothesis as embodied in the model explains the data). We'll come back to this point in several different sections of the workshop.

As a researcher you may be faced with two different situations: you are given data, and have to try to find interesting structure in it; or you have a hypothesis and design an experiment to gather data to test it. It is important to try to avoid letting the second situation default to the first, i.e., running an experiment and getting some data without first thinking through how you will be able to use that data to test your hypothesis. Indeed, the ideal experiment is one where the answer from the data is so clear you don't really need statistics. In other cases, a poor experimental design might violate the assumptions needed to do any valid statistical test, whereas an improved experimental design may greatly simplify the statistical methods needed, or their power to provide a clear conclusion.

Example for discussion in workshop: You have measured time to complete the task and number of errors for 20 subjects and have the following results:

Times Errors

31 6

30 1

28 0

30 0

26 2

28 1

30 5

30 6

29 1

31 5

33 5

10 5

29 1

30 0

32 6

30 0

29 1

31 5

31 6

32 4

What are you going to do next?

The best first step is to 'eye-ball' the distributions, e.g. using a histogram. (Don't make the mistake of plotting the observations against the order of the observations - unless you know this order is actually meaningful, e.g. a time series). This will alert you to anything odd, such as improbable outliers, bimodal or highly skewed distributions. In particular you should do this before calculating measures of central tendency (such as mean or median) which may be quite unrepresentative of some distributions.

What types of data are there?

Qualitative or Nominal: can be categorical (data point falls into one of clear set of classes), but could more complex. Essential point is that it has no inherent ordering. But sometimes numbers are used as labels. This does not make the data numeric!

Ordinal: data represents ranks. Important point is that taking differences between ranks is potentially misleading e.g. the difference of first to second could be much larger or smaller than the distance from second to third. Hence be wary of procedures like calculating the mean that assume distances between numbers represent distances between data.

Quantitative: Can be discrete or continuous. Can be bounded or unbounded (certain types of data, such as counts, cannot go below zero). Can be 'interval' or 'ratio': for interval data, differences are meaningful but ratios are not - classic example is temperature on celcius or farenheit scale (a "doubling" of the temperature is ambiguous as the zero point is arbitrary).

Statistics texts often refer to four 'levels of measurement': nominal, ordinal, interval and ratio. There is some debate around this (e.g. should labelling or ordering be called 'measurement'?) but they are still useful concepts to keep in mind.

Times	Errors
31	6
30	1
28	0
30	0
26	2
28	1
30	5
30	6
29	1
31	5
33	5
10	5
29	1
30	0
32	6
30	0
29	1
31	5
31	6
32	4