**Brief explanation of use of F- ratio
to test significance of model fit.**

**Start with regression**:
linear model of *y~x** :** y=a+bx*

For any
specific data point

The difference of any
data point from the overall mean can be divided in two parts:

And
taking the sum of squares over all the data, it can be shown that:

In other
words the total sum of squares (SST) equals the residual sum of squares (SSR)
plus the sum of squares of the model (SSM):[1]

The correlation
coefficient is ,
that is, the fraction of the total variance explained by the model. But this
does not directly take into account the number of data points
*n *used to estimate the parameters.

In the previous session
we used the F statistic to test the hypothesis of equal variance. In general,
the F statistic is a ratio of two scaled sums of squares, which follows a
scaled Chi-square distribution (exactly if the data are independent and
normally distributed). To test the correlation we use:

Under H0: no
correlation, y=a, our best prediction for so
SSM goes to zero.

**Generalise to any pair of models**

If we are trying to fit
model 1 and model 2 to *n*
data points, where model 2 has more parameters ,
and model 1 is ‘nested’ in model 2 (i.e. any fit to data by model 1 can
be obtained for some parameter setting of model 2, and model 2 can always fit
the data more closely than model 1) then we can compare them using the
F-statistic:

Think of how
this applies to the H0 and H1 models for the simple regression above.

**Apply to factorial data – ANOVA
(“analysis of variance”)**

If x is
factorial, with k levels, then we can try to fit the model where if
the factor is at level k, 0 otherwise. This is hypothesis H1 with k parameters.
The null hypothesis H0 is y=a, i.e. that the level of the factor makes no
difference.

For H0 our best
estimate of any observation is simply the mean of all the observations, .
For H1 our best prediction is the mean of the observations for that level of
the factor, .

As before we have
SST=SSR+SSM, this time calculated as:

Where is
the number of data points at level k, and total data points is

Our test
statistic is:

[1] Confusingly, an alternative terminology is sometimes used: SSR for regression (i.e. model) sum of squares, and SSE for error sum of squares.