Brief explanation of use of F- ratio to test significance of model fit.

For any specific data point

The difference of any data point from the overall mean  can be divided in two parts:

And taking the sum of squares over all the data, it can be shown that:

In other words the total sum of squares (SST) equals the residual sum of squares (SSR) plus the sum of squares of the model (SSM):[1]

The correlation coefficient is  , that is, the fraction of the total variance explained by the model. But this does not directly take into account the number of data points n used to estimate the parameters.

In the previous session we used the F statistic to test the hypothesis of equal variance. In general, the F statistic is a ratio of two scaled sums of squares, which follows a scaled Chi-square distribution (exactly if the data are independent and normally distributed). To test the correlation we use:

Under H0: no correlation, y=a, our best prediction for  so SSM goes to zero.

Generalise to any pair of models

If we are trying to fit model 1 and model 2 to n data points, where model 2 has more parameters ,  and model 1 is ‘nested’ in model 2 (i.e. any fit to data by model 1 can be obtained for some parameter setting of model 2, and model 2 can always fit the data more closely than model 1) then we can compare them using the F-statistic:

Think of how this applies to the H0 and H1 models for the simple regression above.

Apply to factorial data – ANOVA (“analysis of variance”)

If x is factorial, with k levels, then we can try to fit the model  where  if the factor is at level k, 0 otherwise. This is hypothesis H1 with k parameters. The null hypothesis H0 is y=a, i.e. that the level of the factor makes no difference.

For H0 our best estimate of any observation is simply the mean of all the observations,  . For H1 our best prediction is the mean of the observations for that level of the factor, .

As before we have SST=SSR+SSM, this time calculated as:

Where  is the number of data points at level k, and total data points is

Our test statistic is:

[1] Confusingly, an alternative terminology is sometimes used: SSR for regression (i.e. model) sum of squares, and SSE for error sum of squares.