Model Fitting using the Bhattacharyya Measure

Rhodri H Davies, Neil A Thacker and Chris J Taylor

Introduction

These notes contain an overview of the B-fitting method proposed by Thacker et al [1]. The method is a model-selection criteria that explicitly calculates the generalisation ability of the model.

MATLAB code that demonstrates B-Fitting polynomials to some 1D data can be found at:

http://www.isbe.man.ac.uk/tex2html_wrap_inline426rhd/code/Bhatt/.

The `README.m' file contains instructions on how to run the demos.

Model Fitting

Model-fitting involves estimating (i.e. modelling) the process that generated a set of data. We usually use a subset of the entire dataset (the training set) to estimate the model.

Model Complexity

We must find a suitable compromise for the model's complexity, this is often referred to as the principle of parsimony. If the model is too complex, it will capture the errors in the training set and be unable to generalise to unseen data. If it is too simple, it will not be a sufficient estimate of the intrinsic process that created the data.

Ideally, the model should be chosen so that it encodes the meaningful, systematic aspects of the data and discards the unstructured portion.

Chi-squared Fitting

The Chi-squared metric is a `goodness of fit' measure that is often used to assess the suitability of a given model. It looks like:


equation16

where tex2html_wrap_inline428 is a set of parameters, tex2html_wrap_inline430 is the true value at point i, tex2html_wrap_inline434 is the predicted value at that point and tex2html_wrap_inline436 is the data measurement accuracy (the probable variance of the error).

The drawback of using the tex2html_wrap_inline438-metric is that it places no restriction on the complexity of the model, making it prone to overfitting on the training set.

Penalising Complexity

Many methods have been proposed to prevent overfitting by penalising model complexity (e.g. Aikake's information criterion - AIC ). These typically take the form:

Goodness of fit = Training error + Complexity

It is not, however, entirley clear how to select the complexity term.

B-Fitting

B-Fitting is a method that gives an unbiased measure of the generalisation ability of a model.

Consider a least-squares fit. It attempts to find the model that minimises the squared-distances between tex2html_wrap_inline440 and tex2html_wrap_inline442. See the diagram in figure 1 - here, the solution is constrained to lie on a manifold (illustrated as a line). A least-squares fit will select the point on the manifold that is closest to the correct solution.

   figure27
Figure 1: A least squares fit in hyperspace

B-Fitting, however, not only attempts to find the shortest distance but also considers the stability of the model. A set of solutions and probable errors can be described by a multivariate probability distribution function (pdf). We already have this information for the data (from the training set) and assuming we also have the information of the model, B-fitting states that the optimal solution is when the two pdf's have maximum overlap. See Figure 3 - again, the solution must lie along the constraint manifold but this time we attempt to overlap the pdfs as much as possible.

   
figure28a
Figure 2: B-fitting in a hyperspace

figure28b
Figure 3: The optimal solution

B-fitting makes use of the Bhattacharya metric to measure the similarity (amount of overlap) of two pdf's.

The Bhattacharya overlap

The Bhattacharya metric has many desirable statistical properties [2] that make it a suitable measure of the divergence of two pdfs. The 1D analytical Bhattacharya measure for a Gaussian distribution is:


 equation51

where tex2html_wrap_inline444 is the distance between the data and the estimate (made by the model), tex2html_wrap_inline446 is the (error) variance of the data pdf and tex2html_wrap_inline448 is the probable variance of the model pdf.

Estimating the Model PDF

We will now consider how we can estimate the error on the model.



Error Propagation

  Given a change in the input data tex2html_wrap_inline502, we want to know how it affects the output of the function tex2html_wrap_inline504. If the change is small enough (so that the propagated change is linear) then the following equation holds:


equation204

where tex2html_wrap_inline506.

For the n-d case, if we assume the errors are normally distributed and independent, we can propagate the errors through the model with:


equation209

where y=f(x).

References

1
Thacker et al, B-Fitting: An Estimation Technique with Automatic Parameter Selection, BMVC '96

2
Aherne et al, The Bhattacharyya metric as an absolute similarity measure for frequency coded data