Model-Fitting Using the Minimum Description Length Principle

Rh H Davies, T F Cootes and C J Taylor

Introduction

These notes give a brief introduction to the minimum description length principle [2]. The approach is simple and intuitive and is applicable to many model-fitting problems.

Model Fitting

Model-fitting involves estimating (i.e. modelling) the process that generated a set of data. We usually use a subset of the entire dataset (the training set) to estimate the model.

Model Complexity

We must find a suitable compromise for the model's complexity, this is often referred to as the principle of parsimony. If the model is too complex, it will capture the errors in the training set and be unable to generalise to unseen data. If it is too simple, it will not be a sufficient estimate of the intrinsic process that created the data.

Ideally, the model should be chosen so that it encodes the meaningful, systematic aspects of the data and discards the unstructured portion.

Generalisation Ability

We would like to select the model that has the best generalisation ability (the ability to predict data generated from the same process that hasn't been seen in the training set). To achieve this, we can follow the principle of Occam's razor:`Given a set of hypotheses, the simplest description generalises best'.

Minimum Description Length

Suppose we want to transmit some information (i.e. our training data) across a wire to a receiver. We could send the raw data, but we can reduce the amount of information by encoding the data using a model. Now, however, we need to send the model as well as the encoded data.

The Information required to `send' the information is now:
I(Total) = I(Model) + I(Model Parameters)

We can also add an extra term I(Residual) to the abpve equation:

I(Total) = I(Model) + I(Model Parameters) + I(Residual)
This allows us to encode only a subset of the data and to send the residual data in an un-encoded format. This term allows us to optimise the complexity of the model.

Following the principle of Occam's razor, the MDL principle states that the best model is the one that minimises I(Total).

Quantising the Data

Representing a real number (e.g. pi) to an arbitrary accuracy may require an infinite amount of information. We therefore need to truncate the numbers so that only k bits are used. By truncating a number, however, we are losing information - effectively adding additional error. These errors need to be added to the residual data. There is, however, a problem here : the accuracy of the different parameters can be traded off against each other making I(TOTAL) very difficult to optimise.

By using error propagation, we can estimate how truncating each element affects the size of the residual error. By talking the resulting expression and setting the differentials w.r.t each accuracy parameter to zero, we can express each accuracy in terms of a single quantisation parameter. This should be chosen to reflect the data measurement accuracy.

Conclusions

The MDL principle states that the simplest description of a training set will be the one that generalises best to unseen examples. We have used the MDL principle to build statistical shape models [1]. We also intend to investigate using B-fitting [3] for the same purpose.

thebibliography

[1] Davies et al A MDL approach to Statistical Shape Modelling In IPMI'01, pp 50-63 , 2001
[2] Rissanen, J., Modelling the shortest data description, In Automatica,(14), pp 465-471 1978
[3] Thacker et al, B-Fitting: An Estimation Technique with Automatic Parameter Selection, In BMVC '96

Rhodri Davies