Next: Model selection when sensor Up: Model selection in computer Previous: Intuition

Model selection criteria

Several model selection criteria have been used in computer vision, and many more have found popularity in the statistics literature. Model selection criteria based on Chi-square test and F-test have been used both in the vision and statistics literature [5,19]. More recently, information theoretic model selection criteria have gained increasing popularity.

Information theoretic criteria are mainly based on Bayes rule, Kullback-Leibler (K-L) distance, and minimum description lengths (MDL). Criteria based on Bayes rule choose the model that maximizes the probability of the data, D, given the model m and prior information I. This probability is given by

$\begin{displaymath} P(D\vert m, I) = \int \int P(D\vert\mbox{\boldmath${\theta}$... ...m}, \sigma\vert m, I) d\mbox{\boldmath${\theta}$}_{m} d\sigma, \end{displaymath}$

where $\mbox{\boldmath${\theta}$}_{m}$ is the $d_{m} \times 1$ parameter vector for model m, and $\sigma$ is the standard deviation of sensor noise. The first term in the above integral is just the likelihood $L(\mbox{\boldmath${\theta_{m}}$})$ , and $P(\mbox{\boldmath${\theta}$}_{m}, \sigma\vert m, I)$ is the prior probability of $\mbox{\boldmath${\theta}$}_{m}$ and $\sigma$ . Several Bayesian criteria have been derived depending on the choice of priors [10,11,13]. One such criteria chooses the model that maximizes [6,5]

$\begin{displaymath} P(D\vert m,I) \approx (2\pi)^{d_{m}/2} L(\hat{\mbox{\boldmat... ...${H}$}(\hat{\mbox{\boldmath${\theta}$}}_{m})\right\vert^{-1/2},\end{displaymath}$

(1)

where $\mbox{\boldmath${H}$}(\cdot)$ is the Hessian of $\log{L(\cdot)}$ . Another set of model selection criteria minimize the K-L distance between the candidate model's fit and the generating model's fit, given by

$\begin{displaymath} d(\hat{\mbox{\boldmath${\theta}$}}, \mbox{\boldmath${\theta}... ...oldmath${\theta}$}_{m} = \hat{\mbox{\boldmath${\theta}$}}_{m}}.\end{displaymath}$

But evaluating the above distance is not possible, since it requires knowledge of the actual model parameters $\mbox{\boldmath${\theta}$}_{\ast}$ . A number of model selection criteria have been derived based on approximations to this distance. The most common of these criteria are AIC and CAIC, and are given by

$\begin{displaymath} d(\hat{\mbox{\boldmath${\theta}$}}_{m}, \mbox{\boldmath${\th... ...\mbox{\boldmath${\theta}$}}_{m})} + 2 d_{m}, \;\;\; \mbox{and} \end{displaymath}$

$\begin{displaymath} d(\hat{\mbox{\boldmath${\theta}$}}_{m}, \mbox{\boldmath${\th... ...(\hat{\mbox{\boldmath${\theta}$}}_{m})} + d_{m} (\log{n} + 1), \end{displaymath}$

respectively. Finally, MDL based criteria minimize the number of bits required to represent the data using model m given by

$\begin{displaymath} len_{m} = len(\hat{\mbox{\boldmath${e}$}}_{m}) + len (\hat{\mbox{\boldmath${\theta}$}}_{m}), \end{displaymath}$

where the two terms give the number of bits required to encode the residuals and the estimated parameter vector, respectively. A popular MDL criteria is due to Rissanen, and is given by

$\begin{displaymath} len_{m} = -\log_{2}{L(\hat{\mbox{\boldmath${\theta}$}}_{m})}... ...at{\mbox{\boldmath${\theta}$}}_{m}}\right) + log_{2}*V_{d_{m}},\end{displaymath}$

where $\log_{2}\ast(t)=\log_{2}t \; + \; \log_{2}\log_{2} t \; + \; \ldots$ ,including only its positive terms, and V_dm is the volume of the d_m-dimensional unit hypersphere (see [7, page 24]).

An advantage of these criteria is the ease with which they can be used. There is no problem of specifying empirical thresholds, significance levels, or the need to reference look up tables. However, these criteria require fitting all the models to the data, which may be expensive and unnecessary in many applications.

Although these criteria start from different premises, interestingly, they all appear in the form of a penalized likelihood and optimize

$\begin{displaymath} \log{L(\hat{\mbox{\boldmath${\theta}$}}_{m})} \; + \; \mbox{stability or complexity term}.\end{displaymath}$

Further, criteria formulated using one premise have often been derived later using other premises. However, all these criteria assume data contaminated only by small scale random errors. But vision data is also contaminated with outliers. Several modifications have been recently proposed to the above criteria in order to incorporate outliers [3,8,14,15,16,17,18].

Next: Model selection when sensor Up: Model selection in computer Previous: Intuition

Kishore Bubna
10/9/1998