Hannes Kruppa and Bernt Schiele
Perceptual Computing and Computer Vision Group
ETH Zurich, Switzerland
Mutual information has been used previously in computer vision, for example in image registration  or in audio-visual speech acquisition . As detailed below, mutual information can be used to measure the mutual agreement between two object models. In order to combine multiple models a hierarchy of pairwise model combinations is used.
The mutual information of two random variables and with a joint probability mass function and marginal probability mass functions and is defined as :
Here, the probabilities in expression 1 can be directly derived from a pair of distinct visual object models. To undermine the relevance of mutual information in the context of object model combination, we briefly refer to the well-known Kullback-Leibler divergence. The KL-divergence between a probability mass function and a distinct probability mass function is defined as:
Although the Kullback-Leibler divergence (also called relative entropy or information divergence) is not symmetric and does not satisfy the triangle inequality, it is often useful to think of it as a ``distance'' between distributions . By defining the mutual information can be written as the KL-divergence between and :
Mutual information therefore measures the ``distance'' between the joint probability and the probability , which is the joint probability under the assumption of independence. Conversely, it measures mutual dependency or the amount of information one object model contains about another. As a result mutual information can be used to measure mutual agreement between object models.
In the following we assume that for each subregion of the input image, each model determines the probability that the object of interest is either present or absent. This representation is very general and can be satisfied by nearly any object model. is calculated based on the first object model and covers two cases, namely the presence of the object or its absence , respectively. The probability is derived from the second object model analogously, also with the two described cases. Finally, for the joint probability both models and all four cases are taken into account.
Typically, the object of interest can be associated with a characteristic parameter range of an object model. For example, in the case of a color model, the parameter range may be given by a particular subspace of the total color-space. Note that each parameter configuration results in distinct probabilities and and consequently, in a distinct mutual information value. Therefore, one can determine a configuration which maximizes mutual agreement between the employed models and the input data by maximizing the mutual information over the object-specific joint parameter space:
with and describing the object-specific parameter space of a model pair. For example the parameters of the facial shape model in the experiments described below are the size and the location of the face within the image. By maximizing mutual information with a second, complementary face model the algorithm detects and locates faces in the image. Figure 1 illustrates this concept.
In order to combine multiple models a hierarchy of pairwise model combinations is used. At each stage the algorithm computes a ranking of parameter configurations which maximize mutual information. This ranking is then used as input for the next stage in the hierarchy where mutual information can be used again to find the best combined parameter configurations. The resulting algorithm is modular and can be easily extended to new object models. The hierarchical concept is depicted in figure 2 which shows the architecture used for face detection (see figures 3, 4, 5). In this case study, the following three object models are combined pairwise in order to detect human faces: a skin color model, a shape model and a template matcher. In stage one the probability maps are calculated based on the color model and the template matcher. Stage two combines the color model with the facial shape model by maximizing mutual information. The template matcher and the facial shape model are also combined in stage two. Finally stage three combines both results again by maximizing mutual information.
Obviously other groupings would be meaningful as well. The proposed grouping however ensures that the combined hypotheses on stage two can be represented as a single condensed region of probabilities. This will be further explained in the next sections. Also, it would be possible to combine all models in a single maximization step. However, using pairwise combinations enables the definition of separate and independent parameter constraints for each pair which reduces the size of the joint parameter space and therefore speeds up sampling.