The method

If the two distributions are related functionnally (in our case through a geometrical transformation), and we only have an estimate of this transformation, the mutual information depends on

The first distribution.

The second distribution.

The estimated transformation that maps one onto the other.

The exact transformation that maps the first image u (the model) onto the second image v (the image) should give rise to the largest mutual information. Mutual information then becomes an optimisation criterion, optimised w.r.t. T:

MI(T) = H( u(X) ) + H( v(T(X)) ) - H( u(X), v(T(X)) )

The method then proceeds by a classic gradient-descent optimisation technique, and tries to find the transformation that gives the largest mutual information by taking small steps in the "direction" of the derivative of the criterion.

        d/dT[MI(T)] = d/dT[ H( v(T(X)) ) ] - d/dT[ H( u(X), v(T(X)) )]

The density probabilities of both images are estimated using the Parzen Window technique. This is a classical technique used in neural networks for estimating a probability distribution function (pdf) from a sample. It estimates the pdf using radial basis functions.

Since we need to estimate the density probability function at some point, say x, we draw A points at random in the model (data points) and compute the Parzen density estimation. Then we evaluate on a different point in the model, using the same data points. We actually perform this scheme on B different points. This set of (B) points is the set of the centres of the radial basis functions.
Taking the mean of the log of these B measures leads directly to the estimation of the entropy in the image. To access the entropy in both the image and the joint realisation of the model and the image, we just have to apply the same evaluation scheme (having computed the transforms of the random points)