We will now return to image formation and camera geometry in a bit
more detail to determine how one *calibrates* a camera to
determine the relationship between what appears on the image (or
retinal) plane and where it is located in the 3D world.

Imagine we have a three dimensional coordinate system whose origin is
at the centre of projection and whose *Z* axis is along the optical
axis, as shown in figure 1. This coordinate system is called the *
standard coordinate system* of the camera. A point *M* on an object
with coordinates (*X*,*Y*,*Z*) will be imaged at some point *m* = (*x*, *y*)
in the image plane. These coordinates are with respect to a coordinate
system whose origin is at the intersection of the optical axis and the
image plane, and whose *x* and *y* axes are parallel to the *X* and
*Y* axes. The relationship between the two coordinate systems (*c*,*x*,*y*)
and (*C*,*X*,*Y*,*Z*) is given by

(1) |

(2) |

We can express the transformation from three dimensional world coordinates
to image pixel coordinates using a matrix. This is done by
substituting equation (1) into equation (2) and multiplying through by *Z*
to obtain

There are five camera parameters, namely the focal length *f*, the pixel width,
the pixel height, the parameter *u*_{c} which is the *u* pixel coordinate
at the optical centre, and the parameter *v*_{c} which is the *v* pixel
coordinate at the optical centre. However, only four separable parameters
can be solved for as there is an arbitrary scale factor involved
in *f* and in the pixel size. Thus we can only solve for the ratios
pixel width and pixel height. The
parameters and *v*_{c} do not depend on the
position and orientation of the camera in space, and are thus called
the *intrinsic* parameters.

In general, the three dimensional world coordinates of a point will not
be specified in a frame whose origin is at the centre of projection and
whose *Z* axis lies along the optical axis. Some other, more convenient frame,
will more likely be specified, and then we have to include a change of
coordinates from this other frame to the standard coordinate system. Thus
we have

The camera matrix **P** and the homogeneous
transform **K** combine to form a single matrix **C**,
called the *camera calibration matrix*. We can write the general
form of **C** as a function of the intrinsic and extrinsic parameters:

(3) |

Consider a translation of -*f* along the *Z* axis of the standard
coordinate frame, so that the focal plane and the image plane
are now coincident. Since there is no rotation involved in this
transformation, it is easy to see that the camera calibration matrix
is just