STEREOSCOPIC VISION AND PERSPECTIVE PROJECTION

David Young, February 1993 revised Jan 1994

This teach file gives an introduction to stereoscopic vision. It also describes perspective projection.

Please read the introduction giving general information about the nature of this document.

Preliminaries
Introduction
Image and camera coordinates
Perspective projection
Stereo geometry for parallel cameras
Stereo geometry for converging cameras
Three-D layout from a stereo pair
The correspondence problem
- Matching using feature similarity
- Other matching methods
Summary

Preliminaries

You need to have read and understood TEACH VISION1. It will be helpful to have looked at VISION2, VISION3 and VISION4.

This is a long teach file, but it has not been split as the material belongs together. You could split it yourself in two ways. The first is to take it in two parts, perhaps breaking after the section on "Stereo geometry for converging cameras". The second way, which is possibly better, is to go through the file a first time without paying too much attention to the equations and Pop-11 code, but executing the latter so as to see the demonstrations; then on a second pass you could study the equations and look at the code more carefully to see how the ideas are put into practice.

Before proceeding, execute the following chunk of code to avoid having to load the libraries later.

    uses popvision          ;;; search vision libraries
    uses rci_show           ;;; image display
    uses rc_context         ;;; allows multiple windows for graphics
    uses rc_graphplot       ;;; graph drawing
    uses arrayfile          ;;; array storage
    uses float_byte         ;;; type conversion
    uses float_arrayprocs   ;;; array arithmetic
    uses canny              ;;; Canny edge detection

We will need two images of a scene, from different viewpoints, so execute this now:

    vars imageR, imageL;
    arrayfile(popvision_data dir_>< 'stereo1.pic') -> imageR;
    float_byte(imageR, false, false, 0, 255) -> imageR;
    arrayfile(popvision_data dir_>< 'stereo2.pic') -> imageL;
    float_byte(imageL, false, false, 0, 255) -> imageL;

Introduction

Viewing a scene from two (or more) different positions simultaneously allows us to make inferences about 3-D structure, provided that we can match up corresponding points in the images. The visual systems of humans and some other animals make use of this, and it is very important in attempts to develop practical computer vision systems.

Display the stereo pair of images which we read in above, with:

    1 -> rci_show_scale;        ;;; keep them small
    500 -> rci_show_x;         ;;; set the position for the L window
    350 -> rci_show_y;
    rci_show(imageL) -> ;
    rci_show_x + 120 -> rci_show_x;  ;;; position for the R window
    rci_show(imageR) -> ;

(This code carefully positions the windows beside one another. If you want to move the windows, do it by changing the values for rci_show_x and rci_show_y and re-executing the code, to keep the relative positions the same. Remember that you can get rid of unwanted image windows just by clicking on the image.)

You may find it possible to fuse these images in your own visual system. I did it by iconifying all the other windows so that only the stereo pair was visible against a plain background. (If you have a patterned X background enlarge an xterm or other window to cover most of the screen, and put it behind the images by selecting "back" from its titlebar menu.) I then looked at the images from a distance of about 10 inches (25 cm). So that I could only see the left image with my left eye and the right image with my right eye, I held a piece of A4 card up so that one long edge was against the screen between the two images, and the opposite edge was touching my nose. I consciously relaxed my eyes as if I was looking at something in the distance; although this made the images go out of focus, after a while I could get them to fuse as if I was looking at a single picture. I found I could then get this back into focus without breaking the fusion. At that point it looked as if there was a tiny tripod clearly standing out in front of the screen, with the door handle and other objects in the background further away.

In fact, the Victorians invented technology for presenting stereo pairs more conveniently than this, and were enthusiastic about viewing them. You can still find Victorian stereoscopes and collections of stereo pairs in antique shops. Every so often, film-makers discover they can use stereoscopic presentation to dramatic effect, though the audience has to wear glasses with coloured or polarising filters to ensure that a different image goes to each eye. Map-makers use stereo presentation of aerial photographs to estimate ground topography.

The stereo pair of images on the screen was obtained using two camera positions separated horizontally. Viewing the images normally, you can see how the change in camera position has resulted in a change in the position of the tripod's image relative to the images of other objects. There are two computational issues to address: first, how these image differences can be translated into 3-D information; and second, how the images can be matched to one another so that the differences between them can be measured.

Image and camera coordinates

Before we can look at stereo geometry, we need to introduce the relationship between image and object positions for a single camera.

Positions in the image will be expressed using a coordinate system with its origin at the centre of the image, x running from left to right, and y running from bottom to top:

          -----------------------------------------------
          |                      ^                      |
          |               y-axis |                      |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |  --------------------+------------------->  |
          |                      |             x-axis   |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |                      |                      |
          |                                             |
          -----------------------------------------------

It is easy to convert from x and y to column and row indices in the array representing the image. If the image centre is at column c0 and row r0, then column = x + c0 and row = r0 - y.

Positions in space will be expressed using 3-D coordinates X, Y and Z (not to be confused with the X and Y used in some earlier files as synonyms for column and row). The (X, Y, Z) system has its origin at the camera lens, with Z pointing out along the optic axis of the camera (i.e. at right angles to the image plane), X parallel to x, and Y parallel to y.

Suppose we have a situation where Y is actually vertical - as is the case for the tripod test images. Then looking down on the camera and scene from above gives this layout of the X and Z coordinate axes :

                                 ^
                          Z-axis |
                                 |
                                 |
                     Objects being viewed in this
                             general area
                                 |
                                 |
                                 |
                                 |
                                 |
                                 |
                           lens ---
                               -- -- ------------------>
                               |   |             X-axis
                       Camera  |   |
                               -----

The Y axis is pointing at you out of the screen. The intersection of the axes is at the optical centre of the camera lens. The (X, Y, Z) system is called camera coordinates because it is fixed with respect to the camera, centred on it, and aligned with it. In robotics, other coordinate systems fixed with respect to an object being manipulated, part of a robot arm, or some mechanical framework in which the robot is working, are often employed as well.

Perspective projection

What is the relationship between the camera coordinates of a point in space and the coordinates of its image? If we assume a camera with a planar image surface and no lens distortions (and this camera model is usually quite close to reality), the mathematical form of the relation is quite simple:

These are the equations of perspective projection. The constant f is the focal length of the camera, and is the distance from the optical centre of the lens to the image plane. X, Y and Z specify the position of a point in space; x and y specify the position of its image in the image plane. These equations allow us to work out x and y if we know X, Y and Z, but do not allow us to go in the other direction, because knowing x and y is not sufficient on its own to pin down the position of the point in 3-D.

There is a simple geometrical derivation of these formulae. A lens behaves approximately like the pinhole of a pinhole camera (except it produces a much brighter image), and so we can draw a straight line from a point in space, through the lens centre, to its image. For a point on the same horizontal level as the camera (Y = 0), the resulting diagram, looking down on the scene as in the diagram above, is:

                                 ^
                          Z-axis |
                                 |         . <- Position of point
                                 |        /|    in space
                                 |       / |
                                 |      /  |
                                 |     /   |
                                 |    /   Z|
                                 |   /     |
                                 |  /      |
                                 | /       |
                    Camera lens  |/   X    |
                    centre ->  --+--------------------->
                                /|                 X-axis
                               / |f
          Position of image \ /  |
          of point          ---------- <- Image plane in camera
                               x

The relation between x and X now follows from similar triangles (via x/f = X/Z). Since a movement along the Y axis will not affect x, this relation continues to hold even if Y is non-zero. The corresponding relation between y and Y is obtained by drawing the side-view diagram.

It looks from the diagram above as if the x-axis has to run from right to left, in the opposite direction to the X-axis. However, if you look at the image actually formed in the camera it is left-to-right inverted and upside-down compared with the way we display the digitised image on the screen. (You can see this with an ordinary camera by putting tracing paper where the film should go, and keeping the back and the lens open.) When we display the image "right way round" on the screen, the x and y axes get flipped, so that they run in the same directions as the X and Y axes. The easiest way to deal with this is to imagine that the camera forms a virtual image which is in front of the lens instead of behind it, and to pretend that this is what gets digitised. The diagram then looks like this:

                                 ^
                          Z-axis |
                                 |         . <- Position of point
                                 |        /     in space
                                 |       /
                                 |      /
                                 |     /
                                 |    /
  Virtual image plane  -> -------+---+---->
                                 |  /   x-axis
                                f| /
                    Camera lens  |/
                    centre ->  --+--------------------->
                                                     X-axis

This is just a different way of modelling the camera; it is less like the reality in terms of physical layout, but it is easier to work with than the image-behind-the-lens model because the virtual image is "right way round". The similar triangles argument still applies, and gives the same mathematical relationships between x and X, and y and Y, so the models are formally equivalent.

The projection equations obviously work if all the quantities are measured in the same units. However, we would normally like to measure distances in the image in pixel units, where 1 pixel unit is just the distance between the centres of two neighbouring pixels, because then array indices correspond to position. Distances in the scene will usually be measured in something like inches or millimetres. It turns out that the x equation is still valid as long as x and f are in the same units as one another, and X and Z are in the same units as one another (and similarly for the y equation). Thus if we know f in pixel units, everything works out conveniently. Finding the value of f in pixel units is basic camera calibration, and is usually best done by making measurements of the image of an object of known size and distance. It is also possible to estimate f in pixel units from the focal length of the lens (usually stamped on it in millimetres) and the number of pixels per millimetre in the camera detector.

Very high accuracy requires a camera model which takes into account lens distortions and detector irregularities. Such models require more elaborate calibration.

Stereo geometry for parallel cameras

Now suppose we have two cameras imaging the scene. It will be simplest to start by assuming that their optical axes are lined up parallel to one another. We also assume that they are side by side - or more exactly, that the line joining their optical centres is parallel to their x-axes. This means that the image of a point will have the same y coordinate for the two cameras. The top view is something like:

                        Objects being viewed

                       in this general region





                        ---                ---
                       -- --              -- --
               Left    |   |      Right   |   |
               Camera  |   |      Camera  |   |
                       -----              -----

Positions in space could be described in either of the two camera coordinate systems, but it is convenient to describe positions relative to an imaginary third camera half-way between the two real cameras. We can call this the cyclopean coordinate system (the cyclops only had one eye in the middle of its head). Schematically, looking down on the scene:

                           ^
                    Z-axis |
                           |
                           |
                           |
              Objects out here specified
                using (X, Y, Z) coords
                           |
                           |
                           |
                           |
                           |
 Left                      |   Right
 image -----+----->        |   image -----+----->
                 xL        |                   xR
                           |
                           |
    --------+--------------+--------------+--------------->
           /                \              \            X-axis
        Left camera          \              \
        optical centre        \              Right camera optical
                               \             centre
                                \
                                 Origin of cyclopean coordinates

            <------------- D ------------->

Here xL and xR are the x coordinates for an image point in the left and right images respectively. D is the separation between the cameras' centres.

Now if a point at (X, Y, Z) in cyclopean coordinates produces images at (xL, y) and (xR, y) in the two cameras, we can find its position in space from the image positions using the formulae:

                          D (xL + xR)
                    X  =  -----------
                          2 (xL - xR)

                            D y
                    Y  =  -------
                          xL - xR

                            D f
                    Z  =  -------
                          xL - xR

The quantity xL-xR which appears in all three formulae is called the stereo disparity (or just the disparity) of the point. It is the difference in the horizontal positions of the images of a point.

D must be measured in the same units as X, Y and Z, and xL, xR, y and f can all be in pixel units.

Deriving these equations is fairly straightforward. The main thing to use is that the perspective projection equations in x for the left and right cameras become xL = (X + D/2)f/Z and xR = (X - D/2)f/Z. This is because to get from X in cyclopean coordinates to X in left or right camera coordinates you simply adjust for the offsets of the cameras from the cyclopean origin, whilst Z is the same in all three coordinate systems. A little algebra then gives the expressions above for X and Z, and that for Y follows easily.

Note the structure of the Z equation. Z is often termed the depth of an object relative to the cameras. A small disparity yields a large depth, and vice versa.

Stereo geometry for converging cameras

It is not usually convenient to set up the cameras with their axes parallel, because this limits the region of space in which objects are visible in both images. It is more normal to aim the cameras so that their axes are angled inwards, and converge on the objects of interest. The point in space where the optical axes intersect might be described as the fixation point for the cameras. The image of an object at the fixation point is centred for both cameras, and so has zero disparity.

         ^                      . <- Fixation point
         |                     / \
         |                    /   \
         |                   /     \
         |                  /       \
         |                 /         \
         Z0               /     ^     \   <-- Optic axes of cameras
         |               /     Z|      \
         |              /       |       \
         |             /        |        \
         |            /         |         \
         |           /          |          \
         |          /           |           \
         V         O       -----+----->      O
                 Left                X      Right
                 Camera                     Camera

It turns out that this makes the geometry considerably more complicated, and the exact equations for getting from image points to space points are much less straightforward than for parallel cameras. Matching points will not even generally lie at the same y coordinate. Provided the amount of convergence is small, though, we can ignore this vertical disparity, and there is a reasonable approximation which can be used for the equations. Call the depth of the fixation point Z0. Then the disparities can be adjusted by adding the quantity fD/Z0, which is the disparity an object at the fixation point would have if the axes were parallel. The new equations become:

                          D (xL + xR)
                    X  =  -----------
                              2 p

                          D y
                    Y  =  ---
                           p

                          D f
                    Z  =  ---
                           p
                                    f  D
    where           p  =  xL - xR + ----
                                     Z0

Z0 is another parameter that must be found by calibration.

It will be convenient to have the formulae available as a Pop-11 procedure, which we define next.

    define images_to_world(colL, colR, row, c0, r0, D, f, p0)
            -> (X, Y, Z);
        ;;; Calculates the cyclopean coordinates of a point, using
        ;;; the near-parallel approximation. Arguments are:
        ;;;     cL - the column in the left image
        ;;;     cR - the column in the right image
        ;;;     row - the row, assumed same for both images
        ;;;     c0 - the column at the origin of image coords
        ;;;     r0 - the row at the origin of image coords
        ;;;     D - the camera separation in world units
        ;;;     f - the focal length in pixel units
        ;;;     p0 - the disparity shift in pixel units.
        lvars colL, colR, row, c0, r0, D, f, p0, X, Y, Z;
        ;;; Convert from col/row to image coords
        lvars xL, xR, y;
        colL - c0 -> xL;
        colR - c0 -> xR;
        r0 - row -> y;
        ;;; Get adjusted disparity
        lvars p;
        xL - xR + p0 -> p;
        ;;; and apply stereo formulae
        D * (xL + xR) / (2 * p) -> X;
        D * y / p -> Y;
        D * f / p -> Z
    enddefine;

This procedure implements our simplified model of stereo geometry.

Three-D layout from a stereo pair

(This section and the following one incorporate a fair amount of Pop-11 code. This is to make it easy for you to see what is being done at the programming level, if you want to, and to experiment. However, as pointed out at the start of the file, you can follow the main points simply by executing the Pop-11 sections when you get to them, and reading the text between - you can always return to look at the programs later.)

We are now in a position to work out some of the layout of the tripod stereo pair. We will assume the following parameters for the camera set-up:

    vars
        c0 = 128,    ;;; Column at x-origin, c0   =  128
        r0 = 128,    ;;; Row at y-origin, r0      =  128
        D  =  20,    ;;; Separation, D            =   20 cm
        f  = 200,    ;;; Focal length, f          =  200 pixel units
        Z0 = 100;    ;;; Fixation distance, Z0    =  100 cm

We can define a procedure that converts from image positions to cyclopean coordinates for this specific set-up by making a closure (see HELP *CLOSURES) of our general procedure. We do this with:

    define im_to_w =
            images_to_world(% c0, r0, D, f, D*f/Z0 %)
    enddefine;

Now we need to get some matching points. We will take the edges from the Canny edge detector (see TEACH VISION3) as the features to match, so first apply the Canny operation, and threshold the results to give binary edge maps:

    vars cannL, cannR;
    vars sigma = 1, t1 = 5, t2 = 10;    ;;; Canny parameters
    canny(imageL, sigma, t1, t2) -> ( , , cannL);
    float_threshold(0, t1/2, 1, cannL, cannL) -> cannL; ;;; threshold
    canny(imageR, sigma, t2, t2) -> ( , , cannR);
    float_threshold(0, t1/2, 1, cannR, cannR) -> cannR; ;;; threshold

Display the edge maps with:

    2 -> rci_show_scale;                    ;;; bigger is easier to see
    vars winCL, winCR;                      ;;; to save the windows
    500 -> rci_show_x;                     ;;; position the windows
    rci_show_y + 170 -> rci_show_y;
    rci_show(cannL) -> winCL;
    rci_show_x + 240 -> rci_show_x;
    rci_show(cannR) -> winCR;

(Output is displayed a little further on.)

We need a procedure for getting the features in a given row. This is simple enough:

    define features(image, cstart, cend, row) -> list;
        ;;; Returns a list of the column numbers for which there are
        ;;; non-zero pixels in the given row of the image, between
        ;;; columns cstart and cend.
        lvars image, cstart, cend, row, list;
        lvars col;
        [%                  ;;; start building list
            for col from cstart to cend do
                if image(col, row) /= 0 then
                    col                 ;;; put this column in the list
                endif
            endfor
        %] -> list
    enddefine;

Now we choose a row - say 120, which goes through the tripod head. We'll display the position of this row on the screen:

    vars row = 120;         ;;; set up the row
    winCL -> rc_window;     ;;; draw on the left edge map
    rci_show_setcoords(cannL);  ;;; set the scales correctly
    XpwSetColor(rc_window, 'red') -> ;
    rc_jumpto(83, row);     ;;; draw a line along one row
    rc_drawto(173, row);
    winCR -> rc_window;     ;;; draw on the right edge map
    XpwSetColor(rc_window, 'red') -> ;
    rc_jumpto(83, row);
    rc_drawto(173, row);

and collect the right and left features for it, for all columns from 83 to 173 (i.e. the whole width of the image):

    vars featuresL, featuresR;
    features(cannL, 83, 173, row) -> featuresL;
    features(cannR, 83, 173, row) -> featuresR;
    featuresL ==>
prints:
    ** [100 120 133 159 163 167]
    featuresR ==>
prints:
    ** [90 93 99 112 158 162 166]

The list for the right image has one more entry than that for the left image. Inspecting the edge maps shows that this is due to the fact that the second edge from the left, at column 93 in the right image, is not represented in the left image. We can therefore make the two lists correspond by omitting this feature - of course, this is cheating, and a real stereo system will have to do this automatically. For now, though, do:

    vars reduced_featuresR;
    delete(93, featuresR) -> reduced_featuresR;

Now we can look at what layout the stereo procedure gives. We need a diagram looking down on the tripod - this will produce it, showing the cyclopean coordinate system:

    false -> rc_window; ;;; avoid destroying present rc_window
    rc_new_window(200, 600, 10, 10, false); ;;; start a new window
    [-20 20 0 120] -> rcg_usr_reg;           ;;; set the axis limits
    rc_graphplot([], 'X', [], 'Z') -> ;

Move the window with the mouse if it is not in a convenient place. We can mark the positions of the two cameras, remembering that their separation is D, so they are D/2 either side of the origin:

    XpwSetColor(rc_window, 'red') -> ;
    rcg_plt_square(-D/2, 0);                ;;; mark X = -D/2, Z = 0
    rcg_plt_square( D/2, 0);

Now we can apply our stereo procedure to each pair of features, and plot the corresponding positions in the X-Z plane:

    vars colL, colR, X, Y, Z;
    for colL, colR in featuresL, reduced_featuresR do
        im_to_w(colL, colR, row) -> (X, Y, Z);
        rcg_plt_cross(X, Z)
    endfor;

We see the position of the two sides of the tripod head shown at about Z = 66, the object at the left further away at about Z = 80, and three points on the door frame to the right at about Z = 98.

You can easily repeat this with other rows from the image, but it is tedious manually deciding which features to delete in order to make the others match up.

The correspondence problem

The problem of pairing up the features is known as the correspondence problem. If we apply no constraints, any feature in the left image could match any feature in the right image. A small modification of the code above allows you to display the positions of all the possible matches between the two lists of features.

We have already used the most basic constraint, when we looked for matching features in a single row of the image. That is, we have assumed that a feature with a particular y value can only match a feature in the other image with the same y value. This is known as the epipolar constraint. For converging cameras, using the same y value is actually an approximation, and to be exact we should search for matching features on epipolar lines, which are curves close to the rows of the image. We will continue to use the approximation that the epipolar lines are the same as the image rows.

There remains the problem of matches between all the features with the same y. There are three main constraints that can help:

A given feature can match at most one feature from the other image.
Similar features match each other.
Features close together in the image should have similar disparities.

The bases of these constraints are discussed in various books - those by Frisby and Marr (see TEACH *VISION) are probably best. The first constraint is straightforward to apply, but there is considerable flexibility in the way the other two can be used. We will initially explore the use of the second constraint.

Matching using feature similarity

There are many ways of measuring feature similarity. For example, some systems make use of the orientation of the edge. Here we will adopt a measure based on the grey-levels in the image: we will add up the differences in grey-levels in a patch of pixels round the feature, on the assumption that matching regions should have a similar local structure and overall grey-level. To avoid negative and positive differences cancelling out, we will square each difference before doing the addition. Measures like this, based on sums of squares, are extremely common. Since our features are likely to correspond to the boundaries of objects, we will look at two patches, one on each side of the feature, and accept a match if either the grey-level differences to the left of the feature are small, or the differences to the right of the feature are small. (Horizontal edges are no good for matching anyway, so taking left-right patches is reasonable, though one could undoubtedly improve on this using edge orientation information.)

This is implemented in the following procedure:

    define mismatch(col1, col2, row, image1, image2) -> result;
        ;;; Returns a measure of the mismatch between a feature at
        ;;; (col1, row) in image1 and a feature at (col2, row) in
        ;;; image2.
        lvars col1, col2, row, image1, image2, q;
        ;;; These constants set the size of the patches at 5x5
        lconstant rsize = 2, csize = 5;
        lvars r, c, d1, d2, s1=0, s2=0;     ;;; initialise sums
        for r from row-rsize to row+rsize do
            for c from 1 to csize do
                ;;; Get-grey-level differences to left and right
                image1(col1-c, r) - image2(col2-c, r) -> d1;
                image1(col1+c, r) - image2(col2+c, r) -> d2;
                ;;; Add up sums of squares.
                s1 + d1 * d1 -> s1;
                s2 + d2 * d2 -> s2;
            endfor
        endfor;
        min(s1, s2) -> result       ;;; take the smaller mismatch
    enddefine;

The procedure is called mismatch because the larger the result it returns, the worse the match between the two features. To keep the argument lists short, mismatch has the parameters controlling the patch size (5 x 5) built into it. These values happen to be OK for this demonstration, but of course a more general library routine might well have them as arguments.

For a given feature in the left image, we can now pick the feature in the right image which matches it best. We can also do the same thing in reverse, starting from a feature in the right image and looking for the best match in the left image. If the two agree (i.e. two features select each other as closest matches), then it is reasonable to accept the pairing as a definite match. The following procedures implement this. Any feature that is only visible in one image should be weeded out. The first constraint (only one match for any given feature) is built in. In addition, to avoid having to check for wild matches, a disparity limit of 40 pixels is rather arbitrarily built in too - this would, of course, be unacceptable in a library procedure.

    define bestmatch(col1, f2, row, image1, image2) -> b;
        ;;; Returns the best match for the feature at (col1, row) in
        ;;; image1 from the list of features at (col2, row) in image2
        ;;; where col2 is an element of f2.
        ;;; Ignores features with a disparity greater than this limit.
        lconstant disp_limit = 40;
        lvars col1, f2, row, image1, image2, b;
        lvars col2, q, best = false;
        for col2 in f2 do
            if abs(col1 - col2) <= disp_limit then
                mismatch(col1, col2, row, image1, image2) -> q;
                if not(best) or q < best then
                    q -> best;
                    col2 -> b
                endif
            endif
        endfor
    enddefine;

    define findmatches(fL, fR, row, imageL, imageR) -> (newfL, newfR);
        ;;; Given a list of features fL in the left image and fR in the
        ;;; right image, returns two lists of the same length which
        ;;; contain features in one-to-one correspondence.
        lvars fL, fR, row, imageL, imageR, newfL = [], newfR = [];
        lvars colL, colR;
        for colL in fL do
            ;;; Find best match for left feature
            bestmatch(colL, fR, row, imageL, imageR) -> colR;
            ;;; Find best match for right feature, and see if it agrees
            if colR     ;;; test there is some matching feature
            and bestmatch(colR, fL, row, imageR, imageL) == colL then
                colL :: newfL -> newfL;
                colR :: newfR -> newfR
            endif
        endfor;
        ;;; Make the lists run left to right (not essential)
        ncrev(newfL) -> newfL;
        ncrev(newfR) -> newfR
    enddefine;

That then provides the basis of a simple stereo matching program, using grey-level matching starting from edge features. Applied to the features we had before, it gives:

    findmatches(featuresL, featuresR, row, imageL, imageR)
        -> (featuresL, featuresR);
    featuresL =>
prints:
    ** [100 120 133 159 163 167]
    featuresR =>
prints:
    ** [90 99 112 158 162 166]

- the feature at column 93 in the right image has been omitted, as required for a correct set of matches.

We can check how this behaves on the complete images. We need a procedure to iterate over rows and get all the matches. We will store each match as a vector containing the column and row numbers, and place all of them in one long vector.

    define findallmatches(imageL, imageR, edgeL, edgeR) -> matchvec;
        ;;; Returns the columns and row for each pair of matching
        ;;; features in the two images. edgeL and edgeR must be the edge
        ;;; maps for imageL and imageR.
        lvars imageL, imageR, edgeL, edgeR, matchvec;
        lvars row, featuresL, featuresR, colL, colR,
            (ic0, ic1, ir0, ir1) = explode(boundslist(imageL)),
            (ec0, ec1, er0, er1) = explode(boundslist(edgeL));
        ;;; Must avoid pixels adjacent to the boundaries to ensure
        ;;; matching procedure does not try to go outside the array.
        ;;; Next constants must correspond to those in mismatch
        lconstant rsize = 2, csize = 5;
        lvars
            cstart = max(ec0, ic0 + csize),
            cend = min(ec1, ic1 - csize);
        {%      ;;; start building big vector
            for row from max(er0, ir0+rsize) to min(er1, ir1-rsize) do
                ;;; Get the features.
                features(edgeL, cstart, cend, row) -> featuresL;
                features(edgeR, cstart, cend, row) -> featuresR;
                ;;; Find one-to-one matches
                findmatches(featuresL, featuresR, row, imageL, imageR)
                    -> (featuresL, featuresR);
                ;;; Get the matches for this row into vectors
                for colL, colR in featuresL, featuresR do
                    {% colL, colR, row %}  ;;; left on stack
                endfor
            endfor
            %} -> matchvec;     ;;; finish building big vector
    enddefine;

This involves a fair amount of computation, so the next step will take some time:

    vars matchvec;
    findallmatches(imageL, imageR, cannL, cannR) -> matchvec;

(The program could be made more efficient, since it calculates many match strengths twice.)

We now have all the matches. First, we can use them to plot all the matches in the (Z, Y) plane, superimposing different X values. This gives a side view of the scene, with the camera positions both at the origin. The cameras are looking left to right in the diagram drawn by the code that follows.

    rc_new_window(300, 300, rci_show_x, rci_show_y, false);
    [0 120 -60 60] -> rcg_usr_reg;      ;;; set coord region
    rc_graphplot([], 'Z', [], 'Y') -> ; ;;; show coords, set scales
    XpwSetColor(rc_window, 'red') -> ;

    appdata(matchvec,
        procedure(match);
            lvars match, X, Y, Z;
            ;;; Use the 2-D to 3-D procedure.
            im_to_w(explode(match)) -> (X, Y, Z);
            rc_drawpoint(Z, Y);
        endprocedure);

The results are far from perfect, but the tripod at about Z = 65, the object at the left at Z = 80, and the background objects at Z = 90 to 110 show up clearly, with a few false matches scattered around. The results can also be projected onto the (X, Z) plane, which you can try, or indeed plotted as if they were viewed from any given direction, though that requires coordinate rotations which cannot be covered here.

Second, we can look at the matches superimposed on the edge maps, by colour-coding the features according to disparity.

    vars colours = {'red' 'blue' 'green' 'yellow' 'purple' 'orange'
        'pink' 'firebrick' 'magenta' 'brown'};
    rci_show_setcoords(cannL);  ;;; Set coord system
    appdata(matchvec,
        procedure(match);
            lvars match, colL, colR, row, colour;
            dlocal rc_window;
            explode(match) -> (colL, colR, row);
            ;;; arbitrary mapping from disparities to colours
            colours((colL-colR+40) div 8 + 1) -> colour;
            winCL -> rc_window;
            XpwSetColor(rc_window, colour) -> ;
            rci_drawpoint(colL, row);
            winCR -> rc_window;
            XpwSetColor(rc_window, colour) -> ;
            rci_drawpoint(colR, row);
        endprocedure);

This makes it easy to see where incorrect matches have occurred, and also provides a form of depth map of the scene, with different colours indicating different distances away from the camera.

Other matching methods

Although matching based on grey-level least-squares differences worked reasonably well, it is computationally quite expensive, and there were some erroneous matches. One way round these problems might be to match higher-level features, such as straight line segments, and this kind of approach is sometimes used in practice.

Another set of techniques involves using the third constraint listed above: that points near each other in the image will usually have similar disparities, because the scene is made of coherent surfaces This constraint has been formalised in various ways, for example in terms of the disparity gradient limit used in the system developed by a group at Sheffield, described in detail in TEACH *PMF. Prior to that, the constraint was discussed by Marr and used in an interesting way in his cooperative algorithm for finding correspondences. Using disparity smoothness is probably particularly appropriate when trying to match images with a high density of features, as is the case with random dot stereograms. It would be less suitable for the rather sparse features we have been working with.

Marr also proposed a method of stereo matching based on scale space ideas. In this, the zero-crossings of large-scale DoG filters are matched initially, giving rough estimates for the disparities; these are used as a guide to initial disparities for matching at a smaller scale, and so on until matching is done on the smallest visible details at high accuracy. Nishihara developed this approach in a program intended for real-time stereo matching, which works on the kind of binary images displayed in the zero-crossing section of TEACH VISION3.

A rather different way of estimating stereo disparities has recently become prominent. In this, the outputs of a set of convolution masks, sensitive to particular spatial frequencies (see TEACH VISION3), are used to estimate local phase shifts between the left and right images. This kind of method is probably best suited to images with small disparities.

Summary

On a first reading, you should:

know what the terms stereoscopic vision, perspective projection, stereo disparity, fixation point and correspondence problem mean;
have some idea how the pinhole camera model allows us to relate the positions of points in 3-D to the positions of their images in 2-D;
have seen how 3-D information is implicit in a stereo pair of images;
understand how grey-level matching can help solve the correspondence problem;
be aware of other approaches to solving the correspondence problem.

When you have followed the file in detail, you should also:

understand the image, camera and cyclopean coordinate systems;
know the equations for perspective projection, and understand how they were derived using similar triangles;
understand the image-in-front-of-the-lens camera model;
have looked at, though not necessarily derived, the stereo position recovery equations for parallel and converging cameras;
understand the term epipolar constraint;
understand the sum-of-squares match quality measure, and how it was applied;
understand the Pop-11 code used;
be in a position to try out, if you wish, alternative approaches to stereo matching, and investigate their consequences.

Here are links to:

STEREOSCOPIC VISION AND PERSPECTIVE PROJECTION

Contents