Next: Making Complete Surface Hypotheses Up: Surface Data as Input Previous: The Labeled Segmented Surface

Why Use Surfaces for Recognition?

It was Marr, who, in advocating the $2\frac{1}{2}{\rm D}$ sketch as an intermediate representation [112], brought surfaces into focus. Vision is obviously a complicated process, and most computer-based systems have been incapable of coping with both the system and scene complexity. The importance of Marr's proposal lies in having a reconstructed surface representation as a significant intermediate entity in the image understanding process. This decision laid the foundation for a theory of vision that splits vision into those processes that contribute to the creation of the $2\frac{1}{2}{\rm D}$ sketch and those that use its information.

A considerable proportion of vision research is currently involved in generating the $2\frac{1}{2}{\rm D}$ sketch (or equivalent surface representations). This work addresses the problem of what to do after the sketch is produced and some justifications for and implications of using this surface information are discussed in the sections below.

From What Sources Can We Expect to Get Surface Information?

The research presented here is based on the assumption that there will soon be practical means for producing surface images. Several promising research areas suggest that this is likely, though none of the processes described here are "perfected" yet.

Direct laser ranging (e.g. [122,98,143]) computes surface depth by measuring time of flight of a laser pulse or by signal phase shift caused by path length differences. The laser is scanned over the entire scene, producing a depth image. Sonar range finding gives similar results in air, but has lower resolution and has problems because of surface specularity.

Structured illumination uses controlled stimulation of the environment to produce less ambiguous interpretations. One well known technique traces the scene with parallel light stripes ([147,5,133,150,127,36]). The three dimensional coordinates of individual points on the stripes can be found by triangulation along the baseline between the stripe source and sensor. Additionally, this technique highlights distinct surfaces, because all stripes lying on the same surface will have a similar character (e.g. all lines parallel), and usually the character will change radically at occlusion or orientation discontinuity boundaries. Analysis of stripe patterns may give further information about the local surface shape. Variations of this technique use a grey-code set of patterns (e.g. [97]) or color coding ([99,37]) to reduce the number of projected patterns.

A second technique uses one or more remotely sensed light spots. By knowing the emitted and received light paths, the object surface can be triangulated, giving a depth image ([101,132,63]). The advantage of using a spot is then there is no feature correspondence problem.

Besl [25] gives a thorough and up-to-date survey of the techniques and equipment available for range data acquisition.

Stereo is becoming a more popular technique, because it is a significant biological process ([112,113]) and its sensor system is simple and passive. The process is based on finding common features in a pair of images taken from different locations. Given the relationship between the camera coordinate frames, the feature's absolute location can be calculated by triangulation.

One major difficulty with this technique is finding the common feature in both images. Biological systems are hypothesized to use (at least) paired edges with the same sign, and from the same spatial frequency channel ([112,113]). Other systems have used detected corners or points where significant intensity changes take place ([54,118]). More recently, researchers have started using trinocular (etc) stereo (e.g. [125]), exploiting a second epipolar constraint to reduce search for corresponding features and then produce more reliable pairings.

Another difficulty arises because stereo generally only gives sparse depth values, which necessitates surface reconstruction. This topic has only recently entered investigation, but some work has been done using a variety of assumptions (e.g. [77,158,31]).

The relative motion of the observer and the observed objects, causes characteristic flow patterns in an intensity image. These patterns can be interpreted to acquire relative scene distance, surface orientation, rigid scene structure and obscuring boundaries (in the viewer's coordinate system), though there is ambiguity between an object's distance and velocity. From local changing intensity patterns, it is possible to estimate the optic flow (e.g. following [94,120,85]), from which one can then estimate information about the objects in the scene and their motion (e.g. [136], [49,106,138]).

Shading, a more esoteric source of shape information well known to artists, is now being exploited by visual scientists. Horn [93] elaborated the theoretical structure for solving the "shape from shading" problem, and others (e.g. [167,129]) successfully implemented the theory for reasonably simple, uniform surfaces. The method generally starts from a surface function that relates reflectance to the relative orientations of the illumination, viewer and surface. From this, a system of partial differential equations is derived showing how local intensity variation is related to local shape variation. With the addition of boundary, surface continuity and singular point (e.g. highlight) constraints, solutions can be determined for the system of differential equations.

A major problem is that the solution relies on a global resolution of constraints, which seems to require a characterized reflectance function for the whole surface in question. Unfortunately, few surfaces have a reflectance function that meets this requirement, though Pentland [129] has shown reasonable success with some natural objects (e.g. a rock and a face), through making some assumptions about the surface reflectance. There is also a problem with the global convex/concave ambiguity of the surface, which arises when only shading information is available, though Blake [32] has shown how stereo correspondence on a nearby specular point can resolve the ambiguity. For these reasons, this technique may be best suited to only qualitative or rough numerical analyses.

Variations of this technique have used multiple light sources ([50,163]) or polarized light [105]. Explicit surface descriptions (e.g. planar, cylindrical) have been obtained by examining iso-intensity contours [161] and fitting quadratic surfaces [45] to intensity data.

Texture gradients are another source of shape information. Assuming texture structure remains constant over the surface, then all variation in either scale ([152,130]) or statistics ([164,124]) can be ascribed to surface slant distortion. The measure of compression gives local slant and the direction of compression gives local tilt; together they estimate local surface orientation.

The final source of orientation and depth information comes from global shape deformation. The technique relies on knowledge of how appearance varies with surface orientation, how certain patterns create impressions of three dimensional structure, and what constraints are needed to reconstruct that structure. Examples of this include reconstructing surface orientation from the assumption that skew symmetry is slant distorted true symmetry [100], from maximizing the local ratio of the area to the square of the perimeter [38], from families of space curves interpreted as geodesic surface markings [154], from space curves as locally circular arcs [21], and from characteristic distortions in known object surface boundary shapes [66]. Because this information relies on higher level knowledge of the objects, these techniques probably would not help the initial stages of analysis much. However, they may provide supporting evidence at the later stages.

There are variations in the exact outputs of each of these techniques, but many provide the required data and all provide some form of useful three dimensional shape information. Further, some attributes may be derivable from measured values (e.g. orientation by locally differentiating depth).

How Can We Use Surface Information?

As the segmentation criteria are object-centered, the segments of the surface image will directly correspond to model segments, especially since boundaries are mainly used for rough surface shape alignments. Making these symbolic correspondences means that we can directly instantiate and verify object models, which then facilitates geometric inversions of the image formation process.

Surface orientation can then be used to simplify the estimation of the object's orientation. Given the model-to-data patch correspondences, pairing surface normals leaves only a single rotational degree-of-freedom about the aligned normals. Estimating the object's translation is also much simpler because the three dimensional coordinates of any image point (relative to the camera) can be deduced from its pixel position and its depth value. Hence, the translation of a corresponding model point can be directly calculated.

All together, the surface information makes explicit or easily obtainable five of the six degrees-of-freedom associated with an individual surface point (though perhaps imprecisely because of noise in the surface data and problems with making precise model-to-data feature correspondences).

Depth discontinuities make explicit the ordering relationships with more distant surfaces. These relationships help group features to form contexts for model invocation and matching, and help verify potential matches, hence simplifying and strengthening the recognition process.

Surface orientation allows calculation of the surface shape class (e.g. planar, singly-curved or doubly-curved), and correction of slant and curvature distortions of perceived surface area and elongation. The relative orientations between structures give strong constraints on the identity of their superobject and its orientation. Absolute distance measurements allow the calculation of absolute sizes.

How Can Surface Data Help Overcome Current Recognition Problems?

Recognizing real three dimensional objects is still difficult, in part because of difficulties with selecting models, matching complicated shapes, dealing with occlusion, feature description, reducing matching complexity and coping with noise. Surface information helps overcome all of these problems, as discussed below.

Selecting the correct model from the model database requires a description of the data suitable for triggering candidates from the model base. Since descriptions of the model features are given in terms of their three dimensional shapes, describing data features based on their three dimensional shapes reduces the difficulties of comparison, and hence leads to more successful model invocation.

Most current recognition programs have not progressed much beyond recognizing objects whose shapes are largely block-like. One cause of this has been a preoccupation with orientation discontinuity boundaries, which, though easier to detect in intensity images, are noticeably lacking on many real objects. Using the actual surfaces as primitives extends the range of recognizable objects. Faugeras and Hebert [63] demonstrated this by using planar patch primitives to successfully detect and orient an irregularly shaped automobile part.

Viewed objects are often obscured by nearer objects and the ensuing loss of data causes recognition programs to fail. Surface images provide extra information that help overcome occlusion problems. For example, occlusion boundaries are explicit, and thus denote where relevant information stops. Moreover, the presence of a closer surface provides evidence for why the information is missing, and hence where it may re-appear (i.e. on the "other side" of the obscuring surface).

Surface patches are less sensitive to fragmentation, because pixel connectivity extends in two dimensions. Describing the surfaces is also more robust, because (usually) more pixels are involved and come from a more compact area. Globally, connectivity (i.e. adjacency) of surfaces is largely guaranteed, and slight variations will not affect description. Hence, the topology of the major patches should be reliably extracted.

Establishing model-to-data correspondences can be computationally expensive. Using surface patches helps reduce the expense in two ways: (1) there are usually fewer surface patches than, for example, the boundary segments between them and (2) patch descriptions are richer, which leads to selecting fewer locally suitable, but globally incorrect model-to-data feature pairings (such as those that occur in edge-based matching algorithms).

Noise is omnipresent. Sensor imperfections, quantization errors, random fluctuations, surface shape texture and minor object imperfections are typical sources of data variation. A surface image segment is a more robust data element, because its size leads to reduced data variation (assuming O( $n^{2}$ ) data values as compared with O() for linear features). This contrasts with linear feature detection and description processes, in which noise can cause erroneous parameter estimates, loss of connectivity or wandering.

Why Not Use Other Representations?

There are three alternative contenders for the primary input data representation: edges, image regions and volumes.

Edges have been used extensively in previous vision systems. The key limitations of their use are:

ambiguous scene interpretation (i.e. whether caused by occlusion, shadows, highlights, surface orientation discontinuities or reflectance changes),
ambiguous model interpretation (i.e. which straight edge of length 10 could it be?),
loss of data because of noise or low contrast, and
image areas free from edges also contain information (e.g. shading).

While these limitations have not deterred research using edge-based recognition, considerable difficulties have been encountered.

Image regions are bounded segments of an intensity image. Their meaning, however, is ambiguous and their description is not sufficiently related to three dimensional objects. For example, Hanson and Riseman [84] and Ohta [123] segmented green image regions for tree boughs using color, yet there is no reason to assume that trees are the only green objects in the scene nor that contiguous green regions belong to the same object. Further, the segmentations lose all the detailed structure of the shape of the bough, which may be needed to identify the type of tree. They augmented the rough classification with general context relations, which assisted in the interpretation of the data. While this type of general information is important and useful for scene analysis, it is often insufficiently precise and object-specific for identification, given current theories of image interpretation.

Volumetric primitives seem to be useful, as discussed by Marr [112] and Brooks [42] in their advocation of generalized cylinders. These solids are formed by sweeping a cross-section along an axis and represent elongated structures well. For volumes with shapes other than something like generalized cylinders (e.g. a head), the descriptions are largely limited to explicit space-filling primitives, which is insufficiently compact, nor does it have the power to easily support appearance deductions.

The generalized cylinder approach also leads to problems with relating volumetric features to observed visible surface data, because there is no simple transformation from the surface to the solid under most representations. Marr [112] showed that generalized cylinders were a logical primitive because these are the objects with planar contour generators from all points of view (along with a few other conditions) and so are natural interpretations for pairs of extended obscuring boundaries. Unfortunately, few objects meet the conditions. Moreover, this transformation ignored most of the other information available in the $2\frac{1}{2}{\rm D}$ sketch, which is too useful to be simply thrown away.

Final Comments

This completes a quick review of why surface information is useful, how one might obtain the data, and how it might be segmented for use. The next chapter starts to use the data for the early stages of scene analysis.

Next: Making Complete Surface Hypotheses Up: Surface Data as Input Previous: The Labeled Segmented Surface

Bob Fisher 2004-02-26