Though pattern recognition techniques are widespread, and may give rough information about the object's image position, they do not usually provide precise placement, description and feature correspondences. Another technique, graph matching (e.g. [16]), typifies topological matching methods that make correspondences between image and model features, but again do not give scene placement nor precise image description. The graph arcs can include rough spatial relations (e.g. above, left, near) between image features and an image model (e.g. [83], [119], [123,1]). They can also include environmental relations like the sky being at the top of an image and above roofs [123], or most scene lines being vertical [102], which allow for rough correspondences and object placement in the image. These classes of techniques do not strongly exploit the geometric structure of objects and scenes (which often provide the best constraints on the identity and position of objects) though the algorithms do offer simplicity and well-defined decision procedures.

Geometric scene understanding systems can be classified by the dimensionality of their model features (i.e. point, curve, region or volume) and their image data (i.e. two or three dimensional). All of the key research discussed below used three dimensional geometric models.

Early scene understanding systems (e.g. [139,60]) used two dimensional point or corner correspondences to solve for object location and projection relationships. Later, Turner [161] located objects by using three dimensional points found from stereo triangulation of paired two dimensional image points.

Three influential edge-based recognition systems are:

- Brooks' ACRONYM system [42] matched pairs of nearly parallel two dimensional lines ("ribbons") to the extremal boundaries of generalized cylinders, thus instantiating model primitives. Larger objects were found by a graph matching technique, where the arcs in the graphs represented two dimensional projections of three dimensional geometric relationships between the generalized cylinders (e.g. relative orientation). Three dimensional object position was found by using the two dimensional image measurements (feature sizes and positions, etc.) to back-constrain the range of position parameters.
- Lowe's SCERPO system [108] used groupings of straight two dimensional line features to suggest and orient model instances - which were sets of three dimensional lines. To help avoid combinatorial matching problems, he used a "perceptual organization" technique that grouped edges by colinearity, parallelism and endpoint connectivity and then formed larger features, like parallelograms. A measure of significance related to the probability of random occurrence of the segment groupings was calculated for both model and data segments, which was then used to help order search. With a pairing of three or more segments, three dimensional position was found by using an iterative least-squared error algorithm. Hypotheses were then verified and refined by collecting additional line evidence, using direct feature predictions from the initial position estimates.
- The University of Sheffield TINA system [134] matched three dimensional lines derived from binocular stereo to a three dimensional wire frame object model (which was itself derived empirically from observed instances of the objects). The use of edge information implied the objects needed to be largely polyhedral. A least-squared error matching process deduced the position of the object from the three dimensional feature correspondences.

There has been little geometric matching of two dimensional image regions to three dimensional model surfaces. Fisher [66] used heuristics based on the deformations of the three dimensional model surface patches when seen in two dimensions to estimate three dimensional surface patch position. This supported a hierarchical synthesis matching process that recognized larger objects. Ballard and Sabbah [13] used a variety of Hough transformation techniques to estimate the six positional parameters sequentially. This uniform mechanism is more stable to noise, but is likely to suffer when the object's shape varies dramatically with the viewpoint. Turner [161] attempted a more symbolically descriptive matching technique by using surface patches classified according to the patterns of iso-intensity curves. The elementary recognition operation used property and relation matching. More complicated objects were recognized by aggregating subcomponents in a hierarchical synthesis process.

Three dimensional surfaces have been used for recognition since the early 1970's. Several researchers (e.g. [147,133]) collected surface data from a structured light system, where configurations of light stripes characterized regular surface shapes. This method of data collection has again become popular (e.g. [127,36]). A particularly significant result was obtained by Faugeras and Hebert [63], who recognized an irregular part using locally planar patches, by matching to an empirically derived model (although the matcher only found a few correspondences). Grimson and Lozano-Perez [78] extended this work by developing a set of heuristics that eliminate many spurious initial hypotheses. Their features were three dimensional image points with attached vectors (such as surface normals). Grimson [79] later extended this work to recognizing families of scaled, stretched or jointed two dimensional piecewise linear objects, by propagating and refining estimates of the scale or stretch factor down the model-to-data segment pairing search tree.

In a "continuous" version of surface matching, Ikeuchi [96] used an extended gaussian image method to reduce object description to a sphere with quills representing the sizes of areas with the corresponding orientations. Matching used three dimensional data and was largely a constrained correlation. His method was successful with some curved objects, but ignored the object's structural features, and might fail for complicated or non-convex objects.

In general, recognition results have been limited to complete image understanding of simple geometric objects (e.g. [139]) or partial understanding of complex assemblies of simple objects, such as airplanes [42]. Irregular objects are not well understood at this level, in part because of problems with object modeling and in part because of the difficulty in obtaining useful image data.

Once an initial object position has been estimated, object models can be used to predict the location and appearance of the remaining image features. Falk [60] predicted lines in a blocks world domain, and Freuder [75] predicted image region locations in two dimensions with procedural models of hammers. More recently, Brooks [42] showed how a range of image positions could be predicted using partial constraints on object location. Hogg [91] used edge point information to verify the positional parameters of a generalized cylinder human model over time in a natural scene. Individual evidence was weak, but requiring evidence for the whole complex model led to good results. Aylett et al. [11] used constraints similar to Grimson and Lozano-Perez [78] to match stereo-derived three dimensional edges to predicted model edges. The model edges were predicted from a constructive solid geometry object model in a known position and the goal was to eliminate known features from a scene.

Understanding occlusion in three dimensions has had few results to date. Blocks world scenes have been successfully analyzed by Guzman's heuristics [80]. These included the paired-TEE occlusion identification and image region pairing heuristics. Fisher [66] and Adorni and Trucco [4] have extended and applied these ideas to three dimensional scene analysis. Koenderink and van Doorn [104] characterized occlusion on curved surfaces by their local surface relationships, and showed how the occlusion signatures progressively vary as viewpoints change. This micro-level occlusion understanding could help predict local surface shape for the verification of hypothesized occlusion.

Most model-based three dimensional vision systems are slow. Goad [76] described a very efficient three dimensional model edge to two dimensional data straight edge matching algorithm (achieving several second recognition). He represented a discrete set of object orientations (tessellations of a viewsphere) as a boolean bit string, thus allowing fast geometric operations. Matching used a tree search, and Goad pre-expanded the tree branching on the outcome of visibility prediction and feature detection. Impossible branches were pruned at compile time using a combination of geometric, reliability and plausibility analysis. Relative feature positions could be pre-computed for the tessellations, allowing fast run-time absolute feature prediction.