Next: Recognition Approaches Up: Object Recognition from Surface Previous: The Nature of Recognition

Some Previous Object Recognition Systems

Three dimensional object recognition is still largely limited to blocks world scenes. Only simple, largely polyhedral objects can be fully identified, while more complicated objects can only be tentatively recognized (i.e. evidence for only a few features can be found). There are several pieces of research that deserve special mention.

Roberts [139] was the founder of three dimensional model-based scene understanding. Using edge detection methods, he analyzed intensity images of blocks world scenes containing rectangular solids, wedges and prisms. The two key descriptions of a scene were the locations of vertices in its edge description and the configurations of the polygonal faces about the vertices. The local polygon topology indexed into the model base, and promoted initial model-to-data point correspondences. Using these correspondences, the geometric relationship between the model, scene and image was computed. A least-squared error solution accounted for numerical errors. Object scale and distance were resolved by assuming the object rested on a ground plane or on other objects. Recognition of one part of a configuration introduced new edges to help segment and recognize the rest of the configuration.

Hanson and Riseman's VISIONS system [83] was proposed as a complete vision system. It was a schema-driven natural scene recognition system acting on edge and multi-spectral region data [82]. It used a blackboard system with levels for: vertices, segments, regions, surfaces, volumes, objects and schemata. Various knowledge sources made top-down or bottom-up additions to the blackboard. For the identification of objects (road, tree, sky, grass, etc.) a confidence value was used, based on property matching. The properties included: spectral composition, texture, size and two dimensional shape. Rough geometric scene analysis estimated the base plane and then object distances knowing rough object sizes. Use of image relations to give rough relative scene ordering was proposed. Besides the properties, schemata were the other major object knowledge source. They organized objects likely to be found together in generic scenes (e.g. a house scene) and provided conditional statistics used to direct the selection of new hypotheses from the blackboard to pursue.

As this system was described early in its development, a full evaluation can not be made here. Its control structure was general and powerful, but its object representations were weak and dependent mainly on a few discriminating properties, with little spatial understanding of three dimensional scenes.

Marr [112] hypothesized that humans use a volumetric model-based object recognition scheme that:

took edge data from a $2\frac{1}{2}{\rm D}$ sketch,
isolated object regions by identifying obscuring contours,
described subelements by their elongation axes, and objects by the local configuration of axes,
used the configurations to index into and search in a subtype/subcomponent network representing the objects, and
used image axis positions and model constraints for geometric analysis.

His proposal was outstanding in the potential scope of recognizable objects, in defining and extracting object independent descriptions directly matchable to three dimensional models (i.e. elongation axes), in the subtype and subcomponent model refinement, and in the potential of its invocation process. It suffered from the absence of a testable implementation, from being too serial in its view of recognition, from being limited to only cylinder-like primitives, from not accounting for surface structure and from not fully using the three dimensional data in the $2\frac{1}{2}{\rm D}$ sketch.

Brooks [42], in ACRONYM, implemented a generalized cylinder based recognizer using similar notions. His object representation had both subtype and subcomponent relationships. From its models, ACRONYM derived visible features and relationships, which were then graph-matched to edge data represented as ribbons (parallel edge groups). ACRONYM deduced object position and model parameters by back constraints in the prediction graph, where constraints were represented by algebraic inequalities. These symbolically linked the model and position parameters to the model relationships and image geometry, and could be added to incrementally as recognition proceeded. The algebraic position-constraint and incremental evidence mechanism was powerful, but the integration of the constraints was a time-consuming and imperfect calculation.

This well developed project demonstrated the utility of explicit geometric and constraint reasoning, and introduced a computational model for generic identification based on nested sets of constraints. Its weakness were that it only used edge data as input, it had a relatively incomplete understanding of the scene, and did not really demonstrate three dimensional understanding (the main example was an airplane viewed from a great perpendicular height).

Faugeras and his group [63] researched three dimensional object recognition using direct surface data acquired by a laser triangulation process. Their main example was an irregular automobile part. The depth values were segmented into planar patches using region growing and Hough transform techniques. These data patches were then combinatorially matched to model patches, constrained by needing a consistent model-to-data geometric transformation at each match. The transformation was calculated using several error minimization methods, and consistency was checked first by a fast heuristic check and then by error estimates from the transformation estimation. Their recognition models were directly derived from previous views of the object and record the parameters of the planar surface patches for the object from all views.

Key problems here were the undirected matching, the use of planar patches only, and the relatively incomplete nature of their recognition - pairing of a few patches was enough to claim recognition. However, they succeeded in the recognition of a complicated real object.

Bolles et al. [36] used light stripe and laser range finder data. Surface boundaries were found by linking corresponding discontinuities in groups of stripes, and by detecting depth discontinuities in the range data. Matching to models was done by using edge and surface data to predict circular and hence cylindrical features, which were then related to the models. The key limitation of these experiments was that only large (usually planar) surfaces could be detected, and so object recognition could depend on only these features. This was adequate in the limited industrial domains. The main advantages of the surface data was that it was absolute and unambiguous, and that planar (etc.) model features could be matched directly to other planar (etc.) data features, thus saving on matching combinatorics.

The TINA vision system, built by the Sheffield University Artificial Intelligence Vision Research Unit [134], was a working stereo-based three dimensional object recognition and location system. Scene data was acquired in three stages: (1) subpixel "Canny" detected edges were found for a binocular stereo image pair, (2) these were combined using epipolar, contrast gradient and disparity gradient constraints and (3) the three dimensional edge points were grouped to form straight lines and circular arcs. These three dimensional features were then matched to a three dimensional wire frame model, using a local feature-focus technique [35] to cue the initial matches. They eliminated the incorrect model-to-data correspondences using pairwise constraints similar to those of Grimson and Lozano-Perez [78] (e.g. relative orientation). When a maximal matching was obtained, a reference frame was estimated, and then improved by exploiting object geometry constraints (e.g. that certain lines must be parallel or perpendicular).

A particularly notable achievement of this project was their successful inference of the wire frame models from multiple known views of the object. Although the stereo and wire frame-based techniques were suited mainly for polyhedral objects, this well-engineered system was successful at building models that could then be used for object recognition and robot manipulation.

More recently, Fan et al. [61] described range-data based object recognition with many similarities to the work in this book. Their work initially segments the range data into surface patches at depth and orientation discontinuities. Then, they created an attributed graph with nodes representing surface patches (labeled by properties like area, orientation and curvature) and arcs representing adjacency (labeled by the type of discontinuity and estimated likelihood that the two surfaces are part of the same object). The whole scene graph is partitioned into likely complete objects (similar to our surface clusters) using the arc likelihoods. Object models were represented by multiple graphs for the object as seen in topologically distinct viewpoints. The first step of model matching was a heuristic-based preselection of likely model graphs. Then, a search tree was formed, pairing compatible model and data nodes. When a maximal set was obtained, and object position was estimated , which was used to add or reject pairings. Consistent pairings then guided re-partitioning of the scene graph, subject to topological and geometric consistency.

Rosenfeld [142] proposed an approach to fast recognition of unexpected (i.e. fully data driven) generic objects, based on five assumptions:

objects were represented in characteristic views,
the key model parts are regions and boundaries,
features are characterized by local properties,
relational properties are expressed in relative form (i.e. "greater then") and
all properties are unidimensional and unimodal.

A consequence of these assumptions is that most of the recognition processes are local and distributed, and hence can be implemented on an (e.g.) pyramidal processor.

This concludes a brief discussion of some prominent three dimensional object recognition systems. Other relevant research is discussed where appropriate in the main body of the book. Besl and Jain [23] gave a thorough review of techniques for both three dimensional object representation and recognition.

Next: Recognition Approaches Up: Object Recognition from Surface Previous: The Nature of Recognition

Bob Fisher 2004-02-26