Next: An Overview of the Up: An Introduction to Recognition Previous: An Introduction to Recognition

Object Recognition

The following definition is proposed:

Three dimensional object recognition is the identification of a model structure with a set of image data, such that geometrically consistent model-to-data correspondences are established and the object's three dimensional scene position is known. All model features should be fully accounted for - by having consistent image evidence either supporting their presence or explaining their absence.

Hence, recognition produces a symbolic assertion about an object, its location and the use of image features as evidence. The matched features must have the correct types, be in the right places and belong to a single, distinct object. Otherwise, though the data might resemble those from the object, the object is improperly assumed and is not at the proposed location.

Traditional object recognition programs satisfy weaker versions of the above definition. The most common simplification comes from the assumption of a small, well-characterized, object domain. There, identification can be achieved via discrimination using simply measured image features, such as object color or two dimensional perimeter or the position of a few linear features. This is identification, but not true recognition (i.e. image understanding).

Recognition based on direct comparison between two dimensional image and model structures - notably through matching boundary sections - has been successful with both grey scale and binary images of flat, isolated, moderately complicated industrial parts. It is simple, allowing geometric predictions and derivations of object location and orientation and tolerating a limited amount of noise. This method is a true recognition of the objects - all features of the model are accounted for and the object's spatial location is determined.

Some research has started on recognizing three dimensional objects, but with less success. Model edges have been matched to image edges (with both two and three dimensional data) while simultaneously extracting the position parameters of the modeled objects. In polyhedral scenes, recognition is generally complete, but otherwise only a few features are found. The limits of the edge-based approach are fourfold:

It is hard to get reliable, repeatable and accurate edge information from an intensity image.
There is much ambiguity in the interpretation of edge features as shadow, reflectance, orientation, highlight or obscuring edges.
The amount of edge information present in a realistic intensity image is overwhelming and largely unorganizable for matching, given current theories.
The edge-based model is too simple to deal with scenes involving curved surfaces.

Because of these deficiencies, model-based vision has started to exploit the richer information available in surface data.

Surface Data

In the last decade, low-level vision research has been working towards direct deduction and representation of scene properties - notably surface depth and orientation. The sources include stereo, optical flow, laser or sonar range finding, surface shading, surface or image contours and various forms of structured lighting.

The most well-developed of the surface representations is the $2\frac{1}{2}{\rm D}$ sketch advocated by Marr [112]. The sketch represents local depth and orientation for the surfaces, and labels detected surface boundaries as being from shape or depth discontinuities. The exact details of this representation and its acquisition are still being researched, but its advantages seem clear enough. Results suggest that surface information reduces data complexity and interpretation ambiguity, while increasing the structural matching information.

The richness of the data in a surface representation, as well as its imminent availability, offers hope for real advances beyond the current practical understanding of largely polyhedral scenes. Distance, orientation and image geometry enable a reasonable reconstruction of the three dimensional shape of the object's visible surfaces, and the boundaries lead to a figure/ground separation. Because it is possible to segment and characterize the surfaces, more compact symbolic representations are feasible. These symbolic structures have the same relation to the surface information as edges currently do to intensity information, except that their scene interpretation is unambiguous. If there were:

reasonable criteria for segmenting both the image surfaces and models,
simple processes for selecting the models and relating them to the data and
an understanding of how all these must be modified to account for factors in realistic scenes (including occlusion)

then object recognition could make significant advances.

The Research Undertaken

The goal of object recognition, as defined above, is the complete matching of model to image structures, with the concomitant extraction of position information. Hence, the output of recognition is a set of fully instantiated or explained object hypotheses positioned in three dimensions, which are suitable for reconstructing the object's appearance.

The research described here tried to attain these goals for moderately complex scenes containing multiple self-obscuring objects. To fully recognize the objects, it was necessary to develop criteria and practical methods for:

piecing together surfaces fragmented by occlusion,
grouping surfaces into volumes that might be identifiable objects,
describing properties of three dimensional structures,
selecting models from a moderately sized model base,
pairing model features to data features and extracting position estimates and
predicting and verifying the absence of model features because of variations of viewpoint.

The approach requires object models composed of a set of surfaces geometrically related in three dimensions (either directly or through subcomponents). For each model SURFACE, recognition finds those image surfaces that consistently match it, or evidence for their absence (e.g. obscuring structure). The model and image surfaces must agree in location and orientation, and have about the same shape and size, with variations allowed for partially obscured surfaces. When surfaces are completely obscured, evidence for their existence comes either from predicting self-occlusion from the location and orientation of the model, or from finding closer, unrelated obscuring surfaces.

The object representation requires the complete object surface to be segmentable into what would intuitively be considered distinct surface regions. These are what will now be generally called surfaces (except where there is confusion with the whole object's surface). When considering a cube, the six faces are logical candidates for the surfaces; unfortunately, most natural structures are not so simple. The segmentation assumption presumes that the object can be decomposed into rigid substructures (though possibly non-rigidly joined), and that the rigid substructures can be uniquely segmented into surfaces of roughly constant character, defined by their two principal curvatures. It is also assumed that the image surfaces will segment in correspondence with the model SURFACEs, though if the segmentation criteria is object-based, then the model and data segmentations should be similar. (The SURFACE is the primitive model feature, and represents a surface patch.) Of course, these assumptions are simplistic because surface deformation and object variations lead to alternative segmentations, but a start must be made somewhere.

The three models used in the research are: a trash can, a classroom chair, and portions of a PUMA robot. The major common feature of these objects is the presence of regular distinct surfaces uncluttered by shape texture, when considered at a "human" interpretation scale. The objects were partly chosen for experimental convenience, but also to test most of the theories proposed here. The models are shown in typical views in Chapter 7. Some of the distinctive features of each object and their implications on recognition are:

trash can:

*

laminar surfaces - surface grouping difficulties

*

rotational symmetry - surface segmentation and multiple recognitions
chair:

convex and concave curved surfaces (seat back) - surface grouping difficulties

thin cylindrical surfaces (legs) - data scale incompatible with model scale
robot:

surface blending - non-polyhedral segmentation relationships

non-rigidly connected subcomponents - unpredictable reference frame relationships and self-occlusions

These objects were viewed in semi-cluttered laboratory scenes that contained both obscured and unobscured views (example in the next section). Using an intensity image to register all data, nominal depth and surface orientation values were measured by hand at about one hundred points. Values at other nearby points in the images were calculated by interpolation. Obscuring and shape segmentation boundaries were selected by hand to avoid unresolved research problems of segmentation, scale and data errors. No fully developed processes produce these segmentations yet, but several processes are likely to produce them data in the near future and assuming such segmentations were possible to allowed us to concentrate on the primary issues of representation and recognition.

Next: An Overview of the Up: An Introduction to Recognition Previous: An Introduction to Recognition

Bob Fisher 2004-02-26