The following definition is proposed:
Three dimensional object recognition is the identification of a model structure with a set of image data, such that geometrically consistent model-to-data correspondences are established and the object's three dimensional scene position is known. All model features should be fully accounted for - by having consistent image evidence either supporting their presence or explaining their absence.
Hence, recognition produces a symbolic assertion about an object, its location and the use of image features as evidence. The matched features must have the correct types, be in the right places and belong to a single, distinct object. Otherwise, though the data might resemble those from the object, the object is improperly assumed and is not at the proposed location.
Traditional object recognition programs satisfy weaker versions of the above definition. The most common simplification comes from the assumption of a small, well-characterized, object domain. There, identification can be achieved via discrimination using simply measured image features, such as object color or two dimensional perimeter or the position of a few linear features. This is identification, but not true recognition (i.e. image understanding).
Recognition based on direct comparison between two dimensional image and model structures - notably through matching boundary sections - has been successful with both grey scale and binary images of flat, isolated, moderately complicated industrial parts. It is simple, allowing geometric predictions and derivations of object location and orientation and tolerating a limited amount of noise. This method is a true recognition of the objects - all features of the model are accounted for and the object's spatial location is determined.
Some research has started on recognizing three dimensional objects, but with less success. Model edges have been matched to image edges (with both two and three dimensional data) while simultaneously extracting the position parameters of the modeled objects. In polyhedral scenes, recognition is generally complete, but otherwise only a few features are found. The limits of the edge-based approach are fourfold:
In the last decade, low-level vision research has been working towards direct deduction and representation of scene properties - notably surface depth and orientation. The sources include stereo, optical flow, laser or sonar range finding, surface shading, surface or image contours and various forms of structured lighting.
The most well-developed of the surface representations is the sketch advocated by Marr . The sketch represents local depth and orientation for the surfaces, and labels detected surface boundaries as being from shape or depth discontinuities. The exact details of this representation and its acquisition are still being researched, but its advantages seem clear enough. Results suggest that surface information reduces data complexity and interpretation ambiguity, while increasing the structural matching information.
The richness of the data in a surface representation, as well as its imminent availability, offers hope for real advances beyond the current practical understanding of largely polyhedral scenes. Distance, orientation and image geometry enable a reasonable reconstruction of the three dimensional shape of the object's visible surfaces, and the boundaries lead to a figure/ground separation. Because it is possible to segment and characterize the surfaces, more compact symbolic representations are feasible. These symbolic structures have the same relation to the surface information as edges currently do to intensity information, except that their scene interpretation is unambiguous. If there were:
The goal of object recognition, as defined above, is the complete matching of model to image structures, with the concomitant extraction of position information. Hence, the output of recognition is a set of fully instantiated or explained object hypotheses positioned in three dimensions, which are suitable for reconstructing the object's appearance.
The research described here tried to attain these goals for moderately complex scenes containing multiple self-obscuring objects. To fully recognize the objects, it was necessary to develop criteria and practical methods for:
The approach requires object models composed of a set of surfaces geometrically related in three dimensions (either directly or through subcomponents). For each model SURFACE, recognition finds those image surfaces that consistently match it, or evidence for their absence (e.g. obscuring structure). The model and image surfaces must agree in location and orientation, and have about the same shape and size, with variations allowed for partially obscured surfaces. When surfaces are completely obscured, evidence for their existence comes either from predicting self-occlusion from the location and orientation of the model, or from finding closer, unrelated obscuring surfaces.
The object representation requires the complete object surface to be segmentable into what would intuitively be considered distinct surface regions. These are what will now be generally called surfaces (except where there is confusion with the whole object's surface). When considering a cube, the six faces are logical candidates for the surfaces; unfortunately, most natural structures are not so simple. The segmentation assumption presumes that the object can be decomposed into rigid substructures (though possibly non-rigidly joined), and that the rigid substructures can be uniquely segmented into surfaces of roughly constant character, defined by their two principal curvatures. It is also assumed that the image surfaces will segment in correspondence with the model SURFACEs, though if the segmentation criteria is object-based, then the model and data segmentations should be similar. (The SURFACE is the primitive model feature, and represents a surface patch.) Of course, these assumptions are simplistic because surface deformation and object variations lead to alternative segmentations, but a start must be made somewhere.
The three models used in the research are: a trash can, a classroom chair, and portions of a PUMA robot. The major common feature of these objects is the presence of regular distinct surfaces uncluttered by shape texture, when considered at a "human" interpretation scale. The objects were partly chosen for experimental convenience, but also to test most of the theories proposed here. The models are shown in typical views in Chapter 7. Some of the distinctive features of each object and their implications on recognition are:
These objects were viewed in semi-cluttered laboratory scenes that contained both obscured and unobscured views (example in the next section). Using an intensity image to register all data, nominal depth and surface orientation values were measured by hand at about one hundred points. Values at other nearby points in the images were calculated by interpolation. Obscuring and shape segmentation boundaries were selected by hand to avoid unresolved research problems of segmentation, scale and data errors. No fully developed processes produce these segmentations yet, but several processes are likely to produce them data in the near future and assuming such segmentations were possible to allowed us to concentrate on the primary issues of representation and recognition.