Next: Model Invocation Up: Object Representation Previous: The SMS Object Representation

Other Object Representation Approaches

Marr [112] proposed five criteria for evaluating object representations:

accessibility - needed information should be directly available from the model rather than derivable through heavy computation,
scope - a wide range of objects should be representable,
uniqueness - an object should have a unique representation,
stability - small variations in an object should not cause large variations in the model and
sensitivity - detailed features should be represented as needed.

We will use these criteria to help understand the relationships between different model schemes.

The model representation scheme defined above generally satisfies these criteria, except that the uniqueness criterion is weakened to become: an object should have only a few representations and these should be easily derivable from each other. The problem with the uniqueness criterion appears when one considers the subcomponent hierarchy. At present, there is no strong understanding of when to describe a set of features at one level in a model hierarchy versus creating a separate subcomponent for them. Further, given that the image data is typically seen from only one viewpoint, it is conceivable that the properties that were used in organizing the model may not be observed in the image data. An example where this might occur is with: (a) a teapot body and spout grouped at one level of description with the handle added at a higher level versus (b) grouping all three features at the same level.

The problems of model hierarchies suggest that another criterion of a good representation is conceptual economy, and I have identified two justifications for it. First, economy dictates that there should be only a single representation of any particular shape, and multiple instances of that shape should refer to the single representation. Second, features that are distinctly characterized as a whole, irrespective of their substructures, should be represented simply by reference to that whole, with the details of that feature represented elsewhere.

Brady [39] proposed three additional criteria for analyzing representation schemes: rich local support, smooth extension and subsumption of representations and reference frame propagation. Since surface patch types are defined by the principal curvatures, the first additional criterion is partially satisfied. The second one does not seem to apply (Brady investigated merging local descriptions to form larger descriptions). The reference frame propagation criterion also seems to be satisfied because, here, all model features have well defined local reference frames and data descriptions also have local reference frames (though there may be degrees-of-freedom, as with a cylindrical patch).

Modeling schemes in computer vision mainly belong to two families:

Property representations - that define objects by properties or constraints (without recourse to an explicit geometric model) the satisfaction of which should lead to unique identification.
Geometric representations - that represent object shape and structure.

The representations may be expressed implicitly in a computer program or explicitly as an identifiable defined model. The implicit model is not different in competence from the explicit model, but is ignored here because of its lack of generality. We discuss the two families in greater detail in the following subsections. (These representations are closely related to the matching algorithms discussed in Chapter 2.)

Property Representations

A typical property representation associates lists of expected properties with each object. Some examples of this are:

color, size and height for image regions in office scenes [55],
rough object sizes, colors and edge shapes for desk top objects [149] and
face shape, edge lengths and two dimensional edge angles for identifying polyhedra [60].

One can also include relationships that have to be held with other structures (e.g. [19]), such as in Adler's program [3] that interpreted two dimensional Peanuts cartoon figure scenes.

Property and relationship representations often take the form of a graph. Here, object features become nodes in the graph, relationships between the features become the arcs and properties of the features are labels on the nodes. For example, Barrow and Popplestone [16] represented visible object regions and their interrelationships, (like adjacency and relative size).

Graph representations have the advantage of adding some structure to the object properties, and providing a common representation method for many problems. One problem is all object details tend to be represented at the same level, so the graphs can become large without benefit. Adding more detail increases the computational difficulties of matching rather than easing them. Barrow et al. [17] investigated hierarchical graph representations in matching to try to overcome the computational complexity.

Various researchers ([83,119,123]) have augmented property representations with weak geometric shape (e.g. parallel, square) and relationships (e.g. above, near).

Property representations mainly satisfy Marr's scope and accessibility criteria. Further, graph and property representations are usually two dimensional, whereas we are interested in three dimensional objects and scenes, where changes in viewpoint make drastic changes in the representation. Property representations offer simplicity at the expense of having weak descriptive powers and providing no support for active deduction. However, it is still difficult to represent natural objects geometrically so their recognition must depend heavily on these property representations.

Geometric Representations

Early geometric models were based on three dimensional point or line descriptions. Point models (e.g. [139]) specify the location of significant points relative to the whole object. This method is simple but problematic, because of difficulties in correctly establishing model-to-data correspondences. Edge models (e.g. [60]) specify the location, orientation and shape of edges (typically orientation discontinuities). These characterize the wire-frame shape of an object better than the point models and have stronger correspondence power, but lead to difficulties because of the ambiguity of scene edges and the difficulty of reliably extracting the edges. Further, it is difficult to define and extract intrinsic linear features on curved surfaces. Point and edge models have trouble meeting Marr's scope and sensitivity criteria.

Surface models describe the shape of observable surface regions and their relationship to the whole object (and sometimes to each other). Key questions include how to describe the shape of the surfaces and what constitutes an identifiable surface primitive.

Surface regions can be represented by their bounding space curves in wire frame models ([36,14], page 291). Planar surfaces are most easily represented, but curved surfaces can also be represented by judicious placement of lines. While useful for computer-aided design and graphics, the surface information needed for recognition tends to be hard to access.

Surface patch models can give arbitrarily accurate representations of object surfaces. One approach to surface representation is by bicubic spline patches ([171,14], page 269), where cubic polynomials approximate the surface between fixed control points, giving both positional and derivative continuity. Lack of uniqueness and stability are weak aspects of these models. Also, these modeling approaches ignore the problem of shape discontinuities (i.e. surface orientation).

A second popular approach uses polygonal patches (e.g. [34]), with subdivision of the patches to achieve the required accuracy. These represent surfaces well, but give no conceptual structure to the surface and also suffer over the stability criterion. Faugeras and Hebert [63] used planar patches derived from depth data to partially bound a three dimensional rigid object. Here, the model did not characterize well the full object, rather, it concentrated on nearly planar regions. Other researchers have created planar and cylindrical surface models from light-stripe range data (e.g. [133,52]).

On the whole, surfaces represent the actual visibility of an object well and allow direct comparison of appearance, but do not easily characterize the mass distribution of an object. Further, techniques for describing surfaces that are curved or have natural variation for recognition have not yet been well formulated.

Volumetric models represent the solid components of an object in relation to each other or the whole object. Examples include space filling models (e.g. [14], page 280) that represent objects by denoting the portions of space in which the object is located, and constructive solid geometry (CSG) models, that start from geometric primitives like cubes, cylinders or half-spaces (e.g. [137,44]) and then form more complex objects by merging, difference and intersection operations. With these, three dimensional character is directly accessible, but appearance is hard to deduce without the addition of surface shape and reflectance information. Matching with solids requires finding properties of images and solids that are directly comparable, such as obscuring boundaries and axes of elongation. These volumetric representations tend to meet only Marr's scope and sensitivity criteria.

Recently, Pentland [131] introduced a superquadric-based volumetric model representation. The advantage of the superquadrics was that a great variety of primitive volumes can be described using only a few parameters. Natural appearance was improved by adding fractal-based surface texture afterwards. Larger objects and scenes were formed by scaling and positioning instances of the primitives. Pentland demonstrated several techniques for locally estimating a global superquadric decomposition (from local shape analysis and global subsumption) that was particularly effective on elongated features. He also demonstrated estimation of fractal parameters from local shape spatial frequency spectra and showed how these could be used to segment texture regions. However, as with the CSG primitives, superquadric models seem to be more useful for graphics, and appear to have difficulties with uniqueness and access of the correct feature for model matching. This remains an open research area.

Another promising volumetric model for computer vision is the generalized cylinder (or cone) ([27,111,6,91]), which have had their most significant usage in the ACRONYM system [42].

The primitive unit of representation is a solid specified by a cross-sectional shape, an axis along which to sweep the cross-section and a sweeping rule describing how the shape and orientation of the cross-section vary along the axis.

The axis was the key feature because of its relation to axes directly derivable from image data. Many "growth" based natural structures (e.g. tree branches, human limbs) have an axis of elongation, so generalized cylinders make good models. It also represents many simple man-made objects well, because the manufacturing operations of extrusion, shaping and turning create reasonably regular, nearly symmetric elongated solids.

In Marr's proposal [112], objects were described by the dominant model axis and the names and placement of subcomponents about the axis. Subcomponents could be refined hierarchically to provide greater detail. The specification used dimensionless units, which allowed scale invariance, and the relative values were represented by quantized value ranges that provided both the symbolic and approximate representation needed for stability to variation and error.

Brooks [42] used generalized cylinders in the ACRONYM system, where they could be directly matched to image boundary pairs (i.e. ribbons). Subcomponents were attached by specifying the rotation and translation relating the object and subcomponent reference frames. All primitive or affixment relationship specifications could contain variables, but the subcomponents used in ACRONYM's airplane example were all rigidly connected to the main model. Hogg [91] used variable attachments to represent joint variation in a human model, with a posture function constraining relative joint position over time.

For matching, ACRONYM's geometric models were compiled into a prediction graph, where key generalized cylinders became the nodes and placement relationships between the cylinders defined the relations. Because the image data being matched was two dimensional, the prediction graphs represented typical two dimensional views of the objects, derived from the full three dimensional geometric model. The advantage of this was that the full constraints of the geometric model could be employed in the uniform graph matching method. Substantial reasoning was needed to derive the prediction graph from the three dimensional models.

The most important contribution of ACRONYM's modeling was the use of algebraic constraints that limit the range of model variables in relation to either fixed values or other variables. An example would be: "the length of a desk top is greater than the width, but less than twice the width", where both length and width are variable parameters.

Variables and constraints together support generic class models, at least in structural terms (as opposed to functional). The structural aspects of the model define the essential components and their attachments, symbolic parameters denote the type of variation and the constraints specify the range of variation. The effect of the algebraic constraints is to structure the space of all possible models with the same logical part relationships into a generalization hierarchy, where more restrictive constraints define generic specializations of the model. Subclasses are defined by more tightly constrained (or constant) parameters, or additional constraints.

A criticism of the generalized cylinder/cone representation concerns its choice of primitive element. Many natural and man-made objects do not have vaguely cylindrical components: a leaf, a rock, the moon, a crumpled newspaper, a tennis shoe. Though some aspects of these objects could be reasonably represented, the representation would omit some relevant aspects (e.g. the essential two dimensionality of the leaf), or introduce other irrelevant ones (e.g. the axis of a sphere). Hence, other primitives should at least be included to increase its scope.

Secondly, what is perceived is the surface of objects. Hence, it seems reasonable that the preferential representation for object recognition should make surface-based information explicit. The near-parallel, tangential obscuring boundaries of a generalized cylinder ("ribbon") are reasonable features for detection, and the orientation of the figure's spine constrains its three dimensional location, but this is about all the information that is easily derivable from the cylinder representation. Surface shape comparisons are non-trivial, because it is difficult to determine the visible surfaces of the cylinder and what it will look like from a given viewpoint. It is often hard to decide with what model feature an image feature should correspond.

These models are structured, so that they meet the sensitivity criterion and give unique representations. When modeling objects formed from hierarchical collections of elongated primitive shapes, the generalized cylinder method also meets the scope and stability criteria. The key problem of these models is accessibility - it can be hard to deduce much about the appearance of an object from its volumetric description and much about which model to use from the observed appearance.

These considerations, however, are most relevant when recognizing objects whose three dimensional surface shape and structure is apparent at the scale of analysis. The ACRONYM examples, aerial photographs of airport scenes, were largely two dimensional as almost all objects were reduced to laminar surfaces viewed perpendicularly. Hence, treating nearly parallel intensity boundaries as potentially observable extremal boundaries of generalized cylinders was appropriate.

The reduction of generics to numerical ranges of parameter values is simplistic, although an important first step. Sometimes it is inappropriate: a model adequate for recognizing a particular type of office chair probably would not specialize to any other chair. Any chair model that includes most office chairs would probably require a functional definition: seating surface meets appropriate back support surface.

Brooks attempted to introduce structural variation through parameter variation, but his solution seems inappropriate. For example, an integer variable ranging from 3 to 6 was used to state that an electric motor had 3, 4, 5 or 6 flanges, and a second variable stated that a motor did or did not have a base by constraining its value to 0 to 1. More complicated algebraic inequalities stated that motors with bases have no flanges. Uniform representation is a laudable goal, but these examples suggest that a more powerful representation should be considered. Hence, the physical variation within a class (i.e. size and affixment variations), that constraints represent well, should be separated from the conceptual relationships involved in generalization, where logical/relational constraints would be better.

There has been little work on object representations for non-rigid objects, other than ACRONYM, as just described. Grimson [79] defined and recognized piecewise linear two dimensional models that could scale, stretch along one axis or have subcomponents joined by a revolute joint. Terzopolous et al. [159] described symmetry-seeking models based on deformable parametric tube and spine primitives. The shapes of the models were controlled by energy constraints on the deformation of the spine, the symmetry of the tube and the relative position of the two. These models allowed the successful reconstruction of three dimensional natural shapes (e.g. a crook-neck squash) from only the two dimensional image contours. It is unclear, however, if this approach is suitable for object recognition, because of the many parameters required.

Most geometric representations are object-centered, though Minsky [116] proposed a frame representation for recording features visible from typical distinct viewpoints, thus adding viewer-centered information. Chakravarty [47] applied this notion in classifying the topologically distinct views of polyhedral scenes, for use in object recognition. Koenderink and van Doorn [104] considered the application of such ideas to generic smooth objects, by analyzing the discontinuities of occlusion and visibility relationships.

Final Comments This completes our discussion on object representation. Other discussions of three dimensional object representation techniques can be found in [14,23,71]. A particularly thorough review was given by Besl [26] and covers geometric techniques for space-curves, surfaces, volumes and four dimensional (i.e. moving) objects. His review also summarized well the basic geometric matching techniques that are typically used with each class of object.

With the models defined in this chapter, we can now start to recognize the objects in the scene. The recognition process will make heavy use of the surface-based geometric models, and of the subcomponent hierarchy. The first step of recognition is to select potential models for the image features, which is the topic of the next chapter.

Next: Model Invocation Up: Object Representation Previous: The SMS Object Representation

Bob Fisher 2004-02-26