Next: Discussion Up: Model Invocation Previous: Evaluating the Invocation Process

Related Work

There is little work on model invocation in the context of three dimensional vision. The most common technique is comparing all models to the data and is useful only when few possibilities exist.

A second level used a few easily measured object (or image) properties to select a subset of potential models for complete matching. Roberts [139] used configurations of approved polygons in the line image to index directly models according to viewpoint. Nevatia and Binford [121] used an indexing scheme that compared the number of generalized cylinders connecting at the two ends of a distinguished cylinder.

Object properties have often been used to discriminate between potential objects in the domain using tabular and decision-tree techniques. Examples include the early SRI work (e.g. [19,156,157]), which recognized objects in office scenes using constraints that held between objects. This work did not distinguish invocation from verification, and was successful because the model bases were small, the domain simple and the objects easily discriminable. If model bases are large, then there are likely to be many objects with similar properties. Further, data errors and occlusion will make the choice of initial index property difficult, or require vast duplication of index links.

Bolles et al. [35] implemented a powerful method for practical indexing, in their local feature focus method (for use in a two dimensional silhouette industrial domain). The method used key features (e.g. holes and corners) as the primary indices (focus features), which were then supported by locating secondary features at given distances from the first.

Key properties are clearly needed for this task, so these were good advances. However, property-based discrimination methods are sensitive to property estimation errors. Moreover, there are other classes of evidence and object relationships. Property-based indexing often makes subfeatures unusable because of their being too complex to calculate everywhere or too object specific. Alternately, the properties are too simple and invoke everywhere and do not properly account for commonality of substructures.

When it comes to sophisticated three dimensional vision, Marr stated:

"Recognition involves two things: a collection of stored 3-D model descriptions, and various indexes into the collection that allow a newly derived description to be associated with a description in the collection." ([112], page 318)

He advocated a structured object model base linked and indexed on three types of links: the specificity, adjunct and parent indices, which correspond to the subclass, subcomponent and supercomponent link types used here. He assumed that the image structures are well described and that model invocation is based on searching the model base using constraints on the relative sizes, shapes and orientations of the object axes. Recognized structures lead to new possibilities by following the indices. The ACRONYM system [42] implemented a similar notion.

Direct indexing will work for the highest levels of invocation, assuming perfect data from perfectly formed objects. However, it is probably inadequate for more realistic situations. Further, there remains the problem of locating the point from which to start the search from, particularly in a large model base.

Arbib [9] also proposed an invocation process that takes place in a schematic context. In his view, schemata have three components:

: i. Input-matching routines which test for evidence that that which the schema represents is indeed present in the environment.
: ii. Action routines - whose parameters may be tuned by parameter-fitting in the input-matching routines.
: iii. Competition and cooperation routines which, for example, use context (activation levels and spatial relations of other schemas) to lower or raise the schema's activation level.

His point (i) requires each schema to be an active matching process, but is similar, in principle, to the evidence accumulation process discussed here. His point (ii) corresponds to the hypothesis construction and verification processes (Chapters 9 and 10) and point (iii) corresponds closely to the inhibition and relation evidence types used here. His schema invocation process was not defined in detail and considered mainly the highest levels of description (e.g. of objects) and only weakly on the types of visual evidence or the actual invocation computation.

Hoffman and Jain [90] described an evidence-based object recognition process that is similar to the one described in this chapter. Starting from a set of surface patches segmented from range data, they estimated a set of unary and binary properties. Evidence conditions were formulated as conjunctions of property requirements. The degree to which an evidence condition supported or contradicted each model was also specified. When an unknown object was observed, a similarity measure was computed for each model. This approach can use object-specific evidence conditions, leading to more precise identification, but at the cost of evaluating the conditions for all objects. Good results were demonstrated with a modest model-base.

Binford et al. [30] also described a similar (though probability based) invocation approach. The scheme uses a "Bayesian network", where probabilities accumulate from subcomponent relationships and originate from the likelihood that an observed two dimensional image feature is the projection of a three dimensional scene feature. Reliability is enhanced by only using "quasi-invariant" features (i.e. those nearly constant over a large range of viewing positions), such as the fixed relative orientation between two axes. The network formulation used alternating layers of object and relative placement relationships ("joints"). Subcomponent a priori distributions were generated heuristically from occurrences in the model base.

There are obvious links between the network approach described here and the connectionist approach. In the latter, the domain knowledge is implicit in the weighted interconnections between simple identical processing units, which represents the interobject relationships. These machines can converge to a fixed output state for a given input state, and can learn network connection weights (e.g. [92,88,2]). Many of the computations proposed for invocation can probably be implemented using such devices.

Hinton proposed and evaluated ([87,89]) a connectionist model of invocation that assigns a reference frame as well as invoking the model. The model uses connections between retinotopic feature units, orientation mapping units, object feature (subcomponent) units and object units. This model requires duplicated connections for each visible orientation, but expresses them through a uniform mapping method. Consistent patterns of activity between the model and data features reinforce the activation of the mapping and model units. The model was proposed only for two dimensional patterns (letters) and required many heuristics for weight selection and convergence.

Feldman and Ballard [64] proposed a connectionist model indexing scheme using spatial coherence (coincidence) of properties to gate integration of evidence. This helps overcome inappropriate invocation due to coincidentally related features in separate parts of the image. The properties used in their example are simple discriminators: "circle, baby-blue and fairly-fast" for a frisbee.

This proposal did not have a rich representation of the types of knowledge useful for invocation nor the integration of different types of evidence, but did propose a detailed computational model for the elements and their connections.

Feldman [65] later refined this model. It starts with spatially co-located conjunctions of pairs of properties connected in parallel with the feature plane (descriptions of image properties). Complete objects are activated for the whole image based on conjunctions of activations of these spatially coincident pairs. The advantage of complete image activation is that with this method it is not necessary to connect new objects in each image location. The disadvantage is in increased likelihood of spurious invocations arising from cross-talk (i.e. unrelated, spatially separated features invoking the model). Top-down priming of the model holds when other knowledge (e.g. world knowledge) is available. Structured objects are represented by linkage to the subcomponents in the distinct object viewpoints. Multiple instances of the objects use "instance" nodes, but little information is given to suggest how the whole image model can activate separate instances.

These approaches are similar to those of this chapter: direct property evidence triggers structurally decomposed objects seen from given viewpoints. The network formulation for invocation proposed here has a parallel structure for two reasons: (1) the need for fast retrieval and (2) it is a convenient formalism for expressing the computational relationships between evidence types.

A key difference between the connectionist work reviewed above and the work described in this book is the use of dedicated network structures, as specified by the evidence type's constraints, etc. There is also an implementation difference, in that many of the connectionist networks express their results as states or configurations of activity of the network, rather than as the activity at a single node, which is the approach here.

Other Potential Techniques

There has been little Artificial Intelligence research done that treated model invocation as a specific issue. Work (e.g. [116,144]) has focused more on the contents and use of models or schemas than on how a schema is selected.

The NETL formalism of Fahlman ([58,59]) is a general indexing approach to invocation. This approach creates a large net-like database, with generalization/specialization type links. One function of this structure is to allow fast parallel search for concepts based on intersections of properties. For example, an elephant node is invoked by intersection of the "large", "grey" and "mammal" properties. The accessing is done by way of passing markers about the network (implemented in parallel), which is a discrete form of evidence passing. The few links used in this approach make it difficult to implement suggestiveness, as all propagated values must be based on certain properties.

General pattern recognition/classification techniques are also of some use in suggesting potential models. A multi-variate classifier (e.g. [56]) could be used to assign initial direct evidence plausibility to structures based on observed evidence. Unfortunately, this mechanism works well with property evidence, but not with integrating evidence from other sources, such as from subcomponent or generic relationships. Further, it is hard to provide the a priori occurrence and property statistics needed for the better classifiers.

The relaxation-based vision processes are also similar to the plausibility computation. Each image structure has a set of possible labels that must be consistent with the input data and related structure labels. Applications have tended to use the process for either image modification [141], pixel classification [82], structure detection, or discrete consistency maintenance [162]. Most of the applications modify the input data to force interpretations that are consistent with some criterion rather than to suggest interpretations that are verified in another manner. Unfortunately, invocation must allow multiple labels (generics) and has a non-linear and non-probabilistic formulation that makes it difficult to apply previous results about relaxation computations.

Next: Discussion Up: Model Invocation Previous: Evaluating the Invocation Process

Bob Fisher 2004-02-26