Visual recognition (and visual perception) has received considerable philosophical investigation. Three key results are mentioned, as an introduction to this section.
(1) Perception interprets raw sensory data. For example, we interpret a particular set of photons hitting our retina as "green". As a result, perception is an internal phenomenon caused by external events. It transforms the sensory phenomena into a reference to a symbolic description. Hence, there is a strong "linguistic" element to recognition - a vocabulary of interpretations. The perception may be directly related to the source, but it may also be a misinterpretation, as with optical illusions.
(2) Interpretations are directly dependent on the theories about what is being perceived. Hence, a theory that treats all intensity discontinuities as instances of surface reflectance discontinuities will interpret shadows as unexplained or reflectance discontinuity phenomena.
(3) Identity is based on conceptual relations, rather than purely physical ones. An office chair with all atoms replaced by equivalent atoms or one that has a bent leg is still a chair. Hence, any object with the appropriate properties could receive the corresponding identification.
So, philosophical theory implies that recognition has many weaknesses: the interpretations may be fallacious, not absolute and reductive. In practice, however, humans can effectively interpret unnatural or task-specific scenes (e.g. x-ray interpretation for tuberculosis detection) as well as natural and general ones (e.g. a tree against the sky). Moreover, almost all humans are capable of visually analyzing the world and producing largely similar descriptions of it. Hence, there must be many physical and conceptual constraints that restrict interpretation of both raw data as image features, and the relation of these features to objects. This chapter investigates the role of the second category on visual interpretation.
How is recognition understood here? Briefly, recognition is the production of symbolic descriptions. A description is an abstraction, as is stored object knowledge. The production process transforms sets of symbols to produce other symbols. The transformations are guided (in practice) by physical, computational and efficiency constraints, as well as by observer history and by perceptual goals.
Transformations are implementation dependent, and may be erroneous, as when using a simplified version of the ideal transformation. They can also make catastrophic errors when presented with unexpected inputs or when affected by distorting influences (e.g. optical, electrical or chemical). The notion of "transformation error" is not well founded, as the emphasis here is not on objective reality but on perceptual reality, and the perceptions now exist, "erroneous" or otherwise. The perceptions may be causally initiated by a physical world, but they may also be internally generated: mental imagery, dreams, illusions or "hallucinations". These are all legitimate perceptions that can be acted on by subsequent transformations; they are merely not "normal" or "well-grounded" interpretations.
Normal visual understanding is mediated by different description types over a sequence of transformations. The initial representation of a symbol may be by a set of photons; later channels may be explicit (value, place or symbol encoded), implicit (connectionist) or external. The communication of symbols between processes (or back into the same process) is also subject to distorting transformations.
In part, identity is linguistic: a chair is whatever is called a chair. It is also functional - an object has an identity only by virtue of its role in the human world. Finally, identity implies that it has properties, whether physical or mental. Given that objects have spatial extension, and are complex, some of the most important properties are linked to object substructures, their identity and their placement.
An identification is the attribution of a symbol whose associated properties are similar to those of the data, and is the output of a transformation. The properties (also symbols) compared may come from several different processes at different stages of transformation. Similarity is not a well defined notion, and seems to relate to a conceptual distance relationship in the space of all described objects. The similarity evaluation is affected by perceptual goals.
This is an abstract view of recognition.
The practical matters are now discussed: what decisions must be made for
an object recognition system to function in practice.
Descriptions and Transformations
This research starts at the sketch, so this will be the first description encountered. Later transformations infer complete surfaces, surface clusters, object properties and relationships and object hypotheses, as summarized in Chapter 1.
As outlined above, each transformation is capable of error, such as incorrectly merging two surfaces behind an obscuring object, or hypothesizing a non-existent object. Moreover, the model invocation process is designed to allow "errors" to occur, as such a capability is needed for generic object recognition, wherein only "close" models exist. Fortunately, there are many constraints that help prevent the propagation of errors.
Some symbols are created directly from the raw data (e.g. surface properties),
but most are created by transforming previously generated results (e.g.
using two surfaces as evidence for hypothesizing an object containing both).
involves structure isolation, as well as
identification, because naming requires objects to be named.
This includes denoting what constitutes the object, where it is and
what properties it has.
Unfortunately, the isolation process depends on what is to be identified, in
that what is relevant can be object-specific.
However, this problem is mitigated because the number of general visual
properties seems to be limited and there is
hope of developing "first pass" grouping techniques that could be largely
autonomous and model independent.
(These may not always be model independent, as, for example, the constellation
Orion can be found and declared as distinguished
in an otherwise random and overlapping star field.)
So, part of a sound theory of recognition depends on developing
methods for isolating specific classes of objects.
This research inferred surface groupings from local intersurface relationships.
The Basis for Recognition
Having the correct properties and relationships is the traditional basis for recognition, with the differences between approaches lying in the types of evidence used, the modeling of objects, the assumptions about what constitutes adequate recognition and the algorithms for performing the recognition.
Here, surface and structure properties are the key types of evidence, and they were chosen to characterize a large class of everyday objects. As three dimensional input data is used, a full three dimensional description of the object can be constructed and directly compared with the object model. All model feature properties and relationships should be held by the observed data features, with geometric consistency as the strongest constraint. The difficulty then arises in the construction of the three dimensional description. Fortunately, various constraints exist to help solve this problem.
This research investigates recognizing "human scale" rigidly and non-rigidly connected solids with uniform, large surfaces including: classroom chairs, most of a PUMA robot and a trash can. The types of scenes in which these objects appear are normal indoor somewhat cluttered work areas, with objects at various depths obscuring portions of other objects.
Given these objects and scenes, four groups of physical constraints are needed:
Definition of Recognition
Recognition produces a largely instantiated, spatially located, described object hypothesis with direct correspondences to an isolated set of image data. "Largely instantiated" means that most object features predicted by the model have been accounted for, either with directly corresponding image data or with explanations for their absence.
What distinguishes recognition, in the sense used in this book, is that it labels the data, and hence is able to reconstruct the image. While the object description may be compressed (e.g. a "head"), there will be an associated prototypical geometric model (organizing the properties) that could be used to recreate the image to the level of the description. This requires that identification be based on model-to-data correspondences, rather than on summary quantities such as volume or mass distribution.
One problem with three dimensional scenes is incomplete data.
In particular, objects can be partially obscured.
But, because of redundant features, context and limited environments,
identification is still often possible.
On the other hand, there are also objects that cannot be distinguished without
a more complete examination - such as an opened versus unopened soft drink can.
If complete identification requires all properties to be represented in the
data, any missing ones will need to be justified.
Here, it is assumed that all objects have geometric
models that allow appearance prediction.
Then, if the prediction process is reasonable and understands
physical explanations for missing data (e.g. occlusion, known defects),
the model will be consistent with the observed data, and
hence be an acceptable identification.
Criteria for Identification
The proposed criterion is that the object has all the right properties and none of the wrong ones, as specified in the object model. The properties will include local and global descriptions (e.g. surface curvatures and areas), subcomponent existence, global geometric consistency and visibility consistency (i.e. what is seen is what is expected).
Perceptual goals determine the properties used in identification. Unused information may allow distinct objects to acquire the same identity. If the generic chair were the only chair modeled, then all chairs would be classified as the generic chair.
The space of all objects does not seem to be sufficiently disjoint so that the detection of only a few properties will uniquely characterize them. In some model bases, efficient recognition may be possible by a parsimonious selection of properties, but redundancy adds the certainty needed to cope with missing or erroneous data, as much as the extra data bits in an error correcting code help disperse the code space.
Conversely, a set of data might implicate several objects related through a relevant common generalization, such as (e.g.) similar yellow cars. Or, there may be no physical generalization between alternative interpretations (e.g., as in the children's joke Q:"What is grey, has four legs and a trunk?" A:"A mouse going on a holiday!").
Though the basic data may admit several interpretations, further associated properties may provide finer identifications, much as ACRONYM  used additional constraints for class specialization.
While not all properties will be needed for a particular identification, some will be essential and recognition should require these when identifying an object. One could interpret a picture of a soft drink can as if it were the original, but this is just a matter of choosing what properties are relevant. An observation that is missing some features, such as one without the label on the can, may suggest the object, but would not be acceptable as a proper instance.
There may also be properties that the object should not have, though this is a more obscure case. In part, these properties may contradict the object's function. Some care has to be applied here, because there are many properties that an object does not have and all should not have to be made explicit. No "disallowed" properties were used here.
Most direct negative properties, like "the length cannot be less than 15 cm" can be rephrased as "the length must be at least 15 cm". Properties without natural complements are less common, but exist: "adjacent to" and "subcomponent of" are two such properties. One might discriminate between two types of objects by stating that one has a particular subcomponent, and that the other does not and is otherwise identical. Failure to include the "not subcomponent of" condition would reduce the negative case to a generalization of the positive case, rather than an alternative. Examples of this are: a nail polish dot that distinguishes "his and her" toothbrushes or a back support as the discriminator between a chair and a stool.
Recognition takes place in a context - each perceptual system will have its own set of properties suitable for discriminating among its range of objects. In the toothbrushes example, the absence of the mark distinguished one toothbrush in the home, but would not have been appropriate when still at the factory (among the other identical, unmarked, toothbrushes). The number and sensitivity of the properties affects the degree to which objects are distinguished. For example, the area-to-perimeter ratio distinguishes some objects in a two dimensional vision context, even though it is an impoverished representation. This work did not explicitly consider any context-specific identification criteria.
The above discussion introduces most of the issues behind recognition, and is summarized here: