Input Data
The data used in this research were unrealistic in several respects.
Because the depth and orientation values and the segmentation
boundaries were hand-derived, they had few of the errors
likely to be present in real data.
The segmentations also made nearly perfect correspondence with the models,
and thus ignored problems of data variation and scale.
Data variations, particularly for objects with curved surfaces,
cause shape segmentation boundaries to shift.
Further, as the analytic scale changes, segmentation boundaries also move,
and segments may appear or disappear.
Object Modeling
The object representation was too literal and should not always be based on exact sizes and feature placement. The object surfaces could be more notional, designating surface class, curvature, orientation and placement and largely ignore extent. Object representation could also have a more conceptual character that emphasizes key distinguishing features and rough geometric placement, without a literal CAD-like model (as used here). Finally, the models could allow alternative, overlapping representations, such as having two surfaces used individually and as part of a connecting orientation discontinuity.
As data occurs at unpredictable scales, the models might
record the features at a variety of scales.
The models should also include other data elements such as references to solids
(e.g. generalized cylinders), reflectance, surface shape texture
and distinguished axes (e.g. symmetry and elongation), etc.
The representation could have used a more
constraint-like formulation, as in ACRONYM [42], which would
allow inequality relationships among features, and also allow easier use
of model variables.
Many of these inadequacies were subsequently overcome in the SMS
representation approach [70] described in Chapter 7.
Surface Reconstruction
An open question about the surface reconstruction process is
whether to replace the representation of two partially obscured surfaces by
a single merged surface (as was done) or to keep both alternatives.
Keeping the extra hypotheses causes redundant processing and may lead to
duplicated invocation and hypothesis construction, but allows correct processing
if the merging was inappropriate.
Keeping only the merged surface may cause invocation and matching failures,
or require a more intelligent hypothesis construction process that uses
the model to decide if the two surfaces were incorrectly merged.
Surface Cluster Formation
The surface cluster formation process has a similar problem.
When one surface cluster overlaps another, then a third
surface cluster merging the two is created as well.
This was to provide a context within which all components of a self-obscured
object would appear.
The problem is how to control the surface cluster merging process when multiple
surface clusters overlap (as is likely in a real scene), which causes a
combinatorial growth of surface clusters.
Data Description
While shape is very informative, many additional description
types could be added to help characterize objects: reflectance
and shape texture (both random and patterned), reflectance itself,
translucency, surface finish, etc.
Model Invocation
Invocation evaluated a copy of the network in every image context. This is computationally expensive, considering the likely number of contexts (e.g. 100) and the number of models (e.g. 50,000) in a realistic scene. Parallel processing may completely eliminate the computational problem, but there remains the problem of investigating just the relevant contexts. There should probably be a partitioning of the models according to the size of the context, and also some attention focusing mechanism should limit the context within which invocation takes place. This mechanism might apply a rough high-level description of the entire scene and then a coarse-to-fine scale analysis focusing attention to particular regions of interest.
Redundant processing might arise because an object will invoke all of its generalizations. The invocations are correct, but the duplication of effort seems wasteful when a direct method could then pursue models up and down the generalization hierarchy.
As currently formulated, the invocation network must be created symbolically for each new scene analyzed, as a function of the scene, model base and evidence computations. It would be interesting to investigate how the network might re-organize itself as the image changed, maintaining the fixed model dependent relationships, but varying the image dependent ones.
The variety of objects in the natural word suggests that there may not be
a "rich" description hierarchy, nor a deep subcomponent hierarchy for
most objects, nor a general subclass hierarchy.
Though these factors contribute substantially, it appears that there are
relatively few object types in our everyday experience.
Instead, there are many individuals and considerable variation
between individuals.
Thus, the most important aspect of object representation may be the
direct property and primitive description evidences, which would then
differentiate individuals.
Hypothesis Construction
The major criticism of the hypothesis construction process is its literality. In particular, it tried to find evidence for all features, which is probably neither fully necessary, nor always possible.
Literality also appeared in the dependence on the metrical
relationships in the geometric model (e.g. the surface sizes and
boundary placements).
These were used for predicting self-occlusion and for spatially
registering the object.
While these tasks are important, and are part of a
general vision system, they should have a more
conceptual and less analytic formulation.
This would provide a stronger symbolic aspect to the computation and
should also make the process more capable of handling imperfect
or generic objects.
The Recognition Approach as a Whole
One significant limitation of the recognition approach is the absence of scale analysis. Objects should have different conceptual descriptions according to the relevance of a feature at a given scale, and recognition then has to match data within a scale-dependent range of models.
A more relational formulation would help, but there does not seem to be a matching method that neatly combines the tidiness and theoretical strengths of graph matching with the efficiency and prediction capability of model-based geometric matching.
The proposed recognition model ignored the question of when enough evidence was accumulated. Recognition need not require complete evidence or satisfaction of all constraints, provided none actually fail, and the few observed features are adequate for unique identification in a particular context. However, the implementation here plodded along trying to find as much evidence as possible. An object should be recognizable using a minimal set of discriminating features and, provided the set of descriptions is powerful enough to discriminate in a large domain, the recognition process will avoid excessive simplification. Recognition (here) has no concept of context, and so cannot make these simplifications. On the other hand, the additional evidence provides the redundancy needed to overcome data and segmentation errors.
The evaluation on hand-collected and segmented data did not adequately test the methods, but research using this approach is continuing and some simpler genuine range data scenes have been successfully analyzed.