The strategy presented here is appearance-based: object representations are viewer-centered, and the data/model relationship is made explicit though empirical evidence gathered during a training process. Specifically, an appearance-based object recognition strategy is devised that builds on the works of Nayar et al. [2] and Turk et al. [1], due to its fast indexing time and its ability to deal with complex scenes where model-building becomes infeasible. These methods build a lower dimensional subspace based on the entire set of raw images acquired, using Principal Components Analysis (PCA)[1]. The lower dimensional space in which the images are represented is referred to as an appearance manifold. Recognition (or rather, indexing) is based on projecting images acquired on-line onto the manifold and finding the closest stored image.
The majority of the existing appearance-based strategies focus on applications such as face recognition [1,3-6], lip-reading [7], and others [2,8,12]. For most of these applications, using only the raw grey-scale images as input to an appearance strategy has worked quite well. The major drawback is that tight control over the image formation parameters, such as lighting and background, has to be enforced in order to ensure repeatable appearance. One way to avoid some of these difficulties is to make direct surface measurements, i.e. range measurements, of the object to be recognized. This is precisely the strategy taken by Campbell and Flynn [9].
Difficulties regarding sensitivity to lighting and background are overcome here through the development of an appearance-based object recognition strategy, whereby differential image properties are exploited. Here, inputs for recognition correspond to the optical flow images induced onto the image plane by the relative motion between camera and object. Optical flow images used in this appearance-based strategy will be referred to as appearance flows. Using flow images as the input offers some advantages over using the original grey-scale images. First, the resulting flow field is somewhat invariant to scene illumination (assuming that it is constant during acquisition). This lends itself to a more robust strategy whereby lighting conditions need not be identical to those used during training. Second, using flow provides a means of figure/ground separation in the case of a stationary observer. This is a major advantage in that appearance-based strategies that use raw grey scale images have to ensure the exact same background during recognition and training. Here, arbitrary stationary backgrounds are possible as the algorithm can focus on the moving object in the scene. Furthermore, once the moving object has been detected, the object can be centered within the image. This type of position normalization is not feasible using traditional inputs. Third, the flow field itself will be shown to provide coarse information regarding the object's 3D structure. A final motivation for the approach has its roots in biology. A great deal of ``wetware'' in the mammalian brain is devoted to the computation of motion. This motivates the exploitation of its capabilities at solving many visual tasks.
Appearance manifolds can be created for each object in a database by inducing flows corresponding to the set of expected motions. The resulting set of optical flow fields is used to build an appearance manifold using standard appearance-based methods [1,2]. What makes this problem difficult is the confounding of motion, structure, and imaging geometry such that the appearances of different objects are often indistinguishable from one another. Similar problems are inherent in recognizing particular motions, e.g. facial expressions, gestures [3-6], where the problem is often made tractable by the dominance of the motion component of the optical flow field. In fact, this is precisely what we wish to avoid in the object recognition context, hence some control of imaging parameters is required to ensure that a reasonable component of the flow corresponds to structure. Unfortunately, this leads to a much more difficult factorization problem. To get around this difficulty one can invoke a temporal variant of the general position assumption:
In the case of recognition a reasonable strategy is to maintain a list of plausible hypotheses and accumulate evidence for each one through a sequence of observations, which we refer to as temporal regularization (Another example of a sequential strategy can be found in [7].). We will show how this can be precisely formulated in probabilistic terms using Bayesian chaining strategy to effect regularization. The hope is that experimental results will confirm the validity of the general position assumption by showing that in most cases the list of plausible hypotheses quickly diminishes to a single confident assertion.
The foregoing begs the important question of whether it is possible to create an appearance manifold for sufficiently general motions in the first place. Clearly this is possible in restricted contexts such as gesture recognition. What we seek is a means of generating a set of canonical motions, i.e. a motion basis, in different viewpoints that can generalize to a sufficiently wide range of appearances to be of use in practical recognition tasks. Again, we invoke a second general assumption regarding the motion of physical objects:
This suggests that a motion basis can be constructed by presenting an object to a stationary observer and sweeping it in different directions in the plane tangent to the current viewing direction. The basis is constructed be repeating this process for each of the object's canonical viewpoints. (Equivalent motions can be induced by moving the camera about a stationary observer.) Additionally we need to assume that the object moves at constant velocity relative to the camera, which is in general not a severe constraint. In the next section, we discuss precisely how this appearance manifold can be built.
Constraint 1 is required to ensure that a sufficient component of the optical flow magnitude is due to the structure of the object. Together with Constraint 2, it then becomes possible to associate the magnitude of the optical flow vector with distance to the camera. Taken as a whole, the magnitude of the optical flow field can be viewed as a rough kinetic depth map. This is by no means adequate for quantitative recovery of structure due, among other things, to confounding with motion, but it can provide a basis for recognition according to general position assumption A1. Constraint 3 implies that the structure component induced by camera motion about a stationary object is indistinguishable locally from that induced by the motion of the object about a stationary observer. This permits the construction of a motion basis using the more tractable approach of a mobile observer moving about a stationary object. The question of how to generate this basis from sensor motion is discussed next.
By assuming that objects can be differentiated on the basis of their local structure over time, the range of motions that needs to be generated during training is reduced significantly (Assumption A2). However this range is still considerable encompassing 4 degrees of freedom: viewsphere position, direction of motion (component parallel to the plane tangent to the surface), height above the surface, and curvature of the arc (direction orthogonal to surface). Prior knowledge about how objects are likely to interact with the observer can be used to further restrict the range of motions that need to be sampled. For example, the object can be kept at a fixed canonical pose with respect to the camera (i.e. the camera was kept upright) during the acquisition of training images. Each object is then sampled using only 184 trajectories corresponding to 2 orthogonal sweeps at a fixed distance from the object with the same radius of curvature as the surrounding viewsphere at each of 92 viewsphere directions. Experimental results would be required to show that this basis is sufficient to generalize to a very broad range of novel motions.
Each of the 184 trajectories gives rise to a distinct motion sequence,
si,j,t, where the indices i,j, and
t are used to reference a particular object, trajectory, and image
within the sequence respectively. Figure 1 shows a sequence of 1
images resulting from motion in a vertical arc on a viewsphere
surrounding one of the test objects (a horizontal arc can be seen as well).
An optical flow algorithm is used to estimate for each
si,j,t, t=1..l, a vector field,
vi,j,t,t=2..l-1, corresponding to the optical flow
field induced by the camera motion. Only the magnitudes of
vi,j,t are of interest here (for the reasons cited
earlier). For the particular optical flow algorithm chosen [15], three images
were required to estimate a single optical flow vector. As three samples along
each trajectory were sufficient to characterize each curvilinear sweep, each
object i gives rise to a set of 184 scalar images. For brevity we
refer to the latter as flow images. Standard PCA (principal component
analysis) techniques [1] are used to construct an eigenbasis for the entire
training set of 4600 flow images.
where p(Oi) is the prior probability of object hypothesis i, and N(mumi,CTi) |m=m(d) is the multivariate normal distribution representation for the object hypothesis, evaluated at the projected parameters of the measurement, m=m(d).
As each of the image flow vectors is computed from a set of intensity images, evidence can be inexpensively and efficiently accumulated on the level of the probabilities over time, by using a Bayesian chaining strategy that assigns the posterior probability distribution at time t, p(O|dt), as the prior at time t+1. In this fashion, probabilistic evidence is cascaded until a clear winner emerges. Substituting the posterior density function derived from one view (derived in the previous equation) as the prior for the next view leads to the following updating function for p(O|d) [16,17]:
The hypothesis is that confounding information in the optical flow signatures
can be resolved by accumulating support for different object hypotheses over a
sequence of observations in this manner according to Assumption A1.
Presenting the system with all prior evidence should resolve ambiguities and
lead to a winning solution in a short number of views. The entire system can
be seen in Figure 2.
Figure 2: As each flow image in the sequence is introduced to the system, probabilistic evidence is cascaded until a clear winner emerges. The idea is that strong prior evidence should resolve ambiguities and lead to a winning solution in a short number of views.
The next task is to determine a suitable convergence criterion for the
system. It can be shown that a metric that predicts the likelihood of
ambiguous recognition results as a function of the measurement can be derived
based on Shannon's entropy [18]:
To test this hypothesis we have constructed in our laboratory a robot-mounted
camera system that can generate the requisite sensor trajectories on a
viewsphere surrounding the object of interest. This apparatus is used to
automatically generate motion bases (training) for a set of approximately 25
standard household objects. On-line, objects from the database are presented
to a stationary camera by subjecting each to a set of curvilinear motions
generated by a precessing pendulum (using the object as the mass). This
approach allows for a wide range of sample trajectories that are clearly
outside of the motion basis used for training (see Figure 4).
In addition, we have also
performed testing on free rotations and translations generated by hand.
Figure 4: Setup consists of a person swinging an object in front of a stationary camera.
Through application of the sequential recognition strategy, the system is shown to converge to a correct assertion in terms of its MAP (maximum a posteriori ) solution in the majority of cases. This lends support to our contentions regarding the generalizability of our motion basis and disambiguation of competing hypotheses via temporal regularization. Figure 5 plots the entropy over time for the case of a panda bear moved in front of the camera. The system starts with an incorrect assessment about the object identity. Notice that as the object continues to swing in front of the camera, the system becomes more certain about its identity. In time, it converges to the correct solution.
Figure 5: On-line entropy of recognition results over time for panda bear. The MAP solution is shown above the curve at each iteration.
A presentation on this topic was given at the Eleventh British Machine Vision Conference, Bristol, UK, 11-14 September 2000. Download the paper.