next
Next: References

View-Based Object Recognition

Ming-Hsuan Yang Dan Roth Narendra Ahuja
Department of Computer Science and Beckman Institute
University of Illinois at Urbana-Champaign, Urbana, IL 61801
mhyang@vision.ai.uiuc.edu danr@cs.uiuc.edu ahuja@vision.ai.uiuc.edu

View-based object recognition has attracted much attention in recent years. In contrast to methods that rely on pre-defined geometric (shape) models for recognition, view-based methods learn a model of the object's appearance in a two-dimensional image under different poses and illumination conditions. Oftentimes, each image is represented by a raster-scan of pixels, i.e., a vector of intensity values. At evaluation time, given a two-dimensional image, the learned model is used to determine if the target object is present in the image or not.

A number of view-based schemes have been developed to recognize 3D objects. Poggio and Edelman [4] show that 3D objects can be recognized from the raw intensity values in 2D images (we call this representation here a pixel-based representation) using a network of generalized radial basis functions. They argue and demonstrate that full 3D structure of an object can be estimated if enough 2D views of the object are provided. Turk and Pentland [8] demonstrate that human faces can be represented and recognized by ``eigenfaces.'' Representing a face image as a vector of pixel values, the eigenfaces are the eigenvectors associated with the largest eigenvalues which are computed from a covariance matrix of the sample vectors. An attractive feature of this method is that the eigenfaces can be learned from the sample images in pixel representation without any feature selection. The eigenspace approach has since been used in different vision tasks from face recognition to object tracking. Murase and Nayar [2] [3] develop a parametric eigenspace method to recognize 3D objects directly from their appearance. For each object of interest, a set of images in which the object appears in different poses is obtained as training examples. Next, the eigenvectors are computed from the covariance matrix of the training set. The set of images is projected to a low dimensional subspace spanned by a subset of eigenvectors, in which the object is represented as a manifold. A compact parametric model is constructed by interpolating the points in the subspace. In recognition, the image of a test object is projected to the subspace and the object is recognized based on the manifold it lies on. Using a subset of the Columbia Object Image Library (COIL-100), they show that 3D objects can be recognized accurately from their appearances in real-time.

In contrast to these algebraic methods, general purpose learning methods such as support vector machines (SVMs) have also been used for this problem. Schölkopf [7] was the first to apply SVMs to recognize 3D objects from 2D images and has demonstrated the potential of this approach in visual learning. Pontil and Verri [5] also used SVMs for 3D object recognition and experimented with a subset of the COIL-100 data set. Their training set consisted of 36 images (one for every tex2html_wrap110) for each of the 32 objects they chose, and the test sets consist of the remaining 36 images for each object. For 20 random selections of 32 objects from the COIL-100, the system achieves perfect recognition rate. A subset of the COIL-100 has also been used by also Roobaert and Van Hulle [6] to compare the performance of SVMs with different pixel-based input representations.

Recently SNoW has been applied to view-based object recognition [9]. SNoW is a sparse network of linear functions that utilizes the Winnow update rule [1]. It is specifically tailored to learning in domains in which the potential number of features taking part in decisions is very large (and may be unknown a priori), although only a small number of them is typically relevant to a decision. Some of the characteristics of this learning architecture are its sparsely connected units, the allocation of features and links in a data driven way, the decision mechanism and the utilization of a feature-efficient update rule. An additional property of the SNoW architecture that makes it attractive for learning in vision is that it learns a representation for each object rather than a discrimination rule for each pair, as do other methods. This allows for more appealing evaluation schemes and for the incorporation of external information sources into the process of learning a representation and recognizing an object. See [9] for comparisons on view-based object recognition methods using SNoW and SVM.

One commonly used benchmark data set is the Columbia Object Image Library (COIL-100) database (available at http://www.cs.columbia.edu/CAVE.) The COIL-100 data set consists of color images of 100 objects where the images of the objects that were taken at pose intervals of tex2html_wrap111, i.e., 72 poses per object. The images were also normalized such that the larger of the two object dimensions (height and width) fits the image size of tex2html_wrap112 pixels. Figure 1 shows the images of the 100 objects taken in frontal view, i.e., zero pose angle. The 32 highlighted objects in Figure 1 are considered more difficult to recognize in [5]; we use all 100 objects including these in our experiments. Each color image is converted to a gray-scale image of tex2html_wrap113 pixels for our experiments.

 
 figure44

Figure: Columbia Object Image Library (COIL-100) consists of 100 objects of varying poses (tex2html_wrap111 apart). The objects are shown in row order where the highlighted ones are considered more difficult to recognize in [5].




next Next: References

Bob Fisher
Wednesday March 14 14:53:24 GMT 2001