next
Next: References

Support Vector Machines for Vision

Ming-Hsuan Yang Dan Roth Narendra Ahuja
Department of Computer Science and Beckman Institute
University of Illinois at Urbana-Champaign, Urbana, IL 61801
mhyang@vision.ai.uiuc.edu danr@cs.uiuc.edu ahuja@vision.ai.uiuc.edu

A Support Vector Machine is a learning algorithm for pattern classification and regression [13] [1] [3] [12]. The basic training principle behind SVMs is finding the optimal linear hyperplane such that the expected classification error for unseen test samples is minimized -- i.e., good generalization performance. According to the structural risk minimization inductive principle [13], a function that classifies the training data accurately and which belongs to a set of functions with the lowest VC dimension [1] will generalize best regardless of the dimensionality of the input space. Based on this principle, a linear SVM uses a systematic approach to find a linear function with the lowest VC dimension. For linearly non-separable data, SVMs can (nonlinearly) map the input to a high dimensional feature space where a linear hyperplane can be found. Although there is no guarantee that a linear solution will always exist in the high dimensional space, in practice it is feasible to find a working solution.

Given a labeled set of M training samples tex2html_wrap_inline134, where tex2html_wrap_inline136 and tex2html_wrap_inline138 is the associated label (tex2html_wrap122), a SVM classifier finds the optimal hyperplane that correctly separates (classifies) the largest fraction of data points while maximizing the distance of either class from the hyperplane (the margin). Vapnik [13] shows that maximizing the margin distance is equivalent to minimizing the VC dimension in constructing an optimal hyperplane. Computing the best hyperplane is posed as a constrained optimization problem and solved using quadratic programming techniques. The discriminant hyperplane is defined by the level set of
displaymath126
where tex2html_wrap_inline140 is a kernel function and the sign of tex2html_wrap_inline142 determines the membership of tex2html_wrap_inline144. Constructing an optimal hyperplane is equivalent to finding all the nonzero tex2html_wrap_inline146. Any vector tex2html_wrap_inline148 that corresponds to a nonzero tex2html_wrap_inline146 is a support vector (SV) of the optimal hyperplane. A desirable feature of SVMs is that the number of training points which are retained as support vectors is usually quite small, thus providing a compact classifier.

For a linear SVM, the kernel function is just a simple dot product in the input space while the kernel function in a nonlinear SVM effectively projects the samples to a feature space of higher (possibly infinite) dimension via a nonlinear mapping function:
displaymath127
and then constructs a hyperplane in F. The motivation behind this mapping is that it is more likely to find a linear hyperplane in the high dimensional feature space. Using Mercer's theorem [2], the expensive calculations required in projecting samples into the high dimensional feature space can be replaced by a much simpler kernel function satisfying the condition
displaymath128
where tex2html_wrap_inline154 is the nonlinear projection function. Several kernel functions, such as polynomials and radial basis functions, have been shown to satisfy Mercer's theorem and have been used successfully in nonlinear SVMs. In fact, by using different kernel functions, SVMs can implement a variety of learning machines, some of which coincide with classical architectures. Nevertheless, automatic selection of the ``right'' kernel function and its associated parameters remains problematic and in practice one must resort to trial and error for model selection.

Schölkopf [11] was the first to apply SVMs to recognize 3D objects from 2D images and has demonstrated the potential of this approach in visual learning. Pontil and Verri [9] also used SVMs for 3D object recognition and experimented with a subset of the COIL-100 dataset. Their training set consisted of 36 images (one for every tex2html_wrap123) for each of the 32 objects they chose, and the test sets consist of the remaining 36 images for each object. A subset of the COIL-100 has been used by also Roobaert and Van Hulle [10] to compare the performance of SVMs with different pixel-based input representations. Recently, SVMs have been applied to several visual learning tasks including face detection [7] [4], pedestrian detection [6] [8], and gender classification [5].




next
Next: References

Bob Fisher
Wednesday March 14 14:58:48 GMT 2001