Ming-Hsuan Yang Dan Roth Narendra Ahuja
Department of Computer Science and Beckman Institute
University of Illinois at Urbana-Champaign, Urbana, IL 61801
mhyang@vision.ai.uiuc.edu danr@cs.uiuc.edu ahuja@vision.ai.uiuc.edu
A Support Vector Machine is a learning algorithm for pattern classification and regression [13] [1] [3] [12]. The basic training principle behind SVMs is finding the optimal linear hyperplane such that the expected classification error for unseen test samples is minimized -- i.e., good generalization performance. According to the structural risk minimization inductive principle [13], a function that classifies the training data accurately and which belongs to a set of functions with the lowest VC dimension [1] will generalize best regardless of the dimensionality of the input space. Based on this principle, a linear SVM uses a systematic approach to find a linear function with the lowest VC dimension. For linearly non-separable data, SVMs can (nonlinearly) map the input to a high dimensional feature space where a linear hyperplane can be found. Although there is no guarantee that a linear solution will always exist in the high dimensional space, in practice it is feasible to find a working solution.
Given a labeled set of M training samples , where
and
is the associated label (
), a
SVM classifier finds the optimal hyperplane that correctly separates
(classifies) the largest fraction of data points while maximizing the
distance of either class from the hyperplane (the margin).
Vapnik [13] shows that maximizing the margin distance is
equivalent to minimizing the VC dimension in constructing an optimal
hyperplane. Computing the best hyperplane is posed as a constrained
optimization problem and solved using quadratic programming
techniques. The discriminant hyperplane is defined by the level set
of
where is a kernel function and the sign of
determines the membership of
. Constructing an optimal hyperplane
is equivalent to finding all the nonzero
. Any vector
that corresponds to a nonzero
is a support vector
(SV) of the optimal hyperplane. A desirable feature of SVMs is that
the number of training points which are retained as support vectors is
usually quite small, thus providing a compact classifier.
For a linear SVM, the kernel function is just a simple dot product in
the input space while the kernel function in a nonlinear SVM
effectively projects the samples to a feature space of higher
(possibly infinite) dimension via a nonlinear mapping function:
and then constructs a hyperplane in F. The motivation behind this
mapping is that it is more likely to find a linear hyperplane in the
high dimensional feature space. Using Mercer's theorem
[2], the expensive calculations required in
projecting samples into the high dimensional feature space can be
replaced by a much simpler kernel function satisfying the condition
where is the nonlinear projection function. Several kernel
functions, such as polynomials and radial basis functions, have been
shown to satisfy Mercer's theorem and have been used successfully in
nonlinear SVMs. In fact, by using different kernel functions, SVMs
can implement a variety of learning machines, some of which coincide
with classical architectures. Nevertheless, automatic selection of the
``right'' kernel function and its associated parameters remains
problematic and in practice one must resort to trial and error for
model selection.
Schölkopf [11] was the first to apply SVMs to
recognize 3D objects from 2D images and has demonstrated the
potential of this approach in visual learning.
Pontil and Verri [9] also used SVMs for 3D object
recognition and experimented with a subset of the COIL-100 dataset.
Their training set consisted of 36 images (one for every
) for each of the 32 objects they chose, and the test
sets consist of the remaining 36 images for each object.
A subset of the COIL-100 has been used by also Roobaert
and Van Hulle [10] to compare the performance of SVMs
with different pixel-based input representations.
Recently, SVMs have been applied to several visual learning tasks
including face detection [7] [4],
pedestrian detection [6] [8], and
gender classification [5].