As shown in Fig. 2, a spatiotemporal event includes two kinds of information: the object of interest, and the movement of the object. The movement of the object can be further decomposed into two components: global and local motions. The global motion captures gross motion in terms of position change. The local motion characterizes deformation, orientation and gesture changes. In the case of sign language, the hand is the object of interest. The position change of the hand is a global movement and the change of the hand gesture and orientation is a local movement.
Figure 3:
The three-stage framework for spatiotemporal event recognition
In this paper, we propose a three-stage framework for spatiotemporal event recognition, as illustrated in Fig. 3. The first stage, sequence acquisition, acquires image sequences representing the event. This involves motion detection and motion-based visual attention. The start and end of motion identify the temporal attention window in which the event occurs. We map this temporal window to a standard temporal length (e.g., 5) to form what is called motion clip, while the speed information is available from the mapping performed in this stage. In a motion clip, only the temporal dimension is normalized. Here, we assume that a hand sign starts and ends with a static hand.
The second stage is visual attention and object segmentation. This stage directs the system to focus on the object of interest in the image. If we assume that the object of interest is moving in a stationary environment, it is not very difficult to roughly determine the position of a moving object in the image using motion information. However, it is not simple if the task is to extract the contour of the object from various backgrounds. In [17], we presented a prediction-and-verification segmentation scheme to locate hands from complex backgrounds. The scheme uses the information learned from past experience to predict the valid segmentation based on particular views. It is more efficient and effective than other stochastic approaches such as simulated annealing.
After Stage 2, the object of interest in each image of a sequence is segmented and mapped to a fovea image of a standard fixed size. Segmented fovea images at different times form a standard spatiotemporal fovea sequence, in which both temporal and spatial dimensions are normalized. The global motion information of the object of interest is placed in a global motion vector, which records the size and position information of the segmented object in the original image. This vector is necessary because once the object is segmented and mapped to a fovea sequence with a standard spatiotemporal size, the global motion information is lost.
Let a fovea image f with m rows and n columns be an (mn)-dimensional vector. For example, the set of image pixels can be written as a vector where and d=mn. Note that although pixels in an image are lined up to form a 1-D vector this way, the 2-D neighborhood information between pixels will be characterized by the scatter matrix of to be discussed later. Let p be the standard temporal length and be the hand fovea image corresponding to the frame i. Then we create a new vector , called the fovea vector, which consists of the hand fovea subimages and global motion vector G,
The third stage, which is the focus of this paper, is to recognize the spatiotemporal event from the fovea vector.
An automatic hand gesture recognition system accepts an input fovea vector and outputs the recognition result which indicates the class to which belongs. Thus, a recognition system can be denoted by a function f that maps each element in the space of to an element in the space of . Our objective of constructing a recognition system is equivalent to approximating function by another function . The error of a approximation can be indicated by certain measure of the error . One such measure is the mean square error:
where is the probability distribution function in S. In other words, can defer a lot from f in parts where never occurs, without affecting the error measure. Another measure is the pointwise absolute error for any point in S', where is a subset of S that is of interest to a certain problem.
Of course, f is typically high-dimensional and highly complex. A powerful method of constructing is using learning. Specifically, a series of cases is acquired as the learning data set:
Then, the learning task is to construct based on L. For notational convenience, the sample points in L is denoted by X(L):
X(L) should be drawn from the real situation so that the underlying distribution of X(L) is as close to the real distribution as possible.
One popular solution to approximating f is to use the nearest neighbor approximator in the eigenspace based on Karhunen-Loeve projection [32]. Following the notation of our earlier paper [15], we refer the features extracted by Karhunen-Loeve projection as the most expressive features (MEFs). The approach which uses the nearest neighbor approximator in the MEF space has been applied to the problems of face recognition [41] and 3D object recognition [35]. Weng & Chen have shown that the nearest neighbor approximator in the eigen subspace approaches f pointwise in probability. However, the features extracted by Karhunen-Loeve projection, in general, not the best ones for classification, because the features that describe some major variations in the class are typically irrelevant to how the subclasses are divided as illustrated in Fig. 4.
Figure 4:
A 2D illustration of the most discriminating features (MDF).
The MDF is projection along . The Karhunen-Loeve projection
along can not
separate the two subclasses.
In this paper, we use the multiclass, multivariate discriminant analysis [42] to select the most discriminating features (MDF's). In the MDF space, we build a recursive partition tree to approximate the function f. The details of the algorithm are in the following section. In the next section, we also show the convergence property of our new recursive partition tree approximator. Our experiments illustrate the better performance of the recursive partition tree approximator in the MDF space than the nearest neighbor approximator in the eigenspace.