next up previous
Next: Approximation Using Recursive Partition Up: Overview of the Approach Previous: Time as a dimension

Recognition of spatiotemporal pattern

As shown in Fig. 2, a spatiotemporal event includes two kinds of information: the object of interest, and the movement of the object. The movement of the object can be further decomposed into two components: global and local motions. The global motion captures gross motion in terms of position change. The local motion characterizes deformation, orientation and gesture changes. In the case of sign language, the hand is the object of interest. The position change of the hand is a global movement and the change of the hand gesture and orientation is a local movement.

  
Figure 3: The three-stage framework for spatiotemporal event recognition

In this paper, we propose a three-stage framework for spatiotemporal event recognition, as illustrated in Fig. 3. The first stage, sequence acquisition, acquires image sequences representing the event. This involves motion detection and motion-based visual attention. The start and end of motion identify the temporal attention window in which the event occurs. We map this temporal window to a standard temporal length (e.g., 5) to form what is called motion clip, while the speed information is available from the mapping performed in this stage. In a motion clip, only the temporal dimension is normalized. Here, we assume that a hand sign starts and ends with a static hand.

The second stage is visual attention and object segmentation. This stage directs the system to focus on the object of interest in the image. If we assume that the object of interest is moving in a stationary environment, it is not very difficult to roughly determine the position of a moving object in the image using motion information. However, it is not simple if the task is to extract the contour of the object from various backgrounds. In [17], we presented a prediction-and-verification segmentation scheme to locate hands from complex backgrounds. The scheme uses the information learned from past experience to predict the valid segmentation based on particular views. It is more efficient and effective than other stochastic approaches such as simulated annealing.

After Stage 2, the object of interest in each image of a sequence is segmented and mapped to a fovea image of a standard fixed size. Segmented fovea images at different times form a standard spatiotemporal fovea sequence, in which both temporal and spatial dimensions are normalized. The global motion information of the object of interest is placed in a global motion vector, which records the size and position information of the segmented object in the original image. This vector is necessary because once the object is segmented and mapped to a fovea sequence with a standard spatiotemporal size, the global motion information is lost.

Let a fovea image f with m rows and n columns be an (mn)-dimensional vector. For example, the set of image pixels tex2html_wrap_inline1616 can be written as a vector tex2html_wrap_inline1618 where tex2html_wrap_inline1620 and d=mn. Note that although pixels in an image are lined up to form a 1-D vector tex2html_wrap_inline1624 this way, the 2-D neighborhood information between pixels will be characterized by the scatter matrix of tex2html_wrap_inline1624 to be discussed later. Let p be the standard temporal length and tex2html_wrap_inline1630 be the hand fovea image corresponding to the frame i. Then we create a new vector tex2html_wrap_inline1634 , called the fovea vector, which consists of the hand fovea subimages and global motion vector G,

  equation221

The third stage, which is the focus of this paper, is to recognize the spatiotemporal event from the fovea vector.

An automatic hand gesture recognition system accepts an input fovea vector tex2html_wrap_inline1634 and outputs the recognition result tex2html_wrap_inline1640 which indicates the class to which tex2html_wrap_inline1634 belongs. Thus, a recognition system can be denoted by a function f that maps each element in the space of tex2html_wrap_inline1634 to an element in the space of tex2html_wrap_inline1640 . Our objective of constructing a recognition system is equivalent to approximating function tex2html_wrap_inline1650 by another function tex2html_wrap_inline1652 . The error of a approximation can be indicated by certain measure of the error tex2html_wrap_inline1654 . One such measure is the mean square error:

displaymath1606

where tex2html_wrap_inline1656 is the probability distribution function tex2html_wrap_inline1634 in S. In other words, tex2html_wrap_inline1662 can defer a lot from f in parts where tex2html_wrap_inline1634 never occurs, without affecting the error measure. Another measure is the pointwise absolute error tex2html_wrap_inline1668 for any point tex2html_wrap_inline1634 in S', where tex2html_wrap_inline1674 is a subset of S that is of interest to a certain problem.

Of course, f is typically high-dimensional and highly complex. A powerful method of constructing tex2html_wrap_inline1662 is using learning. Specifically, a series of cases is acquired as the learning data set:

displaymath1607

Then, the learning task is to construct tex2html_wrap_inline1662 based on L. For notational convenience, the sample points in L is denoted by X(L):

  equation250

X(L) should be drawn from the real situation so that the underlying distribution of X(L) is as close to the real distribution as possible.

One popular solution to approximating f is to use the nearest neighbor approximator in the eigenspace based on Karhunen-Loeve projection [32]. Following the notation of our earlier paper [15], we refer the features extracted by Karhunen-Loeve projection as the most expressive features (MEFs). The approach which uses the nearest neighbor approximator in the MEF space has been applied to the problems of face recognition [41] and 3D object recognition [35]. Weng & Chen have shown that the nearest neighbor approximator in the eigen subspace approaches f pointwise in probability. However, the features extracted by Karhunen-Loeve projection, in general, not the best ones for classification, because the features that describe some major variations in the class are typically irrelevant to how the subclasses are divided as illustrated in Fig. 4.

   figure260
Figure 4: A 2D illustration of the most discriminating features (MDF). The MDF is projection along tex2html_wrap_inline1522 . The Karhunen-Loeve projection along tex2html_wrap_inline1524 can not separate the two subclasses.

In this paper, we use the multiclass, multivariate discriminant analysis [42] to select the most discriminating features (MDF's). In the MDF space, we build a recursive partition tree to approximate the function f. The details of the algorithm are in the following section. In the next section, we also show the convergence property of our new recursive partition tree approximator. Our experiments illustrate the better performance of the recursive partition tree approximator in the MDF space than the nearest neighbor approximator in the eigenspace.


next up previous
Next: Approximation Using Recursive Partition Up: Overview of the Approach Previous: Time as a dimension

Yuntao Cui
Wed Jun 25 16:00:42 EDT 1997