Next: Overview of the Approach Up: No Title Previous: Introduction

Relation to Previous Work

Recently, there has been a significant amount of research on vision-based hand gesture recognition ( see [27] for a survey). A vision-based approach acquires visual information of a gesture using a single video camera or a pair of cameras.

The existing approaches typically include two parts, modeling hands and analysis of hand motion. Models of the human hand include three dimensional models (e.g., Downton & Drouet [20] and Etoh et al [24]: generalized cylindrical model; Kuch & Huang [31]: NURBS-based hand model), the region-based model (e.g., Darrell & Pentland [18], Bobick & Wilson [4], and Cui et. al. [15]), two dimensional shape models (e.g., Starner & Pentland [39]: elliptic shape models; Cho & Dunn [10]: line segments shape model; Kervrann & Heitz [29] and Blake & Isard [3]: contour shape model), and fingertip models (e.g., Cipolla et al [12] and Davis & Shah [21]).

Different models of hand lead to different models of hand motion. A system which uses a three dimensional hand model is capable of modeling the real hand kinematics (e.g. [31]). For the system which uses a two dimensional hand model, the motion is described as two dimensional rotation, translation and scaling in the image plane [29]. The trajectory of each fingertip is suitable to represent the motion in the case when the fingertip is used to model hands [21]. Typically, motion parameters have been explicitly used to perform gesture classification.

The vision-based approach is one of the most unobtrusive ways which enable users to interact with computers in a natural fashion. However, it faces several difficulties. Among them, the problem of segmentation of a moving hand from sometimes complex backgrounds is perhaps the most difficult one. Many current systems rely on markers or marked gloves (e.g. [12, 21, 39]). The others simply assume uniform backgrounds (e.g. [4, 15, 18]). Recently, Moghaddam and Pentland [34] used a maximum likelihood decision rule based on the estimated probability density of the hand and its 2D contour to detect hands from intensity images. The segmentation scheme in our framework uses attention images from multiple fixations. The search for a valid segmentation is predicted by the training samples and verified by a learning-based interpolation scheme.

Another major difference between our framework and most existing approaches is that in our framework, the motion understanding is tightly coupled with spatial recognition. We do not separate the hand modeling and the motion recognition into two different processes, in order to fully use the two types information in an integrated classification stage.

Next: Overview of the Approach Up: No Title Previous: Introduction

Yuntao Cui
Wed Jun 25 16:00:42 EDT 1997