3D Hand Pose Reconstruction Using ISOSOM

Haiying Guan, Matthew Turk


We present an appearance-based 3D hand posture estimation method that determines a ranked set of possible hand posture candidates from an unmarked hand image, based on an analysis by synthesis method and an image retrieval algorithm. We formulate the posture estimation problem as a nonlinear, many-to-many mapping problem in a high dimension space. A general algorithm called ISOSOM is proposed for nonlinear dimension reduction, applied to 3D hand pose reconstruction to establish the mapping relationships between the hand poses and the image features. In order to interpolate the intermediate posture values given the sparse sampling of ground-truth training data, the geometric map structure of the samples' manifold is generated. The experimental results show that the ISOSOM algorithm performs better than traditional image retrieval algorithms for hand pose estimation.


1.1 Motivation

Non-intrusive hand gesture interpretation plays a crucial role in a wide range of applications, such as automatic sign language understanding, entertainment, and human computer interaction (HCI). Because hand gestures are natural, intuitive, and provide rich information to computers without extra cumbersome devices, they offer a great potential for next generation user interfaces, being especially suitable for large scale displays, 3D volumetric displays or wearable devices such as PDAs or cell phones.

1.2 Introduction

In this research, we take an image retrieval approach based on analysis by synthesis method. The system diagram is shown in Figure 1. It utilizes a 3D realistic hand model and renders it from different viewpoints to generate synthetic hand images. A set of possible candidates is found by comparing the real hand image with the synthesis images. The ground truth labels of the retrieved matches are used as hand pose candidates. 

Figure 1. The system diagram for appearance-based hand posture recognition system

The hand is modeled as a 3D articulated object with 21 DOF of the joint angles (hand configuration) and 6 DOF of global rotation and translations. A hand pose is defined by a hand configuration augmented by the 3 DOF global rotation parameters. The main problem of analysis by synthesis is the complexity in such a high dimension space. The size of the synthesis database grows exponentially with respect to the parameter's accuracy. Even though the articulation of the hand is highly constrained, the complexity is still intractable for both database processing and image retrieval. 

We formulate hand pose reconstruction as a nonlinear mapping problem between the angle vectors (hand configurations) and the images. Generally, such mapping is a many-to-many mapping in high dimension space. Due to occlusions, different hand poses could be rendered to the same images. On the other hand, the same pose is rendered from the different view points and generates many images. To simplify the problem, we eliminate the second case by augmenting the hand configuration vector with the 3 global rotation parameters. The mapping from the images to the augmented hand configurations becomes a one-to-many mapping problem between the image space and the augmented hand configuration space (the hand pose space). We establish the one-to-many mapping between the feature space and the hand pose space with the proposed ISOSOM algorithm. The experimental results shows that our algorithm is better than traditional image retrieval algorithms.


Instead of representing each synthesis image by an isolated item in the database, We cluster the similar vectors generated by similar poses together and use the ground-truth samples to generate an organized structure in low dimension space. With such structure, we can interpolate the intermediate vector. This will greatly reduce the complexity. Based on Kohonen's Self-Organizing Map (SOM) and Tenenbaum's ISOMAP algorithm, we propose an ISOmetric Self-Organizing Mapping algorithm (ISOSOM). Instead of organizing the samples in the 2D grids by Euclidian distance, it utilizes the topological graph and geometric distance of the samples' manifold to define the metric relationships between samples and enable the SOM to follow better the topology of the underlying data set. The ISOSOM algorithm compresses information and automatically clusters the training samples in a low dimension space efficiently. Figure 1 gives an intuitive depiction of the ISOSOM map.

Figure 2. The ISOSOM for Hand Posture Estimation

1.4 Experimental Results

The retrieval results of ISOSOM are shown in Figure 3, where the first image is the query image. The second image is the pseudo ground truth. The rest 20 images are the retrieval results from the ISOSOM neurons.

Figure 3. The ISOSOM Hand Posture Retrieval Results

We compare the performances of the hand pose estimation by the traditional image
retrieval algorithm (IR), SOM and ISOSOM in Table 1. The IR is implemented by simply comparing the query image with each image in the synthesis dataset and retrieve the top N best match. The result shows that the correct match rate of the ISOSOM increases around 16:5% compared to IR, and increases around 5:6% compared to SOM. N is decided by the application requirements. In order to achieve more than 85% successful rate, IR requires retrieving more than 280 images which provide more than 280 hand pose possibilities. ISOSOM just requires less than half of that, and also has a higher precision rate. The retrieval rate curve is shown in Figure 4.

Figure 4. The retrieval rate curve


Haiying Guan, Feris S. Rogerio and Matthew Turk, "The Isometric Self-Organizing Map for 3D Hand Pose Estimation," accepted by Proc. 7th International Conference Automatic Face and Gesture Recognition, Southampton, UK, Apr. 10-12 2006.

Haiying Guan and Matthew Turk, "3D hand pose reconstruction with ISOSOM," Proc.
First International Symposium on Advances in Visual Computing, pages: 630 - 635. Lake Tahoe, NV, USA, Dec. 5-7, 2005.

Haiying Guan and Matthew Turk, "3D Hand Pose Reconstruction with ISOSOM", Technical Reports 2005-15, Department of Computer Science, University of California, Santa Barbara.