3D Hand Pose Retrieval Using ISOSOM

Haiying Guan, Feris S. Rogerio, and Matthew Turk


we propose Isometric Self-Organizing Map (ISOSOM) and Hierarchical-ISOSOM (H-ISOSOM) algorithm for manifold learning and representation. We apply the algorithms to the 3D hand pose estimation problem.

Firstly, we describe an appearance-based 3D hand posture estimation approach. With a realistic 3D model of the hand, we generate a large synthetic image data set with the hand configurations and camera poses ground truth by the analysis by synthesis method. The data set can be used for exemplar-based hand posture and pose retrieval. Given the query hand image, by comparing the real hand image with the synthetic images, the best match may not be the correct solution. Because the occlusion problem, a hand image, in most of cases, could be generated by different hand configurations. In addition, if the feature is used to represent the hand image, different hand images may have the same feature. Thus, the algorithm outputs a set of possible candidates, instead of a single solution to resolve the ambiguity problem. Finally, the algorithm determines a ranked set of possible hand posture candidates, along with the camera relative rotation parameters, based on an image retrieval algorithm.

Secondly, due to the large scale of synthetic database, it is inefficient to do the exhausted search on the whole dataset. In order to improve the retrieval efficiency, we'd like to model the hand manifold, which is a complex, extremely non-linear, large scale, high-dimensional manifold composed of three components: hand image features, hand postures, and hand poses, by a concise, organized representation using ISOSOM and H-ISOSOM algorithm.

Thirdly, in order to represent the hand image robustly, we use depth edge detector with the multi-flash camera to obtain hand shape in complex background. Finally, we compare our approach with current well-known approaches with synthetic and real hand images in the experiments. The results shows that our approach is the best one.


1.1 Motivation

Numerous researchers and inventors dedicating to human computer  interface design are seeking for good interaction candidates, which directly and effortlessly transfer human ideas and intentions to the computer. Compared with other high DOF 3D input devices, hand gestures provide abundant degrees of freedom (DOF) and are natural and intuitive for communication. Hands are capable of performing  countless actions, such as pointing, grabbing, throwing, reaching, and pushing. Hands also play a very important role in interactions with objects and people. Hand gestures are used to express ideas, communicate with others, and show people's intentions and feelings. For example, we grasp objects we want to explore; we wave at familiar faces; we express happiness, and the satisfaction of achievement with "OK" or "Victory" gestures; we use the "Call me"  gesture (to bend three middle fingers and extend thumb and little finger) for keeping in touch. Gesture is a deeply rooted part of human communication. Such advantages make gesture the most remarkable candidate for interactive interfaces. Hand gesture is a potential input modality for the next generation 3D (or even higher) free-form input interfaces.

Visual sensing provides a passive and non-intrusive way for computers to acquire gesture input information. Visual modality is natural, untethered, and inexpensive. To make gesture interface widely applicable, we must explore how a vision system should be designed to model, detect, analyze and recognize gestures. Visual modeling the hand manifold and estimating hand posture and pose is a vital task for gesture interpretation. The problem of exemplar-based posture and pose estimation, which is even beyond the scope of multi-classification task, is an unsolved problem. The core task is to learn the super nonlinear mapping between the hand configuration space and the image feature space, which is very difficult and challenging.

1.2 Introduction

In this research, we take an image retrieval approach based on analysis by synthesis method. The system diagram is shown in Figure 1. It utilizes a 3D realistic hand model and renders it from different viewpoints to generate synthetic hand images. A set of possible candidates is found by comparing the real hand image with the synthesis images. The ground truth labels of the retrieved matches are used as hand pose candidates. 

Figure 1. The system diagram for appearance-based hand posture recognition system

The hand is modeled as a 3D articulated object with 21 DOF of the joint angles (hand configuration) and 6 DOF of global rotation and translations. A hand pose is defined by a hand configuration augmented by the 3 DOF global rotation parameters. The main problem of analysis by synthesis is the complexity in such a high dimension space. The size of the synthesis database grows exponentially with respect to the parameter's accuracy. Even though the articulation of the hand is highly constrained, the complexity is still intractable for both database processing and image retrieval. 

We formulate hand pose reconstruction as a nonlinear manifold representation problem. Next, we present ISOSOM algorithm and the hierarchical version (H-ISOSOM) and learn the complex, high dimensional manifold with the proposed ISOSOM algorithm.


In this section, we propose an Isometric Self-Organizing Map (ISOSOM) method for nonlinear manifold representation, which integrates a Self-Organizing Map model, ISOMAP nonlinear dimension reduction algorithm, and local linear interpolation algorithm, representing the large scale, high dimensional data with a small set of representatives in a low dimensional organized structure. Then, we construct a Hierarchical version of ISOSOM to resolve the complexity and accuracy problem. H-ISOSOM learns an organized structure of a non-convex, large scale manifold and represents it by a set of hierarchical organized maps. The hierarchical structure follows a coarse-to-fine strategy. According to the coarse global structure, it "unfolds" the manifold at the coarse level and decomposes the sample data into small patches, then iteratively learns the nonlinearity of each patch in finer levels. The algorithm simultaneously reorganizes and clusters the data samples in a low dimensional space to obtain the concise representation.

1.3.1 ISOSOM

Based on SOM, ISOMAP, and local linear interpolation algorithm, we proposed an Isometric Self-Organizing Mapping algorithm (ISOSOM) for manifold learning. Figure 2 gives an intuitive depiction of the ISOSOM map. It utilizes the topological graph and geometric distance of the samples' manifold to define the metric relationships between samples. By such geometric distance, it enables the SOM to follow better the topology of the underlying data set and preserves the spatial relationships of the samples in low dimension space. Meanwhile, it self-organizes the samples in low dimensional space. Finally, we utilize the local linear interpolation techniques to take both global and local relationships into account and make the whole mapping pseudo-invertible.

Figure 2. The ISOSOM Algorithm

1.3.2 H-ISOSOM

Due to the computational complexity of ISOMAP (In practice, using the matlab package provided by the authors of ISOMAP, the algorithm can only handle around 20,000 training samples before run out 2GB memory of a personal computer.), ISOSOM cannot handle large scale data sets (for example, more than one million training samples). In order to solve this problem, we present a hierarchical version of ISOSOM. The intuitive depiction of the Hierarchical-ISOSOM is illustrated in Figure 3. H-ISOSOM aims to defuse the computation complexity problem for large scale data sets, to improve the accuracy, and to construct a hierarchical structure for the fast retrieval or indexing.


Figure 3. The H-ISOSOM Algorithm

In order to make the training of large scale data sets tractable, H-ISOSOM follows a coarse-to-fine strategy. Let Nmax be the upper limit of the sample size that the algorithm of O(N3) complexity can handle in practice. H-ISOSOM first randomly samples the whole data set and obtains the subset with m samples,where m<Nmax, to train the first layer of the H-ISOSOM map using the ISOSOM algorithm. After that, for each representative on the map, it collects the training samples from the original data set and sub-samples them to train the next layer. The algorithm iteratively trains the sub-maps with ISOSOM until a certain criterion is reached.

1.4 Experimental Results

1.4.1 ISOSOM Experimental Results

In our experiments, we use depth edges and the variant of shape context descriptor to represent the given image. Each hand shape is represented by a 256 component vector. We define the correct match if the hand posture of the real image is the same as one of the hand postures in the top K retrieved images (with small changes in the hand configuration), and in addition, the three global hand rotation parameters (roll, pitch, yaw) of the test images are within the 40กใ range with respect to that retrieved hand image. Using the same criterion, we compare the performances of the hand pose estimation by Nearest Neighbor (NN), SOM and ISOSOM in Table 1. The NN is implemented by simply comparing the query image with each image in the synthetic dataset and retrieve the top K nearest match. The result shows that the correct match rate of the ISOSOM increases around 16.5% compared to NN, and increases around 5.6% compared to SOM. K is decided by the application requirements. In order to achieve more than 85% successful rate, NN requires retrieving more than 280 images which provide more than 280 hand pose possibilities. ISOSOM just requires less than half of that, and also has a higher precision rate. The ISOSOM with 1512 neurons needs 0.036 second to retrieve the top 400 images. It is more than 12 times faster than NN, which needs 0.453 seconds.

Top 40
Top 80
Top 120
Top 160
Top 200
Top 240
Top 280

Table 1. The comparison of retrieval rates

The ISOSOM retrieval results are shown in Figure 4. The first image in the first row is the query image. The second image in the first row is the model generated by its pseudo-ground truth. The remaining 20 images are the retrieval results from the ISOSOM representatives.

Figure 4. The ISOSOM Hand Posture Retrieval Results

Figure 5 shows the comparison of the match rates with respect to the number of the retrieved images. The results indicate that the ISOSOM algorithm is overall better than traditional image retrieval algorithm and SOM (It shows that the SOM is a little bit better than ISOSOM around K=20, but the retrieval rate of around that point is too low to be considered as a good choice for K in the most applications.).

Figure 5. The retrieval rate curve

1.4.2 H-ISOSOM Experimental Results

In the experiments, we generated a synthesized hand image data set with "Pick Up" posture. It is sampled from 15376 viewpoints from a 3D view sphere (the viewpoint interval for each DOF is 12กใ). In the experiment, instead of focusing on feature extraction, we aim at improving the retrieval accuracy given a commonly used feature, Hu moments.

First we test the algorithms with another dense synthesized data set, whose pitch and yaw camera viewpoint is sampled at 9กใ intervals. Figure 6 shows the hand images randomly picked up from the testing dataset. The size for the testing data set is 861 for each posture; Hu moment features are computed, which are in-plane rotation invariant (corresponding to the roll parameter of the camera). We compare the performance of the H-ISOSOM, GHSOM, SOM, and Nearest Neighbor for pose estimation with a single posture. The Recall-Precision Graph of the "pick" posture is shown in Figure 6. The correct retrieval rate chart for the
same posture with the same training results is shown in Figure 7.  All 861 test samples are tested and the percentage of correct retrieval is calculated. Both figures show that H-ISOSOM performs better than GHSOM, SOM, and Nearest Neighbor (NN) algorithm.

Figure 6 The recall-Precision Graph

Figure 7 The Correct Retrieval Chart


Haiying Guan and Matthew Turk, "The Hierarchical Isometric Self-Organizing Map for Manifold Representation", IEEE Workshop on Component Analysis Methods for Classification, Clustering, Modeling, and Estimation Problems in Computer Vision (in conjunction with CVPR07), Minneapolis, Minnesota, June 18-23, 2007. (PDF)

Haiying Guan, Feris S. Rogerio and Matthew Turk, "The Isometric Self-Organizing Map for 3D Hand Pose Estimation," The 7th International Conference Automatic Face and Gesture Recognition, pages: 263-268. Southampton, UK,  Apr. 10-12 2006.[PDF][Poster]

Haiying Guan and Matthew Turk, "3D hand pose reconstruction with ISOSOM," The First International Symposium on Advances in Visual Computing, pages: 630 - 635. Lake Tahoe, NV, USA, Dec. 5-7, 2005.[PDF]

Haiying Guan and Matthew Turk, "3D Hand Pose Reconstruction with ISOSOM", Technical Reports 2005-15, Department of Computer Science, University of California, Santa Barbara. [PDF]