Overview
we propose Isometric SelfOrganizing Map (ISOSOM) and HierarchicalISOSOM (HISOSOM) algorithm for manifold learning and representation. We apply the algorithms to the 3D hand pose estimation problem.
Firstly, we describe an appearancebased 3D hand posture estimation approach. With a realistic 3D model of the hand, we generate a large synthetic image data set with the hand configurations and camera poses ground truth by the analysis by synthesis method. The data set can be used for exemplarbased hand posture and pose retrieval. Given the query hand image, by comparing the real hand image with the synthetic images, the best match may not be the correct solution. Because the occlusion problem, a hand image, in most of cases, could be generated by different hand configurations. In addition, if the feature is used to represent the hand image, different hand images may have the same feature. Thus, the algorithm outputs a set of possible candidates, instead of a single solution to resolve the ambiguity problem. Finally, the algorithm determines a ranked set of possible hand posture candidates, along with the camera relative rotation parameters, based on an image retrieval algorithm. Secondly, due to the large scale of synthetic database, it is inefficient to do the exhausted search on the whole dataset. In order to improve the retrieval efficiency, we'd like to model the hand manifold, which is a complex, extremely nonlinear, large scale, highdimensional manifold composed of three components: hand image features, hand postures, and hand poses, by a concise, organized representation using ISOSOM and HISOSOM algorithm.
Thirdly, in order to represent the hand image robustly, we use depth edge detector with the multiflash camera to obtain hand shape in complex background. Finally, we compare our approach with current wellknown approaches with synthetic and real hand images in the experiments. The results shows that our approach is the best one. ��
Details
1.1 Motivation
Numerous researchers and inventors dedicating to human computer interface design are seeking for good interaction candidates, which directly and effortlessly transfer human ideas and intentions to the computer. Compared with other high DOF 3D input devices, hand gestures provide abundant degrees of freedom (DOF) and are natural and intuitive for communication. Hands are capable of performing countless actions, such as pointing, grabbing, throwing, reaching, and pushing. Hands also play a very important role in interactions with objects and people. Hand gestures are used to express ideas, communicate with others, and show people's intentions and feelings. For example, we grasp objects we want to explore; we wave at familiar faces; we express happiness, and the satisfaction of achievement with "OK" or "Victory" gestures; we use the "Call me" gesture (to bend three middle fingers and extend thumb and little finger) for keeping in touch. Gesture is a deeply rooted part of human communication. Such advantages make gesture the most remarkable candidate for interactive interfaces. Hand gesture is a potential input modality for the next generation 3D (or even higher) freeform input interfaces. Visual sensing provides a passive and nonintrusive way for computers to acquire gesture input information. Visual modality is natural, untethered, and inexpensive. To make gesture interface widely applicable, we must explore how a vision system should be designed to model, detect, analyze and recognize gestures. Visual modeling the hand manifold and estimating hand posture and pose is a vital task for gesture interpretation. The problem of exemplarbased posture and pose estimation, which is even beyond the scope of multiclassification task, is an unsolved problem. The core task is to learn the super nonlinear mapping between the hand configuration space and the image feature space, which is very difficult and challenging.
1.2 Introduction
In this research, we take an image retrieval approach based on analysis by synthesis method. The system diagram is shown in Figure 1. It utilizes a 3D realistic hand model and renders it from different viewpoints to generate synthetic hand images. A set of possible candidates is found by comparing the real hand image with the synthesis images. The ground truth labels of the retrieved matches are used as hand pose candidates.
Figure 1. The system diagram for appearancebased hand posture recognition system
The hand is modeled as a 3D articulated object with 21 DOF of the joint angles (hand configuration) and 6 DOF of global rotation and translations. A hand pose is defined by a hand configuration augmented by the 3 DOF global rotation parameters. The main problem of analysis by synthesis is the complexity in such a high dimension space. The size of the synthesis database grows exponentially with respect to the parameter's accuracy. Even though the articulation of the hand is highly constrained, the complexity is still intractable for both database processing and image retrieval.
We formulate hand pose reconstruction as a nonlinear manifold representation problem. Next, we present ISOSOM algorithm and the hierarchical version (HISOSOM) and learn the complex, high dimensional manifold with the proposed ISOSOM algorithm.
1.3 ISOSOM and HISOSOM
In this section, we propose an Isometric SelfOrganizing Map (ISOSOM) method for nonlinear manifold representation, which integrates a SelfOrganizing Map model, ISOMAP nonlinear dimension reduction algorithm, and local linear interpolation algorithm, representing the large scale, high dimensional data with a small set of representatives in a low dimensional organized structure. Then, we construct a Hierarchical version of ISOSOM to resolve the complexity and accuracy problem. HISOSOM learns an organized structure of a nonconvex, large scale manifold and represents it by a set of hierarchical organized maps. The hierarchical structure follows a coarsetofine strategy. According to the coarse global structure, it "unfolds" the manifold at the coarse level and decomposes the sample data into small patches, then iteratively learns the nonlinearity of each patch in finer levels. The algorithm simultaneously reorganizes and clusters the data samples in a low dimensional space to obtain the concise representation.
1.3.1 ISOSOM
Based on SOM, ISOMAP, and local linear interpolation algorithm, we proposed an Isometric SelfOrganizing Mapping algorithm (ISOSOM) for manifold learning. Figure 2 gives an intuitive depiction of the ISOSOM map. It utilizes the topological graph and geometric distance of the samples' manifold to define the metric relationships between samples. By such geometric distance, it enables the SOM to follow better the topology of the underlying data set and preserves the spatial relationships of the samples in low dimension space. Meanwhile, it selforganizes the samples in low dimensional space. Finally, we utilize the local linear interpolation techniques to take both global and local relationships into account and make the whole mapping pseudoinvertible.
Figure 2. The ISOSOM Algorithm
1.3.2 HISOSOM
Due to the computational complexity of ISOMAP (In practice, using the matlab package provided by the authors of ISOMAP, the algorithm can only handle around 20,000 training samples before run out 2GB memory of a personal computer.), ISOSOM cannot handle large scale data sets (for example, more than one million training samples). In order to solve this problem, we present a hierarchical version of ISOSOM. The intuitive depiction of the HierarchicalISOSOM is illustrated in Figure 3. HISOSOM aims to defuse the computation complexity problem for large scale data sets, to improve the accuracy, and to construct a hierarchical structure for the fast retrieval or indexing.
��
Figure 3. The HISOSOM Algorithm
In order to make the training of large scale data sets tractable, HISOSOM follows a coarsetofine strategy. Let N_{max} be the upper limit of the sample size that the algorithm of O(N^{3}) complexity can handle in practice. HISOSOM first randomly samples the whole data set and obtains the subset with m samples,where mmax, to train the first layer of the HISOSOM map using the ISOSOM algorithm. After that, for each representative on the map, it collects the training samples from the original data set and subsamples them to train the next layer. The algorithm iteratively trains the submaps with ISOSOM until a certain criterion is reached.
1.4 Experimental Results
1.4.1 ISOSOM Experimental Results
In our experiments, we use depth edges and the variant of shape context descriptor to represent the given image. Each hand shape is represented by a 256 component vector. We define the correct match if the hand posture of the real image is the same as one of the hand postures in the top K retrieved images (with small changes in the hand configuration), and in addition, the three global hand rotation parameters (roll, pitch, yaw) of the test images are within the 40�� range with respect to that retrieved hand image. Using the same criterion, we compare the performances of the hand pose estimation by Nearest Neighbor (NN), SOM and ISOSOM in Table 1. The NN is implemented by simply comparing the query image with each image in the synthetic dataset and retrieve the top K nearest match. The result shows that the correct match rate of the ISOSOM increases around 16.5% compared to NN, and increases around 5.6% compared to SOM. K is decided by the application requirements. In order to achieve more than 85% successful rate, NN requires retrieving more than 280 images which provide more than 280 hand pose possibilities. ISOSOM just requires less than half of that, and also has a higher precision rate. The ISOSOM with 1512 neurons needs 0.036 second to retrieve the top 400 images. It is more than 12 times faster than NN, which needs 0.453 seconds.
Number

NN

SOM

ISOSOM

Top 40

44.25%

62.39%

65.93%

Top 80

55.75%

72.12%

77.43%

Top 120

64.60%

78.76%

85.40%

Top 160

70.80%

80.09%

88.50%

Top 200

76.99%

81.86%

91.59%

Top 240

81.42%

85.84%

92.48%

Top 280

82.30%

87.17%

94.69%


Table 1. The comparison of retrieval rates
The ISOSOM retrieval results are shown in Figure 4. The first image in the first row is the query image. The second image in the first row is the model generated by its pseudoground truth. The remaining 20 images are the retrieval results from the ISOSOM representatives.
Figure 4. The ISOSOM Hand Posture Retrieval Results
Figure 5 shows the comparison of the match rates with respect to the number of the retrieved images. The results indicate that the ISOSOM algorithm is overall better than traditional image retrieval algorithm and SOM (It shows that the SOM is a little bit better than ISOSOM around K=20, but the retrieval rate of around that point is too low to be considered as a good choice for K in the most applications.). ��
Figure 5. The retrieval rate curve
1.4.2 HISOSOM Experimental Results
In the experiments, we generated a synthesized hand image data set with "Pick Up" posture. It is sampled from 15376 viewpoints from a 3D view sphere (the viewpoint interval for each DOF is 12��). In the experiment, instead of focusing on feature extraction, we aim at improving the retrieval accuracy given a commonly used feature, Hu moments. First we test the algorithms with another dense synthesized data set, whose pitch and yaw camera viewpoint is sampled at 9�� intervals. Figure 6 shows the hand images randomly picked up from the testing dataset. The size for the testing data set is 861 for each posture; Hu moment features are computed, which are inplane rotation invariant (corresponding to the roll parameter of the camera). We compare the performance of the HISOSOM, GHSOM, SOM, and Nearest Neighbor for pose estimation with a single posture. The RecallPrecision Graph of the "pick" posture is shown in Figure 6. The correct retrieval rate chart for the same posture with the same training results is shown in Figure 7. All 861 test samples are tested and the percentage of correct retrieval is calculated. Both figures show that HISOSOM performs better than GHSOM, SOM, and Nearest Neighbor (NN) algorithm.
Figure 6 The recallPrecision Graph
Figure 7 The Correct Retrieval Chart ��
Publications
Haiying Guan and Matthew Turk, "The Hierarchical Isometric SelfOrganizing Map for Manifold Representation", IEEE Workshop on Component Analysis Methods for Classification, Clustering, Modeling, and Estimation Problems in Computer Vision (in conjunction with CVPR07), Minneapolis, Minnesota, June 1823, 2007. (PDF)
Haiying Guan, Feris S. Rogerio and Matthew Turk, "The Isometric SelfOrganizing Map for 3D Hand Pose Estimation," The 7th International Conference Automatic Face and Gesture Recognition, pages: 263268. Southampton, UK, Apr. 1012 2006.[PDF][Poster]
Haiying Guan and Matthew Turk, "3D hand pose reconstruction with ISOSOM," The First International Symposium on Advances in Visual Computing, pages: 630  635. Lake Tahoe, NV, USA, Dec. 57, 2005.[PDF]
Haiying Guan and Matthew Turk, "3D Hand Pose Reconstruction with ISOSOM", Technical Reports 200515, Department of Computer Science, University of California, Santa Barbara. [PDF]
