3D Hand Pose Retrieval Using ISOSOMHaiying Guan, Feris S. Rogerio, and Matthew Turk |
| |||||||||||||||||||||||||||||||||
Overviewwe propose Isometric Self-Organizing Map (ISOSOM) and Hierarchical-ISOSOM (H-ISOSOM) algorithm for manifold learning and representation. We apply the algorithms to the 3D hand pose estimation problem. Firstly, we describe an
appearance-based 3D hand posture estimation approach. With a realistic 3D model
of the hand, we generate a large synthetic image data set with the hand
configurations and camera poses ground truth by the analysis by synthesis
method. The data set can be used for exemplar-based hand posture and pose
retrieval. Given the query hand image, by comparing the real hand image with the
synthetic images, the best match may not be the correct solution. Because the
occlusion problem, a hand image, in most of cases, could be generated by
different hand configurations. In addition, if the feature is used to represent
the hand image, different hand images may have the same feature. Thus, the
algorithm outputs a set of possible candidates, instead of a single solution to
resolve the ambiguity problem. Finally, the algorithm determines a ranked set of
possible hand posture candidates, along with the camera relative rotation
parameters, based on an image retrieval algorithm. Thirdly, in order to represent
the hand image robustly, we use depth edge detector with the multi-flash camera
to obtain hand shape in complex background. Finally, we compare our approach
with current well-known approaches with synthetic and real hand images in the
experiments. The results shows that our approach is the best one. Details1.1 Motivation Numerous researchers and
inventors dedicating to human computer interface design are seeking for
good interaction candidates, which directly and effortlessly transfer human
ideas and intentions to the computer. Compared with other high DOF 3D input
devices, hand gestures provide abundant degrees of freedom (DOF) and are natural
and intuitive for communication. Hands are capable of performing countless
actions, such as pointing, grabbing, throwing, reaching, and pushing. Hands also
play a very important role in interactions with objects and people. Hand
gestures are used to express ideas, communicate with others, and show people's
intentions and feelings. For example, we grasp objects we want to explore; we
wave at familiar faces; we express happiness, and the satisfaction of
achievement with "OK" or "Victory" gestures; we use the "Call me" gesture
(to bend three middle fingers and extend thumb and little finger) for keeping in
touch. Gesture is a deeply rooted part of human communication. Such advantages
make gesture the most remarkable candidate for interactive interfaces. Hand
gesture is a potential input modality for the next generation 3D (or even
higher) free-form input interfaces. 1.2 Introduction In this research, we take an image retrieval approach based on analysis by synthesis method. The system diagram is shown in Figure 1. It utilizes a 3D realistic hand model and renders it from different viewpoints to generate synthetic hand images. A set of possible candidates is found by comparing the real hand image with the synthesis images. The ground truth labels of the retrieved matches are used as hand pose candidates.
Figure 1. The system diagram for appearance-based hand posture recognition system The hand is modeled as a 3D articulated object with 21 DOF of the joint angles (hand configuration) and 6 DOF of global rotation and translations. A hand pose is defined by a hand configuration augmented by the 3 DOF global rotation parameters. The main problem of analysis by synthesis is the complexity in such a high dimension space. The size of the synthesis database grows exponentially with respect to the parameter's accuracy. Even though the articulation of the hand is highly constrained, the complexity is still intractable for both database processing and image retrieval. We formulate hand pose reconstruction as a nonlinear manifold representation problem. Next, we present ISOSOM algorithm and the hierarchical version (H-ISOSOM) and learn the complex, high dimensional manifold with the proposed ISOSOM algorithm. 1.3 ISOSOM and H-ISOSOM In this section, we propose an Isometric Self-Organizing Map (ISOSOM) method for nonlinear manifold representation, which integrates a Self-Organizing Map model, ISOMAP nonlinear dimension reduction algorithm, and local linear interpolation algorithm, representing the large scale, high dimensional data with a small set of representatives in a low dimensional organized structure. Then, we construct a Hierarchical version of ISOSOM to resolve the complexity and accuracy problem. H-ISOSOM learns an organized structure of a non-convex, large scale manifold and represents it by a set of hierarchical organized maps. The hierarchical structure follows a coarse-to-fine strategy. According to the coarse global structure, it "unfolds" the manifold at the coarse level and decomposes the sample data into small patches, then iteratively learns the nonlinearity of each patch in finer levels. The algorithm simultaneously reorganizes and clusters the data samples in a low dimensional space to obtain the concise representation. 1.3.1 ISOSOM Based on SOM, ISOMAP, and local linear interpolation algorithm, we proposed an Isometric Self-Organizing Mapping algorithm (ISOSOM) for manifold learning. Figure 2 gives an intuitive depiction of the ISOSOM map. It utilizes the topological graph and geometric distance of the samples' manifold to define the metric relationships between samples. By such geometric distance, it enables the SOM to follow better the topology of the underlying data set and preserves the spatial relationships of the samples in low dimension space. Meanwhile, it self-organizes the samples in low dimensional space. Finally, we utilize the local linear interpolation techniques to take both global and local relationships into account and make the whole mapping pseudo-invertible.
Figure 2. The ISOSOM Algorithm 1.3.2 H-ISOSOM Due to the computational complexity of ISOMAP (In practice, using the matlab package provided by the authors of ISOMAP, the algorithm can only handle around 20,000 training samples before run out 2GB memory of a personal computer.), ISOSOM cannot handle large scale data sets (for example, more than one million training samples). In order to solve this problem, we present a hierarchical version of ISOSOM. The intuitive depiction of the Hierarchical-ISOSOM is illustrated in Figure 3. H-ISOSOM aims to defuse the computation complexity problem for large scale data sets, to improve the accuracy, and to construct a hierarchical structure for the fast retrieval or indexing. กก
Figure 3. The H-ISOSOM Algorithm In order to make the training of large scale data sets tractable, H-ISOSOM follows a coarse-to-fine strategy. Let Nmax be the upper limit of the sample size that the algorithm of O(N3) complexity can handle in practice. H-ISOSOM first randomly samples the whole data set and obtains the subset with m samples,where m<Nmax, to train the first layer of the H-ISOSOM map using the ISOSOM algorithm. After that, for each representative on the map, it collects the training samples from the original data set and sub-samples them to train the next layer. The algorithm iteratively trains the sub-maps with ISOSOM until a certain criterion is reached. 1.4 Experimental Results 1.4.1 ISOSOM Experimental Results In our experiments, we use depth edges and the variant of shape context descriptor to represent the given image. Each hand shape is represented by a 256 component vector. We define the correct match if the hand posture of the real image is the same as one of the hand postures in the top K retrieved images (with small changes in the hand configuration), and in addition, the three global hand rotation parameters (roll, pitch, yaw) of the test images are within the 40กใ range with respect to that retrieved hand image. Using the same criterion, we compare the performances of the hand pose estimation by Nearest Neighbor (NN), SOM and ISOSOM in Table 1. The NN is implemented by simply comparing the query image with each image in the synthetic dataset and retrieve the top K nearest match. The result shows that the correct match rate of the ISOSOM increases around 16.5% compared to NN, and increases around 5.6% compared to SOM. K is decided by the application requirements. In order to achieve more than 85% successful rate, NN requires retrieving more than 280 images which provide more than 280 hand pose possibilities. ISOSOM just requires less than half of that, and also has a higher precision rate. The ISOSOM with 1512 neurons needs 0.036 second to retrieve the top 400 images. It is more than 12 times faster than NN, which needs 0.453 seconds.
Table 1. The comparison of retrieval rates The ISOSOM retrieval results are shown in Figure 4. The first image in the first row is the query image. The second image in the first row is the model generated by its pseudo-ground truth. The remaining 20 images are the retrieval results from the ISOSOM representatives.
Figure 4. The ISOSOM Hand Posture Retrieval Results Figure 5 shows the comparison of
the match rates with respect to the number of the retrieved images. The results
indicate that the ISOSOM algorithm is overall better than traditional image
retrieval algorithm and SOM (It shows that the SOM is a little bit better than
ISOSOM around K=20, but the retrieval rate of around that point is too
low to be considered as a good choice for K in the most applications.).
Figure 5. The retrieval rate curve 1.4.2 H-ISOSOM Experimental Results In the experiments, we generated a synthesized
hand image data set with "Pick Up" posture. It is sampled from 15376 viewpoints
from a 3D view sphere (the viewpoint interval for each DOF is 12กใ). In the
experiment, instead of focusing on feature extraction, we aim at improving the
retrieval accuracy given a commonly used feature, Hu moments.
Figure 6 The recall-Precision Graph
Figure 7 The Correct Retrieval
Chart PublicationsHaiying Guan and Matthew Turk, "The Hierarchical Isometric Self-Organizing Map for Manifold Representation", IEEE Workshop on Component Analysis Methods for Classification, Clustering, Modeling, and Estimation Problems in Computer Vision (in conjunction with CVPR07), Minneapolis, Minnesota, June 18-23, 2007. (PDF) Haiying Guan, Feris S. Rogerio and Matthew Turk, "The Isometric Self-Organizing Map for 3D Hand Pose Estimation," The 7th International Conference Automatic Face and Gesture Recognition, pages: 263-268. Southampton, UK, Apr. 10-12 2006.[PDF][Poster] Haiying Guan and Matthew Turk, "3D hand pose reconstruction with ISOSOM," The First International Symposium on Advances in Visual Computing, pages: 630 - 635. Lake Tahoe, NV, USA, Dec. 5-7, 2005.[PDF] Haiying Guan and Matthew Turk, "3D Hand Pose Reconstruction with ISOSOM", Technical Reports 2005-15, Department of Computer Science, University of California, Santa Barbara. [PDF] | ||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||