|Location-based augmented reality on cell phones|
Here is an overview of the application on the preprocessing side:
Here is an overview of the application on the user side:
The pose estimation between the two images is the main step. It has to be very precise in order for the 3D objects projection to match the image content. We use the SURF algorithm to extract interest points and compute descriptors. However since the images we are dealing with are noisy and due to the matching errors of the SURF descriptors, we get a lot of wrong matches.
Here is the pose computation process:
- To remove the outliers, we first fit a homography between the two sets of points using RANSAC. We then remove the points that are very far from following the homography criterion (the points must be less than 30 pixels far from its image by the homography). This removes the most obvious wrong matches.
- We then use the 8-point algorithm inside a RANSAC loop to remove outliers that do not follow the epipolar criterion.
- Now that most of the outliers are removed we can refine the pose estimation by estimating it using all the inliers with the 8-point algorithm. That removes the remaining outliers.
- The next step is the non-linear optimization. The optimization is done directly by using the pose parameters. We are searching for the translation and the rotation that minimize the sum of the geometric distances over all the points. To do that we use the fact that we know the intrinsic parameters, and that In our case, the fundamental matrix only has 5 degrees of freedom: the only unknown parameters are the rotation and the translation, which have respectively 3 and 2 degrees of freedom. In the optimization process, we choose the angle / axis representation for the rotation, the axis being expressed in spherical coordinates (norm equal to 1), and the translation also expressed in spherical coordinates (norm equal to 1 too).
Let us note that the norm of the translation cannot be retrieved by this optimization process.
- Then we can compute the norm of the translation, using the depths of the interest points from the database. To estimate robustly this depth, we compute this depth for each match, and we take the median of these depths. This is more robust to outliers than just taking the average.
- The pose estimation is still not accuracte enough, especially because of the noise in the matchings. The final step consists in using the 3D points coordinates in the database to minimize the reprojection distance. For each 3D point in the database, we project it into the cellphone image and try to minimize the distance between this projected point and the point in the cellphone image it should match with. This criterion is actually the criterion we want to minimize for augmented reality: the goal is that the projection of 3d objects in the image matches the image content.
The main problem in this algorithm is that the 8-point algorithm is extremely sensitive to noise and gives really bad results in terms of pose estimation. Even with very good matchings, it often produces a totally wrong result. In this application, the 8-point algorithm is only used to remove outliers. Instead of using the estimated pose by the 8-point algorithm to initialize the non-linear minimization, we use the cellphone sensors to estimate the rotation. The translation is initialized by solving the least square problem with fixed rotation (estimated by cellphone sensors):
The global minimum can be obtained linearly thanks to the SVD algorithm.
This step is the most critical step in terms of computation time. Since a naive nearest neighbor search would require too much time, we put all the descriptors inside a KD-tree. The nearest neighbour search is implemented thanks to the Best Bin First technique, which returns the approximate nearest neighbor.
However, the database can contain many images, so searching over the whole database still requires a lot of time. Therefore we are using the cellphones sensors to restrict the search in the database to a much smaller number of images. The Nokia sensors enable us to have an estimation of the pose: by integrating the data from the accelerometer and the magnetometer, we can have an estimation of the cell phone orientation. The 3d position however cannot be determined accurately enough, because of the noise from the accelerometer. Assuming there is not much movement of the user from one frame to another, we can tell for each 3d point from the database if it is likely to be seen by the cellphone. We then search only among the images that have most of their 3D points included in the viewed area. To avoid testing that for each point of each database, we decide to search only the images whose centers are inside the approximate field of view cone.
Here is how the search is restricted:
In this test, the 3d model is composed of three rectangles representing the PC screen, the desk and a poster on the wall. Because the pose of the database camera is known, the 3d points correspond perfectly to the database image content (left image). After that the pose between the database and the cellphone is computed, the 3d objects are projected according to this pose estimation (middle and right image). The average error is about 4 pixels, and is partly due to the imprecision on the 3D points computed by the stereocamera.
When the rotation between the cellphone and the database, the precision is even better. In this image, the reprojection accuracy is a bit more than 2 pixels.