Location-based augmented reality on cell phones


augmented reality
on cell phones

Rémi Paucher, Matthew Turk


This project consists in doing augmented reality on cell phones in known environments by localizing the user in it. The user can be for example in a museum, he starts the application, and sees the scene in front of him on his cellphone with information about the paintings. There are two main steps in developing the program:

  • The first step is about learning the environment. This is a preprocessing step which is run on a computer. It consists in taking some pictures at different locations with a known camera. Then some interest points are extracted and descriptors are computed at each of these points and stored in the database. The 3D position as well as the orientation of each camera have also to be known and stored in the database. To do that we use the Bumblebee 2 stereo camera from Point Grey Research.
  • The second step is the application used by the user. It localizes the user in the environement and computes the orientation of the cellphone. To do that, it extracts interest points in the camera image and looks for the most similar image in the database by finding matchings. The relative pose between the two images is then computed by stereovision, from which we get the 3D position and the orientation of the cellphone. The 3D objects from the database are then projected and drawn in the image.

The database contains 3D objects and the following information for each picture:

  • 3D point of the image center in absolute coordinates
  • Camera orientation
  • Camera intrinsic parameters
  • Features
    • 2D position in image
    • 3D position in local coordinates
    • Descriptor

Here is an overview of the application on the preprocessing side:

Here is an overview of the application on the user side:

User-Side application overview User-Side application overview

Pose estimation

The pose estimation between the two images is the main step. It has to be very precise in order for the 3D objects projection to match the image content. We use the SURF algorithm to extract interest points and compute descriptors. However since the images we are dealing with are noisy and due to the matching errors of the SURF descriptors, we get a lot of wrong matches.

Here is the pose computation process:

  • To remove the outliers, we first fit a homography between the two sets of points using RANSAC. We then remove the points that are very far from following the homography criterion (the points must be less than 30 pixels far from its image by the homography). This removes the most obvious wrong matches.
  • Homography-based outlier removal
  • We then use the 8-point algorithm inside a RANSAC loop to remove outliers that do not follow the epipolar criterion.
  • Now that most of the outliers are removed we can refine the pose estimation by estimating it using all the inliers with the 8-point algorithm. That removes the remaining outliers.
  • The next step is the non-linear optimization. The optimization is done directly by using the pose parameters. We are searching for the translation and the rotation that minimize the sum of the geometric distances over all the points. To do that we use the fact that we know the intrinsic parameters, and that Essential matrix parametrization In our case, the fundamental matrix only has 5 degrees of freedom: the only unknown parameters are the rotation and the translation, which have respectively 3 and 2 degrees of freedom. In the optimization process, we choose the angle / axis representation for the rotation, the axis being expressed in spherical coordinates (norm equal to 1), and the translation also expressed in spherical coordinates (norm equal to 1 too).
    Let us note that the norm of the translation cannot be retrieved by this optimization process.
  • Then we can compute the norm of the translation, using the depths of the interest points from the database. To estimate robustly this depth, we compute this depth for each match, and we take the median of these depths. This is more robust to outliers than just taking the average.
  • The pose estimation is still not accuracte enough, especially because of the noise in the matchings. The final step consists in using the 3D points coordinates in the database to minimize the reprojection distance. For each 3D point in the database, we project it into the cellphone image and try to minimize the distance between this projected point and the point in the cellphone image it should match with. This criterion is actually the criterion we want to minimize for augmented reality: the goal is that the projection of 3d objects in the image matches the image content.

The main problem in this algorithm is that the 8-point algorithm is extremely sensitive to noise and gives really bad results in terms of pose estimation. Even with very good matchings, it often produces a totally wrong result. In this application, the 8-point algorithm is only used to remove outliers. Instead of using the estimated pose by the 8-point algorithm to initialize the non-linear minimization, we use the cellphone sensors to estimate the rotation. The translation is initialized by solving the least square problem with fixed rotation (estimated by cellphone sensors):

Translation computation with rotation fixed The global minimum can be obtained linearly thanks to the SVD algorithm.

Database Search

This step is the most critical step in terms of computation time. Since a naive nearest neighbor search would require too much time, we put all the descriptors inside a KD-tree. The nearest neighbour search is implemented thanks to the Best Bin First technique, which returns the approximate nearest neighbor.

However, the database can contain many images, so searching over the whole database still requires a lot of time. Therefore we are using the cellphones sensors to restrict the search in the database to a much smaller number of images. The Nokia sensors enable us to have an estimation of the pose: by integrating the data from the accelerometer and the magnetometer, we can have an estimation of the cell phone orientation. The 3d position however cannot be determined accurately enough, because of the noise from the accelerometer. Assuming there is not much movement of the user from one frame to another, we can tell for each 3d point from the database if it is likely to be seen by the cellphone. We then search only among the images that have most of their 3D points included in the viewed area. To avoid testing that for each point of each database, we decide to search only the images whose centers are inside the approximate field of view cone.

Restricted search
Only the images with the big red centers are searched because they intersect the field of view cone

Here is how the search is restricted:

Restricted search & Equivalent camera


In this test, the 3d model is composed of three rectangles representing the PC screen, the desk and a poster on the wall. Because the pose of the database camera is known, the 3d points correspond perfectly to the database image content (left image). After that the pose between the database and the cellphone is computed, the 3d objects are projected according to this pose estimation (middle and right image). The average error is about 4 pixels, and is partly due to the imprecision on the 3D points computed by the stereocamera.

Augmented reality on cellphone

When the rotation between the cellphone and the database, the precision is even better. In this image, the reprojection accuracy is a bit more than 2 pixels.

Augmented reality on cellphone