Assisted, Embedded Pose Estimator for 3D Reconstruction

Assisted, Embedded Pose Estimator for 3D Reconstruction

Eyrun Eyjolfsdottir, Nikolas Hansen and Matthew Turk

Image 1: Left: Nokia N900 and a sensorbox, provided by Nokia.  Right: A screenshot of the application running on the phone.


We are building a system that makes use of the inertial sensors to improve the process of extracting 3D points from 2D images on a mobile platform. Computer vision techniques that are widely used for this task suffer from being slow and sensitive to image noise. Using accelerometers and gyroscopes we are able get an estimation of the phone's pose for each frame, which allows us to limit the search space for matching feature points, speeding up the algorithm and decreasing the chance of outliers.

Inertial based tracking

Based on the discrete sensor values from the inertial sensors we are able to estimate the phone's pose (orientation and position). The phone has three accelerometers, one measuring acceleration along each axis. By assuming an initial position and initial velocity, we are able to obtain the position at each point of time by double integrating the acceleration.

Acceleration due to gravity needs to be subtracted since we are only interested in acceleration due translation of the phone, but since the phone can have any orientation and it can rotate during translation, it is not obvious which part of the acceleration vector is due to gravity. That is where knowing the orientation of the phone comes in handy.

We have attached an external sensorbox to the phone, which the phone connects with through bluetooth. The sensorbox has three gyroscopes, one measuring the angular velocity around each axis. Assuming initial orientation of the phone, we are able to estimate the orientation at each point by integration of the angular velocity. Knowing the orientation of the phone lets us easily determine which part of the acceleration vector is caused by gravity.

Using both of these sensors we are able to infer the pose of the phone at each point of time. However, due to small measurement errors in the sensors, this estimate is not completely accurate, as the error accumulates with each integration step.


Image 2: A trajectory of a star and a trajectory of a heart (seen from the side), produced by waiving the phone in space.

Vision based pose estimation

In order to get a 3D representation of a scene, based on 2D images, we need to find matching points between the images, and the transformation of the camera between the images. We can then triangulate the points to find the corresponding 3D points. These are the required steps:


Calibrate the camera The camera is calibrated to obtain the intrinsic parameters of the camera, the focal length, the pixel size and the principal point, as well as the distortion parameters due to lens distortion and misalignment of the image plane and the lens. This needs only be done once per device.

Detect and match keypoints Finding the keypoints in the images can be done fairly fast. However to match those points we need to calculate descriptors for each keypoint and compare each point to the other, assuming we have no knowledge of how the camera moved. Since we are working with images that are not always temporally close, the camera may have moved considerably between two frames so we need features that are robust to scale, rotation, lighting. For those reasons we have chosen SURF as the keypoint descriptor. Calculating the descriptors is by far the most time consuming part of the process.

Find the fundamental matrix Once we have the possible point matches (some of which will be outliers), we can find the fundamental matrix. We have chosen to use RANSAC with the 8 point algorithm. RANSAC will iteratively try to find the fundamental matrix and returns the one that satisfies the fundamental matrix constraint, x' F x = 0, for the biggest number of corresponding points. That way we can also determine which of the point matches are inliers and which ones are outliers. Since the camera can move arbitrarily between two frames, it is not guaranteed that the matrix which has most inliers is the best one (for example in cases where the overlapping area of the two images is small and we have many mismatches). This can lead to bad results as the RANSAC assumption then fails. Other methods are very sensitive to outliers and are therefore not a preferred option.

Find the camera pose Given the fundamental matrix, and the calibrated camera, we can calculate the essential matrix by E = K^T F K, where K is the matrix containing the camera's intrinsic parameters and F is the fundamental matrix. By applying SVD to E we obtain four possible rotation matrices, and two possible translation vectors, so eight configurations in total. To determine which one of those is the correct configuration, we take an inlier and triangulate it for each of the possibilities. The one that finds a 3D point with positive depths as seen from both cameras, and with the lowest reprojection error, will be chosen.

Triangulate Knowing the transformation between the two cameras we can triangulate each of the corresponding points on the images, to get a 3D point cloud representing the scene.

Camera + inertial sensors  =  faster and more robust estimation

Now we have two methods for estimating the phone's pose. One is very fast but not very accurate, the other one is accurate (when it succeeds) but slow. We would like to combine the two the get a faster and more robust estimation. We propose two ways for improving the performance:

Limit search space based on frustum intersection

Based on the intrinsics of the camera we can build a frustum, within which all 3D points captured by the camera must lie. For two images, the points captured in both frames will lie in the intersection of the two frusta corresponding to the camera at the time each of these images were taken. Using the pose estimation from the inertial sensors we are able to find this intersection. If we additionally know the approximite distance to the scene being reconstructed, we can limit the search even more. That distance can be obtained by user input or by our estimation after comparing the first two images. Limiting the search space speeds up the algorithm as there are fewer keypoints to compare and it minimizes the change of mismatching points.

This technique can also be used to have the system automatically snap images once the camera has moved to an unseen area. That would be particularly useful for applications such as panorama building, since they require little overlapping area.


Image 4: Left: Intersection of the camera's frusta at the two points in time, with the estimated distance to the scene, can significantly limit the area of interest in each image.  Right: The red boxes represent the area of interest in both images. The multi-colored lines in lower image are epipolar lines corresponding to the points in the upper image, based on the inertial sensors' pose estimate. Limiting the search for matching points to the red box and the area around the epipolar lines can speed up the process and decrease the change of mismatching.

Limit search space based on estimated epipolar lines

We can also use the pose estimation from the inertial sensors to estimate epipolar lines. We can then for each keypoint in one image, search for the matching keypoint in the second image, along and around these lines. To calculate the epipolar lines we need to reverse the steps outlined in the computer vision part. We start with a rotation matrix, R, and a translation vector, t. The essential matrix can be calculated by E = [t]x R, and the fundamental matrix F = K^T^-1 E K^-1. The epipolar line in the second image, corresponding to point x in the first image will then be l = F x.


Both methods proposed will speed up the process of detecting and matching keypoints, which is the most time consuming part in the vision based algorithm, and minimize the chance of matching outliers, making the search for fundamental matrix more robust.


Image 5: Left: Epipolar lines based on visual features.  Right: Epipolar lines based on the inertial sensors.   Notice that the green correspondence on the left image is an outlier. On the right image you can see that the green circles are the only ones that are not close to their predicted position. Had we limited the search space to small intervals around the estimated epilipolar lines,  this would not have been detected as a match.