Four Eyes Lab, Department of Computer Science, University of California, Santa Barbara


Visual Tracking Dataset

This website accompanies our IJCV paper "Evaluation of Interest Point Detectors and Features Descriptors for Visual Tracking" and makes the dataset described therein available for download.

If you use this dataset in academic work, please cite our paper:

Steffen Gauglitz, Tobias Höllerer, and Matthew Turk. Evaluation of Interest Point Detectors and Features Descriptors for Visual Tracking. International Journal of Computer Vision, Volume 94, Issue 3 (2011), Pages 335-360.

BibTeX:

@article{Gauglitz_IJCV2011,
  author = {Steffen Gauglitz and Tobias H\"ollerer and Matthew Turk},
  title = {Evaluation of Interest Point Detectors and Feature Descriptors for
	Visual Tracking},
  journal = {International Journal of Computer Vision},
  year = {2011},
  volume = {94},
  pages = {335-360},
  number = {3},
  doi = {10.1007/s11263-011-0431-5}
}

The dataset includes 96 videos with a total of 6889 frames, featuring six different planar textures with 16 different camera paths each. Please refer to the paper for more details. If you have any questions, feel free to contact me (see email address at bottom of page).

Videos

All 96 videos are encoded with the lossless HUFFYUV codec.
The file names follow the schema fi-[texture]-[motion path].avi, where fi stands for the camera used (Unibrain Fire-i), [texture] stands for a two-letter acronym for the respective texture (br = bricks, bu = building, mi = mission, pa = Paris, su = sunset, wd = wood), and [motion path] stands for a two-letter acronym for the respective motion path (uc = unconstrained, pn = panning, rt = rotation, pd = perspective distortion, zm = zoom, mX = motion blur level X (X = 1...9), ls = static lighting, ld = dynamic lighting). Please refer to the paper for a description of each texture and motion path.

Note that the videos show not only the texture itself, but also a distinctive black-and-white pattern around the border, the frame, and some background. For evaluating tracking, one should either remove any keypoints detected outside the texture area (this is what we did), or manipulate the video frame to make the background unusable (e.g., replace by uniform color/random noise). The area of the texture for each frame can be extracted using the ground truth data below.

Download:

Camera calibration

We used OpenCV's camera model, and the calibration coefficients are as follows:

869.57	0	299.748	
0	867.528	237.284	
0	0	1	

-0.0225415	
-0.259618	
0.00320736	
-0.000551689	

The first 9 numbers are the camera input matrix [ f_x 0 c_x ; 0 f_y c_y ; 0 0 1 ], the last four numbers are the distortion coefficients k1, k2, p1, p2. These two matrices are the first two parameters for (e.g.) OpenCV's cvInitUndistortMap function.

UPDATE: Apparently, the OpenCV version that I used to create the ground truth data below had a bug in the cvInitUndistortMap function, which was later removed. As a result, the data below is not compatible with recent versions of cvInitUndistortMap. Instead, please use this legacy version of cvInitUndistortMap. Please note the OpenCV copyright. — Many thanks to Konstantin Schauwecker and Stefan Welker for pointing this out.

Ground truth data

The reference coordinates and dimensions are contained in this header file: groundtruth_coordinateframe.h.

For each video, there is a corresponding file [name of video].warps, with one line per video frame. Each line contains nine floating point values, which, written as 3x3 matrix, describe the homography that warps this video frame to the canonical reference frame. More precisely, it warps the centers of the four red balls to the coordinates dst_corners described in groundtruth_coordinateframe.h. The header file also contains textureROI which specifies the area of the texture (in reference frame coordinates), along with several other coordinates and dimensions that might be useful.

Ground truth warps for all videos (.zip, 215 KB)

Given the camera calibration and the (world) position of the red balls (also contained in the header file), one can re-construct the camera path from the given homographies.

Random frame indices

The following data is optional. As described in the paper, the video frames can be used either in consecutive order, or, to simulate larger baseline distances, in randomized (or otherwise permuted) order. If results are to be compared directly to the graphs in our paper, one should use the same random frame pairs. This file (.zip, 10 KB) contains five lists of random frame indices. Specifically:

Wherever frame pairs are needed (i.e., to evaluate repeatability or precision), we used all consecutive pairs in above lists, i.e. (frame[random-index[0]],frame[random-index[1]]),(frame[random-index[1]],frame[random-index[2]])...

These frame pairs can be grouped according to the rotation/baseline distance/scale change, respectively, between the two frames. Note that, since the motion paths for the different textures are not exactly the same, these groups are specific to each video. This file (.zip, 242 KB) contains the grouping we used. It contains 30 files, with one line per frame pair (i.e. 499 entries for 500 random frame indices paired as described above). Each line contains two values, first the specific numeric value (in degrees/millimeters/scale change), then the number of the bin to which it was assigned. Bin numbers are 1-based. Specifically:

You might also be interested in:


maintained by Steffen Gauglitz, sgauglitz [at] cs.ucsb.edu