Tracking an object’s motion is a common challenge for human and machine vision. Many tracking algorithms work by recognizing the same object in successive static video frames, rather than following that object’s unique spatiotemporal trajectory across a video sequence. Such strategies are successful when the target object exhibits unique features in most frames, but fail when the target is surrounded by many similar-appearing items. We introduce a new tracking challenge from short video sequences, wherein a dot moves among multiple similar-looking distractor dots, starting from a marked location. The goal is to determine if the same dot that emerged from a marked starting location ends at another marked location or follows some other trajectory. The challenge is inspired by tracking tasks used in cognitive psychology. Using our video dataset, we are systematically evaluating state of the art deep learning-based vision algorithms. We hope that our novel benchmark will encourage the development of neural networks with human-like tracking capabilities, that reduce reliance on inefficient visual strategies, such as repeated recognition of an object’s within-frame featural signatures. We can then design algorithms that reduce the dependence of most current neural networks on extraordinarily large datasets to learn visual tasks. 

[NeurIPS] [arXiv] [Website] [Code] [Dataset] [Human data collection]

Example task with 14 distractor squares.

Positive instance with 2 target squares and 14 distractor squares. A small filled white target square starts from a larger open red square at the start of the video, and another small white filled square enters a larger open blue square at the end of the video. In positive instance, the target white square starting from the larger red square follows an abritrary trajectory and ends in the larger blue square. The open squares are marked red and blue for starting and ending positions respectively only for demonstration purposes, and are not part of the actual video used for the experiments. In actual experiment datasets, both open squares are gray in color.









Negative instance with 2 target squares and 14 distractor squares. A small filled white target square starts from a larger open red square at the start of the video, and another small white filled square enters a larger open blue square at the end of the video. In negative instance, the target white square starting from the larger red square follows an abritrary trajectory and ends at a random location at the end of the video, while another target square, starting at another random location, ends in the larger blue square at the end of the video. The open squares are marked red and blue for starting and ending positions respectively only for demonstration purposes, and are not part of the actual video used for the experiments. In actual experiment datasets, both open squares are gray in color.