We thank R1, R2, and R3 for carefully reading our paper. All reviewers
appreciate the novelty and significance of our contributions. Although we
rigorously evaluate across four datasets and demonstrate substantial
performance gains on very difficult videos, reviewers want more evaluation
details. We will use this space to clarify:

All reviewers note that our method performs better on VIRAT than on basketball.
This observation is simply a consequence of basketball being a harder data set.
The basketball game is extremely challenging due to cluttered backgrounds,
drastic changes in pose, frequent occlusions, motion blur, and low resolution.
Even the state-of-the-art in computer vision is unable to automatically
correctly track the players in Fig.7. The basketball game requires more effort
in order to label. 

R2 argues that the basketball result is more important than VIRAT. While
basketball is more challenging, VIRAT is a high value and practical data set
since it is of significant interest to surveillance communities and the
military. VIRAT cost the research community tens of thousands of dollars to
originally annotate, while our algorithm can obtain equivalent labels for one
tenth of the cost.

R1 and R2 want more information about the error measure used in Fig.8. We
consider a predicted bounding box to be correct if it overlaps the ground truth
by at least 30%. The vertical axis in Fig.8 is the fraction of correct bounding
boxes. We chose 30% overlap because it agrees with our qualitative inspection
of the results. When we evaluated our approach with other metrics, predictions
that looked correct were unfairly counted as incorrect. 

R1 wants details on failure cases. One potential pitfall is a "self-fulfilling
prophecy", in which the model becomes so sure of an erroneous track that it
does not query for additional annotations. We empirically do not observe this.
Rather, for hard videos, we see errors arising from a poor appearance model,
e.g. the active learner believes that (in the extreme case) all locations in a
frame are equally likely to be annotated. In this case, the algorithm will
provably request the midway frame between two keyframes, causing a "graceful
degradation" into the fixed key-frame baseline for difficult videos. We will
add a discussion of these points.

R1 questions the robustness of our tracker and R3 wants a more sophisticated
tracker. Fig.8b shows our model is capable of tracking basketball players.
While we can always use a particular tracking algorithm, we chose the same
tracker used in [14] so we can fairly compare to the baseline and conclusively
demonstrate the power of adaptive keyframes. Indeed, our active learning
formulation can be applied to any dynamic programming based tracker.

R1 requests clarification on the tracking algorithm. In high resolution video,
K can be in the billions, mandating our dynamic programming solution. We
support bounding boxes with varying dimensions by adding more spatial
locations. The tracker is able to occasionally recover from a lost target
(Fig.4), but our active learning algorithm recovers with less annotation effort
(Fig.5).

R1 asks about code. We will release all code if accepted. The datasets are
already available.

Although R1 is concerned about the suitability for NIPS, R2 says our paper
addresses a core problem that will interest a larger audience. Our paper is
ideal for NIPS since it will interest both the vision and active learning
communities. Our ELC is not limited to video and can be applied to other
structured labeling tasks, such as people joint labeling.

All reviewers agree that our paper is "a significant contribution" that
"outperforms current best practices" [R2]. Our novel algorithm, clear
theoretical formulation, and significant performance gains demonstrate our
paper's value and impact. Indeed, our paper enables the construction of massive
video datasets that were previously economically impossible.
