Paper originally received fairly positive ratings from the reviewers: 6, 6, 6, 7. Reviewers thought that the paper addressed an important problem [R1], the core idea of contextualized object representations was interesting and well motivated [R2], the approach was intuitive/straightforward [R1,R2] and outperformed the baselines [R1,R2]. At the same time reviewers have raised a number of concerns, mainly: (1) lack of methodological novelty [R1,R3,R4], (2) lack of experimental design [R2,R3] and need for additional baselines [R3], and (3) the use of pre-trained object detectors [R1,R4]. Rebuttal was provided by the authors to address these concerns, which included additional baselines requested by reviewers. Reviewers found rebuttal nearly unanimously compelling, with R2 raising the score from 6 to a 7, leading to the final scores for the paper of: 6, 7, 6, 7. AC has read the reviews, rebuttal and the paper itself and largely agrees with the assessment provided by the reviewers. While technical novelty of the approach may indeed be somewhat limited, the core idea is interesting, effective and the application domain challenging and novel; the paper is also well written and the experimental results are robust. Therefore the final decision is Acceptance. NOTE FROM PROGRAM CHAIRS: For the camera-ready version, please expand your broader impact statement to discuss the potential negative impacts of your work. As one reviewer notes, "this algorithm is learning about human actions from an uncurated web video dataset. The potential for learning problematic biases is enormous, considering that web videos tend to depict different races/genders doing different things, and algorithms are well known to reproduce and even amplify these biases. Furthermore, any algorithm that's relevant to action recognition--and to retrieving videos based on arbitrary language queries--has applications in authoritarian surveillance."