Summary and Contributions: This paper describes a new large-scale video dataset for action recognition. The dataset is designed to resolve the diversity, privacy, and expiration issues present in the existing datasets such as Kinetics [Kay 2017] or Moments in Time [Monfort 2018]. The paper describes the procedures to build the fully-annotated dataset and several studies to examine the statistics and baseline performance of the dataset.
Strengths: The proposed dataset is a clear contribution to the computer vision community; as described in Sec 1, existing datasets have been having various issues involving the diversity, privacy, and availability / license problems, and they are major reasons why there was no ImageNet-like common benchmark in video action recognition. This work potentially resolves a majority of these problems and serves as a common important resource for the future study in video recognition. As a dataset paper, this work presents several convincing studies to show that the proposed dataset resolves diversity and other issues compared to the previous work.
Weaknesses: I do not find a major issue in this paper. This is probably unavoidable but the long-tailed distribution (Fig 5 and 6) seems to pose a challenge for rare-category recognition. It would be good if there is any technical suggestion for less frequent labels.
Correctness: The dataset construction process and the empirical evaluation protocol look reasonable and appropriate.
Clarity: The paper is clearly written and easy to follow in overall. (Minor) There is one sentence I could not get: > (Sec 1) Experimentally, we show diversity and lack of diversity affects the recognition.
Relation to Prior Work: The paper shows sufficient empirical studies to contrast the previous datasets, and sufficiently discusses the motivational distinctions throughout the paper.
Additional Feedback: This work really tackles on the difficult dataset issues in video recognition research. I really appreciate. UPDATE: Thanks for the author response, this work definitely makes a contribution. The final rating is the same.
Summary and Contributions: The authors collect a video dataset AViD for action recognition by collecting videos from different countries, in contrast with previous datasets, which are mainly from North America. The faces in the videos are blurred, and they also make sure that the collected videos are licensed so that the dataset keep static. Update: The authors partly addressed my concerns, and I am raising my ratings from 4 to 5.
Strengths: This paper considers the data imbalance issue in terms of countries and cultures, which I am really glad to see. I think this is an important problem in fair AI. Secondly, it has been a long existing problem that original videos on the Internet (e.g. YouTube) that are included in CV datasets get deleted over time, and it is rather cumbersome for both authors and other researchers since usually the only option is to ask the original authors for raw videos, and so I am also glad to see this factor being considered during data collection.
Weaknesses: 1) The same action could mean very different, even opposite things in different cultures (such as nodding and shaking your head). The paper does not discuss how often this happens in the dataset, and how it effects performance of models trained or tested on this dataset. 2) The results in table 5 and 6 show that models trained on AViD surpass the ones trained on other datasets. However, it is unclear where the improvements come from (larger training size of AViD, country diversity, or better video quality). 3) I feel that the paper does not have sufficient evidence to support their claim that models trained on AViD are more diverse in terms of countries. Only table 4 is related to this topic. Since this is the paper's main contribution, I feel that the authors need more comprehensive experiments to support it. 4) From table 3, AViD is still quite unbalanced (even under the simplest metric). Being a benchmark dataset that claim to be from diverse countries, AViD might lead to other researchers claiming their models to be diverse and fair simply because they train their models on AViD, which is still inherently biased. I am not an expert on this, but I think there should be a discussion.
Clarity: The paper is not perfectly written, but I can understand most of it.
Relation to Prior Work: Yes.
Additional Feedback: See the weaknesses section.
Summary and Contributions: The authors propose a new video action recognition dataset, where the videos are made from diverse countries, static with creative license and blurred to protect the privacy. The authors demonstrate the necessity of creating such a dataset by comprehensive benchmark experiments. This work will be impactful for the community.
Strengths: 1. A new diverse, static, and privacy-protected video action recognition dataset. 2. It is large enough compared to existing datasets. 3. The authors demonstrate good generalization of the models pretrained on the proposed dataset. And sufficient detailed analysis is provided.
Weaknesses: 1. Currently, all the videos are short videos. However, it is unlikely that all the source videos are already well-trimmed. If they are not, then how are the temporal boundary are determined and will these annotations be released? I would like to invite the authors to provide more clarification on this. 2. Is there a human verification or voting mechanism to make sure human annotations accurate?
Relation to Prior Work: Yes
Additional Feedback: Final rating: The author's response partially address my concerns. However, I still feel the authors could also release the temporal annotation, which will be of great importance for understanding the temporal boundaries of those actions. In terms of concerns of R2, I agree they are reasonable. But I feel some of them are actually open research problems, e.g., how to make sure the actions have no ambiguity or how to male sure the videos collected from different countries are really diverse in terns of content. Therefore, I decide to maintain my rating.