NeurIPS 2020

Learning Representations from Audio-Visual Spatial Alignment

Review 1

Summary and Contributions: This paper seeks to investigate the power of learning self-supervised audio-visual representations based on 360 degree video with spatial audio. In particular, they compare learning audio-visual spatial correspondences (AVSA) vs. the previously introduced AV tasks of either clip-level (AVC) or temporal correspondence (AVTS). Superiority of AVSA over AVC and AVTS is demonstrated on the pretext tasks themselves, as well as downstream applications.

Strengths: + The introduced self-supervised task of AV spatial correspondence is interesting and novel, and its potential of facilitating better representations based on richer information is well motivated. + The transformer-based architecture seems to be doing a good job of leveraging multiple viewpoints, and changes made to original transformer architecture make sense. + Evaluation of AVSA-based features vs AVC and AVTS is thorough, and demonstrates clear superiority of the AVSA features over the others in the presented setting.

Weaknesses: - The main weakness is in the significance of the contribution of this paper: While I am convinced that AVSA does indeed learn better features than the other tasks on this dataset, I'm not sure what the practical significance of this is. I would have liked to see novel applications and/or state-of-the-art downstream task results that are **enabled** by the proposed model. Without such objective indication of significance, we are left with only relative significance vs. previously proposed tasks. - While it does make sense that context-aware processing of multiple viewpoints is more effective, I'm uncertain of the need for the more sophisticated transformer-based modelling of the task as "translation" between video and audio. I wonder if a sufficient context-aware baseline would have been to think of the problem as a "puzzle", similar to [1,2] where the network predicts the correct arrangement of the a_i, v_j pairs by solving a supervised classification problem. [1] Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context prediction." ICCV 2015. [2] Kim, Dahun, Donghyeon Cho, and In So Kweon. "Self-supervised video representation learning with space-time cubic puzzles." AAAI 2019

Correctness: The paper does appear to be correct.

Clarity: The paper is well written, although one needs to read the supplement in order to learn about some important ablation studies.

Relation to Prior Work: In short: Yes, it is clear how the task and proposed model in this work differ from those in previous works. However, the significance of the proposed solution is not clear in relation to previous works: the reader does not know whether the learned features are *objectively* good for action recognition or semantic segmentation, only how good they are in relation to the provided baseline tasks. See Weaknesses section for more on this.

Reproducibility: Yes

Additional Feedback: To summarize the above: The proposed task of learning representations from 360 degree audio-visual spatial correspondences is interesting, and I believe should enable some novel and interesting applications. However, until these are shown, I think this paper's contribution is lacking in significance. UPDATE: My point was (obviously) *not* that self-supervised learning is worthless, and clearly the fact that it underperforms full supervision is not a reason to kill it. The point I *was* trying to make is that I find new SSL work interesting if (A) it pushes the boundary of SSL results on interesting/important tasks, and/or (B) it *enables* solving new tasks that were unfeasible before. While I completely agree that 360 SSL should be able to do at least one of either A or B, this paper does not show that. While you do show that on a small dataset your method is superior, I don't find the claim of "360 SSL will be SOTA in due time" to be a strong enough argument. Some (B) applications (some of which you mentioned) would be sound localization / separation / stereo or ambisonic generation, etc.

Review 2

Summary and Contributions: The authors present an approach to learn representations from 360 audio-video streams using contrastive learning. This is done in two phases: first, the model is trained to identify audio-visual correspondences at the video level and then the model is tasked with identifying the correspondence only from different crops of the video. The authors claim that this results in better features in the learned model and validate this over multiple downstream tasks (semantic segmentation and action recognition).

Strengths: 1. The new dataset based on 360 degree audio and video from Youtube would be very interesting for the community. 2. The proposed training method brings boost to a variety of downstream tasks (semantic segmentation and action recognition).

Weaknesses: 1. While the motivation of the paper is to leverage spatial information in representation learning, the model is not evaluated on an audio-visual spatial task (like localizing source of sounds of videos). The architecture also uses a max-pool making the learned model similar to previous work in literature [1]. The authors do validate on semantic segmentation task but the pixelwise labeling seems to be done purely on the basis on visual object categories that is a person is labeled even if they might be silent. The title of the paper does not seem justified as it is not clear where the "spatial alignment" is happening in the architecture or evaluation. 2. The boost in performance in semantic segmentation task is marginal over the baselines. How are the baselines able to get such good performance when localization is not involved directly in the training? How are the 4 crops chosen during training? Does the manner of selection of crops affect performance? 3. What is the performance of the method on the semantic segmentation task when a pre-trained ImageNet and randomly initialized model with a segmentation head ? 4. Are there any experiments using the Regressive AVSA? How does it compare to Contrastive AVSA? Can the rotation space be made discrete so that loss is based on classification as in [2] or use Von Mises kernel[3]? [1] Objects that Sound. Relja Arandjelović, Andrew Zisserman [2] Convolutional Neural Networks for joint object detection and pose estimation: A comparative study. Francisco Massa, Mathieu Aubry, Renaud Marlet [3] How useful is photo-realistic rendering for visual learning?Yair Movshovitz-Attias, Takeo Kanade, Yaser Sheikh

Correctness: The method is correct.

Clarity: The paper is well written.

Relation to Prior Work: Related work is discussed properly.

Reproducibility: Yes

Additional Feedback: 1. "they completely disregard the spatial cues of audio and visual signals naturally occurring in the real world." Saying the models completely disregard spatial information is too strong a statement as these models can easily be repurposed to localize sound sources to some extent. [1,2] 2. "This is accomplished by using a translation network similar the transformer of" Typo - similar to. [1] Objects that Sound. Relja Arandjelović, Andrew Zisserman [2] Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon. ********POST REBUTTAL******** I appreciate the authors answering my questions in the rebuttal. I believe there is some miscommunication. I meant using the model for a downstream task that requires audio visual spatial alignment. The authors report results of the AVSA self-supervision task and compare it to other methods like AVC. But that is the self-supervision task or pre-text task setup rather than an actual downstream task. The authors state themselves "Not surprisingly, AVSA performs best". "Quantitative comparison to prior work is infeasible, since localization ability has historically been shown qualitatively" I am not sure I agree with this. If a dataset with annotations of spatial sound sources is collected, the authors can show quantitatively how their method compares to other methods. The authors can use their own implementation of previous work (like AVC) to do this. However, I feel the paper has 2 strong contributions: a 360 degree audio-video dataset and a self-supervision approach for audio-visual spatial alignment. For this reason, I'll update my score. But the evaluation method focuses more on action recognition and semantic segmentation and arguably the true potential of this kind of self-supervision may lie elsewhere.

Review 3

Summary and Contributions: The paper provides a new pretext task for self-supervised representation learning using 360 video data. The authors collected a new dataset named Youtube-360 for this task. The proposed method was tested on several downstream tasks such as segmentation and action recognition. The method significantly outperformed the baseline (AVC) and shows that audiovisual spatial alignment leads to better representation learning.

Strengths: The proposed method is sound, builds on prior work of self-supervised audio-visual representation learning, and provides good results in comparison to the baseline. The authors suggest that audiovisual spatial alignment on 360 video data is very useful and introduced their model. Illustrations are carefully designed

Weaknesses: 1- While the experimental results suggest that the proposed approach is valuable for self-supervised learning on 360 video data which have spatial audio, little insights are given about why do we need to do self-supervised learning on this kind of data. In particular, 1) There are currently several large audio-video datasets such as HowTo100M and VIOLIN, 2) There is not much 360 video data on YouTube in comparison to normal data. 2- For the experimental comparisons, the authors at least should report the performance with using other self-supervised learning losses. For instance, masking features, predicting next video/audio feature, or reconstructing a feature. This will be very useful for understanding the importance of introduced loss in comparison with previous ones. 3- How the videos are divided into 10s segments? 4- It would be interesting to see how this spatial alignment works. For example, aligning an audio to the video and visualizing the corresponding visual region. 5- What's the impact of batch size on performance? batch size of 28 seems small to cover enough positive and negative samples. In this case, using MoCo loss instead of InfoNCE wouldn't help?

Correctness: The paper clearly explained the modules and losses. Therefore, it seems correct from reviewer's point of view.

Clarity: The paper is well-written and easy to follow.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: UPDATE I checked other reviews and author's response. The authors addressed my concerns and I think the paper has enough contributions to be accepted. The dataset itself is an interesting contribution to the video-audio community. However, I still think that the authors couldn't clearly motivate why their proposed self-supervise learning framework is better than other recent works. Also, as other reviewers mentioned the evaluation of method is not complete. These are important questions, however not so critical to reject the paper. Therefore, I keep my score.

Review 4

Summary and Contributions: This paper investigates a novel self-supervised learning objective based upon predicting the spatial alignment between 360-degree audio-visual signals. For this purpose, the paper introduces a new 360-degree video dataset (YouTube-360) with Ambisonics spatial audio collected from YouTube (5.5k videos split into 88k clips, covering 246 hours of video). Similar to other contemporary works in audio-visual representation learning, a two-stream convolutional neural network model (one video stream, one audio stream) is used to produce encodings for a given pair of audio and video inputs, and the model is trained to predict whether the input pair is matched using a contrastive loss. The key algorithmic contribution of this paper is the framing of this task around predicting whether the video and audio signals are spatially aligned; previous work on self-supervised cross-modal learning from videos typically predicts whether the audio and video inputs come from the same source video, or whether they are temporally aligned (in the case that the inputs are sampled from the same source video). Experimental results are presented for the model's audio-visual spatial alignment accuracy, as well as using the learned representations for semantic segmentation (on the YouTube-360 dataset) and action recognition (on UCF and HMDB). The authors demonstrate that predicting spatial alignment outperforms predicting video-level correspondence or temporal alignment.

Strengths: Audio-visual spatial alignment is a very natural self-supervised learning objective which has eluded prior work due to the lack of appropriate datasets. This paper introduces a dataset specifically geared for this task, which is a strong contribution to the community. The paper also provides experimental confirmation that audio-visual spatial alignment is a powerful learning objective compared to temporal alignment or video-level correspondence. Finally, the paper is extremely well written and was a pleasure to read.

Weaknesses: I think that this is already a strong paper, but could be stronger if the experiments were pushed a little further. Specifically, for each potential rotation within a video clip, only a single monaural audio signal was generated from the Ambisonics sound field. It would have been straightforward to generate a stereo audio signal (or even experiment with 3 or more channels), and I expect that this would provide an even stronger learning signal for the spatial alignment prediction task. It also would have been interesting to see whether the AVSA objective is complementary with the AVC and AVTS objectives. Finally, it would have been interesting to see at least a partial confusion matrix for the segmentation task to see how the errors made by AVC/AVTS compare to AVSA. One of the motivations for AVSA stated in the introduction is that AVC/AVTS face ambiguity when confronted with frequently co-occurring objects (such as cars and roads), and that AVSA has the potential to resolve these ambiguities. It would have been nice to see verification that these kinds of errors are in fact reduced when using AVSA.

Correctness: The methodology and claims are correct.

Clarity: The paper is very well-written throughout and easy to read.

Relation to Prior Work: The paper provides a thorough overview of prior work in the area, and clearly positions itself relative to these works.

Reproducibility: Yes

Additional Feedback: Post-rebuttal feedback: In their rebuttal, the authors state: "Why using mono? Audio input ablation (R4) There is a misunderstanding here. We use full ambisonics aligned with each viewpoint by a 3D rotation (L195-196), and compare to mono/stereo inputs in suppl. Table 3a" I think the author(s) have misunderstood my question here. I understand that the full ambisonics are used to produce an audio waveform aligned with each viewpoint. However, the resulting aligned audio still appears to be a mono signal (meaning a single spectrogram is computed for each rotation), according to Table 1 in the supplementary material. This is analogous to attaching a single, forward-facing directional microphone to the front of the camera. The authors are effectively using the ambisonics to perform beamforming, attenuating the volume of sounds originating from outside the current FOV while boosting the volume of sounds in the current FOV. However, for a given FOV, that single spectrogram doesn't preserve direction of arrival information for the different sound sources that are present. The proposed approach is actually under-utilizing the ambisonics; I imagine the method would be even more powerful if a stereo audio signal (or even using a greater number of channels) was computed for each rotation. I am inclined to keep my originally assessed score. I really appreciate the novelty of this paper and think that using 360 degree audio-video data is going to be important in the future. I agree with the other reviewers that a more thorough exploration of novel tasks enabled by this approach and/or demonstration of improvements over the current SotA on downstream tasks would have made this paper much stronger, but given the novelty here I don't think the lack of those things is reason enough to reject this paper.