Summary and Contributions: The authors propose a two-stage framework to tackle the task of discriminatively localising sounding objects in the cocktail party scenario without manual annotations. Robust object representations are first learned in a single source scenario, before expanding to a multi-source scenario where sounding object localisation is formulated as a self-supervised audiovisual consistency problem, which is solved through object category (audio and visual) distribution matching.
Strengths: + Curriculum learning from a simple scenario with a single sound source to a complex scenario with multiple sounding sources, i.e. cocktail party scenario. + The proposed method addresses natural audiovisual scenarios, which consist of multiple sounding and silent objects (unlike some prior works which do not address silent objects). It is nice to match audio and visual object category distributions. + Ablations are performed (sup mat) to show the benefit of alternating between localisation and classification for the first stage (single source). + In the single source scenario, the proposed method achieves either better or comparable results to Sound of pixels  (top performing baseline out of several shown) on the MUSIC-solo and AudioSet-instrument-solo dataset splits. + In the multiple source scenario, the proposed method outperforms all baselines on MUSIC-Synthetic, MUSIC-Duet and AudioSet-Multi dataset splits for CiOU and AUC metrics (although not NSA).
Weaknesses: 1. Some more ablations would be nice. The authors could for example investigate the impact of removing for the second stage (multi source scenario) the sounding area (li stage) as well as sounding object location (si). 2. Some implementation details are missing. E.g. how long do the authors train the first stage vs the second stage? More details would be better for reproducibility.
Correctness: Yes the method seems correct and the experimental section well executed.
Clarity: The paper is relatively well written. The problem, proposed framework and experiments are clearly explained.
Relation to Prior Work: Comparison to prior work is fairly well discussed. Unlike previous work, this method uses an established dictionary of object representations to predict class-aware object localization maps. This work addresses mixed sound localisation whereas previous methods assume mainly single source scenes.
Additional Feedback: For the video in sup mat, it would be more clear if the authors' method was compared to baselines on the same slide. Typos: L.190: two learning objectives Sound-of-pixel -> Sound of pixels Object-that-sound -> Objects that sound ########### POST REBUTTAL ################ I thank the authors for their feedback. I agree with some of the issues raised by the other reviewers: For example the requirement for knowing the number of sources in a video as well as the availability of videos with single sound source for the first part of the curriculum are legitimate concerns. It is also true that the music instruments setting is relatively simple setting, however there's been lots of previous work on this kind of data that so I think it's fair as a benchmark. Overall, I believe the work is good, novelty is sufficient and merits acceptance. I therefore keep my initial score and recommendation. A further comment: Could the authors give some more details on which datasets the models have been trained on for each experiment (e.g. are the models that are evaluated on MUSIC trained on MUSIC only, Audioset only or both?)
Summary and Contributions: The paper proposes methods to localize objects producing sounds in a given audiovisual scene. This is done in an unsupervised setting where manual annotations are not available. The framework first tries to learn robust object representations and then uses audiovisual consistency to train the networks to localize the sounding objects.
Strengths: Localizing sounding objects in a given audiovisual scene is an interesting problem. The paper presents a novel approach for this problem which does not require manual semantic labeling and the training is largely self-supervised and relies on inherent audiovisual consistencies for training the models and the overall approach is nice. Comparison with several prior methods has been done to show the superiority of the proposed method.
Weaknesses: One major limitation of the work is that only music related objects and sounds are used. This does not provide a good idea of how well this method generalizes for everyday objects and sounds produced by them. It would have been nice if this paper had considered this more general condition in their datasets. There are few other concerns w.r.t how the method will generalize. Please look at the detailed comments below.
Correctness: Yes, the method and the empirical methodology seems correct.
Clarity: The paper is well written and clear. There are a few sentences here and there which I feel can be restructured. For example, the last line of paragraph 1.
Relation to Prior Work: Yes
Additional Feedback: The paper presents a framework for localizing sounding objects in an audiovisual scene. Overall, I liked the paper. The proposed approach is neat and makes sense to the most extent. I have a few points of concern and I would like to see the author's responses on them. I would be happy to raise my overall score if the responses are satisfactory. -- Post rebuttal -- Most of my concerns have been addressed and I think the paper should be accepted. Score has been updated to reflect that. 1. The method relies on knowing which videos have a single source and which ones don’t. Where is this information coming from ? For the datasets used, this seems to be known apriori, implying some sort of manual labeling. Is that the case ? 2. Is the number of object categories, K in Eq 3, known apriori ? Is it equal to the number of instruments for each dataset ? Seems like a strong assumption that this information will be known, especially given the claims around self-supervised learning. What would do if this information is not known ? 3. The output s_i^k gives us the location of sounding objects for the k^th object category. GAP(s_i^k) averages this out across all locations, giving us a score estimating the probability of presence of the sounding object in the scene. Now given that in multisource conditions, two objects might be making the sound, why do the softmax over all object categories ? 4. I believe the authors were just following prior works but I am surprised why the “balanced” set of Audioset is used for test. Audioset comes with a “Eval” set for testing. 5. How is the threshold for binarization for mask set to 0.05 ? Is it a factor in performance ? 6. As mentioned one major limitation of this work is that it is empirically studied only in a music setting. I think this is a relatively easier condition and feels very synthetic compared to a more realistic one where different types of objects producing different types of sounds are considered. I understand some prior works have done the same. 7. I am glad that the authors commented on 3 failure cases. What about situations when objects are partially occluded ? How well do you think it will work ? --------------- While this paper does have limitations -- reliance on availability of solo videos, knowing number of sound sources in the dataset, scaling to general everyday objects, overall I think it is still a good paper. Several of my concerns have been addressed in the rebuttal and I have updated my score to reflect that.
Summary and Contributions: This paper addresses the problem of sound source localization in videos frames, and the authors propose a two-stage approach that first learns representaions of sounding objects, and then perform class-aware object localization based on the learned object representations. Experiments demonstrate that the proposed approach leads to some accuracy gains for this task.
Strengths: Nice motivation to consider a two-stage framework that first learns object representation in single source scenario and then perform class-aware object localization maps in multi-source scenarios. Good results on sound source localiztion compared to prior methods in Table 2. Nice qualitative results on sound source localization.
Weaknesses: - It is claimed that the proposed method aims to discrminatively localize the sounding objects from their mixed sound without any manual annotations. However, the method aslo aims to do class-aware localization. As shown in Figure 4, the object categories are labeled for the localized regions for the proposed method. It is unclear to this reviewer whether the labels there are only for illustrative purposes? - Even the proposed method doesn't rely on any class labels, it needs the number of categories of potential sound sources in the data to build the object dictionary. - Though the performance of method is pretty good especially in Table 2, the novelty/contribution of the method is somewhat incremental. The main contribution of the work is a new network design drawing inspirations from prior work for the sound source localization task. - The method assumes single source videos are available to train in the first stage, which is also a strong assumption even though class labels are not used. Most in-the-wild videos are noisy and multi-source. It would be desired to have some analysis to show how robust the system is to noise in videos or how the system can learn without clean single source videos to build the object dictionary.
Correctness: The claims and the proposed method are correct as well as the empirical methodology.
Clarity: The paper is generally nice written and easy to follow.
Relation to Prior Work: The relations to prior work are well discussed and this works has major differences compared to previous contributions.
Additional Feedback: - It would be useful to show what categories do the different colors represent. - the phrase audio/visual "message" sounds very strange #######################AFTER REBUTTAL#################### I concur with other reviewers on the merits of the paper, especially on the nice quantitave results and the design of the self-supervised training paradigm. The rebuttal has addressed most of my questions. However, I still have the following concerns: 1) Although the authors claim that the paper doesn't need any mannual annotions, they still need a dataset of solo videos (which doesn't come for free), and a rough estimate of the number of sources in the dataset. From this perspective, the Sound-of-Pixels baseline also doesn't need any mannual annotations. 2) The essential step of building an object dictionary makes the method hard to scale to more objects or more categories (e.g., general AudioSet videos instead of just instruments, as also pointed out by R2). Nevertheless, based on the merits of the paper, I would be fine to see this paper accepted if the authors can incorporate the proposed changes / additional results into camera ready, and make the limitations/distinctions clear.
Summary and Contributions: This paper proposes to tackle sounding object localization in a cocktail party scenario, where the sounds are mixed and there might be silent objects. It also proposes a two-stage learning framework by first training an audiovisual localization network in single-sound scenarios and then using audiovisual consistency to match the distribution of visual objects and sounding objects.
Strengths: The proposed task is interesting and more realistic in real life. Their two-stage learning framework also has good quantitative results and beats other methods on most metrics.
Weaknesses: My biggest concern is that there is no quantitative ablation study on the effect of the audiovisual consistency objective in equation 7. Although the t-SNE plot shows alternative learning generates better visual features, there are no quantitative studies on how each stage affects the final results. And the lack of ablation makes the second contribution or technical contribution weaker because obviously, the novel part comes from using audiovisual consistency for category distribution matching. I find it interesting that related work including this work doesn't employ temporal information from the video for localization. For example, finger movement is one obvious visual clue of whether an instrument makes sounds.
Correctness: Yes. The claims are supported by experiment results.
Clarity: The writing is clear and easy to understand.
Relation to Prior Work: Yes. This paper has clearly discussed the difference with previous work.