NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:9009
Title:Learning Representations by Maximizing Mutual Information Across Views

Reviewer 1


		
The authors introduce AMDIM. Their aim is to maximize the mutual information between representations of different 'views' of the same `context'. They demonstrate the benefits of such a procedure on multiple tasks. The authors are inspired by Deep InfoMax (DIM). They claim to extend DIM by using independently augmented versions of each input and also by generating features across multiple scales. They also claim to use a more powerful encoder. The authors show an NCE based setup also demonstrate a mechanism to maximize multi scale mutual information. They end by showing mixture based representations and experiments related to the proposed techniques on a variety of data sets. Their main contribution seems to be the multiscale setup and the usage of data augmentation for generating multiple features of the same input. Thee mutual information between these features is then maximized. The f_1 features from the trained model are then used as inputs to linear and MLP classifiers. The extensions to DIM introduced by the authors are interesting and seem to be useful. They seem to be a logical step forward in the space of procedures that maximize mutual information for generating better data representations. There are, however, some questions that come to mind. -- It might be beneficial to add more details/explanations about the procedures being discussed in this work. For example, an explanation of the different steps in Fig 1c will make it easier to read. -- It might be beneficial to add a pictorial representation of the network architecture used to represent m_k -- Figure 3 might need a more detailed explanation. In say Fig 3a) what do the different rows represent. Furthermore, similarity between 2 entities of interest say A and B is often a single number. How is it being represented in Fig 3a) -- So often the multi-layer structure of a NN helps in capturing different levels of information at different levels. In this perspective, what does it exactly mean to have high mutual information between say f_5 and f_7 (or any such set of hidden layer features)? -- In principle DIM can be used with a similar encoder as AMDIM. It will be very interesting to see the performance of DIM in such a setup as a baseline result. This will help in gauging the importance/influence of the augmentation/multi-scale setup used in AMDIM -- In the context of views, what are the authors' thoughts on different views generated through say a different camera angle for images instead of processing done to produce augmented version of an image. Originality: The work is an extension of a known technique (DIM). It seems to be a logical step forward in the space of procedures that maximize mutual information for generating better data representations. Their main contributions seem to be the multiscale setup and the usage of data augmentation for generating multiple features of the same input. The mutual information between these features is then maximized. The f_1 features from the trained model are then used as inputs to linear and MLP classifiers. Quality: The authors do try to give the mathematical formulations behind their work and also provide empirical evaluations. Clarity : The manuscript seems to be written in a somewhat ok manner. There is, however room for improvement with certain places perhaps being a bit too concise. Significance: The work shows useful insights into the usefulness of modern Mutual Information maximization and can perhaps encourage more work in using such techniques for learning representations.

Reviewer 2


		
The authors present a self-supervised approach for visual representation learning. The approach is based on maximizing the mutual information between different "views" of the same data. In particular, the model is tasked with predicting features across augmented variations of each input, and across multiple scales. The training is based on a contrastive loss (InfoNCE) which itself a lower bound on the mutual information (if the positive and negative instances to contrast are sampled carefully). Using a large number of stabilisation and regularization tricks, the model can outperform existing algorithms on a variety of standard benchmarks. Most notably, it outperforms Alexnet trained end-to-end (under the linear evaluation protocol) and sets the new state-of-the-art on ImageNet. The paper is in general well written, but the clarity of exposition can be improved (details below). In contrast to CPC, the model can compute all necessary feature vectors in one forward pass which makes it more computationally attractive. On the other hand, the number of tricks required to make the training stable is stunning. Nevertheless, my score is based on the extremely strong empirical performance and view this work as an instance of "move fast and break things" and the necessary understanding of the success of such methods will follow at a later point. My score is not higher because this work doesn't even cite several highly relevant, information-theoretically backed frameworks for multi-view learning (e.g. [1, 2]). I have several questions and suggestions in the "improvements" section on the basis of which I will consider updating the score. [1] https://homes.cs.washington.edu/~sham/papers/ml/info_multi.pdf [2] https://homes.cs.washington.edu/~sham/papers/ml/regression_cca.pdf ======== Thanks a lot for the strong rebuttal. The new results are impressive and the added ablation tests quantify the importance of each method. I would urge the authors to establish a connection to [1,2], as it provides some theoretical understanding beyond the proposed multi-view approach.

Reviewer 3


		
In the local DIM method, the mutual information is maximized between a global summary feature vector, which depends on the full input, and a collection of local feature vectors pulled from an intermediate layer in the encoder. This paper extends this method in three ways: 1. It applies data augmentation to generate different views of an image. Then, it maximizes the mutual info between the two views, instead of a single view. 2. It proposes to maximize the mutual info between the features of any layers. 3. It uses a different more powerful encoder. Issues: - My main issue is the motivation behind each of the novelties of the paper. In subsections 3.4 and 3.5, the paper starts explaining the proposed approach without giving an explanation of why the proposed modifications to local DIM will be helpful. - Are these proposed modifications to local DIM really important? To get the answer, we need to look at the experimental results section. But then we found several issues in the experiments: 1) We do not see the results of local DIM (Hjelm et al. 2019) in the experiments. 2) There is no explanation about the results in table 1 in the text of the paper! This table and its results should be explained in detail such that we know which approach is better! 3) In the caption of table 1, it has been mentioned that "Data augmentation had the strongest effect by a large margin". What we see in table 1 is that multiscale has the largest effect, by a large margin. --------------- After reading the response of authors: The authors presented a new set of results that shows the method works well. But, my main issue is still clarity and the intuition behind each step of the proposed method. This issue has also been mentioned by Rev#2. It is also crucial to see the result of local DIM in all the tables (properly cited), not just in fig(b) of the response. For these reasons, I keep my score "below the threshold".