Review for NeurIPS paper: Deep Metric Learning with Spherical Embedding

NeurIPS 2020

Deep Metric Learning with Spherical Embedding

Review 1

Summary and Contributions: The paper discusses deep metric learning methods that use L2 normalized embedding. They demonstrate the impact of the embedding norm by showing the effect on gradients with respect to cosine and d Euclidean distance losses. The authors claim that to further improve these methods, it is beneficial to regularize the embeddings to reside on the same sphere by adding a loss term to keep embedding's norm close to the average value across the batch. They continue to she that this modification improves performance of metric learning methods and reduces embedding norm variance.

Strengths: The work is written clearly, with good motivation and simple explanations. The main idea, of requiring the embedding to reside on same sphere, is intuitive yet novel (to best of my knowledge). Furthermore, the authors give empirical evidence that this may be an issue (showing the variance of norm in Figure 1) and later demonstrate how their method resolves it (Figure 4). The introduced method is clearly explained, with robustness measurements (with respect to hyperparameters) and shown to provide significant improvements in the metric learning task chosen. Overall, this work provides a novel method that can be easily incorporated to several tasks and domains with clear explanation and good evidence to support it.

Weaknesses: Some formulations may be a bit confusing, such as dealing with both cosine and distance losses of normalized embedding. As both are similar on normalized setting (up to an additive constant and scaling), I feel that using both of them in section 3 is unnecessary. I also think that authors should add applications and use cases beyond metric learning, as their method is applicable to other methods where representations are constrained to a sphere (see additional feedback part)

Correctness: I did not find any faults with claims or method. empirical methodology is on par with similar works.

Clarity: The paper is well written

Relation to Prior Work: The paper discusses relation to previous works in deep metric learning literature. It is missing some discussion concerning other L2norm-normalized methods such as contrastive learning for unsupervised/self-supervised learning.

Reproducibility: Yes

Additional Feedback: I think the authors can benefit from using their methods on other norm constrained methods. For example: "A Simple Framework for Contrastive Learning of Visual Representations" Chen et al. "Supervised Contrastive Learning" Khosla et al.

Review 2

Summary and Contributions: The paper introduces a new training method for deep metric learning. To measure the similarity between two feature embeddings, angular distance, which disentangles the norm and direction of embeddings, is widely adopted. However, most of the existing methods decouple the magnitude of each embedding by simply dividing it by its norm. The paper empirically and theoretically shows that the current normalization strategy can cause the instability of the update gradient in batch optimization and hinder the quick convergence of the embedding learning. The paper addresses this problem by adding a penalty term that directly regularizes the norm of embeddings to the original training objective. The paper claims that the proposed method can improve the baseline methods by a large margin on both deep metric learning and face recognition tasks.

Strengths: Soundness: The claim is well formulated mathematically, and the paper provides convincing empirical evidence (figure 4 and figure 5) to support the claim. Significance: One great advantage of the methods is that it is complementary to the other existing methods. Performance: The method consistently improves the performance on all the datasets by a significant margin.

Weaknesses: Soundness: The proposed method lacks algorithmic novelty. The idea of regularizing the \ell_2 norm of embedding vectors has already been proposed in [1] and [2]. It should be clearly stated how the proposed method differs from the previous regularization methods. Also, the paper states that the traditional angular loss functions result in unstable update gradient (line 137-147) but does not provide any empirical results on that. Performance: The paper does not provide a code for reproducibility. Also, hyper-parameters for training a network for deep metric learning task, such as learning rate and epoch, are not listed in the paper. Furthermore, it does not provide any references for selecting hyper-parameters. [1] Kihyuk Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, NeurIPS 2016. [2] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy, Deep Metric Learning via Facility Location, CVPR 2017.

Correctness: I checked the correctness of all the propositions.

Clarity: Yes. The paper is well written and easy to follow.

Relation to Prior Work: The paper does not mention some previous works on regularizing \ell_2 norm of embedding vectors, which are similar to the proposed method.

Reproducibility: No

Additional Feedback: Related works: Please write a brief description of each prior work, instead of just listing the names (line 65-67). Hyper-parameters: Are hyper-parameters borrowed from the original papers or tuned by grid search? Please mention how to choose hyper-parameters and network structures clearly. Performance: For an ablation study, please conduct an experiment with \mu=0 instead of average embedding norm and show how the proposed method differs from the previous regularization methods. Also, the phrase 'normalized n-pair loss' is somewhat ambiguous. Does it stand for 'Tuplet Margin Loss' in [3]? If not, you should include [3] as a baseline. [3] Baosheng Yu and Dacheng Tao, Deep Metric Learning with Tuplet Margin Loss, ICCV 2019 ------- Thank you for doing new experiments and making changes to take into account the feedback. However, I disagree with your claim that the original N-pair loss uses inner product without l2-normalization (see Section 3.2.2 in the original paper). So, I leave my rating unchanged.

Review 3

Summary and Contributions: In this paper, the authors have proposed a deep metric learning method with spherical embedding. Existing angular distance based methods ignore the norms of the learned features, which may lead to unstable gradient in batch optimization. This paper analyzes the effect of the embedding norms and also proposes a spherical embedding constraint (SEC) to minimize it. Experimental results show the effectiveness of the proposed method.

Strengths: 1) The problem of the embedding norms of angular based methods is interesting, which is ignored by most existing works. 2) The paper is easy to read. 3) Extensive experiments on various applications.

Weaknesses: 1) I think the main drawback of this paper is that, the design of the SEC loss is somehow too straightforward and trivial. It simply minimizes the distance between the norm of each feature and the average norm, which acts as an additional term of the existing losses. A more elegant design should be expected for the NeurIPS level. 2) The reviewer is not clear whether \mu is the average norm *of the batch*, or *of all the features*? It seems the latter, but in this case how to update \mu during training? Whenever the parameters of the metric are updated, all the features are changed, and \mu should be re-calculated. This detail is very important for the paper. 3) An important baseline that should be compared with is that, after the last layer we use an additional layer to normalize the output feature (e.g., to make the norms as 1 for all the features). Then, we use the normalized features as the output instead of the original ones, where the losses operate on the normalized features. It's required to show the advantages of the proposed SEC over this intuitive design. 4) In the experiments, we observe that some methods have relatively large improvement after using SEC, while some do not. The reviewer wants to know if the larger improvements in accuracy are coming from the larger reduction of the variance of feature norms.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: No

Additional Feedback: The rebuttal partly addresses my concerns. However, the reviewer still considers that the technical contributions are not significant enough, and also the experiments need improvement. Thus the reviewer keeps the original rating of the paper.

Review 4

Summary and Contributions: This paper proposes a regularizer to encourage the underlying embeddings (before normalization) to have similar norms. Based on the training dynamics of SGD, this results in more balance updates on the angles between the embeddings.

Strengths: It is shown that the proposed method can be used to improve many existing methods such as those in [4, 10, 17, 11].

Weaknesses: The results shown in the analysis/theory part of this work are known. Proposition 1 was shown in Wang et al Deep Metric Learning with Angular Loss, Section 3, figure 3. Proportion 2 was shown in Zhang et al Heated-Up Softmax Embedding, Section 3.3 The motivation of this work is based on vanilla SGD. It is unclear whether the problem exists with other widely used optimizers. In particular, the problems stated may not be true for Adam (the one used in the experiments). In Adam, the gradient is weighted by an exponential moving average of each parameter, and it may make the magnitude of the gradient similar regardless of the underlying norm. Some more analysis is needed especially because the experiments are done with Adam.

Correctness: The claims are correct but known.

Clarity: Yes.

Relation to Prior Work: I am wondering about the interplay of the proposed method and batch norm/ weight norm. These method can make the embedding norms close too. Is Figure 1 generated with batch norm?

Reproducibility: Yes

Additional Feedback: [I have read the authors' feedback and respectfully disagree with the authors that the results are novel in comparison with the two previous works. I did not see a reply to the SGD vs. other optimizer concern I raised.]