Review for NeurIPS paper: Learning to Approximate a Bregman Divergence

NeurIPS 2020

Learning to Approximate a Bregman Divergence

Review 1

Summary and Contributions: edit: I read the rebuttal and other reviews, my score has not changed. The paper proposes learn an approximation of a Bregman divergence. Every Bregman divergence can be written as a function of a strictly convex differentiable function. The paper proposes to approximate the strictly convex differentiable function by a non-differentiable convex function written as a maximum of different linear (hence convex) functions. The proposed framework is general and can be applied to different types of losses (e.g. quadruplet-wise constraints in the main paper).

Strengths: Although the approach is simple, the theory of the paper is solid. Rademacher complexity/generalization error bounds are studied. Bregman divergence are a general framework that generalize multiple divergences used in the machine learning literature (e.g. squared Euclidean distance, KL divergence etc...). The experimental evaluation is also solid although most baselines were published about a decade ago (or more). The theoretical study is significant and might help future work working on this kind of generalized divergence. It is also relevant to the NeurIPS community.

Weaknesses: The main limitation/weakness of the paper is that it does not consider more recent baselines and/or deep models. It seems that there is an ICML 2020 paper about deep extensions of this (line 265).

Correctness: The empirical methodology seems correct.

Clarity: The paper is well written and easy to understand although I have not checked the supplementary material.

Relation to Prior Work: Yes. The main novelty compared to previous work are the theoretical generalization error bounds.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper studies the problem of approximating arbitrary Bregman divergences. The underlying idea is to approximate the underlying convex function of the divergence by a piecewise linear function (constructed by selecting, for a given example x, the linear function with maximal value in the set of linear functions used for the approximation). The Rademacher complexity of the class of approximated Bregman divergences using piecewise linear functions is derived. Finally, an algorithm to learn an approximated Brgeman divergence in a Metric Learning setting is proposed. This algorithm comes with generalization guarantees and is shown to be competitive on several tasks.

Strengths: - The idea of using piecewise linear functions to approximate Bregman divergences seems original and is of interest to the Metric Learning community but also to a wider audience given the popularity of Bregman divergences. - The proposed algorithm is sound and backed by theoretical and empirical evidence of its interest. - Theoretically, under reasonable assumptions on the convex function under consideration, it is shown that there exists a piecewise linear function that is a tight approximation provided that the number of linear function considered is large enough. - The Radamacher complexity of Bregman divergences parametrized by piecewise linear functions is investigated.

Weaknesses: - There is a gap between the setting studied in the theoretical analysis and the the one considered in the experiments. Indeed, in the theoretical analysis, pairs or quadruplets of points are assumed to be drawn i.i.d. from a joint distribution while, in the experiments, the points themselves are assumed to be drawn i.i.d. (and, thus, the pairs and quadruplets are not i.i.d. anymore). - The generalization guarantees presented in this paper are not directly comparable to standard metric learning results due to the aforementioned i.i.d. issue.

Correctness: The claims, the method, and the empirical methodology appear to be correct.

Clarity: The paper is mostly clear. Nevertheless, some part could be improved: - Line 27: a_k and b_k are never introduced and it is not clear how they can be obtained at this stage. - Line 143: \phi(xi) should be equal to z_i using (3) rather than (2). - The first constraint in Optimization Problem (7) is hard to parse as it is on two lines. - Line 241: Table 4 is in fact Table 1.

Relation to Prior Work: The relation to prior work appears to be clearly discussed.

Reproducibility: Yes

Additional Feedback: 1) The gap between theory and practice is a bit disappoint. The footnotes at the bottom of page 6 gives a good example of a practical setting aligned with the theory and it would have been interesting to design some experiment in this direction. 2) Line 260-264 of the main paper, preliminary experiments for image retrieval are mentioned using deep features as input. The results are deferred to the supplementary. However, I could not find them there. 3) The regression example in the supplementary shows that the proposed approach is indeed more flexible than Mahalanobis metric learning approaches. This is an interesting experiment that should be mentioned in the main paper. 4) The proposed approach performs quite well on the clustering and ranking experiments. However, its results are slightly lower in terms of KNN accuracy. Similarly the results obtained on ionosphere in the supplementary are far less convincing. Is there any explanation to these differences? On the one hand, the paper presents an interesting and sound approach to approximate Bregman divergences using piecewise linear functions. On the other hand, there is a small gap between the settings considered in theory and the one considered in practice. Overall, I think that this paper could be accepted. --- The rebuttal did not convince me that there is no gap between theory and practice and, thus, my comment 1 still holds. My comments 2 and 4 were not addressed.

Review 3

Summary and Contributions: A method for metric learning based on the Bregman divergence is proposed. The metric, i.e., Bregman divergence is adaptively optimized with a given dataset, using a piece-wise linear function. Validity of the approximation is theoretically investigated and its performance is numerically verified.

Strengths: Using the piece-wise linear function, the method can adaptively approximate the generating function of the Bregman divergence, and its performance is comparable or superior with other metric learning methods.

Weaknesses: The approximation ability by the piece-wise linear function basically depends on the number K of linear functions. In theorem 2, the bound of generalization error is increasing with K, which seems to be strange. Is this correct? In experiments, how to determine K is ambiguous.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposes to fit Bregman divergence for a given dataset where the convex function for generating the divergence is restricted to be a max-affine function. The learning algorithm is formulated as linear programming. The approximation error by the max-affine function and generalization bound for the metric learning problem with the learned Bregman divergence is derived. Experimental results support the effectiveness of the proposed metric learning method compared to classical ones.

Strengths: Approximating a convex function by max-affine functions is reasonable. Approximation and generalization error bounds are derived, which seem sound. Experimental results show relative superiority to conventional methods.

Weaknesses: The problem statement is not satisfactory hence I'm not confident whether I understand the problem and the authors' claim properly. For example, the assumption K=n (even for the sake of simplicity) seems to make the derived bound useless.

Correctness: I followed proofs in supplementary material, and they seem to be correct.

Clarity: There are many unclear points on the formulation and presentation. The optimization problem (7) is ambiguous. What is the input data and what is the pre-determined parameter? I believe a_i are optimized in the problem (7), but it should be clearly stated. If a_i, i=1,..., K is a set of vectors to be optimized, the number of parameters to be optimized grows with the increase of the number of training samples. The relationship between "n" and "m" is not explained. If "m" is the number of constraints and proportional to "n" and K=n, the generalization bound is meaningless. I guess, partly for that reason, it is important to consider K < n case. K=n is an oversimplification. It is stated that in reality, the authors use K smaller than n, but I cannot find an explanation on how to determine an appropriate number K. Symbols are not used in a consistent manner. lines 54 - 56, "Discuss a generalization error bound for metric learning in the Bregman setting of Op(n−1/2), where n is the number of training points; this matches the bound known for the strictly less general Mahalanobis setting". "n" is said to be the number of training points, but it does not appear in the derived generalization bound.

Relation to Prior Work: I'm not very familiar with recent work on metric learning. Classical and representative methods are properly cited and compared with the proposed method. However, I feel cited papers on metric learning methods are relatively old. I understand recent developments on metric learning are concentrated on deep-learning based methods, and nearly linear method like this submission is not actively studies. So, relatively old references do not affect my rating on this paper so much.

Reproducibility: Yes

Additional Feedback: Please explicitly write the optimization objectives in the mathematical programming. 1st term in r.h.s. of the bound in Theorem 2 is strange. "n", "i", and "t" seem to be confused. Are \deltas in Eq.(11) and Theorem 2 the same thing? I guess not. I understand \delta is a standard notation for PAC bounds, but the same symbol shouldn't be used in the manuscript. ------------ My main concerns are the usefulness of the bound (particularly the relationship between n and K) and presentation issues. Confuse caused by misuse of symbols is addressed by the authors' rebuttal. I understand the rationale behind the seemingly useless generalization bound. I like the main idea of leaning Bregman divergence with piece-wise linear approximation, and raised my score.