NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 2825 Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance

### Reviewer 1

Edit after rebuttal: Given the comparison with automatic differentiation in the rebuttal and the additional experiments, I am keen to change my vote to 6. I think the authors did a good job to argument and show that there is a place for the gradient formula to be used in practice; although more experimentation shall be required to support its use. --- Summary An in-depth theoretical analysis of Sinkhorn distances is given. In particular, differentiability and gradients are studied and used in simple experiments on finding Wasserstein barycenters and learning to reconstruct missing part of images. Detailed comments I like proposition 1 as it is a solid advocate of using the “sharp” Sinkhorn distance, instead of the regularized distance. I have concern about the evolution of the paper from here on. While I appreciate the technical analysis of differentiability and gradient of Sinkhorn distances, I struggle to find any clear practical utility. Differentiability: while implementing the Sinkhorn algorithm as an approximation of the Wasserstein distance, it is immediately obviously that it is about staking multiple differentiable operations (matrix multiplication, element-wise division and exponentiation at the end). In fact, several paper already noticed and exploited this property in order to differentiate through the Sinkhorn algorithm in combination with neural networks. Gradient explicit formula: in the era of deep learning and the availability of automatic differentiation in any library, I don’t see the use for that formula (which implementation will need to be debugged!). One can simply plug the solution of Sinkhorn algorithm back into the Frobenius product, as done in (7); then auto-diff that expression. Appendix: I have only checked sections A and B and they seemed sound. I am open to be convinced otherwise by the authors, in case I am missing something. But I believe my concerns disqualify the paper for publication.

### Reviewer 2

The paper focuses on the Sinkhorn approximation of Wasserstein distance, which is used to compare probability distributions and finds applications e.g. in image analysis. In particular, the authors derive methods for efficiently computing the “sharp” Sinkhorn approximation, in place of its regularized version, which the authors show to be inferior to the sharp version. The authors provide theoretical analysis of the two approximations, including proof of consistency of supervised learning with with the two Sinkhorn losses. In experiments the authors demonstrate the benefits of the sharp Sinkhorn in providing better classification results. Update: The author response is convincing, especially the comparison to AD approach. A good job!