Paper ID: 754
Title: Similarity Component Analysis
Reviews

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a model of similarity in which a binary similarity score between two vectors (u,v) is modeled as a noisy-or over a set of (learned) Mahalanobis distances. The motivation being that abstract "similarity" is often non-transitive, and is not well-modeled by a single, fixed distance metric over a vector space. Generally speaking, the paper is clearly written and the method makes intuitive sense. The algorithm seems to work in practice, although the experimental results could be expanded a bit to better illustrate the method. Overall, I enjoyed reading this paper.

A few comments:

- In the introduction, the authors stress the need for supporting multi-faceted models of similarity, but the wording of this section seems a bit careless. The centaur example (line 42) illustrates that "similarity" is non-transitive, but the notion of transitivity does not directly carry over to distances, as some discretization must be applied in order to make a hard similarity assessment from distances. It is easy to construct an example in which a distance metric produces a non-transitive nearest-neighbor connectivity graph. The authors should be more careful in their statement of what exactly is to be gained by using multiple metrics.

- A similar method to model multi-faceted similarity was recently proposed by [1]. Although the context there was visualization (not similarity prediction), the authors should probably cite this work as well.

- Line 125--126: Seems like a typo here. The [k] rows/columns should be the only ones allowed to be NON-zero. It is also unclear whether the sparsity constraint is enforced in this work, or proposed as a future extension.

- Section 3.1: in generating the synthetic dataset, there does not appear to be any processing to ensure consistency in the training set similarities. As a result, the comparison to LMNN may not be entirely fair, as it requires transitive similarity structure (labels).

- Section 3.1: How is a similarity score predicted with LMNN? Is a distance threshold estimated from the training set?

- Table 1: should this table also include accuracy on the second synthetic dataset (figure 2b)?

- Figure 2b: the visualization of the recovered metrics is nice, but it might be good to include a quantitative measure as well, such as the angle/alignment between true and recovered metrics.

- Section 3.2: in the MNIST example, is there much diversity in the recovered metrics (eg, as measured by alignment)? What does the theta vector look like: are the multiple metrics actually used?

- How does SCA-Diag perform on the full (uncompressed) mnist data?

- Table 3: there are no results for itml/lmnn/sca on the BoW/ToW representations. Is this because the data was too high-dimensional? Why not compress by PCA as was done in the MNIST experiment? For completeness, it would be nice to see full metric learning (after PCA) as well as diagonal metric learning (without PCA) on both datasets.
Q2: Please summarize your review in 1-2 sentences
This paper proposes a probabilistic model of pairwise similarity as noisy-or of Mahalanobis distances between two data points. The paper is interesting, well-motivated and clearly written. The experiments could be a bit more thorough.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a graphical model for measuring similarity, based on the assumption that two objects are similar because some (instead of all) of their components are similar. Following this design, they show empirically their model performs better than recent metric learning approaches, when the number of components (K) is larger than 1.

Overall, I like this paper. The model design captures one of the fundamental issues of similarity measures and differentiate it from the metric learning framework. Despite the fact that it's a graphical model, the approach still cleverly leverages some aspects of the metric learning models. Specifically, the component similarity is essentially a Mahalanobis distance (before using a sigmoid function to make it probability). This design makes sure that the model would generalize the metric learning approaches, which reflects in the experimental results as well (when K=1). For some applications/tasks, when K increases, the performance does improve, which demonstrates that this model fits those cases better.

On the other hand, since the model can be viewed as an extension of the Mahalanobis learning, the scalability of the learning process still seems restricted. For instance, PCA needs to be employed to reduce the dimensionality of the raw data before applying this algorithm (same applied to ITML and LMNN though). Special constraints on M_k need to be enforced (SCA-DIAG, where the component similarity becomes a weighted Euclidean distance [1]) to reduce the complexity. Such issues are not critical, but it would be nice if the authors can further discuss them in the paper. Another issue that I wish the authors could discuss further is the sparseness of M_k. It is mentioned in the paper that M_k is sparse naturally when learning the model. Why this happens is nevertheless not very clear since there seems no special regularization on M_k. It would be interesting if the authors could provide some insight.

Other comments:

1. The example of CENTAUR, MAN and HORSE is in fact the canonical example to show that similarity does not need to enforce triangular inequality, although the authors did not make this argument explicit. See Figure 3 in [2].
2. The component similarity design is especially suitable for word similarity, where each word can have multiple senses. For instance, JAGUAR and AUTO are similar. JAGUAR and PUMA are also similar. However, these two pairs are similar because different components (e.g., senses) of the word JAGUAR. One of the recent papers that specifically addresses this issue is [3]. It would be interesting to know if the proposed model can subsume their approach given.
3. Perhaps it's out of the scope of this paper, but the model design bears some resemblance to the Siamese neural network approach [4,5,6]. It should be very straightforward to design a NN architecture with the component similarity subnetworks, and thus interesting to see how it compares to the graphical model approach.

[1] Schultz and Joachims. Learning a distance metric from relative comparisons. NIPS-2004.
[2] D Lin. An information-theoretic definition of similarity. ICML-1998.
[3] Reisinger and Mooney. Multi-Prototype Vector-Space Models of Word Meaning. NAACL-2010.
[4] Bromley, Bentz, Bottou, Guyon, LeCun, Moore, Sackinger and Shah. Signature verification using a “Siamese” time delay neural network. 1993. International Journal Pattern Recognition and Artificial Intelligence, 7(4):669–688.
[5] Yih, Toutanova, Platt and Meek. Learning Discriminative Projections for Text Similarity Measures. CoNLL-2011.
[6] Weston, Bengio, and Usunier. Large scale image annotation: Learning to rank with joint word-image embeddings. ECML-2010.

---

Thanks for your response to the comments.
Q2: Please summarize your review in 1-2 sentences
This paper proposes a graphical model that provides a principled way to learn a similar measure, which can be viewed as a generalization of the common metric learning approach. Experiments show convincing results over the metric learning approaches although scalability could still be an issue for high-dimensional data.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a new metric learning method that learns multiple latent components of a similarity measure and the final metric is assumed to be the maximal similarity of all the latent components (each latent component is a separate similarity measure, and if any of them says “similar”, the final similarity measure gives “similar” to a pair of subjects).

Many metric learning methods use real-valued similarity measure which outputs a score that measures the magnitude of the similarity, such as Mahalanobis distance, or probabilistic distance. Hence, even man and centaur, or centaur and horse have relative higher similarity scores, man and horse can have a lower score. However, I agree that it can be important to know and learn there are different latent components (specific ways to compare subjects) among different labelers, and learn what they are that eventually make human annotators label the subjects in such a way. The proposed method is interesting and can be useful.

On page 3, under “latent components”, M_k is sparse, but according to the context, it should use those features at the [k]-th rows and columns, but why those entries are zeros?

The paper omits very important derivation about how to learn \theta on page 4, and focus on the learning of M_k and b_k, but the formula Eq(8) used to calculate them is related to the part that is omitted.

Further, the method changed to use a symmetric decomposition of M_k and stated that they observe no significant detrimental effect of arriving at those solutions. It is better to include some justification and evidence.

If M_k is degenerated into a diagonal matrix, the method is similar to those methods that perform feature selection for metric learning (although the proposed method may result in multiple subsets of features), and should be compared with some of those methods in terms of classification performance on real data.
Q2: Please summarize your review in 1-2 sentences
This paper presents a new metric learning method that learns multiple latent components of a similarity measure and the final metric is assumed to be the maximal similarity of all the latent components. It can be a useful method.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for their comments. We are especially pleased with their recognition of the proposed method being interesting and clearly motivated as well as the experimental results being convincing.

We clarify several common comments, followed by individual ones.

The example of CENTAUR, MAN and HORSE

We appreciate all the suggestions (including the extra example and references) and will rephrase this part more precisely and clearly. The example is to show that similarity is a broad notion (for example, non-transitivity and non-metricity), thus modeling it with a metric function is limiting.

Sparsity of parameters

We will fix the typo in Line 125-126: “We restrict M_k to be sparse, in particular, the corresponding [k]-th rows and columns are zeroes”. The “zeroes” should be “non-zeroes”.

Further clarifications/discussions on why M_k is sparse naturally when learning: we believe there are two forces at play. Much of the existing work on metric learning has shown that the metrics being learned are often low-rank (without being regularized explicitly). That is leveraged in our work by casting components as metric-based models. Secondly, we use the (noisy-)OR gate to combine local similarity values. Thus, if one component is able to predict similarity, the other components are “suppressed” -- they will receive little learning signals (their similarity values are less relevant to the final outcome). This is evidenced by the component-wise activation patterns in Fig. 3 where we show that each component is often activated specifically.


===R6 ===

Comparison to LMNN in sec 3.1: yes, we threshold the distance computed with the LMNN metric. The threshold is tuned on a development dataset. The synthetic dataset was generated without using labels so the similarity would be considered to be “noisy” when compared to label-induced similarity.

Experiments: we appreciate the suggestions and will incorporate them in future texts.

=== R8 ===

Eq. (8) describes how M_k and b_k are to be estimated, which is decoupled from how \theta_k are to be estimated (Please refer to the supplementary material S1.1). We will make it clear in the text.

Decomposition of M_k: this is often used in existing metric learning work to speed up optimization. The intuition is that in most cases, the local optimum is actually globally optimal (cf. Burer and Monteiro. Math. Programming. 2003)

Feature selection: this is an interesting angle to pursue. Thank you for suggesting that.