NeurIPS 2020

Multi-label classification: do Hamming loss and subset accuracy really conflict with each other?


Meta Review

This is an interesting theoretical paper that performs a cross-analysis of three popular loss functions used in multi-label classification. Despite the fact that some reviewers found the analysis straight-forward, as it is mainly based on known results either from multi-class classification or multi-label classification, the interpretation of the results from the multi-label classification perspective is very interesting. It is worth to underline that there is still a gap in theory for multi-label classification and this paper tries to fill it. Nevertheless, the paper has several flaws that makes the paper a borderline case. The current discussion about the frameworks used in [11] and in this submission is misleading (the rebuttal makes slightly better job in this regard). Both frameworks have their own limitations and those limitations should be clearly stated for both frameworks. The submission derives upper bounds, but these are only upper bounds, moreover for a specific, very constraint, model. This kind of analysis usually misses the aspect of consistency which is a central point of [11]. The results therein clearly indicate that minimization of both metrics can lead to completely different solutions having large regrets with respect to the other metric. This behavior was also confirmed empirically. The authors claim that "algorithms aiming to optimize HL often have better performance on the SA measure than the algorithms that optimize SA directly". Unfortunately, there is no reference given which would justify this claim. I have quickly checked several papers and have not observed that phenomenon. In turn, there are papers that claim the opposite: algorithms designed for SA perform well for HL (see, Classifier Chains: A Review and Perspectives, Read et al.). There are some results across different papers showing a very good performance of deep networks for both metrics while trained with different types of surrogate losses. These results, however, are not so surprising as [11] indicates specific situations under which optimization for both metrics coincides. One of them is low noise. So, if the model is enough expressive and well-trained, we can expect to obtain good results for both metrics. The authors do not discuss state-of-the-art methods used for optimizing SA. They are indeed usually very complex as they need to model conditional label dependencies to estimate (at least implicitly) the joint distribution of labels (since the Bayes decision for SA is the joint mode of the conditional distribution, not the marginal modes as for HL). The situation analyzed by authors is different. They use the same model for both metrics, but with a different loss optimized. Therefore, the authors should discuss the relation between the proposed algorithm and the algorithms modeling conditional label dependence, for example, Struct-SVMs. This algorithm enables incorporating label dependencies to the joint feature space defined on y and x and uses a loss function which is defined over all 2^c label vectors. It would be very interesting to learn what are the differences between the struct hinge loss and the hinge loss introduced by the authors. Is the specific surrogate loss used by the authors sufficient to find the joint mode of the distribution? Also the empirical study should be extended by a well-controlled experiment on synthetic data and comparison to other state-of-the-art methods. The paper also misses discussion about similar results for multi-class classification and structure-output prediction. For example, the authors should relate the bound obtained for SA to bounds for multi-class classification, since multi-label classification under SA can be meant as multi-class classification with an exponential number of classes. Moreover, the algorithm introduced by the authors could be used for multi-class classification as well. Assume that the original class labels are represented using binary codes. Then each bit of the code can be considered as a label in a multi-label problem. Solving this problem by the introduced algorithms would give a solution for the multi-class problem. I suppose that this is not a promising solution for multi-class classification. So, why it is good for multi-label classification?