{"title": "Maximizing Subset Accuracy with Recurrent Neural Networks in Multi-label Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 5413, "page_last": 5423, "abstract": "Multi-label classification is the task of predicting a set of labels for a given input instance. Classifier chains are a state-of-the-art method for tackling such problems, which essentially converts this problem into a sequential prediction problem, where the labels are first ordered in an arbitrary fashion, and the task is to predict a sequence of binary values for these labels. In this paper, we replace classifier chains with recurrent neural networks, a sequence-to-sequence prediction algorithm which has recently been successfully applied to sequential prediction tasks in many domains. The key advantage of this approach is that it allows to focus on the prediction of the positive labels only, a much smaller set than the full set of possible labels. Moreover, parameter sharing across all classifiers allows to better exploit information of previous decisions. As both, classifier chains and recurrent neural networks depend on a fixed ordering of the labels, which is typically not part of a multi-label problem specification, we also compare different ways of ordering the label set, and give some recommendations on suitable ordering strategies.", "full_text": "Maximizing Subset Accuracy with Recurrent Neural\n\nNetworks in Multi-label Classi\ufb01cation\n\nJinseok Nam1, Eneldo Loza Menc\u00eda1, Hyunwoo J. Kim2, and Johannes F\u00fcrnkranz1\n\n1Knowledge Engineering Group, TU Darmstadt\n\n2Department of Computer Sciences, University of Wisconsin-Madison\n\nAbstract\n\nMulti-label classi\ufb01cation is the task of predicting a set of labels for a given input\ninstance. Classi\ufb01er chains are a state-of-the-art method for tackling such problems,\nwhich essentially converts this problem into a sequential prediction problem, where\nthe labels are \ufb01rst ordered in an arbitrary fashion, and the task is to predict a\nsequence of binary values for these labels. In this paper, we replace classi\ufb01er\nchains with recurrent neural networks, a sequence-to-sequence prediction algorithm\nwhich has recently been successfully applied to sequential prediction tasks in many\ndomains. The key advantage of this approach is that it allows to focus on the\nprediction of the positive labels only, a much smaller set than the full set of possible\nlabels. Moreover, parameter sharing across all classi\ufb01ers allows to better exploit\ninformation of previous decisions. As both, classi\ufb01er chains and recurrent neural\nnetworks depend on a \ufb01xed ordering of the labels, which is typically not part of a\nmulti-label problem speci\ufb01cation, we also compare different ways of ordering the\nlabel set, and give some recommendations on suitable ordering strategies.\n\n1\n\nIntroduction\n\nThere is a growing need for developing scalable multi-label classi\ufb01cation (MLC) systems, which, e.g.,\nallow to assign multiple topic terms to a document or to identify objects in an image. While the simple\nbinary relevance (BR) method approaches this problem by treating multiple targets independently,\ncurrent research in MLC has focused on designing algorithms that exploit the underlying label\nstructures. More formally, MLC is the task of learning a function f that maps inputs to subsets of\na label set L = {1, 2,\u00b7\u00b7\u00b7 , L}. Consider a set of N samples D = {(xn, yn)}N\nn=1, each of which\nconsists of an input x \u2208 X and its target y \u2208 Y, and the (xn, yn) are assumed to be i.i.d following\n(cid:80)N\nan unknown distribution P (X, Y ) over a sample space X \u00d7 Y. We let Tn = |yn| denote the size\nn=1 Tn the cardinality of D, which is usually much\nof the label set associated to xn and C = 1\nsmaller than L. Often, it is convenient to view y not as a subset of L but as a binary vector of size L,\nN\ni.e., y \u2208 {0, 1}L. Given a function f parameterized by \u03b8 that returns predicted outputs \u02c6y of inputs x,\ni.e., \u02c6y \u2190 f (x; \u03b8), and a loss function (cid:96) : (y, \u02c6y) \u2192 R which measures the discrepancy between y and\n\u02c6y, the goal is to \ufb01nd an optimal parametrization f\u2217 that minimizes the expected loss on an unknown\nsample drawn from P (X, Y ) such that f\u2217 = arg minf\nexpected risk minimization over P (X, Y ) is intractable, for a given observation x it can be simpli\ufb01ed\nto f\u2217(x) = arg minf\nEY |X [(cid:96) (Y , f (x; \u03b8))] . A natural choice for the loss function is subset 0/1\nloss de\ufb01ned as (cid:96)0/1(y, f (x; \u03b8)) = I [y (cid:54)= \u02c6y] which is a generalization of the 0/1 loss in binary\nclassi\ufb01cation to multi-label problems. It can be interpreted as an objective to \ufb01nd the mode of the\njoint probability of label sets y given instances x: EY |X\nConversely, 1 \u2212 (cid:96)0/1(y, f (x; \u03b8)) is often referred to as subset accuracy in the literature.\n\n(cid:2)EY |X [(cid:96)(Y , f (X; \u03b8))](cid:3) . While the\n(cid:2)(cid:96)0/1 (Y , \u02c6y)(cid:3) = 1 \u2212 P (Y = y|X = x).\n\nEX\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Subset Accuracy Maximization in Multi-label Classi\ufb01cation\n\nFor maximizing subset accuracy, there are two principled ways for reducing a MLC problem to\nmultiple subproblems. The simplest method, label powerset (LP), de\ufb01nes a set of all possible label\ncombinations SL = {{1},{2},\u00b7\u00b7\u00b7 ,{1, 2,\u00b7\u00b7\u00b7 , L}}, from which a new class label is assigned to each\nlabel subset consisting of positive labels in D. LP, then, addresses MLC as a multi-class classi\ufb01cation\nproblem with min(N, 2L) possible labels such that\n\nP (y1, y2,\u00b7\u00b7\u00b7 , yL|x) LP\u2212\u2212\u2192 P (yLP = k|x)\n\n(1)\nwhere k \u2208 {1, 2,\u00b7\u00b7\u00b7 , min(N, 2L)}. While LP is appealing because most methods well studied\nin multi-class classi\ufb01cation can be used, training LP models becomes intractable for large-scale\nproblems with an increasing number of labels in SL. Even if the number of labels L is small enough,\nthe problem is still prone to suffer from data scarcity because each label subset in LP will in general\nonly have a few training instances. An effective solution to these problems is to build an ensemble of\nLP models learning from randomly constructed small label subset spaces [29].\nAn alternative approach is to learn the joint probability of labels, which is prohibitively expensive\ndue to 2L label con\ufb01gurations. To address such a problem, Dembczy\u00b4nski et al. [3] have proposed\nprobabilistic classi\ufb01er chain (PCC) which decomposes the joint probability into L conditional\nprobabilities:\n\nL(cid:89)\n\nP (y1, y2,\u00b7\u00b7\u00b7 , yL|x) =\n\nP (yi|y<i, x)\n\ni=1\n\n(2)\nwhere y<i = {y1,\u00b7\u00b7\u00b7 , yi\u22121} denotes a set of labels that precede a label yi in computing conditional\nprobabilities, and y<i = \u2205 if i = 1. For training PCCs, L functions need to be learned independently\nto construct a probability tree with 2L leaf nodes. In other words, PCCs construct a perfect binary tree\nof height L in which every node except the root node corresponds to a binary classi\ufb01er. Therefore,\nobtaining the exact solution of such a probabilistic tree requires to \ufb01nd an optimal path from the\nroot to a leaf node. A na\u00efve approach for doing so requires 2L path evaluations in the inference step,\nand is therefore also intractable. However, several approaches have been proposed to reduce the\ncomputational complexity [4, 13, 24, 19].\nApart from the computational issue, PCC has also a few fundamental problems. One of them is a\ncascadation of errors as the length of a chain gets longer [25]. During training, the classi\ufb01ers fi in the\nchain are trained to reduce the errors E(yi, \u02c6yi) by enriching the input vectors x with the corresponding\nprevious true targets y<i as additional features. In contrast, at test time, fi generates samples \u02c6yi or\nestimates P (\u02c6yi|x, \u02c6y<i) where \u02c6y<i are obtained from the preceding classi\ufb01ers f1,\u00b7\u00b7\u00b7 , fi\u22121.\nAnother key limitation of PCCs is that the classi\ufb01ers fi are trained independently according to a \ufb01xed\nlabel order, so that each classi\ufb01er is only able to make predictions with respect to a single label in a\nchain of labels. Regardless of the order of labels, the product of conditional probabilities in Eq. (2)\nrepresents the joint probability of labels by the chain rule, but in practice the label order in a chain\nhas an impact on estimating the conditional probabilities. This issue was addressed in the past by\nensemble averaging [23, 3], ensemble pruning [17] or by a previous analysis of the label dependencies,\ne.g., by Bayes nets [27], and selecting the ordering accordingly. Similar methods learning a global\norder over the labels have been proposed by [13], who use kernel target alignment to order the chain\naccording to the dif\ufb01culty of the single-label problems, and by [18], who formulate the problem of\n\ufb01nding the globally optimal label order as a dynamic programming problem. Aside from PCC, there\nhas been another family of probabilistic approaches to maximizing subset accuracy [9, 16].\n\n3 Learning to Predict Subsets as Sequence Prediction\n\nIn the previous section, we have discussed LP and PCC as a means of subset accuracy maximization.\nNote that yLP in Eq. (1) denotes a set of positive labels. Instead of solving Eq. (1) using a multi-class\nclassi\ufb01er, one can consider predicting all labels individually in yLP, and interpret this approach as a\nway of maximizing the joint probability of a label subset given the number of labels T in the subset.\nSimilar to PCC, the joint probability can be computed as product of conditional probabilities, but\nunlike PCC, only T (cid:28) L terms are needed. Therefore, maximizing the joint probability of positive\nlabels can be viewed as subset accuracy maximization such as LP in a sequential manner as the\n\n2\n\n\fway PCC works. To be more precise, y can be represented as a set of 1-of-L vectors such that\ni=1 and ypi \u2208 RL where T is the number of positive labels associated with an instance x.\ny = {ypi}T\nThe joint probability of positive labels can be written as\n\nT(cid:89)\n\nP (yp1, yp2,\u00b7\u00b7\u00b7 , ypT |x) =\n\nP (ypi|y<pi, x).\n\n(3)\n\ni=1\n\nNote that Eq. (3) has the same form with Eq. (2) except for the number of output variables. While\nEq. (2) is meant to maximize the joint probability over the entire 2L con\ufb01gurations, Eq. (3) represents\nthe probability of sets of positive labels and ignores negative labels. The subscript p is omitted unless it\nis needed for clarity. A key advantage of Eq. (3) over the traditional multi-label formulation is that the\nnumber of conditional probabilities to be estimated is dramatically reduced from L to T , improving\nscalability. Also note that each estimate itself again depends on the previous estimates. Reducing the\nlength of the chain might be helpful in reducing the cascading errors, which is particularly relevant\nfor labels at the end of the chain. Having said that, computations over the LT search space of Eq. (3)\nremain infeasible even though our search space is much smaller than the search space of PCC in\nEq. (2), 2L, since the label cardinality C is usually very small, i.e., C (cid:28) L.\nAs each instance has a different value for T , we need MLC methods capable of dealing with a\ndifferent number of output targets across instances. In fact, the idea of predicting positive labels only\nhas been explored for MLC. Recurrent neural networks (RNNs) have been successful in solving\ncomplex output space problems. In particular, Wang et al. [31] have demonstrated that RNNs\nprovide a competitive solution on MLC image datasets. Doppa et al. [6] propose multi-label search\nwhere a heuristic function and cost function are learned to iteratively search for elements to be\nchosen as positive labels on a binary vector of size L. In this work, we make use of RNNs to\n, x) for which the order of labels in a label subset yp1 , yp2,\u00b7\u00b7\u00b7 , ypT need\nto be determined a priori, as in PCC. In the following, we explain possible ways of choosing label\npermutations, and then present three RNN architectures for MLC.\n\ncompute(cid:81)T\n\ni=1 P (ypi|y<pi\n\n3.1 Determining Label Permutations\n\nWe hypothesize that some label permutations make it easier to estimate Eqs. (2) and (3) than others.\nHowever, as no ground truth such as relevance scores of each positive label to a training instance is\ngiven, we need to make the way to prepare \ufb01xed label permutations during training.\nThe most straightforward approach is to order positive labels by frequency simply either in a\ndescending (from frequent to rare labels) or an ascending (from rare to frequent ones) order. Although\nthis type of label permutation may break down label correlations in a chain, Wang et al. [31] have\nshown that the descending label ordering allows to achieve a decent performance on multi-label\nimage datasets. As an alternative, if additional information such as label hierarchies is available\nabout the labels, we can also take advantage of such information to determine label permutations.\nFor example, assuming that labels are organized in a directed acyclic graph (DAG) where labels are\npartially ordered, we can obtain a total order of labels by topological sorting with depth-\ufb01rst search\n(DFS), and given that order, target labels in the training set can be sorted in a way that labels that have\nsame ancestors in the graph are placed next to each other. In fact, this approach also preserves partial\nlabel orders in terms of the co-occurrence frequency of a child and its parent label in the graph.\n\n3.2 Label Sequence Prediction from Given Label Permutations\n\nA recurrent neural network (RNN) is a neural network (NN) that is able to capture temporal\ninformation. RNNs have shown their superior performance on a wide range of applications where\ntarget outputs form a sequence. In our context, we can expect that MLC will also bene\ufb01t from the\nreformulation of PCCs because the estimation of the joint probability of only positive labels as in\nEq. (3) signi\ufb01cantly reduces the length of the chains, thereby reducing the effect of error propagation.\nA RNN architecture that learns a sequence of L binary targets can be seen as a NN counterpart of\nPCC because its objective is to maximize Eq. (2), just like in PCC. We will refer to this architecture\nas RNNb (Fig. 1b). One can also come up with a RNN architecture maximizing Eq. (3) to take\nadvantage of the smaller label subset size T than L, which shall be referred to as RNNm (Fig. 1c). For\nlearning RNNs, we use gated recurrent units (GRUs) which allow to effectively avoid the vanishing\ngradient problem [2]. Let \u00afx be the \ufb01xed input representation computed from an instance x. We shall\n\n3\n\n\fy1\n\ny2\n\ny3\n\nyL\n\nm1\n\nm2\n\nm3\n\n\u00b7 \u00b7 \u00b7 mL\n\ny1\n\nh1\n\ny2\n\nh2\n\ny3\n\nh3\n\nyL\n\nhL\n\n\u00b7 \u00b7 \u00b7\n\n\u00afx\n\ny<1\n\n\u00afx\n\ny<2\n\n\u00afx\n\ny<3\n\n\u00afx y<L\n\n\u00afx\n\ny0\n\n\u00afx\n\ny1\n\n\u00afx\n\ny2\n\n\u00afx yL-1\n\n(a) PCC\n\ny1\n\nh1\n\ny2\n\nh2\n\ny3\n\nh3\n\n\u00afx\n\ny0\n\n\u00afx\n\ny1\n\n\u00afx\n\ny2\n\n(c) RNNm\n\n(b) RNNb\n\nx1\n\nu1\n\nx2\n\nu2\n\nx3\n\nu3\n\nx4\n\nu4\n\ny1\n\nh1\n\ny0\n\ny2\n\nh2\n\ny1\n\ny3\n\nh3\n\ny2\n\n(d) EncDec\n\nFigure 1: Illustration of PCC and RNN architectures for MLC. For the purpose of illustration, we\nassume T = 3 and x consists of 4 elements.\n\nexplain how to determine \u00afx in Sec. 4.2. Given an initial state h0 = finit (\u00afx), at each step i, both\nRNNb and RNNm compute a hidden state hi by taking \u00afx and a target (or predicted) label from the\n\nprevious step as inputs: hi = GRU(cid:0)hi\u22121, Vyi\u22121, \u00afx(cid:1) for RNNb and hi = GRU(cid:0)hi\u22121, Vypi\u22121, \u00afx(cid:1)\nconditional probabilities P\u03b8 (yi|y<i, x) in Eq. (2) by f(cid:0)hi, Vyi\u22121 , \u00afx(cid:1) consisting of linear projection,\n\nfor RNNm where V is the matrix of d-dimensional label embeddings. In turn, RNNb computes the\n\nfollowed by the softmax function. Likewise, we consider f (hi, Vyi\u22121, \u00afx) for RNNm. Note that the\nkey difference between RNNb and RNNm is whether target labels are binary targets yi or 1-of-L\ntargets yi. Under the assumption that the hidden states hi preserve the information on all previous\nlabels y<i, learning RNNb and RNNm can be interpreted as learning classi\ufb01ers in a chain. Whereas\nin PCCs an independent classi\ufb01er is responsible for predicting each label, both proposed types of\nRNNs maintain a single set of parameters to predict all labels.\nThe input representations \u00afx to both RNNb and RNNm are kept \ufb01xed after the preprocessing of\ninputs x is completed. Recently, an encoder-decoder (EncDec) framework, also known as sequence-\nto-sequence (Seq2Seq) learning [2, 28], has drawn attention to modeling both input and output\nsequences, and has been applied successfully to various applications in natural language processing\nand computer vision [5, 14]. EncDec is composed of two RNNs: an encoder network captures the\ninformation in the entire input sequence, which is then passed to a decoder network which decodes\nthis information into a sequence of labels (Fig. 1d). In contrast to RNNb and RNNm, which only\nuse \ufb01xed input representations \u00afx, EncDec makes use of context-sensitive input vectors from x. We\ndescribe how EncDec computes Eq. (3) in the following.\nEncoder. An encoder\ntakes x and produces a sequence of D-dimensional vectors x =\n{x1, x2,\u00b7\u00b7\u00b7 , xE} where E is the number of encoded vectors for a single instance. In this work, we\nconsider documents as input data. For encoding documents, we use words as atomic units. Consider a\ndocument as a sequence of E words such that x = {w1, w2,\u00b7\u00b7\u00b7 , wE} and a vocabulary of V words.\nEach word wj \u2208 V has its own K-dimensional vector representation uj. The set of these vectors con-\nstitutes a matrix of word embeddings de\ufb01ned as U \u2208 RK\u00d7|V|. Given this word embedding matrix U,\nwords in a document are converted to a sequence of K-dimensional vectors u = {u1, u2,\u00b7\u00b7\u00b7 , uE},\nwhich is then fed into the RNN to learn the sequential structures in a document\n\nxj = GRU(xj\u22121, uj)\n\n(4)\n\nci = (cid:80)\n\nwhere x0 is the zero vector.\nDecoder. After the encoder computes xi for all elements in x, we set the initial hidden state of\nthe decoder h0 = finit(xE), and then compute hidden states hi = GRU (hi\u22121, Vyi\u22121, ci) where\nj \u03b1ijxj is the context vector which is the sum of the encoded input vectors weighted by\nattention scores \u03b1ij = fatt (hi\u22121, xj) , \u03b1ij \u2208 R. Then, as shown in [1], the conditional probability\nP\u03b8(yi|y<i, x) for predicting a label yi can be estimated by a function of the hidden state hi, the\nprevious label yi\u22121 and the context vector ci:\n\n(5)\nIndeed, EncDec is potentially more powerful than RNNb and RNNm because each prediction is\ndetermined based on the dynamic context of the input x unlike the \ufb01xed input representation \u00afx used\n\nP\u03b8(yi|y<i, x) = f (hi, Vyi\u22121, ci).\n\n4\n\n\fhidden states\n\nprob. of output labels\n\nRNNm\n\nEncDec\n\nTable 1: Comparison of the three RNN architectures for MLC.\n\nGRU(cid:0)hi\u22121, Vyi\u22121 , \u00afx(cid:1) GRU (hi\u22121, Vyi\u22121, \u00afx) GRU (hi\u22121, Vyi\u22121, ci)\nf(cid:0)hi, Vyi\u22121 , \u00afx(cid:1)\n\nf (hi, Vyi\u22121, ci)\n\nf (hi, Vyi\u22121, \u00afx)\n\nRNNb\n\nin PCC, RNNb and RNNm (cf. Figs. 1a to 1d). The differences in computing hidden states and\nconditional probabilities among the three RNNs are summarized in Table 1.\nUnlike in the training phase, where we know the size of positive label set T , this information is not\navailable during prediction. Whereas this is typically solved using a meta learner that predicts a\nthreshold in the ranking of labels, EncDec follows a similar approach as [7] and directly predicts a\nvirtual label that indicates the end of the sequence.\n\n4 Experimental Setup\n\nIn order to see whether solving MLC problems using RNNs can be a good alternative to classi\ufb01er\nchain (CC)-based approaches, we will compare traditional multi-label learning algorithms such as\nBR and PCCs with the RNN architectures (Fig. 1) on multi-label text classi\ufb01cation datasets. For a\nfair comparison, we will use the same \ufb01xed label permutation strategies in all compared approaches\nif necessary. As it has already been demonstrated in the literature that label permutations may affect\nthe performance of classi\ufb01er chain approaches [23, 13], we will evaluate a few different strategies.\n\n4.1 Baselines and Training Details\n\nWe use feed-forward NNs as a base learner of BR, LP and PCC. For PCC, beam search with beam\nsize of 5 is used at inference time [13]. As another NN baseline, we also consider a feed-forward NN\nwith binary cross entropy per label [21]. We compare RNNs to FastXML [22], one of state-of-the-arts\nin extreme MLC.1 All NN based approaches are trained by using Adam [12] and dropout [26]. The\ndimensionality of hidden states of all the NN baselines as well as the RNNs is set to 1024. The size\nof label embedding vectors is set to 256. We used the NVIDIA Titan X to train NN models including\nRNNs and base learners. For FastXML, a machine with 64 cores and 1024GB memory was used.\n\n4.2 Datasets and Preprocessing\n\nWe use three multi-label text classi\ufb01cation datasets for which we had access to the full text as it is\nrequired for our approach EncDec, namely Reuters-21578,2 RCV1-v2 [15] and BioASQ,3 each of\nwhich has different properties. Summary statistics of the datasets are given in Table 2. For preparing\nthe train and the test set of Reuters-21578 and RCV1-v2, we follow [21]. We split instances in\nBioASQ by year 2014, so that all documents published in 2014 and 2015 belong to the test set. For\ntuning hyperparameters, we set aside 10% of the training instances as the validation set for both\nReuters-21578 and RCV1-v2, but chose randomly 50 000 documents for BioASQ.\nThe RCV1-v2 and BioASQ datasets provide label relationships as a graph. Speci\ufb01cally, labels in\nRCV1-v2 are structured in a tree. The label structure in BioASQ is a directed graph and contains\ncycles. We removed all edges pointing to nodes which have been already visited while traversing the\ngraph using DFS, which results in a DAG of labels.\nDocument Representations. For all datasets, we replaced numbers with a special token and then\nbuild a word vocabulary for each data set. The sizes of the vocabularies for Reuters-21578, RCV1-v2\nand BioASQ are 22 747, 50 000 and 30 000, respectively. Out-of-vocabulary (OOV) words were also\nreplaced with a special token and we truncated the documents after 300 words.4\n\n1Note that as FastXML optimizes top-k ranking of labels unlike our approaches and assigns a con\ufb01dence\n\nscore for each label. We set a threshold of 0.5 to convert rankings of labels into bipartition predictions.\n\n2http://www.daviddlewis.com/resources/testcollections/reuters21578/\n3http://bioasq.org\n4By the truncation, one may worry about the possibility of missing information related to some speci\ufb01c\n\nlabels. As the average length of documents in the datasets is below 300, the effect would be negligible.\n\n5\n\n\fTable 2: Summary of datasets. # training documents (Ntr), # test documents (Nts), # labels (L), label\ncardinality (C), # label combinations (LC), type of label structure (HS).\n\nDATASET\n\nReuters-21578\nRCV1-v2\nBioASQ\n\nNtr\n\n7770\n781 261\n11 431 049\n\nNts\n3019\n23 149\n274 675\n\nL\n\n90\n103\n26 970\n\nC\n1.24\n3.21\n12.60\n\nHS\n-\n\nLC\n\n468\n14 921\n\nTree\n11 673 800 DAG\n\nWe trained word2vec [20] on an English Wikipedia dump to get 512-dimensional word embeddings\nu. Given the word embeddings, we created the \ufb01xed input representations \u00afx to be used for all of the\nbaselines in the following way: Each word in the document except for numbers and OOV words\nis converted into its corresponding embedding vector, and these word vectors are then averaged,\nresulting in a document vector \u00afx. For EncDec, which learns hidden states of word sequences using\nan encoder RNN, all words are converted to vectors using the pre-trained word embeddings and we\nfeed these vectors as inputs to the encoder. In this case, unlike during the preparation of \u00afx, we do not\nignore OOV words and numbers. Instead, we initialize the vectors for those tokens randomly. For a\nfair comparison, we do not update word embeddings of the encoder in EncDec.\n\n4.3 Evaluation Measures\n\nMLC algorithms can be evaluated with multiple measures which capture different aspects of the\nproblem. We evaluate all methods in terms of both example-based and label-based measures.\nExample-based measures are de\ufb01ned by comparing the target vector y = {y1, y2,\u00b7\u00b7\u00b7 , yL} to the predic-\ntion vector \u02c6y = {\u02c6y1, \u02c6y2,\u00b7\u00b7\u00b7 , \u02c6yL}. Subset accuracy (ACC) is very strict regarding incorrect predictions\nin that it does not allow any deviation in the predicted label sets: ACC (y, \u02c6y) = I [y = \u02c6y] . Hamming ac-\nI [yj = \u02c6yj] .\ncuracy (HA) computes how many labels are correctly predicted in \u02c6y: HA (y, \u02c6y) = 1\nACC and HA are used for datasets with moderate L. If C as well as L is higher, entirely correct\npredictions become increasingly unlikely, and therefore ACC often approaches 0. In this case, the\nexample-based F1-measure (ebF1) de\ufb01ned by Eq. (6) can be considered as a good compromise.\nLabel-based measures are based on treating each label yj as a separate two-class prediction problem,\nand computing the number of true positives (tpj), false positives (fpj) and false negatives (fn j)\nfor this label. We consider two label-based measures, namely micro-averaged F1-measure (miF1)\nand macro-averaged F1-measure (maF1) which are de\ufb01ned by Eq. (7) and Eq. (8), respectively.\n\n(cid:80)L\n\nj=1\n\nL\n\nebF1 (y, \u02c6y)\n\n2(cid:80)L\n(cid:80)L\nj=1 yj +(cid:80)L\n\nj=1 yj \u02c6yj\n\n=\n\nj=1 \u02c6yj\n\nmiF1\n\n(cid:80)L\n\n(6)\n\n=\n\n(cid:80)L\n\nj=1 2tpj\n\n(7)\n\nmaF1\n\n=\n\n1\nL\n\nL(cid:88)\n\nj=1\n\n2tpj\n\n(8)\n\n2tpj + fpj + fn j\n\nj=1 2tpj + fpj + fn j\n\nmiF1 favors a system yielding good predictions on frequent labels, whereas higher maF1 scores are\nusually attributed to superior performance on rare labels.\n\n5 Experimental Results\n\nIn the following, we show results of various versions of RNNs for MLC on three text datasets which\nspan a wide variety of input and label set sizes. We also evaluate different label orderings, such as\nfrequent-to-rare (f2r), and rare-to-frequent (r2f ), as well as a topological sorting (when applicable).\n\n5.1 Experiments on Reuters-21578\n\nFigure 2 shows the negative log-likelihood (NLL) of Eq. (3) on the validation set during the course\nof training. Note that as RNNb attempts to predict binary targets, but RNNm and EncDec make\npredictions on multinomial targets, the results of RNNb are plotted separately, with a different scale\nof the y-axis (top half of the graph). Compared to RNNm and EncDec, RNNb converges very slowly.\nThis can be attributed to the length of the label chain and sparse targets in the chain since RNNb is\ntrained to make correct predictions over all 90 labels, most of them being zero. In other words, the\nlength of target sequences of RNNb is 90 and \ufb01xed regardless of the content of training documents.\n\n6\n\n\fTable 3: Performance comparison on Reuters-21578.\n\nBR(NN)\nLP(NN)\nNN\n\nPCC(NN)\nRNNb\nRNNm\nEncDec\n\nPCC(NN)\nRNNb\nRNNm\nEncDec\n\nFigure 2: Negative log-likelihood of RNNs\non the validation set of Reuters-21578.\n\nACC\n\n0.7685\n0.7837\n0.7502\n\nHA\n\nebF1\nNo label permutations\n0.8515\n0.8206\n0.8396\n\n0.9957\n0.9941\n0.9952\n\nFrequent labels \ufb01rst (f2r)\n\n0.7844\n0.6757\n0.7744\n0.8281\n\n0.7864\n0.0931\n0.7744\n0.8261\n\n0.9955\n0.9931\n0.9942\n0.9961\n\n0.8585\n0.7180\n0.8396\n0.8917\nRare labels \ufb01rst (r2f )\n0.8598\n0.1083\n0.8409\n0.8944\n\n0.9956\n0.9835\n0.9943\n0.9962\n\nmiF1\n\nmaF1\n\n0.8348\n0.7730\n0.8183\n\n0.8305\n0.7144\n0.7884\n0.8545\n\n0.8338\n0.1389\n0.7864\n0.8575\n\n0.4022\n0.3505\n0.3083\n\n0.3989\n0.0897\n0.2722\n0.4567\n\n0.3937\n0.0102\n0.2699\n0.4365\n\nFigure 3: Performance of RNN models on the validation set of Reuters-21578 during training. Note\nthat the x-axis denotes # epochs and we use different scales on the y-axis for each measure.\n\nIn particular, RNNb has trouble with the r2f label ordering, where training is unstable. The reason\nis presumably that the predictions for later labels depend on sequences that are mostly zero when\nrare labels occur at the beginning. Hence, the model sees only few examples of non-zero targets in a\nsingle epoch. On the other hand, both RNNm and EncDec converge relatively faster than RNNb and\ndo obviously not suffer from the r2f ordering. Moreover, there is not much difference between both\nstrategies since the length of the sequences is often 1 for Reuters-21578 and hence often the same.\nFigure 3 shows the performance of RNNs in terms of all evaluation measures on the validation set.\nEncDec performs best for all the measures, followed by RNNm. There is no clear difference between\nthe same type of models trained on different label permutations, except for RNNb in terms of NLL\n(cf. Fig. 2). Note that although it takes more time to update the parameters of EncDec than those\nof RNNm, EncDec ends up with better results. RNNb performs poorly especially in terms of maF1\nregardless of the label permutations, suggesting that RNNb would need more parameter updates for\npredicting rare labels. Notably, the advantage of EncDec is most pronounced for this speci\ufb01c task.\nDetailed results of all methods on the test set are shown in Table 3. Clearly, EncDec perform best\nacross all measures. LP works better than BR and NN in terms of ACC as intended, but performs\nbehind them in terms of other measures. The reason is that LP, by construction, is able to more\naccurately hit the exact label set, but, on the other hand, produces more false positives and false\nnegatives in our experiments in comparison to BR and NN when missing the correct label combination.\nAs shown in the table, RNNm performs better than its counterpart, i.e., RNNb, in terms of ACC, but\nhas clear weaknesses in predicting rare labels (cf. especially maF1). For PCC, our two permutations\nof the labels do not affect much ACC due to the low label cardinality.\n\n5.2 Experiments on RCV1-v2\n\nIn comparison to Reuters-21578, RCV1-v2 consists of a considerably larger number of documents.\nThough the the number of unique labels (L) is similar (103 vs. 90) in both datasets, RCV1-v2 has a\nhigher C and LC is greatly increased from 468 to 14 921. Moreover, this dataset has the interesting\nproperty that all labels from the root to a relevant leaf label in the label tree are also associated to the\ndocument. In this case, we can also test a topological ordering of labels, as described in Section 3.1.\n\n7\n\n246051015202530354045Epoch12Negative log-likelihoodRNNb f2rRNNb r2fRNNm f2rRNNm r2fEncDec f2rEncDec r2f0102030400.00.20.40.60.81.0Subset accuracy0102030400.9750.9800.9850.9900.9951.000Hamming accuracy0102030400.30.40.50.60.70.80.91.0Example-based F10102030400.40.50.60.70.80.9Micro-averaged F10102030400.00.10.20.30.40.50.60.7Macro-averaged F1RNNb f2rRNNb r2fRNNm f2rRNNm r2fEncDec f2rEncDec r2f\fTable 4: Performance comparison on RCV1-v2.\n\nTable 5: Performance comparison on BioASQ.\n\nmiF1\n\nmaF1\n\nACC\n\nACC\n\n0.5554\n0.5149\n0.5837\n0.5953\n\nHA\n\nebF1\nNo label permutations\n0.8376\n0.6696\n0.8441\n0.8409\n\n0.9904\n0.9767\n0.9907\n0.9910\n\nFrequent labels \ufb01rst (f2r)\n\n0.6211\n0.6218\n0.6798\n\n0.6300\n0.6216\n0.6767\n\n0.6257\n0.6072\n0.6761\n\n0.6267\n0.6232\n0.6781\n\n0.9904\n0.9903\n0.9925\n\n0.8461\n0.8578\n0.8895\nRare labels \ufb01rst (r2f )\n0.8493\n0.8556\n0.8884\ntopological sorting\n0.8463\n0.8525\n0.8888\n\n0.9904\n0.9898\n0.9924\n\n0.9906\n0.9903\n0.9925\n\nreverse topological sorting\n\n0.9902\n0.9904\n0.9925\n\n0.8444\n0.8561\n0.8899\n\nBR(NN)\nLP(NN)\nNN\nFastXML\n\nPCC(NN)\nRNNm\nEncDec\n\nPCC(NN)\nRNNm\nEncDec\n\nPCC(NN)\nRNNm\nEncDec\n\nPCC(NN)\nRNNm\nEncDec\n\n0.8349\n0.6162\n0.8402\n0.8470\n\n0.8324\n0.8487\n0.8838\n\n0.8395\n0.8525\n0.8817\n\n0.8364\n0.8437\n0.8808\n\n0.8346\n0.8496\n0.8797\n\n0.6376\n0.4154\n0.6573\n0.5918\n\n0.6404\n0.6798\n0.7381\n\n0.6376\n0.6583\n0.7413\n\n0.6486\n0.6578\n0.7220\n\n0.6497\n0.6535\n0.7258\n\nHA\n\nebF1\nNo label permutations\n0.3585\n\n0.9996\n\nFastXML\n\n0.0001\n\nRNNm\nEncDec\n\nRNNm\nEncDec\n\nRNNm\nEncDec\n\nRNNm\nEncDec\n\nFrequent label \ufb01rst (f2r)\n\n0.0001\n0.0004\n\n0.0001\n0.0006\n\n0.0001\n0.0006\n\n0.9993\n0.9995\n\n0.3917\n0.5294\nRare labels \ufb01rst (r2f )\n0.4188\n0.5531\ntopological sorting\n0.4087\n0.5311\n\n0.9994\n0.9953\n\n0.9995\n0.9996\n\nreverse topological sorting\n\n0.0001\n0.0007\n\n0.9994\n0.9996\n\n0.4210\n0.5585\n\nmiF1\n\nmaF1\n\n0.3890\n\n0.0570\n\n0.4088\n0.5634\n\n0.1435\n0.3211\n\n0.4534\n0.5943\n\n0.1801\n0.3363\n\n0.4402\n0.5919\n\n0.1555\n0.3459\n\n0.4508\n0.5961\n\n0.1646\n0.3427\n\nAs RNNb takes long to train and did not show good results on the small dataset, we have no longer\nconsidered it in these experiments. We instead include FastXML as a baseline.\nTable 4 shows the performance of the methods with different label permutations. These results\ndemonstrate again the superiority of PCC and RNNm as well as EncDec against BR and NN in\nmaximizing ACC. Another interesting observation is that LP performs much worse than other\nmethods even in terms of ACC due to the data scarcity problem caused by higher LC. RNNm and\nEncDec, which also predict label subsets but in a sequential manner, do not suffer from the larger\nnumber of distinct label combinations. Similar to the previous experiment, we found no meaningful\ndifferences between the RNNm and EncDec models trained on different label permutations on RCV1-\nv2. FastXML also performs well except for maF1 which tells us that it focuses more on frequent\nlabels than rare labels. As noted, this is because FastXML is designed to maximize top-k ranking\nmeasures such as prec@k for which the performance on frequent labels is important.\n\n5.3 Experiments on BioASQ\n\nCompared to Reuters-21578 and RCV1-v2, BioASQ has an extremely large number of instances and\nlabels, where LC is almost close to Ntr + Nts. In other words, nearly all distinct label combinations\nappear only once in the dataset and some label subsets can only be found in the test set. Table 5\nshows the performance of FastXML, RNNm and EncDec on the test set of BioASQ. EncDec\nclearly outperforms RNNm by a large margin. Making predictions over several thousand labels\nis a particularly dif\ufb01cult task because MLC methods not only learn label dependencies, but also\nunderstand the context information in documents allowing us to \ufb01nd word-label dependencies and to\nimprove the generalization performance.\nWe can observe a consistent bene\ufb01t from using the reverse label ordering on both approaches. Note\nthat EncDec does show reliable performance on two relatively small benchmarks regardless of the\nchoice of the label permutations. Also, EncDec with reverse topological sorting of labels achieves\nthe best performance, except for maF1. Note that we observed similar effects with RNNm in\nour preliminary experiments on RCV1-v2, but the impact of label permutations disappeared once\nwe tuned RNNm with dropout. This indicates that label ordering does not affect much the \ufb01nal\nperformance of models if they are trained well enough with proper regularization techniques.\nTo understand the effectiveness of each model with respect to the size of the positive label set, we\nsplit the test set into \ufb01ve almost equally-sized partitions based on the number of target labels in the\ndocuments and evaluated the models separately for each of the partition, as shown in Fig. 4. The \ufb01rst\npartition (P1) contains test documents associated with 1 to 9 labels. Similarly, other partitions, P2,\nP3, P4 and P5, have documents with cardinalities of 10 \u223c 12, 13 \u223c 15, 16 \u223c 18 and more than 19,\nrespectively. As expected, the performance of all models in terms of ACC and HA decreases as the\n\n8\n\n\fFigure 4: Comparison of RNNm and EncDec wrt. the number of positive labels T of test documents.\nThe test set is divided into 5 partitions according to T . The x-axis denotes partition indices. tps and\ntps_rev stand for the label permutation ordered by topological sorting and its reverse.\n\nnumber of positive labels increases. The other measures increase since the classi\ufb01ers have potentially\nmore possibilities to match positive labels. We can further con\ufb01rm the observations from Table 5\nw.r.t. to different labelset sizes.\nThe margin of FastXML to RNNm and EncDec is further increased. Moreover, its poor performance\non rare labels con\ufb01rms again the focus of FastXML on frequent labels. Regarding computational\ncomplexity, we could observe an opposed relation between the used resources: whereas we ran\nEncDec on a single GPU with 12G of memory for 5 days, FastXML only took 4 hours to complete\n(on 64 CPU cores), but, on the other hand, required a machine with 1024G of memory.\n\n6 Conclusion\n\nWe have presented an alternative formulation of learning the joint probability of labels given an\ninstance, which exploits the generally low label cardinality in multi-label classi\ufb01cation problems.\nInstead of having to iterate over each of the labels as in the traditional classi\ufb01er chains approach, the\nnew formulation allows us to directly focus only on the positive labels. We provided an extension\nof the formal framework of probabilistic classi\ufb01er chains, contributing to the understanding of the\ntheoretical background of multi-label classi\ufb01cation. Our approach based on recurrent neural networks,\nespecially encoder-decoders, proved to be effective, highly scalable, and robust towards different\nlabel orderings on both small and large scale multi-label text classi\ufb01cation benchmarks. However,\nsome aspects of the presented work deserve further consideration.\nWhen considering MLC problems with extremely large numbers of labels, a problem often referred\nto as extreme MLC (XMLC), F1-measure maximization is often preferred to subset accuracy maxi-\nmization because it is less susceptible to the very large number of label combinations and imbalanced\nlabel distributions. One can exploit General F-Measure Maximizer (GFM) [30] to maximize the\nexample-based F1-measure by drawing samples from P (y|x) at inference time. Although it is easy\nto draw samples from P (y|x) approximated by RNNs, and the calculation of the necessary quantities\nfor GFM is straightforward, the use of GFM would be limited to MLC problems with a moderate\nnumber of labels because of its quadratic computational complexity O(L2).\nWe used a \ufb01xed threshold 0.5 for all labels when making predictions by BR, NN and FastXML.\nIn fact, such a \ufb01xed thresholding technique performs poorly on large label spaces. Jasinska et al.\n[10] exhibit an ef\ufb01cient macro-averaged F1-measure (maF1) maximization approach by tuning the\nthreshold for each label relying on the sparseness of y. We believe that FastXML can be further\nimproved by the maF1 maximization approach on BioASQ. However, we would like to remark that\nthe RNNs, especially EncDec, perform well without any F1-measure maximization at inference time.\nNevertheless, maF1 maximization for RNNs might be interesting for future work.\nIn light of the experimental results in Table 5, learning from raw inputs instead of using \ufb01xed input\nrepresentations plays a crucial role for achieving good performance in our XMLC experiments. As\nthe training costs of the encoder-decoder architecture used in this work depend heavily on the input\nsequence lengths and the number of unique labels, it is inevitable to consider more ef\ufb01cient neural\narchitectures [8, 11], which we also plan to do in future work.\n\n9\n\n\fAcknowledgments\n\nThe authors would like to thank anonymous reviewers for their thorough feedback. Computations\nfor this research were conducted on the Lichtenberg high performance computer of the Technische\nUniversit\u00e4t Darmstadt. The Titan X used for this research was donated by the NVIDIA Corporation.\nThis work has been supported by the German Institute for Educational Research (DIPF) under the\nKnowledge Discovery in Scienti\ufb01c Literature (KDSL) program, and the German Research Foundation\nas part of the Research Training Group Adaptive Preparation of Information from Heterogeneous\nSources (AIPHES) under grant No. GRK 1994/1.\n\nReferences\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\nIn Proceedings of the International Conference on Learning Representations, 2015.\n\n[2] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using RNN Encoder\u2013Decoder for statistical machine translation. In Proceedings of\nthe 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724\u20131734, 2014.\n\n[3] K. Dembczy\u00b4nski, W. Cheng, and E. H\u00fcllermeier. Bayes optimal multilabel classi\ufb01cation via probabilistic\nIn Proceedings of the 27th International Conference on Machine Learning, pages\n\nclassi\ufb01er chains.\n279\u2013286, 2010.\n\n[4] K. Dembczy\u00b4nski, W. Waegeman, and E. H\u00fcllermeier. An analysis of chaining in multi-label classi\ufb01cation.\n\nIn Frontiers in Arti\ufb01cial Intelligence and Applications, volume 242, pages 294\u2013299. IOS Press, 2012.\n\n[5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\nLong-term recurrent convolutional networks for visual recognition and description. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 2625\u20132634, 2015.\n\n[6] J. R. Doppa, J. Yu, C. Ma, A. Fern, and P. Tadepalli. HC-Search for multi-label prediction: An empirical\n\nstudy. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[7] J. F\u00fcrnkranz, E. H\u00fcllermeier, E. Loza Menc\u00eda, and K. Brinker. Multilabel classi\ufb01cation via calibrated label\n\nranking. Machine Learning, 73(2):133\u2013153, 2008.\n\n[8] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence\nlearning. In Proceedings of the International Conference on Machine Learning, pages 1243\u20131252, 2017.\n\n[9] N. Ghamrawi and A. McCallum. Collective multi-label classi\ufb01cation. In Proceedings of the 14th ACM\n\nInternational Conference on Information and Knowledge Management, pages 195\u2013200, 2005.\n\n[10] K. Jasinska, K. Dembczynski, R. Busa-Fekete, K. Pfannschmidt, T. Klerx, and E. Hullermeier. Extreme\nF-measure maximization using sparse probability estimates. In Proceedings of the International Conference\non Machine Learning, pages 1435\u20131444, 2016.\n\n[11] A. Joulin, M. Ciss\u00e9, D. Grangier, H. J\u00e9gou, et al. Ef\ufb01cient softmax approximation for GPUs. In Proceedings\n\nof the International Conference on Machine Learning, pages 1302\u20131310, 2017.\n\n[12] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International\n\nConference on Learning Representations, 2015.\n\n[13] A. Kumar, S. Vembu, A. K. Menon, and C. Elkan. Beam search algorithms for multilabel learning. Machine\n\nLearning, 92(1):65\u201389, 2013.\n\n[14] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher.\nAsk me anything: Dynamic memory networks for natural language processing. In Proceedings of The\n33rd International Conference on Machine Learning, pages 1378\u20131387, 2016.\n\n[15] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 5(Apr):361\u2013397, 2004.\n\n[16] C. Li, B. Wang, V. Pavlu, and J. Aslam. Conditional Bernoulli mixtures for multi-label classi\ufb01cation.\nIn Proceedings of the 33rd International Conference on International Conference on Machine Learning,\npages 2482\u20132491, 2016.\n\n[17] N. Li and Z.-H. Zhou. Selective ensemble of classi\ufb01er chains. In Z.-H. Zhou, F. Roli, and J. Kittler, editors,\n\nMultiple Classi\ufb01er Systems, volume 7872, pages 146\u2013156. Springer Berlin Heidelberg, 2013.\n\n10\n\n\f[18] W. Liu and I. Tsang. On the optimality of classi\ufb01er chain for multi-label classi\ufb01cation. In Advances in\n\nNeural Information Processing Systems 28, pages 712\u2013720. 2015.\n\n[19] D. Mena, E. Monta\u00f1\u00e9s, J. R. Quevedo, and J. J. Del Coz. Using A* for inference in probabilistic\nclassi\ufb01er chains. In Proceedings of the 24th International Joint Conference on Arti\ufb01cial Intelligence, pages\n3707\u20133713, 2015.\n\n[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and\nphrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages\n3111\u20133119. 2013.\n\n[21] J. Nam, J. Kim, E. Loza Menc\u00eda, I. Gurevych, and J. F\u00fcrnkranz. Large-scale multi-label text classi\ufb01cation\u2013\nrevisiting neural networks. In Proceedings of the European Conference on Machine Learning and Principles\nand Practice of Knowledge Discovery in Databases, pages 437\u2013452, 2014.\n\n[22] Y. Prabhu and M. Varma. FastXML: A fast, accurate and stable tree-classi\ufb01er for extreme multi-label\nlearning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 263\u2013272, 2014.\n\n[23] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classi\ufb01er chains for multi-label classi\ufb01cation. Machine\n\nLearning, 85(3):333\u2013359, 2011.\n\n[24] J. Read, L. Martino, and D. Luengo. Ef\ufb01cient monte carlo methods for multi-dimensional learning with\n\nclassi\ufb01er chains. Pattern Recognition, 47(3):1535 \u2013 1546, 2014.\n\n[25] R. Senge, J. J. Del Coz, and E. H\u00fcllermeier. On the problem of error propagation in classi\ufb01er chains for\nmulti-label classi\ufb01cation. In M. Spiliopoulou, L. Schmidt-Thieme, and R. Janning, editors, Data Analysis,\nMachine Learning and Knowledge Discovery, pages 163\u2013170. Springer International Publishing, 2014.\n\n[26] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to\nprevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15(1):1929\u20131958, 2014.\n\n[27] L. E. Sucar, C. Bielza, E. F. Morales, P. Hernandez-Leal, J. H. Zaragoza, and P. Larra\u00f1aga. Multi-label\nclassi\ufb01cation with Bayesian network-based chain classi\ufb01ers. Pattern Recognition Letters, 41:14 \u2013 22, 2014.\nISSN 0167-8655.\n\n[28] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances\n\nin Neural Information Processing Systems 27, pages 3104\u20133112. 2014.\n\n[29] G. Tsoumakas, I. Katakis, and I. Vlahavas. Random k-labelsets for multilabel classi\ufb01cation.\n\nTransactions on Knowledge and Data Engineering, 23(7):1079\u20131089, July 2011.\n\nIEEE\n\n[30] W. Waegeman, K. Dembczy\u00b4nki, A. Jachnik, W. Cheng, and E. H\u00fcllermeier. On the Bayes-optimality of\n\nF-measure maximizers. Journal of Machine Learning Research, 15(1):3333\u20133388, 2014.\n\n[31] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. CNN-RNN: A uni\ufb01ed framework for multi-\nlabel image classi\ufb01cation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern\nRecognition, pages 2285\u20132294, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2802, "authors": [{"given_name": "Jinseok", "family_name": "Nam", "institution": "TU Darmstadt"}, {"given_name": "Eneldo", "family_name": "Loza Menc\u00eda", "institution": "Technische Universit\u00e4t Darmstadt"}, {"given_name": "Hyunwoo", "family_name": "Kim", "institution": "University of Wisconsin-Madison"}, {"given_name": "Johannes", "family_name": "F\u00fcrnkranz", "institution": "TU Darmstadt"}]}