{"title": "Tighter Bounds for Structured Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 281, "page_last": 288, "abstract": "Large-margin structured estimation methods work by minimizing a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy.", "full_text": "Tighter Bounds for Structured Estimation\n\nChuong B. Do, Quoc Le\n\n{chuongdo,quocle}@cs.stanford.edu\n\nStanford University\n\nChoon Hui Teo\n\nAustralian National University and NICTA\n\nchoonhui.teo@anu.edu.au\n\nOlivier Chapelle, Alex Smola\n\nYahoo! Research\n\nchap@yahoo-inc.com,alex@smola.org\n\nAbstract\n\nLarge-margin structured estimation methods minimize a convex upper bound of\nloss functions. While they allow for ef\ufb01cient optimization algorithms, these con-\nvex formulations are not tight and sacri\ufb01ce the ability to accurately model the true\nloss. We present tighter non-convex bounds based on generalizing the notion of\na ramp loss from binary classi\ufb01cation to structured estimation. We show that a\nsmall modi\ufb01cation of existing optimization algorithms suf\ufb01ces to solve this mod-\ni\ufb01ed problem. On structured prediction tasks such as protein sequence alignment\nand web page ranking, our algorithm leads to improved accuracy.\n\n1 Introduction\n\nStructured estimation [18, 20] and related techniques has proven very successful in many areas\nranging from collaborative \ufb01ltering to optimal path planning, sequence alignment, graph matching\nand named entity tagging.\nAt the heart of those methods is an inverse optimization problem, namely that of \ufb01nding a func-\ntion f(x, y) such that the prediction y\u2217 which maximizes f(x, y\u2217) for a given x, minimizes some\nloss \u2206(y, y\u2217) on a training set. Typically x \u2208 X is referred to as a pattern, whereas y \u2208 Y is a\ncorresponding label. Y can represent a rich class of possible data structures, ranging from binary\nsequences (tagging), to permutations (matching and ranking), to alignments (sequence matching),\nto path plans [15]. To make such inherently discontinuous and nonconvex optimization problems\ntractable, one applies a convex upper bound on the incurred loss. This has two bene\ufb01ts: \ufb01rstly, the\nproblem has no local minima, and secondly, the optimization problem is continuous and piecewise\ndifferentiable, which allows for effective optimization [17, 19, 20]. This setting, however, exhibits a\nsigni\ufb01cant problem: the looseness of the convex upper bounds can sometimes lead to poor accuracy.\nFor binary classi\ufb01cation, [2] proposed to switch from the hinge loss, a convex upper bound, to\na tighter nonconvex upper bound, namely the ramp loss. Their motivation was not the accuracy\nthough, but the faster optimization due to the decreased number of support vectors. The resulting\noptimization uses the convex-concave procedure of [22], which is well known in optimization as the\nDC-programming method [9].\nWe extend the notion of ramp loss to structured estimation. We show that with some minor mod-\ni\ufb01cations, the DC algorithms used in the binary case carry over to the structured setting. Unlike\nthe binary case, however, we observe that for structured prediction problems with noisy data, DC\nprogramming can lead to improved accuracy in practice. This is due to increased robustness. Effec-\ntively, the algorithm discards observations which it labels incorrectly if the error is too large. This\nensures that one ends up with a lower-complexity solution while ensuring that the \u201ccorrectable\u201d\nerrors are taken care of.\n\n1\n\n\fy\u2217(x, f) := argmax\n\nf(x, y0)\n\ny0\n\nRreg[f, X, Y ] :=\n\n1\nm\n\n\u2206(yi, y\u2217(xi, f)) + \u03bb\u2126[f].\n\nmX\n\ni=1\n\n(1)\n\n(2)\n\n2 Structured Estimation\nDenote by X the set of patterns and let Y be the set of labels. We will denote by X := {x1, . . . , xm}\nthe observations and by Y := {y1, . . . , ym} the corresponding set of labels. Here the pairs (xi, yi)\nare assumed to be drawn from some distribution Pr on X \u00d7 Y.\nLet f : X \u00d7 Y \u2192 R be a function de\ufb01ned on the product space. Finally, denote by \u2206 : Y \u00d7 Y \u2192 R+\n0\na loss function which maps pairs of labels to nonnegative numbers. This could be, for instance, the\nnumber of bits in which y and y0 differ, i.e. \u2206(y, y0) = ky \u2212 y0k1 or considerably more complicated\nloss functions, e.g., for ranking and retrieval [21]. We want to \ufb01nd f such that for\n\nthe loss \u2206(y, y\u2217(x, f)) is minimized: given X and Y we want to minimize the regularized risk,\n\nHere \u2126[f] is a regularizer, such as an RKHS norm \u2126[f] = kfk2\nH and \u03bb > 0 is the associated regular-\nization constant, which safeguards us against over\ufb01tting. Since (2) is notoriously hard to minimize\nseveral convex upper bounds have been proposed to make \u2206(yi, y\u2217(xi, f)) tractable in f. The fol-\nlowing lemma, which is a generalization of a result of [20] provides a strategy for convexi\ufb01cation:\nLemma 1 Denote by \u0393 : R+\nl(x, y, y00, f) := sup\ny0\n\n0 \u2192 R+\n0 a monotonically increasing nonnegative function. Then\n\u0393(\u2206(y, y0)) [f(x, y0) \u2212 f(x, y00)] + \u2206(y, y0) \u2265 \u2206 (y, y\u2217(x, f))\n\nfor all y, y00 \u2208 Y. Moreover, l(x, y, y00, f) is convex in f.\n\nProof Convexity follows immediately from the fact that l is the supremum over linear functions\nin f. To see the inequality, plug y0 = y\u2217(x, f) into the LHS of the inequality: by construction\nf(x, y\u2217(x, f)) \u2265 f(x, y00) for all y00 \u2208 Y.\nIn regular convex structured estimation, l(x, y, y, f) is used. Methods in [18] choose the constant\nfunction \u0393(\u03b7) = 1, whereas methods in [20] choose margin rescaling by means of \u0393(\u03b7) = \u03b7. This\nalso shows why both formulations lead to convex upper bounds of the loss. It depends very much\non the form of f and \u2206 which choice of \u0393 is easier to handle. Note that the inequality holds for all\ny00 rather than only for the \u201ccorrect\u201d label y00 = y. We will exploit this later.\n\n3 A Tighter Bound\nFor convenience denote by \u03b2(x, y, y0, f) the relative margin between y and y0 induced by f via\n\n\u03b2(x, y, y0, f) := \u0393(\u2206(y, y0))[f(x, y0) \u2212 f(x, y)].\n\n(3)\nThe loss bound of Lemma 1 suffers from a signi\ufb01cant problem: for large values of f the loss may\ngrow without bound, provided that the estimate is incorrect. This is not desirable since in this\nsetting even a single observation may completely ruin the quality of the convex upper bound on the\nmisclassi\ufb01cation error.\nAnother case where the convex upper bound is not desirable is the following: imagine that there are\na lot of y which are as good as the label in the training set; this happens frequently in ranking where\nthere are ties between the optimal permutations. Let us denote by Yopt := {y00 such that \u2206(y, y0) =\n\u2206(y00, y0), \u2200y0} this set of equally good labels. Then one can replace y by any element of Yopt in\nthe bound of Lemma 1. Minimization over y00 \u2208 Yopt leads to a tighter non-convex upper bound:\n\nl(x, y, y, f) \u2265 inf\n\n\u03b2(x, y00, y0, f) + \u2206(y00, y0) \u2265 \u2206 (y, y\u2217(x, f)) .\n\ny00\u2208Yopt\n\nsup\ny0\n\nIn the case of binary classi\ufb01cation, [2] proposed the following non-convex loss that can be minimized\nusing DC programming:\n\nl(x, y, f) := min(1, max(0, 1 \u2212 yf(x))) = max(0, 1 \u2212 yf(x)) \u2212 max(0,\u2212yf(x)).\n\n(4)\n\n2\n\n\fWe see that (4) is the difference between a soft-margin loss and a hinge loss. That is, the difference\nbetween a loss using a large margin related quantity and one using simply the violation of the margin.\nThis difference ensures that l cannot increase without bound, since in the limit the derivative of l\nwith respect to f vanishes. The intuition for extending this to structured losses is that the generalized\nhinge loss underestimates the actual loss whereas the soft margin loss overestimates the actual loss.\nTaking the difference removes linear scaling behavior while retaining the continuous properties.\n\nLemma 2 Denote as follows the rescaled estimate and the margin violator\n\n\u02dcy(x, y, f) := argmax\n\ny0\n\n\u03b2(x, y, y0, f) and \u00afy(x, y, f) := argmax\n\ny0\n\n\u03b2(x, y, y0, f) + \u2206(y, y0)\n\n(5)\n\nMoreover, denote by l(x, y, f) the following loss function\n\n[\u03b2(x, y, y0, f) + \u2206(y, y0)] \u2212 sup\ny0\nThen under the assumptions of Lemma 1 the following bound holds\n\nl(x, y, f) := sup\ny0\n\n\u03b2(x, y, y0, f).\n\n\u2206(y, \u00afy(x, y, f)) \u2265 l(x, y, f) \u2265 \u2206(y, y\u2217(x, f))\n\n(6)\n\n(7)\n\nThis loss is a difference between two convex functions, hence it may be (approximately) minimized\nby a DC programming procedure. Moreover, it is easy to see that for \u0393(\u03b7) = 1 and f(x, y) =\n2 yf(x) and y \u2208 {\u00b11} we recover the ramp loss of (4).\n1\nProof Since \u00afy(x, y, f) maximizes the \ufb01rst term in (6), replacing y0 by \u00afy(x, y, f) in both terms yields\n\nl(x, y, f) \u2264 \u03b2(x, y, \u00afy, f) + \u2206(y, \u00afy) \u2212 \u03b2(x, y, \u00afy, f) = \u2206(y, \u00afy).\n\nTo show the lower bound, we distinguish the following two cases:\nCase 1: y\u2217 is a maximizer of supy0 \u03b2(x, y, y0, f)\nReplacing y0 by y\u2217 in both terms of (6) leads to l(x, y, f) \u2265 \u2206(y, y\u2217).\nCase 2: y\u2217 is not a maximizer of supy0 \u03b2(x, y, y0, f)\nLet \u02dcy be any maximizer. Because f(x, y\u2217) \u2265 f(x, \u02dcy), we have \u0393(\u2206(y, \u02dcy)) [f(x, y\u2217) \u2212 f(x, y)] >\n\u0393(\u2206(y, \u02dcy)) [f(x, \u02dcy) \u2212 f(x, y)] > \u0393(\u2206(y, y\u2217)) [f(x, y\u2217) \u2212 f(x, y)] and thus \u0393(\u2206(y, \u02dcy)) >\n\u0393(\u2206(y, y\u2217)). Since \u0393 is non-decreasing this implies \u2206(y, \u02dcy) > \u2206(y, y\u2217). On the other hand,\nplugging \u02dcy in (6) gives l(x, y, f) \u2265 \u2206(y, \u02dcy). Combining both inequalities proves the claim.\nNote that the main difference between the cases of constant \u0393 and monotonic \u0393 is that in the latter\ncase the bounds are not quite as tight as they could potentially be, since we still have some slack with\nrespect to \u2206(y, \u02dcy). Monotonic \u0393 tend to overscale the margin such that more emphasis is placed on\navoiding large deviations from the correct estimate rather than restricting small deviations.\nNote that this nonconvex upper bound is not likely to be Bayes consistent. However, it will generate\nsolutions which have a smaller model complexity since it is never larger than the convex upper bound\non the loss, hence the regularizer on f plays a more important role in regularized risk minimization.\nAs a consequence one can expect better statistical concentration properties.\n\n4 DC Programming\n\nWe brie\ufb02y review the basic template of DC programming, as described in [22]. For a function\n\nf(x) = fcave(x) + fvex(x)\n\nwhich can be expressed as the sum of a convex fvex and a concave fcave function, we can \ufb01nd a\nconvex upper bound by fcave(x0) +hx \u2212 x0, f0\ncave(x0)i + fvex(x). This follows from the \ufb01rst-order\nTaylor expansion of the concave part fcave at the current value of x. Subsequently, this upper bound\nis minimized, a new Taylor approximation is computed, and the procedure is repeated. This will\nlead to a local minimum, as shown in [22].\nWe now proceed to deriving an explicit instantiation for structured estimation. To keep things simple,\nin particular the representation of the functional subgradients of l(x, y, f) with respect to f, we\nassume that f is drawn from a Reproducing Kernel Hilbert Space H.\n\n3\n\n\fAlgorithm 1 Structured Estimation with Tighter Bounds\n\nUsing the loss of Lemma 1 initialize f = argminf0 Pm\nUsing the tightened loss bound recompute f = argminf0Pm\n\nCompute \u02dcyi := \u02dcy(xi, yi, f) for all i.\n\nrepeat\n\nuntil converged\n\ni=1 l(xi, yi, yi, f0) + \u03bb\u2126[f0]\n\n\u02dcl(xi, yi, \u02dcyi, f0) + \u03bb\u2126[f0]\n\ni=1\n\nDenote by k the kernel associated with H, de\ufb01ned on (X \u00d7 Y) \u00d7 (X \u00d7 Y). In this case for f \u2208 H\nwe have by the reproducing property that f(x, y) = hf, k((x, y),\u00b7)i and the functional derivative is\ngiven by \u2202f f(x, y) = k((x, y),\u00b7). Likewise we may perform the linearization in (6) as follows:\n\n\u2212 sup\ny0\n\n\u03b2(x, y, y0, f) \u2264 \u2212\u03b2(x, y, \u02dcy, f)\n\nIn other words, we use the rescaled estimate \u02dcy to provide an upper bound on the concave part of\nthe loss function. This leads to the following instantiation of standard convex-concave procedure:\ninstead of the structured estimation loss it uses the loss bound \u02dcl(x, y, \u02dcy, f)\n\n\u02dcl(x, y, \u02dcy, f) := sup\ny0\u2208Y\n\n[\u03b2(x, y, y0, f) + \u2206(y, y0)] \u2212 \u03b2(x, y, \u02dcy, f)\n\nIn the case of \u0393(\u03b7) = 1 this can be simpli\ufb01ed signi\ufb01cantly: the terms in f(x, y) cancel and \u02dcl becomes\n\n\u02dcl(x, y, \u02dcy, f) = sup\ny0\u2208Y\n\n[f(x, y0) \u2212 f(x, \u02dcy)] + \u2206(y, y0).\n\nIn other words, we replace the correct label y by the rescaled estimate \u02dcy. Such modi\ufb01cations can be\neasily implemented in bundle method solvers and related algorithms which only require access to\nthe gradient information (and the function value). In fact, the above strategy follows directly from\nLemma 1 when replacing y00 by the rescaled estimate \u02dcy.\n\n5 Experiments\n\n5.1 Multiclass Classi\ufb01cation\n\nIn this experiment, we investigate the performance of convex and ramp loss versions of the Winner-\nTakes-All multiclass classi\ufb01cation [1] when the training data is noisy. We performed the experiments\non some UCI/Statlog datasets: DNA, LETTER, SATIMAGE, SEGMENT, SHUTTLE, and USPS,\nwith some \ufb01xed percentages of the labels shuf\ufb02ed, respectively. Note that we reshuf\ufb02ed the labels\nin a strati\ufb01ed fashion. That is, we chose a \ufb01xed fraction from each class and we permuted the label\nassignment subsequently.\nTable 1 shows the results (average accuracy \u00b1 standard deviation) on several datasets with different\npercentages of labels shuf\ufb02ed. We used nested 10-fold crossvalidation to adjust the regularization\nconstant and to compute the accuracy. A linear kernel was used. It can be seen that ramp loss\noutperforms the convex upper bound when the datasets are noisy. For clean data the convex upper\nbound is slightly superior, albeit not in a statistically signi\ufb01cant fashion. This supports our conjecture\nthat, compared to the convex upper bound, the ramp loss is more robust on noisy datasets.\n\n5.2 Ranking with Normalized Discounted Cumulative Gains\n\nRecently, [12] proposed a method for learning to rank for web search. They compared several meth-\nods showing that optimizing the Normalized Discounted Cumulative Gains (NDCG) score using\na form of structured estimation yields best performance. The algorithm used a linear assignment\nproblem to deal with ranking.\nIn this experiment, we perform ranking experiments with the OHSUMED dataset which is publicly\navailable [13]. The dataset is already preprocessed and split into 5 folds. We \ufb01rst carried out the\nstructured output training algorithm which optimizes the convex upper bound of NDCG as described\nin [21]. Unfortunately, the returned solution was f = 0. The convex upper bounds led to the\n\n4\n\n\fDataset\nDNA\n\nLETTER\n\nSATIMAGE\n\nSEGMENT\n\nSHUTTLE\n\nUSPS\n\nMethods\nconvex\nramp loss\nconvex\nramp loss\nconvex\nramp loss\nconvex\nramp loss\nconvex\nramp loss\nconvex\nramp loss\n\n0%\n\n95.2 \u00b1 1.1\n95.1 \u00b1 0.8\n76.8 \u00b1 0.9\n78.6 \u00b1 0.8\n85.1 \u00b1 0.9\n85.4 \u00b1 1.2\n95.4 \u00b1 0.9\n95.2 \u00b1 1.0\n97.4 \u00b1 0.2\n97.1 \u00b1 0.2\n95.1 \u00b1 0.7\n95.1 \u00b1 0.9\n\n10%\n\n88.9 \u00b1 1.5\n89.1 \u00b1 1.3\n64.6 \u00b1 0.7\n70.8 \u00b1 0.8\n77.0 \u00b1 1.6\n78.1 \u00b1 1.6\n84.8 \u00b1 2.3\n85.9 \u00b1 2.1\n89.5 \u00b1 0.2\n90.6 \u00b1 0.8\n85.3 \u00b1 1.3\n86.1 \u00b1 1.6\n\n20%\n\n83.1 \u00b1 2.4\n83.5 \u00b1 2.2\n50.1 \u00b1 1.4\n63.0 \u00b1 1.5\n66.4 \u00b1 1.3\n70.7 \u00b1 1.0\n73.8 \u00b1 2.1\n77.5 \u00b1 2.0\n83.8 \u00b1 0.2\n88.1 \u00b1 0.3\n76.5 \u00b1 1.4\n77.6 \u00b1 1.1\n\nTable 1:\nAverage accu-\nracy for multiclass classi\ufb01-\ncation using the convex up-\nper bound and the ramp\nloss. The third through \ufb01fth\ncolumns represent results for\ndatasets with none, 10%, and\n20% of the labels randomly\nshuf\ufb02ed, respectively.\n\nFigure 1: NDCG comparison against\nranking SVM and RankBoost. We\nreport the NDCG computed at vari-\nous truncation levels. Our non-convex\nupper bound consistently outperforms\nother\nIn the context of\nweb page ranking an improvement of\n0.01 \u2212 0.02 in the NDCG score is con-\nsidered substantial.\n\nrankers.\n\nundesirable situation where no nonzero solution would yield any improvement, since the linear\nfunction class was too simple.\nThis problem is related to the fact that there are a lot of rankings which are equally good because of\nthe ties in the editorial judgments (see beginning of section 3). As a result, there is no w that learns\nthe data well, and for each w the associated maxy0 f(x, y0) \u2212 f(x, y) + \u2206(y, y0) causes either the\n\ufb01rst part or the second part of the loss to be big such that the total value of the loss function always\nexceeds max \u2206(y, y0).\nWhen using the non-convex formulation the problem can be resolved because we do not entirely rely\non the y given in the training set, but instead \ufb01nd the y that minimizes the loss. We compared the\nresults of our method and two standard methods for ranking: ranking SVM [10, 8] and RankBoost\n[6] (the baselines for OHSUMED are shown in [13]) and used NDCG as the performance criterion.\nWe report the aggregate performance in Figure 1.\nAs can be seen from the \ufb01gure, the results from the new formulation are better than standard methods\nfor ranking. It is worth emphasizing that the most important contribution is not only that the new\nformulation can give comparable results to the state-of-the-art algorithms for ranking but also that it\nprovides useful solutions when the convex structured estimation setting provides only useless results\n(obviously f = 0 is highly undesirable).\n\n5.3 Structured classi\ufb01cation\n\nWe also assessed the performance of the algorithm on two different structured classi\ufb01cation tasks for\ncomputational biology, namely protein sequence alignment and RNA secondary structure prediction.\n\nProtein sequence alignment\nis the problem of comparing the amino acid sequences correspond-\ning to two different proteins in order to identify regions of the sequences which have common ances-\ntry or biological function. In the pairwise sequence alignment task, the elements of the input space\nX consist of pairs of amino acid sequences, represented as strings of approximately 100-1000 char-\n\n5\n\n123456789100.430.440.450.460.470.480.490.50.510.520.53truncation levelNDCG@k svmrankrankboostndcg optimization\fMethod\nCRF\nconvex\nramp loss\n\n(793)\n0.316\n0.369\n0.387\n\n0-10% 11-20% 21-30% 31-40% Overall\n(1785)\n(324)\n0.430\n0.111\n0.472\n0.116\n0.138\n0.488\n\n(429)\n0.634\n0.699\n0.708\n\n(239)\n0.877\n0.891\n0.905\n\nTable 2: Protein pairwise sequence alignment results, strati\ufb01ed by reference alignment percentage\nidentity. The second through \ufb01fth columns refer to the four non-overlapping reference alignment\npercentage identity ranges described in the text, and the sixth column corresponds to overall results,\npooled across all four subsets. Each non-bolded value represents the average test set recall for\na particular algorithm on alignment from the corresponding subset. The numbers in parentheses\nindicate the total number of sequences in each subset.\n\nMethod\nCRF\nconvex\nramp loss\n\n1-50\n(118)\n\n0.546 / 0.862\n0.690 / 0.755\n0.725 / 0.708\n\n51-100\n(489)\n\n0.586 / 0.727\n0.664 / 0.629\n0.705 / 0.602\n\n101-200\n(478)\n\n0.467 / 0.523\n0.571 / 0.501\n0.612 / 0.489\n\n201+\n(274)\n\n0.414 / 0.472\n0.542 / 0.484\n0.569 / 0.461\n\nOverall\n(1359)\n\n0.505 / 0.614\n0.608 / 0.565\n0.646 / 0.542\n\nTable 3: RNA secondary structure prediction results. The second through \ufb01fth columns represent\nsubsets of the data strati\ufb01ed by sequence length. The last column presents overall results, pooled\nacross all four subsets. Each pair of non-bolded numbers indicates the sensitivity / selectivity for\nstructures in the two-fold cross-validation. The numbers in parentheses indicate the total number of\nsequences in each subset.\n\nacters in length. The output space Y contains candidate alignments which identify the corresponding\npositions in the two sequences which are hypothesized to be evolutionarily related.\nWe developed a structured prediction model for pairwise protein sequence alignment, using the\ntypes of features described in [3, 11] For the loss function, we used \u2206(y, y0) = 1 \u2212 recall (where\nrecall is the proportion of aligned amino acid matches in the true alignment y that appear in the\npredicted alignment y0. For each inner optimization step, we used a fast-converging subgradient-\nbased optimization algorithm with an adaptive Polyak-like step size [23].\nWe performed two-fold cross-validation over a collection of 1785 pairs of structurally aligned pro-\ntein domains [14]. All hyperparameters were selected via holdout cross validation on the training\nset, and we pooled the results from the two folds. For evaluation, we used recall, as described previ-\nously, and compared the performance of our algorithm to a standard conditional random \ufb01eld (CRF)\nmodel and max-margin model using the same features. The percentage identity of a reference align-\nment is de\ufb01ned as the proportion of aligned residue pairs corresponding to identical amino acids.\nWe partitioned the alignments in the testing collection into four subsets based on percent identity\n(0-10%, 11-20%, 21-30%, and 31+%), showed the recall of the algorithm for each subset in addition\nto overall recall (see Table 2).\nHere, it is clear that our method obtains better accuracy than both the CRF and max-margin models.1\nWe note that the accuracy differences are most pronounced at the low percentage identity ranges,\nthe \u2018twilight zone\u2019 regime where better alignment accuracy has far reaching consequences in many\nother computational biology applications [16].\n\nRNA secondary structure prediction Ribonucleic acid (RNA) refers to a class of long linear\npolymers composed of four different types of nucleotides (A, C, G, U). Nucleotides within a single\nRNA molecule base-pair with each other, giving rise to a pattern of base-pairing known as the\nRNA\u2019s secondary structure. In the RNA secondary structure prediction problem, we are given an\nRNA sequence (a string of approximately 20-500 characters) and are asked to predict the secondary\nstructure that the RNA molecule will form in vivo. Conceptually, an RNA secondary structure can\nbe thought of as a set of unordered pairs of nucleotide indices, where each pair designates two\n\n1We note that the results here are based on using the Viterbi algorithm for parsing, which differs from the\ninference method used in [3]. In practice this is preferable to posterior decoding as it is signi\ufb01cantly faster\nwhich is crucial applications to large amounts of data.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: Tightness of the nonconvex bound. Figures (a) and (b) show the value of the nonconvex\nloss, the convex loss and the actual loss as a function of the number of iterations when minimizing the\nnonconvex upper bound. At each relinearization, which occurs every 1000 iterations, the nonconvex\nupper bound decreases. Note that the convex upper bound increases in the process as convex and\nnonconvex bound diverge further from each other. We chose \u03bb = 2\u22126 in Figure (a) and \u03bb = 27 for\nFigure (b). Figure (c) shows the tightness of the \ufb01nal nonconvex bound at the end of optimization\nfor different values of the regularization parameter \u03bb.\n\n(c)\n\nnucleotides in the RNA molecule which base-pair with each other. Following convention, we take\nthe structured output space Y to be the set of all possible pseudoknot-free structures.\nWe used a max-margin model for secondary structure prediction. The features of the model were\nchosen to match the energetic terms in standard thermodynamic models for RNA folding [4]. As\nour loss function, we used \u2206(y, y0) = 1\u2212 recall (where recall is the proportion of base-pairs in the\nreference structure y that are recovered in the predicted structure y0). We again used the subgradient\nalgorithm for optimization.\nTo test the algorithm, we performed two-fold cross-validation over a large collection of 1359 RNA\nsequences with known secondary structures from the RFAM database (release 8.1) [7]. We evaluated\nthe methods using two standard metrics for RNA secondary structure prediction accuracy known as\nsensitivity and selectivity (which are the equivalent of recall and precision, respectively, for this\ndomain). For reporting, we binned the sequences in the test collection by length into four ranges (1-\n50, 51-100, 101-200, 201+ nucleotides), and evaluated the sensitivity and selectivity of the algorithm\nfor each subset in addition to overall accuracy (see Table 3).\nAgain, our algorithm consistently outperforms an equivalently parameterized CRF and max-margin\nmodel in terms of sensitivity.2 The selectivity of the predictions from our algorithm is often worse\nthan that of the other two models. This is likely because we opted for a loss function that penalizes\nfor \u201cfalse negative\u201d base-pairings but not \u201cfalse-positives\u201d since our main interest is in identifying\ncorrect base-pairings (a harder task than predicting only a small number of high-con\ufb01dence base-\npairings). An alternative loss function that chooses a different balance between penalizing false\npositives and false negatives would achieve a different trade-off of sensitivity and selectivity.\nTightness of the bound: We generated plots of the convex, nonconvex, and actual losses (which\ncorrespond to l(x, y, y, f), l(x, y, f), and \u2206(y, y\u2217(x, f)), respectively, from Lemma 2) over the\ncourse of optimization for our RNA folding task (see Figure 2). From Figures 2a and 2b, we see\nthat the nonconvex loss provides a much tighter upper bound on the actual loss function. Figure 2c\nshows that the tightness of the bound decreases for increasing regularization parameters \u03bb.\nIn summary, our bound leads to improvements whenever there is a large number of instances (x, y)\nwhich cannot be classi\ufb01ed perfectly. This is not surprising as for \u201cclean\u201d datasets even the convex\nupper bound vanishes when no margin errors are encountered. Hence noticeable improvements can\nbe gained mainly in the structured output setting rather than in binary classi\ufb01cation.\n\n2Note that the results here are based on using the CYK algorithm for parsing, which differs from the infer-\n\nence method used in [4].\n\n7\n\n\f6 Summary and Discussion\n\nWe proposed a simple modi\ufb01cation of the convex upper bound of the loss in structured estimation\nwhich can be used to obtain tighter bounds on sophisticated loss functions. The advantage of our\napproach is that it requires next to no modi\ufb01cation of existing optimization algorithms but rather\nrepeated invocation of a structured estimation solver such as SVMStruct, BMRM, or Pegasos.\nIn several applications our approach outperforms the convex upper bounds. This can be seen both\nfor multiclass classi\ufb01cation, for ranking where we encountered under\ufb01tting and undesirable trivial\nsolutions for the convex upper bound, and in the context of sequence alignment where in particular\nfor the hard-to-align observations signi\ufb01cant gains can be found.\nFrom this experimental study, it seems that the tighter non-convex upper bound is useful in two\nscenarios: when the labels are noisy and when for each example there is a large set of labels which\nare (almost) as good as the label in the training set. Future work includes studying other types\nof structured estimation problems such as the ones encountered in NLP to check if our new upper\nbound can also be useful for these problems.\n\nReferences\n[1] K. Crammer, and Y. Singer. On the Learnability and Design of Output Codes for Multiclass Problems. In\n\nCOLT 2000, pages 35\u201346. Morgan Kaufmann, 2000.\n\n[2] R. Collobert, F.H. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In ICML 2006, pages\n\n201\u2013208. ACM, 2006.\n\n[3] C. B. Do, S. S. Gross, and S. Batzoglou. CONTRAlign: discriminative training for protein sequence\n\nalignment. In RECOMB, pages 160\u2013174, 2006.\n\n[4] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without\n\nphysics-based models. Bioinformatics, 22(14):e90\u2013e98, 2006.\n\n[5] S. R. Eddy. Non-coding RNA genes and the modern RNA world. Nature Reviews Genetics, 2(12):919\u2013\n\n929, 2001.\n\n[6] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining prefer-\n\nences. In ICML 1998, pages 170\u2013178., 1998.\n\n[7] S. Grif\ufb01ths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam: annotating\n\nnon-coding RNAs in complete genomes. Nucl. Acids Res., 33:D121\u2013D124, 2005.\n\n[8] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In\n\nAdvances in Large Margin Classi\ufb01ers, pages 115\u2013132, 2000. MIT Press.\n\n[9] T. Hoang. DC optimization: Theory, methods, and applications. In R. Horst and P. Pardalos, editors,\n\nHandbook of Global Optimization, Kluwer.\n\n[10] T. Joachims. Optimizing search engines using clickthrough data. In KDD. ACM, 2002.\n[11] T. Joachims, T. Galor, and R. Elber. Learning to align sequences: A maximum-margin approach. In New\n\nAlgorithms for Macromolecular Simulation, LNCS 49, 57\u201368. Springer, 2005.\n\n[12] Q. Le and A.J. Smola. Direct optimization of ranking measures. NICTA-TR, 2007.\n[13] T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank\n\nfor information retrieval. In LR4IR, 2007.\n\n[14] J. Pei and N. V. Grishin. MUMMALS: multiple sequence alignment improved by using hidden Markov\n\nmodels with local structural information. Nucl. Acids Res., 34(16):4364\u20134374, 2006.\n\n[15] N. Ratliff, J. Bagnell, and M. Zinkevich.\n\n(online) subgradient methods for structured prediction.\n\nIn\n\n[16] B. Rost. Twilight zone of protein sequence alignments. Protein Eng., 12(2):85\u201394, 1999.\n[17] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In\n\nProc. Intl. Conf. Machine Learning, 2007.\n\n[18] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS 16, pages 25\u201332, 2004.\n\nAISTATS, 2007.\n\nMIT Press.\n\n[19] C.H. Teo, Q. Le, A.J. Smola, and S.V.N. Vishwanathan. A scalable modular convex solver for regularized\n\nrisk minimization. In KDD. ACM, 2007.\n\n[20] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. J. Mach. Learn. Res., 6:1453\u20131484, 2005.\n\n[21] M. Weimer, A. Karatzoglou, Q. Le, and A. Smola. Co\ufb01 rank - maximum margin matrix factorization for\n\ncollaborative ranking. In NIPS 20. MIT Press, 2008.\n\n[22] A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915\u2013936, 2003.\n[23] A. Nedic and D. P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. Siam\n\nJ. on Optimization, 12:109\u2013138, 2001.\n\n8\n\n\f", "award": [], "sourceid": 736, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Chuong", "family_name": "B.", "institution": null}, {"given_name": "Choon", "family_name": "Teo", "institution": null}, {"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}