{"title": "Mistake Bounds for Binary Matrix Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 3954, "page_last": 3962, "abstract": "We study the problem of completing a binary matrix in an online learning setting. On each trial we predict a matrix entry and then receive the true entry. We propose a Matrix Exponentiated Gradient algorithm [1] to solve this problem. We provide a mistake bound for the algorithm, which scales with the margin complexity [2, 3] of the underlying matrix. The bound suggests an interpretation where each row of the matrix is a prediction task over a finite set of objects, the columns. Using this we show that the algorithm makes a number of mistakes which is comparable up to a logarithmic factor to the number of mistakes made by the Kernel Perceptron with an optimal kernel in hindsight. We discuss applications of the algorithm to predicting as well as the best biclustering and to the problem of predicting the labeling of a graph without knowing the graph in advance.", "full_text": "Mistake Bounds for Binary Matrix Completion\n\nMark Herbster\n\nUniversity College London\n\nDepartment of Computer Science\n\nLondon WC1E 6BT, UK\n\nm.herbster@cs.ucl.ac.uk\n\nStephen Pasteris\n\nUniversity College London\n\nDepartment of Computer Science\n\nLondon WC1E 6BT, UK\n\ns.pasteris@cs.ucl.ac.uk\n\nMassimiliano Pontil\n\nIstituto Italiano di Tecnologia\n\n16163 Genoa, Italy\n\nand\n\nUniversity College London\n\nDepartment of Computer Science\n\nLondon WC1E 6BT, UK\nm.pontil@cs.ucl.ac.uk\n\nAbstract\n\nWe study the problem of completing a binary matrix in an online learning setting.\nOn each trial we predict a matrix entry and then receive the true entry. We propose\na Matrix Exponentiated Gradient algorithm [1] to solve this problem. We provide a\nmistake bound for the algorithm, which scales with the margin complexity [2, 3] of\nthe underlying matrix. The bound suggests an interpretation where each row of\nthe matrix is a prediction task over a \ufb01nite set of objects, the columns. Using this\nwe show that the algorithm makes a number of mistakes which is comparable up\nto a logarithmic factor to the number of mistakes made by the Kernel Perceptron\nwith an optimal kernel in hindsight. We discuss applications of the algorithm to\npredicting as well as the best biclustering and to the problem of predicting the\nlabeling of a graph without knowing the graph in advance.\n\nIntroduction\n\n1\nWe consider the problem of predicting online the entries in an m \u21e5 n binary matrix U. We formulate\nthis as the following game: nature queries an entry (i1, j1); the learner predicts \u02c6y1 2 {1, 1} as the\nmatrix entry; nature presents a label y1 = Ui1,j1; nature queries the entry (i2, j2); the learner predicts\n\u02c6y2; and so forth. The learner\u2019s goal is to minimize the total number of mistakes M = |{t : \u02c6yt 6= yt}|.\nIf nature is adversarial, the learner will always mispredict, but if nature is regular or simple, there is\nhope that a learner may make only a few mispredictions.\nIn our setting we are motivated by the following interpretation of matrix completion. Each of the\nm rows represents a task (or binary classi\ufb01er) and each of the n columns is associated with an\nobject (or input). A task is the problem of predicting the binary label of each of the objects. For a\nsingle task, if we were given a kernel matrix between the objects in advance we could then use the\nKernel Perceptron algorithm to sequentially label the objects and this algorithm would incur O(1/2)\nmistakes, where is the margin of the best linear classi\ufb01er in the inner product space induced by\nthe kernel. Unfortunately, in our setup, we do not know a good kernel in advance. However, we will\nshow that a remarkable property of our algorithm is that it enjoys, up to logarithmic factors, a mistake\nbound of O(1/2) per task, where is the largest possible margin (over the choice of the kernel)\nwhich is achieved on all tasks.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe problem of predicting online the labels of a \ufb01nite set of objects under the assumption that the\nsimilarity between objects can be described by a graph was introduced in [4], building upon earlier\nwork in the batch setting [5, 6]. In this and later research the common assumption is that two objects\nare similar if there is an edge in the graph connecting them and the aim is to predict well when there\nare few edges between objects with disagreeing labels. Lower bounds and an optimal algorithm (up\nto logarithmic factors) for this problem were given in [7, 8]. The problem of predicting well when\nthe graph is unknown was previously addressed in [9, 10]. That research took the approach that when\nreceiving a vertex to predict, edges local to that vertex were then revealed. In this paper we take a\ndifferent approach - the graph structure is never revealed to the learner. Instead, we have a number of\ntasks over the same unknown graph, and the hope is to perform comparably to the case in which the\ngraph in known in advance.\nThe general problem of matrix completion has been studied extensively in the batch statistical i.i.d.\nsetting, see for example [11, 12, 13] and references therein. These studies are concerned either with\nRademacher bounds or statistical oracle inequalities, both of which are substantially different from the\nfocus of the present paper. In the online mistake-bound setting a special form of matrix completion\nwas previously considered as the problem of learning a binary relation [14, 15] (see Section 5). In\na more general online setting, with minimal assumptions on the loss function [16, 17] bounded the\nregret of the learner in terms of the trace-norm of the underlying matrix. Instead our bounds are\nwith respect to the margin complexity of the matrix. As a result, although our bounds have a more\nrestricted applicability they have the advantage that they become non-trivial after only \u02dc\u21e5(n) matrix\nentries1 are observed as opposed to the required \u02dc\u21e5(n3/2) in [16] and \u02dc\u21e5(n7/4) in [17]. The notion\nof margin complexity in machine learning was introduced in [2] where it was used to the study the\nlearnability of concept classes via linear embeddings and further studied in [3], where it was linked\nto the 2 norm. Here we adopt the terminology in [11] and refer to the 2 norm as the max-norm.\nThe margin complexity seems to be a more natural parameter as opposed to the trace-norm for the\n0-1 loss as it only depends on the signs of the underlying comparator matrix. To the best of our\nknowledge the bounds contained herein are the \ufb01rst online matrix completion bounds in terms of the\nmargin complexity.\nTo obtain our results, we use an online matrix multiplicative weights algorithm, e.g., see [1, 18, 17, 19]\nand references therein. These kinds of algorithms have been applied in a number of learning\nscenarios, including online PCA [20], online variance minimization [21], solving SDPs [18], and\nonline prediction with switching sequences [22]. These algorithms update a new hypothesis matrix\non each trial by trading off \ufb01delity to the previous hypothesis and the incorporation of the new label\ninformation. The tradeoff is computed as an approximate spectral regularization via the quantum\nrelative entropy (see [1, Section 3.1]). The particular matrix multiplicative weights algorithm we\napply is Matrix Winnow [19]; we adapt this algorithm and its mistake bound analysis for our purposes\nvia selection of comparator, threshold, and appropriate \u201cprogress inequalities.\u201d\nThe paper is organized as follows. In Section 2 we introduce basic notions used in the paper. In\nSection 3 we present our algorithm and derive a mistake bound, also comparing it to related bounds\nin the literature. In Section 4 we observe that our algorithm is able to exploit matrix structure to\nperform comparably to the Kernel Perceptron with the best kernel known in advance. Finally, in\nSection 5 we discuss the example of biclustered matrices, and argue that our bound is optimal up to a\npolylogarithmic factor. The appendix contains proofs of the results only stated in the main body of\nthe paper, and other auxiliary results.\n\n2 Preliminaries\n\nWe denote the set of the \ufb01rst m positive integers as Nm = {1, . . . , m}. We denote the inner\ni=1 xiwi and the norm as |w| = phw, wi. We\nproduct of vectors x, w 2 Rn as hx, wi = Pn\nlet Rm\u21e5n be the set of all m \u21e5 n real-valued matrices. If X 2 Rm\u21e5n then X i denotes the i-th\nn-dimensional row vector and the (i, j) entry in X is Xij. The trace of a square matrix X 2 Rn\u21e5n\ni=1 Xii. The trace norm of a matrix X 2 Rm\u21e5n is kXk1 = Tr(pX>X), where\nis Tr(X) =Pn\np\u00b7 indicates the unique positive square root of a positive semi-de\ufb01nite matrix. For every matrix\nU 2 {1, 1}m\u21e5n, we de\ufb01ne SP(U ) = {V 2 Rm\u21e5n : 8ijVijUij > 0}, the set of matrices which\n\n1For simplicity we assume m 2 \u21e5(n).\n\n2\n\n\fare sign consistent with U. We also de\ufb01ne SP1(U ) = {V 2 Rm\u21e5n : 8ijVijUij 1}, that is the set\nof matrices which are sign consistent to U with a margin of at least one.\nThe max-norm (or 2 norm [3]) of a matrix U 2 Rm\u21e5n is de\ufb01ned by the formula\n\nkUkmax := inf\n\nP Q>=U\u21e2 max\n\n1\uf8ffi\uf8ffm|P i| max\n\n1\uf8ffj\uf8ffn|Qj| ,\n\nwhere the in\ufb01mum is over all matrices P 2 Rm\u21e5k and Q 2 Rn\u21e5k and every integer k. The margin\ncomplexity of a matrix U 2 Rm\u21e5n is\n\nmc(U ) :=\n\ninf\n\nP Q>2SP(U )\n\nmax\n\nij\n\n|P i||Qj|\n|hP i, Qji|\n\n.\n\nThis quantity plays a central role in the analysis of our algorithm. If we interpret the rows of U as m\ndifferent binary classi\ufb01cation tasks, and the columns as a \ufb01nite set of objects which we wish to label,\nthe \u201cmin-max\u201d margin with respect to an embedding is smallest of the m maximal margins over the\ntasks. The quantity 1/ mc(U ) is then the maximum \u201cmin-max\u201d margin with respect to all possible\nembeddings. Speci\ufb01cally, the rows of matrix P represent the \u201cweights\u201d of the binary classi\ufb01ers\nand the rows of matrix Q the \u201cinput vectors\u201d associated with the objects. The quantity |hP i,Qji|\n|P i||Qj|\nis the margin of the i-th classi\ufb01er on the j-th input. Observe that margin complexity depends only\non the sign pattern of the matrix and not the magnitudes. The margin complexity is equivalently\nmc(U ) = minV 2SP1(U ) kV kmax, see e.g., [3, Lemma 3.1].\nIn our online\nsequence\n((i1, j1), y1), . . . , ((iT , jT ), yT ) 2 (Nm \u21e5 Nn) \u21e5 {1, 1}. A sequence must be consistent, that is,\ngiven examples ((i, j), y) and ((i0, j0), y0) if (i, j) = (i0, j0) then y = y0. We de\ufb01ne the set of sign-\nconsistent matrices with a sequence S as cons(S) := {M 2 Rm\u21e5n : 0 < yMij, ((i, j), y) 2S} .\nWe extend the notion of margin complexity to sequences via mc(S) := inf U2cons(S) mc(U ).\nThe number of margin violations in a sequence S at complexity is de\ufb01ned to be,\n\nconcerned with predicting an (example)\n\nsetting we\n\nare\n\nmerr(S, ) :=\n\ninf\n\nP Q>2cons(S)\u21e2((i, j), y) 2S :\n\n|P i||Qj|\n|hP i, Qji|\n\n>\n\n1\n\n .\n\nIn particular, note that merr(S, ) = 0 if \uf8ff 1\nFinally, we introduce the following quantity, which plays a central role in the amortized analysis of\nour algorithm.\nDe\ufb01nition 2.1. The quantum relative entropy of symmetric positive semide\ufb01nite square matrices A\nand B is\n\nmc(S).\n\n(A, B) := Tr(A log (A) A log (B) + B A).\n\n(1)\n\n(2)\n\n(3)\n\n3 Algorithm and Analysis\n\nAlgorithm 1 presents an adaptation of the Matrix Exponentiated Gradient algorithm [1, 17, 18, 19] to\nour setting. This algorithm is a matrix analog of the Winnow algorithm [19]; we refer to the above\npapers for more insights into this family of algorithms.\nThe following theorem provides a mistake bound for the algorithm.\nTheorem 3.1. The number of mistakes, M, on sequence S made by the Algorithm 1 with parameter\n0 < \uf8ff 1 is upper bounded by\n\nM \uf8ff c\uf8ff(m + n) log(m + n)\n\n1\n\n2 + merr(S, ) ,\n\nwhere c = 1/(3 e) \uf8ff 3.55 and the quantity merr(S, ) is given in equation (2).\nProof. Given U 2 Rm\u21e5n, let P 2 Rm\u21e5k and Q 2 Rn\u21e5k be such that P Q> = U. For every\ni 2 Nm, we denote by P i the i-th row vector of P and for every j 2 Nn, we denote by Qj the j-th\nrow vector of Q. We construct the (m + n) \u21e5 k matrix\n,\n\n, . . . ,\n\n, . . . ,\n\n1\n\nR := diag\u2713 1\n\n|P 1|\n\n|Qn|\u25c6\uf8ffP\nQ\n\n1\n|Q1|\n\n1\n|P m|\n3\n\n\fAlgorithm 1 Predicting a binary matrix.\nParameters: Learning rate 0 < \uf8ff 1 .\nInitialization: W (0) I\nFor t = 1, . . . , T\n\n(m+n), where I is the (m + n) \u21e5 (m + n) identity matrix.\n\n2 (eit + em+jt)(eit + em+jt)>, where ek is the k-th basis vector of Rm+n.\n\n\u2022 Get pair (it, jt) 2 Nm \u21e5 Nn.\n\u2022 De\ufb01ne X (t) := 1\n\u2022 Predict\n\u02c6yt =(1\nif Tr(W (t1)X (t)) 1\notherwise.\n\u2022 Receive label yt 2 {1, 1} and if \u02c6yt 6= yt update\nW (t) exp\u21e3log\u21e3W (t1)\u2318 +\n\n1\n\n\n2\n\nm+n ,\n\n(yt \u02c6yt)X (t)\u2318 .\n\n1\n\nand construct \u02dcU := (\nek is the k-th basis vector of Rm+n.\nNote that Tr(X (t)) = 1, Tr( \u02dcU ) = 1 (since every row of R is normalized) and\n\nm+n )RR>. De\ufb01ne matrix X (t) := 1\n\n2 (eit + em+jt)(eit + em+jt)>, where\n\n1\n\nn + m\n1\n\n2(n + m)\n\n1\n\n2(n + m)\n\nTr( \u02dcU X (t)) =\n\n=\n\n=\n\n=\n\n=\n\nTr((RR>)\n\n1\n2\n\n(eit + em+jt)(eit + em+jt)>)\n\n(eit + em+jt)>RR>(eit + em+jt)\n\n(R>(eit + em+jt))>(R>(eit + em+jt))\n\n1\n\n+\n\nQjt\n\n|Qjt|\u25c6\u2713 P it\n\n2(n + m)\u2713 P it\n(n + m)\u27131 + hP it, Qjti\n|P it||Qjt|\u25c6 .\n\n|P it|\n\n|P it|\n\n1\n\n+\n\nQjt\n\n|Qjt|\u25c6>\n\n> 1\n\nFor a trial t we say there is a margin violation if |P it||Qjt|\n . Let M denote the number of\n|hP it ,Qjti|\nmistakes made in trials with margin violations and let M + denote the number of mistakes made in\ntrials without margin violations.\nFrom Lemma A.3 in the appendix we have\n( \u02dcU , W (t1)) ( \u02dcU , W (t)) \nthen substituting in the above we have that\n\n(yt \u02c6yt) Tr( \u02dcU X (t)) +\u21e31 e\n\n\n2\n\n\n\n2 (yt\u02c6yt)\u2318 Tr(W (t1)X (t)) ,\nn + m\u27131 + hP it, Qjti\n|P it||Qjt|\u25c6\n+\u21e31 e\n\n2 (yt\u02c6yt)\u2318 Tr(W (t1)X (t)) .\n\n1\n\n\n\n( \u02dcU , W (t1)) ( \u02dcU , W (t)) \n\n\n2\n\n(yt \u02c6yt)\n\nTo further simplify the above we use Lemma A.4 presented in the appendix, which gives\n\n( \u02dcU , W (t1)) ( \u02dcU , W (t)) 8<:\n\nwhere c0 = 3 e.\n\n(c0 1)\nc0\n\n4\n\n1\nn+m 2,\n\nif there is a margin violation ,\n\n1\nn+m 2,\n\notherwise.\n\n\fUsing a telescoping sum, this gives\n( \u02dcU , W (0)) ( \u02dcU , W (0)) ( \u02dcU , W (T )) M +c0\n\n= (c0M + (1 c0)M)\n\n1\n\nn + m\n\n2\n\n1\n\nn + m\n\n2 + M(c0 1)\n\n1\n\nn + m\n\n2\n\nand hence\n\nWe conclude that\n\nWe also have that\n\nM + \uf8ff\n\nc0\n\n1\n1\n\nn+m 2 ( \u02dcU , W (0)) +\n\n1 c0\nc0\n\nM .\n\nM = M + + M \uf8ff\n\nc0\n\n1\n1\n\nn+m 2 ( \u02dcU , W (0)) +\n\n1\nc0\n\nM .\n\n( \u02dcU , W (0)) = Tr( \u02dcU log( \u02dcU )) Tr( \u02dcU log(W (0))) + Tr(W (0)) Tr( \u02dcU )\n\n= Tr( \u02dcU log( \u02dcU )) Tr( \u02dcU log(W (0))) + 1 1\n= Tr( \u02dcU log( \u02dcU )) Tr( \u02dcU log(W (0))) .\n\nWrite the eigen-decomposition of \u02dcU asPm+n\ni=1 i = Tr( \u02dcU ) = 1\nso all eigenvalues i are in the range [0, 1] meaning log(i) \uf8ff 0 so i log(i) < 0 which are the\neigenvalues of \u02dcU log( \u02dcU ) meaning that Tr( \u02dcU log( \u02dcU )) \uf8ff 0. Also, log(W (0)) = log(\nn+m )I so\n\u02dcU log(W (0)) = log(\nn+m ) Tr( \u02dcU ) = log(m + n).\n1\nSo by the above we have\n\nn+m ) \u02dcU and hence Tr( \u02dcU log(W (0))) = log(\n\ni . Now we havePm+n\n\ni=1 i\u21b5i\u21b5T\n\n1\n\n1\n\nand hence putting together we get\n\n( \u02dcU , W (0)) \uf8ff log(m + n)\n\nM \uf8ff\n\nm + n\nc02\n\nlog(m + n) +\n\nM .\n\n1\nc0\n\nObserve that in the simplifying case when we have no margin errors (merr(S, ) = 0) and the\nlearning rate is := 1\nmc(S) we have that the number of mistakes of Algorithm 1 is bounded by\n\u02dcO((n + m) mc2(S)). More generally although the learning rate is \ufb01xed in advance, we may use a\n\u201cdoubling trick\u201d to avoid the need to tune the .\nCorollary 3.2. For any value of \u21e4 the number of mistakes M made by the following algorithm:\nDOUBLING ALGORITHM:\n\nSet \uf8ff p2 and loop over\n\n1. Run Algorithm 1 with = 1\n\n2. Set \uf8ff \uf8ffp2\n\nis upper bounded by\n\n\uf8ff until it has made d2c(m + n) log(m + n)\uf8ff2e mistakes\n\nM \uf8ff 12c\uf8ff(m + n) log(m + n)\n\n1\n\n(\u21e4)2 + merr(S, \u21e4) ,\n\nwith c = 1/(3 e) \u21e1 3.55.\nSee the appendix for a proof. We now compare our bound to other online learning algorithms for\nmatrix completion. The algorithms of [16, 17] address matrix completion in a signi\ufb01cantly more\ngeneral setting. Both algorithms operate with weak assumptions on the loss function, while our\nalgorithm is restricted to the 0\u20131 loss (mistake counting). Those papers present regret bounds,\nwhereas we apply the stronger assumption that there exists a consistent predictor. As a regret bound\nis not possible for a deterministic predictor with the 0\u20131 loss, we compare Theorem 3.1 to their\n\n5\n\n\f1\n3\n\n1\n3\n\n2\n3\n\n1\n2\n\n1 .\n\n(4)\n\nthe\n\nabove\n\nassumptions,\n\nmc(U ) \uf8ff 3 min\n\nbound when their algorithm is allowed to predict \u02c6y 2 [1, 1] and uses absolute loss. For clarity in\nour discussion we will assume that m 2 \u21e5(n).\nthe\nregret bound in [17, Corollary 7] becomes\nUnder\n2pkUk1(m + n)1/2 log(m + n)T .\nFor simplicity we consider the simpli\ufb01ed setting in\nwhich each entry is predicted, that is T = mn; then absorbing polylogarithmic factors, their bound is\n\u02dcO(n5/4kUk\n1 ). From Theorem 3.1 we have a bound of \u02dcO(n mc2(U )). Using [11, Theorem 10], we\nmay upper bound the margin complexity in terms of the trace norm,\n1 \uf8ff 3kUk\nV 2SP1(U )kV k\nSubstituting this into Theorem 3.1 our bound is \u02dcO(nkUk\n1 ). Since the trace norm may be bounded\nas n \uf8ff kUk1 \uf8ff n3/2, both bounds become vacuous when kUk1 = n3/2, however if the trace norm\nis bounded away from n3/2, the bound of Theorem 3.1 is smaller by a polynomial factor. An aspect\nof the bounds which this comparison fails to capture is the fact that since [17, Corollary 7] is a regret\nbound it will degrade more smoothly under adversarial noise than Theorem 3.1.\nThe algorithm in [16] is probabilistic and the regret bound is of \u02dcO(kUk1pn). Unlike [17], the setting\nof [16] is transductive, that is each matrix entry is seen only once, and thus less general. If we use the\nupper bound from [11, Theorem 10] as in the discussion of [17] then [16] improves uniformly on\nour bound and the bound in [17]. However, using this upper bound oversimpli\ufb01es the comparison\nas 1 \uf8ff mc2(U ) \uf8ff n while n \uf8ff kUk1 \uf8ff n3/2 for U 2 {1, 1}m\u21e5n. In other words we have been\nvery conservative in our comparison; the bound (4) may be loose and our algorithm may often have a\nmuch smaller bound. A speci\ufb01c example is provided by the class of (k, `)-biclustered matrices (see\nalso the discussion in Section 5 below) where mc2(U ) \uf8ff min(k, `), in which case bound becomes\nnontrivial after \u02dc\u21e5(min(k, `) n) examples while the bounds in [16] and [17] become nontrivial after\nat least \u02dc\u21e5(n3/2) and \u02dc\u21e5(n7/4) examples, respectively.\nWith respect to computation our algorithm on each trial requires a single eigenvalue decomposition\nof a PSD matrix, whereas the algorithm of [17] requires multiple eigenvalue decompositions per trial.\nAlthough [16] does not discuss the complexity of their algorithm beyond the fact that it is polynomial,\nin [17] it is conjectured that it requires at a minimum \u21e5(n4) time per trial.\n\n4 Comparison to the Best Kernel Perceptron\n\nIn this section, we observe that Algorithm 1 has a mistake bound that is comparable to Novikoff\u2019s\nbound [23] for the Kernel Perceptron with an optimal kernel in hindsight. To explain our observation,\nwe interpret the rows of matrix U as m different binary classi\ufb01cation tasks, and the columns as a \ufb01nite\nset of objects which we wish to label; think for example of users/movies matrix in recommendation\nsystems. If we solve the tasks independently using a Kernel Perceptron algorithm, we will make\nO(1/2) mistakes per task, where is the largest margin of a consistent hypothesis. If every task has\na margin larger than we will make O(m/2) mistakes in total. This algorithm and the parameter \ncrucially depend on the kernel used: if there exists a kernel which makes large for all (or most of)\nthe tasks, then the Kernel Perceptron will incur a small number of mistakes on all (or most of) the\ntasks. We now argue that our bound mimics this \u201coracle\u201d, without knowing in advance the kernel.\nWithout loss of generality, we assume m n (otherwise apply the same reasoning below to matrix\nU >). In this scenario, Theorem 3.1 upper bounds the number of mistakes as\n\nO\u2713 m log m\n2 \u25c6\n\n6\n\nwhere is chosen so that merr(S, ) = 0. To further illustrate our idea, we de\ufb01ne the task complexity\nof a matrix U 2 Rm\u21e5n as\n\n\u2327 (U ) = minh(V ) : V 2 SP1(U ) \n\nwhere\n\n(5)\nNote that the quantity V iK1V >\ni max1\uf8ffj\uf8ffn Kjj is exactly the bound in Novikoff\u2019s Theorem on\nthe number of mistakes of the Kernel Perceptron on the i-th task with kernel K. Hence the quantity\n\nh(V ) = inf\nK0\n\ni max\n1\uf8ffj\uf8ffn\n\nmax\n1\uf8ffi\uf8ffm\n\nV iK1V >\n\nKjj .\n\n\fh(V ) represents the best upper bound on the number of mistakes made by a Kernel Perceptron on\nthe worst (since we take the maximum over i) task.\nProposition 4.1. For every U 2 Rm\u21e5n, it holds that mc2(U ) = \u2327 (U ).\nProof. The result follows by Lemma A.6 presented in the appendix and by the formula mc(U ) =\nminV 2SP1(U ) kV kmax, see, e.g., [3, Lemma 3.1].\nReturning to the interpretation of the bound in Theorem 3.1, we observe that if no more than r out of\nthe m tasks have margin smaller than a threshold then in Algorithm 1 setting parameter = ,\nTheorem 3.1 gives a bound of\n\nO\u2713 (m r) log m\n\n2\n\n+ rn\u25c6 .\n\nThus we essentially \u201cpay\u201d linearly for every object in a dif\ufb01cult task. Since we assume n \uf8ff m,\nprovided r is small the bound is \u201crobust\u201d to the presence of bad tasks.\nWe specialize the above discussion to the case that each of the m tasks is a binary labeling of an\nunknown underlying connected graph G := (V,E) with n vertices and assume that m n. We\nlet U 2 {1, 1}m\u21e5n be the matrix, the rows of which are different binary labelings of the graph.\nFor every i 2 Nm, we interpret U i, the i-th row of matrix U, as the i-th labeling of the graph\nand let i be the corresponding cutsize, namely, i := |{(j, j0) 2E : Uij 6= Uij0}| and de\ufb01ne\nmax := max1\uf8ffi\uf8ffm i. In order to apply Theorem 3.1, we need to bound the margin complexity of\nU. Using the above analysis (Proposition 4.1), this quantity is upper bounded by\n\nmc2(U ) \uf8ff max\n1\uf8ffi\uf8ffm\n\nU iK1U >\n\ni max\n1\uf8ffj\uf8ffn\n\nKjj.\n\n(6)\n\nWe choose the kernel K := L+ + (R 11T ), where L is the graph Laplacian of G, the vector 1\nhas all components equal to one, and R = maxj L+\njj. Since the graph is connected then 1 is the\nonly eigenvector of L with zero eigenvalue. Hence K is invertible and K1 = L + (R 11T )+ =\ni we obtain from (6)\nL + (R n 1pn 11T 1pn )+ = L + 1\nthat\n\nRn2 11T . Then using the formula i = 1\n\n4 U iLU >\n\nmc2(U ) \uf8ff max\n\n1\uf8ffi\uf8ffm\u27134i +\n\n1\n\nR\u25c6 R .\n\nTheorem 3.1 then gives a bound of M \uf8ff O ((1 + maxR) m log m). The quantity R may be further\nupper bounded by the graph resistance diameter, see for example [24].\n\n5 Biclustering and Near Optimality\n\nThe problem of learning a (k, `)-binary-biclustered matrix, corresponds to the assumption that the\nrow indices and column indices represent k and ` distinct object types and that there exists a binary\nrelation on these objects which determines the matrix entry. Formally we have the following\nDe\ufb01nition 5.1. The class of (k, `)-binary-biclustered matrices is de\ufb01ned as\n\nBm,n\nk,` = {U 2 Rm\u21e5n : r 2 Nm\n\nk , c 2 Nn\n\n` , F 2 {1, 1}k\u21e5`, Uij = Fricj , i 2 Nm, j 2 Nn} .\n\nThe intuition is that a matrix is (k, `)-biclustered if after a permutation of the rows and columns the\nresulting matrix is a k \u21e5 ` grid of rectangles and all entries in a given rectangle are either 1 or 1.\nThe problem of determining a (k, `)-biclustered matrix with a minimum number of \u201cviolated\u201d entries\ngiven a subset of entries was shown to be NP-hard in [25]. Thus although we do not give an algorithm\nthat provides a biclustering, we provide a bound in terms of the best consistent biclustering.\nLemma 5.2. If U 2 Bm,n\nProof. We use Proposition 4.1 to upper bound mc2(U ) by h(U ), where the function h is given\nin equation (5). We further upper bound h(U ) by choosing a kernel matrix in the underlying\noptimization problem. By De\ufb01nition 5.1, there exists r 2 Nm\n` and F 2 {1, 1}k\u21e5`\n\nthen mc2(U ) \uf8ff min(k, `).\n\nk , c 2 Nn\n\nk,`\n\n7\n\n\fsuch that Uij = Fricj , for every i 2 Nm and every j 2 Nn. Then we choose the kernel matrix\nK = (Kjj0)1\uf8ffj,j0\uf8ffn such that\n\nKjj0 := cj c0j\n\n+ \u270fjj0\n\nOne veri\ufb01es that U iK1U >\ni \uf8ff ` for every i 2{ 1, . . . , m}, hence by taking the limit for \u270f ! 0\nProposition 4.1 gives that mc2(U ) \uf8ff ` . By the symmetry of our construction we can swap ` with k,\ngiving the bound.\n\nUsing this lemma with Theorem 3.1 gives us the following upper bound on the number of mistakes.\nCorollary 5.3. The number of mistakes of Algorithm 1 applied to sequences generated by a (k, `)-\nbinary-biclustered matrix is upper bounded by O(min(k, `)(m + n) log(m + n)).\nA special case of the setting in this corollary was \ufb01rst studied in the mistake bound setting in [14].\nIn [15] the bound was improved and generalized to include robustness to noise (for simplicity we do\nnot compare in the noisy setting). In both papers the underlying assumption is that there are k distinct\nrow types and no restrictions on the number of columns thus ` = n. In this case they obtained an\nupper bound of kn + min( m2\nwhen k < n 1\nand on other hand when k n 1\nWe now establish that the mistake bound (3) is tight up to a poly-logarithmic factor.\nTheorem 5.4. Given an online algorithm A that predicts the entries of a matrix U 2 {1, 1}m\u21e5n\nand given an ` 2 Nn there exists a sequence S constructed by an adversary with margin complexity\nmc(S) \uf8ff p`. On this sequence the algorithm A will make at least ` \u21e5 m mistakes.\n\n2e log2 e, mp3n log2 k). Comparing the two bounds we can see that\n\n2\u270f the bound in Corollary 5.3 improves over [15, Corollary 1] by a polynomial factor\n\n2 we are no worse than a polylogarithmic factor.\n\nSee the appendix for a proof.\n\n6 Conclusion\n\nIn this paper, we presented a Matrix Exponentiated Gradient algorithm for completing the entries of a\nbinary matrix in an online learning setting. We established a mistake bound for this algorithm, which\nis controlled by the margin complexity of the underlying binary matrix. We discussed improvements\nof the bound over related bounds for matrix completion. Speci\ufb01cally, we noted that our bound requires\nfewer examples before it becomes non-trivial, as compared to the bounds in [16, 17]. Here we require\nonly \u02dc\u21e5(m + n) examples as opposed to the required \u02dc\u21e5((m + n)3/2) in [16] and \u02dc\u21e5((m + n)7/4),\nrespectively. Thus although our bound is more sensitive to noise, it captures structure more quickly\nin the underlying matrix. When interpreting the rows of the matrix as binary tasks, we argued that\nour algorithm performs comparably (up to logarithmic factors) to the Kernel Perceptron with the\noptimal kernel in retrospect. Finally, we highlighted the example of completing a biclustered matrix\nand noted that this is instrumental in showing the optimality of the algorithm in Theorem 5.4.\nWe observed that Algorithm 1 has a per trial computational cost which is smaller than currently\navailable algorithms for matrix completion with online guarantees. In the future it would be valuable\nto study if improvements in this computation are possible by exploiting the special structure in our\nalgorithm. Furthermore, it would be very interesting to study a modi\ufb01cation of our analysis to the\ncase in which the tasks (rows of matrix U) grow over time, a setting which resembles the lifelong\nlearning frameworks in [26, 27].\n\nAcknowledgements. We wish to thank the anonymous reviewers for their useful comments. This work was\nsupported in part by EPSRC Grants EP/P009069/1, EP/M006093/1, and by the U.S. Army Research Laboratory\nand the U.K. Defence Science and Technology Laboratory and was accomplished under Agreement Number\nW911NF-16-3-0001. The views and conclusions contained in this document are those of the authors and should\nnot be interpreted as representing the of\ufb01cial policies, ether expressed or implied, of the U.S. Army Research\nLaboratory, the U.S. Government, the U.K. Defence Science and Technology Laboratory or the U.K. Government.\nThe U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes\nnotwithstanding any copyright notation herein.\n\n8\n\n\fReferences\n[1] K. Tsuda, G. R\u00e4tsch, and M.K. Warmuth. Matrix exponentiated gradient updates for on-line learning and\n\nbregman projection. Journal of Machine Learning Research, 6:995\u20131018, 2005.\n\n[2] S. Ben-David, N. Eiron, and H. U. Simon. Limitations of learning via embeddings in euclidean half spaces.\n\nJournal of Machine Learning Research, 3:441\u2013461, 2003.\n\n[3] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of sign matrices.\n\nCombinatorica, 27(4):439\u2013463, 2007.\n\n[4] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs.\n\nInternational Conference on Machine Learning, pages 305\u2013312, 2005.\n\nIn Proceedings of the 22nd\n\n[5] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions. In Proc. 20th International Conference on Machine Learning, pages 912\u2013919, 2003.\n\n[6] M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds. Machine Learning, 56:209\u2013\n\n239, 2004.\n\n[7] N. Cesa-Bianchi, C. Gentile, and F. Vitale. Fast and optimal prediction of a labeled tree. In Proceedings of\n\nthe 22nd Annual Conference on Learning Theory, 2009.\n\n[8] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the prediction of\n\nweighted graphs. Journal of Machine Learning Research, 14(1):1251\u20131284, 2013.\n\n[9] N. Cesa-Bianchi, C. Gentile, and F. Vitale. Predicting the labels of an unknown graph via adaptive\n\nexploration. Theoretical Computer Science, 412(19):1791\u20131804, 2011.\n\n[10] C. Gentile, M. Herbster, and S. Pasteris. Online similarity prediction of networked data from known and\n\nunknown graphs. In Proceedings of the 26th Annual Conference on Learning Theory, 2013.\n\n[11] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th Annual\n\nConference on Learning Theory, pages 545\u2013560, 2005.\n\n[12] E. J. Cand\u00e8s and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans.\n\nInf. Theor., 56(5):2053\u20132080, May 2010.\n\n[13] A. Maurer and M. Pontil. Excess risk bounds for multitask learning with trace norm regularization. In\n\nProceedings of The 27th Conference on Learning Theory (COLT), pages pages 55\u201376, 2013.\n\n[14] S. A. Goldman, R. L. Rivest, and R. E. Schapire. Learning binary relations and total orders. SIAM J.\n\nComput., 22(5), 1993.\n\n[15] S. A. Goldman and M. K. Warmuth. Learning binary relations using weighted majority voting.\nProceedings of the 6th Annual Conference on Computational Learning Theory, pages 453\u2013462, 1993.\n\nIn\n\n[16] N. Cesa-Bianchi and O. Shamir. Ef\ufb01cient online learning via randomized rounding. In Advances in Neural\n\nInformation Processing Systems 24, pages 343\u2013351, 2011.\n\n[17] E. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix prediction. In Proc.\n\n23rd Annual Conference on Learning Theory, volume 23:38.1-38.13. JMLR W&CP, 2012.\n\n[18] S. Arora and S. Kale. A combinatorial, primal-dual approach to semide\ufb01nite programs. In Proceedings of\n\nthe 29th Annual ACM Symposium on Theory of Computing, pages 227\u2013236, 2007.\n\n[19] M.K. Warmuth. Winnowing subspaces. In Proceedings of the 24th International Conference on Machine\n\nLearning, pages 999\u20131006, 2007.\n\n[20] J. Nie, W. Kot\u0142owski, and M. K. Warmuth. Online PCA with optimal regrets. In Proceedings of the 24th\n\nInternational Conference on Algorithmic Learning Theory, pages 98\u2013112, 2013.\n\n[21] M. K. Warmuth and D. Kuzmin. Online variance minimization. Machine Learning, 87(1):1\u201332, 2012.\n[22] M. Herbster, S. Pasteris, and S. Pontil. Predicting a switching sequence of graph labelings. Journal of\n\nMachine Learning Research, 16:2003\u20132022, 2015.\n\n[23] A.B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathe-\n\nmatical Theory of Automata, pages 615\u2013622, 1962.\n\n[24] M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Advances in Neural Information\n\nProcessing Systems 19, pages 577\u2013584, 2006.\n\n[25] S. Wulff, R. Urner, and S. Ben-David. Monochromatic bi-clustering. In Proc. 30th International Conference\n\non Machine Learning, volume 28, pages 145\u2013153. JMLR W&CP, 2013.\n\n[26] P. Alquier, T.-T. Mai, and M. Pontil. Regret bounds for lifelong learning. Preprint, 2016.\n[27] M.-F. Balcan, A. Blum, and S. Vempala. Ef\ufb01cient representations for lifelong learning and autoencoding.\n\nIn Proc. 28th Conference on Learning Theory, pages 191\u2013210, 2015.\n\n[28] R. Bhatia. Matrix Analysis. Springer Verlag, New York, 1997.\n\n9\n\n\f", "award": [], "sourceid": 1965, "authors": [{"given_name": "Mark", "family_name": "Herbster", "institution": "University College London"}, {"given_name": "Stephen", "family_name": "Pasteris", "institution": "UCL"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "University College London & Italian Institute of Technology"}]}