{"title": "Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 1193, "page_last": 1200, "abstract": null, "full_text": "Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds\n\nBenjamin I. P. Rubinstein Computer Science Division University of California, Berkeley Berkeley, CA 94720-1776, U.S.A. benr@cs.berkeley.edu\n\nPeter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu\n\nJ. Hyam Rubinstein Department of Mathematics & Statistics The University of Melbourne Parkville, Victoria 3010, Australia rubin@ms.unimelb.edu\n\nAbstract\nUnder the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n - 1 points in X and corresponding labels from a concept f F , and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F , Haussler et al. achieved a VC(F )/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube--the one-inclusion graph; the key step is a d = VC(F ) bound on one-inclunion graph density. The first main result of this s /n -1 paper is a density bound of n d-1 ( d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d (n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k -class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k ) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.\n\n1\n\nIntroduction\n\nIn [4, 3] Haussler, Littlestone and Warmuth proposed the one-inclusion prediction strategy as a natural approach to the prediction (or mistake-driven) model of learning, in which a prediction strategy maps a training sample and test point to a test prediction with hopefully guaranteed low probability of erring. The significance of their contribution was two-fold. On the one hand the derived VC(F )/n upper-bound on the worst-case expected risk of the one-inclusion strategy learning from F {0, 1}X improved on the PAC-based previous-best by an order of log n. This was achieved by taking the structure of the underlying F into account--which had not been done in previous work-- in order to break ties between hypotheses consistent with the training set but offering contradictory predictions on a given test point. At the same time Haussler [3] introduced the idea of shifting sub-\n\n\f\nsets of the n-cube down around the origin--an idea previously developed in Combinatorics--as a powerful tool for learning-theoretic results. In particular, shifting admitted deeply insightful proofs of Sauer's Lemma and a VC-dimension bound on the density of the one-inclusion graph--the key result needed for the one-inclusion strategy's expected risk bound. Recently shifting has impacted on work towards the sample compressibility conjecture of [7] e.g. in [5]. Here we continue to study the one-inclusion graph--the natural graph structure induced by a subset of the n-cube--and its related prediction strategy under the lens of shifting. After the necessary background, we develop the technique of shatter-invariant shifting in Section 3. While a subset's VC-dimension cannot be increased by shifting, shatter-invariant shifting guarantees a finite sequence of shifts to a fixed-point under which the shattering of a chosen set remains invariant, thus preserving VC-dimension throughout. In Section 4 we apply a group-theoretic symmetrization to tighten the mistake bound--the worst-case expected risk bound--of the deterministic (randomized) oned d d d inclusion strategy from d/n to Dn /n (Dn /n), where Dn < d for all n, d. The derived Dn density bound positively resolves a conjecture of Kuzmin & Warmuth which was suggested as a step towards a correctness proof of the Peeling compression scheme [5]. Finally we generalize the prediction model, the one-inclusion strategy and its bounds from binary to k -class learning in Section 5. Where G -dim (F ) and P -dim (F ) denote the Graph and Pollard dimensions of F , the best bound on expected risk for k N to-date is O( log ) for = G -dim (F ) /n, for consistent learners [8, 1, 2, 4]. For large n this is O(log nG -dim (F ) /n); we derive an improved bound of P -dim (F ) /n which we show is at most a O(log k ) factor from optimal. Thus, as in the binary case, exploiting class structure enables significantly better bounds on expected risk for multiclass prediction. As always some proofs have been omitted in the interest of flow or space. In such cases see [8].\n\n2\n\nDefinitions & background\n\nIn this paper sets/random variables, scalars an d v=ctors willnbe written in uppercase, lowercase and ne , r bolded typeface as in C, x, v. We define r [n] = {1, . . . , n} and Sn to be the i=0 i set of permutations on [n]. We write the density of graph G = (V , E ) as dens (G) = |E |/|V |, the indicator of A as 1 [A], and !x X, P (x) to mean \"there exists a unique x X satisfying P .\" 2.1 The prediction model of learning\n\nDefinition 2.1 The prediction model of learning concerns the following scenario. Given full knowledge of strategy Q, an adversary picks a distribution P on X and concept f F so as to maximize i.i.d. the probability of {Q (sam (X1 , . . . , Xn-1 , f ) , Xn ) = f (Xn )} where Xi P . Thus the measure of performance is the worst-case expected risk ^ MQ,F (n) = sup sup EXP n [1 [Q (sam ((X1 , . . . , Xn-1 ), f ) , Xn ) = f (Xn )]] .\nf F P\n\nWe begin with the basic setup of [4]. Set X is the domain and F {0, 1}X is a concept class on X . For notational convenience we write sam (x, f ) = ((x1 , f (x1 )) , . . . , (xn , f (xn ))) for x X n , n n-1 X {0, 1}. f F . A prediction strategy is a mapping of the form Q : >1 (X {0, 1})\n\n^ A mistake bound for Q with respect to F is an upper-bound on MQ,F . In contrast to Valiant's PAC model, the prediction learning model is not interested in approximating f given an f -labeled sample, but instead in predicting f (Xn ) with small worst-case probability of erring. The following allows us to derive mistake-bounds by bounding a worst-case average. Lemma 2.2 (Corollary 2.1 [4]) For any n > 1, concept class F and prediction strategy Q, = Qs x ,, = x 1g ^ 1 am f xg(n) f g(n) MQ,F (n) sup sup g (1) , . . . , xg (n-1) f F xX n n!\nSn\n\n^ ^ MQ,F (n) .\n\n^ ^ A permutation mistake bound for Q with respect to F is an upper-bound on MQ,F .\n\n\f\n2.2\n\nThe capacity of function classes contained in {0, . . . , k }X\n\nWe denote by x (F ) = {(f (x1 ), . . . , f (xn )) | f F } the projection of F Y X on x X n . Definition 2.3 The Vapnik-Chervonenkis dimension of concept class F is defined as VC(F ) = sup {n | x X n , x (F ) = {0, 1}n }. An x witnessing VC(F ) is said to be shattered by F . . n Lemma 2.4 (Sauer's Lemma [9]) For any n N and V {0, 1}n , |V | VC(V ) A subset V meeting this with equality is called maximum. It is well-known that the VC-dimension is an inappropriate measure of capacity when |Y | > 2. The following unifying framework of class capacities for |Y | < is due to [1]. Definition 2.5 Let k N, F {0, . . . , k }X and be a family of mappings : {0, . . . , k } {0, 1, } called translations. For x X n , v x (F ) {0, . . . , k }n and n we write (v) = (1 (v1 ), . . . , n (vn )) and (x (F )) = { (v) : v x (F )}. x X n is -shattered by F if there exists a n such that {0, 1}n (x (F )). The -dimension of F is defined by -dim (F ) = sup{n | x X n , n s.t. {0, 1}n (x (F ))}. We next describe three important translation families used in this paper. Example 2.6 The families P = {P,i : i [k ]}, G = {G,i : i {0, . . . , k }} and N = {N ,i,j : i, j {0, . . . , k }, i = j }, where P,i (a) = 1 [a < i], G,i (a) = 1 [a = i] and N ,i,j (a) equals 1, 0, if a = i, a = j, a {i, j } respectively, define the Pollard pseudo-dimension / P -dim (V ), the Graph dimension G -dim (V ) and the Natarajan dimension N -dim (V ). 2.3 The one-inclusion prediction strategy\n\nA subset of the n-cube--the projection of some F --induces the one-inclusion graph, which underlies a natural prediction strategy. The following definition generalizes this to a subset of {0, . . . , k } n . Definition 2.7 The one-inclusion hypergraph G (V ) = (V , E ) of V {0, . . . , k }n is the undirected graph with vertex-set V and hyperedge-set E of maximal (with respect to inclusion) sets of pairwise hamming-1 separated vertices. Algorithm 1 The deterministic multiclass one-inclusion prediction strategy Q G ,F Given: F {0, . . . , k }X , sam ((x1 , . . . , xn-1 ), f ) (X {0, 1}) Returns: a prediction of f (xn )\nn-1\n\n, xn X\n\nV - x (F ) ; G - G (V ) ; - G - orient G to minimize the maximum outdegree ; Vspace - {v V | v1 = f (x1 ), . . . , vn-1 = f (xn-1 )} ; if Vspace = {v} then return vn ; - else return the nth component of the head of hyperedge Vspace in G ; The one-inclusion graph's prediction strategy QG ,F [4] immediately generalizes to the multiclass prediction strategy described by Algorithm 1. For the remainder of this and Section 4 we will restrict our discussion to the k = 1 case, on which the following main result of [4] focuses. ^ Theorem 2.8 (Theorem 2.3 [4]) MQG,F ,F (n) \nVC(F ) n\n\nfor every concept class F and n > 1.\n\nA lower bound in [6] showed that the one-inclusion strategy's performance is optimal within a factor of 1 + o(1). Replacing orientation with a distribution over each edge induces a randomized strategy QG rand,F . The key to proving Theorem 2.8 is the following. Lemma 2.9 (Lemma 2.4 [4]) For any n N and V {0, 1}n , dens (G (V )) VC(V ).\n\n\f\nAn elegant proof of this deep result, due to Haussler [3], uses shifting. Consider any s [n], v V and let Ss (v; V ) be v shifted along s: if vs = 0, or if vs = 1 and there exists some u V differing to v only in the sth coordinate, then Ss (v; V ) = v; otherwise v shifts down--its sth coordinate is decreased from 1 to 0. The entire family V can be shifted to Ss (V ) = {Ss (v; V ) | v V } and this shifted vertex-set induces Ss (E ) the edge-set of G (Ss (V )), where (V , E ) = G (V ). Definition 2.10 Let I [n]. We call a subset V {0, 1}n I -closed-below if Ss (V ) = V for all s I . If V is [n]-closed-below then we call it closed-below. A number of properties of shifting follow relatively easily: |Ss (V )| VC(Ss (V )) |E | |Ss (E )| T N, s [n]T = |V | , by the injectivity of Ss ( ; V ) VC(V ) , as Ss (V ) shatters I [n] V shatters I |V | VC(V ) , as V closed-below maxvV v l1 VC(V ) |E | , by cases (1) (2) (3) (4) (5)\n\ns.t. SsT (. . . Ss1 (V )) is closed-below (a fixed-point) . |SsT (...Ss1 (E ))| VC(SsT (. . . Ss1 (V ))) . . . VC(V ) |SsT (...Ss1 (V ))|\n\nProperties (12) and the justification of (3) together imply Sauer's lemma; Properties (15) lead to\n|E | |V |\n\n ... \n\n.\n\n3\n\nShatter-invariant shifting\n\nWhile [3] shifts to bound density, the number of edges can increase and the VC-dimension can decrease--both contributing to the observed gap between graph density and capacity. The next result demonstrates that shifting can in fact be controlled to preserve VC-dimension. Lemma 3.1 Consider arbitrary n N, I [n] and V {0, 1}n that shatters I . There exists a finite sequence s1 , . . . , sT in [n] such that each Vt = Sst (. . . Ss1 (V )) shatters I and VT is closedbelow. In particular VC(VT ) = VC(VT -1 ) = . . . = VC(V ). Proof: I () is invariant to shifting on I = [n]\\I . So some finite number of shifts on I will produce a I -closed-below family W that shatters I . Hence W must contain representatives for each element of {0, 1}|I | (embedded at I ) with components equal to 0 outside I . Thus the shattering of I is invariant to the shifting of W on I , so that a finite number of shifts on I produces an I -closed-below W that shatters I . Repeating the process a finite number of times until no non-trivial shifts are made produces a closed-below family that shatters I . The second claim follows from (2).\n\n4\n\nTightly bounding graph density by symmetrization\n\nd Kuzmin and Warmuth [5] introduced Dn as a potential bound on the graph density of maximum d classes. We begin with properties of Dn , a technical lemma and then proceed to the main result. d for all n N and d [n]. Denote by Vn the VC-dimension nc d closed-below subset of {0, 1} equal to the union of all d losed-below embedded d-cubes. d Dn \" \" n-1 n d-1 n ( d ) n\n\nDefinition 4.1 Define\n\n=\n\nd Lemma 4.2 Dn d (i) equals the graph density of Vn for each n N and d [n]; (ii) is strictly upper-bounded by d, for all n; (iii) equals d for all n = d N; 2 (iv) is strictly monotonic increasing in d (with n fixed); (v) is strictly monotonic increasing in n (with d fixed); and (vi) limits to d as n .\n\n\f\nVd e d Proof: By counting, for each d n < , the density of G n quals Dn : n-1 n n E G Vd d-1 n | n d d-n n d 1 n d-1 n i=0 i+1 i+1 n i=0 -1 n n i=1 i i i p = = = n= d Vn | d d d i=0\ni\n\nMonotonicity in d, (i) and Lemma 2.9 together prove (ii). Properties (iii,vvi) are proven in [8].\n\nA+C A C A roving (i). Since for all A, B , C, D > 0, B < B +D iff B < D , it is sufficient for (iv) to prove n-1 n( ) d d that Dn-1 < d-1 . By (i) and Lemma 2.9 Dn d, and so (n) d n n-1 (n ! n (n-d)-1)-1)! n d-1 n (n - 1)! (n - d)! d! !(d d-1 . = = Dn d - 1 < d = n! n! (n - d)! (d - 1)! d (n-d)!d!\n\nLemma 4.3 For arbitrary U, V {0, 1}n with dens (G (V )) > 0, |U | |V | and |E (G (U )) | |E (G (V )) |, if dens (G (U V )) < then dens (G (U V )) > . Proof: If G (U V ) has density less than then |E (G (U V )) | |U V | > |E (G (U )) | + |E (G (V )) | - |E (G (U V )) | |U | + |V | - |U V | 2|E (G (V )) | - |E (G (U V )) | 2|V | - |U V | 2|V | - |U V | = 2|V | - |U V |\nd = 10\n\ndensity\n\n6\n\n8\n\n10\n\nd Dnd 4\n\nd=2 d=1\n\n0 0\n\n2\n\n20\n\n40 n\n\n60\n\n80\n\nd Figure 1: The improved graph density bound of Theorem 4.4. The density bounding D n is plotted (dotted solid) alongside the previous best d (dashed), for each d {1, 2, 10}.\n\nTheorem 4.4 Every family V {0, 1}n with d = VC(V ) has (V , E ) = G (V ) with graph density |E | d Dn < d . |V | (6)\n\nd For n N and d [n], Vn is the unique closed-below VC-dimension d subset of {0, 1}n meeting (6) with equality. A VC-dimension d family V {0, 1}n meets (6) with equality only if V is maximum.\n\nProof: Avlow a permutation g Sn to act on vector v {0, 1}n and family g {0, 1}n by l V a g (v) = g(1) , . . . , vg(n) nd g (V ) = {g (v) | v V }; and define Sn (V ) = Sn g (V ). Note\n\n\f\nd that a closed-below VC-dimension d family V {0, 1}n satisfies Sn (V ) = V iff V = Vn , as VC(V ) dn implies V contains an embedded d-cube, invariance to Sn implies further that V s d contains all d uch cubes, and VC(V ) d implies that V Vn . Consider now any U . | Vn,d\n\n\n\narg min\n\nU|\n\n\n\narg max\n\ndens (G (U ))\n\n{U {0,1}n |VC(U )d,U closed-below}\n\n For the purposes of contradiction assume that Vn,d = g (Vn,d ) for some g Sn . Then if GV GV t dens dens hen Vn,d would not have been selected above n,d g (Vn,d ) n,d\n\n (i.e. a closed-below family at least as small and dense as Vn,d g (Vn,d ) would have been chosen). b GV > GV y Lemma 4.3. But then again Vn,d would Thus dens dens n,d g (Vn,d ) n,d not have been selected (i.e. a distinct family at least as dense as Vn,d g (Vn,d ) would have been se lected instead, since every vector in this union contains no more than d 1's). Hence V n,d = Sn (Vn,d ) =n GV n Dd , for d = VC(Vn,d ) d. But by and so Vn,d = V d and by Lemma 4.2.(i) dens n,d d Lemma 4.2.(iv) this implies that d = d and (6) is true for all closed-below families; Vn uniquely maximizes density amongst all closed-below VC-dimension d families in the n-cube.\n\nFor an arbitrary V {0, 1}n with d = VC(V ) consider any of its closed-below fixed-point (cf. (5)), W {0, 1}n . Noting that VC(W ) d and dens (G (V )) dens (G (W )) by (2) and (1) & (4) respectively, the bound (6) follows directly for V . Furthermore if we shift to preserve d VC-dimension then VC(W ) = d while still |V | = |W |. And since dens (G (W )) = Dn only if d W = Vn , it follows that V maximizes density amongst all VC-dimension d families in the n-cube, d with dens (G (V )) = Dn , only if it is maximum. Theorem 4.4 improves on the VC-dimension density bound of Lemma 2.9 for low sample sizes (see Figure 1). This new result immediately implies the following one-inclusion mistake bounds. ^ Theorem 4.5 Consider any n N and F {0, 1}X with VC(F ) = d < . Then MQG,F ,F (n) Dd / d ^Q n and M Grand,F ,F (n) Dn /n. n n Dd -- For small d, n (d) = min d|d= the first n for which the new and old deterministic n one-inclusion mistake bounds coincide--appears to remain very close to 2.96d. The randomized strategy's mistake bound of Theorem 4.5 offers a strict improvement over that of [4].\n\n5\n\nBounds for multiclass prediction\n\nAs in the k = 1 case, the key to developing the multiclass one-inclusion mistake bound is in bounding hypergraph density. We proceed by shifting a graph induced by the one-inclusion hypergraph. Theorem 5.1 For any k , n N and V {0, . . . , k }n , the one-inclusion hypergraph (V , E ) = |E G (V ) satisfies |V | P -dim (V ). | Proof: We begin by replacing the hyperedge structure E with a related edge structure E . Two vertices u, v V are connected in the graph (V , E ) iff there exists an i [n] such that u, v differ only at i and no w V exists such that ui < wi < vi and wj = uj = vj on [n]\\{i}. Trivially |E | k |E | |E | . |V | |V | |V | Consider now shifting vertex v V at shift label t [k ] along shift coordinate s [n] by Ss,t (v; V ) where vs(i) vs = vs(v\ns\n\n(7)\n\n)\n\n= (v1 , . . . , vs-1 , i, vs+1 , . . . , vn ) for i {0, . . . , k } v x i m in {0, . . . , vs } s(x) V or x = vs / f vs = t = vs o.w.\n\n\f\nWe shift V on s at t as usual; we shift V on s alone by bubbling vertices down to fill gaps below: Ss,t (V ) = Ss (V ) = {Ss,t (v; V ) | v V } Ss,k (Ss,k-1 (. . . Ss,1 (V ))) .\n\nLet Ss (E ) denote the edge-set induced by Ss (V ). Ss on a vertex-set is injective implying that |Ss (V )| = |V | . (8) Consider any {u, v} E with i [n] denoting the index on which u, v differ. If i = s then no other vertex w V can come between u and v during shifting by construction of E , so {Ss (u; V ), Ss (v; V )} Ss (E ). Now suppose that i = s. If both vertices shift down by the same number of labels then they remain connected in Ss (E ). Otherwise assume WLOG that Ss (u; V )s < Ss (v; V )s then the shifted vertices will lose their edge, however since vs did not shift down to Ss (u; V )s there must have been some w V different to u on {i, s} such that ws < vs with Ss (w; V )s = Ss (u; V )s . Thus Ss (w; V ), Ss (u; V ) differ only on {i} and a new edge {Ss (w; V ), Ss (u; V )} is in Ss (E ) that was not in E (otherwise u would not have shifted). Thus |Ss (E )| |E\n|\n\n.\n\n(9)\n\nSuppose that I [n] is P -shattered by Ss (V ). If s I then I (Ss (V )) = I (V ) and I is / P -shattered by V . If s I then V P -shatters I . Witnesses of Ss (V )'s P -shattering of I equal to 1 at s, taking each value in {0, 1}|I |-1 on I \\{s}, were not shifted and so are witnesses for V ; since these vertices were not shifted they were blocked by vertices of V of equal values on I \\{s} but equal to 0 at s, these are the remaining half of the witnesses of V 's P -shattering of I . Thus Ss (V ) P -shatters I [n] V P -shatters I . (10) In a finite number of shifts starting from (V , E ), a closed-below family W with induced edge-set F will be reached. If I [n] is P -shattered by W and |I | = d = P -dim (W ), then since W is closed-below the translation vector (P,1 , . . . , P,1 ) () = (1 [ < 1] , . . . , 1 [ < 1]) must witness this shattering. Hence each w W has at most d non-zero components. Counting edges in F by upper-adjoining vertices we have proved that (V , E ) finitely shifts to closed-below graph (W, F ) Combining properties (7)(11) we have that\n|E | |V |\n\ns.t. |F | |W | P -dim (W ) . (11)\n|F | |W |\n\n\n\n|E | |V |\n\n\n\n P -dim (W ) P -dim (V ).\n\nThe remaining arguments from the k = 1 case of [4, 3] now imply the multiclass mistake bound. Theorem 5.2 Consider any k , n N and F {0, . . . , k }X with P -dim (F ) < . The multi^ class one-inclusion prediction strategy satisfies MQG,F ,F (n) P -dim (F ) /n. 5.1 A lower bound\n\nWe now show that the preceding multiclass mistake bound is optimal to within a O(log k ) factor, noting that N is smaller than P by at most such a factor [1, Theorem 10]. Definition 5.3 We call a family F {0, . . . , k }X trivial if either |F | = 1 or there exist no x1 , x2 X and f1 , f2 F such that f1 (x1 ) = f2 (x1 ) and f1 (x2 ) = f2 (x2 ). Theorem 5.4 Consider any deterministic or randomized prediction strategy Q and any F {0, . . . , k }X that has 2 N -dim (F ) < or is non-trivial with N -dim (F ) < 2. Then for ^ all n > N -dim (F ), MQ,F (n) max{1, N -dim (F ) - 1}/(2en). Proof: Following [2], we use the probabilistic method to prove the existence of a target in F for which prediction under a distribution P supported by a N -shattered subset is hard. Consider d = N -dim (F ) 2 with n > d. Fix a Z = {z1 , . . . , zd } N -shattered by F and then a subset FZ F of 2d functions that N -shatters Z . Define a distribution P on X by P ({zi }) = n-1 for each i [d - 1], P ({zd }) = 1 - (d - 1)n-1 and P ({x}) = 0 for all x X \\Z . Observe that PrP n (i [n - 1], Xn = Xi ) PrP n (Xn = zd , i [n - 1], Xn = Xi ) =\n\n\f\nn-1 1 - -n den1 . For any f FZ and x Z n with xn = xi for all i [n - 1], exactly half of the functions in FZ consistent with sam ((x1 , . . . , xn-1 ), f ) output some i {0, . . . , k } on xn and the remaining half output some j {0, . . . , k }\\{i}. Thus EUnif (FZ ) [1 [Q(sam ((x1 , . . . , xn-1 , F ) , xn ) = F (xn )]] = 0.5 for such an x and so\nd-1 n\n\n1\n\nd-1 ^ ^ . MQ,F MQ,FZ EUnif (FZ )P n [1 [Q(sam ((X1 , . . . , Xn-1 , F ) , Xn ) = F (Xn )]] 2en The similar case of d < 2 is omitted here and shows that there is a distribution P on X and function f F such that EP n [1 [Q(sam ((X1 , . . . , Xn-1 ), f ) , Xn ) = f (Xn )]] (2en)-1 .\n\n6\n\nConclusions and open problems\n\nIn this paper we have developed new shifting machinery and tightened the binary one-inclusion d d mistake bound from d/n to Dn /n ( Dn /n for the deterministic strategy) representing a solid improvement for d n. We have described the multiclass generalization of the prediction learning model and derived a mistake bound for the multiclass one-inclusion prediction strategy that improves on previous PAC-based expected risk bounds by O(log n) and that is within O(log k ) of optimal. Here shifting with invariance to the shattering of a single set was described, however we are aware of invariance to more complex shatterings. Another serious application of shatter-invariant shifting, to appear in a sequel to this paper, is to the study of the cubical structure of maximum and maximal classes with connections to the compressibility conjecture of [7]. While Theorem 4.4 resolves one conjecture of Kuzmin & Warmuth [5], the remainder of the conjectured correctness proof for the Peeling compression scheme is known to be false [8]. The symmetrization method of Theorem 4.4 can be extended over subgroups G S n to gain tighter d density bounds. Just as the Sn -invariant Vn is the maximizer of density among all closed-below d V Vn , there exist G-invariant families that maximize the density over all of their sub-families. In addition to Theorem 5.2 we have also proven the following special case in terms of G ; it is open as to whether this generalizes to n N. While a general G -based bound would allow direct comparison with the PAC-based expected risk bound, it should also be noted that P and G are in fact incomparable--neither G P nor P G singly holds for all classes [1, Theorem 1]. Lemma 6.1 ([8]) For any k N and family V {0, . . . , k }2 , dens (G (V )) G -dim (V ). Acknowledgments We gratefully acknowledge the support of the NSF under award DMS-0434383. References\n[1] Ben-David, S., Cesa-Bianchi, N., Haussler, D., Long, P. M.: Characterizations of learnability for classes of {0, . . . , n}-valued functions. Journal of Computer and System Sciences, 50(1) (1995) 7486 [2] Ehrenfeucht, A., Haussler, D., Kearns, M., Valiant, L.: A general lower bound on the number of examples needed for learning. Information and Computation, 82(3) (1989) 247261 [3] Haussler, D.: Sphere packing numbers for subsets of the boolean n-cube with bounded VapnikChervonenkis dimension. Journal of Combinatorial Theory (A) 69(2) (1995) 217232 [4] Haussler, D., Littlestone, N., Warmuth, M. K.: Predicting {0, 1} functions on randomly drawn points. Information and Computation, 115(2) (1994) 284293 [5] Kuzmin, D., Warmuth, M. K.: Unlabeled compression schemes for maximum classes. Journal of Machine Learning Research (2006) to appear [6] Li, Y., Long, P. M., Srinivasan, A.: The one-inclusion graph algorithm is near optimal for the prediction model of learning. IEEE Transactions on Information Theory, 47(3) (2002) 12571261 [7] Littlestone, N., Warmuth, M. K.: Relating data compression and learnability. Unpublished manuscript, http://www.cse.ucsc.edu/~manfred/pubs/lrnk-olivier.pdf (1986) [8] Rubinstein, B. I. P., Bartlett, P. L., Rubinstein, J. H.: Shifting: One-Inclusion Mistake Bounds and Sample Compression. Technical report, EECS Department, UC Berkeley (2007) to appear [9] Sauer, N.: On the density of families of sets. Journal of Combinatorial Theory (A), 13 (1972) 145147\n\n\f\n", "award": [], "sourceid": 2982, "authors": [{"given_name": "Benjamin", "family_name": "Rubinstein", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "J.", "family_name": "Rubinstein", "institution": null}]}