{"title": "Multi-Objective Non-parametric Sequential Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3372, "page_last": 3380, "abstract": "Online-learning research has mainly been focusing on minimizing one objective function. In many real-world applications, however, several objective functions have to be considered simultaneously. Recently, an algorithm for dealing with several objective functions in the i.i.d. case has been presented. In this paper, we extend the multi-objective framework to the case of stationary and ergodic processes, thus allowing dependencies among observations. We first identify an asymptomatic lower bound for any prediction strategy and then present an algorithm whose predictions achieve the optimal solution while fulfilling any continuous and convex constraining criterion.", "full_text": "Multi-Objective Non-parametric Sequential\n\nPrediction\n\nGuy Uziel\n\nComputer Science Department\n\nTechnion - Israel Institute of Technology\n\nguziel@cs.technion.ac.il\n\nRan El-Yaniv\n\nComputer Science Department\n\nTechnion - Israel Institute of Technology\n\nrani@cs.technion.ac.il\n\nAbstract\n\nOnline-learning research has mainly been focusing on minimizing one objective\nfunction. In many real-world applications, however, several objective functions\nhave to be considered simultaneously. Recently, an algorithm for dealing with\nseveral objective functions in the i.i.d. case has been presented. In this paper,\nwe extend the multi-objective framework to the case of stationary and ergodic\nprocesses, thus allowing dependencies among observations. We \ufb01rst identify an\nasymptomatic lower bound for any prediction strategy and then present an algorithm\nwhose predictions achieve the optimal solution while ful\ufb01lling any continuous and\nconvex constraining criterion.\n\n1\n\nIntroduction\n\nIn the traditional online learning setting, and in particular in sequential prediction under uncertainty,\nthe learner is evaluated by a single loss function that is not completely known at each iteration [6].\nWhen dealing with multiple objectives, since it is impossible to simultaneously minimize all of the\nobjectives, one objective is chosen as the main function to minimize, leaving the others to be bound\nby pre-de\ufb01ned thresholds. Methods for dealing with one objective function can be transformed to\ndeal with several objective functions by giving each objective a pre-de\ufb01ned weight. The dif\ufb01culty,\nhowever, lies in assigning an appropriate weight to each objective in order to keep the objectives\nbelow a given threshold. This approach is very problematic in real world applications, where the\nplayer is required to to satisfy certain constraints. For example, in online portfolio selection [4], the\nplayer may want to maximize wealth while keeping the risk (i.e., variance) contained below a certain\nthreshold. Another example is the Neyman-Pearson (NP) classi\ufb01cation paradigm (see, e.g., [19])\n(which extends the objective in classical binary classi\ufb01cation) where the goal is to learn a classi\ufb01er\nachieving low classi\ufb01cation error whose type I error is kept below a given threshold.\nIn the adversarial setting it is known that multiple-objective is generally impossible when the\nconstraints are unknown a-priory [18]. In the stochastic setting, Mahdavi et al. [17] proposed a\nframework for dealing with multiple objectives in the i.i.d. case. They proved that if there exists a\nsolution that minimizes the main objective function while keeping the other objectives below given\nthresholds, then their algorithm will converge to the optimal solution.\nIn this work, we study online prediction with multiple objectives but now consider the challenging\ngeneral case where the unknown underlying process is stationary and ergodic, thus allowing observa-\ntions to depend on each other arbitrarily. The (single-objective) sequential prediction under stationary\nand ergodic sources, has been considered in many papers and in various application domains. For\nexample, in online portfolio selection, [12, 9, 10] proposed non-parametric online strategies that\nguarantee, under mild conditions, the best possible outcome. Another interesting example in this\nregard is the work on time-series prediction by [2, 8, 3]. A common theme to all these results is that\nthe asymptotically optimal strategies are constructed by combining the predictions of many simple\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fexperts. The above strategies use a countably in\ufb01nite set of experts, and the guarantees provided\nfor these strategies are always asymptotic. This is no coincidence, as it is well known that \ufb01nite\nsample guarantees for these methods cannot be achieved without additional strong assumptions on the\nsource distribution [7, 16]. Approximate implementations of non-parametric strategies (which apply\nonly a \ufb01nite set of experts), however, turn out to work exceptionally well and, despite the inevitable\napproximation, are reported [11, 10, 9] to signi\ufb01cantly outperform strategies designed to work in an\nadversarial, no-regret setting, in various domains.\nThe algorithm presented in this paper utilizes as a sub-routine the Weak Aggregating Algorithm\n(WAA) of [21], and [13] to handle multiple objectives. While we discuss here the case of only two\nobjective functions, our theorems can be extended easily to any \ufb01xed number of functions.\n\n2 Problem Formulation\nWe consider the following prediction game. Let X (cid:44) [\u2212D, D]d \u2282 Rd be a compact observation\nspace where D > 0. At each round, n = 1, 2, . . ., the player is required to make a prediction yn \u2208 Y,\nwhere Y \u2282 Rm is a compact and convex set, based on past observations, X n\u22121\n(cid:44) (x1, . . . , xn\u22121)\nand, xi \u2208 X (X 0\n1 is the empty observation). After making the prediction yn, the observation xn is\nrevealed and the player suffers two losses, u(yn, xn) and c(yn, xn), where u and c are real-valued\ncontinuous functions and convex w.r.t. their \ufb01rst argument. We view the player\u2019s prediction strategy as\na sequence S (cid:44) {Sn}\u221e\nn=1 of forecasting functions Sn : X (n\u22121) \u2192 Y; that is, the player\u2019s prediction\nat round n is given by Sn(X n\u22121\n)). Throughout the paper we\nassume that x1, x2, . . . are realizations of random variables X1, X2, . . . such that the stochastic\n\u2212\u221e is jointly stationary and ergodic and P(Xi \u2208 X ) = 1. The player\u2019s goal is to play\nprocess (Xn)\u221e\nthe game with a strategy that minimizes the average u-loss, 1\n), xi), while keeping\nN\nthe average c-loss 1\n), xi) bounded below a prescribed threshold \u03b3. Formally, we\nN\nde\ufb01ne the following:\nDe\ufb01nition 1 (\u03b3-bounded strategy). A prediction strategy S will be called \u03b3-bounded if\n\n) (for brevity, we denote S(X n\u22121\n\n(cid:80)N\ni=1 u(S(X i\u22121\n(cid:33)\n\n(cid:80)N\ni=1 c(S(X i\u22121\n(cid:32)\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\ni=1\n\n1\nN\n\n), Xi)\n\nlim sup\nN\u2192\u221e\n\nc(S(X i\u22121\n\nE(cid:2)maxy\u2208Y() EP\u221e [u(y, X0)](cid:3) where P\u221e is the regular conditional probability distribution of X0\n\n\u2264 \u03b3\nalmost surely. The set of all \u03b3-bounded strategies will be denoted S\u03b3.\nThe well known result of [1] states that for the single objective case the best possible outcome is\ngiven F\u221e (the \u03c3-algebra generated by the in\ufb01nite past X\u22121, X\u22122, . . .) and the maximization is over\nthe F\u221e-measurable functions. This motivates us to de\ufb01ne the following:\nDe\ufb01nition 2 (\u03b3-feasible process). We say that the stationary and ergodic process {Xi}\u221e\n\u2212\u221e is \u03b3-\nfeasible w.r.t. the functions u and c, if for a threshold \u03b3 > 0, there exists some y(cid:48) \u2208 Y() such that\nE [c(y(cid:48), X0)] < \u03b3.\nIf \u03b3-feasibility holds, then we will denote by y\u2217\nfollowing minimization problem:\n\n\u221e is not necessarily unique) the minimizer of the\n\n\u221e (y\u2217\n\nN(cid:88)\n\nE [u(y, X0)]\n\nminimize\ny\u2208Y()\nsubject to E [c(y, X0)] \u2264 \u03b3,\n\n(1)\n\n(1) and we de\ufb01ne the \u03b3-feasible optimal value as\n\nV\u2217 = E [EP\u221e [u(y\u2217\n\n\u221e, X0)]] .\n\nNote that problem (1) is a convex minimization problem over Y(). Therefore, the problem is\nequivalent to \ufb01nding the saddle point of the Lagrangian function [15], namely,\n\nmin\ny\u2208Y()\n\nmax\n\u03bb\u2208R+\n\nL(y, \u03bb),\n\nwhere the Lagrangian is\n\nL(y, \u03bb) (cid:44) (E [u(y, X0)] + \u03bb (E [c(y, X0)] \u2212 \u03b3)) .\n\n2\n\n\fWe denote the optimal dual by \u03bb\u2217\nfunction, \u03bb\u2217\n\u039b (cid:44) [0, \u03bbmax]. We also de\ufb01ne the instantaneous Lagrangian function as\n\n\u221e(). Moreover, we set a constant \u03bbmax such that \u03bbmax > \u03bb\u2217\n\n\u221e and assume that L can be maximize by a unique F\u221e-measurable\n\u221e() P\u221e-a.s., and set\n\nl(y, \u03bb, x) (cid:44) u(y, x) + \u03bb (c(y, x) \u2212 \u03b3) .\n\n(2)\nIn Brief, we are seeking a strategy S \u2208 S\u03b3 that is as good as any other \u03b3-bounded strategy, in terms\nof the average u-loss, when the underlying process is \u03b3-feasible. Such a strategy will be called\n\u03b3-universal.\n3 Optimality of V\u2217\n\nIn this section, we show that the average u-loss of any \u03b3-bounded prediction strategy cannot be\nsmaller than V\u2217, the \u03b3-feasible optimal value. This result is a generalization of the well-known\nresult of [1] regarding the best possible outcome under a single objective. Before stating and proving\nthis optimality result, we state three lemmas that will be used repeatedly in this paper. The \ufb01rst\nlemma is known as Breiman\u2019s generalized ergodic theorem. The second and the third lemmas\nconcern the continuity of the saddle point w.r.t. the probability distribution, their proofs appear in the\nsupplementary material.\nLemma 1 (Ergodicity, [5]). Let X = {Xi}\u221e\n\u2212\u221e be a stationary and ergodic process. For each positive\ninteger i, let Ti denote the operator that shifts any sequence by i places to the left. Let f1, f2, . . .\nbe a sequence of real-valued functions such that limn\u2192\u221e fn(X) = f (X) almost surely, for some\nfunction f. Assume that E supn |fn(X)| < \u221e. Then,\n\nn(cid:88)\n\ni=1\n\nlim\nn\u2192\u221e\n\n1\nn\n\nfi(T iX) = Ef (X)\n\nalmost surely.\nLemma 2 (Continuity and Minimax). Let Y, \u039b,X be compact real spaces. l : Y \u00d7 \u039b \u00d7 X \u2192 R\nbe a continuous function. Denote by P(X ) the space of all probability measures on X (equipped with\nthe topology of weak-convergence). Then the following function L\u2217 : P(X ) \u2192 R is continuous\n\nL\u2217(Q) = inf\n\ny\u2208Y sup\n\u03bb\u2208\u039b\n\nEQ [l(y, \u03bb, x)] .\n\n(3)\n\nMoreover, for any Q \u2208 P(X ),\ninf\ny\u2208Y sup\n\u03bb\u2208\u039b\n\nEQ [l(y, \u03bb, x)] = sup\n\u03bb\u2208\u039b\n\nEQ [l(y, \u03bb, x)] .\n\ninf\ny\u2208Y\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nLemma 3 (Continuity of the optimal selection). Let Y, \u039b,X be compact real spaces. Then, there\nexist two measurable selection functions hX,h\u03bb such that\n\nhy(Q) \u2208 arg min\ny\u2208Y\n\nmax\n\u03bb\u2208\u039b\n\nEQ [l(y, \u03bb, x)]\n\n, h\u03bb(Q) \u2208 arg max\n\u03bb\u2208\u039b\n\nmin\ny\u2208Y\n\nEQ [l(y, \u03bb, x)]\n\nfor any Q \u2208 P(X ). Moreover, let L\u2217 be as de\ufb01ned in Equation (3). Then, the set\nGr(L\u2217) (cid:44) {(u\u2217, v\u2217, Q) | u\u2217 \u2208 hy(Q), v\u2217 \u2208 h\u03bb(Q), Q \u2208 P(X )},\n\nis closed in Y \u00d7 \u039b \u00d7 P(X ).\nThe importance of Lemma 3 stems from the fact that it proves the continuity properties of the\nmulti-valued correspondences Q \u2192 hy(Q) and Q \u2192 h\u03bb(Q). This leads to the knowledge that if for\nthe limiting distribution, Q\u221e, the optimal set is a singleton, then Q \u2192 hy(Q) and Q \u2192 h\u03bb(Q) are\ncontinuous in Q\u221e. We are now ready to prove the optimality of V\u2217.\n\u2212\u221e be a \u03b3-feasible process. Then, for any strategy S \u2208 S\u03b3,\nTheorem 1 (Optimality of V\u2217). Let {Xi}\u221e\nN(cid:88)\nthe following holds a.s.\n\nlim inf\nN\u2192\u221e\n\n1\nN\n\nu(S(X i\u22121\n\n1\n\n), Xi) \u2265 V\u2217.\n\ni=1\n\n3\n\n\fProof. For any given strategy S \u2208 S\u03b3, we will look at the following sequence:\n\nN(cid:88)\n\ni=1\n\n1\nN\n\nl(S(X i\u22121\n\n1\n\n), \u02dc\u03bb\u2217\n\ni , Xi).\n\n(4)\n\nwhere \u02dc\u03bb\u2217\n\ni \u2208 h\u03bb(P\n\nXi|X i\u22121\n\n1\n\n) Observe that\n\nN(cid:88)\n\n(cid:16)\n\ni=1\n\n(4) =\n\n1\nN\n\ni , Xi) \u2212 E(cid:104)\n\n), \u02dc\u03bb\u2217\n\nl(S(X i\u22121\n\n), \u02dc\u03bb\u2217\n\n1\n\n1\n\nl(S(X i\u22121\n\nE(cid:104)\nN(cid:88)\ni , Xi) \u2212 E(cid:104)\n\n1\nN\n\ni=1\n\n+\n\n(cid:105)(cid:17)\n\nl(S(X i\u22121\n\n), \u02dc\u03bb\u2217\n\n1\n\ni , Xi) | X i\u22121\n(cid:105)\n\n1\n\n.\n\nSince Ai = l(S(X i\u22121\nis a martingale difference se-\nquence, the last summand converges to 0 a.s., by the strong law of large numbers (see, e.g., [20]).\nTherefore,\n\nl(S(X i\u22121\n\n), \u02dc\u03bb\u2217\n\n), \u02dc\u03bb\u2217\n\n1\n\n1\n\n1\n\nlim inf\nN\u2192\u221e\n\n1\nN\n\nl(S(X i\u22121\n\n1\n\n), \u02dc\u03bb\u2217\n\ni , Xi) = lim inf\nN\u2192\u221e\n\nl(S(X i\u22121\n\n1\n\n), \u02dc\u03bb\u2217\n\ni , Xi) | X i\u22121\n\n1\n\n(cid:105)\n\n1\n\ni , Xi) | X i\u22121\n(cid:105)\n\ni , Xi) | X i\u22121\nN(cid:88)\n\nE(cid:104)\n\n1\nN\n\ni=1\n\n(cid:105)\n\nE(cid:104)\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\n\u2265 lim inf\nN\u2192\u221e\n\n1\nN\n\nmin\ny\u2208Y()\n\nl(y, \u02dc\u03bb\u2217\n\ni , Xi) | X i\u22121\n\n1\n\n,\n\n(5)\n\nwhere the minimum is taken w.r.t. all the \u03c3(X i\u22121\nstationary, we get for \u02c6\u03bb\u2217\n\n),\n\n1\n\n)-measurable functions. Because the process is\n\ni \u2208 h\u03bb(P\nN(cid:88)\n\nX0|X\n\n\u22121\n1\u2212i\n\nE(cid:104)\n\nmin\ny\u2208Y()\n\ni=1\n\n1\nN\n\n(cid:105)\n\n= lim inf\nN\u2192\u221e\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(5) = lim inf\nN\u2192\u221e\n\nl(y, \u02c6\u03bb\u2217\n\ni , X0) | X\u22121\n1\u2212i\n\nL\u2217(P\n\nX0|X\n\n\u22121\n1\u2212i\n\n).\n\n(6)\n\n\u22121\n1\u2212i\n\nX0|X\n\n\u2192 P\u221e weakly as i approaches \u221e and from Lemma 2 we know\n\nUsing Levy\u2019s zero-one law, P\nthat L\u2217 is continuous. Therefore, we can apply Lemma 1 and get that a.s.\n\u221e, X0)]] = E [L (y\u2217\n\n(6) = E [L\u2217(P\u221e)] = E [EP\u221e [l (y\u2217\n\n\u221e, \u03bb\u2217\n\n(7)\nNote also, that due to the complementary slackness condition of the optimal solution, i.e.,\nE [\u03bb\u2217\n\n\u221e, X0)] \u2212 \u03b3)] = 0, we get\n\n\u221e(EP\u221e [c(y\u2217\n\n\u221e, X0)] .\n\n\u221e, \u03bb\u2217\n\n(7) = E [EP\u221e [u (y\u2217\n\u221e, and using Lemma 3 \u02c6\u03bb\u2217\n\n\u221e, X0)]] = V\u2217.\ni \u2192 \u03bb\u2217\n\n\u221e as i approaches \u221e. Moreover, since l is\nFrom the uniqueness of \u03bb\u2217\ncontinuous on a compact set, l is also uniformly continuous. Therefore, for any given \u0001 > 0, there\nexists \u03b4 > 0, such that if |\u03bb(cid:48) \u2212 \u03bb| < \u03b4, then\n\n|l(y, \u03bb(cid:48), x) \u2212 l(y, \u03bb, x)| < \u0001\n\nfor any y \u2208 Y and x \u2208 X . Therefore, there exists i0 such that if i > i0 then |l(y, \u02c6\u03bb\u2217\nl(y, \u03bb\u2217\n\n\u221e, x)| < \u0001 for any y \u2208 Y and x \u2208 X . Thus,\n\ni , x) \u2212\n\ni=1\n\nN(cid:88)\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nlim inf\nN\u2192\u221e\n\n= lim inf\nN\u2192\u221e\n\n1\nN\n\n1\nN\n\n\u2265 lim inf\nN\u2192\u221e\n\n1\nN\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nl(S(X i\u22121\n\n1\n\n), \u03bb\u2217\n\n\u221e, Xi) \u2212 lim inf\nN\u2192\u221e\n\n1\nN\n\nl(S(X i\u22121\n\n1\n\n), \u02c6\u03bb\u2217\n\ni , Xi)\n\nl(S(X i\u22121\n\n1\n\n), \u03bb\u2217\n\n\u221e, Xi) + lim sup\nN\u2192\u221e\n\n1\nN\n\n\u2212l(S(X i\u22121\n\n1\n\n), \u02c6\u03bb\u2217\n\ni , Xi)\n\nl(S(X i\u22121\n\n1\n\nl(S(X i\u22121\n\n1\n\n), \u03bb\u2217\n\n\u221e, Xi) \u2265 \u2212\u0001 a.s.,\n\nN(cid:88)\n\ni=1\n\n), \u02c6\u03bb\u2217\n\ni , Xi) \u2212 1\nN\n\n4\n\n\fAlgorithm 1 Minimax Histogram Based Aggregation (MHA)\n\nInput: Countable set of experts {Hk,h}, y0 \u2208 Y, \u03bb0 \u2208 \u039b, initial probability {\u03b1k,h},\nFor n = 0 to \u221e\nPlay yn, \u03bbn.\nNature reveals xn\nSuffer loss l(yn, \u03bbn, xn).\nUpdate the cumulative loss of the experts\n\n(cid:44) n(cid:88)\n\ni=0\n\n(cid:44) n(cid:88)\n\nlk,h\ny,n\n\nl(yi\n\nk,h, \u03bbi, xi)\n\nlk,h\n\u03bb,n\n\nl(yi, \u03bbi\n\nk,h, xi)\n\nUpdate experts\u2019 weights\n\nwy,(k,h)\n\nn\n\n(cid:44) \u03b1k,h exp\n\n\u2212 1\u221a\nn\n\nUpdate experts\u2019 weights w\u03bb,(k,h)\n\nn+1\n\nw\u03bb,(k,h)\n\nn+1\n\n(cid:44) \u03b1k,h exp\n\nChoose yn+1 and \u03bbn+1 as follows\n\n(cid:18)\n(cid:18) 1\u221a\n(cid:88)\n\nk,h\n\n(cid:19)\n\nlk,h\ny,n\n\ni=0\n\npy,(k,h)\nn+1\n\n(cid:44)\n\n(cid:80)\u221e\n\nh=1\n\nwy,(k,h)\n\nn+1\n\n(cid:80)\u221e\n\nk=1 wy,(k,h)\n\nn+1\n\n(cid:19)\n\nlk,h\n\u03bb,n\n\np\u03bb,(k,h)\nn+1 =\n\nn\n\nyn+1\nk,h\n\n\u03bbn+1 =\n\n(cid:88)\n\nk,h\n\n(cid:80)\u221e\n\nh=1\n\nw\u03bb,(k,h)\n\nn+1\n\n(cid:80)\u221e\n\nk=1 w\u03bb,(k,h)\n\nn+1\n\np\u03bb,(k,h)\nn+1 \u03bbn+1\nk,h\n\nyn+1 =\n\npy,(k,h)\nn+1\n\nl(S(X i\u22121\n\n1\n\n), \u03bb\u2217\n\n\u221e, Xi) \u2265 lim inf\nN\u2192\u221e\n\nN(cid:88)\n\ni=1\n\n1\nN\n\nl(S(X i\u22121\n\n1\n\n), \u02c6\u03bb\u2217\n\ni , Xi).\n\nEnd For\n\nand since \u0001 is arbitrary,\n\nlim inf\nN\u2192\u221e\n\n1\nN\n\nN(cid:88)\n\ni=1\n\nTherefore we can conclude that\n\nlim inf\nN\u2192\u221e\n\n1\nN\n\nWe \ufb01nish the proof by noticing that since S \u2208 S\u03b3, then by de\ufb01nition\n\nl(S(X i\u22121\n\n1\n\n), \u03bb\u2217\n\n\u221e, Xi) \u2265 V\u2217 a.s.\n\nlim sup\nN\u2192\u221e\n\nc(S(X i\u22121\n\n1\n\n), Xi) \u2264 \u03b3 a.s.\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\n1\nN\n\ni=1\n\n\u221e is non negative, we will get the desired result.\n\nand since \u03bb\u2217\nThe above lemma also provides the motivation to \ufb01nd the saddle point of the Lagrangian L. Therefore,\nfor the reminder of the paper we will use the loss function l as de\ufb01ned in Equation 2.\n\n4 Minimax Histogram Based Aggregation\n\nWe are now ready to present our algorithm Minimax Histogram based Aggregation (MHA) and prove\nthat its predictions are as good as the best strategy.\nBy Theorem 1 we can restate our goal: \ufb01nd a prediction strategy S \u2208 S\u03b3 such that for any \u03b3-feasible\nprocess {Xi}\u221e\n\n\u2212\u221e the following holds:\n\nlim\nN\u2192\u221e\n\n1\nN\n\nu(S(X i\u22121\n\n1\n\n), Xi) = V\u2217 a.s.\n\nN(cid:88)\n\ni=1\n\n5\n\n\fk,h, \u03bbi\n\n(cid:80)N\n(cid:80)N\n\nSuch a strategy will be called \u03b3-universal. We do so by maintaining a countable set of experts\n{Hk,h} k, h = 1, 2, . . ., which are constructed in a similar manner to the experts used in [10].\nEach expert is de\ufb01ned using a histogram which gets \ufb01ner as h grows, allowing us to construct an\nempirical measure on X . An expert Hk,h therefore outputs a pair (yi\nk,h) \u2208 Y \u00d7 \u039b at round\ni. This pair is the minimax w.r.t. its empirical measure. We show that those emprical measures\nconverge weakly to P\u221e, thus, the experts\u2019 prediction will converge to V\u2217. Our algorithm outputs at\nround i a pair (yi, \u03bbi) \u2208 Y \u00d7 \u039b where the sequence of predictions y1, y2, . . . tries to minimize the\ni=1 l(y, \u03bbi, xi) and the sequence of predictions \u03bb1, \u03bb2, . . . tries to maximize the\naverage loss 1\nN\naverage loss 1\ni=1 l(yi, \u03bb, xi). Each of yi and \u03bbi is the aggregation of predictions yi\nk,h,\nN\nk, h = 1, 2, . . . , respectively. In order to ensure that the performance of MHA will be as good as\nany other expert for both the y and the \u03bb predictions, we apply the Weak Aggregating Algorithm of\n[21], and [13] twice alternately. Theorem 2 states that the selection of points made by the experts\nabove converges to the optimal solution, the proof of Theorem 2 and the explicit construction of the\nexperts appears in the supplementary material. Then, in Theorem 3 we prove that MHA applied on\nthe experts de\ufb01ned in Theorem 2 generates a sequence of predictions that is \u03b3-bounded and as good\nas any other strategy w.r.t. any \u03b3-feasible process.\nTheorem 2. Assume that {Xi}\u221e\ncountable set of experts {Hk,h} for which\n\n\u2212\u221e is a \u03b3-feasible process. Then, it is possible to construct a\n\nk,h and \u03bbi\n\nlim\nk\u2192\u221e lim\n\nh\u2192\u221e lim\nn\u2192\u221e\n\n1\nN\n\nl(yi\n\nk,h, \u03bbi\n\nk,h, Xi) = V\u2217 a.s.,\n\nN(cid:88)\n\ni=1\n\nwhere (yi\n\nk,h, \u03bbi\n\nk,h) are the predictions made by expert Hk,h at round i.\n\nN(cid:88)\n\nl(cid:0)yi\n\n(cid:1) \u2264 V\u2217 \u2264 sup\n\nN(cid:88)\n\nl(cid:0)yi, \u03bbi\n\n(cid:1) ,\n\nBefore stating the main theorem regarding MHA, we state the following lemma (the proof appears in\nthe supplementary material), which is used in the proof of the main result regarding MHA.\nLemma 4. Let {Hk,h} be a countable set of experts as de\ufb01ned in the proof of Theorem 2. Then, the\nfollowing relation holds a.s.:\n\ni=1\n\ninf\nk,h\n\n1\nN\n\nk,h, \u03bbi, Xi\n\nlim sup\nn\u2192\u221e\n\n1\nN\nwhere (yi, \u03bbi) are the predictions of MHA when applied on {Hk,h}.\nWe are now ready to state and prove the optimality of MHA.\nTheorem 3 (Optimality of MHA). Let (yi, \u03bbi) be the predictions generated by MHA when applied\non {Hk,h} as de\ufb01ned in the proof of Theorem 2. Then, for any \u03b3-feasible process {Xi}\u221e\n\u2212\u221e: MHA is\na \u03b3-bounded and \u03b3-universal strategy.\n\nlim inf\nn\u2192\u221e\n\nk,h, Xi\n\ni=1\n\nk,h\n\nProof. We \ufb01rst show that\n\nN(cid:88)\n\ni=1\n\nlim\nN\u2192\u221e\n\n1\nN\n\nl(yi, \u03bbi, Xi) = V\u2217 a.s.\n\nApplying Lemma 5 in [13], we know that the x updates guarantee that for every expert Hk,h,\n\n(8)\n\n(9)\n\n(10)\n\nl(yi, \u03bbi, xi) \u2264 1\nN\n\nl(yi, \u03bbi, xi) \u2265 1\nN\n\nl(yi\n\nk,h, \u03bbi, xi) +\n\nCk,h\u221a\nN\nk,h, xi) \u2212 C(cid:48)\nk,h\u221a\nN\n\n,\n\nl(yi, \u03bbi\n\nwhere Ck,h, C(cid:48)\n\nk,h > 0 are some constants independent of N. In particular, using Equation (9),\n\n1\nN\n\nl(yi, \u03bbi, xi) \u2264 inf\n\nk,h\n\nl(yi\n\nk,h, \u03bbi, xi) +\n\nCk,h\u221a\nN\n\n(cid:33)\n\n.\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n1\nN\n\n1\nN\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n(cid:32)\n\nN(cid:88)\n\ni=1\n\n1\nN\n\n6\n\n\fTherefore, we get\n\nlim sup\nN\u2192\u221e\n\n(cid:32)\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\n1\nN\n\n\u2264 inf\n\nk,h\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nl(yi, \u03bbi, xi) \u2264 lim sup\nN\u2192\u221e\n\ninf\nk,h\n\n(cid:33)\n\nl(yi\n\nk,h, \u03bbi, xi) +\n\n(cid:32)\n\nl(yi\n\nk,h, \u03bbi, xi) +\n\nCk,h\u221a\nN\n\n\u2264 inf\n\nk,h\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\nl(yi\n\nk,h, \u03bbi, xi)\n\n,\n\n(11)\n\n(cid:33)\n\nCk,h\u221a\nN\n\n(cid:33)\n\nwhere in the last inequality we used the fact that lim sup is sub-additive. Using Lemma (4), we get\nthat\n\n(11) \u2264 V\u2217 \u2264 sup\n\nk,h\n\nlim inf\nn\u2192\u221e\n\n(12)\n\n(cid:32)\n\nN(cid:88)\n\ni=1\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:1) .\n\nl(cid:0)yi, \u03bbi\n\nk,h, Xi\n\n1\nN\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nUsing similar arguments and using Equation (10) we can show that\n\n(12) \u2264 lim inf\nN\u2192\u221e\n\n1\nN\n\nl(yi, \u03bbi, xi).\n\nSummarizing, we have\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\nN(cid:88)\n\ni=1\n\nl(yi, \u03bbi, xi) \u2264 V\u2217 \u2264 lim inf\nN\u2192\u221e\n\n1\nN\n\nl(yi, \u03bbi, xi).\n\n(cid:80)N\ni=1 l(yi, \u03bbi, Xi) = V\u2217.\n\ni=1\n\nTherefore, we can conclude that a.s. limN\u2192\u221e 1\nN\nTo show that MHA is indeed a \u03b3-bounded strategy, we use two special experts H0,0, H\u22121,\u22121 whose\npredictions are \u03bbn\n\n0,0 = \u03bbmax and \u03bbn\u22121,\u22121 = 0 for every n and to shorten the notation, we denote\n\ng(y, \u03bb, x) (cid:44) \u03bb(c(y, x) \u2212 \u03b3).\n\nFirst, from Equation (10) applied on the expert H0,0, we get that:\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\ng(yi, \u03bbmax, x) \u2264 lim sup\nN\u2192\u221e\n\n1\nN\n\ng(yi, \u03bbi, x).\n\n(13)\n\nN(cid:88)\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nMoreover, since l is uniformly continuous, for any given \u0001 > 0, there exists \u03b4 > 0, such that if\n|\u03bb(cid:48) \u2212 \u03bb| < \u03b4, then |l(y, \u03bb(cid:48), x) \u2212 l(y, \u03bb, x)| < \u0001 for any y \u2208 Y and x \u2208 X . We also know from the\nproof of Theorem 2 that limk\u2192\u221e limh\u2192\u221e limi\u2192\u221e \u03bbi\n\u221e. Therefore, there exist k0, h0, i0 such\nthat |\u03bbi\n\n\u221e| < \u03b4 for any i > i0. Therefore,\n\nk,h = \u03bb\u2217\n\n\u2212 \u03bb\u2217\n\nk0,h0\n\n(cid:32)\n(cid:32)\n(cid:32)\n\n1\nN\n\n1\nN\n\ni=1\n\nN(cid:88)\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n1\nN\n\nlim sup\nN\u2192\u221e\n\nlim sup\nN\u2192\u221e\n\nlim sup\nN\u2192\u221e\n\nl(yi, \u03bb\u2217\n\nN(cid:88)\n\u221e, xi) \u2212 1\nN(cid:88)\nN\n\ni=1\n\nl(yi, \u03bb\u2217\n\n\u221e, xi) \u2212 1\nN\n\ni=1\n\nl(yi, \u03bbi\n\nk0,h0\n\n, xi) \u2212 1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:33)\n\n\u2264\n\n(cid:33)\n(cid:33)\n\n+\n\nl(yi, \u03bbi, xi)\n\nl(yi, \u03bbi\n\nk0,h0\n\n, xi)\n\nl(yi, \u03bbi, xi)\n\n(14)\n\nFrom the uniform continuity we also learn that the \ufb01rst summand is bounded above by \u0001, and from\nEquation (10), we get that the last summand is bounded above by 0. Thus,\n\nand since \u0001 is arbitrary, we get that\n\n(cid:32)\n\nN(cid:88)\n\ni=1\n\n1\nN\n\nlim sup\nN\u2192\u221e\n\n(14) \u2264 \u0001,\n\nl(yi, \u03bb\u2217\n\n\u221e, xi) \u2212 1\nN\n\n7\n\n(cid:33)\n\nl(yi, \u03bbi, xi)\n\n\u2264 0.\n\nN(cid:88)\n\ni=1\n\n\f\u221e, Xi) \u2264 V\u2217, and from Theorem 1 we can conclude that\n\nThus, lim supN\u2192\u221e 1\nN\nlimN\u2192\u221e 1\nN\n\n(cid:80)N\n(cid:80)N\ni=1 l(yi, \u03bb\u2217\n\u221e, Xi) = V\u2217. Therefore, we can deduce that\ni=1 l(yi, \u03bb\u2217\nN(cid:88)\ng(yi, \u03bb\u2217\n1\nN(cid:88)\nN\n\ng(yi, \u03bbi, xi) \u2212 lim sup\nN\u2192\u221e\n\nlim sup\nN\u2192\u221e\n\ni=1\n\ni=1\n\ng(yi, \u03bbi, xi) + lim inf\nN\u2192\u221e\n\n\u221e, xi) =\n\n\u2212g(yi, \u03bb\u2217\n\n\u221e, xi)\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\ni=1\n\n\u2264 lim sup\nN\u2192\u221e\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\n1\nN\n\ni=1\n\ng(yi, \u03bbi, xi) \u2212 1\nN\n\ng(yi, \u03bb\u2217\n\n\u221e, xi)\n\n1\nN\n\nN(cid:88)\nN(cid:88)\nN(cid:88)\nN(cid:88)\n\n1\nN\n\ni=1\n\ni=1\n\ni=1\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n= lim sup\nN\u2192\u221e\n\nl(yi, \u03bbi, xi) \u2212 1\nN\n\nl(yi, \u03bb\u2217\n\n\u221e, xi) = 0,\n\nwhich results in\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\ng(yi, \u03bbi, xi) \u2264 lim sup\nN\u2192\u221e\n\n1\nN\n\ng(yi, \u03bb\u2217\n\n\u221e, xi).\n\nCombining the above with Equation (13), we get that\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\ng(yi, \u03bbmax, xi) \u2264 lim sup\nN\u2192\u221e\n\n1\nN\n\ng(yi, \u03bb\u2217\n\n\u221e, xi).\n\nSince 0 \u2264 \u03bb\u2217\n\n\u221e < \u03bbmax, we get that MHA is \u03b3-bounded. This also implies that\n\n1\nN\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nlim sup\nN\u2192\u221e\n\n1\nN\n\n\u03bbi(c(yi, xi) \u2212 \u03b3) \u2264 0.\n\nNow, if we apply Equation (10) on the expert H\u22121,\u22121, we get that\n\u03bbi(c(yi, xi) \u2212 \u03b3) \u2265 0.\n\nlim inf\nN\u2192\u221e\n\n1\nN\n\nThus,\n\nlim\nN\u2192\u221e\n\n1\nN\n\n\u03bbi(c(yi, xi) \u2212 \u03b3) = 0,\n\nand using Equation (8), we get that MHA is also \u03b3-universal.\n\ni=1\n\nN(cid:88)\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n5 Concluding Remarks\n\nIn this paper, we introduced the Minimax Histogram Aggregation (MHA) algorithm for multiple-\nobjective sequential prediction. We considered the general setting where the unknown underlying\nprocess is stationary and ergodic., and given that the underlying process is \u03b3-feasible, we extended the\nwell-known result of [1] regarding the asymptotic lower bound of prediction with a single objective,\nto the case of multi-objectives. We proved that MHA is a \u03b3-bounded strategy whose predictions also\nconverge to the optimal solution in hindsight.\nIn the proofs of the theorems and lemmas above, we used the fact that the initial weights of the\nexperts, \u03b1k,h, are strictly positive thus implying a countably in\ufb01nite expert set. In practice, however,\none cannot maintain an in\ufb01nite set of experts. Therefore, it is customary to apply such algorithms\nwith a \ufb01nite number of experts (see [12, 9, 10]). Despite the fact that in the proof we assumed that the\nobservation set X is known a priori, the algorithm can also be applied in the case that X is unknown\nby applying the doubling trick. For a further discussion on this point, see [8]. In our proofs, we relied\non the compactness of the set X . It will be interesting to see whether the universality of MHA can be\nsustained under unbounded processes as well. A very interesting open question would be to identify\nconditions allowing for \ufb01nite sample bounds when predicting with multiple objectives.\n\n8\n\n\fAcknowledgments\n\nWe would like to thank the anonymous reviewers for providing helpful comments. This research was\nsupported by The Israel Science Foundation (grant No. 1890/14)\n\nReferences\n[1] P.H. Algoet. The strong law of large numbers for sequential decisions under uncertainty. IEEE\n\nTransactions on Information Theory, 40(3):609\u2013633, 1994.\n\n[2] G. Biau, K. Bleakley, L. Gy\u00f6r\ufb01, and G. Ottucs\u00e1k. Nonparametric sequential prediction of time\n\nseries. Journal of Nonparametric Statistics, 22(3):297\u2013317, 2010.\n\n[3] G. Biau and B. Patra. Sequential quantile prediction of time series. IEEE Transactions on\n\nInformation Theory, 57(3):1664\u20131674, 2011.\n\n[4] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge\n\nUniversity Press, 2005.\n\n[5] L. Breiman. The individual ergodic theorem of information theory. The Annals of Mathematical\n\nStatistics, 28(3):809\u2013811, 1957.\n\n[6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\n2006.\n\n[7] L. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31.\n\nSpringer Science & Business Media, 2013.\n\n[8] L. Gy\u00f6r\ufb01 and G. Lugosi. Strategies for sequential prediction of stationary time series. In\n\nModeling uncertainty, pages 225\u2013248. Springer, 2005.\n\n[9] L. Gy\u00f6r\ufb01, G. Lugosi, and F. Udina. Nonparametric kernel-based sequential investment strategies.\n\nMathematical Finance, 16(2):337\u2013357, 2006.\n\n[10] L. Gy\u00f6r\ufb01 and D. Sch\u00e4fer. Nonparametric prediction. Advances in Learning Theory: Methods,\n\nModels and Applications, 339:354, 2003.\n\n[11] L. Gy\u00f6r\ufb01, F. Udina, and H. Walk. Nonparametric nearest neighbor based empirical portfolio\nselection strategies. Statistics & Decisions, International Mathematical Journal for Stochastic\nMethods and Models, 26(2):145\u2013157, 2008.\n\n[12] L. Gy\u00f6r\ufb01, A. Urb\u00e1n, and I. Vajda. Kernel-based semi-log-optimal empirical portfolio selection\nstrategies. International Journal of Theoretical and Applied Finance, 10(03):505\u2013516, 2007.\n\n[13] Y. Kalnishkan and M. Vyugin. The weak aggregating algorithm and weak mixability. In\nInternational Conference on Computational Learning Theory, pages 188\u2013203. Springer, 2005.\n\n[14] D. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.\n\n[15] U.V. Luxburg and B. Sch\u00f6lkopf. Statistical learning theory: Models, concepts, and results.\n\narXiv preprint arXiv:0810.4752, 2008.\n\n[16] M. Mahdavi, T. Yang, and R. Jin. Stochastic convex optimization with multiple objectives. In\n\nAdvances in Neural Information Processing Systems, pages 1115\u20131123, 2013.\n\n[17] S. Mannor, J. Tsitsiklis, and J.Y. Yu. Online learning with sample path constraints. Journal of\n\nMachine Learning Research, 10(Mar):569\u2013590, 2009.\n\n[18] P. Rigollet and X. Tong. Neyman-pearson classi\ufb01cation, convexity and stochastic constraints.\n\nJournal of Machine Learning Research, 12(Oct):2831\u20132855, 2011.\n\n[19] W. Stout. Almost sure convergence, vol. 24 of probability and mathematical statistics, 1974.\n\n[20] V. Vovk. Competing with stationary prediction strategies. In International Conference on\n\nComputational Learning Theory, pages 439\u2013453. Springer, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1915, "authors": [{"given_name": "Guy", "family_name": "Uziel", "institution": "Technion - Israel Institute of Technology"}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": "Technion"}]}