{"title": "Risk-Aversion in Multi-armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3275, "page_last": 3283, "abstract": "In stochastic multi--armed bandits the objective is to solve the exploration--exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk--aversion where the objective is to compete against the arm with the best risk--return trade--off. This setting proves to be intrinsically more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we introduce two new algorithms, we investigate their theoretical guarantees, and we report preliminary empirical results.", "full_text": "Risk\u2013Aversion in Multi\u2013armed Bandits\n\nAmir Sani\n\nAlessandro Lazaric\n\nR\u00e9mi Munos\n\nINRIA Lille - Nord Europe, Team SequeL\n\n{amir.sani,alessandro.lazaric,remi.munos}@inria.fr\n\nAbstract\n\nStochastic multi\u2013armed bandits solve the Exploration\u2013Exploitation dilemma and\nultimately maximize the expected reward. Nonetheless, in many practical prob-\nlems, maximizing the expected reward is not the most desirable objective. In this\npaper, we introduce a novel setting based on the principle of risk\u2013aversion where\nthe objective is to compete against the arm with the best risk\u2013return trade\u2013off. This\nsetting proves to be more dif\ufb01cult than the standard multi-arm bandit setting due\nin part to an exploration risk which introduces a regret associated to the variability\nof an algorithm. Using variance as a measure of risk, we de\ufb01ne two algorithms,\ninvestigate their theoretical guarantees, and report preliminary empirical results.\n\nIntroduction\n\n1\nThe multi\u2013armed bandit [13] elegantly formalizes the problem of on\u2013line learning with partial feed-\nback, which encompasses a large number of real\u2013world applications, such as clinical trials, online\nadvertisements, adaptive routing, and cognitive radio. In the stochastic multi\u2013armed bandit model,\na learner chooses among several arms (e.g., different treatments), each characterized by an indepen-\ndent reward distribution (e.g., the treatment effectiveness). At each point in time, the learner selects\none arm and receives a noisy reward observation from that arm (e.g., the effect of the treatment on\none patient). Given a \ufb01nite number of n rounds (e.g., patients involved in the clinical trial), the\nlearner faces a dilemma between repeatedly exploring all arms and collecting reward information\nversus exploiting current reward estimates by selecting the arm with the highest estimated reward.\nRoughly speaking, the learning objective is to solve this exploration\u2013exploitation dilemma and ac-\ncumulate as much reward as possible over n rounds. Multi\u2013arm bandit literature typically focuses\non the problem of \ufb01nding a learning algorithm capable of maximizing the expected cumulative re-\nward (i.e., the reward collected over n rounds averaged over all possible observation realizations),\nthus implying that the best arm returns the highest expected reward. Nonetheless, in many practical\nproblems, maximizing the expected reward is not always the most desirable objective. For instance,\nin clinical trials, the treatment which works best on average might also have considerable variabil-\nity; resulting in adverse side effects for some patients. In this case, a treatment which is less effective\non average but consistently effective on different patients may be preferable to an effective but risky\ntreatment. More generally, some applications require an effective trade\u2013off between risk and reward.\nThere is no agreed upon de\ufb01nition for risk. A variety of behaviours result in an uncertainty which\nmight be deemed unfavourable for a speci\ufb01c application and referred to as a risk. For example, an\nalgorithm which is consistent over multiple runs may not satisfy the desire for a solution with low\nvariability in every single realization of the algorithm. Two foundational risk modeling paradigms\nare Expected Utility theory [12] and the historically popular and accessible Mean-Variance paradigm\n[10]. A large part of decision\u2013making theory focuses on de\ufb01ning and managing risk (see e.g., [9]\nfor an introduction to risk from an expected utility theory perspective).\nRisk has mostly been studied in on\u2013line learning within the so\u2013called expert advice setting (i.e.,\nadversarial full\u2013information on\u2013line learning). In particular, [8] showed that in general, although\nit is possible to achieve a small regret w.r.t. to the expert with the best average performance, it is\nnot possible to compete against the expert which best trades off between average return and risk.\nOn the other hand, it is possible to de\ufb01ne no\u2013regret algorithms for simpli\ufb01ed measures of risk\u2013\n\n1\n\n\freturn. [16] studied the case of pure risk minimization (notably variance minimization) in an on-line\nsetting where at each step the learner is given a covariance matrix and must choose a weight vector\nthat minimizes the variance. The regret is then computed over horizon and compared to the \ufb01xed\nweights minimizing the variance in hindsight. In the multi\u2013arm bandit domain, the most interesting\nresults are by [5] and [14]. [5] introduced an analysis of the expected regret and its distribution,\nrevealing that an anytime version of UCB [6] and UCB-V might have large regret with some non-\nnegligible probability.1 This analysis is further extended by [14] who derived negative results which\nshow no anytime algorithm can achieve a regret with both a small expected regret and exponential\ntails. Although these results represent an important step towards the analysis of risk within bandit\nalgorithms, they are limited to the case where an algorithm\u2019s cumulative reward is compared to the\nreward obtained by pulling the arm with the highest expectation.\nIn this paper, we focus on the problem of competing against the arm with the best risk\u2013return trade\u2013\noff. In particular, we refer to the popular mean\u2013variance model introduced by [10]. In Sect. 2 we\nintroduce notation and de\ufb01ne the mean\u2013variance bandit problem. In Sect. 3 and 4 we introduce two\nalgorithms and study their theoretical properties. In Sect. 5 we report a set of numerical simulations\naiming at validating the theoretical results. Finally, in Sect. 7 we conclude with a discussion on\npossible extensions. The proofs and additional experiments are reported in the extended version [15].\n2 Mean\u2013Variance Multi\u2013arm Bandit\nIn this section we introduce the notation and de\ufb01ne the mean\u2013variance multi\u2013arm bandit problem.\nWe consider the standard multi\u2013arm bandit setting with K arms, each characterized by a distribution\ni . The bandit\n\u03bdi bounded in the interval [0, 1]. Each distribution has a mean \u00b5i and a variance \u03c32\nproblem is de\ufb01ned over a \ufb01nite horizon of n rounds. We denote by Xi,s \u223c \u03bdi the s-th random\nsample drawn from the distribution of arm i. All arms and samples are independent. In the multi\u2013\narm bandit protocol, at each round t, an algorithm selects arm It and observes sample XIt,Ti,t,\nI{It = i}).\nWhile in the standard bandit literature the objective is to select the arm leading to the highest reward\nin expectation (the arm with the largest expected value \u00b5i), here we focus on the problem of \ufb01nding\nthe arm which effectively trades off between its expected reward (i.e., the return) and its variability\n(i.e., the risk). Although a large number of models for risk\u2013return trade\u2013off have been proposed, here\nwe focus on the most historically popular and simple model: the mean\u2013variance model proposed by\n[10],where the return of an arm is measured by the expected reward and its risk by its variance.\nDe\ufb01nition 1. The mean\u2013variance of an arm i with mean \u00b5i, variance \u03c32\nrisk tolerance \u03c1 is de\ufb01ned as2 MVi = \u03c32\n\nwhere Ti,t is the number of samples observed from arm i up to time t (i.e., Ti,t =(cid:80)t\n\ni and coef\ufb01cient of absolute\n\ns=1\n\ni \u2212 \u03c1\u00b5i.\n\nThus the optimal arm is the arm with the smallest mean-variance, that is i\u2217 = arg mini MVi. We no-\ntice that we can obtain two extreme settings depending on the value of risk tolerance \u03c1. As \u03c1 \u2192 \u221e,\nthe mean\u2013variance of arm i tends to the opposite of its expected value \u00b5i and the problem reduces to\nthe standard expected reward maximization traditionally considered in multi\u2013arm bandit problems.\nWith \u03c1 = 0, the mean\u2013variance reduces to \u03c32\nan arm i with t samples as(cid:100)MVi,t = \u02c6\u03c32\nGiven {Xi,s}t\nt(cid:88)\n\ns=1 i.i.d. samples from the distribution \u03bdi, we de\ufb01ne the empirical mean\u2013variance of\n\ni and the objective becomes variance minimization.\n\nt(cid:88)\n\ni,t \u2212 \u03c1\u02c6\u00b5i,t, where\n1\n\u02c6\u03c32\ni,t =\nXi,s,\nt\n\n(cid:0)Xi,s \u2212 \u02c6\u00b5i,t\n\n(cid:1)2\n\n.\n\ns=1\n\n\u02c6\u00b5i,t =\n\n1\nt\n\ns=1\n\n(1)\n\nWe now consider a learning algorithm A and its corresponding performance over n rounds. Similar\nto a single arm i we de\ufb01ne its empirical mean\u2013variance as\n(2)\n\n(cid:100)MVn(A) = \u02c6\u03c32\nn(cid:88)\n\nZt,\n\nt=1\n\nn(A) \u2212 \u03c1\u02c6\u00b5n(A),\n\nn(cid:88)\n\nt=1\n\n(cid:0)Zt \u2212 \u02c6\u00b5n(A)(cid:1)2\n\n\u02c6\u03c32\nn(A) =\n\n1\nn\n\nwhere\n\n\u02c6\u00b5n(A) =\n\n1\nn\n\n1The analysis is for the pseudo\u2013regret but it can be extended to the true regret (see Remark 2 at p.23 of [5]).\n2The coef\ufb01cient of risk tolerance is the inverse of the more popular coef\ufb01cient of risk aversion A = 1/\u03c1.\n\n2\n\n,\n\n(3)\n\n\fYi,t =\n\nj<i,j(cid:54)=i\u2217\n\n\uf8f1\uf8f2\uf8f3Xi\u2217,t\nXi\u2217,t(cid:48) with t(cid:48) = Ti\u2217,n + (cid:80)\nTi,n(cid:88)\n(cid:104)\n\n\u02dc\u03c32\ni,Ti,n =\n\nYi,t,\n\nTi,n\n\nt=1\n\n1\n\nif i = i\u2217\notherwise\n\nTj,n + t\n\nTi,n(cid:88)\n\nt=1\n\n1\n\nTi,n\n\n(cid:1)2\n\n(cid:0)Yi,t \u2212 \u02dc\u00b5i,Ti,n\n(cid:105)\n\nbe a renaming of the samples from the optimal arm, such that while the algorithm was pulling arm i\nfor the t-th time, Yi,t is the unobserved sample from i\u2217. The corresponding mean and variance is\n\n.\n\n(5)\n\nwith Zt = XIt,Ti,t, that is the reward collected by the algorithm at time t. This leads to a natural\nde\ufb01nition of the (random) regret at each single run of the algorithm as the difference in the mean\u2013\nvariance performance of the algorithm compared to the best arm.\nDe\ufb01nition 2. The regret for a learning algorithm A over n rounds is de\ufb01ned as\n\nRn(A) =(cid:100)MVn(A) \u2212(cid:100)MVi\u2217,n.\n\n(4)\n\nGiven this de\ufb01nition, the objective is to design an algorithm whose regret decreases as the number\nof rounds increases (in high probability or in expectation).\n\nWe notice that the previous de\ufb01nition actually depends on unobserved samples. In fact, (cid:100)MVi\u2217,n is\ncomputed on n samples i\u2217 which are not actually observed when running A. This matches the de\ufb01-\nnition of true regret in standard bandits (see e.g., [5]). Thus, in order to clarify the main components\ncharacterizing the regret, we introduce additional notation. Let\n\n\u02dc\u00b5i,Ti,n =\n\nRn(A) =\n\n1\nn\n\n+\n\n(cid:88)\nK(cid:88)\n\ni(cid:54)=i\u2217\n1\nn\n\ni=1\n\nGiven these additional de\ufb01nitions, we can rewrite the regret as (see App. A.1 in [15])\n\nTi,n\n\n(\u02c6\u03c32\n\ni,Ti,n \u2212 \u03c1\u02c6\u00b5i,Ti,n ) \u2212 (\u02dc\u03c32\n\n(cid:0)\u02c6\u00b5i,Ti,n \u2212 \u02c6\u00b5n(A)(cid:1)2\n\n1\nn\n\n\u2212\n\nK(cid:88)\n\ni=1\n\nTi,n\n\ni,Ti,n \u2212 \u03c1\u02dc\u00b5i,Ti,n )\n\n(cid:0)\u02dc\u00b5i,Ti,n \u2212 \u02c6\u00b5i\u2217,n\n\n(cid:1)2\n\n.\n\n(6)\n\nTi,n\n\nSince the last term is always negative and small 3, our analysis focuses on the \ufb01rst two terms which\nreveal two interesting characteristics of A. First, an algorithm A suffers a regret whenever it chooses\na suboptimal arm i (cid:54)= i\u2217 and the regret corresponds to the difference in the empirical mean\u2013variance\nof i w.r.t.\nthe optimal arm i\u2217. Such a de\ufb01nition has a strong similarity to the standard de\ufb01nition\nof regret, where i\u2217 is the arm with highest expected value and the regret depends on the number of\ntimes suboptimal arms are pulled and their respective gaps w.r.t. the optimal arm i\u2217. In contrast to the\nn(A), which\nstandard formulation of regret, A also suffers an additional regret from the variance \u02c6\u03c32\ndepends on the variability of pulls Ti,n over different arms. Recalling the de\ufb01nition of the mean\n\u02c6\u00b5n(A) as the weighted mean of the empirical means \u02c6\u00b5i,Ti,n with weights Ti,n/n (see eq. 3), we\nnotice that this second term is a weighted variance of the means and illustrates the exploration risk\nof the algorithm. In fact, if an algorithm simply selects and pulls a single arm from the beginning, it\nwould not suffer any exploration risk (secondary regret) since \u02c6\u00b5n(A) would coincide with \u02c6\u00b5i,Ti,n for\nthe chosen arm and all other components would have zero weight. On the other hand, an algorithm\naccumulates exploration risk through this second term as the mean \u02c6\u00b5n(A) deviates from any speci\ufb01c\narm; where the maximum exploration risk peaks at the mean \u02c6\u00b5n(A) furthest from all arm means.\nThe previous de\ufb01nition of regret can be further elaborated to obtain the upper bound (see App. A.1)\n\nK(cid:88)\n(cid:88)\ni,Ti,n ) \u2212 \u03c1(\u02c6\u00b5i,Ti,n \u2212 \u02dc\u00b5i,Ti,n) and(cid:98)\u03932\n\nTi,nTj,n(cid:98)\u03932\nj(cid:54)=i\ni,j = (\u02c6\u00b5i,Ti,n \u2212 \u02c6\u00b5j,Tj,n)2. Unlike the\nde\ufb01nition in eq. 6, this upper bound explicitly illustrates the relationship between the regret and the\nnumber of pulls Ti,n; suggesting that a bound on the pulls is suf\ufb01cient to bound the regret.\nFinally, we can also introduce a de\ufb01nition of the pseudo-regret.\n\nwhere (cid:98)\u2206i = (\u02c6\u03c32\n\nTi,n(cid:98)\u2206i +\n\ni,Ti,n \u2212 \u02dc\u03c32\n\nRn(A) \u2264\n\n(cid:88)\n\ni(cid:54)=i\u2217\n\n1\nn2\n\n1\nn\n\n(7)\n\ni,j,\n\ni=1\n\n3More precisely, it can be shown that this term decreases with rate O(K log(1/\u03b4)/n) with probability 1\u2212\u03b4.\n\n3\n\n\fInput: Con\ufb01dence \u03b4\nfor t = 1, . . . , n do\n\nfor i = 1, . . . , K do\n\nCompute Bi,Ti,t\u22121 =(cid:100)MVi,Ti,t\u22121 \u2212 (5 + \u03c1)\n\n(cid:113) log 1/\u03b4\n\n2Ti,t\u22121\n\nend for\nReturn It = arg mini=1,...,K Bi,Ti,t\u22121\nUpdate Ti,t = Ti,t\u22121 + 1\nUpdate(cid:100)MVi,Ti,t\nObserve XIt,Ti,t \u223c \u03bdIt\n\nend for\n\nFigure 1: Pseudo-code of the MV-LCB algorithm.\n\nDe\ufb01nition 3. The pseudo regret for a learning algorithm A over n rounds is de\ufb01ned as\n\n(cid:101)Rn(A) =\n\n1\nn\n\n(cid:88)\n\ni(cid:54)=i\u2217\n\nTi,n\u2206i +\n\n2\nn2\n\nK(cid:88)\n\n(cid:88)\n\ni=1\n\nj(cid:54)=i\n\nTi,nTj,n\u03932\n\ni,j,\n\n(8)\n\nwhere \u2206i = MVi \u2212 MVi\u2217 and \u0393i,j = \u00b5i \u2212 \u00b5j.\nIn the following, we denote the two components of the pseudo\u2013regret as\n\nTi,n\u2206i,\n\nand\n\nTi,nTj,n\u03932\n\ni,j.\n\n(9)\n\n(cid:101)R\u0393\n\nn(A) =\n\n2\nn2\n\nK(cid:88)\n\n(cid:88)\n\ni=1\n\nj(cid:54)=i\n\n(cid:101)R\u2206\n\nn (A) =\n\n1\nn\n\n(cid:88)\n\ni(cid:54)=i\u2217\n\nWhere (cid:101)R\u2206\narm bandit problem and (cid:101)R\u0393\n\nn (A) constitutes the standard regret derived from the traditional formulation of the multi-\nn(A) denotes the exploration risk. This regret can be shown to be close\n\nto the true regret up to small terms with high probability.\nLemma 1. Given de\ufb01nitions 2 and 3,\n\nRn(A) \u2264 (cid:101)Rn(A) + (5 + \u03c1)\n\n(cid:114)\n\n2K log(6nK/\u03b4)\n\nn\n\n+ 4\u221a2\n\nK log(6nK/\u03b4)\n\nn\n\n,\n\nwith probability at least 1 \u2212 \u03b4.\nThe previous lemma shows that any (high\u2013probability) bound on the pseudo\u2013regret immediately\ntranslates into a bound on the true regret. Thus, we report most of the theoretical analysis according\n\nto (cid:101)Rn(A). Nonetheless, it is interesting to notice the major difference between the true and pseudo\u2013\ncase that the pseudo\u2013regret is not an unbiased estimator of the true regret, i.e., E[Rn] (cid:54)= E[(cid:101)Rn].\n\nregret when compared to the standard bandit problem. In fact, it is possible to show in the risk\u2013averse\n\nThus, to bound the expectation of Rn we build on the high\u2013probability result from Lemma 1.\n3 The Mean\u2013Variance Lower Con\ufb01dence Bound Algorithm\nIn this section we introduce a risk\u2013averse bandit algorithm whose objective is to identify the arm\nwhich best trades off risk and return. The algorithm is a natural extension of UCB1 [6] and we\nreport a theoretical performance analysis on how its mean\u2013variance.\n\n3.1 The Algorithm\n\nWe propose an index\u2013based bandit algorithm which estimates the mean\u2013variance of each arm and\nselects the optimal arm according to the optimistic con\ufb01dence\u2013bounds on the current estimates. A\nsketch of the algorithm is reported in Figure 1. For each arm, the algorithm keeps track of the\n\nempirical mean\u2013variance (cid:100)MVi,s computed according to s samples. We can build high\u2013probability\n\ncon\ufb01dence bounds on empirical mean\u2013variance through an application of the Chernoff\u2013Hoeffding\ninequality (see e.g., [1] for the bound on the variance) on terms \u02c6\u00b5 and \u02c6\u03c32.\n\n4\n\n\fLemma 2. Let {Xi,s} be i.i.d. random variables bounded in [0, 1] from the distribution \u03bdi with mean\n\u00b5i and variance \u03c32\ni,s computed as in Equation 1, then\n\ni , and the empirical mean \u02c6\u00b5i,s and variance \u02c6\u03c32\n\n(cid:34)\n(cid:114)\n\u2203i = 1, . . . , K, s = 1, . . . , n,|(cid:100)MVi,s \u2212 MVi| \u2265 (5 + \u03c1)\n\nlog 1/\u03b4\n\n2s\n\n\u2264 6nK\u03b4,\n\n(cid:35)\n\nP\n\nThe algorithm in Figure 1 implements the principle of optimism in the face of uncertainty used in\nmany multi\u2013arm bandit algorithms. On the basis of the previous con\ufb01dence bounds, we de\ufb01ne a\nlower\u2013con\ufb01dence bound on the mean\u2013variance of arm i when it has been pulled s times as\n\n(cid:114)\nBi,s =(cid:100)MVi,s \u2212 (5 + \u03c1)\n\nlog 1/\u03b4\n\n2s\n\n,\n\n(10)\n\nwhere \u03b4 is an input parameter of the algorithm. Given the index of each arm at each round t, the al-\ngorithm simply selects the arm with the smallest mean\u2013variance index, i.e., It = arg mini Bi,Ti,t\u22121.\nWe refer to this algorithm as the mean\u2013variance lower\u2013con\ufb01dence bound (MV-LCB) algorithm.\nRemark 1. We notice that MV-LCB reduces to UCB1 for \u03c1 \u2192 \u221e. This is coherent with the fact\nthat for \u03c1 \u2192 \u221e the mean\u2013variance problem reduces to expected reward maximization, for which\nUCB1 is known to be nearly-optimal. On the other hand, for \u03c1 = 0 (variance minimization), the\nalgorithm plays according to a lower\u2013con\ufb01dence\u2013bound on the variances.\nRemark 2. The MV-LCB algorithm has a parameter \u03b4 de\ufb01ning the con\ufb01dence level of the bounds\nemployed in (10). In Thm. 1 we show how to optimize the parameter when the horizon n is known\nin advance. On the other hand, if n is not known, it is possible to design an anytime version of\nMV-LCB by de\ufb01ning a non-decreasing exploration sequence (\u03b5t)t instead of the term log 1/\u03b4.\n3.2 Theoretical Analysis\nIn this section we report the analysis of the regret Rn(A) of MV-LCB (Fig. 1). As highlighted in\neq. 7, it is enough to analyze the number of pulls for each of the arms to recover a bound on the\nregret. The proofs (reported in [15]) are mostly based on similar arguments to the proof of UCB.\nWe derive the following regret bound in high probability and expectation.\nTheorem 1. Let the optimal arm i\u2217 be unique and b = 2(5 + \u03c1), the MV-LCB algorithm achieves\na pseudo\u2013regret bounded as\n\nb2 log 1/\u03b4\n\nn\n\n1\n\u2206i\n\n+ 4\n\n\u03932\ni\u2217,i\n\u22062\ni\n\n+\n\n2b2 log 1/\u03b4\n\nn\n\n\u03932\ni,j\n\u22062\ni \u22062\nj\n\n+\n\n5K\nn\n\n,\n\n(cid:19)\n\n(cid:18)(cid:88)\n\ni(cid:54)=i\u2217\n\n(cid:101)Rn(A) \u2264\nE[(cid:101)Rn(A)] \u2264\n\n(cid:19)\nwith probability at least 1 \u2212 6nK\u03b4. Similarly, if MV-LCB is run with \u03b4 = 1/n2 then\n\n(cid:18)(cid:88)\n\n(cid:88)\n\n4b2 log n\n\n2b2 log n\n\n1\n\u2206i\n\n+ 4\n\n\u03932\ni\u2217,i\n\u22062\ni\n\n+\n\nn\n\ni(cid:54)=i\u2217\n\n\u03932\ni,j\n\u22062\ni \u22062\nj\n\n+ (17 + 6\u03c1)\n\nK\nn\n\n.\n\nn\n\ni(cid:54)=i\u2217\n\nj(cid:54)=i\nj(cid:54)=i\u2217\n\nRemark 1 (the bound). Let \u2206min = mini(cid:54)=i\u2217 \u2206i and \u0393max = maxi |\u0393i|, then a rough simpli\ufb01cation\nof the previous bound leads to\n\nE[(cid:101)Rn(A)] \u2264 O\n\nlog n\n\n+ K 2 \u03932\n\u22064\n\nmax\n\nmin\n\nlog2 n\n\nn\n\n\u2206min\n\nn\n\nFirst we notice that the regret decreases as O(log2 n/n), implying that MV-LCB is a consistent\nalgorithm. As already highlighted in Def. 2, the regret is mainly composed by two terms. The\n\ufb01rst term is due to the difference in the mean\u2013variance of the best arm and the arms pulled by the\nalgorithm, while the second term denotes the additional variance introduced by the exploration risk\nof pulling arms with different means. In particular, this additional term depends on the squared\ndifference of the arm means \u03932\ni,j. Thus, if all the arms have the same mean, this term would be zero.\nRemark 2 (worst\u2013case analysis). We can further study the result of Thm. 1 by considering the\nworst\u2013case performance of MV-LCB, that is the performance when the distributions of the arms are\n\n5\n\n(cid:88)\n\ni(cid:54)=i\u2217\n\n(cid:88)\n\ni(cid:54)=i\u2217\n\n(cid:16) K\n\n(cid:88)\n\nj(cid:54)=i\nj(cid:54)=i\u2217\n\n(cid:88)\n\ni(cid:54)=i\u2217\n\n(cid:88)\n\n(cid:17)\n\n.\n\n\fchosen so as to maximize the regret. In order to illustrate our argument we consider the simple case\n2 = 0 (deterministic arms). 4\nof K = 2 arms, \u03c1 = 0 (variance minimization), \u00b51 (cid:54)= \u00b52, and \u03c32\nIn this case we have a variance gap \u2206 = 0 and \u03932 > 0. According to the de\ufb01nition of MV-LCB,\nthe index Bi,s would simply reduce to Bi,s =\nlog(1/\u03b4)/s, thus forcing the algorithm to pull both\narms uniformly (i.e., T1,n = T2,n = n/2 up to rounding effects). Since the arms have the same\nvariance, there is no direct regret in pulling either one or the other. Nonetheless, the algorithm has\nan additional variance due to the difference in the samples drawn from distributions with different\nmeans. In this case, the algorithm suffers a constant (true) regret\n\n1 = \u03c32\n\n(cid:112)\n\nRn(MV-LCB) = 0 +\n\nT1,nT2,n\n\nn2\n\n\u03932 =\n\n1\n4\n\n\u03932,\n\nindependent from the number of rounds n. This argument can be generalized to multiple arms and\n\u03c1 (cid:54)= 0, since it is always possible to design an environment (i.e., a set of distributions) such that\n\u2206min = 0 and \u0393max (cid:54)= 0. 5 This result is not surprising. In fact, two arms with the same mean\u2013\nvariance are likely to produce similar observations, thus leading MV-LCB to pull the two arms\nrepeatedly over time, since the algorithm is designed to try to discriminate between similar arms.\nAlthough this behavior does not suffer from any regret in pulling the \u201csuboptimal\u201d arm (the two arms\nare equivalent), it does introduce an additional variance, due to the difference in the means of the\narms (\u0393 (cid:54)= 0), which \ufb01nally leads to a regret the algorithm is not \u201caware\u201d of. This argument suggests\nthat, for any n, it is always possible to design an environment for which MV-LCB has a constant\nregret. This is particularly interesting since it reveals a huge gap between the mean\u2013variance and\nthe standard expected regret minimization problem and will be further investigated in the numerical\nsimulations in Sect. 5. In fact, UCB is known to have a worst\u2013case regret of \u2126(1/\u221an) [3], while\nin the worst case, MV-LCB suffers a constant regret. In the next section we introduce a simple\nalgorithm able to deal with this problem and achieve a vanishing worst\u2013case regret.\n4 The Exploration\u2013Exploitation Algorithm\nThe ExpExp algorithm divides the time horizon n into two distinct phases of length \u03c4 and n \u2212 \u03c4\nrespectively. During the \ufb01rst phase all the arms are explored uniformly, thus collecting \u03c4 /K samples\neach 6. Once the exploration phase is over, the mean\u2013variance of each arm is computed and the arm\nwith the smallest estimated mean\u2013variance MVi,\u03c4 /K is repeatedly pulled until the end.\nThe MV-LCB is speci\ufb01cally designed to minimize the probability of pulling the wrong arms, so\nwhenever there are two equivalent arms (i.e., arms with the same mean\u2013variance), the algorithm\ntends to pull them the same number of times, at the cost of potentially introducing an additional\nvariance which might result in a constant regret. On the other hand, ExpExp stops exploring the\narms after \u03c4 rounds and then elicits one arm as the best and keeps pulling it for the remaining n \u2212 \u03c4\nrounds.\nIntuitively, the parameter \u03c4 should be tuned so as to meet different requirements. The\n\ufb01rst part of the regret (i.e., the regret coming from pulling the suboptimal arms) suggests that the\nexploration phase \u03c4 should be long enough for the algorithm to select the empirically best arm \u02c6i\u2217\nat \u03c4 equivalent to the actual optimal arm i\u2217 with high probability; and at the same time, as short as\npossible to reduce the number of times the suboptimal arms are explored. On the other hand, the\nsecond part of the regret (i.e., the variance of pulling arms with different means) is minimized by\ntaking \u03c4 as small as possible (e.g., \u03c4 = 0 would guarantee a zero regret). The following theorem\nillustrates the optimal trade-off between these contrasting needs.\nTheorem 2. Let ExpExp be run with \u03c4 = K(n/14)2/3, then for any choice of distributions {\u03bdi}\n\nthe expected regret is E[(cid:101)Rn(A)] \u2264 2 K\n\nn1/3 .\n\nRemark 1 (the bound). We \ufb01rst notice that this bound suggests that ExpExp performs worse than\nMV-LCB on easy problems. In fact, Thm. 1 demonstrates that MV-LCB has a regret decreasing as\nO(K log(n)/n) whenever the gaps \u2206 are not small compared to n, while in the remarks of Thm. 1\nwe highlighted the fact that for any value of n, it is always possible to design an environment which\nleads MV-LCB to suffer a constant regret. On the other hand, the previous bound for ExpExp is\ndistribution independent and indicates the regret is still a decreasing function of n even in the worst\n\n4Note that in this case (i.e., \u2206 = 0), Thm. 1 does not hold, since the optimal arm is not unique.\n5Notice that this is always possible for a large majority of distributions with independent mean and variance.\n6In the de\ufb01nition and in the following analysis we ignore rounding effects.\n\n6\n\n\fFigure 2: Regret of MV-LCB and ExpExp in different scenarios.\n\n1 = 0.05, and \u03c32\n\nn\n\ncase. This opens the question whether it is possible to design an algorithm which works as well as\nMV-LCB on easy problems and as robustly as ExpExp on dif\ufb01cult problems.\nRemark 2 (exploration phase). The previous result can be improved by changing the exploration\nstrategy used in the \ufb01rst \u03c4 rounds. Instead of a pure uniform exploration of all the arms, we could\nadopt a best\u2013arm identi\ufb01cation algorithms such as Successive Reject or UCB-E, which maximize\nthe probability of returning the best arm given a \ufb01xed budget of rounds \u03c4 (see e.g., [4]).\n5 Numerical Simulations\nIn this section we report numerical simulations aimed at validating the main theoretical \ufb01ndings\nreported in the previous sections. In the following graphs we study the true regret Rn(A) averaged\nover 500 runs. We \ufb01rst consider the variance minimization problem (\u03c1 = 0) with K = 2 Gaussian\n(cid:98)\u0393\n(cid:98)\u2206\n2 = 0.25 and run MV-LCB 7. In Figure 2 we\narms set to \u00b51 = 1.0, \u00b52 = 0.5, \u03c32\n(these two values are de\ufb01ned as in eq. 9 with (cid:98)\u2206 and(cid:98)\u0393 replacing \u2206 and \u0393). As expected (see e.g.,\nn and R\nreport the true regret Rn (as in the original de\ufb01nition in eq. 4) and its two components R\nThm. 1), the regret is characterized by the regret realized from pulling suboptimal arms and arms\n(cid:98)\u2206\nwith different means (Exploration Risk) and tends to zero as n increases. Indeed, if we considered\nn . Furthermore,\ntwo distributions with equal means (\u00b51 = \u00b52), the average regret coincides with R\nas shown in Thm. 1 the two regret terms decrease with the same rate O(log n/n).\nA detailed analysis of the impact of \u2206 and \u0393 on the performance of MV-LCB is reported in App. D\nin [15]. Here we only compare the worst\u2013case performance of MV-LCB to ExpExp (see Figure 2).\nIn order to have a fair comparison, for any value of n and for each of the two algorithms, we select\nthe pair \u2206w, \u0393w which corresponds to the largest regret (we search in a grid of values with \u00b51 = 1.5,\n\u00b52 \u2208 [0.4; 1.5], \u03c32\n2 = 0.25, so that \u2206 \u2208 [0.0; 0.25] and \u0393 \u2208 [0.0; 1.1]). As\ndiscussed in Sect. 4, while the worst\u2013case regret of ExpExp keeps decreasing over n, it is always\npossible to \ufb01nd a problem for which regret of MV-LCB stabilizes to a constant. For numerical\nresults with multiple values of \u03c1 and 15 arms, see App. D in [15].\n6 Discussion\nIn this paper we evaluate the risk of an algorithm in terms of the variability of the sequences of\nsamples that it actually generates. Although this notion might resemble other analyses of bandit\nalgorithms (see e.g., the high-probability analysis in [5]), it captures different features of the learning\nalgorithm. Whenever a bandit algorithm is run over n rounds, its behavior, combined with the arms\u2019\ndistributions, generates a probability distribution over sequences of n rewards. While the quality\nof this sequence is usually de\ufb01ned by its cumulative sum (or average), here we say that a sequence\nof rewards is good if it displays a good trade-off between its (empirical) mean and variance. The\nvariance of the sequence does not coincide with the variance of the algorithm over multiple runs.\nLet us consider a simple case with two arms that deterministically generate 0s and 1s respectively,\nand two different algorithms. Algorithm A1 pulls the arms in a \ufb01xed sequence at each run (e.g.,\narm 1, arm 2, arm 1, arm 2, and so on), so that each arm is always pulled n/2 times. Algorithm A2\nchooses one arm uniformly at random at the beginning of the run and repeatedly pulls this arm for\nn rounds. Algorithm A1 generates sequences such as 010101... which have high variability within\n7Notice that although in the paper we assumed the distributions to be bounded in [0, 1] all the results can be\n\n1 \u2208 [0.0; 0.25], and \u03c32\n\nextended to sub-Gaussian distributions.\n\n7\n\n00.15n\u00d7103MeanRegretMV-LCBRegretTermsvs.n  5102550100250RegretRegret\u2206Regret\u039323.35.68.7142225.326.327.531.235.3n\u00d7103MeanRegret\u00d710\u22122WorstCaseRegretvs.n  5102550100250MV-LCBExpExp\f(cid:112)\n(cid:112)\n\neach run, incurs a high regret (e.g., if \u03c1 = 0), but has no variance over multiple runs because it\nalways generates the same sequence. On the other hand, A2 has no variability in each run, since it\ngenerates sequences with only 0s or only 1s, suffers no regret in the case of variance minimization,\nbut has high variance over multiple runs since the two completely different sequences are generated\nwith equal probability. This simple example shows that an algorithm with small standard regret\n(e.g., A1), might generate at each run sequences with high variability, while an algorithm with small\nmean-variance regret (e.g., A2) might have a high variance over multiple runs.\n7 Conclusions\nThe majority of multi\u2013armed bandit literature focuses on the problem of minimizing the regret w.r.t.\nthe arm with the highest return in expectation. In this paper, we introduced a novel multi\u2013armed\nbandit setting where the objective is to perform as well as the arm with the best risk\u2013return trade\u2013off.\nIn particular, we relied on the mean\u2013variance model introduced in [10] to measure the performance\nof the arms and de\ufb01ne the regret of a learning algorithm. We show that de\ufb01ning the risk of a learning\nalgorithm as the variability (i.e., empirical variance) of the sequence of rewards generated at each\nrun, leads to an interesting effect on the regret where an additional algorithm variance appears. We\nproposed two novel algorithms to solve the mean\u2013variance bandit problem and we reported their\ncorresponding theoretical analysis. To the best of our knowledge this is the \ufb01rst work introducing\nrisk\u2013aversion in the multi\u2013armed bandit setting and it opens a series of interesting questions.\nLower bound. As discussed in the remarks of Thm. 1 and Thm. 2, MV-LCB has a regret of order\nK/n) on easy problems and O(1) on dif\ufb01cult problems, while ExpExp achieves the same\nO(\nregret O(K/n1/3) over all problems. The primary open question is whether O(K/n1/3) is actually\nthe best possible achievable rate (in the worst\u2013case) for this problem. This question is of particular\ninterest since the standard reward expectation maximization problem has a known lower\u2013bound of\n1/n), and a minimax rate of \u2126(1/n1/3) for the mean\u2013variance problem would imply that the\n\u2126(\nrisk\u2013averse bandit problem is intrinsically more dif\ufb01cult than standard bandit problems.\nDifferent measures of return\u2013risk. Considering alternative notions of risk is a natural extension\nto the previous setting. In fact, over the years the mean\u2013variance model has often been criticized.\nFrom a point of view of the expected utility theory, the mean\u2013variance model is only justi\ufb01ed under a\nGaussianity assumption on the arm distributions. It also violates the monotonocity condition due to\nthe different orders of the mean and variance and is not a coherent measure of risk [2]. Furthermore,\nthe variance is a symmetric measure of risk, while it is often the case that only one\u2013sided deviations\nfrom the mean are undesirable (e.g., in \ufb01nance only losses w.r.t. to the expected return are considered\nas a risk, while any positive deviation is not considered as a real risk). Popular replacements for the\nmean\u2013variance are the \u03b1 value\u2013at\u2013risk (i.e., the quantile) or Conditional Value at Risk (otherwise\nknown as average value at risk, tail value at risk, expected shortfall and lower tail risk) or other\ncoherent measures of risk [2]. While the estimation of the \u03b1 value\u2013at\u2013risk might be challenging 8,\nconcentration inequalities exist for the CVaR [7]. Another issue in moving from variance to other\nmeasures of risk is whether single-period or multi-period risk evaluation should be used. While\nthe single-period risk of an arm is simply the risk of its distribution, in a multi-period evaluation\nwe consider the risk of the sum of rewards obtained by repeatedly pulling the same arm over n\nrounds. Unlike the variance, for which the variance of a sum of n i.i.d. samples is simply n times\ntheir variance, for other measures of risk (e.g., \u03b1 value\u2013at\u2013risk) this is not necessarily the case. As a\nresult, an arm with the smallest single-period risk might not be the optimal choice over an horizon of\nn rounds. Therefore, the performance of an algorithm should be compared to the smallest risk that\ncan be achieved by any sequence of arms over n rounds, thus requiring a new de\ufb01nition of regret.\nSimple regret. Finally, an interesting related problem is the simple regret setting where the learner\nis allowed to explore over n rounds and it only suffers a regret de\ufb01ned on the solution returned at\nthe end. It is known that it is possible to design algorithm able to effectively estimate the mean of\nthe arms and \ufb01nally return the best arm with high probability. In the risk-return setting, the objective\nwould be to return the arm with the best risk-return tradeoff.\nAcknowledgments This work was supported by Ministry of Higher Education and Research, Nord-\nPas de Calais Regional Council and FEDER through the \u201ccontrat de projets \u00e9tat region 2007\u20132013\",\nEuropean Community\u2019s Seventh Framework Programme (FP7/2007-2013) under grant agreement\nn\u25e6 270327, and PASCAL2 European Network of Excellence.\n\n8While the cumulative distribution of a random variable can be reliably estimated (see e.g., [11]), estimating\n\nthe quantile might be more dif\ufb01cult\n\n8\n\n\fReferences\n[1] Andr\u00e1s Antos, Varun Grover, and Csaba Szepesv\u00e1ri. Active learning in heteroscedastic noise.\n\nTheoretical Computer Science, 411:2712\u20132728, June 2010.\n\n[2] P Artzner, F Delbaen, JM Eber, and D Heath. Coherent measures of risk. Mathematical\n\n\ufb01nance, (June 1996):1\u201324, 1999.\n\n[3] Jean-Yves Audibert and S\u00e9bastien Bubeck. Regret bounds and minimax policies under partial\n\nmonitoring. Journal of Machine Learning Research, 11:2785\u20132836, 2010.\n\n[4] Jean-Yves Audibert, S\u00e9bastien Bubeck, and R\u00e9mi Munos. Best arm identi\ufb01cation in multi-\narmed bandits. In Proceedings of the Twenty-third Conference on Learning Theory (COLT\u201910),\n2010.\n\n[5] Jean-Yves Audibert, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Exploration-exploitation trade-off\nusing variance estimates in multi-armed bandits. Theoretical Computer Science, 410:1876\u2013\n1902, 2009.\n\n[6] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-armed\n\nbandit problem. Machine Learning, 47:235\u2013256, 2002.\n\n[7] David B. Brown. Large deviations bounds for estimating conditional value-at-risk. Operations\n\nResearch Letters, 35:722\u2013730, 2007.\n\n[8] Eyal Even-Dar, Michael Kearns, and Jennifer Wortman. Risk-sensitive online learning.\n\nIn\nProceedings of the 17th international conference on Algorithmic Learning Theory (ALT\u201906),\npages 199\u2013213, 2006.\n\n[9] Christian Gollier. The Economics of Risk and Time. The MIT Press, 2001.\n[10] Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77\u201391, 1952.\n[11] Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of\n\nProbability, 18(3):pp. 1269\u20131283, 1990.\n\n[12] J Neumann and O Morgenstern. Theory of games and economic behavior. Princeton Univer-\n\nsity, Princeton, 1947.\n\n[13] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the AMS,\n\n58:527\u2013535, 1952.\n\n[14] Antoine Salomon and Jean-Yves Audibert. Deviations of stochastic bandit regret.\n\nIn Pro-\nceedings of the 22nd international conference on Algorithmic learning theory (ALT\u201911), pages\n159\u2013173, 2011.\n\n[15] Amir Sani, Alessandro Lazaric, and R\u00e9mi Munos. Risk-aversion in multi-arm bandit. Techni-\n\ncal Report hal-00750298, INRIA, 2012.\n\n[16] Manfred K. Warmuth and Dima Kuzmin. Online variance minimization. In Proceedings of the\n\n19th Annual Conference on Learning Theory (COLT\u201906), pages 514\u2013528, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1514, "authors": [{"given_name": "Amir", "family_name": "Sani", "institution": null}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}