{"title": "Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions", "book": "Advances in Neural Information Processing Systems", "page_first": 251, "page_last": 259, "abstract": "We consider the problem of adaptive stratified sampling for Monte Carlo integration of a differentiable function given a finite number of evaluations to the function. We construct a sampling scheme that samples more often in regions where the function oscillates more, while allocating the samples such that they are well spread on the domain (this notion shares similitude with low discrepancy). We prove that the estimate returned by the algorithm is almost as accurate as the estimate that an optimal oracle strategy (that would know the variations of the function everywhere) would return, and we provide a finite-sample analysis.", "full_text": "Adaptive Strati\ufb01ed Sampling for Monte-Carlo\n\nintegration of Differentiable functions\n\nR\u00b4emi Munos\n\nINRIA Lille - Nord Europe\n\n40, avenue Halley\n\nAlexandra Carpentier\n\nStatistical Laboratory, CMS\nWilberforce Road, Cambridge\n\nCB3 0WB UK\n\n59000 Villeneuve d\u2019ascq, France\n\na.carpentier@statslab.cam.ac.uk\n\nremi.munos@inria.fr\n\nAbstract\n\nWe consider the problem of adaptive strati\ufb01ed sampling for Monte Carlo integra-\ntion of a differentiable function given a \ufb01nite number of evaluations to the func-\ntion. We construct a sampling scheme that samples more often in regions where\nthe function oscillates more, while allocating the samples such that they are well\nspread on the domain (this notion shares similitude with low discrepancy). We\nprove that the estimate returned by the algorithm is almost similarly accurate as\nthe estimate that an optimal oracle strategy (that would know the variations of the\nfunction everywhere) would return, and provide a \ufb01nite-sample analysis.\n\nIntroduction\n\n1\nIn this paper we consider the problem of numerical integration of a differentiable function f :\n[0, 1]d \u2192 R given a \ufb01nite budget n of evaluations to the function that can be allocated sequentially.\nA usual technique for reducing the mean squared error (w.r.t. the integral of f) of a Monte-Carlo es-\ntimate is the so-called strati\ufb01ed Monte Carlo sampling, which considers sampling into a set of strata,\nor regions of the domain, that form a partition, i.e. a strati\ufb01cation, of the domain (see [10][Subsection\n5.5] or [6]). It is ef\ufb01cient (up to rounding issues) to stratify the domain, since when allocating to\neach stratum a number of samples proportional to its measure, the mean squared error of the result-\ning estimate is always smaller or equal to the one of the crude Monte-Carlo estimate (that samples\nuniformly the domain).\nSince the considered functions are differentiable, if the domain is strati\ufb01ed in K hyper-cubic strata of\nsame measure and if one assigns uniformly at random n/K samples per stratum, the mean squared\nerror of the resulting strati\ufb01ed estimate is in O(n\u22121K\u22122/d). We deduce that if the strati\ufb01cation\nis built independently of the samples (before collecting the samples), and if n is known from the\nbeginning (which is assumed here), the minimax-optimal choice for the strati\ufb01cation is to build n\nstrata of same measure and minimal diameter, and to assign only one sample per stratum uniformly\nat random. We refer to this sampling technique as Uniform strati\ufb01ed Monte-Carlo. The resulting\nestimate has a mean squared error of order O(n\u2212(1+2/d)). The arguments that advocate for strati-\nfying in strata of same measure and minimal diameter are closely linked to the reasons why quasi\nMonte-Carlo methods, or low discrepancy sampling schemes are ef\ufb01cient techniques for integrating\nsmooth functions. See [9] for a survey on these techniques.\nIt is minimax-optimal to stratify the domain in n strata and sample one point per stratum, but it\nwould also be interesting to adapt the strati\ufb01cation of the space with respect to the function f. For\nexample, if the function has larger variations in a region of the domain, we would like to discretize\nthe domain in smaller strata in this region, so that more samples are assigned to this region. Since\nf is initially unknown, it is not possible to design a good strati\ufb01cation before sampling. However\nan ef\ufb01cient algorithm should allocate the samples in order to estimate online the variations of the\n\n1\n\n\ffunction in each region of the domain while, at the same time, allocating more samples in regions\nwhere f has larger local variations.\nThe papers [5, 7, 3] provide algorithms for solving a similar trade-off when the strati\ufb01cation is \ufb01xed:\nthese algorithms allocate more samples to strata in which the function has larger variations. It is,\nhowever, clear that the larger the number of strata, the more dif\ufb01cult it is to allocate the samples\nalmost optimally in the strata.\nContributions: We propose a new algorithm, Lipschitz Monte-Carlo Upper Con\ufb01dence Bound\n(LMC-UCB), for tackling this problem. It is a two-layered algorithm. It \ufb01rst strati\ufb01es the domain\nin K \ufffd n strata, and then allocates uniformly to each stratum an initial small amount of samples\nin order to estimate roughly the variations of the function per stratum. Then our algorithm sub-\nstrati\ufb01es each of the K strata according to the estimated local variations, so that there are in total\napproximately n sub-strata, and allocates one point per sub-stratum. In that way, our algorithm\ndiscretizes the domain into more re\ufb01ned strata in regions where the function has higher variations.\nIt cumulates the advantages of quasi Monte-Carlo and adaptive strategies.\nMore precisely, our contributions are the following:\n\n\u2022 We prove an asymptotic lower bound on the mean squared error of the estimate returned by\nan optimal oracle strategy that has access to the variations of the function f everywhere and\nwould use the best strati\ufb01cation of the domain with hyper-cubes (possibly of heterogeneous\nsizes). This quantity, since this is a lower-bound on any oracle strategies, is smaller than\nthe mean squared error of the estimate provided by Uniform strati\ufb01ed Monte-Carlo (which\nis the non-adaptive minimax-optimal strategy on the class of differentiable functions), and\nalso smaller than crude Monte-Carlo.\n\n\u2022 We introduce the algorithm LMC-UCB, that sub-strati\ufb01es the K strata in hyper-cubic sub-\nstrata, and samples one point per sub-stratum. The number of sub-strata per stratum is\nlinked to the variations of the function in the stratum. We prove that algorithm LMC-UCB\nis asymptotically as ef\ufb01cient as the optimal oracle strategy. We also provide \ufb01nite-time\nresults when f admits a Taylor expansion of order 2 in every point. By tuning the number\nof strata K wisely, it is possible to build an algorithm that is almost as ef\ufb01cient as the\noptimal oracle strategy.\n\nThe paper is organized as follows. Section 2 de\ufb01nes the notations used throughout the paper. Sec-\ntion 3 states the asymptotic lower bound on the mean squared error of the optimal oracle strategy.\nIn this Section, we also provide an intuition on how the number of samples into each stratum should\nbe linked to the variation of the function in the stratum in order for the mean squared error of the\nestimate to be small. Section 4 presents the algorithm LMC-UCB and the \ufb01rst Lemma on how many\nsub-strata are built in the initial strata. Section 5 \ufb01nally states that the algorithm LMC-UCB is al-\nmost as ef\ufb01cient as the optimal oracle strategy. We \ufb01nally conclude the paper. Due to the lack of\nspace, we also provide experiments and proofs in the Supplementary Material (see also [2]).\n\n2 Setting\nWe consider a function f : [0, 1]d \u2192 R. We want to estimate as accurately as possible its integral\naccording to the Lebesgue measure, i.e.\ufffd[0,1]d f (x)dx. In order to do that, we consider algorithms\nthat stratify the domain in two layers of strata, one more re\ufb01ned than the other. The strata of the\nre\ufb01ned layer are referred to as sub-strata, and we sample in the sub-strata. We will compare the\nperformances of the algorithms we construct, with the performances of the optimal oracle algorithm\nthat has access to the variations ||\u2207f (x)||2 of the function f everywhere in the domain, and is\nallowed to sample the domain where it wishes.\nThe \ufb01rst step is to partition the domain [0, 1]d in K measurable strata. In this paper, we assume\nthat K 1/d is an integer1. This enables us to partition, in a natural way, the domain in K hyper-cubic\nstrata (\u03a9k)k\u2264K of same measure wk = 1\nK . Each of these strata is a region of the domain [0, 1]d,\nf (x)dx the mean and\nand the K strata form a partition of the domain. We write \u00b5k = 1\ndx the variance of a sample of the function f when sampling f at a point\nk = 1\n\u03c32\nchosen at random according to the Lebesgue measure conditioned to stratum \u03a9k.\n\nwk\ufffd\u03a9k\n\nwk\ufffd\u03a9k\ufffdf (x)\u2212 \u00b5k\ufffd2\n\n1This is not restrictive in small dimension, but it may become more constraining for large d.\n\n2\n\n\fWe possess a budget of n samples (which is assumed to be known in advance), which means that\nwe can sample n times the function at any point of [0, 1]d. We denote by A an algorithm that\nsequentially allocates the budget by sampling at round t in the stratum indexed by kt \u2208 {1, . . . , K},\nand returns after all n samples have been used an estimate \u02c6\u00b5n of the integral of the function f.\nWe consider strategies that sub-partition each stratum \u03a9k in hyper-cubes of same measure in \u03a9k, but\nof heterogeneous measure among the \u03a9k. In this way, the number of sub-strata in each stratum \u03a9k\ncan adapt to the variations f within \u03a9k. The algorithms that we consider return a sub-partition of\neach stratum \u03a9k in Sk sub-strata. We call Nk = (\u03a9k,i)i\u2264Sk the sub-partition of stratum \u03a9k. In each\nof these sub-strata, the algorithm allocates at least one point2. We write Xk,i the \ufb01rst point sampled\nuniformly at random in sub-stratum \u03a9k,i. We write wk,i the measure of the sub-stratum \u03a9k,i. Let us\nwk,i\ufffd\u03a9k,i\ufffdf (x) \u2212 \u00b5k,i\ufffd2\ndx the variance of a\nwrite \u00b5k,i = 1\nsample of f in sub-stratum \u03a9k,i (e.g. of Xk,i = f (Uk,i) where Uk,i \u223c U\u03a9k,i).\nThis class of 2\u2212layered sampling strategies is rather large. In fact it contains strategies that are\nsimilar to low discrepancy strategies, and also to any strati\ufb01ed Monte-Carlo strategy. For example,\nconsider that all K strata are hyper-cubes of same measure 1\nK and that each stratum \u03a9k is partitioned\ninto Sk hyper-rectangles \u03a9k,i of minimal diameter and same measure\n. If the algorithm allocates\none point per sub-stratum, its sampling scheme shares similarities with quasi Monte-Carlo sampling\nschemes, since the points at which the function is sampled are well spread.\nLet us now consider an algorithm that \ufb01rst chooses the sub-partition (Nk)k and then allocates de-\nterministically 1 sample uniformly at random in each sub-stratum \u03a9k,i. We consider the strati\ufb01ed\nestimate \u02c6\u00b5n =\ufffdK\n\nwk,i\ufffd\u03a9k,i\n\nf (x)dx the mean and \u03c32\n\nXk,i of \u00b5. We have\n\nk,i = 1\n\nwk,i\nSk\n\nKSk\n\n1\n\nSk\ufffdi=1\ufffd\u03a9k,i\n\nf (x)dx =\ufffd[0,1]d\nSk\ufffdi=1\n\n)2E(Xk,i \u2212 \u00b5k,i)2 = \ufffdk\u2264K\n\nf (x)dx = \u00b5,\n\nw2\nk,i\nS2\nk\n\n\u03c32\nk,i.\n\nwk,i\nSk\n\n(\n\nwk,i\nSk\n\n\u00b5k,i = \ufffdk\u2264K\nSk\ufffdi=1\nLn(A) = \ufffdk\u2264K\n\nw2\nk,i\nS2\nk\n\nSk\ufffdi=1\n\nFor a given algorithm A that builds for each stratum k a sub-partition Nk = (\u03a9k,i)i\u2264Sk, we call\npseudo-risk the quantity\n(1)\n\n\u03c32\nk,i.\n\nSome further insight on this quantity is provided in the paper [4].\nConsider now the uniform strategy, i.e. a strategy that divides the domain in K = n hyper-cubic\nstrata. This strategy is a fairly natural, minimax-optimal static strategy, on the class of differentiable\nfunction de\ufb01ned on [0, 1]d, when no information on f is available. We will prove in the next Section\nthat its asymptotic mean squared error is equal to\n\nE(\u02c6\u00b5n) =\n\ni=1\n\nk=1\ufffdSk\nSk\ufffdi=1\nK\ufffdk=1\nV(\u02c6\u00b5n) = \ufffdk\u2264K\n\nand also\n\n1\n\n12\ufffd\ufffd[0,1]d ||\u2207f (x)||2\n\n2dx\ufffd 1\n\nn1+ 2\n\nd\n\n.\n\nThis quantity is of order n\u22121\u22122/d, which is smaller, as expected, than 1/n: this strategy is more\nef\ufb01cient than crude Monte-Carlo.\nWe will also prove in the next Section that the minimum asymptotic mean squared error of an\noptimal oracle strategy (we call it \u201coracle\u201d because it builds the strati\ufb01cation using the information\nabout the variations ||\u2207f (x)||2 of f in every point x), is larger than\nd+1 dx\ufffd2 (d+1)\n\nThis quantity is always smaller than the asymptotic mean squared error of the Uniform strati\ufb01ed\nMonte-Carlo strategy, which makes sense since this strategy assumes the knowledge of the variations\nof f everywhere, and can thus adapt accordingly the number of samples in each region. We de\ufb01ne\n\n12\ufffd\ufffd[0,1]d\n\n(||\u2207f (x)||2)\n\nn1+ 2\n\n1\n\n1\n\nd\n\nd\n\nd\n\n\u03a3 =\n\n2This implies that\ufffdk Sk \u2264 n.\n\n1\n\n12\ufffd\ufffd[0,1]d\n\n.\n\n(2)\n\nd\n\nd\n\nd+1 dx\ufffd2 (d+1)\n\n(||\u2207f (x)||2)\n\n3\n\n\fGiven this minimum asymptotic mean squared error of an optimal oracle strategy, we de\ufb01ne the\npseudo-regret of an algorithm A as\n\nRn(A) = Ln(A) \u2212 \u03a3\n\n1\n\nn1+ 2\n\nd\n\n.\n\n(3)\n\nThis pseudo-regret is the difference between the pseudo-risk of the estimate provided by algorithm\nA, and the lower-bound on the optimal oracle mean squared error. In other words, this pseudo-regret\nis the price an adaptive strategy pays for not knowing in advance the function f, and thus not having\naccess to its variations. An ef\ufb01cient adaptive strategy should aim at minimizing this gap coming\nfrom the lack of informations.\n\n3 Discussion on the optimal asymptotic mean squared error\n3.1 Asymptotic lower bound on the mean squared error, and comparison with the Uniform\n\nstrati\ufb01ed Monte-Carlo\n\nA \ufb01rst part of the analysis of the exposed problem consists in \ufb01nding a good point of comparison\nfor the pseudo-risk. The following Lemma states an asymptotic lower bound on the mean squared\nerror of the optimal oracle sampling strategy.\n\nk )k\u2264n\ufffdn\nLemma 1 Assume that f is such that \u2207f is continuous and\ufffd ||\u2207f (x)||2\nbe an arbitrary sequence of partitions of [0, 1]d in n strata such that all the strata are hyper-cubes,\nand such that the maximum diameter of each stratum goes to 0 as n \u2192 +\u221e (but the strata are\nallowed to have heterogeneous measures).Let \u02c6\u00b5n be the strati\ufb01ed estimate of the function for the\npartition (\u03a9n\n\n2dx < \u221e. Let\ufffd(\u03a9n\n\nk )k\u2264n when there is one point pulled at random per stratum. Then\n\nlim inf\nn\u2192\u221e\n\nn1+2/dV(\u02c6\u00b5n) \u2265 \u03a3.\n\nThe full proof of this Lemma is in the Supplementary Material, Appendix B (see also [2]).\nWe have also the following equality for the asymptotic mean squared error of the uniform strategy.\n2dx < \u221e. For any n = ld\nsuch that l is an integer (and thus such that it is possible to partition the domain in n hyper-cubic\nk )k\u2264n\ufffdn as the sequence of partitions in hyper-cubic strata of\n\nLemma 2 Assume that f is such that \u2207f is continuous and\ufffd ||\u2207f (x)||2\nstrata of same measure), de\ufb01ne\ufffd(\u03a9n\n\nsame measure 1/n. Let \u02c6\u00b5n be the strati\ufb01ed estimate of the function for the partition (\u03a9n\nthere is one point pulled at random per stratum. Then\n\nk )k\u2264n when\n\nlim inf\nn\u2192\u221e\n\nn1+2/dV(\u02c6\u00b5n) =\n\n1\n\n12\ufffd\ufffd[0,1]d ||\u2207f (x)||2\n\n2dx\ufffd.\n\nThe proof of this Lemma is substantially similar to the proof of Lemma 1 in the Supplementary\nMaterial, Appendix B (see also [2]). The only difference is that the measure of each stratum \u03a9n\nk\nis 1/n and that in Step 2, instead of Fatou\u2019s Lemma, the Theorem of dominated convergence is\nrequired.\nThe optimal rate for the mean squared error, which is also the rate of the Uniform strati\ufb01ed Monte-\nCarlo in Lemma 2, is n\u22121\u22122/d and is attained with ideas of low discrepancy sampling. The constant\ncan however be improved (with respect to the constant in Lemma 2), by adapting to the speci\ufb01c\nshape of each function.\nIn Lemma 1, we exhibit a lower bound for this constant (and without\nsurprises, 1\nsharing ideas with low discrepancy sampling, that attains this lower-bound.\n\n2dx\ufffd \u2265 \u03a3). Our aim is to build an adaptive sampling scheme, also\nk )k\u2264n\ufffdn\n\nThere is one main restriction in both Lemma: we impose that the sequence of partitions\ufffd(\u03a9n\n\nis composed only with strata that have the shape of an hyper-cube. This assumption is in fact\nreasonable: indeed, if the shape of the strata could be arbitrary, one could take the level sets (or\napproximate level sets as the number of strata is limited by n) as strata, and this would lead to\nlimn\u2192\u221e inf \u03a9 n1+2/dV(\u02c6\u00b5n,\u03a9) = 0. But this is not a fair competition, as the function is unknown,\nand determining these level sets is actually a much harder problem than integrating the function.\nThe fact that the strata are hyper-cubes appears, in fact, in the bound. If we had chosen other shapes,\n12 in front of the bounds in both Lemma would change3. It is however not\ne.g. l2 balls, the constant 1\n\n12\ufffd\ufffd[0,1]d ||\u2207f (x)||2\n\n3The 1\n\n12 comes from computing the variance of an uniform random variable on [0, 1].\n\n4\n\n\fpossible to make a \ufb01nite partition in l2 balls of [0, 1]d, and we chose hyper-cubes since it is quite\neasy to stratify [0, 1]d in hyper-cubic strata.\n\nThe proof of Lemma 1 makes the quantity s\u2217(x) =\n\nappear. This quantity is\n\nproposed as \u201casymptotic optimal allocation\u201d, i.e. the asymptotically optimal number of sub-strata\none would ideally create in any small sub-stratum centered in x. This is however not very useful for\nbuilding an algorithm. The next Subsection provides an intuition on this matter.\n\n(||\u2207f (x)||2)\n\n\ufffd[0,1]d (||\u2207f (u)||2)\n\nd\n\nd+1\n\nd\n\nd+1 du\n\n3.2 An intuition of a good allocation: Piecewise linear functions\nIn this Subsection, we (i) provide an example where the asymptotic optimal mean squared error is\nalso the optimal mean squared error at \ufb01nite distance and (ii) provide explicitly what is, in that case,\na good allocation. We do that in order to give an intuition for the algorithm that we introduce in the\nnext Section.\nWe consider a partition in K hyper-cubic strata \u03a9k. Let us assume that the function f is af\ufb01ne on all\n\nstrata \u03a9k, i.e. on stratum \u03a9k, we have f (x) =\ufffd\ufffd\u03b8k, x\ufffd + \u03c1k\ufffdI{x \u2208 \u03a9k}. In that case \u00b5k = f (ak)\n\nwhere ak is the center of the stratum \u03a9k. We then have:\n\n\u03c32\nk =\n\n1\n\nwk \ufffd\u03a9k\n\n(f (x) \u2212 f (ak))2dx =\n\n1\n\nwk \ufffd\u03a9k\ufffd\ufffd\u03b8k, (x \u2212 ak)\ufffd\ufffd2\n\ndx =\n\n1\n\nwk\ufffd||\u03b8k||2\n\n12\n\n2\n\nw1+2/d\n\nk\n\n\ufffd = ||\u03b8k||2\n\n12\n\n2\n\nw2/d\n\nk\n\n.\n\n2\n\n2\n\nk\n\nw2\nk\nS2\nk\n\nWe consider also a sub-partition of \u03a9k in Sk hyper-cubes of same size (we assume that S1/d\nis\nan integer), and we assume that in each sub-stratum \u03a9k,i, we sample one point. We also have\nk,i = ||\u03b8k||2\n\u03c32\nFor a given k and a given Sk, all the \u03c3k,i are equals. The pseudo-risk of an algorithm A that divides\neach stratum \u03a9k in Sk sub-strata is thus\n||\u03b8k||2\n\nSk\ufffd2/d for sub-stratum \u03a9k,i.\n12 \ufffd wk\nSk\ufffd2/d\n\n12 \ufffd wk\nLn(A) = \ufffdk\u2264K \ufffdi\u2264Sk\n\n= \ufffdk\u2264K\n\n= \ufffdk\u2264K\n\nIf an unadaptive algorithm A\u2217 has access to the variances \u03c32\nk in the strata, it can choose to allocate\nthe budget in order to minimize the pseudo-risk. After solving the simple optimization problem\nof minimizing Ln(A) with respect to (Sk)k, we deduce that an optimal oracle strategy on this\nstrati\ufb01cation would divide each stratum k in S\u2217k = (wk\u03c3k)\nn sub-strata4. The pseudo-risk\n\ufffdi\u2264K (wi\u03c3i)\nfor this strategy is then\nd+1\ufffd2 (d+1)\n\nLn,K(A\u2217) = \ufffd\ufffdk\u2264K(wk\u03c3k)\n\n2 (d+1)\n\u03a3\nK\nn1+2/d ,\n\nw2+2/d\nS1+2/d\nk\n\n||\u03b8k||2\n12\n\nS1+2/d\nk\n\nn1+2/d\n\n\u03c32\nk.\n\nw2\nk\n\n(4)\n\n=\n\nd+1\n\nd+1\n\nk\n\n2\n\nd\n\nd\n\nd\n\nd\n\nd\n\nd\n\nd+1 . We will call in the paper optimal proportions the quantities\n\nwhere we write \u03a3K =\ufffdi\u2264K(wi\u03c3i)\n\n\u03bbK,k =\n\nd\n\nd+1\n\n(wk\u03c3k)\n\n\ufffdi\u2264K(wi\u03c3i)\n\n.\n\nd\n\nd+1\n\n(5)\n\nd\nd+1 =\n\n(6)\n\nIn the speci\ufb01c case of functions that are piecewise linear, we have \u03a3K = \ufffdk\u2264K(wk\u03c3k)\n\ufffdk\u2264K(wk ||\u03b8k||2\n\nd+1 =\ufffd[0,1]d\n\ndx. We thus have\n\n(||\u2207f (x)||2)\n\nw1/d\n\n2\u221a3\n\n2(d+1)\n\nd+1\n\n12\n\n)\n\nk\n\nd\n\nd\n\nd\n\n1\n\nLn,K (A\u2217) = \u03a3\n\n.\n\nn1+ 2\n\nd\n\nThis optimal oracle strategy attains the lower bound in Lemma 1. We will thus construct, in the next\nSection, an algorithm that learns and adapts to the optimal proportions de\ufb01ned in Equation 5.\n\n4We deliberately forget about rounding issues in this Subsection. The allocation we provide might not be\nrealizable (e.g. if S\u2217k is not an integer), but plugging it in the bound provides a lower bound on any realizable\nperformance.\n\n5\n\n\f4 The Algorithm LMC-UCB\n4.1 Algorithm LMC-UCB\nWe present the algorithm Lipschitz Monte Carlo Upper Con\ufb01dence Bound (LM C\u2212U CB). It takes\nas parameter a partition (\u03a9k)k\u2264K in K \u2264 n hyper-cubic strata of same measure 1/K (it is possible\nsince we assume that \u2203l \u2208 N/ld = K). It also takes as parameter an uniform upper bound L on\n2, and \u03b4, a (small) probability. The aim of algorithm LM C \u2212 U CB is to sub-stratify each\n||\u2207f (x)||2\nstratum \u03a9k in \u03bbK,k = (wk\u03c3k)\nn hyper-cubic sub-strata of same measure and sample one\npoint per sub-stratum. An intuition on why this target is relevant was provided in Section 3.\n\nAlgorithm LMC-UCB starts by sub-stratifying each stratum \u03a9k in \u00afS =\ufffd\ufffd\ufffd n\nd+1\ufffd1/d\ufffdd\nK\ufffd d\n\ncubic strata of same measure. It is possible to do that since by de\ufb01nition, \u00afS1/d is an integer. We\nwrite this \ufb01rst sub-strati\ufb01cation N \ufffdk = (\u03a9\ufffdk,i)i\u2264 \u00afS. It then pulls one sample per sub-stratum in N \ufffdk for\neach \u03a9k.\nIt then sub-strati\ufb01es again each stratum \u03a9k using the informations collected. It sub-strati\ufb01es each\nstratum \u03a9k in\n\n\ufffdK\n\nhyper-\n\ni=1(wi\u03c3i)\n\nd+1\n\nd+1\n\nd\n\nd\n\nSk = max\ufffd\ufffd\ufffd w\n\ufffdK\n\nd\n\nd+1\n\nk \ufffd\u02c6\u03c3k,K \u00afS + A( wk\n\ufffd\u02c6\u03c3i,K \u00afS + A( wi\n\n\u00afS )1/d\ufffd 1\n\u00afS\ufffd d\n\u00afS )1/d\ufffd 1\n\u00afS\ufffd d\n\nd+1\ni\n\nd+1\n\nd\n\ni=1 w\n\n(n \u2212 K \u00afS)\ufffd1/d\ufffdd\n\n, \u00afS\ufffd\n\nd+1\n\n(7)\n\n.\n\nk\n\n(8)\n\n1\n\u00afS\n\n\u00afS\ufffdj=1\n\nXk,j\ufffd2\n\n\u00afS\ufffdi=1\ufffdXk,i \u2212\n\nhyper-cubic strata of same measure (see Figure 1 for a de\ufb01nition of A). It is possible to do that\nbecause by de\ufb01nition, S1/d\nis an integer. We call this sub-strati\ufb01cation of stratum \u03a9k strati\ufb01cation\nNk = (\u03a9k,i)i\u2264Sk. In the last Equation, we compute the empirical standard deviation in stratum \u03a9k\nat time K \u00afS as\n\nAlgorithm LMC-UCB then samples in each sub-stratum \u03a9k,i one point. It is possible to do that\n\n\u02c6\u03c3k,K \u00afS =\ufffd\ufffd\ufffd\ufffd 1\n\u00afS \u2212 1\nsince, by de\ufb01nition of Sk,\ufffdk Sk + K \u00afS \u2264 n\nThe algorithm outputs an estimate \u02c6\u00b5n of the integral of f, computed with the \ufb01rst point in each\nsub-stratum of partition Nk. We present in Figure 1 the pseudo-code of algorithm LMC-UCB.\nInput: Partition (\u03a9k)k\u2264K, L, \u03b4, set A = 2L\u221ad\ufffdlog(2K/\u03b4)\nInitialize: \u2200k \u2264 K, sample 1 point in each stratum of partition N \ufffdk\nMain algorithm:\nCompute Sk for each k \u2264 K\nCreate partition Nk for each k \u2264 K\nSample a point in \u03a9k,i \u2208 Nk for i \u2264 Sk\nOutput: Return the estimate \u02c6\u00b5n computed when taking the \ufb01rst point Xk,i in each sub-stratum \u03a9k,i of\nk=1 wk\ufffdSk\nNk, that is to say \u02c6\u00b5n =\ufffdK\nFigure 1: Pseudo-code of LMC-UCB. The de\ufb01nition of N \ufffdk, \u00afS, Nk, \u03a9k,i and Sk are in the main text.\n4.2 High probability lower bound on the number of sub-strata of stratum \u03a9k\nWe \ufb01rst state an assumption on the function f.\nAssumption 1 The function f is such that \u2207f exists and \u2200x \u2208 [0, 1]d,||\u2207f (x)||2\n2 \u2264 L.\nThe next Lemma states that with high probability, the number Sk of sub-strata of stratum \u03a9k, in\nwhich there is at least one point, adjusts \u201calmost\u201d to the unknown optimal proportions.\nLemma 3 Let Assumption 1 be satis\ufb01ed and (\u03a9k)k\u2264K be a partition in K hyper-cubic strata of\nsame measure. If n \u2265 4K, then with probability at least 1\u2212 \u03b4, \u2200k, the number of sub-strata satis\ufb01es\n\nXk,i\nSk\n\ni=1\n\nSk \u2265 max\ufffd\u03bbK,k\ufffdn \u2212 7(L + 1)d3/2\ufffdlog(K/\u03b4)(1 +\n\n1\n\u03a3K\n\n)K\n\n1\n\nd+1 n\n\nd\n\nd+1\ufffd, \u00afS\ufffd.\n\nThe proof of this result is in the Supplementary Material (Appendix C) (see also [2]).\n\n6\n\n\f4.3 Remarks\nA sampling scheme that shares ideas with quasi Monte-Carlo methods: Algorithm LM C \u2212\nU CB almost manages to divide each stratum \u03a9k in \u03bbK,kn hyper-cubic strata of same measure, each\none of them containing at least one sample. It is thus possible to build a learning procedure that, at\nthe same time, estimates the empirical proportions \u03bbK,k, and allocates the samples proportionally to\nthem.\nThe error terms: There are two reasons why we are not able to divide exactly each stratum \u03a9k\nin \u03bbK,kn hyper-cubic strata of same measure. The \ufb01rst reason is that the true proportions \u03bbK,k are\nunknown, and that it is thus necessary to estimate them. The second reason is that we want to build\nstrata that are hyper-cubes of same measure. The number of strata Sk needs thus to be such that\nS1/d\nk\n\nis an integer. We thus also loose ef\ufb01ciency because of rounding issues.\n\n5 Main results\n5.1 Asymptotic convergence of algorithm LMC-UCB\nBy just combining the result of Lemma 1 with the result of Lemma 3, it is possible to show that\nalgorithm LMC-UCB is asymptotically (when K goes to +\u221e and n \u2265 K) as ef\ufb01cient as the optimal\noracle strategy of Lemma 1.\nTheorem 1 Assume that \u2207f is continuous, and that Assumption 1 is satis\ufb01ed. Let (\u03a9n\nk )n,k\u2264Kn be\nan arbitrary sequence of partitions such that all the strata are hyper-cubes, such that 4Kn \u2264 n, such\n2 \ufffd = 0.\nthat the diameter of each strata goes to 0, and such that limn\u2192+\u221e\nThe regret of LMC-UCB with parameter \u03b4n = 1\nk )n,k\u2264Kn it disposes of n points, is such that\n(\u03a9n\n\nn\ufffdKn\ufffd log(Knn2)\ufffd d+1\n\nn2 on this sequence of partition, where for sequence\n\n1\n\nThe proof of this result is in the Supplementary Material (Appendix D) (see also [2]).\n\nlim\nn\u2192\u221e\n\nn1+2/dRn(ALM C\u2212U CB) = 0.\n\n5.2 Under a slightly stronger Assumption\nWe introduce the following Assumption, that is to say that f admits a Taylor expansion of order 2.\nAssumption 2 f admits a Taylor expansion at the second order in any point a \u2208 [0, 1]d and this\nexpansion is such that \u2200x,|f (x) \u2212 f (a) \u2212 \ufffd\u2207f, (x \u2212 a)\ufffd| \u2264 M||x \u2212 a||2\n2 where M is a constant.\nThis is a slightly stronger assumption than Assumption 1, since it imposes, additional to Assump-\ntion 1, that the variations of \u2207f (x) are uniformly bounded for any x \u2208 [0, 1]d. Assumption 2 im-\nplies Assumption 1 since\ufffd\ufffd||\u2207f (x)||2\u2212||\u2207f (0)||2\ufffd\ufffd \u2264 M||x\u22120||2, which implies that ||\u2207f (x)||2 \u2264\n||\u2207f (0)||2 + M\u221ad. This implies in particular that we can consider L = ||\u2207f (0)||2 + M\u221ad. We\n\nhowever do not need M to tune the algorithm LMC-UCB, as long as we have access to L (although\nM appears in the bound of next Theorem).\nWe can now prove a bound on the pseudo-regret.\nTheorem 2 Under Assumptions 1 and 2, if n \u2265 4K, the estimate returned by algorithm LM C \u2212\nU CB is such that, with probability 1 \u2212 \u03b4, we have\nd+1\ufffd\ufffd.\nd+1 + 25d\ufffd 1\nK\ufffd 1\nRn(ALM C\u2212U CB) \u2264\n\n\u03a3 \ufffd4\ufffd650d3/2\ufffdlog(K/\u03b4)K\n\nd \ufffdM (L + 1)4\ufffd1 +\n\nd+1 n\u2212 1\n\n1\nd+2\n\n3M d\n\nn\n\n1\n\nA proof of this result is in the Supplementary Material (Appendix E) (see also [2]).\nNow we can choose optimally the number of strata so that we minimize the regret.\nTheorem 3 Under Assumptions 1 and 2,\n\nthe algorithm LM C \u2212 U CB launched on Kn =\n\n\ufffd(\u221an)1/d\ufffdd\n\nhyper-cubic strata is such that, with probability 1 \u2212 \u03b4, we have\n3M d\n\n1\n\n2(d+1)\ufffd700M (L + 1)4d3/2\ufffd1 +\n\n\u03a3 \ufffd4\ufffdlog(n/\u03b4)\ufffd.\n\nRn(ALM C\u2212U CB) \u2264\n\nn1+ 2\n\nd + 1\n\n7\n\n\f1\n\n2(d+1) .\n\n2 \ufffd = 0, the pseudo-\n\nn\ufffdKn\ufffd log(Knn2)\ufffd d+1\n\n5.3 Discussion\nConvergence of the algorithm LMC-UCB to the optimal oracle strategy: When the number\nof strata Kn grows to in\ufb01nity, but such that limn\u2192+\u221e\nregret of algorithm LMC-UCB converges to 0. It means that this strategy is asymptotically as ef\ufb01-\ncient as (the lower bound on) the optimal oracle strategy. When f admits a Taylor expansion at the\n\ufb01rst order in every point, it is also possible to obtain a \ufb01nite-time bound on the pseudo-regret.\nA new sampling scheme: The algorithm LM C \u2212 U CB samples the points in a way that takes\nadvantage of both strati\ufb01ed sampling and quasi Monte-Carlo. Indeed, LMC-UCB is designed to\ncumulate (i) the advantages of quasi Monte-Carlo by spreading the samples in the domain and (ii) the\nadvantages of strati\ufb01ed, adaptive sampling by allocating more samples where the function has larger\nvariations. For these reasons, this technique is ef\ufb01cient on differentiable functions. We illustrate this\nassertion by numerical experiments in the Supplementary Material (Appendix A) (see also [2]).\nIn high dimension: The bound on the pseudo-regret in Theorem 3 is of order n\u22121\u2212 2\nd \u00d7\npoly(d)n\u2212 1\nIn order for the pseudo-regret to be negligible when compared to the opti-\nmal oracle mean squared error of the estimate (which is of order n\u22121\u2212 2\nd ) it is necessary that\npoly(d)n\u2212 1\n2(d+1) is negligible compared to 1. In particular, this says that n should scale exponen-\ntially with the dimension d. This is unavoidable, since strati\ufb01ed sampling shrinks the approximation\nerror to the asymptotic oracle only if the diameter of each stratum is small, i.e. if the space is strati\ufb01ed\nin every direction (and thus if n is exponential with d). However Uniform strati\ufb01ed Monte-Carlo,\nalso for the same reasons, shares this problem5.\nWe emphasize however the fact that a (slightly modi\ufb01ed) version of our algorithm is more ef\ufb01cient\nthan crude Monte-Carlo, up to a negligible term that depends only of poly(log(d)). The bound in\nLemma 3 depends of poly(d) only because of rounding issues, coming from the fact that we aim\nat dividing each stratum \u03a9k in hyper-cubic sub-strata. The whole budget is thus not completely\n\nused, and only\ufffdk Sk + K \u00afS samples are collected. By modifying LMC-UCB so that it allocates\n\nthe remaining budget uniformly at random on the domain, it is possible to prove that the (modi\ufb01ed)\nalgorithm is always at least as ef\ufb01cient as crude Monte-Carlo.\nConclusion\nThis work provides an adaptive method for estimating the integral of a differentiable function f.\nWe \ufb01rst proposed a benchmark for measuring ef\ufb01ciency: we proved that the asymptotic mean\nsquared error of the estimate outputted by the optimal oracle strategy is lower bounded by \u03a3 1\nn1+2/d .\nWe then proposed an algorithm called LMC-UCB, which manages to learn the amplitude of the vari-\nations of f, to sample more points where theses variations are larger, and to spread these points in a\nway that is related to quasi Monte-Carlo sampling schemes. We proved that algorithm LMC-UCB\nis asymptotically as ef\ufb01cient as the optimal, oracle strategy. Under the assumption that f admits a\nTaylor expansion in each point, we provide also a \ufb01nite time bound for the pseudo-regret of algo-\nrithm LMC-UCB. We summarize in Table 1 the rates and \ufb01nite-time bounds for crude Monte-Carlo,\nUniform strati\ufb01ed Monte-Carlo and LMC-UCB. An interesting extension of this work would be to\n\nPseudo-Risk:\n\nAsymptotic constant\n\nSampling schemes\n\nRate\n\nCrude MC\n\nUniform strati\ufb01ed MC\n\nLMC-UCB\n\n1\nn\n1\n\nn1+ 2\n\nd\n\n1\n\nn1+ 2\n\nd\n\n1\n\ndx\n\n\ufffd[0,1]d\ufffdf (x) \u2212\ufffd[0,1]d f (u)du\ufffd2\n2dx\ufffd\n12\ufffd\ufffd[0,1]d ||\u2207f (x)||2\nd+1 dx\ufffd2 (d+1)\n12\ufffd\ufffd[0,1]d (||\u2207f (x)||2)\n\n1\n\nd\n\nd\n\n+ Finite-time bound\n\n+0\n\n+O(\n\nd\nn1+ 2\n\nd\n\n+ 1\n2d\n\n)\n\n+O(\n\nn\n\n11\n2\n\nd\n+ 1\n1+ 2\nd\n\n2(d+1)\n\n)\n\nTable 1: Rate of convergence plus \ufb01nite time bounds for Crude Monte-Carlo, Uniform strati\ufb01ed\nMonte Carlo (see Lemma 2) and LMC-UCB (see Theorems 1 and 3).\nadapt it to \u03b1\u2212H\u00a8older functions that admit a Riemann-Liouville derivative of order \u03b1. We believe\nthat similar results could be obtained, with an optimal constant and a rate of order n1+2\u03b1/d.\nAcknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Coun-\ncil, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework\nProgramme (FP7/2007-2013) under grant agreement 270327 (project CompLACS), and by Pascal-2.\n5When d is very large and n is not exponential in d, then second order terms, depending on the dimension,\ntake over the bound in Lemma 2 (which is an asymptotic bound) and poly(d) appears in these negligible terms.\n\n8\n\n\fReferences\n[1] J.Y. Audibert, R. Munos, and Cs. Szepesv\u00b4ari. Exploration-exploitation tradeoff using variance\n\nestimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876\u20131902, 2009.\n\n[2] A. Carpentier and R. Munos. Adaptive Strati\ufb01ed Sampling for Monte-Carlo integration of Dif-\n\nferentiable functions. Technical report, arXiv:0575985, 2012.\n\n[3] A. Carpentier and R. Munos. Finite-time analysis of strati\ufb01ed sampling for monte carlo. In In\n\nNeural Information Processing Systems (NIPS), 2011a.\n\n[4] A. Carpentier and R. Munos. Finite-time analysis of strati\ufb01ed sampling for monte carlo. Tech-\n\nnical report, INRIA-00636924, 2011b.\n\n[5] Pierre Etor\u00b4e and Benjamin Jourdain. Adaptive optimal allocation in strati\ufb01ed sampling methods.\n\nMethodol. Comput. Appl. Probab., 12(3):335\u2013360, September 2010.\n\n[6] P. Glasserman. Monte Carlo methods in \ufb01nancial engineering. Springer Verlag, 2004. ISBN\n\n0387004513.\n\n[7] V. Grover. Active learning and its application to heteroscedastic problems. Department of\n\nComputing Science, Univ. of Alberta, MSc thesis, 2009.\n\n[8] A. Maurer and M. Pontil. Empirical bernstein bounds and sample-variance penalization.\n\nIn\nProceedings of the Twenty-Second Annual Conference on Learning Theory, pages 115\u2013124,\n2009.\n\n[9] H. Niederreiter. Quasi-monte carlo methods and pseudo-random numbers. Bull. Amer. Math.\n\nSoc, 84(6):957\u20131041, 1978.\n\n[10] R.Y. Rubinstein and D.P. Kroese. Simulation and the Monte Carlo method. Wiley-interscience,\n\n2008. ISBN 0470177942.\n\n9\n\n\f", "award": [], "sourceid": 136, "authors": [{"given_name": "Alexandra", "family_name": "Carpentier", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}