{"title": "Bayesian Dyadic Trees and Histograms for  Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2089, "page_last": 2099, "abstract": "Many machine learning  tools for regression are based on recursive partitioning of the covariate space into smaller regions, where the regression function can be estimated locally. Among these, regression trees and their ensembles have demonstrated impressive empirical performance.    In this work,  we shed light on the machinery behind Bayesian variants of these methods.  In particular, we study Bayesian regression histograms, such as Bayesian dyadic trees, in the simple regression case with just one predictor.   We focus on the reconstruction of regression surfaces that are piecewise constant, where the number of jumps is unknown. We show that with suitably designed priors, posterior distributions concentrate around the true step regression function at a near-minimax rate. These results {\\sl do not} require the knowledge of the true number of steps, nor the width of the true partitioning cells. Thus, Bayesian dyadic regression trees are fully adaptive and can recover the true piecewise regression function nearly as well as if we knew the exact number and location  of jumps. Our results constitute the first step towards  understanding why Bayesian trees and their ensembles have worked so well in practice.  As an aside, we discuss prior distributions  on balanced interval partitions and how they relate to an old  problem in geometric probability. Namely, we relate the probability of covering the circumference of a circle with random arcs whose endpoints are confined to a grid, a new variant of the original problem.", "full_text": "Bayesian Dyadic Trees and Histograms for Regression\n\nSt\u00e9phanie van der Pas\nMathematical Institute\n\nLeiden University\n\nLeiden, The Netherlands\n\nsvdpas@math.leidenuniv.nl\n\nVeronika Ro\u02c7ckov\u00e1\n\nBooth School of Business\n\nUniversity of Chicago\nChicago, IL, 60637\n\nVeronika.Rockova@ChicagoBooth.edu\n\nAbstract\n\nMany machine learning tools for regression are based on recursive partitioning\nof the covariate space into smaller regions, where the regression function can\nbe estimated locally. Among these, regression trees and their ensembles have\ndemonstrated impressive empirical performance.\nIn this work, we shed light\non the machinery behind Bayesian variants of these methods. In particular, we\nstudy Bayesian regression histograms, such as Bayesian dyadic trees, in the simple\nregression case with just one predictor. We focus on the reconstruction of regression\nsurfaces that are piecewise constant, where the number of jumps is unknown. We\nshow that with suitably designed priors, posterior distributions concentrate around\nthe true step regression function at a near-minimax rate. These results do not require\nthe knowledge of the true number of steps, nor the width of the true partitioning\ncells. Thus, Bayesian dyadic regression trees are fully adaptive and can recover the\ntrue piecewise regression function nearly as well as if we knew the exact number\nand location of jumps. Our results constitute the \ufb01rst step towards understanding\nwhy Bayesian trees and their ensembles have worked so well in practice. As an\naside, we discuss prior distributions on balanced interval partitions and how they\nrelate to an old problem in geometric probability. Namely, we relate the probability\nof covering the circumference of a circle with random arcs whose endpoints are\ncon\ufb01ned to a grid, a new variant of the original problem.\n\n1\n\nIntroduction\n\nHistogram regression methods, such as regression trees [1] and their ensembles [2], have an impressive\nrecord of empirical success in many areas of application [3, 4, 5, 6, 7]. Tree-based machine learning\n(ML) methods build a piecewise constant reconstruction of the regression surface based on ideas\nof recursive partitioning. Perhaps the most popular partitioning schemes are the ones based on\nparallel-axis splits. One recent example is the Mondrian process [8], which was introduced to the\nML community as a prior over tree data structures with interesting self-consistency properties. Many\nef\ufb01cient algorithms exist that can be deployed to \ufb01t regression histograms underpinned by some\npartitioning scheme. Among these, Bayesian variants, such as Bayesian CART [9, 10] and BART\n[11], have appealed to umpteen practitioners. There are several reasons why. Bayesian tree-based\nregression tools (a) can adapt to regression surfaces without any need for pruning, (b) are reluctant to\nover\ufb01t, (c) provide an avenue for uncertainty statements via posterior distributions. While practical\nsuccess stories abound [3, 4, 5, 6, 7], the theoretical understanding of Bayesian regression tree\nmethods has been lacking. In this work, we study the quality of posterior distributions with regard\nto the three properties mentioned above. We provide \ufb01rst theoretical results that contribute to the\nunderstanding of Bayesian Gaussian regression methods based on recursive partitioning.\nOur performance metric will be the speed of posterior concentration/contraction around the true\nregression function. This is ultimately a frequentist assessment, describing the typical behavior of the\nposterior under the true generative model [12]. Posterior concentration rate results are now slowly\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fentering the machine learning community as a tool for obtaining more insights into Bayesian methods\n[13, 14, 15, 16, 17]. Such results quantify not only the typical distance between a point estimator\n(posterior mean/median) and the truth, but also the typical spread of the posterior around the truth.\nIdeally, most of the posterior mass should be concentrated in a ball centered around the true value\nwith a radius proportional to the minimax rate [12, 18]. Being inherently a performance measure of\nboth location and spread, optimal posterior concentration provides a necessary certi\ufb01cate for further\nuncertainty quanti\ufb01cation [19, 20, 21]. Beyond uncertainty assessment, theoretical guarantees that\ndescribe the average posterior shrinkage behavior have also been a valuable instrument for assessing\nthe suitability of priors. As such, these results can often provide useful guidelines for the choice of\ntuning parameters, e.g. the latent Dirichlet allocation model [14].\nDespite the rapid growth of this frequentist-Bayesian theory \ufb01eld, posterior concentration results\nfor Bayesian regression histograms/trees/forests have, so far, been unavailable. Here, we adopt this\ntheoretical framework to get new insights into why these methods work so well.\nRelated Work\n\nBayesian density estimation with step functions is a relatively well-studied problem [22, 23, 24]. The\nliterature on Bayesian histogram regression is a bit less crowded. Perhaps the closest to our conceptual\nframework is the work by Coram and Lalley [25], who studied Bayesian non-parametric binary\nregression with uniform mixture priors on step functions. The authors focused on L1 consistency.\nHere, we focus on posterior concentration rather than consistency. We are not aware of any other\nrelated theoretical study of Bayesian histogram methods for Gaussian regression.\nOur Contributions\n\nIn this work we focus on a canonical regression setting with merely one predictor. We study\nhierarchical priors on step functions and provide conditions under which the posteriors concentrate\noptimally around the true regression function. We consider the case when the true regression function\nitself is a step function, i.e. a tree or a tree ensemble, where the number and location of jumps is\nunknown.\nWe start with a very simple space of approximating step functions, supported on equally sized intervals\nwhere the number of splits is equipped with a prior. These partitions include dyadic regression trees.\nWe show that for a suitable complexity prior, all relevant information about the true regression\nfunction (jump sizes and the number of jumps) is learned from the data automatically. During the\ncourse of the proof, we develop a notion of the complexity of a piecewise constant function relative\nto its approximating class.\nNext, we take a larger approximating space consisting of functions supported on balanced partitions\nthat do not necessarily have to be of equal size. These correspond to more general trees with splits at\nobserved values. With a uniform prior over all balanced partitions, we are able to achieve a nearly\nideal performance (as if we knew the number and the location of jumps). As an aside, we describe\nthe distribution of interval lengths obtained when the splits are sampled uniformly from a grid. We\nrelate this distribution to the probability of covering the circumference of a circle with random arcs, a\nproblem in geometric probability that dates back to [26, 27]. Our version of this problem assumes\nthat the splits are chosen from a discrete grid rather than from a unit interval.\nNotation\nWith \u221d and (cid:46) we will denote an equality and inequality, up to a constant. The \u03b5-covering number\nof a set \u2126 for a semimetric d, denoted by N (\u03b5, \u2126, d), is the minimal number of d-balls of radius \u03b5\nneeded to cover the set \u2126. We denote by \u03c6(\u00b7) the standard normal density and by P n\n(cid:80)n\nn-fold product measure of the n independent observations under (1) with a regression function f (\u00b7).\ni=1 \u03b4xi we denote the empirical distribution of the observed covariates, by || \u00b7 ||n the\nBy Px\nn) and by || \u00b7 ||2 the standard Euclidean norm.\nnorm on L2(Px\n2 Bayesian Histogram Regression\nWe consider a classical nonparametric regression model, where response variables Y (n) =\n(Y1, . . . , Yn)(cid:48) are related to input variables x(n) = (x1, . . . , xn)(cid:48) through the function f0 as fol-\nlows\n(1)\n\nf =(cid:78) Pf,i the\n\nYi = f0(xi) + \u03b5i,\n\nn = 1\nn\n\n\u03b5i \u223c N (0, 1),\n\ni = 1, . . . , n.\n\n2\n\n\fWe assume that the covariate values xi are one-dimensional, \ufb01xed and have been rescaled so that\nxi \u2208 [0, 1]. Partitioning-based regression methods are often invariant to monotone transformations\nof observations. In particular, when f0 is a step function, standardizing the distance between the\nobservations, and thereby the split points, has no effect on the nature of the estimation problem.\nWithout loss of generality, we will thereby assume that the observations are aligned on an equispaced\ngrid.\nAssumption 1. (Equispaced Grid) We assume that the scaled predictor values satisfy xi = i\neach i = 1, . . . , n.\n\nn for\n\nThis assumption implies that partitions that are balanced in terms of the Lebesque measure will be\nbalanced also in terms of the number of observations. A similar assumption was imposed by Donoho\n[28] in his study of Dyadic CART.\nThe underlying regression function f0 : [0, 1] \u2192 R is assumed to be a step function, i.e.\n\nK0(cid:88)\n\nk=1\n\nf0(x) =\n\n\u03b20\nk\n\nI\n\u21260\nk\n\n(x),\n\nk}K0\n\nk is associated with a step size \u03b20\n\nk. The entire vector of K0 step sizes will be denoted by \u03b20 = (\u03b20\n\nk=1 is a partition of [0, 1] into K0 non-overlapping intervals. We assume that {\u21260\n\nk}K0\nwhere {\u21260\nk=1\nis minimal, meaning that f0 cannot be represented with a smaller partition (with less than K0 pieces).\nEach partitioning cell \u21260\nk, determining the level of the function f0\non \u21260\nOne might like to think of f0 as a regression tree with K0 bottom leaves. Indeed, every step function\ncan be associated with an equivalence class of trees that live on the same partition but differ in\ntheir tree topology. The number of bottom leaves K0 will be treated as unknown throughout this\npaper. Our goal will be designing a suitable class of priors on step functions so that the posterior\nconcentrates tightly around f0. Our analysis with a single predictor has served as a precursor to a\nfull-blown analysis for high-dimensional regression trees [29].\nWe consider an approximating space of all step functions (with K = 1, 2, . . . bottom leaves)\n\n1 , . . . , \u03b20\n\nK)(cid:48).\n\nwhich consists of smaller spaces (or shells) of all K-step functions\n\nF = \u222a\u221e\n\nK=1FK,\n\n(cid:40)\n\nFK =\n\nf\u03b2 : [0, 1] \u2192 R; f\u03b2(x) =\n\n\u03b2kI\u2126k (x)\n\n(cid:41)\n\n(2)\n\n,\n\nK(cid:88)\n\nk=1\n\nk=1 of size K, and (c) a prior on step sizes \u03b2 = (\u03b21, . . . , \u03b2K)(cid:48).\n\neach indexed by a partition {\u2126k}K\nk=1 and a vector of K step heights \u03b2. The fundamental building\nblock of our theoretical analysis will be the prior on F. This prior distribution has three main\ningredients, described in detail below, (a) a prior on the number of steps K, (b) a prior on the\npartitions {\u2126k}K\n2.1 Prior \u03c0K(\u00b7) on the Number of Steps K\nTo avoid over\ufb01tting, we assign an exponentially decaying prior distribution that penalizes partitions\nwith too many jumps.\nDe\ufb01nition 2.1. (Prior on K) The prior on the number of partitioning cells K satis\ufb01es\n\n\u03c0K(k) \u2261 \u03a0(K = k) \u221d exp(\u2212cK k log k)\n\nfor k = 1, 2, . . . .\n\n(3)\n\nThis prior is no stranger to non-parametric problems. It was deployed for stepwise reconstructions of\ndensities [24, 23] and regression surfaces [25]. When cK is large, this prior is concentrated on models\nwith small complexity where over\ufb01tting should not occur. Decreasing cK leads to the smearing of\nthe prior mass over partitions with more jumps. This is illustrated in Figure 1, which depicts the prior\nfor various choices of cK. We provide recommendations for the choice of cK in Section 3.1.\n2.2 Prior \u03c0\u2126(\u00b7| K) on Interval Partitions {\u2126k}K\nAfter selecting the number of steps K from \u03c0K(k), we assign a prior over interval partitions \u03c0\u2126(\u00b7|K).\nWe will consider two important special cases.\n\nk=1\n\n3\n\n\fFigure 1: (Left) Prior on the tree size for several values of cK, (Right) Best approximations of f0 (in\nthe (cid:96)2 sense) by step functions supported on equispaced blocks of size K \u2208 {2, 5, 10}.\n\n2.2.1 Equivalent Blocks\n\nPerhaps the simplest partition is based on statistically equivalent blocks [30], where all the cells are\nrequired to have the same number of points. This is also known as the K-spacing rule that partitions\nthe unit interval using order statistics of the observations.\nDe\ufb01nition 2.2. (Equivalent Blocks) Let x(i) denote the ith order statistic of x = (x1, . . . , xn)(cid:48),\nwhere x(n) \u2261 1 and n = Kc for some c \u2208 N\\{0}. Denote by x(0) \u2261 0. A partition {\u2126k}K\nconsists of K equivalent blocks, when \u2126k = (x(jk), x(jk+1)], where jk = (k \u2212 1)c.\nA variant of this de\ufb01nition can be obtained in terms of interval lengths rather than numbers of\nobservations.\nDe\ufb01nition 2.3. (Equispaced Blocks) A partition {\u2126k}K\n\nk=1 consists of K equispaced blocks \u2126k, when\n\nk=1\n\n\u2126k =(cid:0) k\u22121\n\nK , k\n\nK\n\n(cid:3)\n\nfor k = 1, . . . , K.\n\nWhen K = 2s for some s \u2208 N\\{0}, the equispaced partition corresponds to a full complete binary\ntree with splits at dyadic rationals. If the observations xi lie on a regular grid (Assumption 1), then\nDe\ufb01nition 2.2 and 2.3 are essentially equivalent. We will thereby focus on equivalent blocks (EB)\nand denote such a partition (for a given K > 0) with \u2126EB\nK . Because there is only one such partition\nfor each K, the prior \u03c0\u2126(\u00b7|K) has a single point mass mass at \u2126EB\nK we\ndenote the set of all EB partitions for K = 1, 2, . . . . We will use these partitioning schemes as a\njump-off point.\n\nK . With \u2126EB = \u222a\u221e\n\nK=1\u2126EB\n\n2.2.2 Balanced Intervals\n\nEquivalent (equispaced) blocks are deterministic and, as such, do not provide much room for learning\nabout the actual location of jumps in f0. Balanced intervals, introduced below, are a richer class of\npartitions that tolerate a bit more imbalance. First, we introduce the notion of cell counts \u00b5(\u2126k). For\neach interval \u2126k, we write\n\n\u00b5(\u2126k) =\n\nI(xi \u2208 \u2126k),\n\n(4)\n\nthe proportion of observations falling inside \u2126k. Note that for equivalent blocks, we can write\n\u00b5(\u21261) = \u00b7\u00b7\u00b7 = \u00b5(\u2126K) = c/n = 1/K.\nDe\ufb01nition 2.4. (Balanced Intervals) A partition {\u2126k}K\n\nk=1 is balanced if\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n4\n\nC 2\nmin\nK\n\n\u2264 \u00b5(\u2126k) \u2264 C 2\nmax\nK\n\nfor all k = 1, . . . , K\n\n(5)\n\nfor some universal constants Cmin \u2264 1 \u2264 Cmax not depending on K.\n\n2468100.00.10.20.30.40.5klllllllllll11/21/51/10pK(k)0.00.20.40.60.81.00123456xllllllllllllllllllllllllllllllllllllllllTrueK=2K=5K=10f0(x)\f(a) K = 2\n\n(b) K = 3\n\nFigure 2: Two sets EK of possible stick lengths that satisfy the minimal cell-size condition |\u2126k| \u2265 C\nwith n = 10, C = 2/n and K = 2, 3.\n\n(cid:101)C 2\nmin/K \u2264 |\u2126k| \u2264 (cid:101)C 2\n\nK=1\u2126BI\n\nThe following variant of the balancing condition uses interval widths rather than cell counts:\nmax/K. Again, under Assumption 1, these two de\ufb01nitions are equiva-\nlent. In the sequel, we will denote by \u2126BI\nK the set of all balanced partitions consisting of K intervals\nand by \u2126BI = \u222a\u221e\nK the set of all balanced intervals of sizes K = 1, 2, . . . . It is worth pointing\nout that the balance assumption on the interval partitions can be relaxed, at the expense of a log factor\nin the concentration rate [29].\nWith balanced partitions, the K th shell FK of the approximating space F in (2) consists of all step\nK and have K\u22121 points of discontinuity uk \u2208 In \u2261 {xi :\nfunctions that are supported on partitions \u2126BI\ni = 1, . . . , n \u2212 1} for k = 1, . . . K \u2212 1. For equispaced blocks in De\ufb01nition 2.3, we assumed that\nthe points of subdivision were deterministic, i.e. uk = k/K. For balanced partitions, we assume that\nuk are random and chosen amongst the observed values xi. The order statistics of the vector of splits\nu = (u1, . . . , uK\u22121)(cid:48) uniquely de\ufb01ne a segmentation of [0, 1] into K intervals \u2126k = (u(k\u22121), u(k)],\nwhere u(k) designates the kth smallest value in u and u(0) \u2261 0, u(K) = x(n) \u2261 1.\nOur prior over balanced intervals \u03c0\u2126(\u00b7| K) will be de\ufb01ned implicitly through a uniform prior over\nthe split vectors u. Namely, the prior over balanced partitions \u2126BI\n\nI(cid:16){\u2126k}K\n\nK satis\ufb01es\nk=1 \u2208 \u2126BI\n\nK\n\n(cid:17)\n\n.\n\n(6)\n\n\u03c0\u2126({\u2126k}K\n\nk=1 | K) =\n\n1\n\ncard(\u2126BI\nK )\n\nIn the following Lemma, we obtain upper bounds on card(\u2126BI\nK ) and discuss how they relate to an\nold problem in geometric probability. In the sequel, we denote with |\u2126k| the lengths of the segments\nde\ufb01ned through the split points u.\nLemma 2.1. Assume that u = (u1, . . . , uK\u22121)(cid:48) is a vector of independent random variables\nobtained by uniform sampling (without replacement) from In. Then under Assumption 1, we have for\n1/n < C < 1/K\n\n(cid:18)\n\n\u03a0\n\nmin\n1\u2264k\u2264K\n\n|\u2126k| \u2265 C\n\n=\n\n(cid:19)\n\n|\u2126k| \u2264 C\n\n= 1 \u2212\n\n(\u22121)k\n\n(cid:19)\n(cid:101)n(cid:88)\n\nk=1\n\n(cid:1)\n\n(cid:0)(cid:98)n(1\u2212K C)(cid:99)+K\u22121\n(cid:0) n\u22121\n(cid:1)\n(cid:19)(cid:0)(cid:98)n(1\u2212k C)(cid:99)+K\u22121\n(cid:1)\n(cid:18)n \u2212 1\n(cid:1)\n\n(cid:0) n\u22121\n\nK\u22121\nK\u22121\n\nk\n\nK\u22121\nK\u22121\n\n(7)\n\n(8)\n\n,\n\nand\n\n(cid:18)\n\n\u03a0\n\nmax\n1\u2264k\u2264K\n\nwhere(cid:101)n = min{n \u2212 1,(cid:98)1/C(cid:99)}.\n\nProof. The denominator of (7) follows from the fact that there are n \u2212 1 possible splits for the\nK \u2212 1 points of discontinuity uk. The numerator is obtained after adapting the proof of Lemma\n\n5\n\nW2W1llllllllllllll00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91c1-cc1-cK=200.20.40.60.8100.20.40.60.8100.20.40.60.81W2W1W3lllllllllllllllllllllllllllllllllllllllllK=3\fKonheim [31] that the set EK = {|\u2126k| : (cid:80)K\nin the interior of a convex hull of K points vr = (1 \u2212 KC)er + C(cid:80)K\npoints) and a = 2. With K = 2 (Figure 2(a)), there are only 7 =(cid:0)n(1\u2212KC)+K\u22121\n\n2 of Flatto and Konheim [31]. Without lost of generality, we will assume that C = a/n for some\na = 1, . . . ,(cid:98)n/K(cid:99) so that n(1 \u2212 KC) is an integer. Because the jumps uk can only occur on the\ngrid In, we have |\u2126k| = j/n for some j = 1, . . . , n \u2212 1. It follows from Lemma 1 of Flatto and\nk=1 |\u2126k| = 1 and |\u2126k| \u2265 C for k = 1, . . . , K} lies\nk=1 ek for r = 1, . . . , K,\nwhere er = (er1, . . . , erK)(cid:48) are unit base vectors, i.e. erj = I(r = j). Two examples of the set EK\n(for K = 2 and K = 3) are depicted in Figure 2. In both \ufb01gures, n = 10 (i.e. 9 candidate split\nlengths (|\u21261|,|\u21262|)(cid:48) that satisfy the minimal cell condition. These points lie on a grid between\nthe two vertices v1 = (1 \u2212 C, C) and v2 = (C, 1 \u2212 C). With K = 3, the convex hull of points\nv1 = (1 \u2212 2C, C, C)(cid:48), v2 = (C, 1 \u2212 2C, C)(cid:48) and v1 = (C, C, 1 \u2212 2C)(cid:48) corresponds to a diagonal\ndissection of a cube of a side length (1 \u2212 3C) (Figure 2(b), again with a = 2 and n = 10). The\nnumber of lattice points in the interior (and on the boundary) of such tetrahedron corresponds to\nan arithmetic sum 1\nK = 3. To complete the induction argument, suppose that the formula holds for some arbitrary\nK > 0. Then the size of the lattice inside (and on the boundary) of a (K + 1)-tetrahedron of a side\nlength [1 \u2212 (K + 1)C] can be obtained by summing lattice sizes inside K-tetrahedrons of increasing\nside lengths 0,\n\n2 (n \u2212 3a + 2)(n \u2212 3a + 1) =(cid:0)n\u22123a+2\n\n(cid:1). So far, we showed (7) for K = 2 and\n\n(cid:1) pairs of interval\n\n2/n, . . . , [1 \u2212 (K + 1)C]\n\n2/n, 2\n\nK\u22121\n\n\u221a\n\n\u221a\n\n\u221a\n\n2\n\nn[1\u2212(K+1)C]+K\u22121(cid:88)\n(cid:0) j\n\n(cid:18) j\n(cid:1) =(cid:0)N +1\n\nj=K\u22121\n\n2/n, i.e.\n\n(cid:18)n[1 \u2212 (K + 1)C] + K\n\n(cid:19)\n(cid:1). The second statement (8) is obtained by writing the\n\n(cid:19)\n\nK\n\n=\n\n,\n\nK \u2212 1\n\nwhere we used the fact(cid:80)N\n\nevent as a complement of the union of events and applying the method of inclusion-exclusion.\nRemark 2.1. Flatto and Konheim [31] showed that the probability of covering a circle with random\narcs of length C is equal to the probability that all segments of the unit interval, obtained with iid\nrandom uniform splits, are smaller than C. Similarly, the probability (8) could be related to the\nprobability of covering the circle with random arcs whose endpoints are chosen from a grid of n \u2212 1\nequidistant points on the circumference.\n\nThere are(cid:0) n\u22121\nbalancing condition (where (cid:101)C 2\n\n(cid:1) partitions of size K, of which(cid:0)(cid:98)n(1\u2212(cid:101)C2\n\n(cid:1) satisfy the minimal cell width\n\nmin > K/n). This number gives an upper bound on the combinatorial\n\nmin)(cid:99)+K\u22121\nK\u22121\n\ncomplexity of balanced partitions card(\u2126BI\n2.3 Prior \u03c0(\u03b2 | K) on Step Heights \u03b2\nTo complete the prior on F K, we take independent normal priors on each of the coef\ufb01cients. Namely\n\nK ).\n\nK\u22121\n\nj=K\n\nK\n\nK+1\n\nK(cid:89)\n\n\u03c0(\u03b2 | K) =\n\n\u03c6(\u03b2k),\n\n(9)\n\nwhere \u03c6(\u00b7) is the standard normal density.\n\nk=1\n\n3 Main Results\n\nk}K0\n\nk=1 and the approximating partitions {\u2126k}K\n\nA crucial ingredient of our proof will be understanding how well one can approximate f0 with other\nstep functions (supported on partitions \u2126, which are either equivalent blocks \u2126EB or balanced\npartitions \u2126BI). We will describe the approximation error in terms of the overlap between the true\npartition {\u21260\nk=1 \u2208 \u2126. More formally, we de\ufb01ne the\nrestricted cell count (according to Nobel [32]) as\n\n(cid:16)\nm\nthe number of cells in {\u21260\nk=1 that overlap with an interval V \u2282 [0, 1]. Next, we de\ufb01ne the\nk}K0\ncomplexity of f0 as the smallest size of a partition in \u2126 needed to completely cover f0 without any\noverlap.\n\nk \u2229 V (cid:54)= \u2205|,\n\nV ;{\u21260\n\n= |\u21260\n\nk}K0\n\nk : \u21260\n\n(cid:17)\n\nk=1\n\n6\n\n\fDe\ufb01nition 3.1. (Complexity of f0 w.r.t. \u2126) We de\ufb01ne K(f0, \u2126) as the smallest K such that there\nexists a K-partition {\u2126k}K\nm\n\nk=1 in the class of partitions \u2126 for which\n\n= 1 for all k = 1, . . . , K.\n\n\u2126k;{\u21260\n\n(cid:16)\n\n(cid:17)\n\nk}K0\n\nk=1\n\nThe number K(f0, \u2126) will be referred to as the complexity of f0 w.r.t. \u2126.\n\nk}K0\n\nk}K0\n\nk|. If the minimal partition {\u21260\n\nThe complexity number K(f0, \u2126) indicates the optimal number of steps needed to approximate f0\nwith a step function (supported on partitions in \u2126) without any error. It depends on the true number\nof jumps K0 as well as the true interval lengths |\u21260\nk=1 resided in the\nk=1 \u2208 \u2126, then we would obtain K(f0, \u2126) = K0, the true number of\napproximating class, i.e. {\u21260\nk}K0\nk=1 /\u2208 \u2126, the complexity number K(f0, \u2126) can be much larger.\nsteps. On the other hand, when {\u21260\nThis is illustrated in Figure 1 (right), where the true partition {\u21260\nk=1 consists of K0 = 4 unequal\npieces and we approximate it with equispaced blocks with K = 2, 5, 10 steps. Because the intervals\nk are not equal and the smallest one has a length 1/10, we need K(f0, \u2126EB) = 10 equispaced\n\u21260\nblocks to perfectly approximate f0. For our analysis, we do not need to assume that {\u21260\nk=1 \u2208 \u2126\n(i.e. f0 does not need to be inside the approximating class) or that K(f0, \u2126) is \ufb01nite. The complexity\nnumber can increase with n, where sharper performance is obtained when f0 can be approximated\nerror-free with some f \u2208 \u2126, where f has a small number of discontinuities relative to n.\nAnother way to view K(f0, \u2126) is as the ideal partition size on which the posterior should con-\ncentrate. If this number were known, we could achieve a near-minimax posterior concentration\n\nrate n\u22121/2(cid:112)K(f0, \u2126) log[n/K(f0, \u2126)] (Remark 3.3). The actual minimax rate for estimating a\npiece-wise constant f0 (consisting of K0 > 2 pieces) is n\u22121/2(cid:112)K0 log(n/K0) [33]. In our main\n\nk}K0\n\nk}K0\n\nresults, we will target the nearly optimal rate expressed in terms of K(f0, \u2126).\n\n3.1 Posterior Concentration for Equivalent Blocks\n\nOur \ufb01rst result shows that the minimax rate is nearly achieved, without any assumptions on the\nnumber of pieces of f0 or the sizes of the pieces.\nTheorem 3.1. (Equivalent Blocks) Let f0 : [0, 1] \u2192 R be a step function with K0 steps, where K0\nis unknown. Denote by F the set of all step functions supported on equivalent blocks, equipped\nwith priors \u03c0K(\u00b7) and \u03c0(\u03b2 | K) as in (3) and (9). Denote with Kf0 \u2261 K(f0, \u2126EB) and assume\n(cid:107)\u03b20(cid:107)2\u221e (cid:46) log n and Kf0\n\nn. Then, under Assumption 1, we have\n\n(cid:46) \u221a\n\n(cid:18)\nf \u2208 F : (cid:107)f \u2212 f0(cid:107)n \u2265 Mnn\u22121/2(cid:113)\n\n\u03a0\n\n(cid:19)\n\nKf0 log (n/Kf0 ) | Y (n)\n\n\u2192 0\n\n(10)\n\nin P n\nf0\n\n-probability, for every Mn \u2192 \u221e as n \u2192 \u221e.\n\n.\n\nBefore we proceed with the proof, a few remarks ought to be made. First, it is worthwhile to\nemphasize that the statement in Theorem 3.1 is a frequentist one as it relates to an aggregated\nbehavior of the posterior distributions obtained under the true generative model P n\nf0\nSecond, the theorem shows that the Bayesian procedure performs an automatic adaptation to\nK(f0, \u2126EB). The posterior will concentrate on EB partitions that are \ufb01ne enough to approximate f0\nwell. Thus, we are able to recover the true function as well as if we knew K(f0, \u2126EB).\nThird, it is worth mentioning that, under Assumption 1, Theorem 3.1 holds for equivalent as well as\nequisized blocks. In this vein, it describes the speed of posterior concentration for dyadic regression\ntrees. Indeed, as mentioned previously, with K = 2s for some s \u2208 N\\{0}, the equisized partition\ncorresponds to a full binary tree with splits at dyadic rationals.\nAnother interesting insight is that the Gaussian prior (9), while selected for mathematical convenience,\nturns out to be suf\ufb01cient for optimal recovery. In other words, despite the relatively large amount of\nmass near zero, the Gaussian prior does not rule out optimal posterior concentration. Our standard\nnormal prior is a simpler version of the Bayesian CART prior, which determines the variance from\nthe data [9].\nLet Kf0 \u2261 K(f0, \u2126EB) be as in De\ufb01nition 3.1. Theorem 3.1 is proved by verifying the three\nK=0 FK, with\n\nconditions of Theorem 4 of [18], for \u03b5n = n\u22121/2(cid:112)Kf0 log(n/Kf0) and Fn = (cid:83)kn\n\n7\n\n\fkn of the order Kf0 log(n/Kf0). The approximating subspace Fn \u2282 F should be rich enough to\napproximate f0 well and it should receive most of the prior mass. The conditions for posterior\ncontraction at the rate \u03b5n are:\n\nlog N(cid:0) \u03b5\n\n(C1) sup\n\u03b5>\u03b5n\n\n36 ,{f \u2208 Fn : (cid:107)f \u2212 f0(cid:107)n < \u03b5},(cid:107).(cid:107)n\n\n(C2)\n\n\u03a0(F\\Fn)\n\u03a0(f \u2208 F : (cid:107)f \u2212 f0(cid:107)2\n\n= o(e\u22122n\u03b52\n(C3) \u03a0(f \u2208 Fn : j\u03b5n < (cid:107)f \u2212 f0(cid:107)n \u2264 2j\u03b5n)\n\nn \u2264 \u03b52\nn)\n\n\u03a0(f \u2208 F : (cid:107)f \u2212 f0(cid:107)2\n\nn \u2264 \u03b52\nn)\n\nn ),\n\n\u2264 e\n\n(cid:1) \u2264 n\u03b52\n\nn,\n\nj2\n4 n\u03b52\n\nn for all suf\ufb01ciently large j.\n\nThe entropy condition (C1) restricts attention to EB partitions with small K. As will be seen from the\nproof, the largest allowed partitions have at most (a constant multiple of) Kf0 log (n/Kf0 ) pieces..\nCondition (C2) requires that the prior does not promote partitions with more than Kf0 log (n/Kf0)\npieces. This property is guaranteed by the exponentially decaying prior \u03c0K(\u00b7), which penalizes large\npartitions.\nThe \ufb01nal condition, (C3), requires that the prior charges a (cid:107).(cid:107)n neighborhood of the true function. In\nour proof, we verify this condition by showing that the prior mass on step functions of the optimal\nsize Kf0 is suf\ufb01ciently large.\n\nN(cid:0) \u03b5\n\nProof. We verify the three conditions (C1), (C2) and (C3).\n(C1) Let \u03b5 > \u03b5n and K \u2208 N. For f\u03b1, f\u03b2 \u2208 FK, we have K\u22121(cid:107)\u03b1 \u2212 \u03b2(cid:107)2\nn because\n\u00b5(\u2126k) = 1/K for each k. We now argue as in the proof of Theorem 12 of [18] to show that\nK\u03b5/36-balls required\nK\u03b5-ball in RK. This number is bounded above by 108K. Summing over K, we\nto cover a\nrecognize a geometric series. Taking the logarithm of the result, we \ufb01nd that (C1) is satis\ufb01ed if\nlog(108)(kn + 1) \u2264 n\u03b52\nn.\n\n(cid:1) can be covered by the number of\n\n36 ,{f \u2208 FK : (cid:107)f \u2212 f0(cid:107)n < \u03b5},(cid:107).(cid:107)n\n\n2 = (cid:107)f\u03b1 \u2212 f\u03b2(cid:107)2\n\u221a\n\n\u221a\n\n(cid:16)\n\n(cid:17)\n\n(C2) We bound the denominator by:\n\n\u03a0(f \u2208 F : (cid:107)f \u2212 f0(cid:107)2\n0 \u2208 RKf0 is an extended version of \u03b20 \u2208 RK0, containing the coef\ufb01cients for f0 expressed\n\n\u03b2 \u2208 RK(f0) : (cid:107)\u03b2 \u2212 \u03b2ext\n0 (cid:107)2\n\nn \u2264 \u03b52) \u2265 \u03c0K(Kf0)\u03a0\n\n2 \u2264 \u03b52Kf0\n\n,\n\nwhere \u03b2ext\nas a step function on the partition {\u21260\n\n\u03c0K(Kf0)\ne(cid:107)\u03b2ext\n2/2\n\n0 (cid:107)2\n\n\u03a0\n\n\u03b2 \u2208 RK(f0) : (cid:107)\u03b2(cid:107)2\n\n(cid:16)\n\nk=1. This can be bounded from below by\n\n(cid:90) \u03b52Kf0 /2\n\nk}Kf0\n(cid:17)\n2 \u2264 \u03b52Kf0/2\n\n>\n\n\u03c0K(Kf0)\ne(cid:107)\u03b2ext\n2/2\n\n0 (cid:107)2\n\n0\n\nxKf0 /2\u22121e\u2212x/2\n2Kf0 /2\u0393(Kf0 /2)\n\ndx.\n\nWe bound this from below by bounding the exponential at the upper integration limit, yielding:\n\ne\u2212\u03b52Kf0 /4\n\n\u03c0K(Kf0)\ne(cid:107)\u03b2ext\n2/2\n\n0 (cid:107)2\n\n(11)\nFor \u03b5 = \u03b5n \u2192 0, we thus \ufb01nd that the denominator in (C2) can be lower bounded with\neKf0 log \u03b5n\u2212cK Kf0 log Kf0\u2212(cid:107)\u03b2ext\n0 (cid:107)2\n\u03a0(F\\Fn) = \u03a0\n\nn/2]. We bound the numerator:\ne\u2212cK k log k \u2264 e\u2212cK (kn+1) log(kn+1) +\n\n(cid:32) \u221e(cid:91)\n\n2/2\u2212Kf0 /2[log 2+\u03b52\n\n2Kf0 \u0393(Kf0/2 + 1)\n\ne\u2212cK x log x,\n\n(cid:90) \u221e\n\n\u221e(cid:88)\n\n\u03b5Kf0 K\n\n(cid:33)\n\nFk\n\n\u221d\n\nKf0 /2\nf0\n\n.\n\nk=kn+1\n\nk=kn+1\n\nkn+1\n\nwhich is of order e\u2212cK (kn+1) log(kn+1). Combining this bound with (11), we \ufb01nd that (C2) is met if:\n\ne\u2212Kf0 log \u03b5n+(cK +1) Kf0 log Kf0 +Kf0(cid:107)\u03b20(cid:107)2\u221e\u2212cK (kn+1) log(kn+1)+2n\u03b52\n\nn \u2192 0 as n \u2192 \u221e.\n\n(C3) We bound the numerator by one, and use the bound (11) for the denominator. As \u03b5n \u2192 0, we\nobtain the condition \u2212Kf0 log \u03b5n + (cK + 1)Kf0 log Kf0 + Kf0(cid:107)\u03b20(cid:107)2\u221e \u2264 j2\nn for all suf\ufb01ciently\nlarge j.\n\n4 n\u03b52\n\n8\n\n\f(cid:46) \u221a\n\nn. Finally, the condition (C3) is met for Kf0\n\nConclusion With \u03b5n = n\u22121/2(cid:112)Kf0 log(n/Kf0), letting kn \u221d n\u03b52\n(cid:46) \u221a\n\nn = Kf0 log(n/Kf0), the\ncondition (C1) is met. With this choice of kn, the condition (C2) holds as well as long as (cid:107)\u03b20(cid:107)2\u221e (cid:46)\nlog n and Kf0\nRemark 3.1. It is worth pointing out that the proof will hold for a larger class of priors on K,\nas long as the prior shrinks at least exponentially fast (meaning that it is bounded from above by\nae\u2212bK for constants a, b > 0). However, a prior at this exponential limit will require tuning, because\nthe optimal a and b will depend on K(f0, \u2126EB). We recommend using the prior (2.1) that prunes\nsomewhat more aggressively, because it does not require tuning by the user. Indeed, Theorem 3.1\nholds regardless of the choice of cK > 0. We conjecture, however, that values cK \u2265 1/K(f0, \u2126EB)\nlead to a faster concentration speed and we suggest cK = 1 as a default option.\nRemark 3.2. When Kf0 is known, there is no need for assigning a prior \u03c0K(\u00b7) and the conditions\n(C1) and (C3) are veri\ufb01ed similarly as before, \ufb01xing the number of steps at Kf0.\n\nn.\n\n3.2 Posterior Concentration for Balanced Intervals\n\nAn analogue of Theorem 3.1 can be obtained for balanced partitions from Section 2.2.2 that correspond\nto regression trees with splits at actual observations. Now, we assume that f0 is \u2126BI-valid and carry\nout the proof with K(f0, \u2126BI ) instead of K(f0, \u2126EB). The posterior concentration rate is only\nslightly worse.\nTheorem 3.2. (Balanced Intervals) Let f0 : [0, 1] \u2192 R be a step function with K0 steps, where K0\nis unknown. Denote by F the set of all step functions supported on balanced intervals equipped with\npriors \u03c0K(\u00b7), \u03c0\u2126(\u00b7|K) and \u03c0(\u03b2 | K) as in (3), (6) and (9). Denote with Kf0 \u2261 K(f0, \u2126BI ) and\nassume (cid:107)\u03b20(cid:107)2\u221e (cid:46) log2\u03b2 n and K(f0, \u2126BI ) (cid:46) \u221a\nf \u2208 F : (cid:107)f \u2212 f0(cid:107)n \u2265 Mnn\u22121/2\n\nn. Then, under Assumption 1, we have\n\u2192 0\n\nKf0 log2\u03b2(n/Kf0) | Y (n)\n\n(cid:113)\n\n(cid:18)\n\n(cid:19)\n\n(12)\n\n\u03a0\n\n(cid:16)(cid:80)kn\n\nk\u22121\n\nkn\u22121\n\n) (cid:46)\n\n(cid:113)\n\n2\n\nk=1 C kcard(\u2126BI\nk )\n\n-probability, for every Mn \u2192 \u221e as n \u2192 \u221e, where \u03b2 > 1/2.\n\nKf0 log2\u03b2(n/Kf0). Using the upper bound card(\u2126BI\n\nin P n\nf0\nProof. All three conditions (C1), (C2) and (C3) hold if we choose kn \u221d Kf0 [log(n/Kf0 )]2\u03b2\u22121. The\nentropy condition will be satis\ufb01ed when log\nn for some C > 0, where\n\u03b5n = n\u22121/2\nkn < n\u22121\nKf0 log(n/Kf0), the condition (C2) will be satis\ufb01ed when, for some D > 0, we have\ne\u2212Kf0 log \u03b5n+(cK +1) Kf0 log Kf0 +D Kf0 log(n/Kf0 )+Kf0(cid:107)\u03b20(cid:107)2\u221e\u2212cK (kn+1) log(kn+1)+2n\u03b52\nThis holds for our choice of kn under the assumption (cid:107)\u03b20(cid:107)2\u221e (cid:46) log2\u03b2 n and Kf0\nchoices also yield (C3).\nRemark 3.3. When Kf0\n\nfor large enough n), the condition (C1) is veri\ufb01ed. Using the fact that card(\u2126Kf0\n\nn, Theorem 3.1 and Theorem 3.2 still hold, only with the bit slower\n\n(cid:17) (cid:46) n \u03b52\nk ) <(cid:0)n\u22121\n\nslower concentration rate n\u22121/2(cid:112)Kf0 log n.\n\n(cid:38) \u221a\n\n(cid:1) (because\n\n(cid:1) <(cid:0) n\u22121\n\n(13)\nn. These\n\nn \u2192 0.\n(cid:46) \u221a\n\n4 Discussion\n\nWe provided the \ufb01rst posterior concentration rate results for Bayesian non-parametric regression with\nstep functions. We showed that under suitable complexity priors, the Bayesian procedure adapts to\nthe unknown aspects of the target step function. Our approach can be extended in three ways: (a)\nto smooth f0 functions, (b) to dimension reduction with high-dimensional predictors, (c) to more\ngeneral partitioning schemes that correspond to methods like Bayesian CART and BART. These three\nextensions are developed in our followup manuscript [29].\n\n5 Acknowledgment\n\nThis work was supported by the James S. Kemper Foundation Faculty Research Fund at the University\nof Chicago Booth School of Business.\n\n9\n\n\fReferences\n[1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression Trees. Statis-\n\ntics/Probability Series. Wadsworth Publishing Company, Belmont, California, U.S.A., 1984.\n\n[2] L. Breiman. Random forests. Mach. Learn., 45:5\u201332, 2001.\n\n[3] A. Berchuck, E. S. Iversen, J. M. Lancaster, J. Pittman, J. Luo, P. Lee, S. Murphy, H. K. Dressman, P. G.\nFebbo, M. West, J. R. Nevins, and J. R. Marks. Patterns of gene expression that characterize long-term\nsurvival in advanced stage serous ovarian cancers. Clin. Cancer Res., 11(10):3686\u20133696, 2005.\n\n[4] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing\ndetection. In Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit,\neCrime \u201907, pages 60\u201369, New York, NY, USA, 2007. ACM.\n\n[5] M. A. Razi and K. Athappilly. A comparative predictive analysis of neural networks (NNs), nonlinear\nregression and classi\ufb01cation and regression tree (CART) models. Expert Syst. Appl., 29(1):65 \u2013 74, 2005.\n\n[6] D. P. Green and J. L. Kern. Modeling heterogeneous treatment effects in survey experiments with Bayesian\n\nAdditive Regression Trees. Public Opin. Q., 76(3):491, 2012.\n\n[7] E. C. Polly and M.\n\nSuper\nhttp://works.bepress.com/mark_van_der_laan/200/, 2010.\n\nJ. van der Laan.\n\nlearner\n\nin prediction.\n\nAvailable at:\n\n[8] D. M. Roy and Y. W. Teh. The Mondrian process. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou,\neditors, Advances in Neural Information Processing Systems 21, pages 1377\u20131384. Curran Associates,\nInc., 2009.\n\n[9] H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART model search. JASA, 93(443):935\u2013948,\n\n1998.\n\n[10] D. Denison, B. Mallick, and A. Smith. A Bayesian CART algorithm. Biometrika, 95(2):363\u2013377, 1998.\n\n[11] H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian Additive Regression Trees. Ann.\n\nAppl. Stat., 4(1):266\u2013298, 03 2010.\n\n[12] S. Ghosal, J. K. Ghosh, and A. W. van der Vaart. Convergence rates of posterior distributions. Ann. Statist.,\n\n28(2):500\u2013531, 04 2000.\n\n[13] T. Zhang. Learning bounds for a generalized family of Bayesian posterior distributions. In S. Thrun,\nL. K. Saul, and P. B. Sch\u00f6lkopf, editors, Advances in Neural Information Processing Systems 16, pages\n1149\u20131156. MIT Press, 2004.\n\n[14] J. Tang, Z. Meng, X. Nguyen, Q. Mei, and M. Zhang. Understanding the limiting factors of topic\nmodeling via posterior contraction analysis. In T. Jebara and E. P. Xing, editors, Proceedings of the\n31st International Conference on Machine Learning (ICML-14), pages 190\u2013198. JMLR Workshop and\nConference Proceedings, 2014.\n\n[15] N. Korda, E. Kaufmann, and R. Munos. Thompson sampling for 1-dimensional exponential family bandits.\nIn C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 26, pages 1448\u20131456. Curran Associates, Inc., 2013.\n\n[16] F.-X. Briol, C. Oates, M. Girolami, and M. A. Osborne. Frank-Wolfe Bayesian quadrature: Probabilistic\nintegration with theoretical guarantees. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1162\u20131170. Curran\nAssociates, Inc., 2015.\n\n[17] M. Chen, C. Gao, and H. Zhao. Posterior contraction rates of the phylogenetic indian buffet processes.\n\nBayesian Anal., 11(2):477\u2013497, 06 2016.\n\n[18] S. Ghosal and A. van der Vaart. Convergence rates of posterior distributions for noniid observations. Ann.\n\nStatist., 35(1):192\u2013223, 02 2007.\n\n[19] B. Szab\u00f3, A. W. van der Vaart, and J. H. van Zanten. Frequentist coverage of adaptive nonparametric\n\nBayesian credible sets. Ann. Statist., 43(4):1391\u20131428, 08 2015.\n\n[20] I. Castillo and R. Nickl. On the Bernstein von Mises phenomenon for nonparametric Bayes procedures.\n\nAnn. Statist., 42(5):1941\u20131969, 2014.\n\n10\n\n\f[21] J. Rousseau and B. Szabo. Asymptotic frequentist coverage properties of Bayesian credible sets for sieve\n\npriors in general settings. ArXiv e-prints, September 2016.\n\n[22] I. Castillo. Polya tree posterior distributions on densities.\nlpma-paris. fr/ pageperso/ castillo/ polya. pdf , 2016.\n\npreprint available at http: // www.\n\n[23] L. Liu and W. H. Wong. Multivariate density estimation via adaptive partitioning (ii): posterior concentra-\n\ntion. arXiv:1508.04812v1, 2015.\n\n[24] C. Scricciolo. On rates of convergence for Bayesian density estimation. Scand. J. Stat., 34(3):626\u2013642,\n\n2007.\n\n[25] M. Coram and S. Lalley. Consistency of Bayes estimators of a binary regression function. Ann. Statist.,\n\n34(3):1233\u20131269, 2006.\n\n[26] L. Shepp. Covering the circle with random arcs. Israel J. Math., 34(11):328\u2013345, 1972.\n\n[27] W. Feller. An Introduction to Probability Theory and Its Applications, Vol. 2, 3rd Edition. Wiley, 3rd\n\nedition, January 1968.\n\n[28] D. L. Donoho. CART and best-ortho-basis: a connection. Ann. Statist., 25(5):1870\u20131911, 10 1997.\n\n[29] V. Rockova and S. L. van der Pas. Posterior concentration for Bayesian regression trees and their ensembles.\n\narXiv:1708.08734, 2017.\n\n[30] T. Anderson. Some nonparametric multivariate procedures based on statistically equivalent blocks. In P.R.\n\nKrishnaiah, editor, Multivariate Analysis, pages 5\u201327. Academic Press, New York, 1966.\n\n[31] L. Flatto and A. Konheim. The random division of an interval and the random covering of a circle. SIAM\n\nRev., 4:211\u2013222, 1962.\n\n[32] A. Nobel. Histogram regression estimation using data-dependent partitions. Ann. Statist., 24(3):1084\u20131105,\n\n1996.\n\n[33] C. Gao, F. Han, and C.H. Zhang. Minimax risk bounds for piecewise constant models. Manuscript, pages\n\n1\u201336, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "St\u00e9phanie", "family_name": "van der Pas", "institution": "Leiden University"}, {"given_name": "Veronika", "family_name": "Ro\u010dkov\u00e1", "institution": "University of Chicago"}]}