{"title": "Random Tessellation Forests", "book": "Advances in Neural Information Processing Systems", "page_first": 9575, "page_last": 9585, "abstract": "Space partitioning methods such as random forests and the Mondrian process are powerful machine learning methods for multi-dimensional and relational data, and are based on recursively cutting a domain. The flexibility of these methods is often limited by the requirement that the cuts be axis aligned. The Ostomachion process and the self-consistent binary space partitioning-tree process were recently introduced as generalizations of the Mondrian process for space partitioning with non-axis aligned cuts in the plane. Motivated by the need for a multi-dimensional partitioning tree with non-axis aligned cuts, we propose the Random Tessellation Process, a framework that includes the Mondrian process as a special case. We derive a sequential Monte Carlo algorithm for inference, and provide random forest methods. Our methods are self-consistent and can relax axis-aligned constraints, allowing complex inter-dimensional dependence to be captured. We present a simulation study and analyze gene expression data of brain tissue, showing improved accuracies over other methods.", "full_text": "Random Tessellation Forests\n\nShufei Ge1\n\nshufei_ge@sfu.ca\n\nShijia Wang2,1\n\nshijia_wang@sfu.ca\n\nYee Whye Teh3\n\ny.w.teh@stats.ox.ac.uk\n\nLiangliang Wang1\n\nliangliang_wang@sfu.ca\n\nLloyd T. Elliott1\n\nlloyd_elliott@sfu.ca\n\n1Department of Statistics and Actuarial Science, Simon Fraser University, Canada\n\n2School of Statistics and Data Science, LPMC & KLMDASR, Nankai University, China\n\n3Department of Statistics, University of Oxford, UK\n\nAbstract\n\nSpace partitioning methods such as random forests and the Mondrian process are\npowerful machine learning methods for multi-dimensional and relational data, and\nare based on recursively cutting a domain. The \ufb02exibility of these methods is\noften limited by the requirement that the cuts be axis aligned. The Ostomachion\nprocess and the self-consistent binary space partitioning-tree process were recently\nintroduced as generalizations of the Mondrian process for space partitioning with\nnon-axis aligned cuts in the two dimensional plane. Motivated by the need for a\nmulti-dimensional partitioning tree with non-axis aligned cuts, we propose the Ran-\ndom Tessellation Process (RTP), a framework that includes the Mondrian process\nand the binary space partitioning-tree process as special cases. We derive a sequen-\ntial Monte Carlo algorithm for inference, and provide random forest methods. Our\nprocess is self-consistent and can relax axis-aligned constraints, allowing complex\ninter-dimensional dependence to be captured. We present a simulation study, and\nanalyse gene expression data of brain tissue, showing improved accuracies over\nother methods.\n\n1\n\nIntroduction\n\nBayesian nonparametric models provide \ufb02exible and accurate priors by allowing the dimensionality\nof the parameter space to scale with dataset sizes [12]. The Mondrian process (MP) is a Bayesian\nnonparametric prior for space partitioning and provides a Bayesian view of decision trees and random\nforests [29, 19]. Inference for the MP is conducted by recursive and random axis-aligned cuts in the\ndomain of the observed data, partitioning the space into a hierarchical tree of hyper-rectangles.\nThe MP is appropriate for multi-dimensional data, and it is self-consistent (i.e., it is a projective\ndistribution), meaning that the prior distribution it induces on a subset of a domain is equal to the\nmarginalisation of the prior over the complement of that subset. Self-consistency is required in\nBayesian nonparametric models in order to insure correct inference, and prevent any bias arising\nfrom sampling and sample population size. Recent advances in MP methods for Bayesian nonpara-\nmetric space partitioning include online methods [21], and particle Gibbs inference for MP additive\nregression trees [22]. These methods achieve high predictive accuracy, with improved ef\ufb01ciency.\nHowever, the axis-aligned nature of the decision boundaries of the MP restricts its \ufb02exibility, which\ncould lead to failure in capturing inter-dimensional dependencies in the domain.\nRecently, advances in Bayesian nonparametrics have been developed to allow more \ufb02exible non-axis\naligned partitioning. The Ostomachion process (OP) was introduced to generalise the MP and allow\nnon-axis aligned cuts. The OP is de\ufb01ned for two dimensional data domains [11]. In the OP, the angle\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A draw from the uRTP prior with domain given by a four dimensional hypercube\n(x, y, z, w) \u2208 [\u22121, 1]4. Intersections of the draw and the three dimensional cube are shown for\nw =\u22121, 0, 1. Colours indicate polytope identity, and are randomly assigned.\n\nand position of each cut is randomly sampled from a speci\ufb01ed distribution. However, the OP is not\nself-consistent, and so the binary space partitioning-tree (BSP) process [9] was introduced to modify\nthe cut distribution of the OP in order to recover self-consistency. The main limitation of the OP\nand the BSP is that they are not de\ufb01ned for dimensions larger than two (i.e., they are restricted to\ndata with two predictors). To relax this constraint, in [10] a self-consistent version of the BSP was\nextended to arbitrarily dimensioned space (called the BSP-forest). But for this process each cutting\nhyperplane is axis-aligned in all but two dimensions (with non-axis alignment allowed only in the\nremaining two dimensions, following the speci\ufb01cation of the two dimensional BSP). Alternative\nconstructions of non-axis aligned partitioning for two dimensional spaces and non-Bayesian methods\ninvolving sparse linear combinations of predictors or canonical correlation have also been proposed\nas random forest generalisations [14, 30, 28].\nIn this work, we propose the Random Tessellation Process (RTP), a framework for describing\nBayesian nonparametric models based on cutting multi-dimensional Euclidean space. We consider\nfour versions of the RTP, including a generalisation of the Mondrian process with non-axis aligned\ncuts (a sample from this prior is shown in Figure 1), a formulation of the Mondrian process as an RTP,\nand weighted versions of these two methods (shown in Figure 2). By virtue of their construction, all\nversions of the RTP are self-consistent, and are based on the theory of stable iterated tessellations in\nstochastic geometry [27]. The partitions induced by the RTP prior are described by a set of polytopes.\nWe derive a sequential Monte Carlo (SMC) algorithm [8] for RTP inference which takes advantage\nof the hierarchical structure of the generating process for the polytope tree. We also propose a\nrandom forest version of RTPs, which we refer to as Random Tessellation Forests (RTFs). We apply\nour proposed model to simulated data and several gene expression datasets, and demonstrate its\neffectiveness compared to other modern machine learning methods.\n\n2 Methods\nSuppose we observe a dataset (v1, z1), . . . , (vn, zn), for a classi\ufb01cation task in which vi \u2208 Rd are\npredictors and zi \u2208 {1, . . . , K} are labels (with K levels, K \u2208 N>1). Bayesian nonparametric\nmodels based on partitioning the predictors proceed by placing a prior on aspects of the partition, and\nassociating likelihood parameters with the blocks of the partition. Inference is then done on the joint\nposterior of the parameters and the structure of the partition. In this section, we develop the RTP: a\nunifying framework that covers and extends such Bayesian nonparametric models through a prior on\npartitions of (v1, z1), . . . , (vn, zn) induced by tessellations.\n\n2.1 The Random Tessellation Process\nA tessellation Y of a bounded domain W \u2282 Rd is a \ufb01nite collection of closed polytopes such that the\nunion of the polytopes is all of W , and such that the polytopes have pairwise disjoint interiors [6]. We\ndenote tessellations of W by Y(W ) or the symbol\n. A polytope is an intersection of \ufb01nitely many\nclosed half-spaces. In this work we will assume that all polytopes are bounded and have nonempty\n\n2\n\n\finterior. An RTP Yt(W ) is a tessellation-valued right-continuous Markov jump process (MJP)\nde\ufb01ned on [0, \u03c4 ] (we refer to the t-axis as time), in which events are cuts (speci\ufb01ed by hyperplanes)\nof the tessellation\u2019s polytopes, and \u03c4 is a prespeci\ufb01ed budget [2]. In this work we assume that all\nhyperplanes are af\ufb01ne (i.e., they need not pass through the origin).\nThe initial tessellation Y0(W ) contains a single polytope given by the convex hull of the observed\npredictors in the dataset: W = hull{v1, . . . , vn} (the operation hull A denotes the convex hull of the\nset A). In the MJP for the random tessellation process, each polytope has an exponentially distributed\nlifetime, and at the end of a polytope\u2019s lifetime, the polytope is replaced by two new polytopes. The\ntwo new polytopes are formed by drawing a hyperplane that intersects the interior of the old polytope,\nand then intersecting the old polytope with each of the two closed half-spaces bounded by the drawn\nhyperplane. We refer to this operation as cutting a polytope according to the hyperplane. These\ncutting events continue until the prespeci\ufb01ed budget \u03c4 is reached.\nLet H be the set of hyperplanes in Rd. Every hyperplane h \u2208 H can be written uniquely as the set of\npoints {P : (cid:104)(cid:42)n, P \u2212 u (cid:42)n(cid:105) = 0}, such that (cid:42)n\u2208 Sd\u22121 is a normal vector of h, and u \u2208 R\u22650 (u \u2265 0).\nHere Sd\u22121 is the unit (d \u2212 1)-sphere (i.e., Sd\u22121 = {(cid:42)n\u2208 Rd : (cid:107) (cid:42)n (cid:107) = 1}). Thus, there is a bijection\n\u03d5 : Sd\u22121 \u00d7 R\u22650 (cid:55)\u2212\u2192 H by \u03d5((cid:42)n, u) = {P : (cid:104)(cid:42)n, P \u2212 u (cid:42)n(cid:105) = 0}, and therefore a measure \u039b on H\nis induced by any measure \u039b \u25e6 \u03d5 on Sd\u22121 \u00d7 R\u22650 through this bijection [20, 6].\nIn [27] Section 2.1, Nagel and Weiss describe a random tessellation associated with a measure \u039b on H\nthrough a tessellation-valued MJP Yt such that the rate of the exponential distribution for the lifetime\nof a polytope a \u2208 Yt is \u039b([a]) (here and throughout this work, [a] denotes the set of hyperplanes\nin Rd that intersect the interior of a), and the hyperplane for the cutting event for a polytope a is\nsampled according to the probability measure \u039b(\u00b7 \u2229 [a])/\u039b([a]). We use this construction as the prior\nfor RTPs, and describe their generative process in Algorithm 1. This algorithm is equivalent to the\n\ufb01rst algorithm listed in [27].\n\n(cid:17)\n\n(cid:16)(cid:80)\n\nAlgorithm 1 Generative Process for RTPs\n1: Inputs: a) Bounded domain W, b) RTP measure \u039b on H, c) prespeci\ufb01ed budget \u03c4.\n2: Outputs: A realisation of the Random Tessellation Process (Yt)0\u2264t\u2264\u03c4 .\n3: \u03c40 \u2190 0.\n4: Y0 \u2190 {W}.\n5: while \u03c40 \u2264 \u03c4 do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nSample a polytope a from the set Y\u03c40 with probability proportional to (w.p.p.t.) \u039b([a]).\nSample a hyperplane h from [a] according to the probability measure \u039b(\u00b7 \u2229 [a])/\u039b([a]).\nY\u03c40 \u2190 (Y\u03c40 /{a})\u222a{a\u2229 h\u2212, a\u2229 h+}. (h\u2212 and h+ are the h-bounded closed half planes.)\nreturn the tessellation-valued right-continuous MJP sample (Yt)0\u2264t\u2264\u03c4 .\n\nelse\n\nSample \u03c4(cid:48) \u223c Exp\nSet Yt \u2190 Y\u03c40 for all t \u2208 (\u03c40, min{\u03c4, \u03c40 + \u03c4(cid:48)}].\nSet \u03c40 \u2190 \u03c40 + \u03c4(cid:48).\nif \u03c40 \u2264 \u03c4 then\n\na\u2208Y\u03c40\n\n\u039b([a])\n\n.\n\n2.1.1 Self-consistency of Random Tessellation Processes\n\nFrom Theorem 1 in [27], if the measure \u039b is invariant with respect to translation (i.e., \u039b(A) =\n\u039b({h + x : h \u2208 A}) for all measurable subsets A \u2282 H and x \u2208 Rd), and if a set of d hyperplanes\nwith orthogonal normal vectors is contained in the support of \u039b, then for all bounded domains\nW (cid:48) \u2286 W , Yt(W (cid:48)) is equal in distribution to Yt(W ) \u2229 W (cid:48). This means that self-consistency holds\nfor the random tessellations associated with such \u039b. (Here, for a hyperplane h, h + x refers to\nthe set {y + x : y \u2208 h}, and for a tessellation Y and a domain W (cid:48), Y \u2229 W (cid:48) is the tessellation\n{a \u2229 W (cid:48) : a \u2208 Y }.) In [27], such tessellations are referred to as stable iterated tessellations.\nIf we assume that \u039b \u25e6 \u03d5 is the product measure \u03bbd \u00d7 \u03bb+, with \u03bbd symmetric (i.e., \u03bbd(A) = \u03bbd(\u2212A)\nfor all measurable sets A \u2286 Sd\u22121) and further that \u03bb+ is given by the Lebesgue measure on R\u22650,\nthen \u039b is translation invariant (a proof of this statement is given in Appendix A, Lemma 1 of the\nSupplementary Material). So, through Algorithm 1 and Theorem 1 in [27], any distribution \u03bbd on the\nsphere Sd\u22121 that is supported on a set of d hyperplanes with orthogonal normal vectors gives rise to\n\n3\n\n\fa self-consistent random tessellation, and we refer to models based on this self-consistent prior as\nRandom Tessellation Processes (RTPs). We refer to any product measure \u039b \u25e6 \u03d5 = \u03bbd \u00d7 \u03bb+ such\nthese conditions hold (i.e., \u03bbd symmetric, \u039b supported on d orthogonal hyperplanes and \u03bb+ given by\nthe Lebesgue measure) as an RTP measure.\n\n2.1.2 Relationship to cutting nonparametric models\n\nFigure 2: Draws from priors Left) wuRTP, and Right) wMRTP, for the domain given by the rectangle\nW = [\u22121, 1] \u00d7 [\u22121/3, 1/3]. Weights are given by \u03c9x = 14, \u03c9y = 1, leading to horizontal (x-axis\nheavy) structure in the polygons. Colours are randomly assigned and indicate polygon identity, and\nblack delineates polygon boundaries.\n\nIf \u03bbd is the measure associated with a uniform distribution on the sphere (with respect to the usual\nBorel sets on Sd\u22121 [18]), then the resulting RTP is a generalisation of the Mondrian process with non-\naxis aligned cuts. We refer to this RTP as the uRTP (for uniform RTP). In this case, \u03bbd is a probability\nmeasure and a normal vector (cid:42)n may be sampled according to \u03bbd by sampling ni \u223c N (0, 1) and then\nsetting (cid:42)n i= ni/(cid:107)n(cid:107). A draw from the uRTP prior supported on a four dimensional hypercube is\ndisplayed in Figure 1.\nWe consider a weighted version of the uniform RTP found by setting \u03bbd to the measure associated\nwith the distribution on (cid:42)n induced by the scheme ni \u223c N (0, \u03c92\ni ), (cid:42)n i= ni/(cid:107)n(cid:107). We refer to this RTP\nas the wuRTP (weighted uniform RTP), and the wuRTP is parameterised by d weights \u03c9i \u2208 R>0.\nNote that the isotropy of the multivariate Gaussian n implies symmetry of \u03bbd. Setting the weight\n\u03c9i increases the prior probability of cuts orthogonal to the i-th predictor dimension, allowing prior\ninformation about the importance of each of the predictors to be incorporated. Figure 2(left) shows a\ndraw from the wuRTP prior supported on a rectangle.\n\nThe Mondrian process itself is an RTP with \u03bbd =(cid:80)\nof the Mondrian process as the MRTP. If \u03bbd =(cid:80)\n\nv\u2208 poles(d) \u03b4v. Here, \u03b4x is the Dirac delta supported\non x, and poles(d) is the set of normal vectors with zeros in all coordinates, except for one of the\ncoordinates (the non-zero coordinate of these vectors can be either \u22121 or +1). We refer to this view\nv\u2208 poles(d) \u03c9i(v)\u03b4v, where \u03c9i are d axis weights, and\ni(v) is the index of the nonzero element of v, then we arrive at a weighted version of the MRTP, which\nwe refer to as the wMRTP. Figure 2(right) displays a draw from the wMRTP prior. The horizontal\norganisation of the lines in Figure 2 arise from the uneven axis weights.\nOther nonparametric models based on cutting polytopes may also be viewed in this way. For example,\nBinary Space Partitioning-Tree Processes [9] are uRTPs and wuRTPs restricted to two dimensions,\nand the generalization of the Binary Space Partitioning-Tree Process in [10] is an RTP for which \u03bbd\nis a sum of delta functions convolved with smooth measures on S1 projected onto pairs of axes. The\nOstomachion process [11] does not arise from an RTP: it is not self-consistent, and so by Theorem 1\nin [27], the OP cannot arise from an RTP measure.\n\n2.1.3 Likelihoods for Random Tessellation Processes\n\nIn this section, we illustrate how RTPs can be used to model categorical data. For example, in gene\nexpression data of tissues, the predictors are the amounts of expression of each gene in the tissue\n(the vector vi for sample i), and our goal is to predict disease condition (labels zi). Let\nt be an\nRTP on the domain W = hull{v1, . . . , vn}. Let Jt denote the number of polytopes in\nt. We let\nh(vi) denote a mapping function, which matches the i-th data item to the polytope in the tessellation\ncontaining that data item. Hence, h(vi) takes a value in the set {1, . . . , Jt}. We will consider the\n\n4\n\nxxy\flikelihood arising at time t from the following generative process:\n\nt \u223c Y t(W ),\nzi|\n\n\u03c6j \u223c Dirichlet(\u03b1) for 1 \u2264 j \u2264 Jt,\n\nt, \u03c6h(vi) \u223c Multinomial(\u03c6h(vi)) for 1 \u2264 i \u2264 n.\n\n(1)\n(2)\n\nHere \u03c6j = (\u03c6j1, . . . , \u03c6jK) are parameters of the multinomial distribution with a Dirichlet prior with\nhyperparameters \u03b1 = (\u03b1k)1\u2264k\u2264K. The likelihood function for Z = (zi)1\u2264i\u2264n conditioned on the\ntessellation\n\nt and V = (vi)1\u2264i\u2264n and given the hyperparameter \u03b1 is as follows:\n\nP (Z|\n\n\u00b7\u00b7\u00b7\n\nP (Z, \u03c6|\n\nt, \u03b1)d\u03c61 \u00b7\u00b7\u00b7 d\u03c6Jt =\n\nt, V, \u03b1) =\n\nHere B(\u00b7) is the multivariate beta function, mj = (mjk)1\u2264k\u2264K and mjk =(cid:80)\n\ni:h(vi)=j \u03b4(zi = k),\nand \u03b4(\u00b7) is an indicator function with \u03b4(zi = k) = 1 if zi = k and \u03b4(zi = k) = 0 otherwise. We\nrefer to Appendix A, Lemma 2 of the Supplementary Material for the derivation of (3).\n\n.\n\n(3)\n\nB(\u03b1)\n\nj=1\n\nB(\u03b1 + mj)\n\n(cid:90)\n\n(cid:90)\n\nJt(cid:89)\n\n2.2\n\nInference for Random Tessellation Processes\n\nt. Let P (Z|\n\nt|V, Z, \u03b1) = \u03c00(\n\nt) denote the prior distribution of\n\nt|V, Z, \u03b1). We let\nt, V, \u03b1) denote the likelihood (3) of the data\nt and the hyperparameter \u03b1. By Bayes\u2019 rule, the posterior of the tessellation\n\nOur objective is to infer the posterior of the tessellation at time t, denoted \u03c0(\n\u03c00(\ngiven the tessellation\nat time t is \u03c0(\nHere P (Z|V, \u03b1) is the marginal likelihood given data Z. This posterior distribution is intractable,\nand so in Section 2.2.3 we introduce an ef\ufb01cient SMC algorithm for conducting inference on \u03c0(\nt).\nThe proposal distribution for this SMC algorithm involves draws from the RTP prior, and so in\nSection 2.2.1, we describe a rejection sampling scheme for drawing a hyperplane from the probability\nmeasure \u039b(\u00b7 \u2229 [a])/\u039b([a]) for a polytope a. We provide some optimizations and approximations\nused in this SMC algorithm in Section 2.2.2.\n\nt, V, \u03b1)/P (Z|V, \u03b1).\n\nt)P (Z|\n\n2.2.1 Sampling cutting hyperplanes for Random Tessellation Processes\nSuppose that a is a polytope, and \u039b \u25e6 \u03d5 is an RTP measure such that \u039b(\u03d5(\u00b7)) = (\u03bbd \u00d7 \u03bb+)(\u00b7). We\nwish to sample a hyperplane according to the probability measure \u039b(\u00b7 \u2229 [a])/\u039b([a]). We note that if\nBr(x) is the smallest closed d-ball containing a (with radius r and centre x), then [a] is contained\nin [Br(x)]. If we can sample a normal vector (cid:42)n according to \u03bbd, then we can sample a hyperplane\naccording to \u039b(\u00b7 \u2229 [a])/\u039b([a]) through the following rejection sampling scheme.\n\n\u2022 Step 1) Sample (cid:42)n according to \u03bbd.\n\u2022 Step 2) Sample u \u223c Uniform[0, r].\n\u2022 Step 3) If the hyperplane h = x + {P : (cid:104)(cid:42)n, P \u2212 u (cid:42)n(cid:105)} intersects a, then RETURN h.\n\nOtherwise, GOTO Step 1).\n\nNote that in Step 3, the set {P : (cid:104)(cid:42)n, P \u2212 u (cid:42)n(cid:105)} is a hyperplane intersecting the ball Br(0) centred\nat the origin, and so translation of this hyperplane by x yields a hyperplane intersecting the ball\nBr(x). When this scheme is applied to the uRTP or wuRTP, in Step 1 (cid:42)n is sampled from the uniform\ndistribution on the sphere or the appropriate isotropic Gaussian. And for the MRTP or wMRTP, (cid:42)n is\nsampled from the discrete distributions given in Section 2.1.2.\n\n2.2.2 Optimizations and approximations\n\nWe use three methods to decrease the computational requirements and complexity of inference based\non RTP posteriors. First, we replace all polytopes with convex hulls formed by intersecting the\npolytopes with the dataset predictors. Second, in determining the rates of the lifetimes of polytopes,\nwe approximate \u039b([a]) with the measure \u039b([\u00b7]) applied to the smallest closed ball containing a. Third,\nwe use a pausing condition so that no cuts are proposed for polytopes for which the labels of all\npredictors in that polytope are the same label.\nConvex hull replacement. In our posterior inference, if a \u2208 Y is cut according to the hyperplane h,\nwe consider the resulting tessellation to be Y /{a}\u222a{hull(a\u2229 V \u2229 h+), hull(a\u2229 V \u2229 h\u2212)}. Here h+\n\n5\n\n\fand h\u2212 are the two closed half planes bounded by h, and / is the set minus operation. This requires a\nslight loosening of the de\ufb01nition of a tessellation Y of a bounded domain W to allow the union of\nthe polytopes of a tessellation Y to be a strict subset of W such that V \u2286 \u222aa\u2208Y a.\nIn our computations, we do not need to explicitly compute these convex hulls, and instead for any\npolytope b, we store only b \u2229 V , as this is enough to determine whether or not a hyperplane h\nintersects hull(b \u2229 V ). This membership check is the only geometric operation required to sample\nhyperplanes intersecting b according to the rejection sampling scheme from Section 2.2.1. By the\nself-consistency of RTPs, this has the effect of marginalizing out MJP events involving cuts that do\nnot further separate the predictors in the dataset. This also obviates the need for explicit computation\nof the facets of polytopes, signi\ufb01cantly simplifying the codebase of our implementation of inference.\nAfter this convex hull replacement operation, a data item in the test dataset may not be contained in\nany polytope, and so to conduct posterior inference we augment the training dataset with a version\nof the testing dataset in which the label is missing, and then marginalise the missing label in the\nlikelihood described in Section 2.1.3.\nSpherical approximation. Every hyperplane intersecting a polytope a also intersects a closed ball\ncontaining a. Therefore, for any RTP measure \u039b, \u039b([a]) is upper bounded by \u039b([B(ra)]). Here ra\nis the radius of the smallest closed ball containing a. We approximate \u039b([a]) (cid:39) \u039b([B(ra)]) for use\nin polytope lifetime calculations in our uRTP inference and we do not compute \u039b([a]) exactly. For\nthe uRTP and wRTP, \u039b([B(ra)]) = ra. A proof of this is given in Appendix A, Lemma 3 of the\nSupplementary Material. For the MRTP and wMRTP, \u039b([a]) can be computed exactly [29].\nPausing condition. In our posterior inference, if zi = zj for all i, j such that vi, vj \u2208 a, then we\npause the polytope a and no further cuts are performed on this polytope. This improves computational\nef\ufb01ciency without affecting inference, as cutting such a polytope cannot further separate labels. This\nwas done in recent work for Mondrian processes [21] and was originally suggested in the Random\nForest reference implementation [4].\n\n2.2.3 Sequential Monte Carlo for Random Tessellation Process inference\n\nAlgorithm 2 SMC for inferring RTP posteriors\n1: Inputs: a) Training dataset V , Z, b) RTP measure \u039b on H, c) prespeci\ufb01ed budget \u03c4, d) likelihood\n\n2: Outputs: Approximate RTP posterior(cid:80)M\n\nhyperparameter \u03b1.\n\nm=1\u0001m\u03b4\n\n\u03c4,m\n\nat time \u03c4. (\u0001m are particle weights.)\n\nm=1, for m = 1, . . . , M.\n\nra\na\u2208 \u03c4m ,m\n\n(cid:16)(cid:80)\n\n\u03c4m,m \u2190 (cid:48)\n\n\u03c4m,m, for m = 1, . . . , M.\n\nSample \u03c4(cid:48)\u223cExp\nSet\nif \u03c4m + \u03c4(cid:48) \u2264 \u03c4 then\n\n3: Set \u03c4m \u2190 0, for m = 1, . . . , M.\n0,m \u2190 {hull V }, \u0001m \u2190 1/M, for m = 1, . . . , M.\n4: Set\n5: while min{\u03c4m}M\nm=1 < \u03c4 do\n\u03c4m,m from { \u03c4m,m}M\nm=1 w.p.p.t. {\u0001m}M\n(cid:48)\nResample\n6:\nSet\n7:\n(cid:17)\nSet \u0001m \u2190 1/M, for m = 1, . . . , M.\n8:\nfor m \u2208 {m : m = 1, . . . , M and \u03c4m < \u03c4} do\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n\nelse\nSet \u03c4m \u2190 \u03c4m + \u03c4(cid:48).\nm=1 \u0001m.\n\n22: return the particle approximation(cid:80)M\n\nSet \u0001m \u2190 \u0001m/Z, for m = 1, . . . , M.\n\nSet Z \u2190(cid:80)M\n\nt,m \u2190 \u03c4m,m, for t \u2208 (\u03c4m, \u03c4 ].\n\nSet\n\nm=1 \u0001m\u03b4\n\n.\n\n\u03c4,m\n\n. (ra is the radius of the smallest closed ball containing a.)\n\nt,m \u2190 \u03c4m,m, for all t \u2208 (\u03c4m, min{\u03c4, \u03c4m + \u03c4(cid:48)}].\nSample a from the set\nSample h from [a] according to \u039b(\u00b7 \u2229 [a])/\u039b([a]) using Section 2.2.1.\nSet\nSet \u0001m \u2190 \u0001mP (Z|\n\n\u03c4m,m/{a}) \u222a {hull(V \u2229 a \u2229 h\u2212), hull(V \u2229 a \u2229 h+)}.\n(cid:48)\n\u03c4m,m, V , \u03b1) according to (3).\n\n\u03c4m,m, V , \u03b1)/P (Z|\n\n\u03c4m,m w.p.p.t. ra.\n\n\u03c4m,m \u2190 (\n\n6\n\n\fWe provide an SMC method (Algorithm 2) with M particles, to approximate the posterior distribution\nt) of an RTP conditioned on Z, V , and given an RTP measure \u039b and a hyperparameter \u03b1 and a\n\u03c0(\nprespeci\ufb01ed budget \u03c4. Our algorithm iterates between three steps: resampling particles (Algorithm 2,\nline 7), propagation of particles (Algorithm 2, lines 13-15) and weighting of particles (Algorithm 2,\nline 16). At each SMC iteration, we sample the next MJP events using the spherical approximation\nof \u039b([\u00b7]) described in Section 2.2.2. For brevity, the pausing condition described in Section 2.2.2 is\nomitted from Algorithm 2.\n\nIn our experiments, after using Algorithm 2 to yield a posterior estimate(cid:80)M\n\nm=1 \u0001m\u03b4\n\n\u03c4,m, we select\n\u03c4,m with the largest weight \u0001m (i.e., we do not conduct resampling at the last\nthe tessellation\nSMC iteration). We then compute posterior probabilities of the test dataset labels using the particle\n\u03c4,m. This method of not resampling after the last SMC iteration is recommended in [7] for lowering\n\nasymptotic variance in SMC estimates.\nThe computational complexity of Algorithm 2 depends on the number of polytopes in the tessellations,\nand the organization of the labels within the polytopes. The more linearly separable the dataset is, the\nsooner the pausing conditions are met. The complexity of computing the spherical approximation in\nSection 2.2.2 (the radius ra) for a polytope a is O(|V \u2229 a|2), where | \u00b7 | denotes set cardinality.\n\n2.2.4 Prediction with Random Tessellation Forests\n\nRandom forests are commonly used in machine learning for classi\ufb01cation and regression problems [3].\nA random forest is represented by an ensemble of decision trees, and predictions of test dataset labels\nare combined over all decision trees in the forest. To improve the performance of our methods,\nwe consider random forest versions of RTPs (which we refer to as RTFs: uRTF, wuRTF, MRTF,\nwMRTF are random forest versions of the uRTP, MRTP and their weighted versions resp.). We run\nAlgorithm 2 independently T times, and predict labels using the modes. Differing from [3], we do\nnot use bagging.\nIn [22], Lakshminarayanan, Roy and Teh consider an ef\ufb01cient Mondrian forest in which likelihoods\nare dropped from the SMC sampler and cutting is done independent of likelihood. This method\nfollows recent theory for random forests [15]. We consider this method (by dropping line 16 of\nAlgorithm 2) and refer to the implementations of this method as the uRTF.i and MRTF.i (i for\nlikelihood independence).\n\n3 Experiments\n\nIn Section 3.1, we explore a simu-\nlation study that shows differences\namong uRTP and MRTP, and some\nstandard machine learning methods.\nVariations in gene expression across\ntissues in brain regions play an impor-\ntant role in disease conditions. In Sec-\ntion 3.2, we examine predictions of\na variety of RTF models for gene ex-\npression data. For all our experiments,\nwe set the likelihood hyperparameters\nfor the RTPs and RTFs to the empir-\nical estimates \u03b1k to nk/1000. Here\ni=1 \u03b4(zi = k). In all of our\nexperiments, for each train/test split,\nwe allocate 60% of the data items at\nrandom to the training set.\nAn implementation of our methods (released under the open source BSD 2-clause license) and a\nsoftware manual are provided in the Supplementary Material.\n\nFigure 3: Left) A view of the Mondrian cube, with cyan\nindicating label 1, magenta indicating label 2, and black\ndelineating label boundaries. Right) Percent correct versus\nnumber of cuts for predicting Mondrian cube test dataset,\nwith uRTP, MRTP and a variety of baseline methods.\n\nnk =(cid:80)n\n\n7\n\nxyz% correct# cuts010020060708090100uRTPMRTPLRDTSVM\f3.1 Simulations on the Mondrian cube\n\nWe consider a simulated three dimensional dataset designed to exemplify the difference between\naxis-aligned and non-axis aligned models. We refer to this dataset as the Mondrian cube, and we\ninvestigate the performance of uRTP and the MRTP on this dataset, along with some standard machine\nlearning approaches, varying the number of cuts in the processes. The Mondrian cube dataset is\nsimulated as follows: \ufb01rst, we sample 10,000 points uniformly in the cube [0, 1]3. Points falling in\nthe cube [0, 0.25]3 or the cube [0.25, 1]3 are given label 1, and the remaining points are given label\n2. Then, we centre the points and rotate all of the points by the angles \u03c0\n4 about the x-axis\nand y-axis respectively, creating a dataset organised on diagonals. In Figure 3(left), we display a\nvisualization of the Mondrian cube dataset, wherein points are colored by their label. We apply the\nSMC algorithm to the Mondrian cube data, with 50 random train/test splits. For each split, we run 10\nindependent copies of the uRTP and MRTP and take the mode of their results, and we also examine\nthe accuracy of logistic regression (LR), a decision tree (DT) and a support vector machine (SVM).\nFigure 3(right) shows that the percent correct for the uRTP and MRTP both increase as the number\nof cuts increases, and plateaus when the number of cuts becomes larger (greater than 25). Even\nthough the uRTP has lower accuracy at the \ufb01rst cut, it starts dominating the MRTP after the second\ncut. Overall, in terms of percent correct, with any number of cuts > 105, a sign test indicates that\nthe uRTP performs signi\ufb01cant better than all other methods at nominal signi\ufb01cance, and the MRTP\nperforms signi\ufb01cant better than DT and LR for any number of cuts > 85 at nominal signi\ufb01cance.\n\n4 and \u2212 \u03c0\n\n3.2 Experiment on gene expression data in brain tissue\n\nWe evaluate a variety of RTPs and some standard ma-\nchine learning methods on a glioblastoma tissue dataset\nGSE83294 [13], which includes 22,283 gene expression\npro\ufb01les for 85 astrocytomas (26 diagnosed as grade III\nand 59 as grade IV). We also examine schizophrenia brain\ntissue datasets: GSE21935 [1], in which 54,675 gene ex-\npression in the superior temporal cortex is recorded for 42\nsubjects, with 23 cases (with schizophrenia), and 19 con-\ntrols (without schizophrenia), and dataset GSE17612 [24],\na collection of 54,675 gene expressions from samples in\nthe anterior prefrontal cortex (i.e., a different brain area\nfrom GSE21935) with 28 schizophrenic subjects, and 23\ncontrols. We refer to these datasets as GL85, SCZ42 and\nSCZ51, respectively.\nWe also consider a combined version of SCZ42 and SCZ51\n(in which all samples are concatenated), which we re-\nfer to as SCZ93. For GL85 the labels are the astrocy-\ntoma grade, and for SCZ42, SCZ51 and SCZ93 the labels\nare schizophrenia status. We use principal components\nanalyais (PCA) in preprocessing to replace the predictors\nof each data item (a set of gene expressions) with its scores\non a full set of principal components (PCs): i.e., 85 PCs\nfor GL85, 42 PCs in SCZ42 and 51 PCs in SCZ51. We\nthen scale the PCs. We consider 200 test/train splits for\neach dataset. These datasets were acquired from NCBI\u2019s\nGene Expression Omnibus1 and were are released under\nthe Open Data Commons Open Database License. We\nprovide test/train splits of the PCA preprocessed datasets\nin the Supplementary Material.\nThrough this preprocessing, the j-th predictor is the score vector of the j-th principal component.\nFor the weighted RTFs (the wuRTF and wMRTF), we set the weight of the j-th predictor to be\nj . We set the number of trees in\nproportional to the variance explained by the j-th PC (\u03c32\n\nFigure 4: Box plot showing wuRTF and\nwMRTF improvements for GL85, and\ngenerally best performance for wuRTF\nmethod (with sign test p-value of 3.2 \u00d7\n10\u22129 vs wMRTF). Reduced performance\nof SVM indicates structure in GL85 that\nis not linearly separable. Medians, quan-\ntiles and outliers beyond 99.3% coverage\nare indicated.\n\nj ): \u03c9j = \u03c32\n\n1Downloaded from https://www.ncbi.nlm.nih.gov/geo/ in Spring 2019.\n\n8\n\nLRSVRFMRTF.iuRTF.iMRTFuRTFwMRTFwuRTF30405060708090100% correctGL85\fLR\n\nDataset BL\nGL85\n70.34 58.13 70.34 73.01 70.74\nSCZ42 46.68 57.65 46.79 51.76 49.56\nSCZ51 46.55 51.15 46.67 57.38 52.55\nSCZ93 48.95 53.05 50.15 52.45 50.23\n\nSVM RF MRTF.i uRTF.i MRTF uRTF wMRTF wuRTF\n\n70.06\n48.50\n48.58\n50.24\n\n77.09\n49.91\n57.95\n51.80\n\n70.60 80.57\n47.71 53.12\n44.70 58.12\n50.34 53.12\n\n84.90\n53.97\n49.05\n54.99\n\nTable 1: Comparison of mean percent correct on gene expression datasets over 200 random train/test\nsplits across different methods. Nominal statistical signi\ufb01cance (p-value < 0.05) is ascertained\nby a sign test in which ties are broken in a conservative manner. Tests are performed between\nthe top method and each other method. Bold values indicate the top method and all methods\nstatistically indistinguishable from the top method according to nominal signi\ufb01cance. Largest\nnominally signi\ufb01cant improvement is seen for wuRTF on GL85, and wuRTF is signi\ufb01cantly better\nthan other methods for this dataset. The wMRTF and wuRTF have largest mean percent correct for\nSCZ51 and SCZ93 but are not statistically distinguishable from RF or LR for those datasets.\n\nall of the random forests to 100, which is the default in R\u2019s randomForest package [23]. For the all\nRTFs, we set the budget \u03c4 = \u221e, as is done in [21].\n\n4 Results\n\nWe compare percent correct for the wuRTF, uRTF, uRTF.i, and the Mondrian Random Tessellation\nForests wMRTF, MRTF and MRTF.i, a random forest (RF), logistic regression (LR), a support\nvector machine (SVM) and a baseline (BL) in which the mode of the training set label is always\npredicted [25, 23]. Mean percent correct and sign tests for all of these experiments are reported in\nTable 1 and box plots for the GL85 experiment reported in Figure 4. We observe that the increase in\naccuracy of wuRTFs achieves nominal signi\ufb01cance over other methods on GL85. For datasets SCZ42,\nSCZ51 and SCZ93, the performance of the RTFs is comparable to that of logistic regression and\nrandom forests. For all datasets we consider, RTFs have higher accuracy than SVMs (with nominal\nsigni\ufb01cance). Boxplots with accuracies for the datasets SCZ42, SCZ51 and SCZ93 are provided\nin Appendix B, Supplementary Figure 1 of the Supplementary Material. Results of a conservative\npairwise sign test performed between each pair of methods on each dataset, standard deviation of the\npercent correct, and mean runtime across different methods for all these four datasets are reported in\nSupplementary Tables 1, 2, and 3 in Appendix B of the Supplementary Material.\n\n5 Discussion\n\nThe spherical approximation introduced in Section 2.2.2 can lead to inexact inference. In Supple-\nmentary Algorithm 1, we introduce a new algorithm based on Poisson thinning that recovers exact\ninference. This algorithm may improve accuracy and allow hierarchical likelihoods.\nThere are many directions for future work in RTPs including improved SMC sampling for MJPs as\nin [17], hierarchical likelihoods and online methods as in [21], analysis of minimax convergence\nrates [26], and extensions using Bayesian additive regression trees [5, 22]. We could also consider\napplications of RTPs to data that naturally displays tessellation and cracking, such as sea ice [16].\n\n6 Conclusion\n\nWe have described a framework for viewing Bayesian nonparametric methods based on space\npartitioning as Random Tessellation Processes. This framework includes the Mondrian process as\na special case, and includes extensions of the Mondrian process allowing non-axis aligned cuts in\nhigh dimensional space. The processes are self-consistent, and we derive inference using sequential\nMonte Carlo and random forests. To our knowledge, this is the \ufb01rst work to provide self-consistent\nBayesian nonparametric hierarchical partitioning with non-axis aligned cuts that is de\ufb01ned for more\nthan two dimensions. As demonstrated by our simulation study and experiments on gene expression\ndata, these non-axis aligned cuts can improve performance over the Mondrian process and other\nmachine learning methods such as support vector machines, and random forests.\n\n9\n\n\fAcknowledgments\n\nWe are grateful to Kevin Sharp, Frauke Harms, Maasa Kawamura, Ruth Van Gurp, Lars Buesing, Tom\nLoughin and Hugh Chipman for helpful discussion, comments and inspiration. We would also like\nto thank Fred Popowich and Martin Siegert for help with computational resources at Simon Fraser\nUniversity. YWT\u2019s research leading to these results has received funding from the European Research\nCouncil under the European Union\u2019s Seventh Framework Programme (FP7/2007-2013) ERC grant\nagreement no. 617071. This research was also funded by NSERC grant numbers RGPIN/05484-2019,\nDGECR/00118-2019 and RGPIN/06131-2019.\n\nReferences\n[1] M. R. Barnes, J. Huxley-Jones, P. R Maycox, M. Lennon, A. Thornber, F. Kelly, S. Bates, A. Taylor,\nJ. Reid, N. Jones, and J. Schroeder. Transcription and pathway analysis of the superior temporal cortex and\nanterior prefrontal cortex in schizophrenia. Journal of Neuroscience Research, 89(8), 2011.\n\n[2] M. A. Berger. An Introduction to Probability and Stochastic Processes. Springer Texts in Statistics, 2012.\n\n[3] L. Breiman. Random forests. Machine Learning, 45(1), 2001.\n\n[4] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression Trees. Chapman\n\nand Hall/CRC, 1984.\n\n[5] H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian additive regression trees. Annals of\n\nApplied Statistics, 4(1), 2010.\n\n[6] S. N. Chiu, D. Stoyan, W. S. Kendall, and J. Mecke. Stochastic Geometry and its Applications. Wiley\n\nSeries in Probability and Statistics, 2013.\n\n[7] N. Chopin. Central limit theorem for sequential Monte Carlo methods and its application to Bayesian\n\ninference. The Annals of Statistics, 32(6), 2004.\n\n[8] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian \ufb01ltering.\n\nStatistics and Computing, 10(3), 2000.\n\n[9] X. Fan, B. Li, and S. Sisson. The binary space partitioning-tree process. In Proceedings of the 35th\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n\n[10] X. Fan, B. Li, and S. Sisson. Binary space partitioning forests. arXiv preprint 1903.09348, 2019.\n\n[11] X. Fan, B. Li, Y. Wang, Y. Wang, and F. Chen. The Ostomachion process. In Proceedings of the Thirtieth\n\nConference of the Association for the Advancement of Arti\ufb01cial Intelligence, 2016.\n\n[12] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2), 1973.\n\n[13] W. A. Freije, F. E. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy, L. M. Liau, P. S. Mischel, and S. F.\nNelson. Gene expression pro\ufb01ling of gliomas strongly predicts survival. Cancer Research, 64(18), 2004.\n\n[14] E. I. George. Sampling random polygons. Journal of applied probability, 24(3):557\u2013573, 1987.\n\n[15] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1), 2006.\n\n[16] D. Godlovitch. Idealised models of sea ice thickness dynamics. PhD thesis, University of Victoria, 2011.\n\n[17] M. Hajiaghayi, B. Kirkpatrick, L. Wang, and A. Bouchard-C\u00f4t\u00e9. Ef\ufb01cient continuous-time Markov chain\n\nestimation. In Proceedings of the 31st International Conference on Machine Learning, 2014.\n\n[18] P. Halmos. Measure Theory. Springer, 1974.\n\n[19] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with an\nin\ufb01nite relational model. In Proceedings of the 20th Conference on the Association for the Advancement of\nArti\ufb01cial Intelligence, 2006.\n\n[20] J. F. Kingman. Poisson Processes. Oxford University Press, 1996.\n\n[21] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests: Ef\ufb01cient online random forests. In\n\nProceedings of the 28th Conference on Neural Information Processing Systems, 2014.\n\n10\n\n\f[22] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Particle Gibbs for Bayesian additive regression trees. In\n\nProceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[23] A. Liaw and M. Wiener. Classi\ufb01cation and Regression by randomForest. R News, 2(3), 2002.\n\n[24] P. R. Maycox, F. Kelly, A. Taylor, S. Bates, J. Reid, R. Logendra, M. R. Barnes, C. Larminie, N. Jones,\nM. Lennon, C. Davies, J. J. Hagan, C. A. Scorer, C. Angelinetta, M. T. Akbar, S. Hirsch, A. M. Mortimer,\nT. R. Barnes, and J. de Belleroche. Analysis of gene expression in two large schizophrenia cohorts identi\ufb01es\nmultiple changes associated with nerve terminal function. Molecular Psychiatry, 14(12), 2009.\n\n[25] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of the\n\nDepartment of Statistics, Probability Theory Group, Technische Universit\u00e4t Wien, 2019.\n\n[26] Jaouad Mourtada, St\u00e9phane Ga\u00efffas, and Erwan Scornet. Minimax optimal rates for mondrian trees and\n\nforests. arXiv preprint 1803.05784, 2018.\n\n[27] W. Nagel and V. Weiss. Crack STIT tessellations: Characterization of stationary random tessellations\n\nstable with respect to iteration. Advances in Applied Probability, 37(4), 2005.\n\n[28] T. Rainforth and F. Wood. Canonical correlation forests. arXiv preprint 1507.05444, 2015.\n\n[29] D. M. Roy and Y. W. Teh. The Mondrian process. In Proceedings of the 22nd Conference on Neural\n\nInformation Processing Systems, 2008.\n\n[30] T. M. Tomita, J. Browne, C. Shen, J. L. Patsolic, J. Yim, C. E. Priebe, R. Burns, M. Maggioni, and J. T.\n\nVogelstein. Random projection forests. arXiv preprint 1506.03410, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5087, "authors": [{"given_name": "Shufei", "family_name": "Ge", "institution": "Simon Fraser University"}, {"given_name": "Shijia", "family_name": "Wang", "institution": "Nankai University"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}, {"given_name": "Liangliang", "family_name": "Wang", "institution": "Simon Fraser University"}, {"given_name": "Lloyd", "family_name": "Elliott", "institution": "Simon Fraser University"}]}