{"title": "Learning Minimum Volume Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 1209, "page_last": 1216, "abstract": "", "full_text": "Learning Minimum Volume Sets\n\nClayton Scott\n\nStatistics Department\n\nRice University\n\nHouston, TX 77005\n\ncscott@rice.edu\n\nRobert Nowak\n\nElectrical and Computer Engineering\n\nUniversity of Wisconsin\n\nMadison, WI 53706\nnowak@engr.wisc.edu\n\nAbstract\n\nGiven a probability measure P and a reference measure \u00b5, one is\noften interested in the minimum \u00b5-measure set with P -measure at\nleast \u03b1. Minimum volume sets of this type summarize the regions of\ngreatest probability mass of P , and are useful for detecting anoma-\nlies and constructing con\ufb01dence regions. This paper addresses the\nproblem of estimating minimum volume sets based on independent\nsamples distributed according to P . Other than these samples, no\nother information is available regarding P , but the reference mea-\nsure \u00b5 is assumed to be known. We introduce rules for estimating\nminimum volume sets that parallel the empirical risk minimization\nand structural risk minimization principles in classi\ufb01cation. As\nin classi\ufb01cation, we show that the performances of our estimators\nare controlled by the rate of uniform convergence of empirical to\ntrue probabilities over the class from which the estimator is drawn.\nThus we obtain \ufb01nite sample size performance bounds in terms of\nVC dimension and related quantities. We also demonstrate strong\nuniversal consistency and an oracle inequality. Estimators based\non histograms and dyadic partitions illustrate the proposed rules.\n\n1 Introduction\n\nGiven a probability measure P and a reference measure \u00b5, the minimum volume\nset (MV-set) with mass at least 0 < \u03b1 < 1 is\n\nG\u2217\n\n\u03b1 = arg min{\u00b5(G) : P (G) \u2265 \u03b1, G measurable}.\n\nMV-sets summarize regions where the mass of P is most concentrated. For example,\nif P is a multivariate Gaussian distribution and \u00b5 is the Lebesgue measure, then the\nMV-sets are ellipsoids (see also Figure 1). Applications of minimum volume sets\ninclude outlier/anomaly detection, determining highest posterior density or multi-\nvariate con\ufb01dence regions, tests for multimodality, and clustering. In comparison\nto the closely related problem of density level set estimation [1, 2], the minimum\nvolume approach seems preferable in practice because the mass \u03b1 is more easily\nspeci\ufb01ed than a level of a density. See [3, 4, 5] for further discussion of MV-sets.\n\nThis paper considers the problem of MV-set estimation using a training sample\ndrawn from P , which in most practical settings is the only information one has\n\n\fFigure 1: Gaussian mixture data, 500 samples, \u03b1 = 0.9. (Left and Middle) Mini-\nmum volume set estimates based on recursive dyadic partitions, discussed in Section\n6. (Right) True MV set.\n\nabout P . The speci\ufb01cations to the estimation process are the signi\ufb01cance level \u03b1,\nthe reference measure \u00b5, and a collection of candidate sets G. All proofs, as well as\nadditional results and discussion, may be found in [6] . To our knowledge, ours is\nthe \ufb01rst work to establish \ufb01nite sample bounds, an oracle inequality, and universal\nconsistency for the MV-set estimation problem.\n\nThe methods proposed herein are primarily of theoretical interest, although they\nmay be implemented e\ufb00eciently for certain partition-based estimators as discussed\nlater. As a more practical alternative, the MV-set problem may be reduced to\nNeyman-Pearson classi\ufb01cation [7, 8] by simulating realizations from.\n\n1.1 Notation\n\nLet (X , B) be a measure space with X \u2282 Rd. Let X be a random variable taking\nvalues in X with distribution P . Let S = (X1, . . . , Xn) be an independent and\nidentically distributed (IID) sample drawn according to P . Let G denote a subset\n\nof X , and let G be a collection of such subsets. Let bP denote the empirical measure\nbased on S: bP (G) = (1/n)Pn\n\ni=1 I(Xi \u2208 G). Here I(\u00b7) is the indicator function. Set\n\n{\u00b5(G) : P (G) \u2265 \u03b1},\n\n\u00b5\u2217\n\n\u03b1 = inf\nG\n\n(1)\n\nwhere the inf is over all measurable sets. A minimum volume set, G\u2217\n\u03b1, is a minimizer\nof (1), when it exists. Let G be a class of sets. Given \u03b1 \u2208 (0, 1), denote G\u03b1 = {G \u2208\nG : P (G) \u2265 \u03b1}, the collection of all sets in G with mass at least alpha. De\ufb01ne\n\u00b5G,\u03b1 = inf{\u00b5(G) : G \u2208 G\u03b1} and GG,\u03b1 = arg min{\u00b5(G) : G \u2208 G\u03b1} when it exists.\nThus GG,\u03b1 is the best approximation to the MV-set G\u2217\n\u03b1 from G. Existence and\nuniqueness of these and related quantities are discussed in [6] .\n\n2 Minimum Volume Sets and Empirical Risk Minimization\n\nIn this section we introduce a procedure inspired by the empirical risk minimization\n(ERM) principle for classi\ufb01cation. In classi\ufb01cation, ERM selects a classi\ufb01er from a\n\ufb01xed set of classi\ufb01ers by minimizing the empirical error (risk) of a training sample.\nVapnik and Chervonenkis established the basic theoretical properties of ERM (see\n[9, 10]), and we \ufb01nd similar properties in the minimum volume setting. In this and\nthe next section we do not assume P has a density with respect to \u00b5.\n\nLet \u03c6(G, S, \u03b4) be a function of G \u2208 G, the training sample S, and a con\ufb01dence\n\n\fparameter \u03b4 \u2208 (0, 1). Set bG\u03b1 = {G \u2208 G : bP (G) \u2265 \u03b1 \u2212 \u03c6(G, S, \u03b4)} and\n\nbGG,\u03b1 = arg min{\u00b5(G) : G \u2208 bG\u03b1}.\n\nWe refer to the rule in (2) as MV-ERM because of the analogy with empirical risk\nminimization in classi\ufb01cation. The quantity \u03c6 acts as a kind of \u201ctolerance\u201d by which\nthe empirical mass estimate may deviate from the targeted value of \u03b1. Throughout\nthis paper we assume that \u03c6 satis\ufb01es the following.\nDe\ufb01nition 1. We say \u03c6 is a (distribution free) complexity penalty for G if and\nonly if for all distributions P and all \u03b4 \u2208 (0, 1),\n\n(2)\n\nP n(cid:18)(cid:26)S : sup\n\nG\u2208G(cid:16)(cid:12)(cid:12)(cid:12)P (G) \u2212 bP (G)(cid:12)(cid:12)(cid:12) \u2212 \u03c6(G, S, \u03b4)(cid:17) > 0(cid:27)(cid:19) \u2264 \u03b4.\n\nThus, \u03c6 controls the rate of uniform convergence of bP (G) to P (G) for G \u2208 G. It\n\nis well known that the performance of ERM (for binary classi\ufb01cation) relative to\nthe performance of the best classi\ufb01er in the given class is controlled by the uniform\nconvergence of true to empirical probabilities. A similar result holds for MV-ERM.\nTheorem 1. If \u03c6 is a complexity penalty for G, then\n\nProof. Consider the sets\n\nP n(cid:16)(cid:16)P (bGG,\u03b1) < \u03b1 \u2212 2\u03c6(bGG,\u03b1, S, \u03b4)(cid:17) or(cid:16)\u00b5(bGG,\u03b1) > \u00b5G,\u03b1(cid:17)(cid:17) \u2264 \u03b4.\n\u0398P = {S : P (bGG,\u03b1) < \u03b1 \u2212 2\u03c6(bGG,\u03b1, S, \u03b4)},\n\u0398\u00b5 = {S : \u00b5(bGG,\u03b1) > \u00b5(GG,\u03b1)},\n\u2126P = (cid:26)S : sup\n\nG\u2208G(cid:16)(cid:12)(cid:12)(cid:12)P (G) \u2212 bP (G)(cid:12)(cid:12)(cid:12) \u2212 \u03c6(G, S, \u03b4)(cid:17) > 0(cid:27) .\n\nThe result follows easily from the following lemma.\n\nLemma 1. With \u0398P , \u0398\u00b5, and \u2126P de\ufb01ned as above and bGG,\u03b1 as de\ufb01ned in (2) we\n\nhave \u0398P \u222a \u0398\u00b5 \u2282 \u2126P .\n\nThe proof of this lemma (see [6] ) follows closely the proof of Lemma 1 in [7]. This\nresult may be understood by analogy with the result from classi\ufb01cation that says\n\nR(bf ) \u2212 inf f \u2208F R(f ) \u2264 2 supf \u2208F |R(f ) \u2212 bR(f )| (see [10], Ch. 8). Here R and bR are\nthe true and empirical risks, bf is the empirical risk minimizer, and F is a set of\n\nclassi\ufb01ers. Just as this result relates uniform convergence bounds to empirical risk\nminimization in classi\ufb01cation, so does Lemma 1 relate uniform convergence to the\nperformance of MV-ERM.\n\nThe theorem above allows direct translation of uniform convergence results into\nperformance guarantees for MV-ERM. Fortunately, many penalties (uniform con-\nvergence results) are known. We now give to important examples, although many\nothers, such as the Rademacher penalty, are possible.\n\n2.1 Example: VC Classes\n\nLet G be a class of sets with VC dimension V , and de\ufb01ne\n\n\u03c6(G, S, \u03b4) =r32\n\nV log n + log(8/\u03b4)\n\nn\n\n.\n\n(3)\n\n\fBy a version of the VC inequality [10], we know that \u03c6 is a complexity penalty for G,\nand therefore Theorem 1 applies. To view this result in perhaps a more recognizable\nway, let \u0001 > 0 and choose \u03b4 such that 2\u03c6(G, S, \u03b4) = \u0001. By inverting the relationship\nbetween \u03b4 and \u0001, we have the following.\nCorollary 1. With the notation de\ufb01ned above,\n\nP n(cid:16)(cid:16)P (bGG,\u03b1) < \u03b1 \u2212 \u0001(cid:17) or(cid:16)\u00b5(bGG,\u03b1) > \u00b5G,\u03b1(cid:17)(cid:17) \u2264 8nV e\u2212n\u00012/128.\n\nThus, for any \ufb01xed \u0001 > 0, the probability of being within \u0001 of the target mass \u03b1\nand being less than the target volume \u00b5G,\u03b1 approaches one exponentially fast as\nthe sample size increases. This result may also be used to calculate a distribution\nfree upper bound on the sample size needed to be within a given tolerance \u0001 of \u03b1\nand with a given con\ufb01dence 1 \u2212 \u03b4. In particular, the sample size will grow no faster\nthan a polynomial in 1/\u0001 and 1/\u03b4, paralleling results for classi\ufb01cation.\n\n2.2 Example: Countable Classes\n\nSuppose G is a countable class of sets. Assume that to every G \u2208 G a number JGK\n\nis assigned such that PG\u2208G 2\u2212JGK \u2264 1. In light of the Kraft inequality for pre\ufb01x\n\ncodes, JGK may be de\ufb01ned as the codelength of a codeword for G in a pre\ufb01x code\nfor G. Let \u03b4 > 0 and de\ufb01ne\n\n\u03c6(G, S, \u03b4) =r JGK log 2 + log(2/\u03b4)\n\n2n\n\n.\n\n(4)\n\nBy Cherno\ufb00\u2019s bound together with the union bound, \u03c6 is a penalty for G. Therefore\nTheorem 1 applies and we have obtained a result analogous to the Occam\u2019s Razor\nbound for classi\ufb01cation.\n\nAs a special case, suppose G is \ufb01nite and take JGK = log2 |G|. Setting 2\u03c6(G, S, \u03b4) = \u0001\nand inverting the relationship between \u03b4 and \u0001, we have\n\nCorollary 2. For the MV-ERM estimate bGG,\u03b1 from a \ufb01nite class G\nP n(cid:16)(cid:16)P (bGG,\u03b1) < \u03b1 \u2212 \u0001(cid:17) or(cid:16)\u00b5(bGG,\u03b1) > \u00b5G,\u03b1(cid:17)(cid:17) \u2264 2|G|e\u2212n\u00012/2.\n\n3 Consistency\n\nA minimum volume set estimator is consistent if its volume and mass tend to the\noptimal values \u00b5\u2217\n\n\u03b1 and \u03b1 as n \u2192 \u221e. Formally, de\ufb01ne the error quantity\n\nM(G) := (\u00b5(G) \u2212 \u00b5\u2217\n\n\u03b1)+ + (\u03b1 \u2212 P (G))+ ,\n\nwhere (x)+ = max(x, 0). (Note that without the (\u00b7)+ operator, this would not be\na meaningful error since one term could be negative and cause M to tend to zero,\neven if the other error term does not go to zero.) We are interested in MV-set\n\nestimators such that M(bGG,\u03b1) tends to zero as n \u2192 \u221e.\nDe\ufb01nition 2. A learning rule bGG,\u03b1 is strongly consistent if limn\u2192\u221e M(bGG,\u03b1) = 0\nwith probability 1. If bGG,\u03b1 is strongly consistent for every possible distribution of\nX, then bGG,\u03b1 is strongly universally consistent.\n\nTo see how consistency might result from MV-ERM, it helps to rewrite Theorem\n1 as follows. Let G be \ufb01xed and let \u03c6(G, S, \u03b4) be a penalty for G. Then with\nprobability at least 1 \u2212 \u03b4, both\n\n\u00b5(bGG,\u03b1) \u2212 \u00b5\u2217\n\n\u03b1 \u2264 \u00b5(GG,\u03b1) \u2212 \u00b5\u2217\n\u03b1\n\n(5)\n\n\fand\n\n(6)\nhold. We refer to the left-hand side of (5) as the excess volume of the class G and\n\n\u03b1 \u2212 P (bGG,\u03b1) \u2264 2\u03c6(bGG,\u03b1, S, \u03b4)\n\nthe left-hand side of (6) as the missing mass of bGG,\u03b1. The upper bounds on the\n\nright-hand sides are an approximation error and a stochastic error, respectively.\nThe idea is to let G grow with n so that both errors tend to zero as n \u2192 \u221e. If G\ndoes not change with n, universal consistency is impossible.\n\nTo have both stochastic and approximation errors tend to zero, we apply MV-ERM\nto a class Gk from a sequence of classes G 1, G 2, . . ., where k = k(n) grows with the\n\nsample size. Consider the estimator bGG k,\u03b1.\nP\u221e\n\nTheorem 2. Choose k = k(n) and \u03b4 = \u03b4(n) such that k(n) \u2192 \u221e as n \u2192 \u221e and\n\nn=1 \u03b4(n) < \u221e. Assume the sequence of sets Gk and penalties \u03c6k satisfy\n\nlim\nk\u2192\u221e\n\ninf\nG\u2208G k\n\n\u03b1\n\n\u00b5(G) = \u00b5\u2217\n\u03b1\n\nand\n\nlim\nn\u2192\u221e\n\nsup\nG\u2208G k\n\n\u03b1\n\n\u03c6k(G, S, \u03b4(n)) = o(1).\n\n(7)\n\n(8)\n\nThen bGG k,\u03b1 is strongly universally consistent.\n\nThe proof combines the Borel-Cantelli lemma and the distribution-free result of\nTheorem 1 with the stated assumptions. Examples satisfying the hypotheses of the\ntheorem include families of VC classes with arbitrary approximating power (e.g.,\ngeneralized linear discriminant rules with appropriately chosen basis functions and\nneural networks), and histogram rules. See [6]\n\nfor further discussion.\n\n4 Structural Risk Minimization and an Oracle Inequality\n\nIn the previous section the rate of convergence of the two errors to zero is determined\nby the choice of k = k(n), which must be chosen a priori. Hence it is possible that\nthe excess volume decays much more quickly than the missing mass, or vice versa.\nIn this section we introduce a new rule called MV-SRM, inspired by the principle of\nstructural risk minimization (SRM) from the theory of classi\ufb01cation [11, 12], that\nautomatically balances the two errors.\n\nThe result in this section is not distribution free. We assume\n\nA1 P has a density f with respect to \u00b5.\nA2 G\u2217\n\n\u03b1 exists and P (G\u2217\n\n\u03b1) = \u03b1.\n\nUnder these assumptions (see [6] ) there exists \u03b3\u03b1 > 0 such that for any MV-set\nG\u2217\n\n\u03b1, {x : f (x) > \u03b3\u03b1} \u2282 G\u2217\n\n\u03b1 \u2282 {x : f (x) \u2265 \u03b3\u03b1}.\n\nLet G be a class of sets. Conceptualize G as a collection of sets of varying capacities,\nsuch as a union of VC classes or a union of \ufb01nite classes. Let \u03c6(G, S, \u03b4) be a penalty\nfor G. The MV-SRM principle selects the set\n\nbGG,\u03b1 = arg min\n\nG\u2208G\n\nn\u00b5(G) + \u03c6(G, S, \u03b4) : bP (G) \u2265 \u03b1 \u2212 \u03c6(G, S, \u03b4)o .\n\nNote that MV-SRM is di\ufb00erent from MV-ERM because it minimizes a complexity\npenalized volume instead of simply the volume. We have the following.1\n\n(9)\n\n1Although the value of 1/\u03b3\u03b1 is in practice unknown, it can be bounded by 1/\u03b3\u03b1 \u2264\n\u03b1) on the\n\n\u03b1)/(1 \u2212 \u03b1) \u2264 1/(1 \u2212 \u03b1). This follows from the bound 1 \u2212 \u03b1 \u2264 \u03b3\u03b1 \u00b7 (1 \u2212 \u00b5\u2217\n\n(1 \u2212 \u00b5\u2217\nmass outside the minimum volume set.\n\n\fTheorem 3. Let bGG,\u03b1 be the MV-set estimator in (9). With probability at least\n\n1 \u2212 \u03b4 over the training sample S,\n\n(10)\n\nM(bGG,\u03b1) \u2264(cid:18)1 +\n\n1\n\n\u03b3\u03b1(cid:19) inf\n\nG\u2208G\u03b1n \u00b5(G) \u2212 \u00b5\u2217\n\n\u03b1 + 2\u03c6(G, S, \u03b4) o .\n\nSketch of proof: The proof is similar in some respects to oracle inequalities for\nclassi\ufb01cation. The key di\ufb00erence is in the form of the error term M(G) =\n(\u00b5(G) \u2212 \u00b5\u2217\n\u03b1)+ + (\u03b1 \u2212 P (G))+. In classi\ufb01cation both approximation and stochastic\nerrors are positive, whereas with MV-sets the excess volume \u00b5(G) \u2212 \u00b5\u2217\n\u03b1 or missing\nmass \u03b1 \u2212 P (G) could be negative. This necessitates the (\u00b7)+ operators, without\nwhich the error would not be meaningful as mentioned earlier. The proof considers\n\u03b1 and\nIn the \ufb01rst case, both\nvolume and mass errors are positive and the argument follows standard lines. The\nsecond case can be seen to follow easily from the \ufb01rst. The third case (which oc-\ncurs most frequently in practice) is most involved and requires use of the fact that\n\u00b5\u2217\n\u03b1 \u2212 \u00b5\u2217\n\u03b1\u2212\u0001 \u2264 \u0001/\u03b3\u03b1 for \u0001 > 0, which can be deduced from basic properties of MV and\ndensity level sets.\n\n\u03b1 and P (bGG,\u03b1) < \u03b1, (2) \u00b5(bGG,\u03b1) \u2265 \u00b5\u2217\n\u03b1 and P (bGG,\u03b1) < \u03b1.\n\nthree cases separately: (1) \u00b5(bGG,\u03b1) \u2265 \u00b5\u2217\nP (bGG,\u03b1) \u2265 \u03b1, and (3) \u00b5(bGG,\u03b1) < \u00b5\u2217\n\nThe oracle inequality says that MV-SRM performs about as well as the set chosen\nby an oracle to optimize the tradeo\ufb00 between the stochastic and approximation\nerrors. To illustrate the power of the oracle inequality, in [6] we demonstrate that\nMV-SRM applied to recursive dyadic partition-based estimators adapts optimally\nto the number of relevant features (unknown a priori).\n\n5 Damping the Penalty\n\nIn Theorem 1, the reader may have noticed that MV-ERM does not equitably bal-\n\nance the volume error with the mass error. Indeed, with high probability, \u00b5(bGG,\u03b1)\nis less than \u00b5(GG,\u03b1), while P (bGG,\u03b1) is only guaranteed to be within \u03c6(bGG,\u03b1) of\n\n\u03b1. The net e\ufb00ect is that MV-ERM (and MV-SRM) underestimates the MV-set.\nExperimental comparisons have con\ufb01rmed this to be the case [6] .\n\nA minor modi\ufb01cation of MV-ERM and MV-SRM leads to a more equitable distribu-\ntion of error between the volume and mass, instead of having all the error reside in\nthe mass term. The idea is simple: scale the penalty in the constraint by a damping\nfactor \u03bd < 1. In the case of MV-SRM, the penalty in the objective function also\nneeds to be scaled by 1 + \u03bd. Moreover, the theoretical properties of these estimators\nstated above are retained (the statements, omitted here, are slightly more involved\n[6] ). Notice that in the case \u03bd = 1 we recover the original estimators. Also note\nthat the above theorem encompasses the generalized quantile estimate of [3], which\ncorresponds to \u03bd = 0. Thus we have \ufb01nite sample size guarantees for that estimator\nto match Polonik\u2019s asymptotic analysis.\n\n6 Experiments: Histograms and Trees\n\nTo gain some insight into the basic properties of our estimators, we devised some\nsimple numerical experiments. In the case of histograms, MV-SRM can be imple-\nmented in a two step process. First, compute the MV-ERM estimate (a very simple\nprocedure) for each Gk, k = 1, . . . , K, where 1/k is the bin-width. Second, choose\nthe \ufb01nal estimate by minimizing the penalized volume of the MV-ERM estimates.\n\n\fError as a function of sample size\n\noccam\nrademacher\n\nn = 10000, k = 20, \u03bd=0\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n100\n\n1000\n\n10000\n\n100000\n\n1000000\n\nFigure 2: Results for histograms. (Left) A typical MV-ERM estimate with bin-\nwidth 1/20, \u03bd = 0, and based on 10000 points. True MV-set indicated by solid line.\n\n(Right) The error of the MV-SRM estimate M(bGG,\u03b1) as a function of sample size\n\nwhen \u03bd = 0. The results indicated that the Occam\u2019s Razor bound is tighter and\nyields better performance than Rademacher.\n\nWe consider two penalties: one based on an Occam style bound, the other on the\n(conditional) Rademacher average. As a data set we consider X = [0, 1]2, the unit\nsquare, and data generated by a two-dimensional truncated Gaussian distribution,\ncentered at the point (1/2, 1/2) and having spherical variance with parameter \u03c3 =\n0.15. Other parameter settings are \u03b1 = 0.8, K = 40, and \u03b4 = 0.05. All experiments\nwere conducted at nine di\ufb00erent sample sizes, logarithmically spaced from 100 to\n1000000, and repeated 100 times. Results are summarized in Figure 2.\n\nTo illustrate the potential improvement o\ufb00ered by spatially adaptive partitioning\nmethods, we consider a minimum volume set estimator based on recursive dyadic\n(quadsplit) partitions. We employ a penalty that is additive over the cells A of the\npartition. The precise form of the penalty \u03c6(A) for each cell is given in [6] , but\nloosely speaking it is proportional to the square-root of the ratio of the empirical\nmass of the cell to the sample size n. In this case, MV-SRM with \u03bd = 0 is\n\nmin\n\nG\u2208GL XA\n\n[\u00b5(A)`(A) + \u03c6(A)]\n\nsubject to XA bP (A)`(A) \u2265 \u03b1\n\n(11)\n\nwhere GL is the collection of all partitions with dyadic cell sidelengths no smaller\nthan 2\u2212L and `(A) = 1 if A belongs to the candidate set and `(A) = 0 otherwise\n(see [6] for further details). Although directly optimization appears formidable, an\ne\ufb03cient alternative is to consider the Lagrangian and conduct a bisection search over\nthe Lagrange multiplier until the mass constraint is nearly achieved with equality\n(10 iterations is su\ufb03cient in practice). For each iteration, minimization of the\nLagrangian can be performed very rapidly using standard tree pruning techniques.\n\nAn experimental demonstration of the dyadic partition estimator is depicted in Fig-\nure 1. In the experiments we employed a dyadic quadtree structure with L = 8 (i.e.,\ncell sidelengths no smaller than 2\u22128) and pruned according to the theoretical penalty\n\u03c6(A) formally de\ufb01ned in [6] weighted by a factor of 1/30 (in practice the optimal\nweight could be found via cross-validation or other techniques). Figure 1 shows\nthe results with data distributed according to a two-component Gaussian mixture\ndistribution. This \ufb01gure (middle image) additionally illustrates the improvement\npossible by \u201cvoting\u201d over shifted partitions, which in principle is equivalent to con-\nstructing 2L \u00d7 2L di\ufb00erent trees, each based on a partition o\ufb00set by an integer\nmultiple of the base sidelength 2\u2212L, and taking a majority vote over all the result-\n\n\fing set estimates to form the \ufb01nal estimate. This strategy mitigates the \u201cblocky\u201d\nstructure due to the underlying dyadic partitions, and can be computed almost as\nrapidly as a single tree estimate (within a factor of L) due to the large amount of\nredundancy among trees. The actual running time was one to two seconds.\n\n7 Conclusions\n\nIn this paper we propose two rules, MV-ERM and MV-SRM, for estimation of\nminimum volume sets. Our theoretical analysis is made possible by relating the\nperformance of these rules to the uniform convergence properties of the class of sets\nfrom which the estimate is taken. Ours are the \ufb01rst known results to feature \ufb01nite\nsample bounds, an oracle inequality, and universal consistency.\n\nAcknowledgements\n\nThe authors thank Ercan Yildiz and Rebecca Willett for their assistance with the experi-\nments involving dyadic trees.\n\nReferences\n\n[1] I. Steinwart, D. Hush, and C. Scovel, \u201cA classi\ufb01cation framework for anomaly detec-\n\ntion,\u201d J. Machine Learning Research, vol. 6, pp. 211\u2013232, 2005.\n\n[2] S. Ben-David and M. Lindenbaum, \u201cLearning distributions by their density levels \u2013 a\nparadigm for learning without a teacher,\u201d Journal of Computer and Systems Sciences,\nvol. 55, no. 1, pp. 171\u2013182, 1997.\n\n[3] W. Polonik, \u201cMinimum volume sets and generalized quantile processes,\u201d Stochastic\n\nProcesses and their Applications, vol. 69, pp. 1\u201324, 1997.\n\n[4] G. Walther, \u201cGranulometric smoothing,\u201d Ann. Stat., vol. 25, pp. 2273\u20132299, 1997.\n\n[5] B. Sch\u00a8olkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson, \u201cEstimating\nthe support of a high-dimensional distribution,\u201d Neural Computation, vol. 13, no. 7,\npp. 1443\u20131472, 2001.\n\n[6] C. Scott and R. Nowak, \u201cLearning minimum volume sets,\u201d UW-Madison, Tech. Rep.\n\nECE-05-2, 2005. [Online]. Available: http://www.stat.rice.edu/\u223ccscott\n\n[7] A. Cannon, J. Howse, D. Hush, and C. Scovel, \u201cLearning with the Neyman-Pearson\nand min-max criteria,\u201d Los Alamos National Laboratory, Tech. Rep. LA-UR 02-2951,\n[Online]. Available: http://www.c3.lanl.gov/\u223ckelly/ml/pubs/2002 minmax/\n2002.\npaper.pdf\n\n[8] C. Scott and R. Nowak, \u201cA Neyman-Pearson approach to statistical learning,\u201d IEEE\n\nTrans. Inform. Theory, 2005, (in press).\n\n[9] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.\n\n[10] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi, A Probabilistic Theory of Pattern Recognition.\n\nNew York: Springer, 1996.\n\n[11] V. Vapnik, Estimation of Dependencies Based on Empirical Data.\n\nNew York:\n\nSpringer-Verlag, 1982.\n\n[12] G. Lugosi and K. Zeger, \u201cConcept learning using complexity regularization,\u201d IEEE\n\nTrans. Inform. Theory, vol. 42, no. 1, pp. 48\u201354, 1996.\n\n\f", "award": [], "sourceid": 2851, "authors": [{"given_name": "Clayton", "family_name": "Scott", "institution": null}, {"given_name": "Robert", "family_name": "Nowak", "institution": null}]}