{"title": "Nearest-Neighbor-Based Active Learning for Rare Category Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 640, "abstract": null, "full_text": "Nearest-Neighbor-Based Active Learning for Rare\n\nCategory Detection\n\nJingrui He\n\nSchool of Computer Science\nCarnegie Mellon University\njingruih@cs.cmu.edu\n\nJaime Carbonell\n\nSchool of Computer Science\nCarnegie Mellon University\n\njgc@cs.cmu.edu\n\nAbstract\n\nRare category detection is an open challenge for active learning, especially in\nthe de-novo case (no labeled examples), but of signi\ufb01cant practical importance for\ndata mining - e.g. detecting new \ufb01nancial transaction fraud patterns, where normal\nlegitimate transactions dominate. This paper develops a new method for detecting\nan instance of each minority class via an unsupervised local-density-differential\nsampling strategy. Essentially a variable-scale nearest neighbor process is used to\noptimize the probability of sampling tightly-grouped minority classes, subject to\na local smoothness assumption of the majority class. Results on both synthetic\nand real data sets are very positive, detecting each minority class with only a frac-\ntion of the actively sampled points required by random sampling and by Pelleg\u2019s\nInterleave method, the prior best technique in the sparse literature on this topic.\n\n1 Introduction\n\nIn many real world problems, the proportion of data points in different classes is highly skewed:\nsome classes dominate the data set (majority classes), and the remaining classes may have only a\nfew examples (minority classes). However, it is very important to detect examples from the minority\nclasses via active learning. For example, in fraud detection tasks, most of the records correspond to\nnormal transactions, and yet once we identify a new type of fraud transaction, we are well on our\nway to stopping similar future fraud transactions [2]. Another example is in astronomy. Most of\nthe objects in sky survey images are explainable by current theories and models. Only 0.001% of\nthe objects are truly beyond the scope of current science and may lead to new discoveries [8]. Rare\ncategory detection is also a bottleneck in reducing the sampling complexity of active learning [1,\n5]. The difference between rare category detection and outlier detection is that: in rare category\ndetection, the examples from one or more minority classes are often self-similar, potentially forming\ncompact clusters, while in outlier detection, the outliers are typically scattered.\nCurrently, only a few methods have been proposed to address this challenge. For example, in [8],\nthe authors assumed a mixture model to \ufb01t the data, and selected examples for labeling according\nto different criteria; in [6], the authors proposed a generic consistency algorithm, and proved upper\nbounds and lower bounds for this algorithm in some speci\ufb01c situations. Most of the existing methods\nrequire that the majority classes and the minority classes be separable or work best in the separable\ncase. However, in real applications, the support regions of the majority and minority classes often\noverlap, which affects negatively the performance of these methods.\nIn this paper, we propose a novel method for rare category detection in the context of active learning.\nWe typically start de-novo, no category labels, though our algorithm makes no such assumption.\nDifferent from existing methods, we aim to solve the hard case, i.e. we do not assume separability or\nnear-separability of the classes. Intuitively, the method makes use of nearest neighbors to measure\nlocal density around each example. In each iteration, the algorithm selects an example with the\n\n1\n\n\fmaximum change in local density on a certain scale, and asks the oracle for its label. The method\nstops once it has found at least one example from each class (given the knowledge of the number\nof classes). When the minority classes form compact clusters and the majority class distribution\nis locally smooth, the method will select examples both on the boundary and in the interior of the\nminority classes, and is proved to be effective theoretically. Experimental results on both synthetic\nand real data sets show the superiority of our method over existing methods.\nThe rest of the paper is organized as follows. In Section 2, we introduce our method and provide\ntheoretical justi\ufb01cation, \ufb01rst for binary classes and then for multiple classes. Section 3 gives exper-\nimental results. Finally, we conclude the paper in Section 4.\n\n2 Rare category detection\n\n2.1 Problem de\ufb01nition\nGiven a set of unlabeled examples S = {x1, . . . , xn}, xi \u2208 Rd, which come from m distinct classes,\ni.e. yi \u2208 {1, . . . , m}, the goal is to \ufb01nd at least one example from each class by requesting as few\ntotal labels as possible. For the sake of simplicity, assume that there is only one majority class,\nwhich corresponds to yi = 1, and all the other classes are minority classes.\n\n2.2 Rare category detection for the binary case\nFirst let us focus on the simplest case where m = 2, and Pr[yi = 1] (cid:192) Pr[yi = 2] = p, i.e.\np (cid:191) 1. Here, we assume that we have an estimate of the value of p a priori. Next, we introduce our\nmethod for rare category detection based on nearest neighbors, which is presented in Algorithm 1.\nThe basic idea is to \ufb01nd maximum changes in local density, which might indicate the location of a\nrare category.\nThe algorithm works as follows. Given the unlabeled set S and the prior of the minority class p, we\n\ufb01rst estimate the number K of minority class examples in S. Then, for each example, we record\nits distance from the Kth nearest neighbor, which could be realized by kd-trees [7]. The minimum\ndistance over all the examples is assigned to r(cid:48). Next, we draw a hyper-ball centered at each example\nwith radius r(cid:48), and count the number of examples enclosed by this hyper-ball, which is denoted as\nni. ni is roughly in proportion to the local density. To measure the change of local density around a\ncertain point xi, in each iteration of Step 3, we subtract nj of neighboring points from ni, and let the\nmaximum value be the score of xi. The example with the maximum score is selected for labeling\nby the oracle. If the example is from the minority class, stop the iteration; otherwise, enlarge the\nneighborhood where the scores of the examples are re-calculated and continue.\nBefore giving theoretical justi\ufb01cation, here, we give an intuitive explanation of why the algorithm\nworks. Assume that the minority class is concentrated in a small region and the probability distri-\nbution function (pdf) of the majority class is locally smooth. Firstly, since the support region of the\nminority class is very small, it is important to \ufb01nd its scale. The r(cid:48) value obtained in Step 1 will\nbe used to calculate the local density ni. Since r(cid:48) is based on the minimum Kth nearest neighbor\ndistance, it is never too large to smooth out changes of local density, and thus it is a good measure of\nthe scale. Secondly, the score of a certain point, corresponding to the change in local density, is the\nmaximum of the difference in local density between this point and all of its neighboring points. In\nthis way, we are not only able to select points on the boundary of the minority class, but also points\nin the interior, given that the region is small. Finally, by gradually enlarging the neighborhood where\nthe scores are calculated, we can further explore the interior of the support region, and increase our\nchance of \ufb01nding a minority class example.\n\n2.3 Correctness\n\nIn this subsection, we prove that if the minority class is concentrated in a small region and the pdf\nof the majority class is locally smooth, the proposed algorithm will repeatedly sample in the region\nwhere minority class examples occur with high probability.\nLet f1(x) and f2(x) denote the pdf of the majority and minority classes respectively, where x \u2208 Rd.\nTo be precise, we make the following assumptions.\n\n2\n\n\fAlgorithm 1 Nearest-Neighbor-Based Rare Category Detection for the Binary Case (NNDB)\nRequire: S, p\n1: Let K = np. For each example, calculate the distance to its Kth nearest neighbor. Set r(cid:48) to be\n2: \u2200xi \u2208 S, let N N(xi, r(cid:48)) = {x|x \u2208 S,(cid:107)x \u2212 xi(cid:107) \u2264 r(cid:48)}, and ni = |N N(xi, r(cid:48))|.\n3: for t = 1 : n do\n4:\n\nthe minimum value among all the examples.\n\n(ni \u2212 nj); otherwise, si = \u2212\u221e.\n\nmax\n\nxj\u2208N N (xi,tr(cid:48))\n\n\u2200xi \u2208 S, if xi has not been selected, then si =\nQuery x = arg maxxi\u2208S si.\nIf the label of x is 2, break.\n\n5:\n6:\n7: end for\n\nAssumptions\n\n1. f2(x) is uniform within a hyper-ball B of radius r centered at b, i.e. f2(x) = 1\n\nx \u2208 B; and 0 otherwise, where V (r) \u221d rd is the volume of B.\n\n2. f1(x) is bounded and positive in B1, i.e. f1(x) \u2265\n\n(1\u2212p)V (r), \u2200x \u2208 Rd, where c1, c2 > 0 are two constants.\n\nc2p\n\nV (r), if\n(1\u2212p)V (r), \u2200x \u2208 B and f1(x) \u2264\n\nc1p\n\n1\n\nn\n\nr2\n2\n\nr\n\n1\nd\n\n(1+c2)\nr2\n\n\u00014V ( r2\n\n1\n\n2 )4 log 3\n\n2(1\u22122\u2212d)2p2 log 3\n\u03b4 ,\n\n)] \u2264 2\u2212dp. Then\n\n2 ) is the volume of a hyper-ball with radius r2\nn \u2212 E( ni\n\nWith the above assumptions, we have the following claim and theorem. Note that variants of the\nfollowing proof apply if we assume a different minority class distribution, such as a tight Gaussian.\n\u03b4}, where r2 =\nClaim 1. \u2200\u0001, \u03b4 > 0, if n \u2265 max{ 1\n1p2 log 3\n\u03b4 ,\n2c2\n2 , then with probability at least 1\u2212 \u03b4,\n, and V ( r2\nn )| \u2264 \u0001V (r(cid:48)), 1 \u2264 i \u2264 n, where V (r(cid:48)) is the volume of a hyper-ball\n2 \u2264 r(cid:48) \u2264 r and | ni\nwith radius r(cid:48).\nProof. First, notice that the expected proportion of points falling inside B, E(|N N (b,r)|\n) \u2265 (c1+1)p,\n2 ,\nand that the maximum expected proportion of points falling inside any hyper-ball of radius r2\n2 )|\n[E(|N N (x, r2\nmax\nx\u2208Rd\nPr[r(cid:48) > r or r(cid:48) <\n\u2264 Pr[r(cid:48) > r] + Pr[r(cid:48) <\n\u2264 Pr[|N N(b, r)| < K] + Pr[max\nx\u2208Rd\n= Pr[| N N(b, r)\n| < p] + Pr[max\nx\u2208Rd\n\u2264 e\u22122nc2\nwhere the last inequality is based on Hoeffding bound.\nLet e\u22122nc2\nn \u2265 1\n1p2 log 3\n2c2\nBased on Claim 1, we get the following theorem, which shows the effectiveness of the proposed\nmethod.\nMain Theorem. If\n\nor \u2203xi \u2208 S s.t. | ni\n)| > \u0001V (r(cid:48))]\nn\nand \u2203xi \u2208 S s.t. | ni\n] + Pr[r(cid:48) \u2265 r2\n2\nn\n|N N(x,\nr2\n2\n2 )\n| N N(x, r2\n\n\u2212 E( ni\n)| > \u0001V (r(cid:48))|r(cid:48) \u2265 r2\n2\nn\n)| > \u0001V (r(cid:48))|r(cid:48) \u2265 r2\n\u2212 E( ni\n2\nn\n\n)| > K] + n Pr[| ni\nn\n| > p] + n Pr[| ni\nn\n\n3 , e\u22122n(1\u22122\u2212d)2p2 \u2264 \u03b4\n2(1\u22122\u2212d)2p2 log 3\n\n3 and 2ne\u22122n\u00012V (r(cid:48)) \u2264 2ne\u22122n\u00012V ( r2\n\u03b4 , and n \u2265\n\n1p2 + e\u22122n(1\u22122\u2212d)2p2 + 2ne\u22122n\u00012V (r(cid:48))2\n\n)| > \u0001V (r(cid:48))]\n\n2 )4 log 3\n\n\u03b4 . (cid:165)\n\n\u2212 E( ni\nn\n\n\u2212 E( ni\nn\n\n3, we obtain\n\n2 )2 \u2264 \u03b4\n\n1\n\n\u00014V ( r2\n\n1p2 \u2264 \u03b4\n\n\u03b4 , n \u2265\n\nn\n\nn\n\nr2\n2\n\n1\n\n]\n\n]\n\nn\n\n1. Let B2 be the hyper-ball centered at b with radius 2r. The minimum distance between\nthe points inside B and the ones outside B2 is not too large, i.e. min{(cid:107)xi \u2212 xj(cid:107)|xi, xj \u2208\nS,(cid:107)xi \u2212 b(cid:107) \u2264 r,(cid:107)xj \u2212 b(cid:107) > 2r} \u2264 \u03b1, where \u03b1 is a positive parameter.\n\n1Notice that here we are only dealing with the hard case where f1(x) is positive within B. In the separable\ncase where the support regions of the two classes do not overlap, we can use other methods to detect the\nminority class, such as the one proposed in [8].\n\n3\n\n\f2. f1(x) is locally smooth, i.e. \u2200x, y \u2208 Rd,|f1(x)\u2212f1(y)| \u2264 \u03b2(cid:107)x\u2212y(cid:107)\n\n2 ,r)\n2d+1V (r)2\n2 , r) is the volume of the overlapping region of two hyper-balls: one is of radius\n\n, where \u03b2 \u2264 p2OV ( r2\n\n\u03b1\n\nand OV ( r2\nr, the other one is of radius r2\n\n2 , and its center is on the sphere of the bigger one.\n\n3. The number of examples is suf\ufb01ciently large,\n2(1\u22122\u2212d)2p2 log 3\n\u03b4 ,\n\ni.e. n \u2265 max{ 1\n1p2 log 3\n\u03b4 ,\n2c2\n\n1\n\n1\n\n(1\u2212p)4\u03b24V ( r2\n\n\u03b4}.\n2 )4 log 3\n\nr2\n\n2d\n\np(1\u2212p) \u2212 2) \u00b7 \u03b1\n\n(cid:101) iterations, NNDB will query at least one example\n3 , and it will continue querying such\n\nthen with probability at least 1 \u2212 \u03b4, after (cid:100) 2\u03b1\nwhose probability of coming from the minority class is at least 1\nexamples until the (cid:98)(\nProof. Based on Claim 1, using condition 3, if the number of examples is suf\ufb01ciently large, then with\nn \u2212 E( ni\nn )| \u2264 (1\u2212 p)\u03b2V (r(cid:48)), 1 \u2264 i \u2264 n. According to\nprobability at least 1\u2212 \u03b4, r2\ncondition 2, \u2200xi, xj \u2208 S s.t. (cid:107)xi \u2212 b(cid:107) > 2r, (cid:107)xj \u2212 b(cid:107) > 2r and (cid:107)xi \u2212 xj(cid:107) \u2264 \u03b1, E( ni\nn ) and E( nj\nn )\nn )| \u2264 (1\u2212 p)\u03b2V (r(cid:48)) \u2264 (1\u2212 p)\u03b2V (r).\nn )\u2212 E( nj\nwill not be affected by the minority class, and |E( ni\nNote that \u03b1 is always bigger than r. Based on the above inequalities, we have\n\n2 \u2264 r(cid:48) \u2264 r and | ni\n\nr (cid:99)th iteration.\n\n| ni\nn\n\n\u2212 nj\nn\n\n| \u2264 | ni\nn\n\n\u2212 E( ni\nn\n\n)| + | nj\nn\n\n(1)\nFrom inequality (1), it is not hard to see that \u2200xi, xj \u2208 S, s.t. (cid:107)xi \u2212 b(cid:107) > 2r and (cid:107)xi \u2212 xj(cid:107) \u2264 \u03b1,\nn \u2212 nj\n\nn \u2264 3(1 \u2212 p)\u03b2V (r), i.e. when tr(cid:48) = \u03b1,\n\nni\n\n\u2212 E( nj\nn\n\n)| + |E( ni\nn\n\n) \u2212 E( nj\nn\n\n)| \u2264 3(1 \u2212 p)\u03b2V (r)\n\n\u2264 3(1 \u2212 p)\u03b2V (r)\n\n(2)\n\nsi\nn\n\nThis is because if (cid:107)xj \u2212 b(cid:107) \u2264 2r, the minority class may also contribute to nj\nn , and thus the score\nmay be even smaller.\nOn the other hand, based on condition 1, there exist two points xk, xl \u2208 S, s.t. (cid:107)xk \u2212 b(cid:107) \u2264 r,\n(cid:107)xl \u2212 b(cid:107) > 2r, and (cid:107)xk \u2212 xl(cid:107) \u2264 \u03b1. Since the contribution of the minority class to E( nk\nn ) is at least\np\u00b7OV ( r2\nV (r) \u2212(1\u2212 p)\u03b2V (r). Since\nV (r)\nfor any example xi \u2208 S, we have | ni\n\nV (r) \u2212(1\u2212 p)\u03b2V (r(cid:48)) \u2265 p\u00b7OV ( r2\nn \u2212 E( ni\n\nn ) \u2265 p\u00b7OV ( r2\n\nn )\u2212 E( nl\n\n, so E( nk\n\nn )| \u2264 (1 \u2212 p)\u03b2V (r(cid:48)) \u2264 (1 \u2212 p)\u03b2V (r), therefore\n\u2212 3(1 \u2212 p)p2 \u00b7 OV ( r2\n2 , r)\n\n\u2212 3(1 \u2212 p)\u03b2V (r) \u2265 p \u00b7 OV ( r2\n\n2 , r)\n\n\u2265 p \u00b7 OV ( r2\n\n2 , r)\n\n2 ,r)\n\n2 ,r)\n\n2 ,r)\n\nV (r)\n\n2d+1V (r)\n\nnk\nn\n\n\u2212 nl\nn\n\nV (r)\n\nSince p is very small, p (cid:192) 3(1\u2212p)p2\n\n2d+1\n\n; therefore, nk\n\nn \u2212 nl\n\nn > 3(1 \u2212 p)\u03b2V (r), i.e. when tr(cid:48) = \u03b1,\n\n> 3(1 \u2212 p)\u03b2V (r)\n\nsk\nn\n\n(3)\n\nr2\n\n(cid:101).\n\nr(cid:48)(cid:101) \u2264 (cid:100) 2\u03b1\nn \u2212 nj\n, i.e. c \u2264 2d\n\nIn Step 4 of the proposed method, we gradually enlarge the neighborhood to calculate the change of\nlocal density. When tr(cid:48) = \u03b1, based on inequalities (2) and (3), \u2200xi \u2208 S, (cid:107)xi \u2212 b(cid:107) > 2r, we have\nsk > si. Therefore, in this round of iteration, we will pick an example from B2. In order for tr(cid:48) to\nbe equal to \u03b1, the value of t would be (cid:100) \u03b1\nIf we further increase t so that tr(cid:48) = c\u03b1, where c > 1, we have the following conclusion: \u2200xi, xj \u2208\nn \u2264 (c + 2)(1 \u2212 p)\u03b2V (r), i.e. si\nn \u2264 (c + 2)(1 \u2212\nS, s.t. (cid:107)xi \u2212 b(cid:107) > 2r and (cid:107)xi \u2212 xj(cid:107) \u2264 c\u03b1, ni\np)\u03b2V (r). As long as p \u2265 (c+2)(1\u2212p)p2\np(1\u2212p) \u2212 2, then \u2200xi \u2208 S, (cid:107)xi \u2212 b(cid:107) > 2r, sk > si,\nand we will pick examples from B2. Since r(cid:48) \u2264 r, the method will continue querying examples in\nB2 until the (cid:98)(\n3.\nFinally, we show that the probability of picking a minority class example from B2 is at least 1\nTo this end, we need to calculate the maximum probability mass of the majority class within B2.\nConsider the case where the maximum value of f1(x) occurs at b, and this pdf decreases by \u03b2 every\nthe shape of f1(x) is a cone\ntime x moves away from b in the direction of the radius by \u03b1, i.e.\nin (d + 1) dimensional space. Since f1(x) must integrate to 1, i.e. V ( \u03b1f1(b)\nd+1 = 1, where\nV (\u03b1)) 1\nV ( \u03b1f1(b)\nd+1 .\n\n) \u00b7 f1(b)\n, we have f1(b) = ( d+1\n\n) is the volume of a hyper-ball with radius \u03b1f1(b)\n\np(1\u2212p) \u2212 2) \u00b7 \u03b1\n\nr (cid:99)th iteration.\n\nd+1 \u03b2\n\n2d\n\n2d\n\n\u03b2\n\n\u03b2\n\n\u03b2\n\nd\n\n4\n\n\fTherefore, the probability mass of the majority class within B2 is:\n\u03b2\nd + 1 V (2r) < V (2r)f1(b)\nd+1 \u03b2\n\nV (2r)(f1(b) \u2212 2r\n\u03b1\n= V (2r)( d + 1\n) 1\nd+1 \u03b2\nV (\u03b1)\n\n2r\n\u03b1\nd+1 = 2d V (r)\nV (\u03b1) 1\n\n(d + 1) 1\n\n\u03b2) +\n\nd+1\n\nd\n\n< (d + 1) 1\n\nd+1 (2d+1V (r)\u03b2) d\n\nd+1 \u2264 (d + 1) 1\n\nd+1 (\n\nd\n\nd+1\n\np2 \u00b7 OV ( r2\nV (r)\n\n2 , r)\n\n) d\nd+1 < 2p\n\nwhere V (2r) is the volume of a hyper-ball with radius 2r. Therefore, if we select a point at random\np+2p = 1\n3.\nfrom B2, the probability that this point is from the minority class is at least\n(cid:165)\n\np+(1\u2212p)\u00b72p \u2265 p\n\np\n\n2.4 Rare category detection for multiple classes\n\nIn subsection 2.2, we have discussed rare category detection for the binary case. In this subsection,\nwe focus on the case where m > 2. To be speci\ufb01c, let p1, . . . , pm be the priors of the m classes,\nand p1 (cid:192) pi, i (cid:54)= 1. Our goal is to use as few label requests as possible to \ufb01nd at least one example\nfrom each class.\nThe method proposed in subsection 2.2 can be easily generalized to multiple classes, which is pre-\nsented in Algorithm 2. In this algorithm, we are given the priors of all the minority classes. Using\neach pi, we estimate the number Ki of examples from this class, and calculate the corresponding r(cid:48)\ni\nvalue in the same manner as NNDB. Then, we calculate the local density at each example based on\ndifferent scales r(cid:48)\ni. In the outer loop of Step 9, we calculate the r(cid:48) value which is the minimum of\nall the r(cid:48)\ni whose corresponding classes have not been discovered yet and its index. In the inner loop\nof Step 11, we gradually enlarge the neighborhood to calculate the score of each example. This is\nthe same as NNDB, except that we preclude the examples that are within a certain distance of any\nselected example from being selected. This heuristic is to avoid repeatedly selecting examples from\nthe same discovered class. The inner loop stops when we \ufb01nd an example from an undiscovered\nclass. Then we will update the r(cid:48) value and resume the inner loop. If the minority classes form\ncompact clusters and are far apart from each other, NNDM is able to detect examples from each\nminority class with a small number of label requests.\n\ni to be the minimum value among all the examples.\n\nAlgorithm 2 Nearest-Neighbor-Based Rare Category Detection for Multiple Classes (NNDM)\nRequire: S, p2, . . . , pm\n1: for i = 2 : m do\nLet Ki = npi.\n2:\nFor each example, calculate the distance between this example and its Kth\n3:\nSet r(cid:48)\n4: end for\n5: Let r(cid:48)\n1 = maxm\n6: for i = 1 : m do\n7:\n8: end for\n9: while not all the classes have been discovered do\n10:\n\ni) = {x|x \u2208 S,(cid:107)x \u2212 xj(cid:107) \u2264 r(cid:48)\n\ni|1 \u2264 i \u2264 m, and class i has not been discovered}, and s be the correspond-\n\n\u2200xj \u2208 S, let N N(xj, r(cid:48)\n\nj = |N N(xj, r(cid:48)\n\ni nearest neighbor.\n\ni}, and ni\n\ni=2 r(cid:48)\ni.\n\ni)|.\n\nLet r(cid:48) = min{r(cid:48)\ning index, i.e. r(cid:48) = r(cid:48)\ns.\nfor t = 1 : n do\n\n11:\n12:\n\n13:\n14:\nend for\n15:\n16: end while\n\nfor each xi that has been selected and labeled yi, \u2200x \u2208 S, s.t. (cid:107)x \u2212 xi(cid:107) \u2264 r(cid:48)\nfor all the other examples, si =\nQuery x = arg maxxi\u2208S si.\nIf x belongs to a class that has not been discovered, break.\n\nxj\u2208N N (xi,tr(cid:48))\n\ni \u2212 ns\nj).\n\nmax\n\n(ns\n\nyi, si = \u2212\u221e;\n\nIn NNDB and NNDM, we need the priors of the minority classes as the input. As we will see in the\nnext section, our algorithms are robust against small perturbations in the priors.\n\n5\n\n\f3 Experimental results\n\nIn this section, we compare our methods (NNDB and NNDM) with the best method proposed in [8]\n(Interleave) and random sampling (RS) on both synthetic and real data sets. In Interleave, we use\nthe number of classes as the number of components in the mixture model. For both Interleave and\nRS, we run the experiment multiple times and report the average results.\n\n3.1 Synthetic data sets\n\nFigure 1(a) shows a synthetic data set where the pdf of the majority class is Gaussian and the pdf of\nthe minority class is uniform within a small hyper-ball. There are 1000 examples from the majority\nclass and only 10 examples from the minority class. Using Interleave, we need to label 35 examples,\nusing RS, we need to label 101 examples, and using NNDB, we only need to label 3 examples in\norder to sample one from the minority class, which are denoted as \u2018x\u2019 in Figure 1(b). Notice that\nthe \ufb01rst 2 examples that NNDB selects are not from the correct region. This is because the number\nof examples from the minority class is very small, and the local density may be affected by the\nrandomness in the data.\n\n(a) Data Set\n\n(b) Examples Selected by NNDB, denoted as \u2018x\u2019\n\nFigure 1: Synthetic Data Set 1.\n\nIn Figure 2(a), the X-shaped data consisting of 3000 examples correspond to the majority class, and\nthe four characters \u2018NIPS\u2019 correspond to four minority classes, which consist of 138, 79, 118, and\n206 examples respectively. Using Interleave, we need to label 1190 examples, using RS, we need\nto label 83 examples, and using NNDM, we only need to label 5 examples in order to get one from\neach of the minority classes, which are denoted as \u2018x\u2019 in Figure 2(b). Notice that in this example,\nInterleave is even worse than RS. This might be because some minority classes are located in the\nregion where the density of the majority class is not negligible, and thus may be \u2018explained\u2019 by the\nmajority-class mixture-model component.\n\n3.2 Real data sets\n\nIn this subsection, we compare different methods on two real data sets: Abalone [3] and Shuttle [4].\nThe \ufb01rst data set consists of 4177 examples, described by 7 dimensional features. The examples\ncome from 20 classes:\nthe proportion of the largest class is 16.50%, and the proportion of the\nsmallest class is 0.34%. For the second data set, we sub-sample the original training set to produce\na smaller data set with 4515 examples, described by 9 dimensional features. The examples come\nfrom 7 classes: the proportion of the largest class is 75.53%, and the proportion of the smallest class\nis 0.13%.\nThe comparison results are shown in Figure 3(a) and Figure 3(b) respectively. From these \ufb01gures,\nwe can see that NNDM is signi\ufb01cantly better than Interleave and RS: with Abalone data set, to \ufb01nd\n\n6\n\n\u22123\u22122\u2212101234\u22121012345\u22123\u22122\u2212101234\u22121012345\f(a) Data Set\n\n(b) Examples Selected by NNDM, denoted as \u2018x\u2019\n\nFigure 2: Synthetic Data Set 2.\n\nall the classes, Interleave needs 280 label requests, RS needs 483 label requests, and NNDM only\nneeds 125 label requests; with Shuttle data set, to \ufb01nd all the classes, Interleave needs 140 label\nrequests, RS needs 512 label requests, and NNDM only needs 87 label requests. This is because\nas the number of components becomes larger, the mixture model generated by Interleave is less\nreliable due to the lack of labeled examples, thus we need to select more examples. Furthermore,\nthe majority and minority classes may not be near-separable, which is a disaster for Interleave. On\nthe other hand, NNDM does not assume a generative model for the data, and only focuses on the\nchange in local density, which is more effective on the two data sets.\n\n(a) Abalone\n\n(b) Shuttle\n\nFigure 3: Learning Curves for Real Data Sets\n\n3.3\n\nImprecise priors\n\nThe proposed algorithms need the priors of the minority classes as input. In this subsection, we test\nthe robustness of NNDM against modest mis-estimations of the class priors. The performance of\nNNDB is similar to NNDM, so we omit the results here. In the experiments, we use the same data\nsets as in subsection 3.2, and add/subtract 5%, 10%, and 20% from the true priors of the minority\nclasses. The results are shown in Figure 4. From these \ufb01gures, we can see that NNDM is very robust\nto small perturbations in the priors. For example, with Abalone data set, if we subtract 10% from\nthe true priors, only one more label request is needed in order to \ufb01nd all the classes.\n\n7\n\n050100150200250020406080100120140160180200050100150200250020406080100120140160180200010020030040050005101520Number of Selected ExamplesClasses DiscoveredNNDMInterleaveRS01002003004005006001234567Number of Selected ExamplesClasses DiscoveredNNDMInterleaveRS\f(a) Abalone\n\n(b) Shuttle\n\nFigure 4: Robustness Study\n\n4 Conclusion\n\nIn this paper, we have proposed a novel method for rare category detection, useful for de-novo active\nlearning in serious applications. Different from existing methods, our method does not rely on the\nassumption that the data is near-separable. It works by selecting examples corresponding to regions\nwith the maximum change in local density, and depending on scaling, it will select class-boundary\nor class-internal samples of minority classes. The method could be scaled up using kd-trees [7]. The\neffectiveness of the proposed method is guaranteed by theoretical justi\ufb01cation, and its superiority\nover existing methods is demonstrated by extensive experimental results on both synthetic and real\ndata sets. Moreover, it is very robust to modest perturbations in estimating true class priors.\nAcknowledgments\n\nThis paper is based on work in part supported by the Defense Advanced Research Projects Agency\n(DARPA) under contract number NBCHD030010.\n\nReferences\n[1] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proc. of the 23rd Int.\n\nConf. on Machine Learning, pages 65\u201372, 2006.\n\n[2] S. Bay, K. Kumaraswamy, M. Anderle, R. Kumar, and D. Steier. Large scale detection of\nirregularities in accounting data. In Proc. of the 6th Int. Conf. on Data Mining, pages 75\u201386,\n2006.\n\n[3] C. Blake and C. Merz.\n\nUci\n\nrepository of machine learning databases.\n\nhttp://www.ics.uci.edu/ machine/MLRepository.html, 1998.\n\n[4] P.\n\nBrazdil\n\nrepository.\nhttp://www.niaad.liacc.up.pt/old/statlog/datasets/shuttle/shuttle.doc.html, 1991.\n\nStatlog\n\nGama.\n\nand\n\nJ.\n\nIn\n\nIn\n\n[5] S. Dasgupta. Coarse sample complexity bounds for active learning.\n\nInformation Processing Systems 19, 2005.\n\nIn Advances in Neural\n\n[6] S. Fine and Y. Mansour. Active sampling for multiple output identi\ufb01cation. In The 19th Annual\n\nConf. on Learning Theory, pages 620\u2013634, 2006.\n\n[7] A. Moore. A tutorial on kd-trees. Technical report, University of Cambridge Computer Labo-\n\nratory, 1991.\n\n[8] D. Pelleg and A. Moore. Active learning for anomaly and rare-category detection. In Advances\n\nin Neural Information Processing Systems 18, 2004.\n\n8\n\n05010015020025005101520Number of Selected ExamplesClasses Discovered\u22125%\u221210%\u221220%0+5%+10%+20%0204060801001234567Number of Selected ExamplesClasses Discovered\u22125%\u221210%\u221220%0+5%+10%+20%\f", "award": [], "sourceid": 51, "authors": [{"given_name": "Jingrui", "family_name": "He", "institution": null}, {"given_name": "Jaime", "family_name": "Carbonell", "institution": null}]}