{"title": "Rapid Distance-Based Outlier Detection via Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 467, "page_last": 475, "abstract": "Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search.", "full_text": "Rapid Distance-Based Outlier Detection via Sampling\n\nMahito Sugiyama1 Karsten M. Borgwardt1;2\n\n1Machine Learning and Computational Biology Research Group, MPIs T\u00a8ubingen, Germany\n\n2Zentrum f\u00a8ur Bioinformatik, Eberhard Karls Universit\u00a8at T\u00a8ubingen, Germany\nfmahito.sugiyama,karsten.borgwardtg@tuebingen.mpg.de\n\nAbstract\n\nDistance-based approaches to outlier detection are popular in data mining, as they\ndo not require to model the underlying probability distribution, which is particu-\nlarly challenging for high-dimensional data. We present an empirical comparison\nof various approaches to distance-based outlier detection across a large number\nof datasets. We report the surprising observation that a simple, sampling-based\nscheme outperforms state-of-the-art techniques in terms of both ef\ufb01ciency and ef-\nfectiveness. To better understand this phenomenon, we provide a theoretical anal-\nysis why the sampling-based approach outperforms alternative methods based on\nk-nearest neighbor search.\n\n1 Introduction\n\nAn outlier, which is \u201can observation which deviates so much from other observations as to arouse\nsuspicions that it was generated by a different mechanism\u201d (by Hawkins [10]), appears in many real-\nlife situations. Examples include intrusions in network traf\ufb01c, credit card frauds, defective products\nin industry, and misdiagnosed patients. To discriminate such outliers from normal observations,\nmachine learning and data mining have de\ufb01ned numerous outlier detection methods, for example,\ntraditional model-based approaches using statistical tests, convex full layers, or changes of vari-\nances and more recent distance-based approaches using k-nearest neighbors [18], clusters [23], or\ndensities [7] (for reviews, see [1, 13]).\nWe focus in this paper on the latter, the distance-based approaches, which de\ufb01ne outliers as objects\nlocated far away from the remaining objects. More speci\ufb01cally, given a metric space (M; d), each\nobject x 2 M receives a real-valued outlierness score q(x) via a function q : M ! R; q(x)\ndepends on the distances between x and the other objects in the dataset. Then the top-(cid:20) objects with\nmaximum outlierness scores are reported to be outliers. To date, this approach has been successfully\napplied in various situations due to its \ufb02exibility, that is, it does not require to determine or to \ufb01t\nan underlying probability distribution, which is often dif\ufb01cult, in particular in high-dimensional\nsettings. For example, LOF (Local Outlier Factor) [7] has become one of the most popular outlier\ndetection methods, which measures the outlierness of each object by the difference of local densities\nbetween the object and its neighbors.\nThe main challenge, however, is its scalability since this approach potentially requires computation\nof all pairwise distances between objects in a dataset. This quadratic time complexity leads to\nruntime problems on massive datasets that emerge across application domains. To avoid this high\ncomputational cost, a number of techniques have already been proposed, which can be roughly\ndivided into two strategies: indexing of objects such as tree-based structures [5] or projection-based\nstructures [9] and partial computation of the pairwise distances to compute scores only for the top-(cid:20)\noutliers, \ufb01rst introduced by Bay and Schwabacher [4] and improved in [6, 16]. Unfortunately, both\nstrategies are nowadays not suf\ufb01cient, as index structures are often not ef\ufb01cient enough for high-\ndimensional data [20] and the number of outliers often increases in direct proportion to the size of\nthe dataset, which signi\ufb01cantly deteriorates the ef\ufb01ciency of partial computation techniques.\n\n1\n\n\fHere we show that a surprisingly simple and rapid sampling-based outlier detection method out-\nperforms state-of-the-art distance-based methods in terms of both ef\ufb01ciency and effectiveness by\nconducting an extensive empirical analysis. The proposed method behaves as follows: It takes a\nsmall set of samples from a given set of objects, followed by measuring the outlierness of each ob-\nject by the distance from the object to its nearest neighbor in the sample set. Intuitively, the sample\nset is employed as a telltale set, that is, it serves as an indicator of outlierness, as outliers should\nbe signi\ufb01cantly different from almost all objects by de\ufb01nition, including the objects in the sample\nset. The time complexity is therefore linear in the number of objects, dimensions, and samples. In\naddition, this method can be implemented in a one-pass manner with constant space complexity as\nwe only have to store the sample set, which is ideal for analyzing massive datasets.\nThis paper is organized as follows: In Section 2, we describe our experimental design for the em-\npirical comparison of different outlier detection strategies. In Section 3, we review a number of\nstate-of-the-art outlier detection methods which we used in our experiments, including our own\nproposal. We present experimental results in Section 4 and theoretically analyze them in Section 5.\n\n2 Experimental Design\n\nWe present an extensive empirical analysis of state-of-the-art approaches for distance-based outlier\ndetection and of our new approach, which are introduced in Section 3. They are evaluated in terms\nof both scalability and effectiveness on synthetic and real-world datasets. All parameters are set\nby referring the original literature or at popular values, which are also shown in Section 3. Note\nthat these parameters have to be chosen by heuristics in distance-based approaches, while they still\noutperform other approaches such as statistical approaches [3].\nEnvironment. We used Ubuntu version 12.04.3 with a single 2.6 GHz AMD Opteron CPU and 512\nGB of memory. All C codes were compiled with gcc 4.6.3. All experiments were performed in the\nR environment, version 3.0.1.\nEvaluation criterion. To evaluate the effectiveness of each method, we used the area under the\nprecision-recall curve (AUPRC; equivalent to the average precision), which is a typical criterion to\nmeasure the success of outlier detection methods [1]. It takes values from 0 to 1 and 1 is the best\nscore, and quanti\ufb01es whether the algorithm is able to retrieve outliers correctly. These values were\ncalculated by the R ROCR package.\nDatasets. We collected 14 real-world datasets from the UCI machine learning repository [2], with\na wide range of sizes and dimensions, whose properties are summarized in Table 1. Most of them\nhave been intensively used in the outlier detection literature.\nIn particular, KDD1999 is one of\nthe most popular benchmark datasets in outlier detection, which was originally used for the KDD\nCup 1999. The task is to detect intrusions from network traf\ufb01c data, and as in [22], objects whose\nattribute logged in is positive were chosen as outliers. In every dataset, we \ufb01rst excluded all\ncategorical attributes and missing values since some methods cannot handle categorical attributes.\nFor all datasets except for KDD1999, we assume that objects from the smallest class are outliers, as\nthey are originally designed for classi\ufb01cation rather than outlier detection. Three datasets Mfeat,\nIsolet, and Optdigits were prepared exactly the same way as [17], where only two similar\nclasses were used as inliers. All datasets were normalized beforehand, that is, in each dimension,\nthe feature values were divided by their standard deviation [1, Chapter 12.10].\nIn addition, we generated two synthetic datasets (Gaussian) using exactly the same procedure\nas [14, 17], of which one is high-dimensional (1000 dimensions) and the other is large (10;000;000\nobjects). For each dataset, inliers (non-outliers) were generated from a Gaussian mixture model with\n\ufb01ve equally weighted processes, resulting in \ufb01ve clusters. The mean and the variance of each cluster\nwas randomly set from the Gaussian distribution N (0; 1), and 30 outliers were generated from a\nuniform distribution in the range from the minimum to the maximum values of inliers.\n\n3 Methods for Outlier Detection\n\nIn the following, we will introduce the state-of-the-art methods in distance-based outlier detection,\nincluding our new sampling-based method. Every method is formalized as a scoring function q :\nM ! R on a metric space (M; d), which assigns a real-valued outlierness score to each object x\n\n2\n\n\fin a given set of objects X . We denote by n the number of objects in X . If X is multivariate, the\nnumber of dimensions is denoted by m. The number of samples (sample size) is denoted by s.\n\n3.1 The kth-nearest neighbor distance\n\n\u2032 2 X j d(x; x\n\u2032\n\nKnorr and Ng [11, 12] were the \ufb01rst to formalize a distance-based outlier detection scheme, in which\nan object x 2 X is said to be a DB((cid:11); (cid:14))-outlier if jfx\n) > (cid:14)gj (cid:21) (cid:11)n, where (cid:11) and\n(cid:14) with (cid:11); (cid:14) 2 R and 0 (cid:20) (cid:11) (cid:20) 1 are parameters speci\ufb01ed by the user. This means that at least a\nfraction (cid:11) of all objects have a distance from x that is larger than (cid:14). This de\ufb01nition has mainly two\nsigni\ufb01cant drawbacks: the dif\ufb01culty of determining the distance threshold (cid:14) in practice and the lack\nof a ranking of outliers. To overcome these drawbacks, Ramaswamy et al. [18] proposed to measure\nthe outlierness by the kth-nearest neighbor (kth-NN) distance. The score qkthNN(x) of an object x\nis de\ufb01ned as\nwhere dk(x;X ) is the distance between x and its kth-NN in X . Notice that if we set (cid:11) = (n(cid:0)k)=n,\nthe set of Knorr and Ng\u2019s DB((cid:11); (cid:14))-outliers coincides with the set fx 2 X j qkthNN(x) (cid:21) (cid:14)g. We\nemploy qkthNN(x) as a baseline for distance-based methods in our comparison.\nSince the na\u00a8\u0131ve computation of scores qkthNN(x) for all x requires quadratic computational cost, a\nnumber of studies investigated speed-up techniques [4, 6, 16]. We used Bhaduri\u2019s algorithm (called\niORCA) [6] and implemented it in C since it is the latest technique in this branch of research.\nIt has a parameter k to specify the kth-NN and an additional parameter (cid:20) to retrieve the top-(cid:20)\nobjects with the largest outlierness scores. We set k = 5, which is a default setting used in the\nliterature [4, 6, 15, 16], and set (cid:20) to be twice the number of outliers for each dataset. Note that in\npractice we usually do not know the exact number of outliers and have to set (cid:20) large enough.\n\nqkthNN(x) := dk(x;X );\n\n3.2\n\nIterative sampling\n\nWu and Jermaine [21] proposed a sampling-based approach to ef\ufb01ciently approximate the kth-NN\ndistance score qkthNN. For each object x 2 X , de\ufb01ne\n\nqkthSp(x) := dk(x; Sx(X ));\n\nwhere Sx(X ) is a subset of X , which is randomly and iteratively sampled for each object x. In\naddition, they introduced a random variable N = jO \\ O\u2032j with two sets of top-(cid:20) outliers O and\nO\u2032 with respect to qkthNN and qkthSp, and analyzed its expectation E(N ) and the variance Var(N ).\nThe time complexity is (cid:2)(nms). We implemented this method in C and set k = 5 and the sample\nsize s = 20 unless stated otherwise.\n\n3.3 One-time sampling (our proposal)\n\nHere we present a new sampling-based method. We randomly and independently sample a subset\nS(X ) (cid:26) X only once and de\ufb01ne\n\nqSp(x) := min\n\nx\u20322S(X )\n\n\u2032\n\nd(x; x\n\n)\n\nfor each object x 2 X . Although this de\ufb01nition is closely related to Wu and Jermaine\u2019s method\nqkthSp in the case of k = 1, our method performs sampling only once while their method performs\nsampling for each object. We empirically show that this leads to signi\ufb01cant differences in accuracy\nin outlier detection (see Section 4). We also theoretically analyze this phenomenon to get a better\nunderstanding of its cause (see Section 5). The time complexity is (cid:2)(nms) and the space complexity\nis (cid:2)(ms) using the number of samples s, as this score can be obtained in a one-pass manner. We\nimplemented this method in C. We set s = 20 for the comparison with other methods.\n\n3.4\n\nIsolation forest\n\nLiu et al. [15] proposed a random forest-like method, called isolation forest. It uses random recursive\npartitions of objects, which are assumed to be m-dimensional vectors, and hence is also based on\nthe concept of proximity. From a given set X , we construct an iTree in the following manner. First\na sample set S(X ) (cid:26) X is chosen. Then this sample set is partitioned into two non-empty subsets\n\n3\n\n\fS(X )L and S(X )R such that S(X )L = f x 2 S(X ) j xq < v g and S(X )R = S(X )nS(X )L, where\nv and q are randomly chosen. This process is recursively applied to each subset until it becomes a\nsingleton, resulting in a proper binary tree such that the number of nodes is 2s (cid:0) 1. The outlierness\nof an object x is measured by the path length h(x) on the tree, and the score is normalized and\naveraged on t iTrees. Finally, the outlierness score qtree(x) is de\ufb01ned as\n\nqtree(x) := 2\n\n(cid:0)h(x)=c(s);\n\nwhere h(x) is the average of h(x) on t iTrees and c(s) is de\ufb01ned as c(s) := 2H(s(cid:0)1)(cid:0)2(s(cid:0)1)=n,\nwhere H denotes the harmonic number. The overall average and worst case time complexities are\nO((s + n)t log s) and O((s + n)ts). We used the of\ufb01cial R IsolationForest package1, whose\ncore process is implemented in C. We set t = 100 and s = 256, which is the same setting as in [15].\n\n3.5 Local outlier factor (LOF)\n\nWhile LOF [7] is often referred to as not distance-based but density-based, we still include this\nmethod as it is also based on pairwise distances and is known to be a prominent outlier detection\nmethod. Let N k(x) be the set of k-nearest neighbors of x. The local reachability density of x\nis de\ufb01ned as (cid:26)(x) := jN k(x)j (\n(cid:0)1. Then the local outlier\nfactor (LOF) qLOF(x) is de\ufb01ned as the ratio of the local reachability density of x and the average\nof the local reachability densities of its k-nearest neighbors, that is,\n\n\u2211\nx\u20322N k(x) maxf dk(x\n\u2032\n\n;X ); d(x; x\n)\n\n\u2032\n\n)g)\n\n(jN k(x)j(cid:0)1\n\n\u2211\n\nqLOF(x) :=\n\n\u2032\nx\u20322N k(x) (cid:26)(x\n\n)\n\n(cid:26)(x)\n\n(cid:0)1:\n\nThe time complexity is O(n2m), which is known to be the main disadvantage of this method. We\nimplemented this method in C and used the commonly used setting k = 10.\n\n3.6 Angle-based outlier factor (ABOF)\n\n\u2032\nKriegel et al. [14] proposed to use angles instead of distances to measure outlierness. Let c(x; x\n)\n\u2032(cid:0) x)\nbe the similarity between vectors x and x\n\u2032 with respect to the the coordinate origin\nshould be correlated with the angle of two vectors y and y\nx. The insight of Kriegel et al. is that if x is an outlier, the variance of angles between pairs of the\nremaining objects becomes small. Formally, for an object x 2 X de\ufb01ne\n\u2032 (cid:0) x):\n\n\u2032, for example, the cosine similarity. Then c(y(cid:0) x; y\n\nqABOF(x) := Vary;y\u20322X c(y (cid:0) x; y\n\nNote that the smaller qABOF(x), the more likely is x to be an outlier, which is in contrast to the\nother methods. This method was originally introduced to overcome the \u201ccurse of dimensionality\u201d\nin high-dimensional data. However, recently Zimek et al. [24] showed that distance-based methods\nsuch as LOF also work if attributes carry relevant information for outliers. We include several high-\ndimensional datasets in experiments and check whether distance-based methods work effectively.\nAlthough this method is attractive as it is parameter-free, the computational cost is cubic in n. Thus\nwe use its near-linear approximation algorithm proposed by Pham and Pagh [17]. Their algorithm,\n\u2032(cid:0)\ncalled FastVOA, estimates the \ufb01rst and the second moments of the variance Vary;y\u20322X c(y(cid:0)x; y\nx) independently using two techniques: random projections and AMS sketches. The latter is a\nrandomized technique to estimate the second frequency moment of a data stream. The resulting time\ncomplexity is O(tn(m+log n+c1c2)), where t is the number of hyperplanes for random projections\nand c1, c2 are the number of repetitions for AMS sketches. We implemented this algorithm in C. We\nset t = log n, c1 = 1600, and c2 = 10 as they are shown to be empirically suf\ufb01cient in [17].\n\n3.7 One-class SVM\n\nThe One-class SVM, introduced by Sch\u00a8olkopf et al. [19], classi\ufb01es objects into inliers and outliers\nby introducing a hyperplane between them. This classi\ufb01cation can be turned into a ranking of\noutlierness by considering the signed distance to the separating hyperplane. That is, the further\nan object is located in the outlier half space, the more likely it is to be a true outlier. Let X =\nfx1; : : : ; xng. Formally, the score of a vector x with a feature map (cid:8) is de\ufb01ned as\n\nqSVM(x) := (cid:26) (cid:0) (w (cid:1) (cid:8)(x));\n\n(1)\n\n1http://sourceforge.net/projects/iforest/\n\n4\n\n\fTable 1: Summary of datasets. Gaussian is syn-\nthetic (marked by *) and the other datasets are\ncollected from the UCI repository (n = number\nof objects, m = number of dimensions).\n\nIonosphere\nArrhythmia\nWdbc\nMfeat\nIsolet\nPima\nGaussian*\nOptdigits\nSpambase\nStatlog\nSkin\nPamap2\nCovtype\nKdd1999\nRecord\nGaussian*\n\nn\n\n351\n452\n569\n600\n960\n768\n1000\n1688\n4601\n6435\n245057\n373161\n286048\n4898431\n5734488\n10000000\n\n# of outliers\n126\n207\n212\n200\n240\n268\n30\n554\n1813\n626\n50859\n125953\n2747\n703067\n20887\n30\n\nm\n34\n274\n30\n649\n617\n8\n1000\n64\n57\n36\n3\n51\n10\n6\n7\n20\n\nFigure 1: Average of area under the precision-\nrecall curves (AUPRCs) over all datasets with re-\nspect to changes in number of samples s for qSp\n(one-time sampling; our proposal) and qkthSp (it-\nerative sampling by Wu and Jermaine [21]). Note\nthat the x-axis has logarithmic scale.\n\nn\u2211\n\ni=1\n\n\u2225w\u22252 +\n\n1\n2\n\n1\n(cid:23)n\n\nwhere the weight vector w and the offset (cid:26) are optimized by the following quadratic program:\n\nmin\n\n(cid:24)i (cid:0) (cid:26) subject to (w (cid:1) (cid:8)(xi)) (cid:21) (cid:26) (cid:0) (cid:24)i; (cid:24)i (cid:21) 0\n\nn\n\nw2F; (cid:24)2Rn; (cid:26)2R\n\n\u2211\nwith a regularization parameter (cid:23). The term w (cid:1) (cid:8)(x) in equation (1) can be replaced with\ni=1 (cid:11)ik(xi; x) using a kernel function k, where (cid:11) = ((cid:11)1; : : : ; (cid:11)n) is used in the dual problem.\nWe tried ten different values of (cid:23) from 0 to 1 and picked up the one maximizing the margin between\nnegative and positive scores. We used a Gaussian RBF kernel and set its parameter (cid:27) by the popular\nheuristics [8]. The R kernlab package was used, whose core process is implemented in C.\n\n4 Experimental Results\n\n4.1 Sensitivity in sampling size and sampling scheme\n\nWe \ufb01rst analyze the parameter sensitivity of our method qSp with respect to changes in the sample\nsize s. In addition, for each sample size we compare our qSp (one-time sampling) to Wu and Jer-\nmaine\u2019s qkthSp (iterative sampling). We set k = 1 in qkthSp, hence the only difference between them\nwas the sampling scheme. Each method was applied to each dataset listed in Table 1 and the average\nof AUPRCs (area under the precision-recall curves) in 10 trials were obtained, and these were again\naveraged over all datasets. These scores with varying sample sizes are plotted in Figure 1.\nOur method shows robust performance over all sample sizes from 5 to 1000 and the average AUPRC\nvaries by less than 2%. Interestingly, the score is maximized at a rather small sample size (s = 20)\nand monotonically (slightly) decreases with increasing sample size. Moreover, for every sample\nsize, the one-time sampling qSp signi\ufb01cantly outperforms the iterative sampling qkthSp (Wilcoxon\nsigned-rank test, (cid:11) = 0:05). We checked that this behavior is independent from dataset size.\n\n4.2 Scalability and effectiveness\n\nNext we evaluate the scalability and effectiveness of the approaches introduced in Section 3 by\nsystematically applying them to every dataset. Results of running time and AUPRCs are shown in\nTable 2 and Table 3, respectively. As we can see, our method qSp is the fastest among all methods;\nit can score more than \ufb01ve million objects within a few seconds. Although the time complexity of\nWu and Jermaine\u2019s qkthSp is the same as qSp, our method is empirically much faster, especially in\nlarge datasets. The different costs of two processes, sampling once and performing nearest neighbor\n\n5\n\nNumber of samplesAUPRC (average)5105020010000.400.450.500.55qSpqkthSp\fqABOF\n\n(cid:0)1\n\n(cid:0)1\n\n(cid:0)1\n\n(cid:0)2\n(cid:0)1\n(cid:0)2\n\n(cid:0)2\n(cid:0)1\n(cid:0)2\n\n(cid:0)4\n(cid:0)2\n(cid:0)3\n(cid:0)2\n(cid:0)2\n(cid:0)4\n(cid:0)1\n(cid:0)2\n(cid:0)2\n(cid:0)2\n(cid:0)2\n\nTable 2: Running time (in seconds). Averages in 10 trials are shown in four probabilistic methods\nqkthSp, qSp, qtree, and qABOF. \u201c\u2014\u201d means that computation did not completed within 2 months.\nqSVM\n6.80(cid:2)10\n3.88(cid:2)10\n9.20(cid:2)10\n1.90\n3.60\n9.60(cid:2)10\n7.77\n1.14\n8.77\n1.39(cid:2)101\n9.44(cid:2)103\n8.37(cid:2)104\n1.69(cid:2)104\n\nqkthNN\n2.00(cid:2)10\nIonosphere\nArrhythmia 2.56(cid:2)10\n7.20(cid:2)10\nWdbc\n1.04\nMfeat\nIsolet\n4.27\n4.00(cid:2)10\nPima\n4.18\nGaussian\n1.04\nOptdigits\n9.51\nSpambase\nStatlog\n6.99\n6.82(cid:2)103\nSkin\n9.05(cid:2)104\nPamap2\n6.87(cid:2)102\nCovtype\n2.68(cid:2)106\nKdd1999\n3.62(cid:2)106\nRecord\n3.37(cid:2)103\nGaussian\nTable 3: Area under the precision-recall curve (AUPRC). Averages(cid:6)SEMs in 10 trials are shown\nin four probabilistic methods. Best scores are denoted in Bold. Note that the root mean square\ndeviation (RMSD) rewards methods that are always close to the best result on each dataset.\n\nqkthSp\n(cid:0)3\n9.60(cid:2)10\n(cid:0)2\n2.72(cid:2)10\n(cid:0)2\n1.60(cid:2)10\n(cid:0)2\n6.00(cid:2)10\n(cid:0)2\n8.68(cid:2)10\n(cid:0)2\n2.04(cid:2)10\n(cid:0)1\n2.13(cid:2)10\n(cid:0)2\n7.48(cid:2)10\n(cid:0)1\n7.26(cid:2)10\n(cid:0)1\n2.03(cid:2)10\n2.12(cid:2)101\n3.27(cid:2)101\n2.16(cid:2)101\n4.40(cid:2)102\n9.58(cid:2)102\n1.73(cid:2)103\n\nqtree\n6.25(cid:2)10\n2.72\n7.32(cid:2)10\n8.69\n9.71\n(cid:0)1\n3.14(cid:2)10\n2.10(cid:2)101\n(cid:0)1\n8.65(cid:2)10\n1.02\n9.35(cid:2)10\n3.04\n1.20(cid:2)101\n6.15\n4.78(cid:2)101\n8.84(cid:2)101\n3.26(cid:2)102\n\nqSp\n8.00(cid:2)10\n1.52(cid:2)10\n2.00(cid:2)10\n4.80(cid:2)10\n8.37(cid:2)10\n4.00(cid:2)10\n1.54(cid:2)10\n1.48(cid:2)10\n3.68(cid:2)10\n2.80(cid:2)10\n9.72(cid:2)10\n2.73\n2.83(cid:2)10\n3.46\n4.11\n2.13(cid:2)101\n\nqLOF\n2.40(cid:2)10\n2.04(cid:2)10\n6.80(cid:2)10\n1.02\n4.61\n(cid:0)2\n9.20(cid:2)10\n2.61(cid:2)101\n1.46\n1.14(cid:2)101\n1.68(cid:2)101\n1.38(cid:2)104\n1.37(cid:2)105\n3.67(cid:2)104\n\n4.72\n6.19\n7.86\n8.26\n1.38(cid:2)101\n1.07(cid:2)101\n1.46(cid:2)101\n2.41(cid:2)101\n7.75(cid:2)101\n1.07(cid:2)102\n7.33(cid:2)103\n1.71(cid:2)104\n1.13(cid:2)104\n2.40(cid:2)105\n1.07(cid:2)106\n1.47(cid:2)106\n\n(cid:0)2\n(cid:0)1\n(cid:0)2\n\n(cid:0)2\n\n(cid:0)2\n\n(cid:0)1\n\n\u2014\n\u2014\n\u2014\n\n\u2014\n\u2014\n\u2014\n\nIonosphere\nArrhythmia\nWdbc\nMfeat\nIsolet\nPima\nGaussian\nOptdigits\nSpambase\nStatlog\nSkin\nPamap2\nCovtype\nKdd1999\nRecord\nGaussian\nAverage\nAvg.Rank\nRMSD\n\nqkthNN\n0.931\n0.701\n0.607\n0.217\n0.380\n0.519\n1.000\n0.204\n0.395\n0.057\n0.195\n0.249\n0.016\n0.768\n0.002\n1.000\n0.453\n3.750\n0.259\n\nqkthSp\n\n0.762(cid:6)0.007\n0.674(cid:6)0.008\n0.226(cid:6)0.001\n0.293(cid:6)0.002\n0.175(cid:6)0.001\n0.608(cid:6)0.007\n1.000(cid:6)0.000\n0.319(cid:6)0.001\n0.418(cid:6)0.001\n0.058(cid:6)0.000\n0.146(cid:6)0.000\n0.328(cid:6)0.000\n0.058(cid:6)0.001\n0.081(cid:6)0.000\n0.411(cid:6)0.000\n0.999(cid:6)0.000\n\n0.410\n3.875\n0.274\n\nqSp\n\n0.899(cid:6)0.032\n0.711(cid:6)0.005\n0.667(cid:6)0.036\n0.245(cid:6)0.031\n0.535(cid:6)0.138\n0.512(cid:6)0.010\n1.000(cid:6)0.000\n0.233(cid:6)0.021\n0.422(cid:6)0.011\n0.082(cid:6)0.008\n0.353(cid:6)0.058\n0.268(cid:6)0.009\n0.075(cid:6)0.034\n0.611(cid:6)0.098\n0.933(cid:6)0.013\n1.000(cid:6)0.000\n\n0.534\n2.188\n0.068\n\nqtree\n\nqABOF\n\nqLOF\n0.864\n0.673\n0.428\n0.369\n0.274\n0.406\n0.904\n0.361\n0.354\n0.093\n0.130\n0.338\n0.010\n\n0.740(cid:6)0.022\n0.697(cid:6)0.005\n0.490(cid:6)0.014\n0.211(cid:6)0.003\n0.520(cid:6)0.034\n0.461(cid:6)0.008\n0.994(cid:6)0.005\n0.255(cid:6)0.006\n0.398(cid:6)0.002\n0.054(cid:6)0.000\n0.258(cid:6)0.006\n0.231(cid:6)0.002\n0.087(cid:6)0.005\n\nqSVM\n0.871(cid:6)0.002\n0.794\n0.681(cid:6)0.004\n0.707\n0.595(cid:6)0.018\n0.556\n0.270(cid:6)0.009\n0.257\n0.328(cid:6)0.011\n0.439\n0.441(cid:6)0.003\n0.461\n0.934(cid:6)0.036\n1.000\n0.295(cid:6)0.010\n0.266\n0.419(cid:6)0.011\n0.399\n0.060(cid:6)0.002\n0.056\n0.242(cid:6)0.003\n0.213\n0.252(cid:6)0.001\n0.235\n0.017(cid:6)0.001\n0.095\n0.389(cid:6)0.007 \u2014 0.539(cid:6)0.020 \u2014\n0.976(cid:6)0.004 \u2014 0.658(cid:6)0.106 \u2014\n0.890(cid:6)0.022 \u2014 0.893(cid:6)0.003 \u2014\n0.421\n4.000\n0.094\n\n0.468\n4.563\n0.140\n\n0.479\n3.875\n0.133\n\n0.400\n4.538\n0.152\n\nsearch versus re-sampling per object and performing kth-NN search, causes this difference. The\nbaseline qkthNN shows acceptable runtimes for large data only if the number of outliers is small.\nIn terms of effectiveness, qSp shows the best performance on seven out of sixteen datasets including\nthe high-dimensional datasets, resulting in the best average AUPRC score, which is signi\ufb01cantly\nhigher than every single method except for qLOF (Wilcoxon signed-rank test, (cid:11) = 0:05). The\nmethod qSp also shows the best performance in terms of the average rank and RMSDs (root mean\nsquare deviations) to the best result on each dataset. Moreover, qSp is inferior to the baseline qkthNN\nonly on three datasets. It is interesting that qtree, which also uses one-time sampling like our method,\nshows better performance than exhaustive methods on average. In contrast, qkthSp with iterative\nsampling is worst in terms of RMSD among all methods.\nBased on these observations we can conclude that (1) small sample sizes lead to the maximum\naverage precision for qSp; (2) one-time sampling leads to better results than iterative sampling; (3)\none-time sampling leads to better results than exhaustive methods and is also much faster.\n\n6\n\n\f5 Theoretical Analysis\n\n\u2032 2 X j d(x; x\n\u2032\n\nTo understand why our new one-time sampling method qSp shows better performance than the other\nmethods, we present a theoretical analysis to get answers to the following four questions: (1) What\nis the probability that qSp will correctly detect outliers? (2) Why do small sample sizes lead to better\nresults in qSp? (3) Why is qSp superior to qkthSp? (4) Why is qSp superior to qkthNN? Here we use\nthe notion of Knorr and Ng\u2019s DB((cid:11); (cid:14))-outliers [11, 12] and denote the set of DB((cid:11); (cid:14))-outliers by\nX ((cid:11); (cid:14)), that is, an object x 2 X ((cid:11); (cid:14)) if jf x\n) > (cid:14) gj (cid:21) (cid:11)n holds. We also de\ufb01ne\nX ((cid:11); (cid:14)) = X n X ((cid:11); (cid:14)) and, for simplicity, we call an element in X ((cid:11); (cid:14)) an outlier and that in\nX ((cid:11); (cid:14)) an inlier unless otherwise noted. Our method requires as input only the sample size s in\npractice, whereas the parameters (cid:14) and (cid:11) are used only in our theoretical analysis. In the following,\nwe always assume that s \u226a n, hence the sampling process is treated as with replacement.\n\u222a\nProbabilistic analysis of qSp. First we introduce a partition of inliers into subsets (clusters) using\nthe threshold (cid:14). A (cid:14)-partition P (cid:14) of X ((cid:11); (cid:14)) is de\ufb01ned as a set of non-empty disjoint subsets of\nX ((cid:11); (cid:14)) such that each element (cluster) C 2 P (cid:14) satis\ufb01es maxx;x\u20322C d(x; x\nC =\n\u2032\nC2P (cid:14)\nX ((cid:11); (cid:14)). Then if we focus on a cluster C 2 P (cid:14), the probability of discriminating an outlier from\ninliers contained in C can be bounded from below. Remember that s is the number of samples.\nTheorem 1 For an outlier x 2 X ((cid:11); (cid:14)) and a cluster C 2 P (cid:14), we have\n\n) (cid:21) (cid:11)s(1 (cid:0) (cid:12)s) with (cid:12) = (n (cid:0) jCj)=n:\n\n\u2032 2 C; qSp(x) > qSp(x\n\n(8x\n\n) < (cid:14) and\n\n(2)\n\nPr\n\n)\n\n\u2032\n\n\u2032\n\n\u2032 2 C; qSp(x\n\n) < (cid:14) holds for all x\n\n) < (cid:14)) = 1 (cid:0) (cid:12)s. Inequality (2) therefore follows.\n\nProof. We have the probability Pr(qSp(x) > (cid:14)) = (cid:11)s from the de\ufb01nition of outliers. Moreover,\nif at least one object is sampled from the cluster C, qSp(x\n\u2032 2 C. Thus\n\u2032\nPr(8x\nFor instance, if we assume that 5% of our data are outliers and \ufb01x (cid:11) to be 0.95, we have (maximum\n(cid:14); mean of (cid:12)) = (10:51; 0:50), (44:25; 2:23(cid:2) 10\n(cid:0)3), (10:93; 0:67), (37:10; 0:75), and (36:37; 0:80)\non our \ufb01rst \ufb01ve datasets from Table 1 to achieve this 5% rate of outliers. These (cid:12) were obtained by\ngreedily searching each cluster in P (cid:14) under (cid:11) = 0:95 and the respective maximum (cid:14).\nNext we consider the task of correctly discriminating an outlier from all inliers. This can be achieved\nif for each cluster C 2 P (cid:14) at least one object x 2 C is chosen in the sampling process. Thus the\nlower bound can be directly derived using the multinomial distribution as follows.\n\u2211\nTheorem 2 Let P (cid:14) = fC1; : : : ;Clg with l clusters and pi = jCij = n for each i 2 f1; : : : ; lg. For\nevery outlier x 2 X ((cid:11); (cid:14)) and the sample size s (cid:21) l, we have\n\u220f\n\nwhere f is the probability mass function of the multinomial distribution de\ufb01ned as\n\n) (cid:21) (cid:11)s\n\u220f\n\n\u2032 2 X ((cid:11); (cid:14)); qSp(x) > qSp(x\n\nf (s1; : : : ; sl; s; p1; : : : ; pl);\n\n(8x\n\n\u2211\n\n8i;si\u2a880\n\nPr\n\n)\n\n\u2032\n\nf (s1; : : : ; sl; s; p1; : : : ; pl) := (s!=\n\nl\ni=1 si!)\n\nl\n\ni=1 psi\n\ni\n\nwith\n\nl\ni=1 si = s:\n\nFurthermore, let I((cid:11); (cid:14)) be a subset of X ((cid:11); (cid:14)) such that minx\u20322I((cid:11);(cid:14)) d(x; x\n\u2032\n) > (cid:14) for every\noutlier x 2 X ((cid:11); (cid:14)) and assume that P (cid:14) is a (cid:14)-partition of I((cid:11); (cid:14)) instead of all inliers X ((cid:11); (cid:14)). If\nS(X ) (cid:18) I((cid:11); (cid:14)) and at least one object is sampled from each cluster C 2 P (cid:14), qSp(x) > qSp(x\n\u2032\nholds for all pairs of an outlier x and an inlier x\n\u2211\nTheorem 3 Let P (cid:14) = fC1; : : : ;Clg be a (cid:14)-partition of I((cid:11); (cid:14)) and (cid:13) = jI((cid:11); (cid:14))j = n, and assume\nthat pi = jCij =jI((cid:11); (cid:14))j for each i 2 f1; : : : ; lg. For every s (cid:21) l,\n\n\u2032.\n\n)\n\n(8x 2 X ((cid:11); (cid:14));8x\n\nPr\n\n) (cid:21) (cid:13)s\n\n\u2032\n\n)\n\nf (s1; : : : ; sl; s; p1; : : : ; pl):\n\n8i;si\u2a880\n\nFrom the fact that this theorem holds for any (cid:14)-partition, we automatically have the maximum lower\nbound over all possible (cid:14)-partitions.\n\n\u2032 2 X ((cid:11); (cid:14)); qSp(x) > qSp(x\n\u2211\n\nCorollary 1 Let \u03c6(s) =\n\n8i;si\u2a880 f (s1; : : : ; sl; s; p1; : : : ; pl) given in Theorem 3. We have\n\n\u2032 2 X ((cid:11); (cid:14)); qSp(x) > qSp(x\n\u2032\n\n)\n\n\u03c6(s):\n\n(3)\n\n(8x 2 X ((cid:11); (cid:14));8x\n\nPr\n\n) (cid:21) (cid:13)s maxP (cid:14)\n\n7\n\n\fLet B((cid:13); (cid:14)) be the right-hand side of Inequality (3) above. This bound is maximized for equally\nsized clusters when l is \ufb01xed and it shows high probability for large (cid:13). For example if (cid:13) = 0:99,\nwe have (l; optimal s; B((cid:13); (cid:14))) = (2; 7; 0:918), (3; 12; 0:866), and (4; 17; 0:818). It is notable that\nthe bound B((cid:13); (cid:14)) is independent of the actual number of outliers and inliers, which is a desirable\nproperty when analyzing large datasets. Although it is dependent on the number of clusters l, the\nbest (minimum) l which maximizes B((cid:13); (cid:14)) with the simplest clustering is implicitly chosen in qSp.\nTheoretical support for small sample sizes. Let g(s) = (cid:11)s(1 (cid:0) (cid:12)s), which is the right-hand side\nof Inequality (2). From the differentiation dg=ds, we can see that this function is maximized at\n\n)\n\n(\n\ns = log(cid:12)\n\nlog (cid:11)=(log (cid:11) + log (cid:12))\n\n;\n\nwith the natural assumption 0 < (cid:12) < (cid:11) < 1 and this optimal sample size s is small for large (cid:11)\nand small (cid:12), for example, s = 6 for ((cid:11); (cid:12)) = (0:99; 0:5) and s = 24 for ((cid:11); (cid:12)) = (0:999; 0:8).\nMoreover, as we already saw above the bound B((cid:13); (cid:14)) is also maximized at such small sample\nsizes for large (cid:13). This could be the reason why qSp works well for small sample sizes, as these are\ncommon values for (cid:11), (cid:12), and (cid:13) in outlier detection.\nComparison with qkthSp. De\ufb01ne Z(x; x\npling method qkthSp. Since we repeat sampling for each object in qkthSp, probability Z(x; x\neach x\n\n\u2032 2 X ((cid:11); (cid:14)) is independent with respect to a \ufb01xed x 2 X ((cid:11); (cid:14)). We therefore have\n\n)) for the iterative sam-\n) for\n\n) := Pr(qkthSp(x) > qkthSp(x\n\n\u2032\n\n\u2032\n\n\u2032\n\n(8x 2 X ((cid:11); (cid:14));8x\n\nPr\n\n) (cid:20) min\n\nx2X ((cid:11);(cid:14))\n\n\u220f\n\n\u2032\n\nZ(x; x\n\n):\n\nx\u20322X ((cid:11);(cid:14))\n\n\u2032 2 X ((cid:11); (cid:14)); qkthSp(x) > qkthSp(x\n\n\u2032\n\n)\n\n\u2032\n) is typically close to 1 in outlier detection, the overall probability rapidly de-\nAlthough Z(x; x\ncreases if n is large. Thus the performance suffers on large datasets.\nIn contrast, our one-time\nsampling qSp does not have independence, resulting in our results (Theorem 1, 2, 3, and Corol-\nlary 1) instead of this upper bound, which often lead to higher probability. This fact might be the\nreason why qkthSp empirically performs signi\ufb01cantly worse than qSp and shows the worst RMSD.\nComparison with qkthNN. Finally, let us consider the situation in which there exists the set of \u201ctrue\u201d\noutliers O (cid:26) X given by an oracle. Let (cid:3) = fk 2 N j qkthNN(x) > qkthNN(x\n) for all x 2 O and\n\u2032\n\u2032 2 X n Og, the set of ks with which we can detect all outliers, and assume that (cid:3) \u0338= \u2205. Then\n\nx\n\n(8x 2 O;8x\n\nPr\n\n\u2032 2 X n O; qSp(x) > qSp(x\n\u2032\n\n)\n\nmax\n\nk2(cid:3); (cid:14)2\u2206(k)\n\nB((cid:13); (cid:14))\n\nwith \u2206(k) = f(cid:14) 2 R j X ((cid:11); (cid:14)) = Og if we set (cid:11) = (n (cid:0) k)=n. Notice that (cid:13) is determined from (cid:11)\n(i.e. k) and (cid:14). Thus both k and (cid:14) are implicitly optimized in qSp. In contrast, in qkthNN the number\nk is speci\ufb01ed by the user. For example, if (cid:3) is small, it is hardly possible to choose k 2 (cid:3) without\nany prior knowledge, resulting in overlooking some outliers, while qSp always has the possibility\nto detect them without knowing (cid:3) if I((cid:11); (cid:14)) is non-empty for some (cid:11). This difference in detection\nability could be a reason why qSp signi\ufb01cantly outperforms qkthNN on average.\n\n) (cid:21)\n\n6 Conclusion\n\nIn this study, we have performed an extensive set of experiments to compare current distance-based\noutlier detection methods. We have observed that a surprisingly simple sampling-based approach,\nwhich we have newly proposed here, outperforms other state-of-the-art distance-based methods.\nSince the approach reached its best performance with small sample sizes, it achieves dramatic speed-\nups compared to exhaustive methods and is faster than all state-of-the-art methods for distance-based\noutlier detection. We have also presented a theoretical analysis to understand why such a simple\nstrategy works well and outperforms the popular approach based on kth-NN distances.\nTo summarize, our contribution is not only to overcome the scalability issue of the distance-based\napproach to outlier detection using the sampling strategy but also, to the best of our knowledge,\nto give the \ufb01rst thorough experimental comparison of a broad range of recently proposed distance-\nbased outlier detection methods. We are optimistic that these results will contribute to the further\nimprovement of outlier detection techniques.\nAcknowledgments. M.S. is funded by the Alexander von Humboldt Foundation. The research of\nProfessor Dr. Karsten Borgwardt was supported by the Alfried Krupp Prize for Young University\nTeachers of the Alfried Krupp von Bohlen und Halbach-Stiftung.\n\n8\n\n\fReferences\n[1] Aggarwal, C. C. Outlier Analysis. Springer, 2013.\n[2] Bache, K. and Lichman, M. UCI machine learning repository, 2013.\n[3] Bakar, Z. A., Mohemad, R., Ahmad, A., and Deris, M. M. A comparative study for outlier detection tech-\nniques in data mining. In Proceedings of IEEE International Conference on Cybernetics and Intelligent\nSystems, 1\u20136, 2006.\n\n[4] Bay, S. D. and Schwabacher, M. Mining distance-based outliers in near linear time with randomiza-\ntion and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, 29\u201338, 2003.\n\n[5] Berchtold, S., Keim, D. A., and Kriegel, H.-P. The X-tree: An index structure for high-dimensional data.\n\nIn Proceedings of the 22th International Conference on Very Large Data Bases, 28\u201339, 1996.\n\n[6] Bhaduri, K., Matthews, B. L., and Giannella, C. R. Algorithms for speeding up distance-based outlier\nIn Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data\n\ndetection.\nMining, 859\u2013867, 2011.\n\n[7] Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. LOF: Identifying density-based local outliers.\nIn Proceedings of the ACM SIGMOD International Conference on Management of Data, 93\u2013104, 2000.\n[8] Caputo, B., Sim, K., Furesjo, F., and Smola, A. Appearance-based object recognition using SVMs:\nWhich kernel should I use? In Proceedings of NIPS Workshop on Statistical Methods for Computational\nExperiments in Visual Processing and Computer Vision, 2002.\n\n[9] de Vries, T., Chawla, S., and Houle, M. E. Density-preserving projections for large-scale local anomaly\n\ndetection. Knowledge and Information Systems, 32(1):25\u201352, 2012.\n\n[10] Hawkins, D. Identi\ufb01cation of Outliers. Chapman and Hall, 1980.\n[11] Knorr, E. M. and Ng, R. T. Algorithms for mining distance-based outliers in large datasets. In Proceedings\n\nof the 24rd International Conference on Very Large Data Bases, 392\u2013403, 1998.\n\n[12] Knorr, E. M., Ng, R. T., and Tucakov, V. Distance-based outliers: algorithms and applications. The VLDB\n\nJournal, 8(3):237\u2013253, 2000.\n\n[13] Kriegel, H.-P., Kr\u00a8oger, P., and Zimak, A. Outlier detection techniques. Tutorial at 16th ACM SIGKDD\n\nConference on Knowledge Discovery and Data Mining, 2010.\n\n[14] Kriegel, H.-P., Schubert, M., and Zimek, A. Angle-based outlier detection in high-dimensional data.\nIn Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, 444\u2013452, 2008.\n\n[15] Liu, F. T., Ting, K. M., and Zhou, Z. H.\n\nIsolation-based anomaly detection. ACM Transactions on\n\nKnowledge Discovery from Data, 6(1):3:1\u20133:39, 2012.\n\n[16] Orair, G. H., Teixeira, C. H. C., Wang, Y., Meira Jr., W., and Parthasarathy, S. Distance-based outlier\n\ndetection: consolidation and renewed bearing. PVLDB, 3(2):1469\u20131480, 2010.\n\n[17] Pham, N. and Pagh, R. A near-linear time approximation algorithm for angle-based outlier detection in\nhigh-dimensional data. In Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery\nand Data Mining, 877\u2013885, 2012.\n\n[18] Ramaswamy, S., Rastogi, R., and Shim, K. Ef\ufb01cient algorithms for mining outliers from large data sets.\nIn Proceedings of the ACM SIGMOD International Conference on Management of Data, 427\u2013438, 2000.\n[19] Sch\u00a8olkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. Estimating the support\n\nof a high-dimensional distribution. Neural computation, 13(7):1443\u20131471, 2001.\n\n[20] Weber, R., Schek, H.-J., and Blott, S. A quantitative analysis and performance study for similarity-search\nmethods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Data\nBases, 194\u2013205, 1998.\n\n[21] Wu, M. and Jermaine, C. Outlier detection by sampling with accuracy guarantees. In Proceedings of\nthe 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 767\u2013772,\n2006.\n\n[22] Yamanishi, K., Takeuchi, J., Williams, G., and Milne, P. On-line unsupervised outlier detection using\n\ufb01nite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 8(3):275\u2013\n300, 2004.\n\n[23] Yu, D., Sheikholeslami, G., and Zhang, A. FindOut: Finding outliers in very large datasets. Knowledge\n\nand Information Systems, 4(4):387\u2013412, 2002.\n\n[24] Zimek, A., Schubert, E., and Kriegel, H.-P. A survey on unsupervised outlier detection in high-\n\ndimensional numerical data. Statistical Analysis and Data Mining, 5(5):363\u2014387, 2012.\n\n9\n\n\f", "award": [], "sourceid": 296, "authors": [{"given_name": "Mahito", "family_name": "Sugiyama", "institution": "MPI T\u00fcbingen"}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": "MPI T\u00fcbingen & University of T\u00fcbingen"}]}