{"title": "Model Based Population Tracking and Automatic Detection of Distribution Changes", "book": "Advances in Neural Information Processing Systems", "page_first": 1345, "page_last": 1352, "abstract": "", "full_text": "Model Based Population Tracking and\n\nAutomatic Detection of Distribution Changes\n\nIgor V. Cadez \u2044\n\nDept. of Information and Computer Science,\n\nUniversity of California,\n\nIrvine, CA 92612\nicadez@ics.uci.edu\n\nP. S. Bradley\ndigiMine, Inc.\n\n10500 NE 8th Street,\n\nBellevue, WA 98004-4332\n\npaulb@digimine.com\n\nAbstract\n\nProbabilistic mixture models are used for a broad range of data anal-\nysis tasks such as clustering, classi\ufb01cation, predictive modeling, etc.\nDue to their inherent probabilistic nature, mixture models can easily be\ncombined with other probabilistic or non-probabilistic techniques thus\nforming more complex data analysis systems. In the case of online data\n(where there is a stream of data available) models can be constantly up-\ndated to re\ufb02ect the most current distribution of the incoming data. How-\never, in many business applications the models themselves represent a\nparsimonious summary of the data and therefore it is not desirable to\nchange models frequently, much less with every new data point. In such\na framework it becomes crucial to track the applicability of the mixture\nmodel and detect the point in time when the model fails to adequately\nrepresent the data.\nIn this paper we formulate the problem of change\ndetection and propose a principled solution. Empirical results over both\nsynthetic and real-life data sets are presented.\n\n1 Introduction and Notation\n\nConsider a data set D = fx1; x2; : : : ; xng consisting of n independent, identically dis-\ntributed (iid) data points.\nIn context of this paper the data points could be vectors, se-\nquences, etc. Further, consider a probabilistic mixture model that maps each data set to a\nreal number, the probability of observing the data set:\n\nP (Dj\u00a3) =\n\nn\n\nY\n\ni=1\n\nP (xij\u00a3) =\n\nn\n\nK\n\nY\n\nX\n\ni=1\n\nk=1\n\n\u2026kP (xij(cid:181)k);\n\n(1)\n\nwhere the model is parameterized by \u00a3 = f\u20261; : : : ; \u2026K; (cid:181)1; : : : ; (cid:181)Kg: Each P (:j(cid:181)k) repre-\nsents a mixture component, while \u2026i represents mixture weights. It is often more convenient\n\n\u2044Work was done while author was at digiMine, Inc., Bellevue, WA.\n\n\fto operate with the log of the probability and de\ufb01ne the log-likelihood function as:\n\nl(\u00a3jD) = log P (Dj\u00a3) =\n\nn\n\nX\n\ni=1\n\nlog P (xij\u00a3) =\n\nn\n\nX\n\ni=1\n\nLogPi\n\nwhich is additive over data points rather than multiplicative. The LogPi terms we introduce\nin the notation represent each data point\u2019s contribution to the overall log-likelihood and\ntherefore describe how well a data point \ufb01ts under the model. For example, Figure 3 shows\na distribution of LogP scores using a mixture of conditionally independent (CI) models.\nMaximizing probability1 of the data with respect to the parameters \u00a3 can be accomplished\nby the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity\n(e.g., number of dimensions) and data set size (e.g., number of data points). Although EM\nguarantees only local optimality, it is a preferred method for \ufb01nding good solutions in\nlinear time. We consider an arbitrary but \ufb01xed parametric form of the model, therefore\nwe sometimes refer to a speci\ufb01c set of parameters \u00a3 as the model. Note that since the\nlogarithm is a monotonic function, the optimal set of parameters is the same whether we\nuse likelihood or log-likelihood.\nConsider an online data source where there are data sets Dt available at certain time in-\ntervals t (not necessarily equal time periods or number of data points). For example, there\ncould be a data set generated on a daily basis, or it could represent a constant stream of\ndata from a monitoring device. In addition, we assume that we have an initial model \u00a30\nthat was built (optimized, \ufb01tted) on some in-sample data D0 = fD1; D2; : : : ; Dt0 g. We\nwould like to be able to detect a change in the underlying distribution of data points within\ndata sets Dt that would be suf\ufb01cient to require building of a new model \u00a31. The criterion\nfor building a new model is loosely de\ufb01ned as \u201cthe model does not adequately \ufb01t the data\nanymore\u201d.\n\n2 Model Based Population Similarity\n\nIn this section we formulate the problem of model-based population similarity and tracking.\nIn case of mixture models we start with the following observations:\n\n\u2020 The mixture model de\ufb01nes the probability density function (PDF) that is used to\nscore each data point (LogP scores), leading to the score for the overall population\n(log-likelihood or sum of LogP scores).\n\n\u2020 The optimal mixture model puts more PDF mass over dense regions in the data\nspace. Different components allow the mixture model to distribute its PDF over\ndisconnected dense regions in the data space. More PDF mass in a portion of the\ndata space implies higher LogP scores for the data points lying in that region of\nthe space.\n\n\u2020 If model is to generalize well (e.g., there is no signi\ufb01cant over\ufb01tting) it cannot put\nsigni\ufb01cant PDF mass over regions of data space that are populated by data points\nsolely due to the details of a speci\ufb01c data sample used to build the model.\n\n\u2020 Dense regions in the data space discovered by a non-over\ufb01tting model are the\nintrinsic property of the true data-generating distribution even if the functional\nform of the model is not well matched with the true data generating distribution. In\nthe latter case, the model might not be able to discover all dense regions or might\nnot model the correct shape of the regions, but the regions that are discovered (if\nany) are intrinsic to the data.\n\n1This approach is called maximum-likelihood estimation.\n\nIf we included parameter priors we\n\ncould equally well apply results in this paper to the maximum a posteriori estimation.\n\n\f\u2020 If there is con\ufb01dence that the model is not over\ufb01tting and that it generalizes well\n(e.g., cross-validation was used to determine the optimal number of mixture com-\nponents), the new data from the same distribution as the in-sample data should be\ndense in the same regions that are predicted by the model.\n\nGiven these observations, we seek to de\ufb01ne a measure of data-distribution similarity based\non how well the dense regions of the data space are preserved when new data is introduced.\nIn model based clustering, dense regions are equivalent to higher LogP scores, hence we\ncast the problem of determining data distribution similarity into one of determining LogP\ndistribution similarity (relative to the model). For example, Figure 3 (left) shows a his-\ntogram of one such distribution. It is important to note several properties of Figure 3: 1)\nthere are several distinct peaks from which distribution tails off toward smaller LogP val-\nues, therefore simple summary scores fail to ef\ufb01ciently summarize the LogP distribution.\nFor example, log-likelihood is proportional to the mean of LogP distribution in Figure 3,\nand the mean is not a very useful statistic when describing such a multimodal distribution\n(also con\ufb01rmed experimentally); 2) the histogram itself is not a truly non-parametric repre-\nsentation of the underlying distribution, given that the results are dependent on bin width.\nIn passing we also note that the shape of the histogram in Figure 3 is a consequence of the\nCI model we use: different peaks come from different discrete attributes, while the tails\ncome from continuous Gaussians. It is a simple exercise to show that LogP scores for\na 1-dimensional data set generated by a single Gaussian have an exponential distribution\nwith a sharp cutoff on the right and tail toward the left.\n\nTo de\ufb01ne the similarity of the data distributions based on LogP scores in a purely non-\nparametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov\n(KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions\n(CDF) to estimate distance between two empirical 1-dimensional distributions, in our case\ndistributions of LogP scores. In principle, we could compare the LogP distribution of the\nnew data set Dt to that of the training set D0 and obtain the probability that the two came\nfrom the same distribution. In practice, however, this approach is not feasible since we do\nnot assume that the estimated model and the true data generating process share the same\nfunctional form (see Section 3). Consequently, we need to consider the speci\ufb01c KS score\nin relation to the natural variability of the true data generating distribution. In the situation\nwith streaming data, the model is estimated over the in-sample data D0. Then the individual\nin-sample data sets D1; D2; : : : ; Dt0 are used to estimate the natural variability of the KS\nstatistics. This variability needs to be quanti\ufb01ed due to the fact that the model may not\ntruly match the data distribution. When the natural variance of the KS statistics over the\nin-sample data has been determined, the LogP scores for a new dataset Dt; t > t0 are\ncomputed. Using principled heuristics, one can then determine whether or not the LogP\nsignature for Dt is signi\ufb01cantly different than the LogP signatures for the in-sample data.\nTo clarify various steps, we provide an algorithmic description of the change detection\nprocess.\nAlgorithm 1 (Quantifying Natural Variance of KS Statistics):\nGiven an \u201cin-sample\u201d dataset D0 = fD1; D2; : : : ; Dt0g, proceed as follows:\n\n1. Estimate the parameters \u00a30 of the mixture model P (Dj\u00a3) over D0 (see equa-\n\ntion (1)).\n2. Compute\n\nLogP (Di) =\n\nni\n\nX\n^i=1\n\nlog P (x^ij\u00a30); x^i 2 Di; ni = jDij; i = 1; : : : ; t0:\n\n(2)\n\n3. For 1 \u2022 i; j \u2022 t0, compute LKS(i; j) = log [PKS(Di; Dj)]. See [7] for details on\n\nPKS computation.\n\n\f4. For 1 \u2022 i \u2022 t0, compute the KS measure MKS(i) as\n\nMKS(i) = Pt0\n\nj=1 LKS(i; j)\n\nt0\n\n:\n\n(3)\n\n5. Compute \u201eM = M ean[MKS(i)] and (cid:190)M = ST D[MKS(i)] to quantify the natural\n\nvariability of MKS over the \u201cin-sample\u201d data.\n\nAlgorithm 2 (Evaluating New Data):\nGiven a new dataset Dt; t > t0, \u201eM and (cid:190)M proceed as follows:\n\n1. Compute LogP (Dt) as in (2).\n2. For 1 \u2022 i \u2022 t0, compute LKS(i; t).\n3. Compute MKS(t) as in (3).\n4. Apply decision criteria using MKS(t), \u201eM , (cid:190)M to determine whether or not \u00a30 is\n\na good \ufb01t for the new data. For example, if\n\njMKS(t) \u00a1 \u201eM j\n\n(cid:190)M\n\n> 3;\n\n(4)\n\nthen \u00a30 is not a good \ufb01t any more.\n\nNote, however, that the 3-(cid:190) interval be interpreted as a con\ufb01dence interval only in the\nlimit when number of data sets goes to in\ufb01nity. In applications presented in this paper we\ncertainly do not have that condition satis\ufb01ed and we consider this approach as an \u201ceducated\nheuristic\u201d (gaining \ufb01rm statistical grounds in the limit).\n\n2.1 Space and Time Complexity of the Methodology\n\nThe proposed methodology was motivated by a business application with large data sets,\nhence it must have time complexity that is close to linear in order to scale well. In order\nto assess the time complexity, we use the following notation: nt = jDtj is the number of\ndata points in the data set Dt; t0 is the index of the last in-sample data set, but is also the\nnumber of in-sample data sets; n0 = jD0j = Pt0\nt=1 nt is the total number of in-sample\ndata points (in all the in-sample data sets); n = n0=t0 is the average number of data points\nin the in-sample data sets. For simplicity of argument, we assume that all the data sets are\napproximately of the same size, that is nt \u2026 n.\nThe analysis presented here does not take into account the time and space complexity\nneeded to estimated the parameters \u00a3 of the mixture model (1). In the \ufb01rst phase of the\nmethodology, we must score each of the in-sample data points under the model (to obtain\nthe LogP distributions) which has time complexity of O(n0). Calculation of KS statistics\nfor two data sets is done in one pass over the LogP distributions, but it requires that the\nLogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n).\nSince we must calculate all the pairwise KS measures, this step has time complexity of\n0n log n). In-sample mean and variance of the KS measure\nt0(t0 \u00a1 1)=2 O(n log n) = O(t2\nare obtained in time which is linear in t0 hence the asymptotic time complexity does not\nchange. In order to evaluate out-of-sample data sets we must keep LogP distributions for\neach of the in-sample data sets as well as several scalars (e.g., mean and variance of the\nin-sample KS measure) which requires O(n0) memory.\nTo score an out-of-sample data set Dt; t > t0, we must \ufb01rst obtain the LogP distribution\nof Dt which has time complexity of O(n) and then calculate the KS measure relative to\neach of the in-sample data sets which has time complexity O(n log n) per in-sample data\nset, or t0O(n log n) = O(t0n log n) for the full in-sample period. The LogP distribution\nfor Dt can be discarded once the pairwise KS measures are obtained.\n\n\ft\n\nn\nu\no\nC\n\nt\n\nn\nu\no\nC\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\nt\n\nn\nu\no\nC\n\nt\n\nn\nu\no\nC\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n\u22125.5\n\n\u22125\n\n\u22124.5\n\nLogP\n\n\u22124\n\n\u22123.5\n\n\u22123\n\n\u22122.5\n\n\u22125.5\n\n\u22125\n\n\u22124.5\n\nLogP\n\n\u22124\n\n\u22123.5\n\n\u22123\n\n\u22122.5\n\n\u22125.5\n\n\u22125\n\n\u22124.5\n\nLogP\n\n\u22124\n\n\u22123.5\n\n\u22123\n\n\u22122.5\n\n\u22125.5\n\n\u22125\n\n\u22124.5\n\nLogP\n\n\u22124\n\n\u22123.5\n\n\u22123\n\n\u22122.5\n\nFigure 1: Histograms of LogP scores for two data sets generated from the \ufb01rst model\n(top row) and two data sets generated from the second model (bottom row). Each data\nset contains 50,000 data points. All histograms are obtained from the model \ufb01tted on the\nin-sample period.\n\n0n log n) time for prepro-\nOverall, the proposed methodology requires O(n0) memory, O(t2\ncessing and O(t0n log n) time for out-of-sample evaluation. Further, since t0 is typically a\nsmall constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time\ncomplexity of O(n log n).\n\n3 Experimental Setup\n\nExperiments presented consist of two parts: experiments on synthetic data and experiments\non the aggregations over real web-log data.\n\n3.1 Experiments on Synthetic Data\n\nSynthetic data is a valuable tool when determining both applicability and limitations of the\nproposed approach. Synthetic data was generated by sampling from a a two component CI\nmodel (the true model is not used in evaluations). The data consist of a two-state discrete\ndimension and a continuous dimension. First 100 data sets where generated by sampling\nfrom a mixture model with parameters: [\u20261; \u20262] = [0:6; 0:4] as weights, (cid:181)1 = [0:8; 0:2]\nand (cid:181)2 = [0:4; 0:6] as discrete state probabilities, [\u201e1; (cid:190)2\n2] = [0; 7]\nas mean and variance (Gaussian) for the continuous variable. Then the discrete dimension\nprobability of the second cluster was changed from (cid:181)2 = [0:4; 0:6] to (cid:181)0\n2 = [0:5; 0:5]\nkeeping the remaining parameters \ufb01xed and an additional 100 data sets were generated by\nsampling from this altered model. This is a fairly small change in the distribution and the\nunderlying LogP scores appear to be very similar as can be seen in Figure 1. The \ufb01gure\nshows LogP distributions for the \ufb01rst two data sets generated from the \ufb01rst model (top row)\nand the \ufb01rst two data sets generated from the second model (bottom row). Plots within each\n\n1] = [10; 5] and [\u201e2; (cid:190)2\n\n\f>\n)\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\nS\nK\n(\ng\no\n<\n\nl\n\n>\n)\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\nS\nK\n(\ng\no\n<\n\nl\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\n0\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n\u221212\n\n\u221214\n\n0\n\n0\n\n\u22125\n\n>\n)\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\nS\nK\n(\ng\no\n<\n\nl\n\n(a)\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nData set Dt\n\n\u221210\n\n\u221215\n\n0\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n>\n)\ny\nt\ni\nl\ni\n\n\u221220\n\n(b)\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nData set Dt\n\n \n\nb\na\nb\no\nr\np\nS\nK\n(\ng\no\n<\n\nl\n\n\u221225\n\n\u221230\n\n\u221235\n\n\u221240\n\n\u221245\n\n\u221250\n\n0\n\n(d)\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nData set Dt\n\n(c)\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nData set Dt\n\nFigure 2: Average log(KS probability) over the in-sample period for four experiments on\nsynthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000;\nd) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note\nthat y-axes have different scales in order to show full variability of the data.\n\nrow should be more similar than plots from different rows, but this is dif\ufb01cult to discern by\nvisual inspection.\n\nAlgorithms 1 and 2 were evaluated by using the \ufb01rst 10 data sets to estimate a two com-\nponent model. Then pairwise KS measures were calculated between all possible data set\npairs relative to the estimated model. Figure 2 shows average KS measures over in-sample\ndata sets (\ufb01rst 10) for four experiments varying the number of data points in each experi-\nment. Note that the vertical axes are different in each of the plots to better show the range\nof values. As the number of data points in the data set increases, the change that occurs\nat t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2)\nthe change in the distribution becomes easily detectable. Since this number of data points\nis typically considered to be small compared to the number of data points in our real life\napplications we expect to be able to detect such slight distribution changes.\n\n3.2 Experiments on Real Life Data\n\nFigure 3 shows a distribution for a typical day from a content web-site. There are almost\n50,000 data points in the data set with over 100 dimensions each. The LogP score distribu-\ntion is similar to that of synthetic data in Figure 1 which is a consequence of the CI model\nused. Note, however, that in this data set the true generating distribution is not known\nand is unlikely to be purely a CI model. Therefore, the average log KS measure over in-\nsample data has much lower values (see Figure 3 right, and plots in Figure 2). Another\nway to phrase this observation is to note that since the true generating data distribution is\nmost likely not CI, the observed similarity of LogP distributions (the KS measure) is much\nlower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the\nCI model to capture all the aspects of the true data distribution. Nonetheless, the \ufb01rst 31\n\n\f5000\n\n4500\n\n4000\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\nt\n\nn\nu\no\nC\n\n0\n\u2212100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\nLogP\n\n0\n\n20\n\n40\n\n60\n\n>\n)\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\nS\nK\n(\ng\no\n<\n\nl\n\n\u2212100\n\n\u2212200\n\n\u2212300\n\n\u2212400\n\n\u2212500\n\n\u2212600\n\n\u2212700\n\n\u2212800\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nData set Dt\n\nFigure 3: Left: distribution of 42655 LogP scores from mixture of conditional indepen-\ndence models. The data is a single-day of click-stream data from a commercial web site.\nRight: Average log(KS probability) over the 31 day in-sample period for a content web-\nsite showing a glitch on day 27 and a permanent change on day 43, both detected by the\nproposed methodology.\n\ndata sets (one month of data) that were used to build the initial model \u00a30 can be used to\nde\ufb01ne the natural variability of the KS measures against which additional data sets can be\ncompared. The result is that in Figure 3 we clearly see a problem with the distribution on\nday 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both\nof the detected changes correspond to real changes in the data, as veri\ufb01ed by the commer-\ncial website operators. Automatic description of changes in the distribution and criteria for\nautomatic rebuilding of the model are beyond scope of this paper.\n\n4 Related Work\n\nAutomatic detection of various types of data changes appear in the literature in several\ndifferent \ufb02avors. For example, novelty detection ([4], [8]) is the task of determining unusual\nor novel data points relative to some model. This is closely related to the outlier detection\nproblem ([1], [5]) where the goal is not only to \ufb01nd unusual data points, but the ones that\nappear not to have been generated by the data generating distribution. A related problem\nhas been addressed by [2] in the context of time series modeling where outliers and trends\ncan contaminate the model estimation. More recently mixture models have been applied\nmore directly to outlier detection [3].\n\nThe method proposed in this paper addesses a different problem. We are not interested in\nnew and unusual data points; on the contrary, the method is quite robust with respect to\noutliers. An outlier or two do not necessarily mean that the underlying data distribution has\nchanged. Also, some of the distribution changes we are interested in detecting might be\nconsidered uninteresting and/or not-novel; for example, a slight shift of the population as\na whole is something that we certainly detect as a change but it is rarely considered novel\nunless the shift is drastic.\n\nThere is also a set of online learning algorithms that update model parameters as the new\ndata becomes available (for variants and additional references, e.g. [6]). In that frame-\nwork there is no such concept as a data distribution change since the models are constantly\nupdated to re\ufb02ect the most current distribution. For example, instead of detecting a slight\nshift of the population as a whole, online learning algorithms update the model to re\ufb02ect\nthe shift.\n\n\f5 Conclusions\n\nIn this paper we introduced a model-based method for automatic distribution change detec-\ntion in an online data environment. Given the LogP distribution data signature we further\nshowed how to compare different data sets relative to the model using KS statistics and how\nto obtain a single measure of similarity between the new data and the model. Finally, we\ndiscussed heuristics for change detection that become principled in the limit as the number\nof possible data sets increases.\n\nExperimental results over synthetic and real online data indicate that the proposed method-\nology is able to alert the analyst to slight distributional changes. This methodology may be\nused as the basis of a system to automatically re-estimate parameters of a mixture model on\nan \u201c as-needed\u201d basis \u2013 when the model fails to adequately represent the data after a certain\npoint in time.\n\nReferences\n\n[1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984.\n[2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends,\nand level shifts. In Proceedings of the Third International Conference on Neural Networks in\nFinancial Engineering, pages 564\u2013577, 1996.\n\n[3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applica-\ntions to pro\ufb01ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings\nof the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Min-\ning, pages 37\u201346. ACM, 2001.\n\n[4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K.\nLeen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems\n13, pages 395\u2013401. MIT Press, 2001.\n\n[5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In\nProceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 53\u201362, 1999.\n\n[6] R. Neal and G. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse and other\nvariants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355\u2013368. Kluwer Aca-\ndemic Publishers, 1998.\n\n[7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C:\nThe Art of Scienti\ufb01c Computing, Second Edition. Cambridge University Press, Cambridge, UK,\n1992.\n\n[8] B. Sch\u00a8olkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector\nmethod for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in\nNeural Information Processing Systems 12, pages 582\u2013588. MIT Press, 2000.\n\n\f", "award": [], "sourceid": 2008, "authors": [{"given_name": "Igor", "family_name": "Cadez", "institution": null}, {"given_name": "P. S.", "family_name": "Bradley", "institution": null}]}