{"title": "A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters", "book": "Advances in Neural Information Processing Systems", "page_first": 651, "page_last": 658, "abstract": "", "full_text": "A Fast Multi-Resolution Method for Detection of\n\nSigni\ufb01cant Spatial Disease Clusters\n\nDaniel B. Neill\n\nDepartment of Computer Science\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nneill@cs.cmu.edu\n\nAndrew W. Moore\n\nDepartment of Computer Science\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nawm@cs.cmu.edu\n\nAbstract\n\nGiven an N(cid:2)N grid of squares, where each square has a count and an un-\nderlying population, our goal is to \ufb01nd the square region with the highest\ndensity, and to calculate its signi\ufb01cance by randomization. Any density\nmeasure D, dependent on the total count and total population of a re-\ngion, can be used. For example, if each count represents the number\nof disease cases occurring in that square, we can use Kulldorff\u2019s spatial\nscan statistic DK to \ufb01nd the most signi\ufb01cant spatial disease cluster. A\nnaive approach to \ufb01nding the maximum density region requires O(N 3)\ntime, and is generally computationally infeasible. We present a novel\nalgorithm which partitions the grid into overlapping regions, bounds the\nmaximum score of subregions contained in each region, and prunes re-\ngions which cannot contain the maximum density region. For suf\ufb01ciently\ndense regions, this method \ufb01nds the maximum density region in optimal\nO(N2) time, in practice resulting in signi\ufb01cant (10-200x) speedups.\n\n1 Introduction\nThis paper develops fast methods for detection of spatial overdensities: discovery of spa-\ntial regions with high scores according to some density measure, and statistical signi\ufb01cance\ntesting in order to determine whether these high-density regions can reasonably have oc-\ncurred by chance. A major application is in identifying clusters of disease cases, for pur-\nposes ranging from detection of bioterrorism (ex. anthrax) to environmental risk factors\nfor diseases such as childhood leukemia ([1]-[3]). [4] discusses many other applications,\nincluding astronomy (identifying star clusters), reconnaissance, and medical imaging.\nConsider the case in which counts are aggregated to a uniform 2-d grid. Assume an N (cid:2) N\ngrid of squares G, where each square si j 2 G is associated with a count ci j and an underlying\npopulation pi j. For example, a square\u2019s count may be the number of disease cases in that\ngeographical region in a given time period, while its population may be the total number of\npeople \u201cat-risk\u201d for the disease. Our goal is to \ufb01nd the square region S(cid:3) (cid:18) G with the highest\ndensity according to a density measure D: S(cid:3) = arg maxS D(S). We use the abbreviations\nmdr for the \u201cmaximum density region\u201d S(cid:3), and mrd for the \u201cmaximum region density\u201d\nD(S(cid:3)), throughout. The density measure D must be an increasing function of the total\ncount of the region, C(S) = (cid:229) S ci j, and a decreasing function of the total population of the\nregion, P(S) = (cid:229) S pi j. In the case of a uniform underlying population, P(S) (cid:181)\nk2, where k\nis the size of region S. But we focus on the more interesting case: non-uniform populations.\n\n\fThe problem of \ufb01nding signi\ufb01cant spatial overdensities is distinct from that solved by grid-\nbased hierarchical methods such as CLIQUE [5], MAFIA [6], and STING [7], which also\nlook for \u201cdense clusters.\u201d There are three main differences:\n\n1. Our method is applicable to any density measure D, while the other algorithms are\nspeci\ufb01c to the \u201cstandard\u201d density measure D1(S) = C(S)\nP(S) . The D1 measure is the\nnumber of points per unit population, for example this corresponds to the region\nwith the highest observed disease rate. Unlike many other density measures, D1 is\nmonotonic: if a region S with density d is partitioned into any set of disjoint sub-\nregions, at least one subregion will have density d0 (cid:21) d. Thus it is not particularly\nuseful to \ufb01nd the \u201cregion\u201d with maximum D1, since this will be the single square\nwith highest ci j\n. Instead, the other algorithms search for maximally sized regions\npi j\nwith D1 greater than some threshold, relying on the monotonicity of D1 by \ufb01rst\n\ufb01nding dense units (1(cid:2)1 squares), then merging adjacent units in bottom-up fash-\nion. For a non-monotonic measure such as Kulldorff\u2019s, it is possible to have a large\ndense region where none of its subregions are themselves dense, so bottom-up can\nfail. Here, we will optimize with respect to arbitrary non-monotonic density mea-\nsures, and thus use a different approach from CLIQUE, MAFIA, or STING.\n\n2. Our method deals with non-uniform underlying populations: this is particularly\nimportant for real-world epidemiological applications, in which an overdensity of\ndisease cases is more signi\ufb01cant if the underlying population is large.\n\n3. Our goal is not only to \ufb01nd the highest scoring region, but also to test whether that\n\nregion is a true cluster or if it is likely to have occurred by chance.\n\nP + (Ctot (cid:0)C)log Ctot(cid:0)C\n\nPtot(cid:0)P (cid:0)Ctot log Ctot\n\nPtot\n\n, if C\n\nP > Ctot\nPtot\n\n1.1 The spatial scan statistic\nA non-monotonic density measure which is of great interest to epidemiologists is Kull-\ndorff\u2019s spatial scan statistic [8], which we denote by DK. This assumes that counts ci j are\ngenerated by an inhomogeneous Poisson process with mean qpi j, where q is the underlying\n\u201cdisease rate\u201d (or expected value of the D1 density). We then calculate the log of the likeli-\nhood ratio of two possibilities: that the disease rate q is higher in the region than outside the\nregion, and that the disease rate is identical inside and outside the region. For a region with\ncount C and population P, in a grid with total count Ctot and population Ptot, we can cal-\nculate DK = C log C\n, and 0 otherwise. [8]\nproved that the spatial scan statistic is individually most powerful for \ufb01nding a signi\ufb01cant\nregion of elevated disease rate: it is more likely to detect the overdensity than any other test\nstatistic. Note, however, that our algorithm is general enough to use any density measure,\nand in some cases we may wish to use measures other than Kulldorff\u2019s. For instance, if\nwe have some idea of the size of the maximum density region, we can use the Dr measure,\nDr(S) = C(S)\nOnce we have found the maximum density region (mdr) of grid G according to our density\nmeasure, we must still determine the statistical signi\ufb01cance of this region. Since the exact\ndistribution of the test statistic is only known in special cases (such as D1 density with a\nuniform underlying population), in general we must perform Monte Carlo simulation for\nour hypothesis test. To do so, we run a large number R of random replications, where\na replica has the same underlying populations pi j as G, but assumes a uniform disease\nrate qrep = Ctot (G)\nPtot (G) for all squares. For each replica G0, we \ufb01rst generate all counts ci j\nrandomly from an inhomogeneous Poisson distribution with mean qrep pi j, then compute\nthe maximum region density (mrd) of G0 and compare this to mrd(G). The number of\nreplicas G0 with mrd(G0) (cid:21) mrd(G), divided by the total number of replications R, gives\nus the p-value for our maximum density region. If this p-value is less than .05, we can\nconclude that the discovered region is statistically signi\ufb01cant (unlikely to have occurred\n\nP(S)r , 0 < r < 1, with larger r corresponding to tests for smaller clusters.\n\n\fby chance) and is thus a \u201cspatial overdensity.\u201d If the test fails, we have still discovered the\nmaximum density region of G, but there is not suf\ufb01cient evidence that this is an overdensity.\n\n1.2 The naive approach\nThe simplest method of \ufb01nding the maximum density region is to compute the density of\nall square regions of sizes k = kmin : : : N.1 Since there are (N (cid:0) k + 1)2 regions of size k,\nthere are a total of O(N3) regions to examine. We can compute the density of any region\nS in O(1), by \ufb01rst \ufb01nding the count C(S) and population P(S), then applying our density\nmeasure D(C; P).2 This allows us to compute the mdr of an N (cid:2) N grid G in O(N3) time.\nHowever, signi\ufb01cance testing by Monte Carlo replication also requires us to \ufb01nd the mrd for\neach replica G0, and compare this to mrd(G). Since calculation of the mrd takes O(N 3) time\nfor each replica, the total complexity is O(RN3), and R is typically large (we assume R =\n1000). Several simple tricks may be used to speed up this procedure for cases where there\nis no signi\ufb01cant spatial overdensity. First, we can stop examining a replica G0 immediately\nif we \ufb01nd a region with density greater than mrd(G). Second, we can use the Central Limit\nTheorem to halt our Monte Carlo testing early if, after a number of replications R0 < R, we\ncan conclude with high con\ufb01dence that the region is not signi\ufb01cant. For cases where there\nis a signi\ufb01cant spatial overdensity, the naive approach is still extremely computationally\nexpensive, and this motivates our search for a faster algorithm.\n\n2 Overlap-multires partitioning\nSince the problem of detection of spatial overdensities is closely related to problems such\nas kernel density estimation and kernel regression, this suggests that multi-resolution par-\ntitioning techniques such as kd-trees [9] and mrkd-trees [10] may be useful in speeding up\nour search. The main difference of our problem from kernel density estimation, however,\nis that we are only interested in the maximum density region; thus, we do not necessarily\nneed to build a space-partitioning tree at all resolutions. Also, the assumption that counts\nare aggregated to a uniform grid simpli\ufb01es and speeds up partitioning, eliminating the need\nfor a computationally expensive instance-based approach. These observations suggest a\ntop-down multi-resolution partitioning approach, in which we search \ufb01rst at coarse resolu-\ntions (large regions), then at successively \ufb01ner resolutions as necessary. One option would\nbe to use a \u201cquadtree\u201d [11], a hierarchical data structure in which each region is recursively\npartitioned into its top left, top right, bottom left, and bottom right quarters. However, a\nsimple partitioning approach fails because of the non-monotonicity of our density measure:\na dense region may be split into two or more separate subregions, none of which is as dense\nas the original region. This problem can be prevented by a partitioning approach in which\nadjacent regions partially overlap, an approach we call \u201coverlap-multires partitioning.\u201d\n\n(x + w; y; k (cid:0) w),\n\nTo explain how this method works, we \ufb01rst de\ufb01ne some notation. We denote a region S by\nan ordered triple (x; y; k), where (x; y) is the upper left corner of the region and k is its size.\nNext, we de\ufb01ne the w-children of a region S = (x; y; k) as the four overlapping subregions\nof size k(cid:0) w\ncorresponding to the top left, top right, bottom left, and bottom right corners\nof S: (x; y; k (cid:0) w),\n(x; y + w; k (cid:0) w), and (x + w; y + w; k (cid:0) w). Next, we\nde\ufb01ne a region as \u201ceven\u201d if its size is 2k for some k (cid:21) 2, and \u201codd\u201d if its size is 3(cid:2) 2k for\nsome k (cid:21) 0. We de\ufb01ne the \u201cgridded children\u201d (g-children) of an even region S = (x; y; k)\nas its w-children for\n4 . Thus the four g-children of an even region are odd, and each\noverlaps 2\n3 with the directly adjacent child regions. Similarly, we de\ufb01ne the g-children of\nan odd region S = (x; y; k) as its w-children for\n3 . Thus the four g-children of an odd\nregion are even, and each overlaps 1\n2 with the directly adjacent child regions. Note that\n\nw = k\n\nw = k\n\n1We assume that a region must have size at least kmin to be signi\ufb01cant: here kmin = 3.\n2An old trick allows us to compute the count of any k(cid:2) k region in O(1): we \ufb01rst form a matrix of\nthe cumulative counts, then compute each region\u2019s count by adding at most four cumulative counts.\n\n\feven though a region has four g-children, and each of its g-children has four g-children,\nit has only nine (not 16) distinct grandchildren, several of which are the child of multiple\nregions. Figure 1 shows the \ufb01rst two levels of such a tree.\n\nFigure 1:\nThe \ufb01rst two levels of the overlap-\nmutires tree. Each node represents a gridded re-\ngion (denoted by a thick square) of the entire\ndataset (thin square and dots).\n\nNext, we assume that the size of the entire grid is a power of two: thus the entire grid\nG = (0;0; N) is an even region. We de\ufb01ne the set of \u201cgridded\u201d regions of G as G and all of\nits \u201cgridded descendents\u201d (its g-children, g-grandchildren, etc.). Our algorithm focuses its\nsearch on the set of gridded regions, only searching non-gridded regions when necessary.\nThis technique is useful because the total number of gridded regions is O(N 2), as in the\nsimple quadtree partitioning method. This implies that, if only gridded regions need to be\nsearched, our total time to \ufb01nd the mdr of a grid is O(N2). Since it takes W(N 2) time to\ngenerate the grid, this time bound is optimal.\n\n2.1 Top-down pruning\nSo when can we search only gridded regions, or alternatively, when does a given non-\ngridded region need to be searched? Our basic method is branch-and-bound: we perform\na top-down search, and speed up this search by pruning regions which cannot possibly\ncontain the mdr. Our \ufb01rst step is to derive an upper bound Dmax(S; k) on the density of\nsubregions of minimum size k contained in a given region S (Section 2.2). Then we can\ncompare Dmax(S; k) to the density D(S(cid:3)) of the best region found so far: if Dmax(S; k) <\nD(S(cid:3)), we know that no subregion of S with size k or more can be the mdr.\nWe can use this information for two types of pruning. First, if Dmax(S; kmin) < D(S(cid:3)), we\nknow that no subregion of S can be optimal; we can prune the region completely, and not\nsearch its (gridded or non-gridded) children. Second, we can show that (for 0 < k < n) any\nregion of size 2k + 1 or less is contained entirely in an odd gridded region of size 3\n2 (cid:2) 2k.\nThus, if Dmax(G;2n(cid:0)1 + 2) < D(S(cid:3)) for the entire grid G, any optimal non-gridded region\nmust be contained in an odd gridded region. Similarly, if Dmax(S;2k + 2) < D(S(cid:3)) for an\nodd gridded region S of size 3(cid:2) 2k, any optimal non-gridded subregion of S must be within\nan odd gridded subregion of S. Thus we can search only gridded regions if two conditions\nhold: 1) no subregion of G of size 2n(cid:0)1 + 2 or more can be optimal, and 2) for each odd\ngridded region of size 3(cid:2) 2k, no subregion of size 2k + 2 or more can be optimal.\n2.2 Bounding subregion density\nTo bound the maximum subregion density Dmax(S; k), we must \ufb01nd the highest possible\nscore D(S0) of a subregion S0 (cid:18) S of size k or more. Let C = C(S), P = P(S), and K =\nsize(S). We assume that these are known, as well as lower and upper bounds [dmin; dmax]\non the D1 density of subregions of S. Let c = C(S0) and p = P(S0); these are presently\nunknown. We can prove that, if D(S0) > D(S), the maximum value of D(S0) occurs when\nS has the maximum allowable D1 density dmax, and S(cid:0) S0 has the minimum allowable D1\ndensity dmin: this gives us pdmax + (P(cid:0) p)dmin = C. Thus p = C(cid:0)Pdmin\nand c = dmax p =\ndmax(cid:0)dmin\nC(cid:0)Pdmin\n1(cid:0)dmin=dmax\nWe can place tighter bounds on Dmax(S; k) if we also have a lower bound pmin(S; k) on the\npopulation of a size k subregion S0 (cid:18) S: in this case, if the value calculated for p in the\n\n. Then computing D(c; p) gives us a guaranteed upper bound on Dmax(S; k).\n\n\fequation above is less than pmin, we know that D(c0; pmin), where c0 = C(cid:0) (P(cid:0) pmin)dmin,\nis a tighter upper bound for Dmax. We can bound pmin in several ways. First, if we know\nthe minimum population ps;min of a single square s 2 S, then pmin (cid:21) k2 ps;min. Second,\nif we know the maximum population ps;max of a single square s 2 S, then pmin (cid:21) P (cid:0)\n(K2 (cid:0) k2)ps;max. At the beginning of our algorithm, we calculate ps;max(S) = max pi j and\nps;min(S) = min pi j (where si j 2 S) for each gridded region S. This calculation can be\ndone recursively (bottom-up) in O(N2). The resulting population statistics are used for the\noriginal grid and for all replicas. For non-gridded regions, we use the population statistics\nof the region\u2019s gridded parent (either an odd gridded region or the entire grid G); these\nbounds will be looser for the child region than for the parent, but are still correct. We\nalso initially calculate dmax and dmin. This is done simply by \ufb01nding the global maximum\nand minimum values of the D1 density: dmax = max C(S0)\nP(S0) (where S0 (cid:18) G and size(S0) =\nkmin), and dmin = min ci j\n(where si j 2 G).3 Alternatively, we could compute dmax and dmin\npi j\nrecursively (bottom-up) for each gridded region S, but in practice we \ufb01nd that the global\nvalues are suf\ufb01cient for good performance on most test cases.\n\n2.3 The algorithm\nOur algorithm, based on the overlap-multires partitioning scheme above, is a top-down,\nbest-\ufb01rst search of the set of gridded regions, followed by a top-down, best-\ufb01rst search\nof any non-gridded regions as necessary. We use priority queues (q1,q2) for our search:\neach step of the algorithm takes the \u201cbest\u201d (i.e. highest density) region from a queue,\nexamines it, and (if necessary) adds its children to queues. The w-children and g-children\nof a region S are de\ufb01ned above; note that the 1-children of S are its w-children with w = 1.\nWe also assume that regions are \u201cmarked\u201d once added to a queue, so that a region will\nnot be searched more than once. Finally, we use the rules and density bounds derived\nabove to speed up our search, by pruning subregions when Dmax(S; k) (cid:20) D(S(cid:3)). The basic\npseudocode outline of our method is as follows:\nAdd G to q1.\nIf D_max(G,N/2+2)>mrd, add 1-children(G) to q2 with k1=N/2+2.\nWhile q1 not empty:\n\nGet best region S from q1.\nIf D(S)>mrd, set mdr=S and mrd=D(S).\nIf D_max(S,k_min)>mrd, add g-children(S) to q1.\nIf size(S)=3(2\u02c6k) and D_max(S,2\u02c6k+2)>mrd, add 1-children(S) to q2 with k1=2\u02c6k+2.\n\nWhile q2 not empty:\n\nGet best region S and value k1(S) from q2.\nIf D(S)>mrd, set mdr=S and mrd=D(S).\nIf D_max(S,k1(S))>mrd, add 1-children(S) to q2 with same k1.\n\nThese steps are \ufb01rst performed for the original grid, allowing us to calculate its mdr and\nmrd. We then perform these steps to calculate the mrd of each replica; however, several\ntechniques allow us to reduce the amount of computation necessary for a replica. First, we\ncan stop examining a replica G0 immediately if we \ufb01nd a region with density greater than\nmrd(G). This is especially useful in cases where there is no signi\ufb01cant spatial overdensity\nin G. Second, we can use mrd(G) for pruning our search on a replica G0: if Dmax(S; k) <\nmrd(G) for some S (cid:18) G0, we know that no subregion of S of size k or more can have a\ngreater density than the mdr of the original grid, and thus we do not need to examine any of\nthose subregions. This is especially useful where there is a signi\ufb01cant spatial overdensity\nin G: a high mrd will allow large amounts of pruning on the replica grids.\n\n3 Improving the algorithm\nThe exact version of the algorithm uses conservative estimates of the D1 densities of S0 and\nS (cid:0) S0 (dmax and dmin respectively), and a loose lower bound on the population of S0, to\n3We can use the tighter bound for dmax since we are using it to bound the density of a square\n\nregion S0 of size at least kmin; we cannot use the tighter bound for dmin since S(cid:0) S0 is not square.\n\n\fcalculate Dmax(S; k). This results in a loose upper bound on Dmax which is guaranteed to be\ncorrect, but allows little pruning to be done. We can derive tighter bounds on Dmax in two\nways: by using a closer approximation to the D1 density of S(cid:0) S0, and by using a tighter\nlower bound on the population of S0. These improvements are discussed below.\n\nP(cid:0)pi = Ctot(cid:0)C\n\n. Then Dmax(S; k) = D(c; p), where c = dmax p.\n\n3.1 The outer density approximation\nTo derive tighter bounds on the maximum density of a subregion S0 contained in a given\nregion S, we \ufb01rst note that (under both the null hypothesis and the alternative hypothesis)\nwe assume that at most one disease cluster Sdc exists, and that the disease rate q is expected\nto be uniform outside Sdc (or uniform everywhere, if no disease cluster exists). Thus, if\nSdc is contained entirely in the region under consideration S, we would expect that the\nmaximum density subregion S0 of S is Sdc, and that the disease rate of S(cid:0) S0 is equal to the\ndisease rate outside S: Eh C(cid:0)c\nPtot(cid:0)P = dout. Assuming that the D1 density of S(cid:0) S0 is\nequal to its expected value dout, we obtain the equation pdmax + (P(cid:0) p)dout = C. Solving\nfor p, we \ufb01nd p = C(cid:0)Pdout\ndmax(cid:0)dout\nThe problem with this approach is that we have not compensated for the variance in den-\nsities: our calculated value of Dmax is an upper bound for the maximum subregion density\nD(S0) only in the most approximate probabilistic sense. We would expect the D1 density of\nS(cid:0) S0 to be less than its expected value half the time, and thus we would expect D(S0) to be\nless than Dmax at least half the time; in practice, our bound will be correct more often, since\nwe are still using a conservative approximation of the D1 density of S0. Note also that we\nexpect to underestimate Dmax if the disease cluster Sdc is not contained entirely in S: this is\nacceptable (and desirable) since a region not containing Sdc does not need to be expanded.\nWe can improve the correctness of our probabilistic bound by also considering the vari-\nance of C(cid:0)c\nPtot(cid:0)P . Assuming that all counts outside Sdc are generated by a inho-\nPtot(cid:0)Pi =\ns 2h C(cid:0)c\nP(cid:0)p (cid:0) Ctot(cid:0)C\nmogeneous Poisson distribution with parameter qpi j, we obtain:\nq(Ptot(cid:0)p)\n(P(cid:0)p)(Ptot(cid:0)P) . Since the actual value of\nPtot(cid:0)p . From\n(P(cid:0)p)(Ptot(cid:0)P) . Then we can compute p by solving\n\ns 2h Po(q(P(cid:0)p))\nthis, we obtain s h C(cid:0)c\npdmax + (P(cid:0) p)(dout (cid:0) bs) = C, and obtain c = dmax p and Dmax = D(c; p) as before.\nBy adjusting our approximation of the minimum density in this manner, we compute a\nhigher score Dmax, reducing the likelihood that we will underestimate the maximum subre-\ngion density and prune a region that should not necessarily be pruned. Given a constant b,\nthe D1 density of S(cid:0) S0 will be greater than dout (cid:0) bs with probability P(Z < b), where Z\nis chosen randomly from the unit normal. For b = 2, there is an 98% chance that we will\nunderestimate D1(S(cid:0)S0), giving a guaranteed correct upper bound for the maximum subre-\ngion density. In practice, the maximum subregion density will be lower than our computed\nvalue of Dmax more often, since our estimates for dmax and q are conservative. Thus, though\nour algorithm is approximate, it is very likely to converge to the globally optimal mdr. In\nfact, our experiments demonstrate that b = 1 is suf\ufb01cient to obtain the correct region with\nover 90% probability, approaching 100% for suf\ufb01ciently dense regions.\n\ni = q\nP(cid:0)p + q\nPtot(cid:0)P =\nPtot(cid:0)Pi =q Ctot\n\nthe parameter q is not known, we use a conservative empirical estimate: q = Ctot\n\nP(cid:0)p (cid:0) Ctot(cid:0)C\n\nP(cid:0)p (cid:0)\n\nPo(q(Ptot(cid:0)P))\n\nPtot(cid:0)P\n\nP(cid:0)p (cid:0) Ctot(cid:0)C\n\n3.2 Cached population statistics\nA \ufb01nal step in making the algorithm tractable is to cache certain statistics about the mini-\nmum populations of subsquares of gridded regions. This is only performed once: it need\nnot be repeated for each replica (since populations need not be randomized). Although\nthere is no room to describe it, we have empirically shown it to give an important acceler-\nation if populations are highly non-uniform. The results below make use of this.\n\n\f4 Results\nWe \ufb01rst describe results with arti\ufb01cially generated grids and then real-world case data. An\narti\ufb01cial grid is generated from a set of parameters (N, k, \u00b5, s, q0, q00). The grid generator\n\ufb01rst creates an N (cid:2) N grid, and randomly selects a k(cid:2) k \u201ctest region.\u201d Then the population\nof each square is chosen randomly from a normal distribution with mean \u00b5 and standard\ndeviation s\n(populations less than zero are set to zero). Finally, the count of each square is\nchosen randomly from a Poisson distribution with parameter qpi j, where q = q0 inside the\ntest region and q = q00 outside the test region.\nWe tested three different adjustments for density variance (b = 0;1;2). The approximate\nalgorithm was tested for grids of size N = 512; test region sizes of k = 16 and k = 4 were\nused, and the disease rate q was set to .002 inside the test region and .001 outside the\ntest region. We used three different population distributions for testing: the \u201cstandard\u201d\ndistribution (\u00b5 = 104, s = 103), and two types of \u201chighly varying\u201d populations. For the\n\u201ccity\u201d distribution, we randomly selected a \u201ccity region\u201d with size 16: square populations\nwere generated with \u00b5 = 107 and s = 106 inside the city, and \u00b5 = 104 and s = 103 outside\nthe city. For the \u201chigh-s\u201d distribution, we generated all square populations with \u00b5 = 104\nand s = 5(cid:2) 103. We \ufb01rst compared the performance of each variant of the algorithm to\nthe naive approach for the three test cases; see Table 1 for results. For large test regions\n(k = 16), all variants of the algorithm had runtimes of (cid:24)20 minutes, as compared to 44\nhours for the naive approach, a speedup of 122-155x. For small test regions (k = 4), we\nobserved that performance generally decreased with increasing b: the algorithm achieved\naverage speedups of 133x for b = 0, 61x for b = 1, and 18x for b = 2.\nNext, we tested accuracy by generating 50 arti\ufb01cial grids for each population distribution,\nand computing the percentage of test grids on which the algorithm was able to \ufb01nd the\ncorrect mdr (see Table 2). For the large test region (k = 16), all variants were able to \ufb01nd the\ncorrect mdr with high (97-100%) accuracy. For the small test region, accuracy improved\nsigni\ufb01cantly with increasing b: the non-variance adjusted version (b = 0) achieved only\n45% accuracy, while the variance adjusted versions (b = 1 and b = 2) achieved 89% and\n99% accuracy respectively. These results demonstrate that the approximate algorithm (with\nvariance adjustment and cached population statistics) is able to achieve high performance\nand accuracy even for very small test regions and highly non-uniform populations.\n\nand j = l(cid:0)lmin\n\nFinally, we measured the performance of the approximate algorithm on a grid generated\nfrom real-world data. We used a database of (anonymized) Emergency Department data\ncollected from Western Pennsylvania hospitals in the period 1999-2002. This dataset con-\ntained a total of 630,000 records, each representing a single ED visit and giving the latitude\nand longitude of the patient\u2019s home location to the nearest .005 degrees ((cid:24) 1\n3 mile, a suf\ufb01-\nciently low resolution to ensure anonymity). For each record, the latitude L and longitude\nl were converted to a grid square si j by i = L(cid:0)Lmin\n:005 ; this created a 512(cid:2) 512\n:005\ngrid. We tested for spatial clustering of \u201crecent\u201d disease cases: the \u201ccount\u201d of each square\nwas the number of ED visits in that square in the last two months, and the \u201cpopulation\u201d of\nthat square was the total number of ED visits in that square. See Figure 2 for a picture of\nthis dataset, including the highest scoring region. We tested six variants of the approximate\nalgorithm on the ED dataset; the presence/absence of cached population statistics did not\nsigni\ufb01cantly affect the performance or accuracy for this test, so we focus on the variation\nin b. All three variants (b = 0;1;2), as well as the naive algorithm, found the maximum\ndensity region (of size 101) and found it statistically signi\ufb01cant (p-value 0/1000). The ma-\njor difference, of course, was in runtime and number of regions searched (see Table 3).\nThe naive algorithm took 2.7 days to \ufb01nd the mdr and perform 1000 Monte Carlo repli-\ncations, while each of the variants of the approximate algorithm performed the same task\nin (cid:24)2 hours or less. The approximate algorithm took 19 minutes (a speedup of 209x) for\nb = 0, 47 minutes (a speedup of 85x) for b = 1, and 126 minutes (a speedup of 31x) for\nb = 2. Thus we can see that all three variants \ufb01nd the correct region in much less time than\n\n\fFigure 2: The left picture shows the\n\u201cpopulation\u201d distribution within West-\nern PA and the right picture shows the\n\u201ccounts\u201d distribution. The winning re-\ngion is shown as a square.\n\nTable 1: Performance of algorithm, N = 512\n\nmethod\nnaive\nb = 0\nb = 1\nb = 2\nb = 0\nb = 1\nb = 2\nb = 0\nb = 1\nb = 2\nb = 0\nb = 1\nb = 2\nb = 0\nb = 1\nb = 2\nb = 0\nb = 1\nb = 2\n\ntest\nall\n\nstd, k = 16\nstd, k = 16\nstd, k = 16\nstd, k = 4\nstd, k = 4\nstd, k = 4\ncity, k = 16\ncity, k = 16\ncity, k = 16\ncity, k = 4\ncity, k = 4\ncity, k = 4\n\nhigh-s, k = 16\nhigh-s, k = 16\nhigh-s, k = 16\nhigh-s, k = 4\nhigh-s, k = 4\nhigh-s, k = 4\n\ntime (orig+1000 reps)\n2 : 37 + 43 : 36 : 40\n\n0 : 42 + 16 : 40\n0 : 43 + 16 : 20\n0 : 41 + 17 : 00\n0 : 41 + 17 : 00\n0 : 41 + 29 : 10\n\n0 : 42 + 1 : 13 : 00\n\n0 : 42 + 16 : 30\n0 : 46 + 20 : 40\n0 : 41 + 18 : 40\n0 : 43 + 24 : 30\n\n0 : 44 + 2 : 11 : 00\n0 : 47 + 7 : 06 : 50\n\n0 : 41 + 17 : 00\n0 : 41 + 16 : 40\n0 : 41 + 17 : 00\n0 : 44 + 17 : 15\n0 : 45 + 34 : 10\n\n1 : 08 + 3 : 20 : 00\n\nspeedup\n\nx1\nx151\nx154\nx148\nx148\nx88\nx36\nx153\nx122\nx135\nx104\nx20\nx6.1\nx148\nx151\nx148\nx146\nx75\nx13\n\nmethod\n\ntest\n\nstandard\n\naccuracy\n(k = 16)\n\nTable 2: Accuracy of algorithm\naccuracy\n(k = 4)\n52%\n36%\n46%\n90%\n88%\n90%\n98%\n98%\n100%\n\n96%\n98%\n98%\n100%\n100%\n100%\n100%\n100%\n100%\n\ncity\nhigh-s\nstandard\n\ncity\nhigh-s\nstandard\n\nb = 0\nb = 0\nb = 0\nb = 1\nb = 1\nb = 1\nb = 2\nb = 2\nb = 2\n\ncity\nhigh-s\n\ntime (orig+1000 reps)\n4 : 05 + 65 : 50 : 00\n\nTable 3: Emergency Dept. dataset\nmethod\nnaive\nb = 0\nb = 1\nb = 2\n\n4 : 20 + 14 : 36\n4 : 22 + 42 : 20\n\nx1\nx209\nx85\nx31\n\n4 : 36 + 2 : 01 : 12\n\nspeedup\n\nthe naive approach. This is very important for applications such as real-time detection of\ndisease outbreaks: if a system is able to detect an outbreak in minutes rather than days,\npreventive measures or treatments can be administered earlier, possibly saving many lives.\n\nThus we have presented a fast overlap-multires partitioning algorithm for detection of\nspatial overdensities, and demonstrated that this method results in signi\ufb01cant (10-200x)\nspeedups on real and arti\ufb01cially generated datasets. We are currently applying this algo-\nrithm to national-level hospital and pharmacy data, attempting to detect statistically signif-\nicant indications of a disease outbreak based on changes in the spatial clustering of disease\ncases. Application of a fast partitioning method using the techniques presented here may\nallow us to achieve the dif\ufb01cult goal of automatic real-time detection of disease outbreaks.\n\nReferences\n[1] S. Openshaw, et al. 1988. Investigation of leukemia clusters by use of a geographical analysis machine. Lancet 1, 272-273.\n[2] L. A. Waller, et al. 1994. Spatial analysis to detect disease clusters. In N. Lange, ed. Case Studies in Biometry. Wiley, 3-23.\n[3] M. Kulldorff and N. Nagarwalla. 1995. Spatial disease clusters: detection and inference. Statistics in Medicine 14, 799-810.\n[4] M. Kulldorff. 1999. Spatial scan statistics: models, calculations, and applications. In Glaz and Balakrishnan, eds. Scan\nStatistics and Applications. Birkhauser: Boston, 303-322.\n[5] R. Agrawal, et al. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proc. ACM-\nSIGMOD Intl. Conference on Management of Data, 94-105.\n[6] S. Goil, et al. 1999. MAFIA: ef\ufb01cient and scalable subspace clustering for very large data sets. Northwestern University,\nTechnical Report No. CPDC-TR-9906-010.\n[7] W. Wang, et al. 1997. STING: a statistical information grid approach to spatial data mining. Proc. 23rd Conference on Very\nLarge Databases, 186-195.\n[8] M. Kulldorff. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods 26(6), 1481-1496.\n[9] F. P. Preparata and M. I. Shamos. 1985. Computational Geometry: An Introduction. Springer-Verlag: New York.\n[10] K. Deng and A. W. Moore. 1995. Multiresolution instance-based learning. Proc. 12th Intl. Joint Conference on Arti\ufb01cial\nIntelligence, 1233-1239.\n[11] H. Samet. 1990. The Design and Analysis of Spatial Data Structures. Addison-Wesley: Reading.\n\n\f", "award": [], "sourceid": 2529, "authors": [{"given_name": "Daniel", "family_name": "Neill", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}