{"title": "The Robustness of Estimator Composition", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 937, "abstract": "We formalize notions of robustness for composite estimators via the notion of a breakdown point.  A composite estimator successively applies two (or more) estimators: on data decomposed into disjoint parts, it applies the first estimator on each part, then the second estimator on the outputs of the first estimator. And so on, if the composition is of more than two estimators. Informally, the breakdown point is the minimum fraction of data points which if significantly modified will also significantly modify the output of the estimator, so it is typically desirable to have a large breakdown point. Our main result shows that, under mild conditions on the individual estimators, the breakdown point of the composite estimator is the product of the breakdown points of the individual estimators. We also demonstrate several scenarios, ranging from regression to statistical testing, where this analysis is easy to apply, useful in understanding worst case robustness, and sheds powerful insights onto the associated data analysis.", "full_text": "The Robustness of Estimator Composition\n\nPingfan Tang\n\nSchool of Computing\nUniversity of Utah\n\nSalt Lake City, UT 84112\ntang1984@cs.utah.edu\n\nJeff M. Phillips\n\nSchool of Computing\nUniversity of Utah\n\nSalt Lake City, UT 84112\n\njeffp@cs.utah.edu\n\nAbstract\n\nWe formalize notions of robustness for composite estimators via the notion of\na breakdown point. A composite estimator successively applies two (or more)\nestimators: on data decomposed into disjoint parts, it applies the \ufb01rst estimator on\neach part, then the second estimator on the outputs of the \ufb01rst estimator. And so\non, if the composition is of more than two estimators. Informally, the breakdown\npoint is the minimum fraction of data points which if signi\ufb01cantly modi\ufb01ed will\nalso signi\ufb01cantly modify the output of the estimator, so it is typically desirable to\nhave a large breakdown point. Our main result shows that, under mild conditions\non the individual estimators, the breakdown point of the composite estimator is the\nproduct of the breakdown points of the individual estimators. We also demonstrate\nseveral scenarios, ranging from regression to statistical testing, where this analysis\nis easy to apply, useful in understanding worst case robustness, and sheds powerful\ninsights onto the associated data analysis.\n\n1\n\nIntroduction\n\nRobust statistical estimators [5, 7] (in particular, resistant estimators), such as the median, are an\nessential tool in data analysis since they are provably immune to outliers. Given data with a large\nfraction of extreme outliers, a robust estimator guarantees the returned value is still within the non-\noutlier part of the data. In particular, the role of these estimators is quickly growing in importance\nas the scale and automation associated with data collection and data processing becomes more\ncommonplace. Artisanal data (hand crafted and carefully curated), where potential outliers can be\nremoved, is becoming proportionally less common. Instead, important decisions are being made\nblindly based on the output of analysis functions, often without looking at individual data points\nand their effect on the outcome. Thus using estimators as part of this pipeline that are not robust are\nsusceptible to erroneous and dangerous decisions as the result of a few extreme and rogue data points.\nAlthough other approaches like regularization and pruning a constant number of obvious outliers\nare common as well, they do not come with the important guarantees that ensure these unwanted\noutcomes absolutely cannot occur.\nIn this paper we initiate the formal study of the robustness of composition of estimators through the\nnotion of breakdown points. These are especially important with the growth of data analysis pipelines\nwhere the \ufb01nal result or prediction is the result of several layers of data processing. When each layer\nin this pipeline is modeled as an estimator, then our analysis provides the \ufb01rst general robustness\nanalysis of these processes.\nThe breakdown point [4, 3] is a basic measure of robustness of an estimator. Intuitively, it describes\nhow many outliers can be in the data without the estimator becoming unreliable. However, the\nliterature is full of slightly inconsistent and informal de\ufb01nitions of this concept. For example:\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u2022 Aloupis [1] write \u201cthe breakdown point is the proportion of data which must be moved to\n\nin\ufb01nity so that the estimator will do the same.\u201d\n\n\u2022 Huber and Ronchetti [8] write \u201cthe breakdown point is the smallest fraction of bad observa-\n\ntions that may cause an estimator to take on arbitrarily large aberrant values.\"\n\n\u2022 Dasgupta, Kumar, and Srikumar [14] write \u201cthe breakdown point of an estimator is the\nlargest fraction of the data that can be moved arbitrarily without perturbing the estimator to\nthe boundary of the parameter space.\u201d\n\nAll of these de\ufb01nitions have similar meanings, and they are typically suf\ufb01cient for the purpose of\nunderstanding a single estimator. However, they are not mathematically rigorous, and it is dif\ufb01cult to\nuse them to discuss the breakdown point of composite estimators.\nComposition of Estimators.\nIn a bit more detail (we give formal de\ufb01nitions in Section 2.1), an\nestimator E maps a data set to single value in another space, sometimes the same as a single data\npoint. For instance the mean or the median are simple estimators on one-dimensional data. A\ncomposite E1-E2 estimator applies two estimators E1 and E2 on data stored in a hierarchy. Let\nP = {P1, P2, . . . , Pn} be a set of subdata sets, where each subdata set Pi = {pi,1, pi,2, . . . , pi,k}\nhas individual data readings. Then the E1-E2 estimator reports E2(E1(P1), E1(P2), . . . , E1(Pn)),\nthat is the estimator E2 applied to the output of estimator E1 on each subdata set.\n\n1.1 Examples of Estimator Composition\n\nComposite estimators arise in many scenarios in data analysis.\nUncertain Data. For instance, in the last decade there has been increased focus on the study\nof uncertainty data [10, 9, 2] where instead of analyzing a data set, we are given a model of the\nuncertainty of each data point. Consider tracking the summarization of a group of n people based\non noisy GPS measurements. For each person i we might get k readings of their location Pi, and\nuse these k readings as a discrete probability distribution of where that person might be. Then in\norder to represent the center of this set of people a natural thing to do would be to estimate the\nlocation of each person as xi \u2190 E1(Pi), and then use these estimates to summarize the entire group\nE2(x1, x2, . . . , xn). Using the mean as E1 and E2 would be easy, but would be susceptible to even\na single outrageous outlier (all people are in Manhattan, but a spurious reading was at (0, 0) lat-long,\noff the coast of Africa). An alternative is to use the L1-median for E1 and E2, that is known to have\nan optimal breakdown point of 0.5. But what is the breakdown point of the E1-E2 estimator?\nRobust Analysis of Bursty Behavior. Understanding the robustness of estimators can also be\ncritical towards how much one can \u201cgame\u201d a system. For instance, consider a start-up media website\nthat gets bursts of traf\ufb01c from memes they curate. They publish a statistic showing the median of the\ntop half of traf\ufb01c days each month, and aggregate these by taking the median of such values over the\ntop half of all months. This is a composite estimator, and they proudly claim, even through they have\nbursty traf\ufb01c, it is robust (each estimator has a breakdown point of 0.25). If this composite estimator\nshows large traf\ufb01c, should a potential buyer of this website by impressed? Is there a better, more\nrobust estimator the potential buyer could request? If the media website can stagger the release of its\ncontent, how should they distribute it to maximize this composite estimator?\nPart of the Data Analysis Pipeline. This process of estimator composition is very common in\nbroad data analysis literature. This arises from the idea of an \u201canalysis pipeline\u201d where at several\nstages estimators or analysis is performed on data, and then further estimators and analysis are\nperformed downstream. In many cases a robust estimator like the median is used, speci\ufb01cally for its\nrobustness properties, but there is no analysis of how robust the composition of these estimators is.\n\n1.2 Main Results\nThis paper initiates the formal and general study of the robustness of composite estimators.\n\n\u2022 In Subsection 2.1, we give two formal de\ufb01nitions of breakdown points which are both\nrequired to prove composition theorem. One variant of the de\ufb01nition closely aligns with\nother formalizations [4, 3], while another is fundamentally different.\n\n\u2022 The main result provides general conditions under which an E1-E2 estimator with break-\n\ndown points \u03b21 and \u03b22, has a breakdown point of \u03b21\u03b22 (Theorem 2 in Subsection 2.2).\n\n2\n\n\f\u2022 Moreover, by showing examples where our conditions do not strictly apply, we gain an\nunderstanding of how to circumvent the above result. An example is in composite percentile\nestimators (e.g., E1 returns the 25th percentile, and E2 the 75th percentile of a ranked set).\nThese composite estimators have larger breakdown point than \u03b21 \u00b7 \u03b22.\n\n\u2022 The main result can extended to multiple compositions, under suitable conditions, so for\ninstance an E1-E2-E3 estimator has a breakdown point of \u03b21\u03b22\u03b23 (Theorem 3 in Subsection\n2.3). This implies that long analysis chains can be very suspect to a few carefully places\noutliers since the breakdown point decays exponentially in the length of the analysis chain.\n\u2022 In Section 3, we highlight several applications of this theory, including robust regression,\nrobustness of p-values, a depth-3 composition, and how to advantageously manipulate the\nobservation about percentile estimator composition. We demonstrate a few more applications\nwith simulations in Section 4.\n\n2 Robustness of Estimator Composition\n\n2.1 Formal De\ufb01nitions of Breakdown Points\nIn this paper, we give two de\ufb01nitions for the breakdown point: Asymptotic Breakdown Point and\nAsymptotic Onto-Breakdown Point. The \ufb01rst de\ufb01nition, Asymptotic Breakdown Point, is similar\nto the classic formal de\ufb01nitions in [4] and [3] (including their highly technical nature), although\ntheir de\ufb01nitions of the estimator are slightly different leading to some minor differences in special\ncases. However our second de\ufb01nition, Asymptotic Onto-Breakdown Point, is a structurally new\nde\ufb01nition, and we illustrate how it can result in signi\ufb01cantly different values on some common and\nuseful estimators. Our main theorem will require both de\ufb01nitions, and the differences in performance\nwill lead to several new applications and insights.\nWe de\ufb01ne an estimator E as a function from the collection of some \ufb01nite subsets of a metric space\n(X , d) to another metric space (X (cid:48), d(cid:48)):\n\nE : A \u2282 {X \u2282 X | 0 < |X| < \u221e} (cid:55)\u2192 X (cid:48),\n\n(1)\nwhere X is a multiset. This means if x \u2208 X then x can appear more than once in X, and the\nmultiplicity of elements will be considered when we compute |X|.\n\nFinite Sample Breakdown Point. For estimator E de\ufb01ned in (1) and positive integer n we de\ufb01ne\nits \ufb01nite sample breakdown point gE(n) over a set M as\n\n(cid:26)max(M )\n\ngE(n) =\n\n0\n\nif M (cid:54)= \u2205\nif M = \u2205\n\n(2)\n\n(3)\n\nwhere for \u03c1(x(cid:48), X) = maxx\u2208X d(x(cid:48), x) is the distance from x(cid:48) to the furthest point in X,\nM = {m \u2208 [0, n] | \u2200X \u2208 A ,|X| = n,\u2200 G1 > 0,\u2203 G2 = G2(X, G1) s.t. \u2200X(cid:48) \u2208 A ,\n\nif |X(cid:48)| = n and |{x(cid:48) \u2208 X(cid:48) | \u03c1(x(cid:48), X) > G1}| \u2264 m then d(cid:48)(E(X), E(X(cid:48))) \u2264 G2}.\n\nFor an estimator E in (1) and X \u2208 A , the \ufb01nite sample breakdown point gE(n) means if the number\nof unbounded points in X(cid:48) is at most gE(n), then E(X(cid:48)) will be bounded. Lets break this de\ufb01nition\ndown a bit more. The de\ufb01nition holds over all data sets X \u2208 A of size n, and for all values G1 > 0\nand some value G2 de\ufb01ned as a function G2(X, G1) of the data set X and value G1. Then gE(n) is\nthe maximum value m (over all X, G1, and G2 above) such that for all X(cid:48) \u2208 A with |X(cid:48)| = n then\n|{x(cid:48) \u2208 X(cid:48) | \u03c1(x(cid:48), X) > G1}| \u2264 m (that is at most m points are further than G1 from X) where the\nestimators are close, d(cid:48)(E(X), E(X(cid:48))) \u2264 G2.\nFor example, consider a point set X = {0, 0.15, 0.2, 0.25, 0.4, 0.55, 0.6, 0.65, 0.72, 0.8, 1.0} with\nn = 11 and median 0.55. If we set G1 = 3, then we can consider sets X(cid:48) of size 11 with fewer\nthan m points that are either greater than 3 or less than \u22122. This means in X(cid:48) there are at most m\npoints which are greater than 3 or less than \u22122, and all other n\u2212 m points are in [\u22122, 3]. Under these\nconditions, we can (conservatively) set G2 = 4, and know that for values of m as 1, 2, 3, 4, or 5, then\nthe median of X(cid:48) must be between \u22123.45 and 4.55; and this holds no matter where we set those m\npoints (e.g., at 20 or at 1000). This does not hold for m \u2265 6, so gE(11) = 5.\n\n3\n\n\fAsymptotic Breakdown Point.\n\nIf the limit limn\u2192\u221e gE (n)\n\nn\ngE(n)\n\nn\n\n\u03b2 = lim\nn\u2192\u221e\n\nexists, then we de\ufb01ne this limit\n\n(4)\n\nas the asymptotic breakdown point, or breakdown point for short, of the estimator E.\nRemark 1. It is not hard to see that many common estimators satisfy the conditions. For example, the\nmedian, L1-median [1], and Siegel estimators [11] all have asymptotic breakdown points of 0.5.\nAsymptotic Onto-Breakdown Point. For an estimator E given in (1) and positive integer n, if\nn \u2212 m, E(X(cid:48)) = y} is not empty, we de\ufb01ne\n\n(cid:102)M = {0 \u2264 m \u2264 n | \u2200 X \u2208 A ,|X| = n,\u2200 y \u2208 X (cid:48), \u2203 X(cid:48) \u2208 A s.t. |X(cid:48)| = n,|X \u2229 X(cid:48)| =\n\nfE(n) = min((cid:102)M ).\n\n(5)\nThe de\ufb01nition of fE(n) implies, if we change fE(n) elements in X, we can make E become any\nvalue in X (cid:48): it is onto. In contrast gE(n) only requires E(X(cid:48)) to become far from E(X), perhaps\nonly in one direction. Then the asymptotic onto-breakdown point is de\ufb01ned as the following limit if\nit exists\n\n.\n\nn\n\nn\n\nlim\nn\u2192\u221e\n\n(cid:54)= limn\u2192\u221e fE (n)\n\n(6)\nRemark 2. For a quantile estimator E that returns a percentile other than the 50th,\nthen\nlimn\u2192\u221e gE (n)\nn . For instance, if E returns the 25th percentile of a ranked set,\nsetting only 25% of the data points to \u2212\u221e causes E to return \u2212\u221e; hence limn\u2192\u221e gE (n)\nn = 0.25.\nAnd while any value less than the original 25th percentile can also be obtained; to return a value\nlarger than the largest element in the original set, at least 75% of the data must be modi\ufb01ed, thus\nlimn\u2192\u221e fE (n)\nAs we will observe in Section 3, this nuance in de\ufb01nition regarding percentile estimators will allow\nfor some interesting composite estimator design.\n\nn = 0.75.\n\nfE(n)\n\n2.2 De\ufb01nition of E1-E2 Estimators, and their Robustness\nWe consider the following two estimators:\n\nn\n\nnk\n\nE1 : A1 \u2282 {X \u2282 X1 | 0 < |X| < \u221e} (cid:55)\u2192 X2,\nE2 : A2 \u2282 {X \u2282 X2 | 0 < |X| < \u221e} (cid:55)\u2192 X (cid:48)\n2 ,\n\n(7)\n(8)\nwhere any \ufb01nite subset of E1(A1), the range of E1, belongs to A2. Suppose Pi \u2208 A1, |Pi| = k for\ni = 1, 2,\u00b7\u00b7\u00b7 , n and P\ufb02at = (cid:93)n\ni=1Pi, where (cid:93) means if x appears n1 times in X1 and n2 times in X2\nthen x appears n1 + n2 times in X1 (cid:93) X2. We de\ufb01ne\n\nE(P\ufb02at) = E2 (E1(P1), E1(P2),\u00b7\u00b7\u00b7 , E1(Pn)) .\n\nand \u03b2 = limn,k\u2192\u221e gE (nk)\n\nall exist, then we have \u03b21\u03b22 \u2264 \u03b2.\n\n(9)\nTheorem 1. Suppose gE1(k) and gE2 (n) are the \ufb01nite sample breakdown points of estimators E1\nand E2 which are given by (7) and (8) respectively. If gE(nk) is the \ufb01nite sample breakdown\npoint of E given by (9), then we have gE2(n)gE1(k) \u2264 gE(nk). If \u03b21 = limk\u2192\u221e gE1 (k)\n, \u03b22 =\nlimn\u2192\u221e gE2 (n)\nThe proof of Theorem 1 and other theorems can be found in the full version of this paper [12].\nRemark 3. Under the condition of Theorem 1, we cannot guarantee \u03b2 = \u03b21\u03b22. For example, suppose\nE1 and E2 take the 25th percentile and the 75th percentile of a ranked set of real numbers respectively.\nSo, we have \u03b21 = \u03b22 = 1\n\n4. However, \u03b2 = 1\nas n, k \u2192 \u221e may even not exist. For example, suppose E1 takes the 25th\nIn fact, the limit of gE (nk)\npercentile of a ranked set of real numbers. When n is odd E2 takes the the 25th percentile of a ranked\nset of n real numbers, and when n is even E2 takes the the 75th percentile of a ranked set of n real\nnumbers. Thus, \u03b21 = \u03b22 = 1\n4 nk if n is even,\nwhich implies limn,k\u2192\u221e gE (nk)\nTherefore, to guarantee \u03b2 exist and \u03b2 = \u03b21\u03b22, we introduce the de\ufb01nition of asymptotic onto-\nbreakdown point in (6). As shown in Remark 2, the values of (4) and (6) may be not equal. However,\nwith the condition of the asymptotic breakdown point and asymptotic onto-breakdown point of E1\nbeing the same, we can \ufb01nally state our desired clean result.\n\n4 nk if n is odd, and gE(nk) \u2248 1\n\n4, but gE(nk) \u2248 1\ndoes not exist.\n\n16.\n4 = 3\n\n4 \u00b7 3\n\n4 \u00b7 3\n\nnk\n\nnk\n\nk\n\n4\n\n\fTheorem 2. For estimators E1, E2 and E given by (7), (8) and (9) respectively, suppose gE1(k),\ngE2(n) and gE(nk) are de\ufb01ned by (2), and fE1(k) is de\ufb01ned by (5). Moreover, E1 is an onto function\nand for any \ufb01xed positive integer n we have\n\nn\n\n(10)\n\n\u2203 X \u2208 A2,|X| = n, G1 > 0, s.t. \u2200 G2 > 0,\u2203 X(cid:48) \u2208 A2 satisfying\n|X(cid:48)| = n,|X(cid:48) \\ X| = gE2(n) + 1, and d(cid:48)\n2(E2(X), E2(X(cid:48))) > G2,\nk = limk\u2192\u221e fE1 (k)\n\n2 . If \u03b21 = limk\u2192\u221e gE1 (k)\n\nand \u03b22 =\n\nexists, and \u03b2 = \u03b21\u03b22.\n\nboth exist, then \u03b2 = limn,k\u2192\u221e gE (nk)\n\n2 is the metric of space X (cid:48)\n\nwhere d(cid:48)\nlimn\u2192\u221e gE2 (n)\nRemark 4. Without the introduction of fE(n), we cannot even guarantee \u03b2 \u2264 \u03b21 or \u03b2 \u2264 \u03b22 only\nunder the condition of Theorem 1, even if E1 and E2 are both onto functions. For example, for any\nP = {p1, p2,\u00b7\u00b7\u00b7 , pk} \u2282 R and X = {x1, x2,\u00b7\u00b7\u00b7 , xn} \u2282 R, we de\ufb01ne E1(P ) = 1/median(P )\n(if median(P ) (cid:54)= 0, otherwise de\ufb01ne E1(P ) = 0) and E2(X) = median(y1, y2,\u00b7\u00b7\u00b7 , yn), where yi\n(1 \u2264 y \u2264 n) is given by yi = 1/xi (if xi (cid:54)= 0, otherwise de\ufb01ne yi = 0). Since gE1(k) = gE2 (n) = 0\nfor all n, k, we have \u03b21 = \u03b22 = 0. However, in order to make E2(E1(P1), E1(P2),\u00b7\u00b7\u00b7 , E1(Pn)) \u2192\n+\u221e, we need to make about n\n2 elements in {E(P1), E(P2),\u00b7\u00b7\u00b7 , E(Pn)} go to 0+. To make\nE1(Pi) \u2192 0+, we need to make about k\n2 \u00b7 k\nand \u03b2 = 1\n4.\n\n2 points in Pi go to +\u221e. Therefore, we have gE(nk) \u2248 n\n\nnk\n\nk\n\n2\n\n,\n\n2.3 Multi-level Composition of Estimators\nTo study the breakdown point of composite estimators with more than two levels, we introduce the\nfollowing estimator:\n\nE3 : A3 \u2282 {X \u2282 X (cid:48)\n\n(11)\nwhere any \ufb01nite subset of E2(A2), the range of E2, belongs to A3. Suppose Pi,j \u2208 A1, |Pi,j| = k\nfor i = 1, 2,\u00b7\u00b7\u00b7 , n, j = 1, 2,\u00b7\u00b7\u00b7 , m and P j\n\n2 | 0 < |X| < \u221e} (cid:55)\u2192 X (cid:48)\n3 ,\n\ufb02at = (cid:93)n\ni=1Pi,j, P\ufb02at = (cid:93)m\n\ufb02at),\u00b7\u00b7\u00b7 , E2((cid:101)P m\n\ufb02at), E2((cid:101)P 2\nE2((cid:101)P 1\n\ufb02at)\n\ufb02at = {E1(P1,j), E1(P2,j),\u00b7\u00b7\u00b7 , E1(Pn,j)}, for j = 1, 2,\u00b7\u00b7\u00b7 , m.\n\nwhere (cid:101)P j\n\n,\n\n(12)\n\n\ufb02at. We de\ufb01ne\n\nE(P\ufb02at) = E3\n\nj=1P j\n\n(cid:16)\n\n(cid:17)\n\nFrom Theorem 2, we can obtain the following theorem about the breakdown point of E in (12).\nTheorem 3. For estimators E1, E2, E3 and E given by (7), (8), (11) and (12) respectively, suppose\ngE1(k), gE2(n), gE3(m) and gE(mnk) are de\ufb01ned by (2), and fE1(k), fE2(n) are de\ufb01ned by (5).\nMoreover, E1 and E2 are both onto functions, and for any \ufb01xed positive integer m we have\n\n\u2203 X \u2208 A3,|X| = m, G1 > 0, s.t. \u2200 G2 > 0,\u2203 X(cid:48) \u2208 A3\nsatisfying |X(cid:48)| = m,|X(cid:48) \\ X| = gE3 (m) + 1, and d(cid:48)\n3 is the metric of space X (cid:48)\n3 .\nand \u03b23 = limm\u2192\u221e gE3 (m)\n\n= limn\u2192\u221e fE2 (n)\n\nIf \u03b21 = limk\u2192\u221e gE1 (k)\n\nk\n\nm\n\n3(E3(X), E3(X(cid:48))) > G2,\n= limk\u2192\u221e fE1 (k)\n\n, \u03b22 =\nall exist, then we have \u03b2 =\n\nk\n\nn\n\nexists, and \u03b2 = \u03b21\u03b22\u03b23 .\n\nwhere d(cid:48)\nlimn\u2192\u221e gE2 (n)\nlimm,n,k\u2192\u221e gE (mnk)\n\nn\n\nmnk\n\n3 Applications\n\n3.1 Application 1 : Balancing Percentiles\n\nFor n companies, for simplicity, assume each company has k employees. We are interested in the\nincome of the regular employees of all companies, not the executives who may have much higher pay.\nLet pi,j represents the income of the jth employee in the ith company. Set P\ufb02at = (cid:93)n\ni=1Pi where the\nith company has a set Pi = {pi,1, pi,2,\u00b7\u00b7\u00b7 , pi,k} \u2282 R and for notational convenience pi,1 \u2264 pi,2 \u2264\n\u00b7\u00b7\u00b7 \u2264 pi,k for i \u2208 {1, 2,\u00b7\u00b7\u00b7 , n}. Suppose the income data Pi of each company is preprocessed by a\n45-percentile estimator E1 (median of lowest 90% of incomes), with breakdown point \u03b21 = 0.45. In\ntheory E1(Pi) can better re\ufb02ect the income of regular employees in a company, since there may be\nabout 10% of employees in the management of a company and their incomes are usually much higher\nthan that of common employees. So, the preprocessed data is X = {E1(P1), E1(P2),\u00b7\u00b7\u00b7 , E1(Pn)}.\n\n5\n\n\fIf we de\ufb01ne E2(X) = median(X) and E(P\ufb02at) = E2(X), then the breakdown point of E2 is\n\u03b22 = 0.5, and the breakdown points of E is \u03b2 = \u03b21\u03b22 = 0.225.\nHowever, if we use another E2, then E can be more robust. For example, for X = {x1, x2,\u00b7\u00b7\u00b7 , xn}\nwhere x1 \u2264 x2 \u2264 \u00b7\u00b7\u00b7 \u2264 xn, we can de\ufb01ne E2 as the 55-percentile estimator (median of largest\n90% of incomes). In order to make E(P\ufb02at) = E2(X) = E2(E1(P1), E1(P2),\u00b7\u00b7\u00b7 , E1(Pn)) go to\nin\ufb01nity, we need to either move 55% points of X to \u2212\u221e or move 45% points of X to +\u221e. In either\ncase, we need to move about 0.45 \u00b7 0.55nk points of P\ufb02at to in\ufb01nity. This means the breakdown point\nof E is \u03b2 = 0.45 \u00b7 0.55 = 0.2475 which is greater than 0.225.\nThis example implies if we know how the raw data is preprocessed by estimator E1, we can choose a\nproper estimator E2 to make the E1-E2 estimator more robust.\n\n3.2 Application 2 : Regression of L1 Medians\n\nSuppose we want to use linear regression to robustly predict the weight of a person from his or\nher height, and we have multiple readings of each person\u2019s height and weight. The raw data is\nP\ufb02at = (cid:93)n\ni=1Pi where for the ith person we have a set Pi = {pi,1, pi,2,\u00b7\u00b7\u00b7 , pi,k} \u2282 R2 and\npi,j = (xi,j, yi,j) for i \u2208 {1, 2,\u00b7\u00b7\u00b7 , n}, j \u2208 {1, 2,\u00b7\u00b7\u00b7 , k}. Here, xi,j and yi,j are the height and\nweight respectively of the ith person in their jth measurement.\nOne \u201crobust\u201d way to process this data, is to \ufb01rst pre-process each Pi with its L1-median [1]:\n(\u00afxi, \u00afyi) \u2190 E1(Pi), where E1(Pi) = L1-median(Pi) has breakdown point \u03b21 = 0.5. Then we could\ngenerate a linear model to predict weight \u02c6yi = ax+b from the Siegel Estimator [11]: E2(Z) = (a, b),\nwith breakdown point \u03b22 = 0.5. From Theorem 2 we immediately know the breakdown point of\nE(P\ufb02at) = E2(E1(P1), E1(P2),\u00b7\u00b7\u00b7 , E1(Pn)) is \u03b2 = \u03b21\u03b22 = 0.5 \u00b7 0.5 = 0.25.\nAlternatively, taking the Siegel estimator of P\ufb02at (i.e., returning E2(P\ufb02at)) would have a much larger\nbreakdown point of 0.5. So a seemingly harmless operation of normalizing the data with a robust\nestimator (with optimal 0.5 breakdown point) drastically decreases the robustness of the process.\n\n3.3 Application 3 : Signi\ufb01cance Thresholds\n\nSuppose we are studying the distribution of the wingspread of fruit \ufb02ies. There are n = 500 \ufb02ies,\nand the variance of the true wingspread among these \ufb02ies is on the order of 0.1 units. Our goal is to\nestimate the 0.05 signi\ufb01cance level of this distribution of wingspread among normal \ufb02ies.\nTo obtain a measured value of the wingspread of the ith \ufb02y, denoted Fi, we measure the wingspread\nof ith \ufb02y k = 100 times independently, and obtain the measurement set Pi = {pi,1, pi,2,\u00b7\u00b7\u00b7 , pi,k}.\nThe measurement is carried out by a machine automatically and quickly, which implies the variance\nof each Pi is typically very small, perhaps only 0.0001 units, but there are outliers in Pi with small\nchance due to possible machine malfunction. This malfunction may be correlated to individual\n\ufb02ies because of anatomical issues, or it may have autocorrelation (the machine jams for a series of\nconsecutive measurements).\nTo perform hypothesis testing we desire the 0.05 signi\ufb01cance level, so we are interested in the 95th\npercentile of the set F = {F1, F2,\u00b7\u00b7\u00b7 , Fn}. So a post processing estimator E2 returns the 95th per-\ncentile of F and has a breakdown point of \u03b22 = 0.05 [6]. Now, we need to design an estimator E1 to\nprocess the raw data P\ufb02at = (cid:93)n\ni=1Pi to obtain F = {F1, F2,\u00b7\u00b7\u00b7 , Fn}. For example, we can de\ufb01ne E1\nas Fi = E1(Pi) = median(Pi) and estimator E as E(P\ufb02at) = E2(E1(P1), E1(P2),\u00b7\u00b7\u00b7 , E1(Pn)).\nThen, the breakdown point of E1 is 0.5. Since the breakdown point of E2 is 0.05, the breakdown point\nof the composite estimator E is \u03b2 = \u03b21\u03b22 = 0.5 \u00b7 0.05 = 0.025. This means if the measurement\nmachine malfunctioned only 2.5% of the time, we could have an anomalous signi\ufb01cant level, leading\nto false discovery. Can we make this process more robust by adjusting E1?\nActually, yes!, we can use another pre-processing estimator to get a more robust E. Since the variance\nof each Pi is only 0.0001, we can let E1 return the 5th percentile of a ranked set of real numbers, then\nthere is not much difference between E1(Pi) and the median of Pi. (Note: this introduces a small\namount of bias that can likely be accounted for in other ways.) In order to make E(P\ufb02at) = E2(F )\ngo to in\ufb01nity we need to move 5% points of X to \u2212\u221e (causing E2 to give an anomalous value) or\n95% points of X to +\u221e (causing many, 95%, of the E1 values, to give anomalous values). In either\ncase, we need to move about 5% \u00b7 95% points of P\ufb02at to in\ufb01nity. So, the breakdown points of E is\n\n6\n\n\f\u03b2 = 0.05 \u00b7 0.95 = 0.0475 which is greater than 0.025. That is, we can now sustain up to 4.75% of\nthe measurement machine\u2019s reading to be anomalous, almost double than before, without leading to\nan anomalous signi\ufb01cance threshold value.\nThis example implies if we know the post-processing estimator E2, we can choose a proper method\nto preprocess the raw data to make the E1-E2 estimator more robust.\n\n3.4 Application 4 : 3-Level Composition\n\n\ufb02at = (cid:93)n\n\ni=1Pi,j, P\ufb02at = (cid:93)m\n\nSuppose we want to use a single value to represent the temperature of the US in a certain day.\nThere are m = 50 states in the country. Suppose each state has n = 100 meteorological stations,\nand the station i in state j measures the local temperature k = 24 times to get the data Pi,j =\n{ti,j,1, ti,j,2,\u00b7\u00b7\u00b7 , ti,j,k}. We de\ufb01ne P j\nE1(Pi,j) = median(Pi,j), E2(P j\n\ufb02at), E2(P 2\nE(P\ufb02at) = E3(E2(P 1\n\ufb02at)).\nSo, the break down points of E1, E2 and E3 are \u03b21 = \u03b22 = \u03b23 = 0.5. From Theorem 3, we know\nthe break down point of E is \u03b2 = \u03b21\u03b22\u03b23 = 0.125. Therefore, we know the estimator E is not very\nrobust, and it may be not a good choice to use E(P\ufb02at) to represent the temperature of the US in a\ncertain day.\nThis example illustrates how the more times the raw data is aggregated, the more unreliable the \ufb01nal\nresult can become.\n\n\ufb02at) = median (E1(P1,j), E1(P1,j),\u00b7\u00b7\u00b7 , E1(Pn,j))\n\ufb02at),\u00b7\u00b7\u00b7 , E2(P m\n\n\ufb02at),\u00b7\u00b7\u00b7 , E2(P m\n\n\ufb02at)) = median(E2(P 1\n\n\ufb02at), E2(P 2\n\n\ufb02at and\n\nj=1P j\n\n4 Simulation: Estimator Manipulation\nIn this simulation we actually construct a method to relocate an estimator by modifying the smallest\nnumber of points possible. We speci\ufb01cally target the L1-median of L1-medians since its somewhat\nnon-trivial to solve for the new location of data points.\nIn particular, given a target point p0 \u2208 R2 and a set of nk points P\ufb02at = (cid:93)n\ni=1Pi,\nwhere Pi = {pi,1, pi,2,\u00b7\u00b7\u00b7 , pi,k} \u2282 R2, we use simulation to show that we only need\nto change \u02dcn\u02dck points of P\ufb02at,\n\ni=1(cid:101)Pi such that\nmedian(median((cid:101)P1), median((cid:101)P2),\u00b7\u00b7\u00b7 , median((cid:101)Pn)) = p0. Here, the \"median\" means L1-median,\n\nthen we can get a new set (cid:101)P\ufb02at = (cid:93)n\n\nand\n\n(cid:26) 1\n\n(cid:26) 1\n\n\u02dcn =\n\n2 n\n1\n2 (n + 1)\n\nif n is even\nif n is odd , \u02dck =\n\n2 k\n1\n2 (k + 1)\n\nif k is even\nif k is odd .\n\nTo do this, we \ufb01rst show that, given k points S = {(xi, yi) | 1 \u2264 i \u2264 k} in R2, and a target point\n(x0, y0), we can change \u02dck points of S to make (x0, y0) as the L1-median of the new set. As n and k\ngrow, then \u02dcn\u02dck/(nk) = 0.25 is the asymptotic breakdown point of this estimator, as a consequence of\nTheorem 2, and thus we may need to move this many points to get the result.\nIf (x0, y0) is the L1-median of the set {(xi, yi) | 1 \u2264 i \u2264 k}, then we have [13]:\n\nWe de\ufb01ne (cid:126)x = (x1, x2,\u00b7\u00b7\u00b7 , x\u02dck), (cid:126)y = (y1, y2,\u00b7\u00b7\u00b7 , y\u02dck) and\n\nk(cid:88)\n\ni=1\n\nxi \u2212 x0\n\n(cid:112)(xi \u2212 x0)2 + (yi \u2212 y0)2\n(cid:32) k(cid:88)\n\n(cid:112)(xi \u2212 x0)2 + (yi \u2212 y0)2\n\nxi \u2212 x0\n\n= 0,\n\ni=1\n\nk(cid:88)\n(cid:33)2\n\ni=1\n\nyi \u2212 y0\n\n(cid:112)(xi \u2212 x0)2 + (yi \u2212 y0)2\n(cid:32) k(cid:88)\n\nyi \u2212 y0\n\n(cid:112)(xi \u2212 x0)2 + (yi \u2212 y0)2\n\ni=1\n\n+\n\n(cid:33)2\n\n.\n\n= 0.\n\n(13)\n\nh((cid:126)x, (cid:126)y) =\n\nSince (13) is the suf\ufb01cient and necessary condition for L1-median, if we can \ufb01nd (cid:126)x and (cid:126)y such that\nh((cid:126)x, (cid:126)y) = 0, then (x0, y0) is the L1-median of the new set.\nSince\n\nxj \u2212 x0\n\n(cid:16) k(cid:88)\n(cid:17)\n(cid:112)(xj \u2212 x0)2 + (yj \u2212 y0)2\n(cid:16) k(cid:88)\n(cid:112)(xj \u2212 x0)2 + (yj \u2212 y0)2\n\nyj \u2212 y0\n\nj=1\n\n\u2212 2\n\nj=1\n\n(yi \u2212 y0)2\n\n(cid:0)(xi \u2212 x0)2 + (yi \u2212 y0)2(cid:1) 3\n(cid:17)\n(cid:0)(xi \u2212 x0)2 + (yi \u2212 y0)2(cid:1) 3\n\n(xi \u2212 x0)(yi \u2212 y0)\n\n2\n\n2\n\n,\n\n\u2202xi h((cid:126)x, (cid:126)y) =2\n\n7\n\n\f(cid:16) k(cid:88)\n(cid:16) k(cid:88)\n\nj=1\n\n\u2202yi h((cid:126)x, (cid:126)y) = \u2212 2\n\n+ 2\n\n(xi \u2212 x0)(yi \u2212 y0)\n\n(cid:0)(xi \u2212 x0)2 + (yi \u2212 y0)2(cid:1) 3\n(cid:0)(xi \u2212 x0)2 + (yi \u2212 y0)2(cid:1) 3\n\n(xi \u2212 x0)2\n\n2\n\n2\n\n,\n\nxj \u2212 x0\n\n(cid:112)(xj \u2212 x0)2 + (yj \u2212 y0)2\n(cid:112)(xj \u2212 x0)2 + (yj \u2212 y0)2\n\nyj \u2212 y0\n\n(cid:17)\n(cid:17)\n\nj=1\n\n1, m(cid:48)\ni,1, p(cid:48)\n\nwe can use gradient descent to compute (cid:126)x, (cid:126)y to minimize h. For the input S = {(xi, yi)|1 \u2264 i \u2264 k},\nwe choose the initial value (cid:126)x0 = {x1, x2,\u00b7\u00b7\u00b7 , x\u02dck}, (cid:126)y0 = {y1, y2,\u00b7\u00b7\u00b7 , y\u02dck}, and then update (cid:126)x and (cid:126)y\nalong the negative gradient direction of h, until the Euclidean norm of gradient is less than 0.00001.\nThe algorithm framework is then as follows, using the above gradient descent formulation at each step.\nWe \ufb01rst compute the L1-median mi for each Pi, and then change \u02dcn points in {m1, m2,\u00b7\u00b7\u00b7 , mn} to\nobtain {m(cid:48)\n\u02dcn, m\u02dcn+1,\u00b7\u00b7\u00b7 , mn) =\n1, m(cid:48)\n, pi,\u02dck+1,\u00b7\u00b7\u00b7 , pi,k}\np0. For each m(cid:48)\n\ni, we change \u02dck points in Pi to obtain (cid:101)Pi = {p(cid:48)\n2,\u00b7\u00b7\u00b7 , m(cid:48)\n\u02dcn, m\u02dcn+1,\u00b7\u00b7\u00b7 , mn} such that median(m(cid:48)\nsuch that median((cid:101)Pi) = m(cid:48)\n\n2,\u00b7\u00b7\u00b7 , m(cid:48)\ni,2,\u00b7\u00b7\u00b7 , p(cid:48)\n\nmedian(cid:0)median((cid:101)P1),\u00b7\u00b7\u00b7 , median((cid:101)P\u02dcn), median(P\u02dcn+1),\u00b7\u00b7\u00b7 , median(Pn)(cid:1) = p0.\n\n(14)\nTo show a simulation of this process, we use a uniform distribution to randomly generate nk\npoints in the region [\u221210, 10] \u00d7 [\u221210, 10], and generate a target point p0 = (x0, y0) in the region\n[\u221220, 20] \u00d7 [\u221220, 20], and then use our algorithm to change \u02dcn\u02dck points in the given set, to make\nthe new set satisfy (14). Table 1 shows the result of running this experiment for different n and\nk, where (x(cid:48)\n0) is the median of medians for the new set obtained by our algorithm. It lists the\nvarious values n and k, the corresponding values \u02dcn and \u02dck of points modi\ufb01ed, and the target point\nand result of our algorithm. If we reduce the terminating condition, which means increasing the\nnumber of iteration, we can obtain a more accurate result, but only requiring the Euclidean norm of\ngradient to be less than 0.00001, we get very accurate results, within about 0.01 in each coordinate.\nWe illustrate the results of this process graphically for a example in Table 1: for the cases n = 5,\n\ni. Thus, we have\n\n0, y(cid:48)\n\ni,\u02dck\n\nn\n\n5\n\n5\n\n10\n\n50\n\n100\n\n500\n\nk\n\n8\n\n8\n\n5\n\n20\n\n50\n\n\u02dcn\n\n3\n\n3\n\n5\n\n25\n\n50\n\n100\n\n250\n\n\u02dck\n\n4\n\n4\n\n3\n\n10\n\n25\n\n50\n\n(x0, y0)\n\n(x(cid:48)\n\n0, y(cid:48)\n0)\n\n(0.99, 1.01)\n\n(0.99, 1.01)\n\n(10.76, 11.06)\n\n(10.70 11.06)\n\n(-13.82, -4.74)\n\n(-13.83, -4.74)\n\n( -14.71, -13.67)\n\n(-14.72, -13.67)\n\n( -14.07, 18.36)\n\n( -14.07, 18.36)\n\n(-15.84, -6.42)\n\n(-15.83, -6.42)\n\n1000\n\n200\n\n500\n\n100\n\n(18.63, -12.10)\n\n(18.78, -12.20)\n\nTable 1: The running result of simulation.\n\nFigure 1: The running result for the case n = 5,\nk = 8, (x0, y0) = (0.99, 1.01) in Table 1.\n\nk = 8, (x0, y0) = (0.99, 1.01), wihch is shown in Figure 1. In this \ufb01gure, the green star is the\ntarget point. Since n = 5, we use \ufb01ve different markers (circle, square, upward-pointing triangle,\ndownward-pointing triangle, and diamond) to represent \ufb01ve kinds of points. The given data P\ufb02at are\nshown by black points and un\ufb01lled points. Our algorithm changes those un\ufb01lled points to the blue\nones, and the green points are the medians of the new subsets. The red star is the median of medians\nfor P\ufb02at, and other red points are the median of old subsets. So, we only changed 12 points out of 40,\nand the median of medians for the new data set is very close to the target point.\n\n5 Conclusion\nWe de\ufb01ne the breakdown point of the composition of two or more estimators. These de\ufb01nitions\nare technical but necessary to understand the robustness of composite estimators. Generally, the\ncomposition of two of more estimators is less robust than each individual estimator. We highlight a\nfew applications and believe many more exist. These results already provide important insights for\ncomplex data analysis pipelines common to large-scale automated data analysis.\n\n8\n\n\u221210\u22125051015\u221210\u221250510152025ThegivenpointsthatarenotchangedThegivenpointsthatarechangedThenewlocationsforthosechangedpointsThemediansofoldsubsetsThemediansofnewsubsetsThemedianofmediansforthegivenpointsThetargetpoint\fReferences\n[1] G. Aloupis. Geometric measures of data depth. In Data Depth: Robust Multivariate Analysis, Computa-\n\ntional Geometry and Applications. AMS, 2006.\n\n[2] G. Cormode and A. McGregor. Approximation algorithms for clustering uncertain data. In PODS, 2008.\n\n[3] P. Davies and U. Gather. The breakdown point: Examples and counterexamples. REVSTAT \u2013 Statitical\n\nJournal, 5:1\u201317, 2007.\n\n[4] F. R. Hampel. A general qualitative de\ufb01nition mof robustness. Annals of Mathematical Statistics, 42:1887\u2013\n\n1896, 1971.\n\n[5] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The Approach Based\n\non In\ufb02uence Functions. Wiley, 1986.\n\n[6] X. He, D. G. Simplson, and S. L. Portnoy. Breakdown robustness of tests. Journal of the Maerican\n\nStatistical Association, 85:446\u2013452, 1990.\n\n[7] P. J. Huber. Robust Statistics. Wiley, 1981.\n\n[8] P. J. Huber and E. M. Ronchetti. Breakdown point. In Robust Statistics, page 8. John Wiley & Sons, Inc.,\n\n2009.\n\n[9] A. G. J\u00f8rgensen, M. L\u00f6f\ufb02er, and J. M. Phillips. Geometric computation on indecisive points. In WADS,\n\n2011.\n\n[10] A. D. Sarma, O. Benjelloun, A. Halevy, S. Nabar, and J. Widom. Representing uncertain data: models,\n\nproperties, and algorithms. VLDBJ, 18:989\u20131019, 2009.\n\n[11] A. F. Siegel. Robust regression using repeated medians. Biometrika, 82:242\u2013244, 1982.\n\n[12] P. Tang and J. M. Phillips. The robustness of estimator composition. Technical report, arXiv:1609.01226,\n\n2016.\n\n[13] E. Weiszfeld and F. Plastria. On the point for which the sum of the distances to n given points is minimum.\n\nAnnals of Operations Research, 167:7\u201341, 2009.\n\n[14] A. H. Welsh. The standard deviation. In Aspects of Statistical Inference, page 245. Wiley-Interscience;,\n\n1996.\n\n9\n\n\f", "award": [], "sourceid": 568, "authors": [{"given_name": "Pingfan", "family_name": "Tang", "institution": "University of Utah"}, {"given_name": "Jeff", "family_name": "Phillips", "institution": "University of Utah"}]}