# the_robustness_of_estimator_composition__1309d854.pdf

The Robustness of Estimator Composition

Pingfan Tang School of Computing University of Utah Salt Lake City, UT 84112 tang1984@cs.utah.edu

Jeff M. Phillips School of Computing University of Utah Salt Lake City, UT 84112 jeffp@cs.utah.edu

We formalize notions of robustness for composite estimators via the notion of a breakdown point. A composite estimator successively applies two (or more) estimators: on data decomposed into disjoint parts, it applies the ﬁrst estimator on each part, then the second estimator on the outputs of the ﬁrst estimator. And so on, if the composition is of more than two estimators. Informally, the breakdown point is the minimum fraction of data points which if signiﬁcantly modiﬁed will also signiﬁcantly modify the output of the estimator, so it is typically desirable to have a large breakdown point. Our main result shows that, under mild conditions on the individual estimators, the breakdown point of the composite estimator is the product of the breakdown points of the individual estimators. We also demonstrate several scenarios, ranging from regression to statistical testing, where this analysis is easy to apply, useful in understanding worst case robustness, and sheds powerful insights onto the associated data analysis.

1 Introduction

Robust statistical estimators [5, 7] (in particular, resistant estimators), such as the median, are an essential tool in data analysis since they are provably immune to outliers. Given data with a large fraction of extreme outliers, a robust estimator guarantees the returned value is still within the nonoutlier part of the data. In particular, the role of these estimators is quickly growing in importance as the scale and automation associated with data collection and data processing becomes more commonplace. Artisanal data (hand crafted and carefully curated), where potential outliers can be removed, is becoming proportionally less common. Instead, important decisions are being made blindly based on the output of analysis functions, often without looking at individual data points and their effect on the outcome. Thus using estimators as part of this pipeline that are not robust are susceptible to erroneous and dangerous decisions as the result of a few extreme and rogue data points.

Although other approaches like regularization and pruning a constant number of obvious outliers are common as well, they do not come with the important guarantees that ensure these unwanted outcomes absolutely cannot occur.

In this paper we initiate the formal study of the robustness of composition of estimators through the notion of breakdown points. These are especially important with the growth of data analysis pipelines where the ﬁnal result or prediction is the result of several layers of data processing. When each layer in this pipeline is modeled as an estimator, then our analysis provides the ﬁrst general robustness analysis of these processes.

The breakdown point [4, 3] is a basic measure of robustness of an estimator. Intuitively, it describes how many outliers can be in the data without the estimator becoming unreliable. However, the literature is full of slightly inconsistent and informal deﬁnitions of this concept. For example:

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Aloupis [1] write the breakdown point is the proportion of data which must be moved to inﬁnity so that the estimator will do the same.

Huber and Ronchetti [8] write the breakdown point is the smallest fraction of bad observations that may cause an estimator to take on arbitrarily large aberrant values."

Dasgupta, Kumar, and Srikumar [14] write the breakdown point of an estimator is the largest fraction of the data that can be moved arbitrarily without perturbing the estimator to the boundary of the parameter space. All of these deﬁnitions have similar meanings, and they are typically sufﬁcient for the purpose of understanding a single estimator. However, they are not mathematically rigorous, and it is difﬁcult to use them to discuss the breakdown point of composite estimators.

Composition of Estimators. In a bit more detail (we give formal deﬁnitions in Section 2.1), an estimator E maps a data set to single value in another space, sometimes the same as a single data point. For instance the mean or the median are simple estimators on one-dimensional data. A composite E1-E2 estimator applies two estimators E1 and E2 on data stored in a hierarchy. Let P = {P1, P2, . . . , Pn} be a set of subdata sets, where each subdata set Pi = {pi,1, pi,2, . . . , pi,k} has individual data readings. Then the E1-E2 estimator reports E2(E1(P1), E1(P2), . . . , E1(Pn)), that is the estimator E2 applied to the output of estimator E1 on each subdata set.

1.1 Examples of Estimator Composition

Composite estimators arise in many scenarios in data analysis.

Uncertain Data. For instance, in the last decade there has been increased focus on the study of uncertainty data [10, 9, 2] where instead of analyzing a data set, we are given a model of the uncertainty of each data point. Consider tracking the summarization of a group of n people based on noisy GPS measurements. For each person i we might get k readings of their location Pi, and use these k readings as a discrete probability distribution of where that person might be. Then in order to represent the center of this set of people a natural thing to do would be to estimate the location of each person as xi E1(Pi), and then use these estimates to summarize the entire group E2(x1, x2, . . . , xn). Using the mean as E1 and E2 would be easy, but would be susceptible to even a single outrageous outlier (all people are in Manhattan, but a spurious reading was at (0, 0) lat-long, off the coast of Africa). An alternative is to use the L1-median for E1 and E2, that is known to have an optimal breakdown point of 0.5. But what is the breakdown point of the E1-E2 estimator?

Robust Analysis of Bursty Behavior. Understanding the robustness of estimators can also be critical towards how much one can game a system. For instance, consider a start-up media website that gets bursts of trafﬁc from memes they curate. They publish a statistic showing the median of the top half of trafﬁc days each month, and aggregate these by taking the median of such values over the top half of all months. This is a composite estimator, and they proudly claim, even through they have bursty trafﬁc, it is robust (each estimator has a breakdown point of 0.25). If this composite estimator shows large trafﬁc, should a potential buyer of this website by impressed? Is there a better, more robust estimator the potential buyer could request? If the media website can stagger the release of its content, how should they distribute it to maximize this composite estimator?

Part of the Data Analysis Pipeline. This process of estimator composition is very common in broad data analysis literature. This arises from the idea of an analysis pipeline where at several stages estimators or analysis is performed on data, and then further estimators and analysis are performed downstream. In many cases a robust estimator like the median is used, speciﬁcally for its robustness properties, but there is no analysis of how robust the composition of these estimators is.

1.2 Main Results

This paper initiates the formal and general study of the robustness of composite estimators.

In Subsection 2.1, we give two formal deﬁnitions of breakdown points which are both required to prove composition theorem. One variant of the deﬁnition closely aligns with other formalizations [4, 3], while another is fundamentally different.

The main result provides general conditions under which an E1-E2 estimator with breakdown points β1 and β2, has a breakdown point of β1β2 (Theorem 2 in Subsection 2.2).

Moreover, by showing examples where our conditions do not strictly apply, we gain an understanding of how to circumvent the above result. An example is in composite percentile estimators (e.g., E1 returns the 25th percentile, and E2 the 75th percentile of a ranked set). These composite estimators have larger breakdown point than β1 β2.

The main result can extended to multiple compositions, under suitable conditions, so for instance an E1-E2-E3 estimator has a breakdown point of β1β2β3 (Theorem 3 in Subsection 2.3). This implies that long analysis chains can be very suspect to a few carefully places outliers since the breakdown point decays exponentially in the length of the analysis chain.

In Section 3, we highlight several applications of this theory, including robust regression, robustness of p-values, a depth-3 composition, and how to advantageously manipulate the observation about percentile estimator composition. We demonstrate a few more applications with simulations in Section 4.

2 Robustness of Estimator Composition

2.1 Formal Deﬁnitions of Breakdown Points In this paper, we give two deﬁnitions for the breakdown point: Asymptotic Breakdown Point and Asymptotic Onto-Breakdown Point. The ﬁrst deﬁnition, Asymptotic Breakdown Point, is similar to the classic formal deﬁnitions in [4] and [3] (including their highly technical nature), although their deﬁnitions of the estimator are slightly different leading to some minor differences in special cases. However our second deﬁnition, Asymptotic Onto-Breakdown Point, is a structurally new deﬁnition, and we illustrate how it can result in signiﬁcantly different values on some common and useful estimators. Our main theorem will require both deﬁnitions, and the differences in performance will lead to several new applications and insights.

We deﬁne an estimator E as a function from the collection of some ﬁnite subsets of a metric space (X , d) to another metric space (X , d ):

E : A {X X | 0 < |X| < } 7 X , (1)

where X is a multiset. This means if x X then x can appear more than once in X, and the multiplicity of elements will be considered when we compute |X|.

Finite Sample Breakdown Point. For estimator E deﬁned in (1) and positive integer n we deﬁne its ﬁnite sample breakdown point g E(n) over a set M as

g E(n) = max(M) if M = 0 if M = (2)

where for ρ(x , X) = maxx X d(x , x) is the distance from x to the furthest point in X,

M = {m [0, n] | X A , |X| = n, G1 > 0, G2 = G2(X, G1) s.t. X A ,

if |X | = n and |{x X | ρ(x , X) > G1}| m then d (E(X), E(X )) G2}. (3)

For an estimator E in (1) and X A , the ﬁnite sample breakdown point g E(n) means if the number of unbounded points in X is at most g E(n), then E(X ) will be bounded. Lets break this deﬁnition down a bit more. The deﬁnition holds over all data sets X A of size n, and for all values G1 > 0 and some value G2 deﬁned as a function G2(X, G1) of the data set X and value G1. Then g E(n) is the maximum value m (over all X, G1, and G2 above) such that for all X A with |X | = n then |{x X | ρ(x , X) > G1}| m (that is at most m points are further than G1 from X) where the estimators are close, d (E(X), E(X )) G2.

For example, consider a point set X = {0, 0.15, 0.2, 0.25, 0.4, 0.55, 0.6, 0.65, 0.72, 0.8, 1.0} with n = 11 and median 0.55. If we set G1 = 3, then we can consider sets X of size 11 with fewer than m points that are either greater than 3 or less than 2. This means in X there are at most m points which are greater than 3 or less than 2, and all other n m points are in [ 2, 3]. Under these conditions, we can (conservatively) set G2 = 4, and know that for values of m as 1, 2, 3, 4, or 5, then the median of X must be between 3.45 and 4.55; and this holds no matter where we set those m points (e.g., at 20 or at 1000). This does not hold for m 6, so g E(11) = 5.

Asymptotic Breakdown Point. If the limit limn g E(n)

n exists, then we deﬁne this limit

β = lim n g E(n)

as the asymptotic breakdown point, or breakdown point for short, of the estimator E. Remark 1. It is not hard to see that many common estimators satisfy the conditions. For example, the median, L1-median [1], and Siegel estimators [11] all have asymptotic breakdown points of 0.5.

Asymptotic Onto-Breakdown Point. For an estimator E given in (1) and positive integer n, if f M = {0 m n | X A , |X| = n, y X , X A s.t. |X | = n, |X X | = n m, E(X ) = y} is not empty, we deﬁne

f E(n) = min(f M). (5)

The deﬁnition of f E(n) implies, if we change f E(n) elements in X, we can make E become any value in X : it is onto. In contrast g E(n) only requires E(X ) to become far from E(X), perhaps only in one direction. Then the asymptotic onto-breakdown point is deﬁned as the following limit if it exists

lim n f E(n)

Remark 2. For a quantile estimator E that returns a percentile other than the 50th, then limn g E(n)

n = limn f E(n)

n . For instance, if E returns the 25th percentile of a ranked set, setting only 25% of the data points to causes E to return ; hence limn g E(n)

n = 0.25. And while any value less than the original 25th percentile can also be obtained; to return a value larger than the largest element in the original set, at least 75% of the data must be modiﬁed, thus limn f E(n)

As we will observe in Section 3, this nuance in deﬁnition regarding percentile estimators will allow for some interesting composite estimator design.

2.2 Deﬁnition of E1-E2 Estimators, and their Robustness

We consider the following two estimators:

E1 : A1 {X X1 | 0 < |X| < } 7 X2, (7) E2 : A2 {X X2 | 0 < |X| < } 7 X 2, (8) where any ﬁnite subset of E1(A1), the range of E1, belongs to A2. Suppose Pi A1, |Pi| = k for i = 1, 2, , n and Pﬂat = n i=1Pi, where means if x appears n1 times in X1 and n2 times in X2 then x appears n1 + n2 times in X1 X2. We deﬁne

E(Pﬂat) = E2 (E1(P1), E1(P2), , E1(Pn)) . (9)

Theorem 1. Suppose g E1(k) and g E2(n) are the ﬁnite sample breakdown points of estimators E1 and E2 which are given by (7) and (8) respectively. If g E(nk) is the ﬁnite sample breakdown point of E given by (9), then we have g E2(n)g E1(k) g E(nk). If β1 = limk g E1(k)

limn g E2(n)

n and β = limn,k g E(nk)

nk all exist, then we have β1β2 β.

The proof of Theorem 1 and other theorems can be found in the full version of this paper [12]. Remark 3. Under the condition of Theorem 1, we cannot guarantee β = β1β2. For example, suppose E1 and E2 take the 25th percentile and the 75th percentile of a ranked set of real numbers respectively. So, we have β1 = β2 = 1

4. However, β = 1

In fact, the limit of g E(nk)

nk as n, k may even not exist. For example, suppose E1 takes the 25th percentile of a ranked set of real numbers. When n is odd E2 takes the the 25th percentile of a ranked set of n real numbers, and when n is even E2 takes the the 75th percentile of a ranked set of n real numbers. Thus, β1 = β2 = 1

4, but g E(nk) 1

4nk if n is odd, and g E(nk) 1

4nk if n is even, which implies limn,k g E(nk)

nk does not exist.

Therefore, to guarantee β exist and β = β1β2, we introduce the deﬁnition of asymptotic ontobreakdown point in (6). As shown in Remark 2, the values of (4) and (6) may be not equal. However, with the condition of the asymptotic breakdown point and asymptotic onto-breakdown point of E1 being the same, we can ﬁnally state our desired clean result.

Theorem 2. For estimators E1, E2 and E given by (7), (8) and (9) respectively, suppose g E1(k), g E2(n) and g E(nk) are deﬁned by (2), and f E1(k) is deﬁned by (5). Moreover, E1 is an onto function and for any ﬁxed positive integer n we have

X A2, |X| = n, G1 > 0, s.t. G2 > 0, X A2 satisfying

|X | = n, |X \ X| = g E2(n) + 1, and d 2(E2(X), E2(X )) > G2, (10)

where d 2 is the metric of space X 2. If β1 = limk g E1(k)

k = limk f E1(k)

k , and β2 =

limn g E2(n)

n both exist, then β = limn,k g E(nk)

nk exists, and β = β1β2. Remark 4. Without the introduction of f E(n), we cannot even guarantee β β1 or β β2 only under the condition of Theorem 1, even if E1 and E2 are both onto functions. For example, for any P = {p1, p2, , pk} R and X = {x1, x2, , xn} R, we deﬁne E1(P) = 1/median(P) (if median(P) = 0, otherwise deﬁne E1(P) = 0) and E2(X) = median(y1, y2, , yn), where yi (1 y n) is given by yi = 1/xi (if xi = 0, otherwise deﬁne yi = 0). Since g E1(k) = g E2(n) = 0 for all n, k, we have β1 = β2 = 0. However, in order to make E2(E1(P1), E1(P2), , E1(Pn)) + , we need to make about n

2 elements in {E(P1), E(P2), , E(Pn)} go to 0+. To make E1(Pi) 0+, we need to make about k

2 points in Pi go to + . Therefore, we have g E(nk) n

2 and β = 1

2.3 Multi-level Composition of Estimators To study the breakdown point of composite estimators with more than two levels, we introduce the following estimator:

E3 : A3 {X X 2 | 0 < |X| < } 7 X 3, (11)

where any ﬁnite subset of E2(A2), the range of E2, belongs to A3. Suppose Pi,j A1, |Pi,j| = k for i = 1, 2, , n, j = 1, 2, , m and P j ﬂat = n i=1Pi,j, Pﬂat = m j=1P j ﬂat. We deﬁne

E(Pﬂat) = E3 E2( e P 1 ﬂat), E2( e P 2 ﬂat), , E2( e P m ﬂat) , (12)

where e P j ﬂat = {E1(P1,j), E1(P2,j), , E1(Pn,j)}, for j = 1, 2, , m.

From Theorem 2, we can obtain the following theorem about the breakdown point of E in (12). Theorem 3. For estimators E1, E2, E3 and E given by (7), (8), (11) and (12) respectively, suppose g E1(k), g E2(n), g E3(m) and g E(mnk) are deﬁned by (2), and f E1(k), f E2(n) are deﬁned by (5). Moreover, E1 and E2 are both onto functions, and for any ﬁxed positive integer m we have

X A3, |X| = m, G1 > 0, s.t. G2 > 0, X A3 satisfying |X | = m, |X \ X| = g E3(m) + 1, and d 3(E3(X), E3(X )) > G2,

where d 3 is the metric of space X 3. If β1 = limk g E1(k)

k = limk f E1(k)

limn g E2(n)

n = limn f E2(n)

n and β3 = limm g E3(m)

m all exist, then we have β = limm,n,k g E(mnk)

mnk exists, and β = β1β2β3 .

3 Applications

3.1 Application 1 : Balancing Percentiles

For n companies, for simplicity, assume each company has k employees. We are interested in the income of the regular employees of all companies, not the executives who may have much higher pay. Let pi,j represents the income of the jth employee in the ith company. Set Pﬂat = n i=1Pi where the ith company has a set Pi = {pi,1, pi,2, , pi,k} R and for notational convenience pi,1 pi,2 pi,k for i {1, 2, , n}. Suppose the income data Pi of each company is preprocessed by a 45-percentile estimator E1 (median of lowest 90% of incomes), with breakdown point β1 = 0.45. In theory E1(Pi) can better reﬂect the income of regular employees in a company, since there may be about 10% of employees in the management of a company and their incomes are usually much higher than that of common employees. So, the preprocessed data is X = {E1(P1), E1(P2), , E1(Pn)}.

If we deﬁne E2(X) = median(X) and E(Pﬂat) = E2(X), then the breakdown point of E2 is β2 = 0.5, and the breakdown points of E is β = β1β2 = 0.225.

However, if we use another E2, then E can be more robust. For example, for X = {x1, x2, , xn} where x1 x2 xn, we can deﬁne E2 as the 55-percentile estimator (median of largest 90% of incomes). In order to make E(Pﬂat) = E2(X) = E2(E1(P1), E1(P2), , E1(Pn)) go to inﬁnity, we need to either move 55% points of X to or move 45% points of X to + . In either case, we need to move about 0.45 0.55nk points of Pﬂat to inﬁnity. This means the breakdown point of E is β = 0.45 0.55 = 0.2475 which is greater than 0.225.

This example implies if we know how the raw data is preprocessed by estimator E1, we can choose a proper estimator E2 to make the E1-E2 estimator more robust.

3.2 Application 2 : Regression of L1 Medians

Suppose we want to use linear regression to robustly predict the weight of a person from his or her height, and we have multiple readings of each person s height and weight. The raw data is Pﬂat = n i=1Pi where for the ith person we have a set Pi = {pi,1, pi,2, , pi,k} R2 and pi,j = (xi,j, yi,j) for i {1, 2, , n}, j {1, 2, , k}. Here, xi,j and yi,j are the height and weight respectively of the ith person in their jth measurement.

One robust way to process this data, is to ﬁrst pre-process each Pi with its L1-median [1]: ( xi, yi) E1(Pi), where E1(Pi) = L1-median(Pi) has breakdown point β1 = 0.5. Then we could generate a linear model to predict weight ˆyi = ax+b from the Siegel Estimator [11]: E2(Z) = (a, b), with breakdown point β2 = 0.5. From Theorem 2 we immediately know the breakdown point of E(Pﬂat) = E2(E1(P1), E1(P2), , E1(Pn)) is β = β1β2 = 0.5 0.5 = 0.25.

Alternatively, taking the Siegel estimator of Pﬂat (i.e., returning E2(Pﬂat)) would have a much larger breakdown point of 0.5. So a seemingly harmless operation of normalizing the data with a robust estimator (with optimal 0.5 breakdown point) drastically decreases the robustness of the process.

3.3 Application 3 : Signiﬁcance Thresholds

Suppose we are studying the distribution of the wingspread of fruit ﬂies. There are n = 500 ﬂies, and the variance of the true wingspread among these ﬂies is on the order of 0.1 units. Our goal is to estimate the 0.05 signiﬁcance level of this distribution of wingspread among normal ﬂies.

To obtain a measured value of the wingspread of the ith ﬂy, denoted Fi, we measure the wingspread of ith ﬂy k = 100 times independently, and obtain the measurement set Pi = {pi,1, pi,2, , pi,k}. The measurement is carried out by a machine automatically and quickly, which implies the variance of each Pi is typically very small, perhaps only 0.0001 units, but there are outliers in Pi with small chance due to possible machine malfunction. This malfunction may be correlated to individual ﬂies because of anatomical issues, or it may have autocorrelation (the machine jams for a series of consecutive measurements).

To perform hypothesis testing we desire the 0.05 signiﬁcance level, so we are interested in the 95th percentile of the set F = {F1, F2, , Fn}. So a post processing estimator E2 returns the 95th percentile of F and has a breakdown point of β2 = 0.05 [6]. Now, we need to design an estimator E1 to process the raw data Pﬂat = n i=1Pi to obtain F = {F1, F2, , Fn}. For example, we can deﬁne E1 as Fi = E1(Pi) = median(Pi) and estimator E as E(Pﬂat) = E2(E1(P1), E1(P2), , E1(Pn)).

Then, the breakdown point of E1 is 0.5. Since the breakdown point of E2 is 0.05, the breakdown point of the composite estimator E is β = β1β2 = 0.5 0.05 = 0.025. This means if the measurement machine malfunctioned only 2.5% of the time, we could have an anomalous signiﬁcant level, leading to false discovery. Can we make this process more robust by adjusting E1?

Actually, yes!, we can use another pre-processing estimator to get a more robust E. Since the variance of each Pi is only 0.0001, we can let E1 return the 5th percentile of a ranked set of real numbers, then there is not much difference between E1(Pi) and the median of Pi. (Note: this introduces a small amount of bias that can likely be accounted for in other ways.) In order to make E(Pﬂat) = E2(F) go to inﬁnity we need to move 5% points of X to (causing E2 to give an anomalous value) or 95% points of X to + (causing many, 95%, of the E1 values, to give anomalous values). In either case, we need to move about 5% 95% points of Pﬂat to inﬁnity. So, the breakdown points of E is

β = 0.05 0.95 = 0.0475 which is greater than 0.025. That is, we can now sustain up to 4.75% of the measurement machine s reading to be anomalous, almost double than before, without leading to an anomalous signiﬁcance threshold value.

This example implies if we know the post-processing estimator E2, we can choose a proper method to preprocess the raw data to make the E1-E2 estimator more robust.

3.4 Application 4 : 3-Level Composition

Suppose we want to use a single value to represent the temperature of the US in a certain day. There are m = 50 states in the country. Suppose each state has n = 100 meteorological stations, and the station i in state j measures the local temperature k = 24 times to get the data Pi,j = {ti,j,1, ti,j,2, , ti,j,k}. We deﬁne P j ﬂat = n i=1Pi,j, Pﬂat = m j=1P j ﬂat and

E1(Pi,j) = median(Pi,j), E2(P j ﬂat) = median (E1(P1,j), E1(P1,j), , E1(Pn,j))

E(Pﬂat) = E3(E2(P 1 ﬂat), E2(P 2 ﬂat), , E2(P m ﬂat)) = median(E2(P 1 ﬂat), E2(P 2 ﬂat), , E2(P m ﬂat)). So, the break down points of E1, E2 and E3 are β1 = β2 = β3 = 0.5. From Theorem 3, we know the break down point of E is β = β1β2β3 = 0.125. Therefore, we know the estimator E is not very robust, and it may be not a good choice to use E(Pﬂat) to represent the temperature of the US in a certain day.

This example illustrates how the more times the raw data is aggregated, the more unreliable the ﬁnal result can become.

4 Simulation: Estimator Manipulation

In this simulation we actually construct a method to relocate an estimator by modifying the smallest number of points possible. We speciﬁcally target the L1-median of L1-medians since its somewhat non-trivial to solve for the new location of data points.

In particular, given a target point p0 R2 and a set of nk points Pﬂat = n i=1Pi, where Pi = {pi,1, pi,2, , pi,k} R2, we use simulation to show that we only need to change n k points of Pﬂat, then we can get a new set e Pﬂat = n i=1 e Pi such that median(median( e P1), median( e P2), , median( e Pn)) = p0. Here, the "median" means L1-median, and

2n if n is even 1 2(n + 1) if n is odd , k = 1

2k if k is even 1 2(k + 1) if k is odd .

To do this, we ﬁrst show that, given k points S = {(xi, yi) | 1 i k} in R2, and a target point (x0, y0), we can change k points of S to make (x0, y0) as the L1-median of the new set. As n and k grow, then n k/(nk) = 0.25 is the asymptotic breakdown point of this estimator, as a consequence of Theorem 2, and thus we may need to move this many points to get the result.

If (x0, y0) is the L1-median of the set {(xi, yi) | 1 i k}, then we have [13]:

(xi x0)2 + (yi y0)2 = 0,

(xi x0)2 + (yi y0)2 = 0. (13)

We deﬁne x = (x1, x2, , x k), y = (y1, y2, , y k) and

(xi x0)2 + (yi y0)2

(xi x0)2 + (yi y0)2

Since (13) is the sufﬁcient and necessary condition for L1-median, if we can ﬁnd x and y such that h( x, y) = 0, then (x0, y0) is the L1-median of the new set.

xih( x, y) =2 k X

(xj x0)2 + (yj y0)2

(yi y0)2 (xi x0)2 + (yi y0)2 3

(xj x0)2 + (yj y0)2

(xi x0)(yi y0) (xi x0)2 + (yi y0)2 3

yih( x, y) = 2 k X

(xj x0)2 + (yj y0)2

(xi x0)(yi y0) (xi x0)2 + (yi y0)2 3

(xj x0)2 + (yj y0)2

(xi x0)2 (xi x0)2 + (yi y0)2 3

we can use gradient descent to compute x, y to minimize h. For the input S = {(xi, yi)|1 i k}, we choose the initial value x0 = {x1, x2, , x k}, y0 = {y1, y2, , y k}, and then update x and y along the negative gradient direction of h, until the Euclidean norm of gradient is less than 0.00001.

The algorithm framework is then as follows, using the above gradient descent formulation at each step. We ﬁrst compute the L1-median mi for each Pi, and then change n points in {m1, m2, , mn} to obtain {m 1, m 2, , m n, m n+1, , mn} such that median(m 1, m 2, , m n, m n+1, , mn) = p0. For each m i, we change k points in Pi to obtain e Pi = {p i,1, p i,2, , p i, k, pi, k+1, , pi,k}

such that median( e Pi) = m i. Thus, we have

median median( e P1), , median( e P n), median(P n+1), , median(Pn) = p0. (14)

To show a simulation of this process, we use a uniform distribution to randomly generate nk points in the region [ 10, 10] [ 10, 10], and generate a target point p0 = (x0, y0) in the region [ 20, 20] [ 20, 20], and then use our algorithm to change n k points in the given set, to make the new set satisfy (14). Table 1 shows the result of running this experiment for different n and k, where (x 0, y 0) is the median of medians for the new set obtained by our algorithm. It lists the various values n and k, the corresponding values n and k of points modiﬁed, and the target point and result of our algorithm. If we reduce the terminating condition, which means increasing the number of iteration, we can obtain a more accurate result, but only requiring the Euclidean norm of gradient to be less than 0.00001, we get very accurate results, within about 0.01 in each coordinate. We illustrate the results of this process graphically for a example in Table 1: for the cases n = 5,

10 5 0 5 10 15

The given points that are not changed The given points that are changed The new locations for those changed points The medians of old subsets The medians of new subsets The median of medians for the given points The target point

Figure 1: The running result for the case n = 5, k = 8, (x0, y0) = (0.99, 1.01) in Table 1.

n k n k (x0, y0) (x 0, y 0)

5 8 3 4 (0.99, 1.01) (0.99, 1.01)

5 8 3 4 (10.76, 11.06) (10.70 11.06)

10 5 5 3 (-13.82, -4.74) (-13.83, -4.74)

50 20 25 10 ( -14.71, -13.67) (-14.72, -13.67)

100 50 50 25 ( -14.07, 18.36) ( -14.07, 18.36)

500 100 250 50 (-15.84, -6.42) (-15.83, -6.42)

1000 200 500 100 (18.63, -12.10) (18.78, -12.20)

Table 1: The running result of simulation.

k = 8, (x0, y0) = (0.99, 1.01), wihch is shown in Figure 1. In this ﬁgure, the green star is the target point. Since n = 5, we use ﬁve different markers (circle, square, upward-pointing triangle, downward-pointing triangle, and diamond) to represent ﬁve kinds of points. The given data Pﬂat are shown by black points and unﬁlled points. Our algorithm changes those unﬁlled points to the blue ones, and the green points are the medians of the new subsets. The red star is the median of medians for Pﬂat, and other red points are the median of old subsets. So, we only changed 12 points out of 40, and the median of medians for the new data set is very close to the target point.

5 Conclusion We deﬁne the breakdown point of the composition of two or more estimators. These deﬁnitions are technical but necessary to understand the robustness of composite estimators. Generally, the composition of two of more estimators is less robust than each individual estimator. We highlight a few applications and believe many more exist. These results already provide important insights for complex data analysis pipelines common to large-scale automated data analysis.

[1] G. Aloupis. Geometric measures of data depth. In Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications. AMS, 2006.

[2] G. Cormode and A. Mc Gregor. Approximation algorithms for clustering uncertain data. In PODS, 2008.

[3] P. Davies and U. Gather. The breakdown point: Examples and counterexamples. REVSTAT Statitical Journal, 5:1 17, 2007.

[4] F. R. Hampel. A general qualitative deﬁnition mof robustness. Annals of Mathematical Statistics, 42:1887 1896, 1971.

[5] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The Approach Based on Inﬂuence Functions. Wiley, 1986.

[6] X. He, D. G. Simplson, and S. L. Portnoy. Breakdown robustness of tests. Journal of the Maerican Statistical Association, 85:446 452, 1990.

[7] P. J. Huber. Robust Statistics. Wiley, 1981.

[8] P. J. Huber and E. M. Ronchetti. Breakdown point. In Robust Statistics, page 8. John Wiley & Sons, Inc., 2009.

[9] A. G. Jørgensen, M. Löfﬂer, and J. M. Phillips. Geometric computation on indecisive points. In WADS, 2011.

[10] A. D. Sarma, O. Benjelloun, A. Halevy, S. Nabar, and J. Widom. Representing uncertain data: models, properties, and algorithms. VLDBJ, 18:989 1019, 2009.

[11] A. F. Siegel. Robust regression using repeated medians. Biometrika, 82:242 244, 1982.

[12] P. Tang and J. M. Phillips. The robustness of estimator composition. Technical report, ar Xiv:1609.01226, 2016.

[13] E. Weiszfeld and F. Plastria. On the point for which the sum of the distances to n given points is minimum. Annals of Operations Research, 167:7 41, 2009.

[14] A. H. Welsh. The standard deviation. In Aspects of Statistical Inference, page 245. Wiley-Interscience;, 1996.