# minimax_mestimation_under_adversarial_contamination__831731c0.pdf Minimax M-estimation under Adversarial Corruption Sujay Bhatt, Guanhua Fang, Ping Li Gennady Samorodnitsky Cognitive Computing Lab School of ORIE Baidu Research Cornell University 10900 NE 8th St. Bellevue, WA 98004, USA 220 Frank T Rhodes Hall, Ithaca, NY 14853, USA {sujaybhatt.hr, fanggh2018, pingli98}@gmail.com gs18@cornell.edu 1We present a new finite-sample analysis of Catoni s M-estimator under adversarial contamination, where an adversary is allowed to corrupt a fraction of the samples arbitrarily. We make minimal assumptions on the distribution of the uncorrupted random variables, namely, we only assume the existence of a known upper bound on the (1+ε)th central moment. We provide a lower bound on the minimax error rate for the mean estimation problem under adversarial corruption under this weak assumption, and establish that the proposed M-estimator achieves this lower bound (up to multiplicative constants). When variance is infinite, the tolerance to contamination of any estimator reduces as ε 0. We establish a tight upper bound that characterizes this bargain. To illustrate the usefulness of the derived robust M-estimator in an online setting, we present a bandit algorithm for the partially identifiable best arm identification problem that improves upon the sample complexity of the state of the art algorithms. 1. Introduction Univariate mean estimation plays an important role in many statistical learning problems, ranging from classification and regression (James et al., 2013) to online learning (Lattimore and Szepesv ari, 2020; Agarwal et al., 2019). A fundamental challenge in machine learning, using diverse data drawn from heterogeneous sources, is that outliers and adversarial contamination is unavoidable. For example, the presence of outliers is the primary source of invalid inferences in f MRI (Eklund et al., 2016). In addition to dealing with Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1The work of Gennady Samorodnitsky was conducted as a consulting researcher at Baidu Research Bellevue, WA 98004. heavy-tails, modern machine learning also has to deal with malicious noise (Auer and Cesa-Bianchi, 1998; Diakonikolas et al., 2018). Here an adversary can arbitrarily corrupt a fraction of the data to disrupt inference. These challenges motivate the study of robust mean estimation in this paper, with a focus on providing estimators that can tolerate large amount of adversarial corruption, possibly in the presence of heavy-tailed data. Standard mean estimation based on mean squared error is insufficient to deal with such diverse data (Catoni, 2012); and mean estimation based on deviations, namely, that of estimation of confidence intervals offers a much better alternative (Catoni, 2012; Brownlees et al., 2015). Here the fundamental problem is of designing an estimator bµn = bµn({Xi}n i=1) that for a given confidence δ (0, 1) has the smallest possible ϱ = ϱ(n, δ) such that P n bµn µ > ϱ o δ. In next two subsections, we briefly review the standard results (and key challenges) for robust mean estimation under finite and infinite variance settings, separately. 1.1. Finite Variance, ε = 1 In the case of finite variance, the classical empirical mean is a poor estimator of mean due to the presence of outliers (Catoni, 2012; Devroye et al., 2016; Oliveira and Orenstein, 2019). In fact, it was shown in Catoni (2012) that, if υ is the variance of the observations, then the best possible error ϱ(n, δ) for empirical mean of {Xi}n i=1 is of the order of υ δn, which is far from the best possible, as informed by the central limit theorem (Lugosi and Mendelson, 2019a). Devroye et al. (2016) exhibit several sub-Gaussian estimators2 of the mean such as median-of-means, trimmed mean, 2An estimator bµn is L-sub-Gaussian for a constant L > 0 if for random variables with variance υ and (any) sample size n, with probability at least 1 δ, υ log(2/δ) n . Minimax M-estimation under Adversarial Corruption and Catoni s M-estimator. In the presence of malicious noise, however, every robust estimator has an asymptotic bias of O( υη), where η is the fraction of samples corrupted by the adversary (Lai et al., 2016; Hopkins and Li, 2018; Lugosi and Mendelson, 2021). In the univariate setting, such estimators have been derived using the well known estimators for finite variance: (i) Using empirical mean: The main idea here is based on the robust estimator proposed in Lai et al. (2016). Essentially, the data is split into two halves one half is used to construct bounded intervals that trap many samples from the uncorrupted distribution with a high confidence, while an empirical estimate of the other half contained in the constructed interval is returned as the robust estimate of the mean. It is shown this estimator has the (minimax) asymptotic bias of O( υη) (Prasad et al., 2020a, Lemma 3). (ii) Using trimmed mean: The main idea is similar to that in Lai et al. (2016), except that instead of the intervals, order statistics on one half of the data is used to compute the minimum and maximum truncation levels, while a smoothed estimate of the other half data is returned as the robust estimated of the mean. A subtle difference here is that the samples in the second half outside the truncation levels are not discarded, however, are set to the truncation level. It has been shown that this estimator again achieves the (minimax) asymptotic bias of O( υη), see Lugosi and Mendelson (2021, Theorem 1). It is clear that the above methods, while achieving the minimax error up to constants, do not make use of the data effectively one half of the data is used only to extract periphery information. This, in part, motivates robust estimation using M-estimators that effectively utilize the entire data to return a robust estimate of the mean, while achieving a high tolerance to adversarial contamination. For example, when ε = 1, we establish that due to inherent robustness of the proposed M-estimator, it tolerates upto 36% arbitrary contamination, and outperforms the state of the art estimators based on trimmed mean (Lugosi and Mendelson, 2021). There is related work using Median-of-Means estimator (Laforgue et al., 2021), where the number of groups is modulated using the fraction of the outliers (η), and the authors obtain L sub-Gaussian like estimators with very large L(η). The asymptotic bias is not characterized and it is unclear whether the proposed estimator achieves the minimax error bound. In contrast, the focus in this paper, is to obtain sharp constants with minimax asymptotic bias. L sub-Gaussian estimators are optimal up to constants with L 2+o(1) identified as nearly-optimal (Devroye et al., 2016). Also see Buldygin and Kozachenko (2000) and Li (2007, Chapter 4) for essentially equivalent definitions (and more applications) of the sub-Gaussian estimators. 1.2. Infinite Variance, ε < 1 When ε < 1, Devroye et al. (2016) establish that the achievable ϱ(n, δ) is no longer sub-Gaussian and is in fact given by the following result: Theorem (Lower Bound). There exists a distribution with mean µ and (1 + ε)th central moment υε such that for any mean estimator bµn and δ (2e n/4, 1/2), P n |bµn µ| > υ 1 εε log(2/δ) This is established in Devroye et al. (2016, Theorem 3.1). There are estimators that obtain the optimal order such as Median-of-Means (Bubeck et al., 2013; Minsker, 2019; Lecu e and Lerasle, 2020) and Trimmed Mean (Oliveira and Orenstein, 2019; Lugosi and Mendelson, 2021), each with their own merits and shortcomings. See Lugosi and Mendelson (2019a) for an excellent summary and related literature. In the presence of malicious noise, however, there are no simple estimators that achieve minimax error rate. Cherapanamjeri et al. (2020) consider a similar problem over random vectors and propose a two-stage algorithm that is inspired by semi-definite relaxations of the results derived in Lugosi and Mendelson (2019b). However, for the univariate mean estimation, the algorithm proposed is neither efficient nor obtains sharp error bounds. The M-estimators derived in this paper alleviate both these shortcomings and achieve the minimax rate in univariate settings. 1.3. Main Results Catoni s estimator (Catoni, 2012) is a nearly-optimal estimator (L = 2 + o(1)) in the absence of adversarial contamination and finite variance. This motivates extending the analysis to deal with adversarial contamination and under weaker moment assumptions. The main technical contributions are highlighted below. 1. We provide an M-estimator based on Catoni s influence functions (Catoni, 2012) that can deal with a fraction η of arbitrary adversarial contamination. The error bounds are order optimal in η, δ and n, and achieve the best minimax error obtainable under finite variance assumption. While the tight non-asymptotic bounds under contamination are of independent interest, an interesting feature of the proposed M-estimator is that the inherent robustness offered by the influence functions facilitates a large (compared to the state of the art) tolerance to adversarial contamination. 2. The finite variance assumption is not always valid in machine learning applications, where the existence of a bound on the (1 + ε)th central moment is all that is Minimax M-estimation under Adversarial Corruption available. We extend the analysis to deal with adversarial contamination and weak moment assumptions. We first derive the minimax rate achievable under this setting, and establish that the derived Catoni s estimator achieves this rate. The novel analysis also offers insights into the nature of tolerance to contamination of any minimax estimator as ε 0. In line with the intuition, the tolerance reduces, as it becomes increasingly difficult to distinguish contaminated samples from the underlying data. We also explicitly characterize the rate of this gradual decline. 3. As an application of the developed error bounds, we demonstrate their usefulness in a multi-armed bandit application. We propose a novel best arm identification algorithm under adversarial contamination that outperforms the state of the art algorithms in terms of sample complexity and exhibits excellent empirical performance. 1.4. Other Related Work Mean estimation under adversarial contamination is well researched topic in the robust statistics community. The contamination model introduced by Huber (1964), called the Huber s contamination model, provides a solid framework design optimal estimations that achieve statistical efficiency and robustness simultaneously (Huber, 2004). It is not surprising that it has been widely used to model adversarial contamination in statistical learning (Chen et al., 2016; 2018; Prasad et al., 2020b; Sun et al., 2020; Laforgue et al., 2021; Bhatt et al., 2022a) and online learning (Chen et al., 2021b; Prasad et al., 2020a; Zhu et al., 2020; Bhatt et al., 2022b). In the context of bandit applications, best arm identification algorithms under adversarial contamination based on Huber s contamination model was first considered in Altschuler et al. (2019) and several abstract frameworks were introduced to address this challenging setting. We adopt one such framework called the Partially Identifiable Best Arm Identification (PIBAI) framework to propose algorithms using the developed estimators. While Altschuler et al. (2019) used the median to define and identify the best arm, Mukherjee et al. (2021) consider the traditional setting using the mean instead. In this work, we consider PIBAI problem in the fixed confidence setting, and seek to identify the arm having the largest mean using as few samples as possible. 2. Contamination Model Let {Xi}i n be i.i.d observations with EX1 = µ. We assume that for some corruption rate 0 < η < 1, an adversary may change at most ηn of these observations to arbitrary values. The resulting set of observations will be X1, X2, Xn, so that n X i=1 1 Xi = Xi ηn. (1) The task is to estimate the true mean µ based on the observations X1, X2, , Xn. This strong adversarial contamination model is well-studied in machine learning; see Charikar et al. (2017); Hopkins and Li (2018); Lugosi and Mendelson (2021) and the related references for existing results. It should be noted that the well known ϵ Huber contamination model (Huber, 2004), is just a special case of (1); see Lai et al. (2016) and Prasad et al. (2020a) for robust mean estimation procedures using this model. We will develop M-estimators to deal with the contamination model given in Eq. (1). The contamination model Eq. (1), while allowing arbitrary contamination, is retrospective in that the adversary corrupts a fraction of the sample possibly with the knowledge of the whole data set. This is more than what the adversary can do in an online corruption setting, where the adversary can choose to contaminate based only on the observed data. So hereafter, we assume η for online settings as well without loss of generality (see Appendix B). 3. Robust M-estimator for Finite Variance We use a special version of the estimator of the mean that was proposed by Catoni (2012). In the context of L sub Gaussian estimators, the estimator proposed in Catoni (2012), which is henceforth referred to as Catoni s estimator, is significant owing to the fact that it is a (nearly) optimal3 sub-Gaussian estimator of the mean with L = 2 + o(1). This motivates extending this estimator to deal with adversarial contamination. In the following, the terminology robust is used to collectively describe the stability of the estimator with respect to the tail-behavior of the data, and how well it tolerates adversarial corruption. Start with a non-decreasing function ψ : R R such that log(1 x + x2/2) ψ(x) log(1 + x + x2/2) (2) for all x R. One can choose such a function that is bounded4: specifically, we for some 0 < A < , |ψ(x)| A for all x R. (3) Robust Catoni s Estimator: We define the estimator of µ, denoted by ˆµ, as the solution of the equation in the variable θ i=1 ψ α( Xi θ) = 0. (4) 3Here optimal is to be understood in the sense that the Catoni s estimator comes close the best possible L(= 2). 4see Eq. (67) in Appendix C.1 for an explicit representation. Minimax M-estimation under Adversarial Corruption Clearly, ψ(x) = x corresponds to the empirical mean, though it does not satisfy Eq. (2). Catoni (2012) notes that the empirical mean is unduly influenced by large values when the distribution tails are not themselves sub-Gaussian. When instead ψ(x) satisfies Eq. (2), it is similar to the linear function for small and moderate values of x, but its logarithmic rate of growth reduces the effect of large values. Unlike classical M-estimation (Huber, 2004) literature, Catoni (2012) uses unbounded influence functions to obtain sharp confidence bounds. However, we will show that when we use a bounded function instead, Catoni s estimator can deal with adversarial contamination as well. Theorem 3.1. Let δ (0, 1) such that δ 2e n/4. Let {Xi}n i=1 be i.i.d random variables with mean µ and E|X1 µ|2 υ. Let the corruption parameter η [0, 1 4A) for A > 0 in Eq. (3). Let Ω (0, 1 Aη 4) be such that n 4 log(2/δ) 1 (Ω+ 4)Aη . Robust Catoni s M-estimator bµ with parameter ΩAη + 2 log(2/δ) satisfies, with probability at least 1 δ, |bµ µ| < υ1/2 (Ω+4) 2Ω1/2 A1/2η1/2 + ( 2 log(2/δ) 1 (Ω+ 4)Aη/2 2 log(2/δ)/n. (5) Moreover, with probability at least 1 2 exp ηA 4 n , we have that |bµ µ| < e C υη, (6) for some e C. Theorem 3.1 provides a non-asymptotic error bound for the robust mean estimator in the presence of adversarial contamination. Clearly, the asymptotic bias is O( υη), which is information theoretically optimal (Hopkins and Li, 2018). The error bound obtained in Eq. (5) is not the tightest possible. A slightly sharper error bound that is less explicit in the dependence on the parameters can be similarly obtained. Corollary 3.2. Under the same assumptions as in Theorem 3.1, with probability at least 1 δ, |bµ µ| < (Ω+ 4)Aη + 4(log(2/δ)/n) 2α 1 (Ω+ 4)Aη/2 2 log(2/δ)/n . 3.1. Minimax Lower Bound The minimax error bound for estimators under finite variance bound assumption has been derived in Diakonikolas et al. (2017); Lugosi and Mendelson (2021). In this section, we will repeat the key ideas that will enable the proof of the minimax lower bound for variables with bounded (1 + ε)th moments in Sec.4.3. Since at most η fraction of the samples can be corrupted by the adversary, Lugosi and Mendelson (2021) argue that the adversary can arbitrarily corrupt the values in the tail quantiles5 Q1 η/2(X µ) and Qη/2(X µ) to introduce the worst possible error. Then with a high probability, no estimator can perform better than max n E |X µ Qη/2|1 X µ Qη/2 , E |X µ Q1 η/2|1 X µ Q1 η/2 o . This combined with the sub-Gaussian error bound under no contamination obtains the best minimax error as cv1/2 max n η1/2, (log 2/δ)/n 1/2o , (7) where c is an absolute constant. By considering a threshold on the range of n as a function of δ and η, it follows that Eq. (5) switches between a sub-Gaussian estimator and that having order optimal corruption bias Eq. (6), essentially matching the lower bound. 3.2. Comparison with Lugosi and Mendelson (2021) Lugosi and Mendelson (2021) develop a mean estimator based on trimmed mean that is robust to adversarial contamination. The corruption parameter η is required to satisfy 8η + 12log(4/δ) An upper bound on the corruption fraction tolerated by the robust trimmed mean estimator is η < 1/16 or around 6%. In contrast, for the robust Catoni s estimator in Eq. (5), the upper bound on η is 1/4A. For A = log 2 as in Eq. (67), the robust estimator can tolerate up to 36% contamination around three times higher! 3.3. Comparison with Robust Empirical Mean Prasad et al. (2020a) develop an interval estimator of empirical mean that is robust to adversarial contamination. The corruption parameter η is required to satisfy n + log(2/δ) An upper bound on the corruption fraction tolerated by the robust trimmed mean estimator is η < 1/4 or at most 25%, 5Here the quantile is identified as Qp(X µ) := sup M R P{X µ M} 1 p Minimax M-estimation under Adversarial Corruption compared with the the robust Catoni s estimator in Eq. (5) that can tolerate up to 36% contamination. Remark 1. Here we would like to point out that Lugosi and Mendelson (2021) and Prasad et al. (2020a) did not push the requirement of η to the limit. The constants cannot be easily improved, however, using their current proof techniques. So there might be room for improvement in each of the estimators, which could be worthwhile exploring as future work. 4. Robust M-estimator for Infinite Variance Let ε (0, 1] and let a sequence of i.i.d random variables {Xi}n i=1 be such that E(X1) = µ and E|X1 µ|1+ε υε. For Cε > 0, let ψ : R R be a nondecreasing influence function such that for all x R log(1 x+Cε|x|1+ε) ψ(x) log(1+x+Cε|x|1+ε). (8) Chen et al. (2021a) choose Cε = 1 ε inspired by Taylorlike expansions. Minsker (2018, Section 3.4) choose Cε = 1 ε 1+ε. Motivated by the choice of the co-efficient in Eq. (2), we choose a value that satisfies (1 x + Cεx1+ε)(1 + x + Cεx1+ε) 1, x 0, and is chosen as Cε = ε 1 + ε When ε = 1, we also recover the coefficient in Catoni (2012), namely C1 = 1/2. One can choose a bounded function6 satisfying Eq. (8): specifically, we assume that for some 0 < Aε < , |ψ(x)| Aε for all x R. (9) Robust Catoni s Estimator for ε < 1: As before, we define the Catoni s M-estimator bµε as a solution of the equation in the variable θ, namely, Pn i=1 ψ α( Xi θ) = 0 using an influence function ψ satisfying Eq. (8). If the solution is not unique, choose bµε to be the median solution. Again the motivation is to reduce the effect of large values using a logarithmic function. Theorem 4.1. Let {Xi}n i=1 be i.i.d random variables with mean µ and E|X1 µ|1+ε υε. Fix δ (0, 1), τ > 0, 0 < h < 1 and B > 0, and let the corruption parameter η [0, τ 1/ε 2Aε(1+τ)(1+ε)/ε C 1/ε ε ) for ε (0, 1]. Let n be such that n log(2/δ)(1 + h εB1+εCε) τ 1/ε (1+τ)(1+ε)/ε (1 h)C 1/ε ε 2Aεη(1 + h εB1+εCε) . 6see Eq. (69) in Appendix C.2 for an explicit function representation. Catoni s M-estimator bµε with parameter α = Bv 1/(1+ε) ε 2Aεη + log 2/δ satisfies, with probability at least 1 δ, |bµε µ| < (1 + τ)v1/(1+ε) ε 2Aεη + log 2/δ h εBεCε + 1/B . (10) Note that the error we have obtained in Eq. (10) has the ηε/(1+ε) dependence on the corruption rate, (log 2/δ)/n ε/(1+ε) dependence on the number of observations, and has a v1/(1+ε) ε dependence on the centered moment. We will see below that this order of magnitude is optimal. 4.1. Choice of τ, h and B in Eq. (10) Choose τε,η > 0 and 1 > hε,η > 0, and B > 0 such that (1 hε,η) τ 1/ε ε,η (1 + τε,η)(1+ε)/ε > 1+h ε ε,ηB1+εCε C1/ε ε 2Aεη. As η 0, we can choose τε,η arbitrarily close to 0, hε,η arbitrarily close to 1, and then choose B that minimizes the expression in the right hand side of Eq. (10) for h arbitrarily close to 1, i.e. B = (εCε)1/(1+ε). This gives us the following result. Corollary 4.2. For large n and small η, with probability at least 1 δ, |bµε µ| (1 + o(1))v1/(1+ε) ε (1 + ε)ε ε/(1+ε)C1/(1+ε) ε 2Aεη + log 2/δ Note that for ε = 1, the result recovers the near-optimal asymptotic constant for η 0 as in Catoni (2012) (see also Theorem 3.1) as C1/2 1 = 1/ 2. This indicates that error bounds are nearly-tight for ε < 1 as well in the absence of contamination. This is not the case, for example, with the estimator proposed in Chen et al. (2021a) for η = 0. 4.2. Corruption tolerance under ε < 1 Below we present a tight upper bound on the tolerance to contamination when ε (0, 1]. Corollary 4.3. For any ε (0, 1], the robust M-estimator for infinite variance can tolerate a corruption level η [0, Λ(ε)) with Λ(ε) := Ω(ε) 2 log 1 Ω(ε) , Minimax M-estimation under Adversarial Corruption where Ω(ε) := (1 + ε)(1+ε)/(2ε)(1 ε)(1 ε)/(2ε) 1. It turns out that, for a fixed value of the parameter τ > 0, an upper bound on the contamination rate that our estimator can handle is: 2Aεη < τ 1/ε (1 + τ)(1+ε)/ε C 1/ε ε . The value τ = 1/ε maximizes the expression in the right hand side above, so assuming that Aε is given in Eq. (68), the upper bound on the contamination rate Λ(ε) given in the corollary is obtained. ε 0.001 0.1 0.3 0.5 0.9 1 Λ(ε) (as %) 7% 16% 22% 26% 34% 36% Table 1. Percentage corruption tolerance In Table 1, as ε decreases from 1 to 0, Ω(ε) increases from 1/2 to 1, and the upper bound on the contamination rate decreases from 1/(4 log 2) to 0. Remarkably, even in the challenging setting when ε = 0.001, the proposed M-estimator tolerates around 7% contamination. 4.3. Minimax Lower Bound for ε < 1 We establish now a lower bound on the best minimax error rate achievable under adversarial contamination in the weak moment setting. Theorem 4.4. Let n > 0 and δ (0, 1) be such that δ 2e n/4. Let ε (0, 1] and distribution D be such that EX1 D X1 µ 1+ε = υε. The error bound of any estimator bµn({Xi}i n) in the adversarial contamination setting with corruption parameter 0 < η < 1 is at least qυ1/(1+ε) ε max n η ε 1+ε , log(2/δ) for a suitable absolute constant q. Clearly, the error we obtain in Eq. (10) is minimax optimal in the sense of Theorem 4.4 using arguments similar to those in Sec 3.1. Essentially, there is threshold for n depending on ε, δ & η beyond which the bias dominates and below which the natural variability of the data dominates. Remark 2. All results derived in the previous sections require the knowledge of vε. The extension to the unknown moment parameter vε can be achieved using Lepskii s method (Lepskii, 1992), which adapts to any unknown moment of the problem, by compromising on the tightness of the deviations. 5. Application: Best Arm Identification Consider the best arm identification problem on a multiarmed bandit with [K] := {1, 2, , K} arms in a fixed confidence setting. The goal is to identify the best arm with a high probability, while providing qualitative guarantees. A primary difficulty in the contaminated setting, as opposed to the classical setting (Even-Dar et al., 2006), is that the true parameters can only be partially identified (Altschuler et al., 2019), even with infinite samples. That is, there is an inherent asymptotic bias associated, which needs to be taken into account while identifying the best arm. This motivates the partially identifiable best arm identification framework (PIBAI) of Altschuler et al. (2019), summarized as follows. With each arm i [K], associate a family of reward distributions Di = {Di(µi, η)}η E, where η represents a corruption, and E is some space. The uncorrupted reward associated with arm i has mean µi and a centered 1 + ε-moment not exceeding υε, 0 < ε 1. Let = argmaxi [K]µi denote the best arm. To simplify notation, rewrite Eq. (5) and Eq. (10) as follows. For any arm i [K], |bµi(t) µ| Hi(η) + Gε,η(δ) 1 ε 1+ε , (11) where Hi(η) and Gε,η(δ) absorb the missing terms (see Appendix C.3). PIBAI Model Assumptions (Altschuler et al., 2019): A1. Even with infinite samples Xt(i) Di(µi, ηt) for unknown ηt, t = 1, 2 , it is impossible to estimate µi more precisely than the region [µi Hi]. A2. The unavoidable biases {Hi}i [K] are such that the effective gaps i := (µ H ) (µi + Hi) (12) are strictly positive for each sub-optimal arm i = . A3. There exists an algorithm that is δ PAC7 for the given contaminated bandit instance (D). The above assumptions are motivated by the fact that, even if any estimator computes the means of the individual arms with large tolerance in the presence of contamination, it is not guaranteed that the relative ordering between the 7Any PIBAI algorithm is said to be δ PAC if it outputs an arm b I that satisfies the following with probability at least 1 δ, P n µb I + Hb I < µ H o δ. Minimax M-estimation under Adversarial Corruption estimated means remains the same. In fact, if Eq. (12) does not hold, it can be shown that no algorithm can distinguish between the best and the second best arms, even with access to infinitely many samples (Altschuler et al., 2019). 5.1. Adversarial Elimination with Catoni A n aive approach to identify the best arm that attains optimal order of sample complexity up to logarithmic terms is based on successive elimination. However, the standard algorithms achieve the optimal order with doubly-logarithmic terms. When ε = 1, Altschuler et al. (2019) argue that the most common approaches (Karnin et al., 2013; Jamieson et al., 2013) to improve order to O(log K log(1/ i) 2 i ) are unsuccessful in the presence of contamination. The key issue that contributes to this shortcoming unlike a classical successive elimination algorithm (Mukherjee et al., 2021) is that these algorithms heavily rely on the additive property of suboptimality for successful identification: When the biases i, Hi = 0, i = j + µj µi. Clearly this is not true when Hi = 0. We next provide an improvement of successive elimination algorithm that alleviates this shortcoming and achieves better sample complexity in terms of i for ε (0, 1], in the contamination setting. For i S, let the elimination criterion for Algorithm 1 be specified as n bµi(tm) + Gε,η δ 2K(m + 1)2 ε 1+ε o (13) n bµj(tm) Gε,η δ 2K(m + 1)2 where m is the phase index. Theorem 5.1 (Sample Complexity). Let ε (0, 1]. Suppose Assumptions A1-A3 hold. With probability at least 1 δ, Algorithm 1 outputs S = { }, after pulling at most i [K] log K log(1/ i) samples, where i is defined as Eq. (12) using η = ηinit and tinit as defined in Algorithm 1. The key intuition for reduction in sample complexity is as follows: various successive elimination algorithms were initially proposed in Even-Dar et al. (2006), and later modified to regret minimization setting in Auer and Ortner (2010). Here the length of the phase is modulated by a parameter representing unknown sub-optimality gaps (e ), and elimination of arm i is identified when e < i/2. While this is sufficient for the purposes of regret minimization, it falls short in terms of the reducing samples for best arm identification. The elimination parameter γ, barely greater than 1, in Algorithm 1 balances this trade-off between elimination Algorithm 1 Adversarial Elimination with Catoni (AECat) 1: Input: δ, K, σ, ε, υε, γ(> 1) 2: Initialization: Set S := [K], phase index m = 0 3: Set t0 = 0 and t1 := tinit = max n γGε,η(δ/2K) 1+ε ε , Tε(σ, δ 4: while |S| > 1 do 5: Increase phase index m by 1 6: Sample every arm in S for max{tm tm 1, 0} times 7: Compute bµi(t) with appropriate α 8: Remove all arms i from S which satisfy Eq. (13) 9: Update S as the remaining arms and set m = m + 1 10: Set tm = γm Gε,η(δ/2K(m + 1)2) 1+ε 11: end while 12: Output: S and sample complexity. We establish that once the phase index is such that γm Gε,η(δ/2Km2) 1+ε ε l 4Gε,η( δ 4Km2 ) i then arm i will be eliminated before phase m, whence we obtain m = O(logγ(4/ i)). In Algorithm 1, the Catoni s estimate is computed at the end of each phase. To simplify algorithmic notation, let Tε(σ, δ) = σ log 2 δ to denote the minimum number of samples required in Theorem 3.1 and Theorem 4.1, where 4 1 (Ω+4)Aη, ε = 1, (1+h εB1+εCε) (1+τ)(1+ε)/ε (1 h)C 1/ε ε 2Aεη(1+h εB1+εCε), ε < 1. A few parameters appearing in T(σ, δ) are suppressed in the algorithm inputs for simplicity. The choices of the missing parameters are guided by the discussion in Sec. 4.1 for implementation purposes. The choice of α in Step 9 is further guided by Theorem 3.1 and Theorem 4.1 for ε = 1 and ε < 1 cases, respectively. There are two main drawbacks of the traditional successive elimination-type algorithm when using Catoni s estimator: (i) Unlike the proposed AECat method, which is phase-based, the mean estimation by root finding needs to be performed at every time for all the noneliminated arms. We find it extremely computationally inefficient in our experiments. (ii) The sample complexity is at best O(log(K/δ i)/ 1+ε/ε i ) as opposed to O(log K log(1/ i) δ / 1+ε/ε i ) using AECat, and this translates to better empirical performance. When i is sufficiently small, the successive elimination-type method exhibits a very bad empirical performance. Minimax M-estimation under Adversarial Corruption 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 0.01 0.02 0.03 0.04 Confidence No. of Pulls X 104 Figure 1. Average sample complexity over 50 iterations for the two algorithms: SE-CBAI as proposed in Mukherjee et al. (2021) and Algorithm 1 shortened as AECat. The true/ uncontaminated distribution is taken to be Gaussian, and the sensitivity w.r.t to contamination distribution and fraction of contamination is shown. As expected, the sample complexity decreases with lower contamination levels. True mean values are chosen as µk = 2 (i/K)0.1 for 0 i < K. The top six figures correspond to ε = 1. The last two figures correspond to best arm identification in heavy-tail settings under contamination, and use ε = 0.85 for a bound of υε = 50. These indicate how the number of arms affect the identification as a function of δ. 5.2. Experimental Results In this section, we describe the experimental setup designed to evaluate the performance of the proposed algorithm against the existing baseline. Mukherjee et al. (2021) propose a successive elimination algorithm (SE-CBAI) for contaminated best arm identification sub-Gaussian setting, using a suitable trimmed mean estimator for robustness and confidence bounds adjusted to provide good sample complexity. Since SE-CBAI is proposed in the sub-Gaussian setting, we provide the performance comparison as shown in this setting, while allowing the contamination distribution to be selected from common models of noise. As the implementation details of SE-CBAI and technical details of another gap based al- gorithm (G-CBAI) for the asymptotic setting (δ 0) are not clear, we tuned the parameters that reflect the obtained performance in the paper and use that throughout for comparison. The hyper-parameters (Ω, τ, B, h, γ) in Algorithm 1 are tuned as follows: To compute the Catoni s estimator Eq. (4), for ε = 1, we need to calculate α, which depends on Ω (0, 1/Aη 4). Smaller Ωresults in smaller initial exploration, while increasing the magnitude of H(η) in Eq. (5) a quantity of interest from assumption A2. We choose Ω= 0.25 (1/Aη 4). The factor γ > 1 in Algorithm 1 that controls the exploration should be chosen barely greater than 1 for good performance, and we choose γ = 1.01. From Eq. (10), for ε < 1, we can choose h 0.5, and a τ (0, 2) that obtains a valid but large B (note that this affects the error bound in Eq. (10)). Minimax M-estimation under Adversarial Corruption We take h = 0.5, τ = 1.2 and use B = 0.8. While the algorithm has more inputs than a typical successive elimination algorithm, it should be noted that tuning is straightforward here as we know the trade-offs. Results under different settings are summarized in Figure 1. Our method is uniformly better than SE-CBAI under various scenarios. 6. Generalization to Unknown η Suppose, the true contamination level ηtrue is significantly smaller than the upper bound η, then the upper bounds given in the main theorems may be larger than desirable. We will use the idea proposed in Jain et al. (2022) to deal with an unknown or a loose η. Jain et al. (2022) do not address probabilistic statements, and we contribute to this literature. In this section, we only consider the finite variance case, i.e., ε = 1. The extension to the infinite variance case is easily derived using similar constructions. Let 0 < ηmin < η < 1 4A be a threshold to be decided upon in the sequel. Choose a number θ (0, 1) and consider the following sequential contamination bounds ηk = ηθk, k = 0, 1, . . . , J, (14) where J = min k 1 : ηθk ηmin . (15) Let δk for k = 0, 1, . . . , J be confidence levels. For every k, we compute the estimator ˆm(ηk, δk) in the same way as derived in Theorem 3.1 to obtain a confidence interval as Ik = ˆm(ηk, δk) B(ηk, δk), ˆm(ηk, δk) + B(ηk, δk) , B(ηk, δk) (16) =v1/2 (K + 4)/(2K1/2)A1/2η1/2 + (2 log 2/δ)/n 1/2 1 (K + 4)Aη/2 2(log 2/δ)/n . By using interval sequences {Ik}, we can construct a tighter estimator as follows. We first define index J0 to be J0 := max{k = 0, . . . , J : k j=0Ij = }. (17) The desired estimator is defined as ˆµ := ˆm(ηJ0, δJ0). (18) In particular, we can choose the confidence sequence {δk} = {2 kδ} for convenience. For θ, we can set it as 1/4. In terms of ηmin, we can set ( 2 log(2/δ)/n (Ω+ 4)2A/(4Ω), η The choice of such ηmin has the following advantages. 1. If ηtrue ηmin, we can guarantee that ηJ0 4ηηtrue with a high probability. This implies an error order of O(η1/2 true). 2. If ηtrue < ηmin, we know that ˆµ will have error of order O(η1/2 min). Thanks to the choice of ηmin, the error will never exceed O(( log(2/δ) The above arguments imply that ˆµ in (18) enjoys an error bound of O(max{η1/2 true, ( log(2/δ) n )1/2}). Formally, we have the following result. Theorem 6.1. Let δ (0, 1) such that δ 2e n/4. Let {Xi}n i=1 be i.i.d. random variables with mean µ and E|X1 µ|2 v. Suppose the true corruption parameter ηtrue [0, 1 4A). Let Ωsatisfy (Ω+ 4)Aη + 4(log 2/δ)/n (19) log (Ω+ 4)2Aη/(4Ω) + log n 1. Then the estimator defined in (18) satisfies |ˆµ µ| (20) 2Ω1/2 A1/2η1/2 true + ( 2J0 log 2 n )1/2 + 2( 2 log(2/δ) 1 (Ω+ 4)Aηtrue/2 2J0 log(2)/n 2 log(2/δ)/n. with probability at least 1 2δ. The error bound in (20) is free of choice of η, which makes our theoretical results more appealing in the practical problems. 7. Conclusion We provided a minimax M-estimator based on influence functions inspired by Catoni (2012), which is known to be nearly-optimal in the absence of contamination. In the adversarial contamination setting, the proposed M-estimator tolerates more corruption than the state of the art estimators and achieves the minimax error rate both in the finite (ε = 1) and infinite variance (ε < 1) setting. We also explicitly characterized the maximum tolerance as a function of ε for the proposed estimators. We then proposed a novel best arm identification algorithm in the contaminated setting, that works in both bounded and heavy-tailed settings, and achieves better theoretical sample complexity and empirical performance than the state of the art. Finally, we extend the minimax estimation procedure to incorporate the challenging setting where a tight upper bound on the corruption level η is unknown, greatly improving the applicability of the proposed minimax M-estimators in applications. Currently, we require the knowledge of ε for obtaining the bounds. A recent work (Ashutosh et al., 2021) discusses adhoc ways of choosing ε that obtains a decent compromise. Extending these ideas for minimax M-estimation might be a worthwhile avenue for further exploration. Minimax M-estimation under Adversarial Corruption Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. 2019. Jason Altschuler, Victor-Emmanuel Brunel, and Alan Malek. Best arm identification for contaminated bandits. J. Mach. Learn. Res., 20(91):1 39, 2019. Kumar Ashutosh, Jayakrishnan Nair, Anmol Kagrecha, and Krishna P. Jagannathan. Bandit algorithms: Letting go of logarithmic regret for statistical robustness. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 622 630, Virtual Event, 2021. Peter Auer and Nicolo Cesa-Bianchi. On-line learning with malicious noise and the closure algorithm. Annals of mathematics and artificial intelligence, 23(1):83 99, 1998. Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010. Sujay Bhatt, Guanhua Fang, and Ping Li. Offline change detection under contamination. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI), Eindhoven, The Netherlands, 2022a. Sujay Bhatt, Guanhua Fang, Ping Li, and Gennady Samorodnitsky. Nearly optimal catoni s M-estimator for infinite variance. In Proceedings of the 39th International Conference on Machine Learning (ICML), Bartimore, MD, 2022b. Christian Brownlees, Emilien Joly, and G abor Lugosi. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507 2536, 2015. S ebastien Bubeck, Nicol o Cesa-Bianchi, and G abor Lugosi. Bandits with heavy tail. IEEE Trans. Inf. Theory, 59(11): 7711 7717, 2013. Valerii Vladimirovich Buldygin and YU V Kozachenko. Metric characterization of random variables and random processes, volume 188. American Mathematical Soc., 2000. Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l IHP Probabilit es et statistiques, volume 48, pages 1148 1185, 2012. Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 47 60, Montreal, Canada, 2017. Mengjie Chen, Chao Gao, and Zhao Ren. A general decision theory for huber s ϵ-contamination model. Electronic Journal of Statistics, 10(2):3752 3774, 2016. Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under huber s contamination model. The Annals of Statistics, 46(5):1932 1960, 2018. Peng Chen, Xinghu Jin, Xiang Li, and Lihu Xu. A generalized catoni s m-estimator under finite α-th moment assumption with α (1, 2). Electronic Journal of Statistics, 15(2):5523 5544, 2021a. Sitan Chen, Frederic Koehler, Ankur Moitra, and Morris Yau. Online and distribution-free robustness: Regression and contextual bandits with huber contamination. In Proceedings of the 62nd IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 684 695, Denver, CO, 2021b. Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter L Bartlett, and Michael I Jordan. Optimal mean estimation without a variance. ar Xiv preprint ar Xiv:2011.12433, 2020. Luc Devroye, Matthieu Lerasle, G abor Lugosi, and Roberto I Oliveira. Sub-gaussian mean estimators. The Annals of Statistics, 44(6):2695 2725, 2016. Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 999 1008, Sydney, Australia, 2017. Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1061 1073, Los Angeles, CA, 2018. Anders Eklund, Thomas E Nichols, and Hans Knutsson. Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proceedings of the national academy of sciences, 113(28):7900 7905, 2016. Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079 1105, 2006. Samuel B. Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1021 1034, Los Angeles, CA, 2018. Minimax M-estimation under Adversarial Corruption Steven R Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49 (2):1055 1080, 2021. Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964. Peter J Huber. Robust statistics, volume 523. John Wiley & Sons, 2004. Ayush Jain, Alon Orlitsky, and Vaishakh Ravindrakumar. Robust estimation algorithms don t need to know the corruption level. ar Xiv preprint ar Xiv:2202.05453, 2022. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning, volume 112. Springer, 2013. Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. On finding the largest mean among many. ar Xiv preprint ar Xiv:1306.3917, 2013. Zohar Shay Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 1238 1246, Atlanta, GA, 2013. Pierre Laforgue, Guillaume Staerman, and St ephan Cl emenc on. Generalization bounds in the presence of outliers: a median-of-means study. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 5937 5947, Virtual Event, 2021. Kevin A. Lai, Anup B. Rao, and Santosh S. Vempala. Agnostic estimation of mean and covariance. In Proceedings of the IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665 674, New Brunswick, NJ, 2016. Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Guillaume Lecu e and Matthieu Lerasle. Robust machine learning by median-of-means: theory and practice. The Annals of Statistics, 48(2):906 931, 2020. OV Lepskii. Asymptotically minimax adaptive estimation. I: Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications, 36(4):682 697, 1992. Ping Li. Stable random projections and conditional random sampling, two sampling techniques for modern massive datasets, Ph D Thesis. 2007. URL https://hastie. su.domains/THESES/pingli_thesis.pdf. G abor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Found. Comput. Math., 19(5):1145 1190, 2019a. G abor Lugosi and Shahar Mendelson. Sub-gaussian estimators of the mean of a random vector. The Annals of Statistics, 47(2):783 794, 2019b. G abor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. The Annals of Statistics, 49(1):393 410, 2021. Stanislav Minsker. Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics, 46(6A):2871 2903, 2018. Stanislav Minsker. Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics, 13(2):5213 5252, 2019. Arpan Mukherjee, Ali Tajer, Pin-Yu Chen, and Payel Das. Best arm identification in contaminated stochastic bandits. In Advances in Neural Information Processing Systems (Neur IPS), pages 9651 9662, virtual, 2021. Roberto I Oliveira and Paulo Orenstein. The sub-gaussian property of trimmed means estimators. Technical report, Technical report, IMPA, 2019. Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. A robust univariate mean estimator is all you need. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4034 4044, Online [Palermo, Sicily, Italy], 2020a. Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):601 627, 2020b. Qiang Sun, Wen-Xin Zhou, and Jianqing Fan. Adaptive huber regression. Journal of the American Statistical Association, 115(529):254 265, 2020. Yinglun Zhu, Sumeet Katariya, and Robert D. Nowak. Robust outlier arm identification. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 11566 11575, Virtual Event, 2020. Minimax M-estimation under Adversarial Corruption A. Proofs of Main Results A.1. Proof of Theorem 3.1 For α > 0 define two functions of θ R: i=1 ψ α(Xi θ) , r(θ) = 1 i=1 ψ α( Xi θ) . (21) Note that by Eq. (1) and Eq. (3), r(θ) r(θ) 2Aη/α (22) for any θ R. As in Catoni (2012), for θ R, we define B+(θ) = µ θ + α 2 v + (µ θ)2 + log 2/δ B (θ) = µ θ α 2 v + (µ θ)2 log 2/δ Let P denote a probability measure that encompasses the randomness in the observations X1, . . . , Xn as well as any randomness that may be used by the adversary in corrupting the sample. It follows from Eq. (22) that for any θ P r(θ) B+(θ) + 2Aη/α P r(θ) B+(θ) δ/2, (23) P r(θ) B (θ) 2Aη/α P r(θ) B (θ) δ/2. Let θ+ be the smallest solution of the equation 0 = B+(θ) + 2Aη/α = µ θ + α 2 v + (µ θ)2 + log 2/δ and let θ be the largest solution of the equation 0 = B (θ) 2Aη/α = µ θ α 2 v + (µ θ)2 log 2/δ provided, of course, that these solutions exist. Let us concentrate first at θ+. Denoting x = θ µ, the equation Eq. (24) becomes 2 v + log 2/δ and, if real solutions exist, then the smallest such solution is given by 1 α2v 4Aη 2(log 2/δ)/n Of course, for a real solution to exist, the expression under the square root must be non-negative. This immediately says that our approach may work only if the corruption level η satisfies η < 1/(4A). (28) With A = log 2, the limit on the corruption level becomes η < 0.36. Assuming that Eq. (28) holds, a real solution to the equation Eq. (26) exists whenever α2v + 2(log 2/δ)/n 1 4Aη. (29) Minimax M-estimation under Adversarial Corruption In that case x+ in Eq. (27) is real, and 0 x+ = αv + 4Aη/α + 2(log 2/δ)/(αn) 1 α2v 4Aη 2(log 2/δ)/n αv/2 + 2Aη/α + (log 2/δ)/(αn) 1 α2v/2 2Aη (log 2/δ)/n . θ+ µ + αv/2 + 2Aη/α + (log 2/δ)/(αn) 1 α2v/2 2Aη (log 2/δ)/n . (30) We need to choose α. Choose Ωso that 0 < Ω< 1/(Aη) 4, ΩAη + 2 log 2/δ Then Eq. (29) holds whenever (Ω+ 4)Aη + 4(log 2/δ)/n 1. (31) It follows from Eq. (24) that θ+ µ + (v1/2/2) (Ω+ 4)Aη + 4(log 2/δ)/n / ΩAη + 2 log 2/δ 1 (Ω+ 4)Aη/2 2(log 2/δ)/n (32) µ + v1/2 (Ω+ 4)/(2Ω1/2)A1/2η1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2(log 2/δ)/n . It follows by the monotonicity of the function ψ that, with probability at least 1 δ/2, ˆµ µ v1/2 (Ω+ 4)/(2Ω1/2)A1/2η1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2(log 2/δ)/n . (33) Performing the same analysis with θ , we conclude that, with probability at least 1 δ/2, µ ˆµ v1/2 (Ω+ 4)/(2K1/2)A1/2η1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2(log 2/δ)/n . (34) Combining Eq. (33) with Eq. (34), we conclude that, with probability at least 1 δ, |ˆµ µ| v1/2 (Ω+ 4)/(2K1/2)A1/2η1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2(log 2/δ)/n . (35) The second part of the theorem follows from using n 4 Aη log(2/δ). By routine algebraic manipulations, we obtain |ˆµ µ| < e C ηυ, where e C := (Ω+4)/(2K1/2)A1/2η1/2+ 1 2 1 (Ω+4)Aη/2 Aη Minimax M-estimation under Adversarial Corruption A.2. Proof of Corollary 3.2 The result follows from Eq. (32) in Theorem 3.1 and the definition of α. A.3. Proof of Theorem 4.1 Initial analysis is similar to Theorem 3.1. Consider the following convexity upper bound as follows. For a, b 0 and 0 < h < 1, (a + b)1+ε = ha h + (1 h) b 1 h 1+ε + (1 h) b 1 h (1 h)ε . (36) Therefore, for any 0 < h < 1, E|X1 θ|1+ε h εE|X1 µ|1+ε + (1 h) ε|µ θ|1+ε. (37) For 0 < h < 1, define B+(θ) = (µ θ) + h ε2εCεαεvε + (1 h) ε2εCεαε|µ θ|1+ε + log 2/δ B (θ) = (µ θ) h ε2εCεαεvε (1 h) ε2εCεαε|µ θ|1+ε log 2/δ Observe that the bounds Eq. (23) still hold. Now let θ+ be the smallest solution of the equation 0 =B+(θ) + 2Aεη/α (39) =µ θ + h ε2εCεαεvε + (1 h) ε2εCεαε|µ θ|1+ε + log 2/δ and let θ be the largest solution of the equation 0 =B (θ) 2Aεη/α (40) =µ θ h ε2εCεαεvε (1 h) ε2εCεαε|µ θ|1+ε log 2/δ assuming such solutions exist. Concentrating first on θ+, we note that, if the equation in Eq. (39) has real roots, they are larger than µ. Setting x = θ µ, θ µ, the equation in Eq. (39) can be written in the form Kx1+ε x + M = 0 (41) K = (1 h) εαεCε, M = h εαεCεvε + log 2/δ As in the paper, we denote D = K1/εM, and y = K1/εx, so the equation Eq. (41) becomes y1+ε y + D = 0, y 0. Assuming that for some τ > 0 the condition (1 + τ)(1+ε)/ε (42) Minimax M-estimation under Adversarial Corruption holds, we know that θ+ µ (1 + τ)M, θ+ µ (1 + τ) h εαεCεvε + log 2/δ and the condition Eq. (42) has the following explicit form; h εα1+εCεvε + 2Aεη + log 2/δ (1 + τ)(1+ε)/ε (1 h)C 1/ε ε . (44) Note that this puts an upper bound on the contamination rate our estimator can tolerate: 2Aεη < τ 1/ε (1 + τ)(1+ε)/ε C 1/ε ε . The value τ = 1/ε maximizes the expression in the right hand side above, so assuming that Aε is given in Eq. (68), the bound on the contamination rate is η <ε/(1 + ε)(1+ε)/ε 2AεC1/ε ε (45) = n 2 log h 1 (1 + ε)(1+ε)/(2ε)(1 ε)(1 ε)/(2ε) 1i (1 + ε)(1+ε)/(2ε)(1 ε)(1 ε)/(2ε)o 1 = K(ε) 2 log 1 K(ε) , where K(ε) = (1 + ε)(1+ε)/(2ε)(1 ε)(1 ε)/(2ε) 1. (46) Assuming that the contamination rate η satisfies Eq. (45), condition Eq. (44) holds for some choices of α, h and τ and for n large enough. In that case, as in the paper we conclude that, with probability at least 1 δ/2, ˆµ µ (1 + τ) h εαεCεvε + log 2/δ Performing the same analysis with θ , we conclude that, with probability at least 1 δ/2, µ ˆµ (1 + τ) h εαεCεvε + log 2/δ It follows from Eq. (47) and Eq. (48) that, with probability at least 1 δ, |ˆµ µ| (1 + τ) h εαεCεvε + log 2/δ whenever Eq. (44) holds. Next, we address the choice of α, τ, h. For B > 0 can choose α = Bv 1/(1+ε) ε 2Aεη + log 2/δ 1/(1+ε) . (50) Then the bound Eq. (49) becomes |ˆµ µ| (1 + τ)v1/(1+ε) ε 2Aεη + log 2/δ ε/(1+ε) h εBεCε + 1/B , (51) while the constraint Eq. (44) takes the form 1 + h εB1+εCε 2Aεη + log 2/δ (1 + τ)(1+ε)/ε (1 h)C 1/ε ε . (52) Minimax M-estimation under Adversarial Corruption A.4. Proof of Corollary 4.3 The proof follows from Eq. (45) in Theorem 4.1. A.5. Proof of Theorem 4.4 Arguing as in Lugosi and Mendelson (2021), the lower bound on the error one can get with probability at least 1 δ is const. max n E |X µ Qη/2|1 X µ Qη/2 , E |X µ Q1 η/2|1 X µ Q1 η/2 , v1/(1+ε) ε (log 2/δ)/n ε/(1+ε)o , (53) where for 0 < p < 1, Qp is a pth quantile of the distribution of X µ. Therefore, we only need to show that there is a constant c and a random variable X with mean µ, E|X µ|1+ε vε such that the maximum of the first two terms in the right hand side of Eq. (53) is at least cv1/(1+ε) ε ηε/(1+ε). We can simply choose µ = 0 and set 0 with probability 1 η, v1/(1+ε) ε η 1/(1+ε)/2 with probability η/4 each v1/(1+ε) ε η 1/(1+ε) with probability η/4 each. Then EX = 0, E|X|1+ε vε and the quantile Qη/2 = v1/(1+ε) ε η 1/(1+ε)/2. Therefore, E |X Qη/2|1 X m Qη/2 = 1 8v1/(1+ε) ε ηε/(1+ε), as required. A.6. Proof of Theorem 5.1 Rewriting the error bound in Theorem 4.1 as in Eq. (11), with probability at least 1 δ Km2 , we have each of the following events bµi(t) µ(i) + Hi(η) + Gε,η( δ 2Km2 ) 1 ε 1+ε , (54) and bµ (t) µ( ) H (η) Gε,η( δ 2Km2 ) 1 ε 1+ε . (55) We ll first establish that if the number of pulls of sub-optimal arm i is larger than nmi := l 4Gε,η( δ 2Km2 ) i at phase m (m will be determined later, i.e. Eq. (56)), then arm i will be eliminated. Essentially, this means that i 4Gε,η( δ 2Km2 ) 1 Therefore, by (54) and (55), we have bµi(nmi) + Gε,η( δ 2Km2 ) 1 ε 1+ε + Hi(η) µ(i) + 2Gε,η( δ 2Km2 ) 1 ε 1+ε + 2Hi(η) < µ(i) + i 2Gε,η( δ 2Km2 ) 1 ε 1+ε + 2Hi(η) = µ( ) H (η) Hi(η) 2Gε,η( δ 2Km2 ) 1 ε 1+ε + 2Hi(η) < bµ (nmi) Gε,η( δ 2Km2 ) 1 ε 1+ε + Hi(η). Minimax M-estimation under Adversarial Corruption Note that the optimal arm will not be eliminated if (54) and (55) hold, as then the required relation bµi(nmi) Gε,η( δ 2Km2 ) 1 ε 1+ε > bµ (nmi) + Gε,η( δ 2Km2 ) 1 leads to µ(i) + Hi(η) > µ( ) H (η) for a sub-optimal arm i, which violates the assumption (A2) in PIBAI framework Eq. (12). This implies that once the phase index m0 i is such that m0 i := min n m : γm Gε,η(δ/2Km2) 1+ε ε > nmi o , (56) then arm i will be eliminated before phase m0 i with probability at least 1 Pm0 m=2 δ Km2 1 δ Solving Eq. (56), we have γm Gε,η(δ/2Km2) 1+ε ε > l 4Gε,η( δ 2Km2 ) i ε m . It suffices to have m > logγ(4/ i). We then know m0 i is bounded by logγ(4/ i) . Therefore, the number of times pulling arm i is bounded by γm0 i Gε,η( δ 2Km2 ) 1+ϵ = Gε,η( δ 2K(m0 i )2 ) 1+ϵ ϵ (γm0 i ) 1+ϵ ϵ (Gε,η( δ 2K(m0 i )2 ) 1+ϵ log(K logγ(4/ i) The result follows by substituting the expression for Gε,η( ). A.7. Proof of Theorem 6.1 According to Theorem 3.1, we know, for any Ωsatisfying 0 < Ω< 1/(Aη) 4 and (Ω+ 4)Aη + 4(log 2/δ)/n 1, the estimator ˆm(η, δ) produced by our algorithm holds | ˆm(η, δ) µ| v1/2 (Ω+ 4)/(2Ω1/2)A1/2η1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2(log 2/δ)/n (58) with probability at least 1 δ. By construction, we know P(µ Ik) 1 δk, (59) for any k such that ηtrue ηk. Thus, it holds P µ Ik for any k such that ηtrue ηk 1 k=0 δk. (60) By definition of J0, it gives k=0 δk. (61) By the choice that δk = δ2 k, (61) implies | ˆm m| v1/2 (ω + 4)/(2Ω1/2)A1/2η1/2 J0 + (2J0(log 2)/n)1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2J0 log 2/n 2(log 2/δ)/n (62) Minimax M-estimation under Adversarial Corruption with probability at least 1 2δ. If ηtrue ηmin, then on the above event of probability at least 1 2δ, we have ηtrue ηJ0/4, so that |ˆµ µ| v1/2 2(Ω+ 4)/(2Ω1/2)A1/2η1/2 true + (2J0(log 2)/n)1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2J0 log 2/n 2(log 2/δ)/n . On the other hand, if ηtrue < ηmin, then on the same event |ˆµ µ| v1/2 (Ω+ 4)/(2Ω1/2)A1/2η1/2 min + (2J0(log 2)/n)1/2 + (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2J0 log 2/n 2(log(2/δ))/n v1/2 (2J0(log 2)/n)1/2 + 2 (2 log(2/δ))/n 1/2 1 (K + 4)Aη/2 2J0 log 2/n 2(log(2/δ))/n. Therefore, in any case, |ˆµ µ| v1/2 2(Ω+ 4)/(2Ω1/2)A1/2η1/2 true + (2J0(log 2)/n)1/2 + 2 (2 log 2/δ)/n 1/2 1 (Ω+ 4)Aη/2 2J0 log 2/n 2(log 2/δ)/n (63) with probability at least 1 2δ. This, of course, holds assuming that (31) holds for every k = 0, 1, . . . , J0. Since J0 J, this will hold if (Ω+ 4)Aη + 4(log 2/δ)/n + 4 log 2 log (Ω+ 4)2Aη/(4Ω) + log n / log 4 1. (64) B. ϵ Huber Contamination Model There is a related contamination model considered in Lai et al. (2016); Prasad et al. (2020a), known as the ϵ Huber contamination model (Huber, 2004): e P = (1 ϵ)P + ϵQ. (65) Here {Xi}i n are drawn i.i.d from the mixture model e P, with the uncontaminated distribution P and arbitrary contamination distribution Q chosen based on Bernoulli(ϵ) flip, possibly in an online fashion. Proposition B.1 provides a high confidence bound on the empirical fraction bϵn = 1 n Pn i=1 1 Xi = Xi . Proposition B.1. Let bϵn denote the empirical fraction of observations drawn from Q up to time n. With probability at least 1 β, for all n 1 bϵn ϵ + 1.7 p log(log(2n)) + 0.72 log 10.4 β n | {z } :=f(β,ϵ,n) Proof of Proposition B.1 In the empirical fraction bϵn = 1 n Pn i=1 1 Xi = Xi , the indicator random variable 1 Xi = Xi has a sub-Gaussian distribution. The result follows using Howard et al. (2021, Theorem 1). The transition from Eq. (1) to Eq. (65) is as follows: fix n0 > 1 and set ϵ + f(β, ϵ, n0) = η. (66) Clearly, for all n n0, we have the corruption fraction bϵn to be at most η with a very high probability. C. Detailed Expressions for Suppressed Notation C.1. Influence function when ε < 1 As explained in Catoni (2012), the narrowest possible ψ that satisfies Eq. (2) has A = log 2, and is given by log(1 x + x2/2), 0 x 1, log(2), x 1, ψ( x), x 0. (67) Minimax M-estimation under Adversarial Corruption C.2. Influence function when ε < 1 Influence Function: Note that the function log(1 x + Cεx1+ε), x 0 achieves its maximum at the point (1 + ε)Cε 1/ε and its maximal value is log 1 ε 1 + ε (1 + ε)Cε 1/ε . Similarly, the function log(1 + x + Cε|x|1+ε), x 0 achieves its minimum at the point (1 + ε)Cε 1/ε, and its minimal value is log 1 ε 1 + ε (1 + ε)Cε 1/ε . Therefore, we can choose a specific function ψ(x) = as a bounded function satisfying |ψ(x)| Aε =: log 1 ε 1 + ε (1 + ε)Cε 1/ε . (68) Note that A1 = log 2, and Aε as ε 0. We assume, therefore, that the function ψ is bounded by some Aε, and we know that Aε can be chosen as in Eq. (68).A bounded influence function ψ(x) that satisfies Eq. (68) is given as follows: log 1 ε 1+ε (1 + ε)Cε 1/ε if x (1 + ε)Cε 1/ε, log(1 + x + Cε|x|1+ε) if (1 + ε)Cε 1/ε x 0, log(1 x + Cεx1+ε) if 0 x (1 + ε)Cε 1/ε, log 1 ε 1+ε (1 + ε)Cε 1/ε if x (1 + ε)Cε 1/ε. C.3. Compact error bounds To simplify notation, we rewrote Eq. (5) and Eq. (10) using Eq. (11). When ε = 1, we can compactly represent Eq. (5) as follows: For any arm i K, |bµi(t) µ| Hi(η) + Gε,η(δ) 1 ε 1+ε , where, Hi(η) := υ1/2(Ω+ 4) ηA Ω 1 (Ω+ 4)Aη/2 2 log(2/δ)/n , and Gε,η(δ) := υ1/2p 2 log(2/δ)/n 1 (Ω+ 4)Aη/2 2 log(2/δ)/n . Similarly, when ε < 1, we can compactly represent Eq. (10) (using concavity) as follows: For any arm i K, |bµi(t) µ| Hi(η) + Gε,η(δ) 1 ε 1+ε , where, Hi(η) := (1 + τ)υ1/1+ε ε (h εBεCε + 1/B)(2Aεη) ε 1+ε , and Gε,η(δ) := (1 + τ)υ1/1+ε ε (h εBεCε + 1/B) log(2/δ)