# minimaxoptimal_location_estimation__ff8ff9df.pdf

Minimax-Optimal Location Estimation

Shivam Gupta The University of Texas at Austin shivamgupta@utexas.edu

Jasper C.H. Lee University of Wisconsin Madison jasper.lee@wisc.edu

Eric Price The University of Texas at Austin ecprice@cs.utexas.edu

Paul Valiant Purdue University pvaliant@gmail.com

Location estimation is one of the most basic questions in parametric statistics. Suppose we have a known distribution density f, and we get n i.i.d. samples from f(x µ) for some unknown shift µ. The task is to estimate µ to high accuracy with high probability. The maximum likelihood estimator (MLE) is known to be asymptotically optimal as n , but what is possible for finite n? In this paper, we give two location estimators that are optimal under different criteria: 1) an estimator that has minimax-optimal estimation error subject to succeeding with probability 1 δ and 2) a confidence interval estimator which, subject to its output interval containing µ with probability at least 1 δ, has the minimum expected squared interval width among all shift-invariant estimators. The latter construction can be generalized to minimizing the expectation of any loss function on the interval width.

1 Introduction

We revisit one of the most basic questions in parametric statistics: location estimation. Suppose there is a shift-invariant parametric model {f (θ)}θ R, namely that f (θ) = f(x θ) for some distribution f, and we get n i.i.d. samples from f (µ) for some unknown parameter µ. The task is to estimate µ as accurately as possible, succeeding with probability at least 1 δ. This problem includes Gaussian mean estimation as a special case. Compared to general mean estimation, where we know nothing about the distribution beyond mild moment assumptions, here we are given the shape of the distribution, and only the shift is unknown. This allows us to aim for better performance than in mean estimation, and even handle cases where the mean of the distribution does not exist.

The problem has been widely studied as a special case of parametric estimation. In particular, the classic asymptotic theory [Vaa00] recommends using the maximum likelihood estimator (MLE), which, when we fix a distribution f = f (0) and take the number of samples n to , satisfies µMLE µ N(0, 1/(n I)) for any parameter µ. Here, the variance of the asymptotic Gaussian is 1/(n I), where I is the Fisher information of the distribution f, defined by I = R

R(f (x))2/f(x) dx. It is a standard fact that 1/I σ2 for any variance-σ2 distribution, meaning that the asymptotic performance of the MLE is always at least as good as the sample mean. Conversely, the well-known Cram er-Rao theorem states that, for any shift-equivariant estimator ˆµ, its variance must be at least 1/(n I) as well.

However, the classic asymptotic guarantee can be misleading, since the convergence to the asymptotic Gaussian may be arbitrarily slow in the number of samples, depending on the underlying distribution f. Indeed, recent work by Gupta et al. [GLPV22] showed that the issue is in fact information-

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

(a) The input distribution has a rare but very informative spike

Interval Width 0.0

Cumulative Density

MSE Estimator Interval MLE

(b) The minimax interval estimator from 10 samples finds the smallest possible worst-case interval, but at the cost of much worse typical interval size.

Figure 1: Example comparing our algorithms

theoretic, that no estimator can converge to the Fisher information rate in a distribution-independent manner as n . They [GLPV22; GLP23] also studied the finite sample algorithmic theory of location estimation, with the goal of yielding finite-sample estimation error tight to within a 1 + o(1) multiplicative factor. To circumvent the impossibility of converging to the Fisher information in a distribution-independent finite-sample manner, they instead proposed a theory based on the smoothed Fisher information Ir. That is, the Fisher information of the distribution f after convolving with N(0, r2). They showed that applying the MLE to perturbed samples and the smoothed distribution will yield an estimation error that converges to N(0, 1/(n Ir)) in a distribution-independent fashion, for some smoothing radius r that vanishes as n .

The key issue with the above theory, however, is that the Fisher information can be extremely sensitive to smoothing. As a simple example, a Dirac-δ spike has infinite Fisher information, yet smoothing it by N(0, r2) reduces its Fisher information to 1/r2. Thus, to get 1 + o(1) tightness in estimation error, one would also need to also completely optimize away the smoothing. This is highly challenging and not yet achieved by these prior works on a distribution-by-distribution basis.

In this work, we instead directly construct minimax-optimal location estimators, under two different criteria. The first construction is a point estimator ˆµ which, for a given failure probability δ , attains the minimum estimation error ϵ such that |ˆµ µ| ϵ with probability at least 1 δ . The second construction is a confidence interval estimator [µℓ, µr]. Consider a loss function on the size of the output interval L(µr µℓ) that is monotonically increasing. The second estimator is one such that, subject to the correctness constraint that the true parameter µ lies in the output interval [µℓ, µr] with probability at least 1 δ , the estimator minimizes the mean loss.

The distinction is whether we want to minimize the worst-case or the average confidence interval. For example, consider the distribution (1 1/n)N(θ, 1) + (1/n)N(θ + n, r2) for r = 1 n2 (see Figure 1). There is about a 1/e chance that you don t sample the narrow spike, in which case your uncertainty is effectively the Gaussian N(0, 1

n); but if you do sample the spike k > 0 times, it gives a much more precise N(0, r2

k ) uncertainty. That is, some observed samples x could give very precise estimates, while other observed x cannot. The second algorithm gives worse worst-case behavior in exchange for much better average case behavior.

Our first algorithm only cares about the maximum size confidence interval it returns. That will be dominated by the 1/e chance the narrow spike isn t sampled; so to get a 95% confidence interval, it will always return essentially a (1 e 20)-confidence interval for the N(0, 1

n) error, which is [ 1.49 n , 1.49 n ]. Our second algorithm could instead think: by returning the slightly larger interval [ 1.50 n , 1.50 n ] in the cases where the narrow spike is not sampled, it can return much smaller Θ(r) size intervals when the spike is seen. It will optimize this tradeoff according to the given loss function; for square loss, this will be with Θ(r log n) in the (more common!) good case, and a 1 + 1/ poly(n) factor worse than optimal interval in the bad case.

1.1 Our results and techniques

Minimax-optimal point estimation In the first setting we consider in this paper: Given a point estimator ˆµ, for a given failure probability δ , we measure its estimation error ϵˆµ,δ by its worst-case error over parameters µ, namely

inf ϵ > 0 sup µ Pr x1,...,xn f (µ) (|ˆµ(x1, . . . , xn) µ| ϵ) 1 δ

Our first main contribution is that the estimator defined in Algorithm 2 achieves (essentially) minimal estimation error.

Theorem 1.1. For any small κ > 0, the estimator defined in Algorithm 2 has estimation error within a (1 + κ) factor of minimax optimal.

In order to construct this estimator, we solve the dual problem: fixing an estimation accuracy ϵ, find an estimator that minimizes the failure probability δ. The optimal estimator for this dual problem can be defined as the limit of a family of Bayes-optimal estimators. Specifically, fixing estimation error ϵ, the estimator At defined in Algorithm 1 attains the minimum failure probability when the true parameter µ is drawn from the prior Unif[ t, t]. The following theorem shows that the estimator in the limit as t , A , attains the minimax-optimal failure probability.

Theorem 1.2. The algorithm A (limit of Algorithm 1 as t ) is minimax-optimal with respect to the 0-1 loss L(ˆµ, µ) = 1[|ˆµ µ| > ϵ], namely, with respect to the failure probability.

The estimator A induces a function of the failure probability δ in terms of the estimation error ϵ. Thus, given a desired a failure probability δ , the δ(ϵ) function can be inverted to define the optimal estimator in Algorithm 2 in Section 3.2. Since δ(ϵ) is monotonic, we can use binary search to approximately compute the optimal ϵ. The binary search is explicitly given and analyzed in Appendix A.1.

Minimax-optimal confidence-interval estimation The second setting we consider in this paper concerns confidence-interval estimators. Suppose we are given an increasing loss function L : R+ R+ mapping an interval width to a loss, and given an estimator A whose output interval we denote by [µℓ, µr], we measure its mean loss by

sup µ R E x1,...,xn f (µ) [L(µr µℓ)]

Our main result for this second setting is that the estimator defined in Algorithm 4 achieves minimaxoptimal mean loss, subject to the constraint that for all parameters µ R, the probability that µ [µℓ, µr] is at least 1 δ for a given failure probability δ .

Theorem 1.3. Consider the set Algδ of estimators A whose output we denote by [µℓ, µr], such that for all parameters µ, Pr x1,...,xn f (µ)(µ [µℓ, µr]) 1 δ

Consider the estimator defined in Algorithm 4 using failure probability δ , and denote its mean loss by R. Then, R inf A Algδ sup µ R E x1,...,xn f (µ) [L(µr µℓ)]

The construction of Algorithm 4 and proof of Theorem 1.3 come in two parts. First, we show in Section 4.1 that Algorithm 4 is minimax-optimal among equivariant estimators. That is, estimators which, the probability (density) of the estimator outputting the interval [µℓ, µr] on input (x1, . . . , xn) is equal to the probability of the estimator outputting [µℓ+ θ, µr + θ] on input (x1 + θ, . . . , xn + θ). Second, we show in Section 4.2 that, for any estimator, equivariant or not, its performance can be arbitrarily well-approximated by an equivariant estimator. Together, this shows Theorem 1.3, that Algorithm 4 is minimax optimal.

The bulk of the technical work comes from the first step, constructing Algorithm 4 and showing that it is minimax-optimal among equivariant estimators. The challenge is that we need to optimize the

mean loss of an estimator subject to the constraint on its probability of success, which is the event that its output interval does in fact include the true parameter.

Our approach is to use convex optimization techniques: for each equivariant estimator A, we consider its mean loss RA and its failure probability δA, both of which are independent of the true parameter due to the equivariance of A. It is easy to check that the set {(RA, δA)} over equivariant estimators A is a convex set. We give Algorithm 3, which, for any angle ρ [0, π/2), is an equivariant estimator which attains the minimum possible RA cos ρ + δA sin ρ among equivariant estimators. We also show the optimality of Algorithm 3 in Theorem 4.3. This estimator thus defines a failure probability δρ and a supporting hyperplane of angle ρ for the {(RA, δA)} set. Noting that δρ is monotonic in ρ, this again means we can binary search on ρ to yield the desired failure probability δ . Appendix B.1 gives the explicit binary search algorithm and its correctness guarantees.

2 Related work

Location estimation, as a special case of parametric estimation, has a well-established asymptotic theory [Vaa00] concerning the MLE, whose asymptotic performance is captured by the Fisher information. There have also been works on the finite-sample performance of the MLE, see for example [Spo11; Pin17; Van00; Mia10]. However, these works impose strong regularity conditions on the parametric model, and lose at least constant multiplicative factors in their error guarantees.

More recently, Gupta et al. [GLPV22; GLP23] focused on location estimation, and aimed to characterize the finite-sample theory to within 1 + o(1)-tightness in the estimation error. They proposed a theory in terms of the smoothed Fisher information. Yet, there are examples of distributions whose Fisher information decrease significantly even after little smoothing, meaning that their theory does not always capture the optimal finite-sample bounds up to a 1 + o(1)-factor. By contrast, in this work, we directly construct minimax-optimal estimators.

The related problem of mean estimation has also seen renewed interest in its finite-sample theory, also with the aim of yielding 1 + o(1)-tightness in the estimation error. Catoni [Cat12] showed in his seminal work that, if the variance of the underlying distribution is known, then it is possible to achieve 1 + o(1)-tight estimation error. Catoni [Cat12] and Devroye et al. [DLLO16] also showed that, as long as the kurtosis (normalized central 4th moment) is bounded, knowledge of the variance is unnecessary. Recently, Lee and Valiant [LV22] proposed an estimator also yielding 1 + o(1)-tight error, but removing the knowledge assumption on the variance entirely, and instead assuming only its existence and finiteness.

3 Minimax-optimal point estimator

The goal of this section is to construct a point estimator ˆµ that yields minimax-optimal estimation error, where error for ˆµ is measured by inf ϵ > 0 supµ Prx1,...,xn f (µ) (|ˆµ(x1, . . . , xn) µ| ϵ) 1 δ .

Section 3.1 first gives and analyzes an algorithm which takes as input an estimation accuracy ϵ and returns an estimate ˆµ whose failure probability Pr(|ˆµ µ| > ϵ) is minimax-optimal. Section 3.2 then uses binary search to construct an estimator that instead takes in a desired failure probability, and outputs an estimation that (almost) optimally minimizes the estimation error.

3.1 Estimator with minimax failure probability

The goal of this section is to construct an algorithm that takes in a desired estimation accuracy ϵ and optimizes for the failure probability δ. The key intuition is that, since we do not know what the true parameter θ is, we might as well start by constructing an estimator that is Bayes optimal with respect to the uniform prior over the real line Unif(R). However, there is no such thing as a uniform prior over the real line. So instead, Algorithm 1 below is constructed to be Bayes optimal with respect to the prior Unif[ t, t], and we take the limit of t to get a translation-equivariant estimator, whose risk (failure probability) is independent of the underlying parameter θ. We then use standard tools to relate Bayes optimality to minimax optimality, under such translation equivariance.

Algorithm 1 The algorithm At for a fixed estimation accuracy ϵ

Input: Distribution shape f, samples x1, . . . , xn, and estimation accuracy ϵ

1. Let lx(θ) be the likelihood of f (θ) yielding the observed samples x1, . . . , xn.

2. Return ˆµ = arg maxθ [ t+ϵ,t ϵ] R θ+ϵ θ ϵ lx(θ) dθ

As mentioned earlier, the optimal algorithm we propose is the limit A , namely, in Step 2 of Algorithm 1, the arg max on θ is over the entire real line. Note, as a basic observation, that A is an equivariant estimator.

To show that A is minimax optimal in failure probability, we will use the standard technique of relating it to (a sequence) of Bayes-optimal estimators for some (sequence of) Bayesian priors.

Let us define the relevant notation here.

Definition 3.1. Throughout this section, we consider the 0-1 loss function L(µ, ˆµ) = 1[|ˆµ µ| > ϵ] for some fixed ϵ. The failure probability of an n-sample estimator ˆµ with respect to a parameter µ can then be re-expressed as the risk R(µ, ˆµ) = Ex1,...,xn f (µ)[L(µ, ˆµ(x1, . . . , xn)].

Consider a prior π on the parameter µ over R. We denote the Bayes risk of an estimator ˆµ with respect to prior π as R(π, µ) = Eµ π[R(µ, ˆµ)], under the usual abuse of notation.

To show that a given estimator ˆµ has minimax optimal risk (namely, failure probability), the following standard fact states that it suffices to demonstrate a sequence of priors, such that the worst-case risk of ˆµ is the limit of the optimal Bayes risk of these priors. We prove this fact in Appendix A.

Fact 3.2. Given an n-sample estimator ˆµ, suppose there exists a sequence of priors {πi} such that the worst-case risk of ˆµ is equal to the limit of optimal Bayes risk of πi, namely

sup µ R R(µ, ˆµ) = lim i inf ˆµ R(πi, ˆµ )

Then ˆµ is minimax optimal.

To show the minimax optimality of A , we will consider the sequence of priors πt = Unif([ t, t]). The first step is to characterize the Bayes-optimal estimators for the prior πt.

Lemma 3.3. The algorithm At is Bayes-optimal for the prior πt.

Proof. Given the samples x1, . . . , xn, the likelihood function lx(θ) constructed in Step 1 of At, limited to the range [ t, t], is equal to the posterior distribution of θ given x up to a normalization factor. Thus, by construction, Step 2 finds the ˆµ that maximizes the posterior success probability, or equivalently, ˆµ that minimizes the expected posterior loss (the posterior failure probability) given x. Hence At is a Bayes-optimal estimator.

With Fact 3.2 and Lemma 3.3, we can now state and prove that the algorithm A (Algorithm 1) is minimax-optimal with respect to the 0-1 loss, namely that it has minimax-optimal failure probability.

Theorem 1.2. The algorithm A (limit of Algorithm 1 as t ) is minimax-optimal with respect to the 0-1 loss L(ˆµ, µ) = 1[|ˆµ µ| > ϵ], namely, with respect to the failure probability.

Proof. Fact 3.2 implies that, it suffices for us to show

sup µ R(µ, A ) inf ˆµ R(πi, ˆµ ) 0

as i . Since A is equivariant, the risk R(µ, A ) is independent of µ. Also, recall by Lemma 3.3 that At is Bayes-optimal for the prior πt. The above claim is thus equivalent to

R(πi, A ) R(πi, Ai) 0

Since Ai is Bayes-optimal for πi, the left hand side is always positive. Also observe that the left hand side is upper bounded by the probability that A has a different output from Ai. We will show that such probability, explicitly, Prµ πi;x1,...,xn f (µ)(A (x) = Ai(x)), tends to 0 as i .

We first claim that, fixing a distribution shape f, for any small probability κ > 0, there exists a sufficiently large i such that, if the true parameter θ [ i +

i], then there is at most α/2 probability that A has a different output from Ai when given i.i.d. samples from f (µ). This is easy to see, since A outputting differently from Ai implies the output of A is outside of [ i+ϵ, i ϵ]. Furthermore, since A is equivariant, we have Pr x1,...,xn f (µ)(A (x) / [µ t, µ + t]) 0

as t for any fixed µ. For µ [ i +

i], the locations i + ϵ and i ϵ are at least t = Ω(

i) away from µ. Thus we can set i sufficiently large such that A outputs outside of [ i + ϵ, i ϵ] with probability at most κ/2, for any fixed µ [ i +

Further observe that, under prior πi, there is only 2/

i probability that the parameter µ is sampled be to outside of [ i +

i]. By setting i sufficiently large, this probability is at most κ/2, for any given κ.

In summary, for any given small probability κ, there exists a sufficiently large i such that A outputs differently from Ai under prior πi with probability at most κ. We have thus shown that R(πi, A ) R(πi, Ai) 0 as i , completing the proof.

3.2 Estimator with (almost) optimal estimation error

In Section 3.1, we presented an algorithm which, when given the distribution shape f and a fixed estimation error ϵ, finds an estimate which minimizes the failure probability δ. Observing that δ is a monotonic function in ϵ, we can also interpret ϵ as a monotonic function of δ. Thus, if we are given a fixed δ (and distribution f), there is an infimum over achievable ϵ. Algorithm 1 therefore also induces an optimal algorithm that optimizes for ϵ when given δ, as follows. Algorithm 2. Consider the optimal failure probability δ as a function of the estimation accuracy ϵ. Given a desired failure probability δ , define ϵ = inf{ϵ > 0 | δ(ϵ) δ }. Using Algorithm 1 with estimation accuracy (1 + κ)ϵ for κ > 0 results in an estimator whose failure probability is by definition at most δ . By construction, this estimator has accuracy that is within a (1 + κ) factor of minimax optimality.

The optimal estimation error ϵ can be computed using binary search. We defer the details to Appendix A.1.

4 Minimax-optimal confidence-interval estimator

Recall the second problem setting of this paper concerning confidence-interval estimators. Fix a loss function L : R+ R+ mapping an interval width to a loss, and assume it is increasing. Given an estimator A whose output interval we call [µℓ, µr], its mean loss is the worst-case expected loss sup µ R E x1,...,xn f (µ) [L(µr µℓ)]

Consider the set Algδ of estimators A that are correct with probability 1 δ . That is, for all parameters µ, Pr x1,...,xn f (µ)(µ [µℓ, µr]) 1 δ

Then, the goal of this section is to find an estimator in Algδ that minimizes its mean loss, for any desired failure probability δ (0, 1].

Section 4.1 shows how to define the optimal equivariant estimator subject to the 1 δ probability correctness constraint. Appendix B.1 gives a binary search procedure for approximately computing this optimal equivariant estimator. Finally, Section 4.2 shows that the optimal equivariant estimator is also optimal across all estimators.

Figure 2: Illustration of the feasible set Ff in Definition 4.1, and Algorithm 3 and Algorithm 4

4.1 Optimality among equivariant estimators

We use convex optimization techniques to find the best equivariant estimator. To do so, consider the following feasible set of estimator performances, where for each equivariant estimator A, we consider the pair (RA, δA) which is respectively its mean loss and its failure probability. This set is always convex. Definition 4.1. Fix a distribution f = f (0). Given a (potentially) randomized equivariant nsample estimator A with output [µℓ, µr], we say that A achieves the performance pair (RA, δA) if Ex1,...,xn f[(µr µℓ)2] RA and Prx1,...,xn f(0 / [µℓ, µr]) δA.

Define the feasible set for distribution f to be the set Ff = {(RA, δA) achieved by some an equivariant n-sample estimator A}

See Figure 2 for a pictorial illustration of the set.

Fact 4.2. Given an arbitrary distribution f, the feasible set Ff is convex. As a corollary, the boundary R (δ) = inf{RA | (RA, δ) Ff} is a non-increasing convex function.

Proof. Given two algorithms A0 and A1 achieving (RA0, δA0) and (RA1, δA1) respectively, consider a new randomized algorithm which runs A0 with probability λ and A1 otherwise, for an arbitrary λ [0, 1]. The new algorithm achieves (λRA0 + (1 λ)RA1, λδA0 + (1 λ)δA1). Thus, Ff is a convex set.

Since Ff is a convex set, the boundary R is also a convex function. The definition of Ff also directly implies that R is non-increasing.

Our goal then is to define the equivariant algorithm arg min A{RA | (RA, δ) Ff}, that in fact, the boundary R being an inf is actually achievable. The key idea is to consider the (sub-)derivative (or just slope , from now on) of the boundary function R (δ), which is always non-positive and also increasing as a function of δ, since R is non-increasing and convex. We will first show how, given a slope, we can find an estimator whose (RA, δA) pair lies on the boundary of Ff and has a supporting hyperplane with the desired slope (see Generic Alg 3 in Figure 2). Having defined this estimator, its failure probability δA can be evaluated. Since the failure probability is an increasing function of the slope, we will then do a binary search on the value of the slope to yield the correct value of δA = δ (see Alg 4 in Figure 2).

Specifically, we will parameterize the slope by its angle ρ [0, π/2]. Given an angle ρ, we will define an estimator A in arg min{RA cos ρ + δA sin ρ | (RA, δA) Ff}. Additionally, since the arg min set may contain many estimators, we will also have the choice to additionally find the algorithm in the arg min with the minimum δA (equivalently, maximizing RA) or the largest δA.

As we see in the following theorem, the quantity l x 1 in Step 3 above is in fact the probability (density) of observing any shifted versions of x, and hence exists and is finite. Thus the normalization step is well-defined.

Algorithm 3 Estimator minimizing RA cos ρ + δA sin ρ for a given angle ρ [0, π/2]

1. Input: samples x1, . . . , xn, distribution f, Boolean flag smallest Delta indicating whether we tiebreak for the smallest δA or the largest δA 2. Compute the likelihood function l x(θ) of f (θ) yielding the observed sample x1, . . . , xn. 3. Normalize l x into the distribution p x(θ) = l x(θ)/ l x 1. 4. For each interval [ µℓ, µr], define its risk R[ µℓ, µr] as ( µr µℓ)2, and its corresponding δ[ µℓ, µr] as the probability 1 p x[ µℓ, µr] under the distribution p x.

5. Return [µℓ, µr] that minimizes R[µℓ,µr] cos ρ + δ[µℓ,µr] sin ρ, and tiebreak according to the input Boolean flag smallest Delta.

The following theorem shows that Algorithm 3 is indeed an optimal estimator that minimizes RA cos ρ + δA sin ρ among all equivariant estimators. Theorem 4.3 (Correctness of Algorithm 3). Suppose Algorithm 3 with angle ρ and flag smallest Delta set to true achieves the pair (RA, δA). Then, for an arbitrary equivariant estimator A whose performance is (R A, δ A), we have that

RA cos ρ + δA sin ρ R A cos ρ + δ A sin ρ

Furthermore, if RA cos ρ + δA sin ρ = R A cos ρ + δ A sin ρ, then δA δ A.

The analogous statement holds for Algorithm 3 with the flag smallest Delta set to false.

Moreover, as a corollary, we have R (δA) = RA by the convexity of Ff.

Proof. For the rest of the proof, we use the notation v for the unit vector (cos ρ, sin ρ).

On the set of n-tuples of samples (namely Rn), define the equivalence relation if two n-tuples are shifts of each other, that is, x = (x1, . . . , xn) is in the same class as y = (y1, . . . , yn) if and only if y1 x1 = . . . = yn xn. Given a tuple x, we use the notation [ x] to denote the corresponding equivalence class. We will also abuse notation and use [ x] to denote a unique representative chosen for that equivalence class, and use x [ x] to denote the scalar shift between x and the representative [ x] of its equivalence class.

Then, for an equivariant algorithm A whose output is denoted by [µℓ, µr], we can write RA as

RA = E [ x]

E A| x [L(µr µℓ)]

meaning that there are three levels of randomness: first draw an equivalence class [ x] (according to the distribution f = f (0)), then draw a random shift according to the conditional distribution x | [ x], then finally, the algorithm A can itself be random conditioned on its input x.

Similarly, we can write δA as

δA = E [ x]

10/ [µℓ,µr]

Given that the goal of Algorithm 3 is to optimize v (RA, δA), linearity implies that it suffices to optimize, for each equivalence class [ x], the inner product

E A| x [L(µr µℓ)] , E x|[ x]

10/ [µℓ,µr]

To see that Algorithm 3 optimizes the above inner product among all equivariant algorithms, first observe that by linearity, it suffices to consider deterministic algorithms A which output a deterministic interval on a given input n-tuple of samples.

Moreover, observe that the conditional distribution x | [ x] is, up to translation, equal to the normalized likelihood l y(θ)/ l y 1 for any n-tuple y [ x]. This is because l y 1 is exactly the probability

(density) of the equivalence class [ x]. Therefore, in Step 4 of Algorithm 3, the probability δ[ µℓ, µr] is equal to the probability, under the conditional distribution x | [ x], that an equivariant algorithm whose output is [ µℓ ( x [ x]), µr ( x [ x])] under input [ x] fails to output an interval that contains the parameter 0. That is, δ[ µℓ, µr] = E x|[ x] 10/ [µℓ,µr] where [µℓ, µr] is the output of a deterministic equivariant algorithm that outputs [ µℓ ( x [ x]), µr ( x [ x])] under input [ x].

To conclude, then, Algorithm 3 by construction chooses a deterministic interval to output so as to minimize the inner product

E A| x [L(µr µℓ)] , E x|[ x]

10/ [µℓ,µr]

among all equivariant algorithms. The same reasoning also shows that the tiebreaking per input ntuple sample x implies tiebreaking on δA which is an expectation over the distribution of the n-tuple sample input.

Given Algorithm 3, we can now define the optimal estimator for a given confidence parameter δ. which is by construction optimal given Theorem 4.3. Algorithm 4 (Minimax-optimal confidence-interval estimator). Fix a distribution f. For an angle ρ, let δℓ(ρ) be the failure probability of Algorithm 3 with the flag smallest Delta set to true, denote this estimator by Aℓ(ρ), and δu(ρ) be the failure probability of Algorithm 3 with the flag set to false, similarly denoting this estimator by Ar(ρ). Given a failure probability δ , by the supporting hyperplane theorem there exists an angle value ρ (e.g. the arctangent of a sub-derivative of R (δ) at δ = δ ) such that δ [δℓ(ρ), δu(ρ)]. Let δ = λδℓ(ρ) + (1 λ)δu(ρ). Then, define the optimal MSE estimator for failure probability δ to be the randomized algorithm which runs Aℓ(ρ) with probability λ and runs Ar(ρ) otherwise. By definition, this estimator achieves (R (δ ), δ ).

The optimal slope angle ρ can be approximately computed using binary search. We give the details in Appendix B.1.

4.2 Minimax optimality of Algorithm 4

In Section 4.1, we showed that Algorithm 4 is optimal among equivariant estimators. We now show that is in fact optimality among all estimators.

We first extend Definition 4.1 to cover also non-equivariant estimators. Definition 4.4. Given a (potentially) randomized n-sample estimator A, whose output interval we denote by [µℓ, µr], we say that A achieves its (RA, δA) pair if, for all parameters µ R, Ex1,...,xn f (µ)[L(µr µℓ)] RA and Prx1,...,xn f (µ)(µ / [µℓ, µr]) δA.

The key step in the proof of minimax optimality of Algorithm 4 is to show that the performance of any estimator can be well-approximated by an equivariant estimator. Proposition 4.5. Fix an arbitrary distribution f and an arbitrarily small approximation parameter b (0, 1

2). Suppose there is an estimator A (not necessarily equivariant) achieving (RA, δA), then there exists an equivariant estimator A that achieves ((1 + b)RA, δA + 2b).

We prove Proposition 4.5 in Appendix B. The main idea for constructing A from A is that, on input an n-tuple of samples (x1, . . . , xn) whose sample median is at s, we 1) shift the samples by a random shift drawn from Unif[ t s, t s] for sufficiently large t, 2) call A on the shifted samples and 3) output the returned interval and subtract off the random shift we made to the samples. This estimator is equivariant by construction, and we relate its execution with high probability to applying A to samples drawn according to the prior Unif[ t, t]. This allows us to bound the performance guarantees of A by those of A.

With Proposition 4.5, we can then show that Algorithm 4 is in fact minimax optimal among all estimators, not just equivariant ones. Theorem 4.6. Fix a distribution f. Recall the definition of R (δ) from Fact 4.2, which is the infimum of all achievable (RA, δA = δ) over all equivariant estimator A. Now consider an arbitrary (potentially non-equivariant) estimator A . If A achieves (RA , δA ), then RA R (δA ).

As a result, since (R (δ ), δ ) is achievable by Theorem 4.3 and Algorithm 4 for any δ > 0, we have that Algorithm 4 is in fact minimax optimal among all estimators.

Original Distribution Smoothed Distribution (r=0.3)

(a) Distribution f

0.0 0.5 1.0 1.5 2.0 Error

Cumulative Density

CDF of Error

MSE Estimator MLE Interval MLE Smoothed MLE 1-δ ϵ for Interval MLE

(b) Point Error

0.4 0.6 0.8 1.0 1.2

Cumulative Density

MSE Estimator Interval Radius ϵ for Interval MLE

(c) Interval radius

Figure 3: Example distribution and CDF of confidence interval widths

The proof is in Appendix B.

5 Experimental results

We compare various location estimation methods on synthetic data from a fairly simple, but irregular, piecewise linear distribution (Figure 3(a)). We set n = 10 and aim for 90% confidence intervals. In Figure 3(b), we plot the CDF of the point error produced by the MLE, the 0.3-smoothed MLE, and our two algorithms (Algorithm 1 and Algorithm 4). We find that our algorithms get 25% smaller error than the MLE at the target 90% error rate. The smoothed MLE and Algorithm 4 error distributions both dominate the MLE, with Algorithm 4 having better performance close to its target and smoothed MLE better further away. Algorithm 1 does 2% better than Algorithm 4 at the target 90%, but worse over most of the range.

In Figure 3(c), we compare the confidence intervals produced by our algorithms. Algorithm 1 uses a fixed interval radius regardless of the samples it sees, while Algorithm 4 has a distribution over interval widths that is usually smaller but occasionally bigger.

Acknowledgements

We thank the anonymous reviewers for insightful comments and suggestions on this work. Shivam Gupta and Eric Price are supported by NSF awards CCF-2008868, CCF-1751040 (CAREER), and the NSF AI Institute for Foundations of Machine Learning (IFML). Jasper C.H. Lee is supported in part by the generous funding of a Croucher Fellowship for Postdoctoral Research, NSF award DMS2023239, NSF Medium Award CCF-2107079 and NSF Ai TF Award CCF-2006206. Paul Valiant is supported by NSF award CCF-2127806.

[Cat12] Olivier Catoni. Challenging the empirical mean and empirical variance: A deviation study . In: Annales de l Institut Henri Poincar e, Probabilit es et Statistiques 48.4 (2012), pp. 1148 1185. DOI: 10.1214/11-AIHP454. [DLLO16] Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian mean estimators . In: Ann. Stat 44.6 (2016), pp. 2695 2725. [GLP23] Shivam Gupta, Jasper C.H. Lee, and Eric Price. High-dimensional Location Estimation via Norm Concentration for Subgamma Vectors . In: ICML. 2023. [GLPV22] Shivam Gupta, Jasper C.H. Lee, Eric Price, and Paul Valiant. Finite-Sample Maximum Likelihood Estimation of Location . In: Neur IPS. 2022. [LV22] Jasper C.H. Lee and Paul Valiant. Optimal Sub-Gaussian Mean Estimation in R . In: Proc. FOCS 21. 2022, pp. 672 683. [Mia10] Yu Miao. Concentration inequality of maximum likelihood estimator . In: Applied Mathematics Letters 23.10 (2010), pp. 1305 1309. ISSN: 0893-9659. DOI: https: //doi.org/10.1016/j.aml.2010.06.019. [Pin17] Iosif Pinelis. Optimal-order uniform and nonuniform bounds on the rate of convergence to normality for maximum likelihood estimators . In: Electronic Journal of Statistics 11.1 (2017), pp. 1160 1179. DOI: 10.1214/17-EJS1264. [Spo11] Vladimir Spokoiny. Parametric estimation. Finite sample theory . In: The Annals of Statistics 40.6 (2011), pp. 2877 2909. DOI: 10.1214/12-AOS1054. [Vaa00] A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000. ISBN: 9780521784504. [Van00] Sara A Van de Geer. Empirical Processes in M-estimation. Vol. 6. Cambridge university press, 2000.

A Missing details from Section 3

Fact 3.2. Given an n-sample estimator ˆµ, suppose there exists a sequence of priors {πi} such that the worst-case risk of ˆµ is equal to the limit of optimal Bayes risk of πi, namely

sup µ R R(µ, ˆµ) = lim i inf ˆµ R(πi, ˆµ )

Then ˆµ is minimax optimal.

Proof. Consider any other estimator θ.

Then sup θ R(θ, θ) R(πi, θ) inf ˆθ R(πi, ˆθ )

for all πi, meaning that

sup θ R(θ, θ) lim i inf ˆθ R(πi, ˆθ ) = sup θ R(θ, ˆθ)

A.1 Binary search for approximately computing the optimal point estimator

In this section, we address the (approximate) computation of the optimal ϵ when given a desired failure probability δ , using binary search. We explicitly give the (straightforward) binary search below in Algorithm 5, which accounts for the error in computing δ(ϵ) for a given value of ϵ. We also state the guarantees of Algorithm 5 here in Theorem A.1 and prove the theorem in Appendix A.

Algorithm 5 Approximately computing ϵ from δ using binary search

Assume: Oracle for estimating δ(ϵ) to additive error ξ for any given ϵ. Call this oracle ˆδ(ϵ). Input: Distribution f = f (0), desired failure probablity δ

2. While ˆδ(ϵmax) + ξ > δ , double ϵmax 3. ϵmin 0 4. While ϵmax ϵmin > η

(a) ϵmid (ϵmax + ϵmin)/2

(b) If ˆδ(ϵmid) + ξ < δ , then ϵmax ϵmid

(c) Otherwise, ϵmin ϵmid 5. Return ϵmax

Theorem A.1. Fix a distribution shape f, which implicitly defines a function ϵ (δ) mapping a failure probability δ to an optimal estimation error ϵ (δ). Algorithm 5, on input a failure probability δ

and the distribution shape f, takes poly( 1

ξ, |log ϵ (δ ξ)| , log 1

τ , log ϵ (δ ξ)

η ) many executions of A and returns ϵ [ϵ (δ ), ϵ (δ ξ) + η] with probability 1 τ.

Proof. For now we analyze Algorithm 5 assuming the oracle. We will discuss the time complexity of the oracle and success probability of the whole algorithm at the end of the proof.

Given the guarantee that |ˆδ δ| ξ, at the end of Step 2, we have ϵmax ϵ .

Due to the one-sided test in Step 4(b), by induction, we have that ϵmin ϵ (δ ξ) throughout the iterations of Step 4. Since we return ϵmax in Step 5, when ϵmax ϵmin η, we can guarantee that the returned value of ϵ is at most ϵ (δ ξ) + η.

The above analysis assumes the success of every call to the oracle ˆδ. Observe that δ is a failure probability, namely that it is the expectation of a Bernoulli coin. We can estimate δ via running A

using phantom samples we generate from f 0. Via the standard Hoeffding bound, to yield an estimate of δ(ϵ) to within additive error ξ with probability 1 τ, we need O( 1

τ ) samples, namely that many independent runs of A . Since we make many calls to the oracle throughout Algorithm 5, it suffices for us to upper bound the number of oracle calls, and multiplicatively adjust τ appropriately such that the total failure probability is τ by a union bound, across all the oracle calls.

For Step 2, we can upper bound the number of iterations by log max(1, ϵ (δ ξ)), since we always terminate whenever ϵmax ϵ (δ ξ) by the guarantee that ˆδ δ ξ.

Step 4 is just a binary search, halving the interval [ϵmin, ϵmax] every time, from size max(1, ϵ (δ ξ)) down to η, thus requiring log(max(1, ϵ (δ ξ))/η) many iterations. Note also that log(max(1, ϵ (δ ξ))/η) | log ϵ (δ ξ)| + log ϵ (δ ξ)

The runtime claim of Theorem A.1 thus follows.

B Missing details from Section 4

Proposition 4.5. Fix an arbitrary distribution f and an arbitrarily small approximation parameter b (0, 1

2). Suppose there is an estimator A (not necessarily equivariant) achieving (RA, δA), then there exists an equivariant estimator A that achieves ((1 + b)RA, δA + 2b).

Proof. Given the estimator A, first construct an intermediate estimator A such that, if A outputs an interval of length at most L 1(RA/b), then A outputs an identical output, otherwise A outputs an arbitrary interval of length at most L 1(RA/b). The claim is that A achieves (RA, δA + b). It is immediate that A achieves mean loss at most RA for any parameter µ, by construction. As for its failure probability, for an arbitrary parameter µ, the probability that A outputs an interval [µℓ, µr] of length > L 1(RA/b) is at most b, since the mean loss is at most RA for parameter µ. Thus, the failure probability of A on parameter µ is at most δA + b, which holds true for any µ.

We now construct the equivariant estimator A . On input n-tuple x, let the sample median be s. Sample a number w Unif[ t s, t s] for some fixed parameter t chosen later, where t depends only on b and the distribution f, and is independent of x. Then, A calls A on input x + (w, . . . , w) and obtains an output interval [ µℓ, µr]. Finally, A outputs [ µℓ w, µr w].

Observe that the new estimator A is by construction equivariant. It remains to choose the parameter t as a function of b (and the underlying distribution f). Consider drawing n samples from f = f (0), then there must exist some distance smax such that the sample median is within smax of the origin with probability at least 1 b2/2. We will choose the above parameter t as t = smax/b2.

Now we need to upper bound RA and δA in terms of R A and δ A respectively. Specifically, we claim that there is a 1 b2 probability coupling between the execution of A on parameter µ = 0 to the process of 1) drawing a parameter µ Unif[ t, t], 2) drawing an n-tuple of samples y from f (µ), 3) running A on y to get interval [ µℓ, µr] and outputting [ µℓ µ, µr µ]. We note that the latter process achieves (R A, δ A) with respect to parameter µ = 0.

To show this, observe that with probability at least 1 b2/2, drawing an n-tuple of samples x from f = f (0) will yield a sample median of s smax, by definition of smax. Then, observe that since t = smax/b2, the total variation distance between Unif[ t s, t s] and Unif[ t, t] is s/2t b2/2. Thus, with probability at least 1 b2, the execution of A on input samples drawn from f (0) is equal to a version of A where the shift w is instead always drawn from Unif[ t, t]. However, this variant of A is exactly equivalent to the aforementioned process of 1) drawing a parameter w Unif[ t, t], 2) running A on samples drawn from f (w) and 3) outputting the interval after correcting for the w shift.

Therefore, given this coupling, we conclude that δA δ A + b2 δA + b + b2 δA + 2b as b < 1

2. As for RA , since A always returns intervals whose loss is at most RA/b, in the probability b2 part of the coupling when the two processes behave differently, the contribution to the expected loss of A is at most b2 RA/b = b RA. Thus we can bound RA by R A + b RA (1 + b)RA, as desired.

Theorem 4.6. Fix a distribution f. Recall the definition of R (δ) from Fact 4.2, which is the infimum of all achievable (RA, δA = δ) over all equivariant estimator A. Now consider an arbitrary (potentially non-equivariant) estimator A . If A achieves (RA , δA ), then RA R (δA ).

As a result, since (R (δ ), δ ) is achievable by Theorem 4.3 and Algorithm 4 for any δ > 0, we have that Algorithm 4 is in fact minimax optimal among all estimators.

Proof of Theorem 4.6. Suppose for the sake of contradiction that RA < R (δA ). By the supporting hyperplane theorem, there exists an angle ρ [0, π/2] such that for all equivariant estimators A, we have δA cos ρ + RA sin ρ δA cos ρ + R (δA ) sin ρ. However, by Proposition 4.5, for all b (0, 1

2], there is an equivariant estimator Ab that achieves ((1 + b)RA , δA + 2b). Consider a value of b < min( 1

2, (R (δA ) RA )/(2 cos ρ + RA sin ρ)), noting that the upper bound is positive by our contradiction assumption. Then, it is straightforward to check that Ab violates the supporting hyperplane δAb cos ρ + RAb sin ρ δA cos ρ + R (δA ) sin ρ. Thus, it must be the case that RA R (δA ).

B.1 Compute an optimal slope angle

In order to apply the construction of Algorithm 4, when given a distribution f and a desired failure probability δ , we need to compute an optimal slope angle value ρ [0, π/2] to use for Algorithm 3. Since the failure probability is a monotonic function of ρ, the obvious strategy here is to again use binary search. Algorithm 6 returns an (approximately) optimal value of ρ for Algorithm 3, as well as a probability for randomly setting the flag smallest Delta to true or false.

Algorithm 6 Binary search for the optimal estimator through the slope angle ρ

Assume: Oracle for computing the failure probability of Algorithm 3 to additive error ξ for any given angle ρ and Boolean value for the smallest Delta flag. We will use ˆδℓ(ρ) to denote the oracle approximation for Algorithm 3 with smallest Delta set to true, and ˆδr(ρ) for smallest Delta set to false.

1. Input: distribution f, failure probability δ

2. Set ρmin 0, ρmax π/2.

3. Repeat until ˆδℓ(ρmin) ˆδr(ρmax) ξ 4ξ:

(a) Set ρmid (ρmin + ρmax)/2.

(b) Use the oracle to evaluate the failure probabilities be ˆδℓ(ρmid) and ˆδr(ρmid) for Algorithm 3. (c) If δ < ˆδℓ(ρmid) + ξ, ρmin ρmid.

(d) If δ > ˆδr(ρmid) + ξ, ρmax ρmid.

(e) If δ [ˆδℓ(ρmid) + ξ, ˆδr(ρmid) + ξ]

i. If ˆδr(ρmid) ˆδℓ(ρmid) ξ 4ξ, terminate and return the algorithm that runs

Algorithm 3 with ρ = ρmid and flag smallest Delta set to true always. ii. Otherwise, terminate and return the algorithm that runs Algorithm 3 with ρ = ρmid and flag smallest Delta set to true with probability (1 + 3 ξ)(ˆδr(ρmid) + ξ δ )/(ˆδr(ρmid) ˆδℓ(ρmid)). 4. Return the algorithm that runs Algorithm 3 with ρ = ρmax and setting smallest Delta to be false always.

The following theorem captures the correctness of Algorithm 6.

Theorem B.1. Fix an arbitrary distribution f and failure probability δ . If, for parameter ξ smaller than some absolute constant, all the oracle calls in Algorithm 6 are correct, then, upon termination of Algorithm 6, the returned algorithm achieves a mean loss of RA [R (δ ), R (δ O( ξ))], and succeeds with probability at least 1 δ .

Proof. For this proof, denote δℓ(ρ) by the actual failure probability of Algorithm 3 using slope value ρ and setting smallest Delta to true, and similarly for δr(ρ).

First, by induction, throughout the execution of the algorithm, we have that δ [δr(ρmax), δℓ(ρmin) + 2ξ]. This can be checked straightforwardly by Steps 3(c,d) and the fact that ˆδ(ρ) is within additive ξ error of δ(ρ) by assumption.

There are two ways the algorithm terminates, either in Step 4 or in Step 3(e).

We first analyze the termination condition in Step 4. Since, from the above paragraph, we have δ [δr(ρmax), δℓ(ρmin) + 2ξ]. For the returned estimator from Step 4, its failure probability is by definition δr(ρmax). In order to reach Step 4, it must be the case that ˆδℓ(ρmin) ˆδr(ρmax) ξ 4ξ, which implies δℓ(ρmin) δr(ρmax) ξ. Therefore, we have δr(ρmax) [δ ξ, δ ]. By Fact 4.2, we conclude that RA [R (δ ), R (δ ξ)].

Next, we analyze the two possible termination conditions in Step 3(e). We know that δ [ˆδℓ(ρmid) + ξ, ˆδr(ρmid) + ξ], which implies δ [δℓ(ρmid), δr(ρmid) + 2ξ].

If we terminate according to Step 3(e) i., then we know that ˆδr(ρmid) ˆδℓ(ρmid) ξ 4ξ. This implies δr(ρmid) δℓ(ρmid) ξ 2ξ. Combining with the last paragraph, we get δℓ(ρmid) [δ ξ, δ ]. Also note that the returned estimator has failure probability δℓ(ρmid) by definition. Again using Fact 4.2, we conclude that the returned estimator satisfies RA [R (δ ), R (δ ξ)].

Lastly, we check the case if we terminate according to Step 3(e) ii. The failure probability of the returned algorithm is

(1 + 3 ξ)(ˆδr(ρmid) + ξ δ )

ˆδr(ρmid) ˆδℓ(ρmid) δℓ(ρmid) +

1 (1 + 3 ξ)(ˆδr(ρmid) + ξ δ )

ˆδr(ρmid) ˆδℓ(ρmid)

We will show that this failure probability is between δ O( ξ) and δ , which (again, by Fact 4.2) lets us conclude that RA [R (δ ), R (δ O( ξ))].

To show that the failure probability is at most δ , observe that ˆδr(ρmid) ˆδℓ(ρmid) δr(ρmid) δℓ(ρmid) + 2ξ (1 + 3 ξ)(δr(ρmid) δℓ(ρmid)) since ˆδr(ρmid) ˆδℓ(ρmid) ξ 4ξ by the condition in Step 3(e) ii. Furthermore, ˆδr(ρmid) + ξ δ δr(ρmid) δ . Thus,

(1 + 3 ξ)(ˆδr(ρmid) + ξ δ )

ˆδr(ρmid) ˆδℓ(ρmid) δr(ρmid) δ

δr(ρmid) δℓ(ρmid)

implying that the failure probability is at most δ .

Finally we need to show that the failure probability is at least δ O( ξ). To see this, we upper bound (1 + 3 ξ)(ˆδr(ρmid) + ξ δ )

ˆδr(ρmid) ˆδℓ(ρmid)

(1 + 3 ξ)(δr(ρmid) δ + 2ξ)

δr(ρmid) δℓ(ρmid) 2ξ

(1 + O( ξ))(δr(ρmid) δ + 2ξ)

δr(ρmid) δℓ(ρmid) (since ˆδr(ρmid) ˆδℓ(ρmid) p

Therefore, the failure probability is lower bounded by

(1 + O( ξ))(δr(ρmid) δ + 2ξ)

δr(ρmid) δℓ(ρmid) δℓ(ρmid) + 1 (1 + O( ξ))(δr(ρmid) δ + 2ξ)

δr(ρmid) δℓ(ρmid)

ξ)(δr(ρmid) δ + 2ξ) (1 + O( p

ξ) since ξ is sufficiently small

Since the failure probability δ is not a Lipschitz function of the slope angle ρ, we cannot in theory bound the runtime of Algorithm 6. However, we can nonetheless implement the probability estimation oracle and ensure a total failure probability of at most 1 τ. As in Section 3.2, in order

to estimate the failure probability of an algorithm to accuracy ξ with probability 1 τ, it suffices to use O(log 1

τ /ξ2) executions of Algorithm 3 over simulated samples (say, from f (0)). In order to make sure that all oracle calls are correct, with a total failure probability of at most τ, we simply decrease the failure probability of each subsequent oracle call by a factor of 2. That is, the first call is allowed to fail with probability τ, then the second call τ/2 and so on, which sums to a total failure probability of strictly less than τ.

We emphasize again that these executions of Algorithm 3 use only simulated samples that we generate ourselves from a known distribution with a known shift, and not real samples we get for the estimation task.