# instability_computational_efficiency_and_statistical_accuracy__2713e75a.pdf

Journal of Machine Learning Research 26 (2025) 1-68 Submitted 3/22; Revised 1/25; Published 3/25

Instability, Computational Eﬃciency and Statistical Accuracy

Nhat Ho minhnhat@utexas.edu Department of Statistics and Data Sciences University of Texas, Austin

Koulik Khamaru kk1241@stat.rutgers.edu Department of Statistics Rutgers University

Raaz Dwivedi rd597@cornell.edu Department of Operations Research & Information Engineering Cornell

Martin J. Wainwright wainwrigwork@gmail.com Department of EECS, Department of Mathematics Massachusetts Institute of Technology

Michael I. Jordan jordan@cs.berkeley.edu Department of EECS, Department of Statistics University of California, Berkeley INRIA, Paris

Bin Yu binyu@berkeley.edu Department of EECS, Department of Statistics University of California, Berkeley

Editor: Prateek Jain

Many statistical estimators are deﬁned as the ﬁxed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of inﬁnitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on n samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton s method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative nonresponse models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps namely, with the number of iterations being reduced from polynomial to logarithmic in sample size n.

. Raaz Dwivedi, Nhat Ho, and Koulik Khamaru contributed equally to this work.

c 2025 Nhat Ho, Koulik Khamaru, Raaz Dwivedi, Martin J. Wainwright, Michael I. Jordan, and Bin Yu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/22-0300.html.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

1. Introduction

The interplay between the stability and computational eﬃciency of optimization algorithms has long been a fundamental problem in statistics and machine learning. The stability of the algorithm, a classical desideratum, is often believed to be a necessity for obtaining eﬃcient statistical estimators. Such a belief rules out the use of a variety of faster algorithms due to their instability. This paper shows that this popular belief can be misleading: the situation is more subtle in that there are various settings in which unstable algorithms may be preferable to their stable counterparts. Recent years have seen a signiﬁcant body of work involving performance of various machine-learning algorithms when applied to statistical estimation problems. Examples include sparse signal recovery (Hale et al., 2008; Garg and Khandekar, 2009; Beck and Teboulle, 2009; Becker et al., 2011), more general forms of M-estimation (Agarwal et al., 2012; Zhang and Zhang, 2012; Loh and Wainwright, 2015), principal component analysis (Amini and Wainwright, 2008; Ma, 2013; Yuan and Zhang, 2013), regression with concave penalties (Loh and Wainwright, 2015; Wang et al., 2014), phase retrieval problems retrieval (e.g., (Cand es et al., 2012, 2015; Chen and Wainwright, 2015; Zhang et al., 2017; Chen et al., 2018b)), and mixture model estimation (Balakrishnan et al., 2017; Yang et al., 2017; Cai et al., To Appear; Yi and Caramanis, 2015). A unifying theme in these works is to study, in a ﬁnite-sample setting, the computational eﬃciency of diﬀerent algorithms and the statistical accuracy of the resulting estimates. For estimators based on solving optimization problems that are convex, standard algorithms and theory can be applied. However, many modern estimators arise from non-convex optimization problems, in which case the associated algorithms become more complex to understand. But evidence is accumulating for the practical and theoretical advantages of such algorithms. For instance, the paper (Agarwal et al., 2012) established the fast convergence of projected gradient descent (GD) for high-dimensional signal recovery in a weakly convex setting, whereas the papers (Loh and Wainwright, 2015; Wang et al., 2014) provided similar guarantees for a class of non-convex learning problems. Other work has demonstrated fast convergence of the truncated power method for PCA (Yuan and Zhang, 2013), analyzed the behavior of projected gradient methods for low-rank matrix recovery (Chen and Wainwright, 2015), and characterized the behavior of gradient descent for phase-retrieval problems (Chen et al., 2018b). Additionally, there is also a recent line on work on the fast convergence of EM for various types of mixture models (Balakrishnan et al., 2017; Yang et al., 2017; Cai et al., To Appear). Finally, there is a line of work (Hardt et al., 2016; Chen et al., 2018a; Kuzborskij and Lampert, 2018; Charles and Papailiopoulos, 2018) that provides statistical error bounds for generic machine learning problems (with certain assumptions on loss functions) in terms of estimators obtained via iterative optimization algorithms (e.g., stochastic gradient methods).

1.1 Population-to-sample or stability-based analysis

The analysis in these works falls into two distinct categories. The ﬁrst is a direct analysis, in which one directly characterizes the behavior of the iterates of the algorithm on the ﬁnite-sample objective. A long line of papers has used the direct approach (e.g., (Agarwal et al., 2012; Loh and Wainwright, 2015; Wang et al., 2014; Zhang and Zhang, 2012; Yuan

Instability, Computational Efficiency and Statistical Accuracy

and Zhang, 2013)) to demonstrate that certain optimization algorithms converge at geometric rates to a local neighborhood of the true parameter, with the radius proportional to the statistical minimax risk. The second kind of analysis is more indirect and can be referred to as population-to-sample analysis or stability-based analysis where one analyzes the algorithmic convergence of population-level iterates, and derives statistical errors for the sample-level updates via uniform laws for stability/perturbation bounds. These approaches have been used to analyze the performance of EM and its variants in several statistical settings, see the papers (Balakrishnan et al., 2017; Cai et al., To Appear; Yang et al., 2017; Yi and Caramanis, 2015; Dwivedi et al., 2020a,b) and the references therein. In general settings, it has been used to derive statistical errors for iterates from stochastic optimization methods (Hardt et al., 2016; Chen et al., 2018a; Kuzborskij and Lampert, 2018; Charles and Papailiopoulos, 2018). The contributions of this paper build upon the stability-based analysis, so let us discuss it in a little more detail. Let F and Fn denote the operators that deﬁne the iterates at the population level, corresponding to the idealized limit of an inﬁnite sample size, and samplelevel based on a dataset of size n. Suppose θ denotes the parameter of interest, such that the population-level iterates deﬁned as θt = F(θt 1) for t = 1, 2, . . . with initialization θ0, i.e., θt = F t(θ0), converge to θ as t . Of interest is to characterize the best possible estimate of θ obtained from the sample-based (noisy) iterates, deﬁned as θt n = F t n(θ0) (with initialization θ0), and possibly characterize the change in the error F t n(θ0) θ as a function of the iteration t and the sample size n. The population-to-sample or the stability analysis proceeds by using the following decomposition:

F t n(θ0) θ = F t(θ0) θ | {z } =:εt opt

+ F t n(θ0) F t(θ0) | {z } =:εt stab

Given this decomposition, the analysis proceeds in two steps:

The ﬁrst step is a deterministic convergence analysis of the algorithm to the true parameter at the population-level, namely, obtain a control on the optimization error εt opt as a function of t.

The second step is to perform a stability analysis of the diﬀerence between the population and the sample-based iterates, namely, obtain a control on the perturbation/stability error εt stab as a function of t.

The ultimate convergence guarantee what statistical error can be achieved with the samplebased operator Fn, and in how many iterations is then derived based on the interplay between the two errors in equation (1), namely, εt opt and εt stab.

The ERM-based approach: We remark that the decomposition in equation (1) is diﬀerent from that used when invoking the uniform laws for the empirical risk minimizer (ERM). Assuming the sample-based iterates converges to the ERM, i.e., limt F t n(θ0) = bθERM,

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

the typical decomposition in the ERM-based approach is given by

F t n(θ0) θ = F t n(θ0) bθERM | {z } =:εt opt-sample

+ bθERM θ | {z } =:εunif-gen

Here the ﬁrst term in the RHS corresponds to the optimization error at the sample-level at iteration t and the second term corresponds to the (iteration-independent) uniform generalization bound. Depending on the application, a precise characterization of either of these terms can be non-trivial; moreover, applying uniform bounds to control the term εunif-gen may lead to bounds that are overly loose. In such settings, the population-to-sample or stability-based analysis can prove to be a useful alternative.

1.2 Past works focus on stable methods

Most of the past work with the population-to-sample analysis has focused on algorithms whose updates are stable, meaning that the perturbation error between sample-level and population-level iterates decays to zero as the iterates approach the true parameter. For example, the papers (Balakrishnan et al., 2017; Cai et al., To Appear; Yang et al., 2017; Yi and Caramanis, 2015) used this framework for problems where the population updates converge at a geometric rate to the true parameter, and iterates based on n samples yield an estimate within n 1/2 of the true parameter. On the other hand, other papers (Dwivedi et al., 2020a,b) have shown that with over-speciﬁed Gaussian mixtures, the EM algorithm, which is a stable algorithm, takes a large number of steps to ﬁnd an estimate whose statistical error is of order n 1/4 or n 1/8. Although for those problems the larger ﬁnal statistical error of EM is minimax optimal, several natural questions remained unanswered: Can an algorithm converge to a statistically optimal estimate in signiﬁcantly fewer steps than EM for over-speciﬁed mixtures? Moreover, will the faster algorithm continue to be stable? Besides the analysis in recent works (Dwivedi et al., 2020a,b) relied heavily on the facts that the EM updates had closed-form analytical expressions. To our best knowledge, general statistical guarantees for a generic stable or unstable algorithm (without a closed-formed expression) when the algorithmic convergence is slow, are not present in the literature. In past work, Chen et al. (2018b) provided a trade-oﬀbetween stability and number of iterations to converge. In particular, they showed that the minimax error of a problem class forces a trade-oﬀbetween the two errors in equation (1), εt opt and εt stab, for any iterative algorithm used for solving it. In simple words, given the minimax error, an algorithm that converges quickly is necessarily unstable, and conversely, a stable algorithm cannot converge quickly. Their work, however, did not address the following converse questions: Under what conditions does an algorithm, either stable or unstable, achieve a statistically optimal rate? When is an unstable algorithm to be preferred to a stable counterpart? Such questions about the trade-oﬀbetween stability, computational eﬃciency and the statistical error upon convergence are of special interest for singular problems in which the

. There is a subtle diﬀerence in the deﬁnition of (in)stability used in Chen et al. s work (Chen et al., 2018a) compared to ours. In their work, stability refers to a slow growth in the error F t(θ) F t n(θ) with number of iterations t, where slow is deﬁned in a relative sense with other methods. In our case, we use stability for the settings when F(θ) Fn(θ) decreases with θ θ as θ θ .

Instability, Computational Efficiency and Statistical Accuracy

Fisher information matrices are degenerate. Singular problems appear in a wide range of statistical settings, including mode estimation (Chernoﬀ, 1964), robust regression (Rousseeuw, 1984), stochastic utility models (Manski, 1975), informative non-response in missing data (Heckman, 1976; Diggle and Kenward, 1994), high-dimensional linear regression (Hastie et al., 2015), and over-speciﬁed mixture models (Chen, 1995; Rousseau and Mengersen, 2011; Nguyen, 2013). Several papers have shown that maximum likelihood estimates for singular problems have much lower accuracy than the classical parametric rate n 1/2; problems that exhibit slow rates of this type include stochastic frontier models (Lee and Chesher, 1986; Lee, 1993), certain classes of parametric models (Rotnitzky et al., 2000), and in strongly or weakly identiﬁable mixture models (Chen, 1995; Nguyen, 2013; Ho and Nguyen, 2016). Nevertheless, the computational aspects of parameter estimation and the trade-oﬀs with stability in such models are not well understood at the current time.

1.3 Our contributions

This paper lays out a general framework to address the questions raised above. Making use of the population-to-sample approach and a generalization of the localization argument from our previous works (Dwivedi et al., 2020a,b), we derive tight bounds on the statistical error of the ﬁnal iterate produced by an algorithm. The ﬁnal error and the number of steps taken depend on two things: (i) the rate of convergence of the corresponding populationlevel iterates, and (ii) the (in)stability of the sample-level iterates with respect to that at the population-level. As a ﬁrst contribution, our statistical guarantees for slowly converging stable algorithms and (fast/slow converging) unstable algorithms complement the ﬁndings of Balakrishnan et al. (2017) for fast converging stable algorithms (Theorems 5 and 6). We provide an overview of these general results in Table 1. The second contribution extends the work of Chen et al. (2018a) by showing how the ﬁnal statistical errors achieved by stable and unstable algorithms can be used to directly compare and contrast the (dis)advantages between the two (Section 4). Our third contribution is an explicit demonstration of the fact that unstable methods can converge in signiﬁcantly fewer steps when compared to stable methods, while still yielding statistically optimal estimates (Corollaries 7, 8 and 9). In particular, applying our framework to three estimation problems single index models with known link, informative non-response models, and Gaussian mixture models we show that while the (unstable) Newton method converges after on the order of log n steps, there is some q > 0 such that gradient descent which we show to be a stable method takes on the order of nq steps. Finally, we also establish that our guarantees are unimprovable in general on both statistical accuracy, and the iteration complexity.

Organization: The remainder of our paper is organized as follows. We begin in Section 2 with simulations that illustrate the phenomena to be investigated in this paper. We then introduce some deﬁnitions and discuss diﬀerent properties of the sample and population operators. Section 3 is devoted to statements of our general computational and statistical guarantees with detailed proofs presented in Appendix A. In Section 4, we apply our general results to demonstrate the trade-oﬀbetween stable and unstable methods for several examples. We conclude with a discussion of potential future work in Section 5. Proofs of supporting lemmas and technical results are provided in the appendices.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Operator Properties Rate Optimization Stability convergence Iterations for

on convergence Statistical error

General expressions

Fast, stable (Balakrishnan et al., 2017) FAST(κ) STA(0) log(1/ε(n, δ)) ε(n, δ)

Slow, stable (Thm. 5) SLOW(β) STA(γ) ε(n, δ ) 1 1+β γβ [ε(n, δ )]

Fast, unstable (Thm. 6) FAST(κ) UNS(γ) log(1/ε(n, δ)) [ε(n, δ)]

Slow, unstable (Thm. 6) SLOW(β) UNS(γ) [ε(n, δ)] 1 1+β [ε(n, δ)]

ε(n, δ) = log(1/δ)/ n

Speciﬁc examples

FAST(κ), STA(0) e κt 1 n log n n 1/2

2), STA(1) 1

t r n n1/2 n 1/4

FAST(κ), UNS(-1) e κt 1 r n log n n 1/4

2), UNS(-1) 1

t 1 r n n1/3 n 1/8

Table 1. A high-level overview of our results. The notation in the problem set-up (columns 2 and 3) is formalized in Section 2.2, and the formal results (columns 4 and 5) are discussed in Section 3. In the top panel, we provide general expressions from our results, and in the bottom panel, we provide some explicit expressions for few speciﬁc settings. The second and third columns respectively denote the optimization and stability properties of the operator, and the last two columns provide the expressions for iterations for convergence, and the ﬁnal statistical errors of the estimate returned the sample-based (noisy) operator (see equations (14) for the deﬁnition of δ ). For the bottom panel, we use β = 1

2, γ = 0, 1 with the noise function ε(n, δ) = log(1/δ)/ n. For brevity, we omit log-factors (in n, δ) and universal constants for the expressions in the bottom panel.

Notation: A few remarks on notation: for a pair of sequences {an}n 1 and {bn}n 1, we write an bn or an = Ω(bn) to mean that there is a universal constant c such that an cbn for all n 1. We write an bn if both an bn and an bn hold. We use x to denote the smallest integer greater than or equal to x for any x R. In the paper, we use c, c , ci, c i when i 1 to denote the universal constants. Note that the values of universal constants may change from line to line. Finally, for our operator notation, we use the subscript n to distinguish a sample-based operator (e.g., Fn, GNM n , MGA n ) from its corresponding population-based analog (respectively F, GNM, MGA).

2. Motivation and problem set-up

We begin in Section 2.1 by motivating the analysis to follow by showing and discussing the results of some computational studies for the class of non-linear regression models. These results demonstrate a wide range of possible convergence rates, and associated stability (or

Instability, Computational Efficiency and Statistical Accuracy

instability) of the operator to perturbations. With this intuition in hand, we then turn to Section 2.2, in which we set up the deﬁnitions that underlie our analysis. In particular, we state the (i) local Lipschitz condition, and (ii) local convergence behavior for the populationlevel operator F, and (iii) the stability and instability condition of the sample-level operator Fn with respect to F.

2.1 A vignette on non-linear regression

We ﬁrst consider a certain class of statistical estimation problems in which there are interesting diﬀerences between algorithms. Here we keep the discussion very brief; see Section 4.3 for a more detailed discussion. We consider a simple type of non-linear regression model, one based on a function f : Rd R that can be written in the form f(x) = g ( x, θ ) for some parameter vector θ Rd, and some univariate function g : R R. In the simplest setting considered here, the univariate function g is known, and we have a parametric family of functions as θ ranges over Rd; when g is unknown, we have a semi-parametric family. Now suppose that we are given a collection of pairs {(Xi, Yi)}n i=1, generated from a noisy regression model of the form

Yi = g ( Xi, θ ) + ξi, for i = 1, . . . , n. (2)

Here ξi is a zero-mean noise variable with variance σ2, which we assume to be independent of Xi. The single index regression model (2) has been studied extensively in the literature (e.g., (Carroll et al., 1997; Ichimura, 1993)). When g is known, a natural procedure for estimating θ is based on minimizing the least squares objective function

i=1 {Yi g ( Xi, θ )}2 . (3)

When the variables ξi are Gaussian, then this objective coincides (up to scaling and constant factors) with the negative log-likelihood function, so that minimizing it yields the maximum likelihood estimate. Under suitable regularity conditions on g in a neighborhood of θ , it is known that it is possible to estimate θ at the usual parametric rate of n 1/2. However, problems can arise when the signal-to-noise ratio (SNR), as measured by the ratio θ 2/σ, tends to zero. In particular, consider a function g whose derivative vanishes at zero that is, g (0) = 0. For instance, the function g(t) = t2, which arises in the application of the non-linear regression framework to the problem of phase retrieval, has this property. Taking the limit of low SNR amounts to trying to estimate the vector θ = 0 based on observations from the model (2). For this type of singular statistical model, we see many interesting diﬀerences between algorithms that might be used to minimize the least-squares criterion (3). More concretely, let us consider three standard optimization algorithms that might be applied to the objective (3): (i) gradient descent; (ii) Newton s method, and; (iii) cubicregularized Newton s method. See Appendix D.3 for a precise description of these algorithms and the associated updates in application to this model.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

102 103 104 Number of samples n

Statistical error vs n

alg = GD: slope = 0.23 alg = CNM: slope = 0.24 alg = NM: slope = 0.24

102 103 104 Number of samples n

# Iterations to converge

Iteration complexity vs. sample size

alg = GD: slope = 0.44 alg = CNM: slope = 0.10 alg = NM: slope = 0.06

Figure 1. Plots characterizing the behavior of diﬀerent algorithms, namely gradient descent (GD), cubic-regularized Newton s method (CNM), and the vanilla Newton s method (NM) for the non-linear regression model when θ = 0. (a) Log-log plots of the Euclidean distance bθn θ 2 versus the sample size. It shows that all the algorithms converge to an estimate at Euclidean distance of the order n 1/4 from the true parameter θ . (b) Log-log plots for the number of iterations taken by diﬀerent algorithms to converge to the ﬁnal estimate. Newton s method takes the least number of steps. On the other hand, gradient descent takes signiﬁcantly larger number of steps, with an empirical scaling close to n.

Statistical and iteration complexity of optimization algorithms: For each procedure, we are interested both in the associated statistical error that is, the Euclidean distance between their output and the true parameter θ and their iteration complexity, meaning the number of iterations required to converge. In order to gain some understanding, we performed some simulations for non-linear regression based on the function g(t) = t2 in dimension d = 1, over a range of sample sizes n. Figure 1 provides some plots that summarize some results from these simulations. Panel (a) plots the Euclidean error associated with the estimate versus the sample size n on a log-log plot, along with associated least-squares ﬁts to these data. As can be seen, all three methods lie upon a line with slope 1/4 on the log-log scale, showing that the statistical error decays at the rate n 1/4. This slow rate to be contrasted with the usual n 1/2 parametric rate is a consequence of the singularity in the model. Panel (b) plots the iteration complexity of the three algorithms versus the sample sizes, again on a log-log plot. For a given problem based on n samples, the iteration complexity is the number of iterations required for the distance between the iterate and θ

to drop below n 1/4. Here we see some interesting diﬀerences, with the gradient method having an empirical iteration complexity that grows as n0.44, based on our ﬁts, with the two forms of Newton s method having much milder growth in iteration complexity. In the theory to follow, we will prove that iteration complexity for the gradient method scales at most like n, that of the cubic-regularized Newton method scales as n1/6, whereas that of Newton s method scales only as log n. (See Corollary 9 for a precise statement.)

Behavior of optimization operators: The plots in Figure 1 all concern the behavior of algorithms in practice, as applied to the empirical objective function, and our ultimate goal is to provide a theoretical explanation of phenomena of these types. In order to do so, our

Instability, Computational Efficiency and Statistical Accuracy

analysis makes use of the population-level algorithms obtained in the limit of inﬁnite sample size; i.e., n . In the special case of the non-linear regression model considered here, we refer the Appendix D.3 for the precise forms of these operators (cf. equations (107a) (107c)). The plots in Figure 2 illustrate the two properties of the operators that underlie our theoretical analysis: convergence rate of the population operators (panel (a)), and the stability of the empirical operators relative to the population version (panel (b)).

0 10 20 Iteration number t

Convergence rates of population operators

θt+1 = F GD(θt)

θt+1 = F CNM(θt)

θt+1 = F NM(θt)

0.10 0.15 0.20 0.25 0.30 Perturbation

F alg n (θ + ) F alg(θ + )

Stability of empirical operator

alg = GD alg = CNM alg = NM

Figure 2. Exploration of the population level updates, and their connection to the empirical updates for the non-linear regression problem. (a) Plots showing the convergence rate of the error θt θ for diﬀerent algorithms namely gradient descent (GD), standard Newton s method (NM), and cubic-regularized Newton s method (CNM) applied at the population level (limit of inﬁnite sample size). Notice the log-scale on the y-axis. The sequence from the Newton s method converges a geometric rate to θ , whereas the gradient method converges at a sub-linear rate. (b) Plots showing the scaling of the perturbation error Fn(θ + ) F(θ + ) versus the perturbation . For an unstable operator, the perturbation error can increase as 0, with Newton s method showing a strong version of such instability. In contrast, the gradient descent method is a stable procedure in this setting.

The plots in panel (a) reveal that the three algorithms diﬀer dramatically in their convergence rate at the population level. The ordinary Newton updates converge at a geometric rate, with the distance to the optimum θ decreasing as κt with the number of iterations t, where κ (0, 1) is a contraction coeﬃcient. In contrast, the other two algorithms exhibit an inverse polynomial rate of convergence, with the distance to optimality decreasing at the rate 1/tβ for some exponent β > 0. In the analysis to follow, we prove that gradient descent has inverse polynomial decay with exponent β = 1/2, whereas the cubic-regularized Newton updates exhibit inverse polynomial decay with exponent β = 2. In Corollary 9 and its proof, we characterize the optimization rate (algorithmic rate of convergence), the stability and the ﬁnal statistical error obtained by these three methods. For reader s convenience, we summarize these results in Table 2.

2.2 Problem set-up

Having provided a high-level overview of the phenomena that motivate our analysis, let us now set up the problem more abstractly, and introduce some key deﬁnitions. Consider an operator F that maps a space Θ to itself; typical examples of the space Θ that we consider

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Algorithm Optimization Rate Stability Iterations for convergence Statistical error on convergence

Gradient descent 1

t r n n1/2 n 1/4

Newton s method e κt 1 r n log n n 1/4

Cubic-regularized Newton s method 1 t2 1 r n n1/6 n 1/4

Table 2. Overview of results illustrated in Figures 1 and 2 for non-linear regression model with the link function g(t) = t2 and θ = 0. By characterizing the optimization rate and stability precisely, and invoking our general theory (summarized in Table 1), we establish that while the three methods diﬀer signiﬁcantly in terms of their optimization rate and stability, they achieve the same statistical error upon convergence, albeit by taking diﬀerent number of iterations to converge. We omit logarithmic factors and universal constants for brevity. See Corollary 9 and its proof for precise details.

are subsets of the Euclidean space Rd, and subsets of symmetric matrices. Let θ be a ﬁxed point of the operator i.e., an element θ Θ such that F(θ ) = θ . The challenge is that we do not have access to the operator F directly, but rather can observe only a random operator Fn that can be understood as a noisy estimate of F. Throughout, we call F the population operator and Fn the empirical operator. Using the empirical operator, we generate a sequence of iterates via the ﬁxed-point updates

θt+1 n = Fn(θt n) for t = 1, 2, . . ., (4)

with a suitable initialization θ0 n Θ. Our goal is to determine conditions under which the sequence {θt n}t 0 approaches a suitably deﬁned neighborhood of θ . More precisely, for any given triple (F, Fn, t) we provide a sharp characterization of the optimality gap θt n θ 2 as a function of the iteration count t and the error F Fn 2 of the empirical operator Fn. One interesting class of problems where the operators F and Fn arise naturally is estimation problems in statistics and machine learning. More concretely, consider the problem of ﬁnding the unique minimizer θ of an objective function L : Θ R. In practice, we do not know the true objective function L, instead we have access to an approximate (random) objective function Ln, which is an unbiased estimate of the true objective function L. Given the pair (L, Ln), we can obtain diﬀerent operators F by applying various optimization algorithms to minimize L, including gradient methods, proximal methods, the EM algorithm and related majorization-minimization algorithms, as well as Newton and other higher-order methods. The noisy operators Fn are obtained by applying the same optimization algorithms to the approximate objective function Ln.

. For ease of exposition, going forward the index n is synonymous with the sample size that deﬁnes the operator Fn; while our general results, namely, Theorems 5 and 6 do not rely on use of this simpliﬁcation.

Instability, Computational Efficiency and Statistical Accuracy

2.2.1 Properties of the operator F

We begin by formalizing some properties of the operator F. We assume that the operator F has a unique ﬁxed point θ in the local neighborhood of the Euclidean ball

B(θ , ρ) := n θ Θ | θ θ 2 ρ o (5)

centered at θ and we study its behavior in that local neighborhood. Our ﬁrst condition is a standard Lipschitz condition on the operator F. In particular, we say that the operator F is 1-Lipschitz in norm over the ball B(θ , ρ) if

F(θ1) F(θ2) θ1 θ2 for all θ1, θ2 B(θ , ρ). (6)

In words, the 1-Lipschitz condition guarantees that the operator F is non-expansive with respect to perturbations of its argument. Our next two deﬁnitions distinguish between fast and slow rates of convergence. The ﬁrst deﬁnition captures an especially favorable property of operator F; namely, it is locally contractive around the ﬁxed point θ . The second deﬁnition considers a substantially slower (sub-linear) rate of convergence of the operator F.

Deﬁnition 1 (Fast convergence) For a contraction coeﬃcient κ (0, 1), the operator F is FAST(κ)-convergent on the ball B(θ , ρ) if

F t(θ0) θ κt θ0 θ for all iterations t = 1, 2, . . ., (7)

and for all θ0 B(θ , ρ).

Deﬁnition 2 (Slow convergence) Given an exponent β > 0, the operator F is SLOW(β)- convergent over the ball B(θ , ρ) means that

F t(θ0) θ c

tβ for all iterations t = 1, 2, . . ., (8)

and for all θ0 B(θ , ρ), where c is a universal constant.

These notions of fast and slow convergence are ubiquitous in analysis of iterative methods, especially in the optimization literature. For example, when the operator F corresponds to gradient descent for some objective L, a suﬃcient condition for fast convergence is local strong convexity of the objective L, and if L is just convex, F satisﬁes slow convergence. Let us now illustrate these deﬁnitions with a simple example.

Example 1 (Fast versus slow convergence) Consider the function L(θ) = θ2p

2p for some positive integer p 1. Note that for any p 1, the function L( ) has a unique global minimum at θ = 0. The ﬁrst two derivatives of L( ) are given by

L (θ) = θ2p 1, and L (θ) = (2p 1)θ2p 2.

Consequently, a gradient descent update with a constant stepsize h > 0 takes the form

F GRD(θ) = θ h L (θ) = θ 1 hθ2p 2 . (9)

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Thus, when p = 1, for any h (0, 1), this gradient descent update is a FAST(κ)-convergent algorithm with κ = 1 h. On the other hand, for any p 2, it can be shown that gradient descent is SLOW(β)-convergent with parameter β = 1 2p 2 in the ball B(θ , ρ) with θ = 0 and

ρ = h 1 2p 2 . Now, let us consider Newton s method with step size one, namely the update

F NWT(θ) = θ L (θ) 1L (θ) = θ θ2p 1

(2p 1)θ2p 2 = θ 1 1 2p 1

For p = 1, this update converges in a single step (simply because the quadratic approximation that underlies Newton s method is exact in this special case). For p 2, the pure Newton update is FAST(κ)-convergent with κ = 1 1 2p 1 for all θ R.

2.2.2 From the empirical operator Fn to the population operator F

In this section, we introduce some key concepts that characterize the (in)-stability of the sample operator Fn with respect to the population operator F. Given a pair of operators (Fn, F) and a tolerance parameter δ (0, 1), our deﬁnitions involve a perturbation function ε( ) that maps the triple (Fn, F, δ) to a positive (deterministic) scalar ε(n, δ). In general, we impose the following conditions on the perturbation function ε( ):

It is decreasing in n for any ﬁxed δ, and is monotonically increasing in δ for any ﬁxed n.

For any ﬁxed δ (0, 1), we have ε(n, δ) 0 as n , and similarly, for any ﬁxed n > 0, we have ε(n, δ) as δ 0.

Note that ε(n, δ) = p

log(1/δ)/n would satisfy these requirements. Given some choices of perturbation function, we can deﬁne our ﬁrst stability condition as follows:

Deﬁnition 3 (STA(γ)-Stability) For a given parameter γ 0, the operator Fn is STA(γ)- stable over B(θ , ρ) with noise function ε( ) means that, for any radius r (0, ρ) and tolerance δ (0, 1), we have

P h sup θ B(θ ,r) Fn(θ) F(θ) c2 min n rγε(n, δ), r oi 1 δ, (11)

for some positive universal constant c2.

Informally, given the randomness of the data associated with the operator Fn the stability condition (11) guarantees that with high probability, the error Fn(θ) F(θ) is upper bounded by c2 min{rγε(n, δ), r} uniformly over a disk of radius r. Note moreover that the upper bound decays to 0 as the radius r 0+. Next we consider the case when γ < 0, i.e., the perturbation error Fn(θ) F(θ) blows up as θ gets close to θ . We refer to such operators as unstable operators. Given radii r1, r2 such that r2 > r1 0, let A(θ , r1, r2) = B(θ , r2)\B(θ , r1) denote the annulus around θ

with inner and outer radii r1 and r2 respectively.

Instability, Computational Efficiency and Statistical Accuracy

Deﬁnition 4 (UNS(γ)-Instability) For a given parameter γ < 0 and radii 0 < ρin < ρout, we say that the operator Fn is UNS(γ)-unstable over the annulus A(θ , ρin, ρout) with noise function ε( ) if

sup θ A(θ ,r,ρout) Fn(θ) F(θ) ε(n, δ) max 1

r|γ| , ρout

for any radius r [ρin, ρout] and any tolerance δ (0, 1).

Two remarks are in order: First, note that the main diﬀerence between STA(γ)and UNS(γ)is how the error scales with the radius r as it gets smaller. For stable operators, the error decreases with scaling rγ, while for unstable operators the error blows up as r |γ|

(where we use |γ| for clarity). There is another subtle diﬀerence: the condition (12) deﬁnes the instability of the perturbation error Fn(θ) F(θ) in an annulus with the inner radius bounded below by ρin, and does not characterize the behavior as the distance θ θ 0. Let us now illustrate these deﬁnitions by following up on Example 1.

Example 2 (Stable versus unstable updates) Consider an empirical function of the form

2 nθ2, where w N(0, 1). (13)

Here p 2 is a positive integer. Note that E[Ln(θ)] = 1 2pθ2p, which is equivalent to the population likelihood function considered in Example 1. A gradient update with stepsize h > 0 on the empirical objective leads to the empirical gradient operator

F GRD n (θ) = θ 1 hθ2p 2 h σw n

Comparing with equation (9), we obtain that |F GRD n (θ) F GRD(θ)| = σ n|w| |θ|. Since

log(1/δ) with probability at least 1 δ, we see for any ρ > 0 and n 16σ2 log(1/δ), the operator F GRD n is STA(γ)-stable with parameter γ = 1, with respect to the noise function

ε(n, δ) = 4σ

As for the Newton update for the problem (13), we have

F NWT n (θ) = θ θ2p 1 + σwθ/ n (2p 1)θ2p 2 + σw/ n,

|F NWT(θ) F NWT n (θ)| = (2p 2)

(2p 1) σ |w| |θ| / n (2p 1)θ2p 2 + σw/ n.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Recall that |w| 4 p

log(1/δ) with probability at least 1 δ. Plugging in w > 4 p

log(1/δ) in the denominator and w < 4 p

log 1/δ of the RHS, and doing some algebra yields that

|F NWT(θ) F NWT n (θ)| cp

n for |θ| > c pσ

where cp = 16(p 1)

2p 1 and c p = 8 2p 1 . Thus, we conclude that the operator F NWT n is UNS(γ)- unstable with parameter γ = 1 over the annulus A(θ , ρin, ρout) with noise function ε where

1 2p 2 , ρout = , and ε(n, δ) = cpσ

2.2.3 Comparison of our assumptions with empirical process literature

We note that while the deﬁnition of STA(γ) is reminiscent of the typical assumptions in empirical process literature, there is a subtle diﬀerence in our set-up. In a typical statistical learning problem, the following assumptions are commonly made on: (a) the local curvature of the population objective function (e.g., the expected negative log-likelihood L), and (b) bounds on the perturbation error between the population and sample objective functions (e.g., supθ B(θ ,r) |L(θ) Ln(θ)|). With these assumptions, the statistical guarantees for the critical points (e.g., the maximum likelihood estimate (MLE)) are then established. See, e.g., Theorem 3.2.5 (van der Vaart, 1998). Such a framework is oblivious about any computational aspect of the problem (e.g., how the MLE is computed), which is one of the key focus in our work. Our goal is to study the interplay of computational-statistical tradeoﬀs between various algorithms that are used to solve these learning problems. In particular, our aim is to identify the number of iterations taken by an algorithm, and the ﬁnal statistical accuracy of the estimate returned by it. Consequently, our conditions are deﬁned in terms of operators that correspond to the algorithm employed by the user to solve the problem at hand, rather than the landscape of the objective itself. In particular, in place of the curvature condition (a) on L, we make assumptions on the convergence rate of the population operator F (SLOW(β)/FAST(κ)). And, in place of the perturbation bounds on the objective functions (L, Ln), we make assumptions on the operator perturbation errors between F and Fn as in equations (11) and (12) (STA(γ)/UNS(γ)).

2.2.4 Further discussion on our definitions

In several cases (also applicable to all examples in this paper), the user often knows (by design) the explicit relationship between the operators F and Fn and the corresponding objectives L and Ln, e.g., when F and Fn correspond to gradient ascent (GA) or Newton s method (NM). In these situations, often it is possible to derive whether STA(γ) or UNS(γ) conditions are satisﬁed given the assumptions on the curvature of L and the perturbation error between L and Ln as in the empirical process literature. Our framework allows the user to simultaneously study the tradeoﬀs between the ﬁnal statistical error and the computational budget needed between several algorithms at once. For example, we show in several settings that NM while being unstable provides computational beneﬁts compared

Instability, Computational Efficiency and Statistical Accuracy

to its stable counterpart GA, since both NM and GA yield an estimate with comparable statistical error upon convergence while the former takes very few steps (although such a condition is not guaranteed always). We remark that the property UNS(γ) of the operators F and Fn as introduced above has not been commonly used in prior work, while STA(γ) has appeared often in prior works (albeit in slightly diﬀerent forms, (Balakrishnan et al., 2017; Chen et al., 2018a; Dwivedi et al., 2020a)). The condition STA(0) is perhaps the most common, which holds for most well-conditioned problems (when the Fisher information matrix is invertible). In such settings, the commonly used methods like GA and NM are also FAST(κ) operators so that the ﬁnal statistical error is of order ε(n, δ) which is obtained in roughly log(1/ε(n, δ)) steps (Balakrishnan et al., 2017). In simple words, the statistical-computational tradeoﬀs across several algorithms are fairly similar for such cases. On the other hand, operators with SLOW(β), and STA(γ) with γ 1, would typically arise when the log-likelihood is not well-conditioned and one uses methods like GA, and UNS(γ) with γ < 0 would appear in such a setting when one uses a higher-order optimization scheme like NM to solve these ill-conditioned problems. So far, there is a limited understanding of the statistical-computational tradeoﬀfor slowly converging stable algorithms as well as any unstable algorithm that can arise in such settings. When the population operator has a unique ﬁxed point, our main results provide a fairly comprehensive understanding towards this end. Let us revisit Examples 1 and 2 which serve as motivating examples for the various ill conditioned settings: Suppose that the population negative log-likelihood is given by L(θ) = θ2p/(2p) (and the true parameter is θ = 0), and the sample negative log-likelihood is given by Ln(θ) = θ2p/(2p)+σwθ2/(2 n). For this setting, the population Fisher information (the second derivative of L) is given by (2p 1)θ2p 2, and the perturbation term σwθ2/(2 n) between the two objectives mimics the intuition that the ﬁnite sample Fisher information would typically have 1/ n ﬂuctuations around its population-level objective with n samples. With p = 1, it is easy to establish that the operators F and Fn corresponding to both GD and NM are FAST(κ) and STA(0) operators. However, for p 2, as our earlier computations illustrated, we observe a more interesting set of behaviors with GD and NM. In particular, the operators corresponding to GA are SLOW(β) and STA(γ) with γ > 1, and the NM operators exhibit a FAST(κ) and UNS(γ) with γ < 0 behavior. The theory to follow provides a precise characterization of the statistical error achievable and the computational budget needed for convergence in such settings.

3. General convergence results

With the deﬁnitions from the previous section in place, we are now ready to state our main results. In Section 3.1, we consider the case when Fn is a stable perturbation of F, and in Section 3.2, we consider the case when it is an unstable perturbation of F. We summarize our ﬁndings in Table 1.

3.1 Results for slowly converging but stable operators

We ﬁrst consider the setting in which the sample-based operator Fn is a stable perturbation of the population-level operator F. If, in addition, we assume that the operator F has

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

fast convergence (cf. equation (7)), then past work is applicable. In particular, Theorem 2 of Balakrishnan et al. (2017) provides a precise characterization of the convergence behavior of iterates from the empirical operator Fn. Here we instead consider the more challenging setting in which the operator F exhibits slow convergence to θ . Analysis of this slow convergence case requires rather diﬀerent techniques than those used to analyze the fastconvergent case. Let us collect the assumptions needed to state our ﬁrst result. The ﬁrst two assumptions involve the Euclidean ball B(θ , ρ) centered at θ of some ﬁxed radius ρ > 0.

(A) The population operator F is 1-Lipschitz (6) and is SLOW(β)-convergent (8) over the ball B(θ , ρ).

(B) There is some γ [0, β 1] such that the empirical operator Fn is STA(γ)-stable (11) over B(θ , ρ).

(C) The tolerance parameters δ (0, 1) and α (0, β 1+β γβ) are ﬁxed and the sample size is large enough such that

ε(n, δ ) c where δ := δ log(1+β

8 log( β α(1+β γβ)) , (14)

and c (0, 1) is a suﬃciently small constant.

Assumptions (A) and (B) quantify, respectively, the convergence behavior of the operator F and the stability of the operator Fn; Assumption (C) is a book-keeping device needed to state our results cleanly. Given the above conditions, we now state our ﬁrst main result.

Theorem 5 Under Assumptions (A), (B), and (C), consider the sequence θt+1 n = Fn(θt n) generated from an initialization θ0 n B(θ , ρ/2). Then there is a universal constant c such that for any ﬁxed α (0, β 1+β γβ) and uniformly for all iterations t c 1/ε(n, δ )) 1 1+β γβ log 1

θt n θ 2[ε(n, δ )]

β 1+β γβ α with probability at least 1 δ. (15)

Let us make some comments on this result (see Appendix A.1 for a detailed proof).

Tightness of Theorem 5: Disregarding the term α and constants, the bound (15) guarantees that the sequence θt+1 n = Fn(θt n) converges to a statistical tolerance of order

β 1+β γβ with respect to θ in order [ε(n, δ )] 1 1+β γβ step. This guarantee turns out to be unimprovable under the given assumptions. In Appendix B (see Proposition 12), we construct a family of examples with the operators F, Fn and noise functions εn (constant with respect to δ) satisfying the assumptions for Theorem 5, such that the following additional results hold:

β 1+β γβ n for all t 1,

β 1+β γβ n for all t c ε 1 1+β γβ n .

As a result, we conclude that the results of Theorem 5 are tight for both statistical accuracy and the number of iterations needed for convergence.

Instability, Computational Efficiency and Statistical Accuracy

Relation to prior work: As noted earlier, prior work (Balakrishnan et al., 2017) shows that when the operator is F is FAST(κ)-convergent, and Fn is a STA(0) perturbation of Fn with error ε(n, δ), we have θT n θ ε(n, δ) for T log(1/ε(n, δ)). On the other hand, Chen et al. (2018a) argued that given the minimax error of a problem class, a fast converging algorithm can not be too stable. Neither of these works provided a precise characterization of what statistical errors are achievable when dealing with a slow converging stable operator, which is the focus of our Theorem 5.

A direct sub-optimal proof argument: Let us ﬁrst illustrate how a naive argument that tries to directly tradeoﬀthe perturbation error of Fn with the convergence rate of F leads to a sub-optimal guarantee. Let the assumptions in Theorem 5 remain in force. Roughly speaking, one can show that (cf. Lemma 10), the operator F t n is also STA(γ) perturbation of F t with the noise function t ε(n, δ), so that we can bound the error at iteration t as follows:

θt n θ = F t n(θ0 n) θ F t n(θ0 n) F t(θ0 n) + F t(θ0 n) θ t Cρ ε(n,δ)+ 1

where Cρ = ργ denotes a constant corresponding to the radius ρ of initialization. Minimizing the last bound in the display above over the iteration index t, we ﬁnd that the best possible error is of order (ε(n, δ))β/(1+β). This rate is clearly sub-optimal when compared to the statistical error of order (ε(n, δ))β/(1+β γβ) (unless γ = 0) guaranteed by display (15) from Theorem 5. The reason for sub-optimality of this bound is our failure to localize the argument with the perturbation error as the iterates θt n converge closer to θ .

Outline of proof: In order to derive the sharp guarantee, we need to establish a more reﬁned tradeoﬀthan that in equation (16). To this end, we generalize and reﬁne the annulusbased localization argument introduced in our prior work on the EM algorithm (Dwivedi et al., 2020a,b). In the past work (Dwivedi et al., 2020a,b), we studied particular instantiations of the EM algorithm, for which the operators F and Fn had closed-form solutions. Here in the absence of closed-form expressions, the argument is necessarily more abstract to establish a sharp guarantee under the more general Assumptions (A), (B), and (C), which also handles the previous analysis as a special case (as illustrated in Section 4.2). At a high-level, the proof proceeds by decomposing the total collection of iterations {1, 2, . . . , t} into a disjoint partition of subsets {Tℓ}ℓ 0, referred to as epochs, where the nonnegative integers ℓand Tℓrespectively denote the index of a given epoch and the number of iterations in that epoch. We use Sℓ:= Pℓ i=0 Ti to denote the total number of iterations up to epoch ℓ. By carefully choosing the sequence {Tℓ}ℓ 0, we ensure that at the end of a given epoch ℓ, the error θSℓ n θ has decreased to a prescribed threshold. More precisely,

. In several statistical settings, the error function satisﬁes ε(n, δ) = c p

log(1/δ)/n. When the problems are well-conditioned, e.g., if the Fisher information matrix is invertible while estimating MLE, we typically have FAST(κ) and STA(0) condition for commonly used algorithms like gradient ascent, and Newton s method. In general, we do not expect a setting where the operators are FAST(κ) and STA(γ) with γ 1. Although, one can construct pathological examples, in which case, our localization argument augmented with the earlier proofs by Balakrishnan et al. (2017) would yield that θt n t θ , i.e., the statistical error converges to zero as the number of iterations goes to (even with ﬁnite samples).

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

using an inductive argument on ℓ, we show that

θSℓ n θ ε(n, δ )λℓ for all epoch ℓ 1, (17)

where the sequence {λℓ}ℓ 0 is deﬁned via the recursion

λ0 = 0 and λℓ+1 = νλℓ+ ν , for all ℓ 1, (18)

with the scalars ν (0, 1) and ν > 0 determined by the problem parameters β and γ. We show that the sequence {λℓ}ℓ 0 converges to ν := β 1+β γβ fast enough and we have |λℓ ν | α for all ℓ O (log(1/α)). Deriving a suitable upper bound on Tmax on the epoch size Ti, we then put the pieces together to (roughly) conclude that

θt n θ cε(n, δ )ν α for t c Tmax log 1

As expected, much of the technical work is required to establish the inductive step. The full proof of the theorem is given in Appendix A.1. We also illustrate the high-level ideas of the epoch-based localization argument in Figure 3.

3.2 Results for unstable operators

We now turn to our next main result which characterizes the convergence when the operator Fn is an unstable perturbation of the operator F. We consider two distinct cases depending on whether the operator F is (a) FAST(κ)-convergent or (b) SLOW(β)-convergent. In order to obtain sharp upper bounds ones that depend purely on the noise function ε the inner radius for the instability of the operator Fn around the operator F, which we refer to as eρn, must be sample size-dependent and chosen suitably.

Theorem 6 For a given parameter δ (0, 1), consider the sequence θt+1 n = Fn(θt n) for some initial point θ0 n in the ball B(θ , ρ/2). Suppose that for some γ < 0, the empirical operator Fn is UNS(γ)-unstable over the annulus A(θ , eρn, ρ) with respect to the noise function ε.

(a) Suppose that the operator F is FAST(κ)-convergent over the ball B(θ , ρ), and the sample size n is suﬃciently large so as to ensure that

1 1+|γ| (1 κ)ρ. (19a)

Then with probability at least 1 δ, for any iteration t log( ρ ε(n,δ))

(1+|γ|) log 1

κ , we have

min k {0,1,...,t} θk n θ max (2 κ)

(1 κ) [ε(n, δ)]

1 1+|γ| , eρn

(b) Suppose that the operator F is 1-Lipschitz and SLOW(β)-convergent for some β > 0, and that the sample size n is large enough to ensure that

β 1+β γβ ρ. (20a)

Then with probability at least 1 δ, for any iteration t 1

[ε(n,δ)] 1 1+β , we have

min k {0,1,...,t} θk n θ max n [ε(n, δ)]

β 1+β γβ , eρn o . (20b)

Instability, Computational Efficiency and Statistical Accuracy

F 2( 0) F t( 0)

Proof sketch for epoch

n( 0) ?k k F t

n( 0) F t( 0)k + k F t( 0) ?k trγ" + 1

tβ = t"γλ +1 + 1

min over t "λ +1

where λ +1 = λ + 0 =) lim

!1 λ = ? = β 1 + β γβ , and ? λ for all O(log(1/ ))

Figure 3. An illustration of the epoch-based argument when the population operator F is SLOW(β)-convergent, and the noisy operator is STA(γ)-stable (Theorem 5). In order to simplify the visualization, we use the shorthand ε = ε(n, δ ). Moreover, here θ0 denotes the starting point for a given epoch ℓ(assumed to be at distance r = ελℓfrom θ ), and the iterations 1, 2, . . . , t denote the iteration count in that epoch. The population iterates F 1(θ0), F 2(θ0), . . . converge towards to θ at the rate t β (shown in blue), and their distance from the noisy iterates F 1 n(θ0), F 2 n(θ0), . . . grows at the rate at a distance of trγε. Trading-oﬀ the two errors, we can show that at the end of epoch ℓ(denoted by a suitable choice of t), the distance F t n(θ0) θ ελℓ+1. By establishing that λℓconverges to ν exponentially fast, and that similar arguments can be made for suﬃciently many epochs, we obtain the result in Theorem 5. See Appendix A.1 for a formal argument.

Let us make a few comments about these bounds. (See Appendix A.2 for a detailed proof.)

Choice of the inner radius eρn: Focusing on part (a), if we ensure that eρn [ε(n, δ)] 1 1+γ , then we obtain an upper bound on the error that involves only the noise function. We show how to make such choices in our applications of this general theorem. A similar statement applies to part (b) of the theorem.

Tightness of Theorem 6: In Appendix B, we construct examples of the operators F and Fn which satisfy the assumptions of Theorem 6, and with the inner radius satisfying the bound eρn [ε(n, δ)]τ, τ = 1 1+γ for part (a) or τ = β 1+β γβ for part (b). For each of these examples, we show that the sequence θt+1 n = Fn(θt n) satisﬁes the lower bound

θt n θ [ε(n, δ)]τ for all t 0,

with constant probability. Thus, we conclude that the results of Theorem 6 are tight and not improvable in general.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Necessity of the minimum: Note that both of the bounds (19b) and (20b) apply to the minimum over all iterates k {1, 2, . . . , t}, as opposed to the ﬁnal iterate t. For this reason, our results only guarantee that the iterates produced by an unstable operator Fn converge at least once to a vicinity of the parameter θ , but not that they necessarily stay there for all the future iterations. In fact, such escape behavior for an unstable algorithm is unavoidable in the absence of any additional regularity assumptions. In particular, we provide a simple example in Appendix B.4 that illustrates this unavoidability.

Additional regularity condition: If we impose an additional regularity condition, then we can remove the minimum from the guarantee. In particular, consider the condition:

(D) There exists a universal constant C such that for a given initialization θ0 n, the sequence θt n = F t n(θ0 n) has the following property:

θt+1 n θ Ceρ whenever θt n θ eρ, (21)

where the radius eρ corresponds to equation (19b) or (20b) as relevant to F.

Under this condition, it is straightforward to modify the proof of Theorem 6 to show that the bounds in both parts (a) and (b) can be sharpened by replacing the term mink {0,1,...,t} θk n θ with θt n θ . In Section 4 to follow, we provide a number of examples for which Assumption (D) is satisﬁed.

4. Some concrete results for speciﬁc models

In this section, we study three interesting classes of statistical problems that fall within the framework of the paper. We also discuss various consequences of Theorems 5 and Theorem 6 when applied to these problems.

4.1 Informative non-response model

In our ﬁrst example, let us consider the problem of biased or informative non-response in sample surveys. In certain settings, the chance of a response to not be observed depends on the value of the response. This form of non-response introduces systematic biases in the survey and associated conclusions (Heckman, 1976). Some examples where this issue arises include longitudinal data (Diggle and Kenward, 1994), housing surveys and election polls (Shaiko et al., 1991). In such settings, it is common practice to estimate the nonresponsive behavior in order to correct for the bias. We now describe one simple formulation of such a setting. Suppose that we have n i.i.d. values Y1, . . . , Yn for the response variable Y N(µ, σ2), where for each Yi there is a chance that the value is not observed. To account for such a possibility, we deﬁne {0, 1}-valued random variables Ri for i = 1, . . . , n as follows:

Ri = 1 if Yi is observed, and Ri = 0 otherwise. (22a)

We assume that the conditional distribution Ri|Yi takes the form

Pθ(Ri = 1|Yi = y) = exp H θ(y µ)/σ , (22b)

Instability, Computational Efficiency and Statistical Accuracy

where H is a known function and θ is an unknown parameter which controls the dependence of the probability of non-response on the observation Y = y. In a general setting, all the parameters µ, σ and θ are unknown and are estimated jointly from the data. However, to simplify our presentation, we assume that the parameters (µ, σ) are known and only θ needs to estimated. In particular, we consider the case when the response variable Y N(µ, σ2) N(0, 1) and H(x) = x2 log 2. Under these assumptions, simple algebra yields that

Pθ(Ri = 1|Yi = y) = exp θ2y2

2 log 2 and Pθ(Ri = 1) = 1 2

θ2 + 1 . (22c)

Given n i.i.d. samples {Ri, Yi}n i=1, where we note that Yi is not observed when Ri = 0, the log-likelihood is given by

i=1 Ri Y 2 i (θ2 + 1) + 2 log 2

2 + (1 Ri) log 1 1 2

Note that the likelihood above does not depend on the unobserved Yi since Ri = 0 makes the contribution of the corresponding term 0. In the remainder of this section, we focus on the singular regime, i.e., when the true parameter θ = 0 and consequently the probability of observing any sample Yi = y is always 1/2 (independent of the value y). For such a setting, the results of Rotnitzky et al. (2000) imply that the statistical error of the MLE is larger than the parametric rate n 1

2 . In particular, they showed that |bθn,MLE θ | = O(n 1

4 ). However, with high probability, the log-likelihood Ln is non-concave and thereby a closed-form for the maximum-likelihood estimate is not available. Thus a theoretical analysis of the estimates obtained via diﬀerent optimization algorithms (that can be used to maximize the log-likelihood Ln) can be of signiﬁcant interest. We now apply our general theory to analyze two optimization methods: (i) gradient ascent, and (ii) Newton s method.

4.1.1 Theoretical guarantees

We now state a theoretical guarantee on the behavior of the optimization algorithms in practice with the informative non-response model (22) that is, when applied to the sample log likelihood (23). We analyze the gradient ascent updates for a step-size η (0, 8

3), and the pure Newton updates. We use MGA n and MNM n respectively to denote the sample-based operators for gradient ascent and Newton s method (see Appendix D.1 for the precise form of these operators). The following statement also involves other universal constants c, ci, c i, c i etc.

Corollary 7 For the singular setting of informative non-response model (θ = 0) and given some δ (0, 1), the following properties hold with probability at least 1 δ:

. For instance, when Pn i=1 Ri(Y 2 i + 1) < n, the sample log-likelihood function is bimodal and symmetric around 0.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

(a) For any ﬁxed α (0, 1/4) and initialization θ0 B(θ , 1/2), the sequence θt := (MGA n )t(θ0) of gradient iterates satisﬁes the bound

log(log(1/α)

4 α for all iterates t c 1 n log 1

as long as n c 1 log log(1/α)

(b) For any initialization θ0 A(θ ,

2c (log(1/δ)/n)1/4 , 1/2), the sequence of Newton iterates θt := (MNM n )t(θ0) satisﬁes the bound

4 for all iterates t c 2 log n, (24b)

as long as n c 2 log(1/δ).

See Appendix D.1 for the proof of this corollary (and below for the proof sketch).

Corollary 7 shows that given n samples, (i) the ﬁnal statistical errors achieved by the iterates generated by the gradient descent and the Newton s method are similar (of order n 1

4 ), and (ii) the Newton s method takes a considerably smaller number (of order log n) of steps in comparison to that taken by gradient ascent (of order n). Finally, in Appendix D.1, we show that all the non-zero ﬁxed points of the considered operators have a magnitude of the order n 1

4 with constant probability. Therefore, the statistical radius achieved by the given optimization methods are optimal.

4.1.2 Proof sketch for Corollary 7

Our proof of Corollary 7 starts with an analysis of the gradient ascent and Newton iterates on the population-level analog of the problem. In particular, taking expectations in equation (23), we obtain the following population-level optimization problem

max θ R L(θ) where L(θ) = 1

2 log 1 1 2

Let MGA denote the gradient update operator applied to this objective with a given stepsize η, and let MNM denote the Newton update. In Appendix D.1 (where we also provide explicit forms of these operators), we show that with θ = 0, the population-level operators have the following properties:

(P1) The gradient operator MGA is SLOW(β)-convergent with parameter β = 1

2 over the Euclidean ball B(θ , 1

2), i.e., for the sequence θt = (MGA)t(θ0) with θ0 B(θ , 1

2), we have θt θ c t1/2 .

(P2) The Newton operator MNM is FAST(κ)-convergent with parameter κ = 4

5 over the Euclidean ball B(θ , 1

2), i.e., for the sequence θt = (MNM)t(θ0) with θ0 B(θ , 1

2), we have θt θ c e κt.

Instability, Computational Efficiency and Statistical Accuracy

Moreover in the same Appendix D.1, we show that with the noise function ε(n, δ) = q

n , the sample-level operators satisfy the following properties:

(S1) The sample-based gradient ascent operator MGA n is STA(γ)-stable with parameter γ = 1 over the ball B(θ , 1

(S2) the operator MNM n is UNS(γ)-unstable with parameter γ = 1 over the annulus A(θ , eρn, ρ) with eρn = c[ε(n, δ)] 1 2 and ρ = 1

2 where c denotes some universal positive constant.

Given these properties, we now show how our general theory yields the results stated in Corollary 7. To simplify the following discussion, we omit the universal constants and a few-logarithmic terms, and track the dependency only on the sample size n.

Results for gradient ascent: The items (P1) and (S1) establish that the gradient operators are slow-convergent and stable, and thus we can apply our general result from Theorem 5. In particular, plugging β = 1

2, and γ = 1 in Theorem 5, we ﬁnd that the statistical error for the gradient iterates θt = (MGA n )t(θ0) satisﬁes

θt θ [ε(n, δ)]

β 1+β γβ [n 1

1/2 1+1/2 1/2 = n 1

for t [ε(n, δ)] 1 1+β γβ [n 1

2 ] 1 1+1/2 1/2 = n 1 2 . (26b)

Results for Newton s method: The items (P2) and (S2) establish that the Newton operators are fast-convergent but unstable, and as a consequence our general result from Theorem 6(a) can be applied. In particular, plugging γ = 1 in Theorem 6(a), we ﬁnd that the Newton iterates θt = (MNM n )t(θ0) satisfy

θt θ max n [ε(n, δ)]

1 1+|γ| , eρn o

2 ] 1 1+1 = n 1

4 for t log(1/ε(n, δ)) log n. (27)

Moreover, we show that (see the discussion around equation (87)) Assumption (D) holds for the Newton iterates with an initialization outside the ball B(θ , eρn), and hence part (b) of the Corollary 7 states that the Newton iterates stay in a close vicinity of θ for all future iterations.

4.2 Over-speciﬁed Gaussian mixture models

We now consider the problem of parameter estimation in Gaussian mixture models; and analyze the behavior of two popular algorithms namely (a) Expectation-Maximization (EM) algorithm (Dempster et al., 1997), and (b) Newton s method. We note that EM is arguably the most widely used algorithm for parameter estimation in mixture models and other missing data problems (Dempster et al., 1997). Here we study the problem of estimating the parameters of a Gaussian mixture model given n i.i.d. samples from the model. When the number of components in the mixture is known, prior works (Balakrishnan et al., 2017; Daskalakis et al., 2017; Cai et al., To Appear) have shown that (i) the mixture parameters can be estimated at the parametric rate n 1

2 with the EM algorithm and (ii) the algorithm

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

takes at most log n steps to converge. In the over-speciﬁed setting, i.e., when the ﬁtted model has more components than the true model, recent works (Dwivedi et al., 2020a,b; Wu and Zhou, 2019) have established the slow convergence of EM on both the statistical and algorithmic fronts. For example, for over-speciﬁed Gaussian-location mixtures EM takes n 1 2 log n steps (where denotes much greater than) to converge and produces an estimate for the mean parameter that has a statistical error of order n 1

2 . In the sequel, we apply our general theory to study the behavior of EM and Newton s method for parameter estimation in over-speciﬁed Gaussian-location mixtures. First, we recover the slow convergence of EM as derived in prior works (Dwivedi et al., 2020a). Second, we prove that the Newton s method although an unstable algorithm in this setting achieves a similar statistical accuracy as EM albeit in an exponentially fewer number of steps. We now formalize the details. Let φ( ; θ, σ2) denote the density of N(θ, σ2) random variable, i.e.,

φ(x; θ, σ2) = (2πσ2) 1/2e (x θ)2

and let X1, . . . , Xn be n i.i.d. draws from the standard normal distribution (density φ( ; 0, 1)). Given this data, we ﬁt an over-speciﬁed mixture model namely, a two-component symmetric Gaussian mixture with equal ﬁxed weights whose density is given by

2φ(x; θ, 1) + 1

2φ(x; θ, 1), (28b)

where θ is the parameter to be estimated. In such a setting, the true parameter is unique and given by θ = 0 since f0( ) = φ( ; 0, 1). However, the fact that we ﬁt a mixture that has one extra component than the true model (which has just one component) leads to interesting consequences as we now elaborate. Using Ln to denote the log-likelihood function, the MLE estimate is given by

bθn,MLE arg max θ R Ln(θ) where Ln(θ) := 1

i=1 log fθ(Xi). (28c)

On one hand, it is known (Chen, 1995) that the over-speciﬁcation in such a setting leads to a slower than n 1

2 statistical rate for the MLE, i.e., |bθn,MLE θ | = O(n 1

4 ). On the other hand, MLE does not admit a closed-form expression and thus it is of signiﬁcant interest to understand the behavior of iterative algorithms that are used to estimate the MLE. Next, we use our general framework to provide a precise characterization of two algorithms namely, EM, and Newton s method on maximizing the log-likelihood Ln (28c).

4.2.1 Theoretical guarantees

The next corollary provides a precise characterization of EM and Newton s method for the over-speciﬁed setting described in the previous section. We analyze the EM updates and the pure Newton updates. Moreover, we use GEM n and GNM n respectively to denote the samplebased operators for EM and Newton s method (see Appendix D.2 for the precise form of these operators). Finally, the scalars c, ci, c i, c i denote some positive universal constants.

Instability, Computational Efficiency and Statistical Accuracy

Corollary 8 For the over-speciﬁed Gaussian mixture model (28) with θ = 0, given some δ (0, 1), the following properties hold with probability at least 1 δ:

(a) For any ﬁxed α (0, 1/4) and initialization θ0 B(θ , 1), the sequence θt := (GEM n )t(θ0) of EM iterates satisﬁes the bound

log(log(1/α)

4 α for all iterates t c 1 n log 1

as long as n c 1 log log(1/α)

(b) For any initialization θ0 A(θ ,

2c log2(3n/δ)

n1/4 , 1/3), the sequence of Newton iterates θt := (GNM n )t(θ0) satisﬁes the bound

4 for all iterates t c 2 log n, (29b)

as long as n c 2 log(1/δ).

See Appendix D.2 for the proof (and below for the proof sketch).

Corollary 8 establishes that the Newton EM is signiﬁcantly faster than EM for the model setup (28). More precisely, it reaches ball around θ with a statistical radius of order n 1

4 within log n steps, which is much smaller than the number of steps taken by EM. Moreover, the updates from Newton s method do not escape this ball for future iterations. This behavior is a consequence of the fact that under the assumed initialization condition, the (cubic-regularized) Newton EM sequence satisﬁes assumption (D).

Multivariate settings: In Figure 4, we discuss the performance of EM and Newton s method under the multivariate setting of the over-speciﬁed Gaussian mixture model (28b). Similar to the univariate setting, both algorithms converge to a statistical error of order (d/n)1/4 around the true parameter θ . Furthermore, the EM algorithm takes p

n/d number of iterations to converge to the ﬁnal estimate (see Appendix E.1 for a formal result) while the Newton s method takes much fewer number of iterations (which seems in agreement with the log n scaling suggested by our theory). Given that each iteration of the EM algorithm takes order n d arithmetic operations, the computational complexity for the EM algorithm to reach the ﬁnal estimate is of order n3/2d1/2. On the other hand, each iteration of the Newton s method takes an order of n d + d3 arithmetic operations where d3 is computational complexity of computing inverse of an d d matrix via Gauss-Jordan elimination approach. It leads to the computational complexity at the order (nd + d3) log n for the Newton s method to reach to the ﬁnal estimate. Thus, when d5/3 n, Newton s method is computationally more eﬃcient than the EM algorithm.

4.2.2 Proof sketch for Corollary 8

The proof strategy for this case is similar to that laid out in Section 4.1.2 for informative nonresponse model. First, to study this problem in our framework, we consider the population

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

102 103 104 Number of samples n

Statistical error vs n

EM, d=2: slope= 0.24 NM, d=2: slope= 0.23 EM, d=8: slope= 0.26 NM, d=8: slope= 0.25

102 103 104 Number of samples n

# Iterations to converge

Iteration complexity vs. sample size

EM, d=2: slope=0.51 NM, d=2: slope=0.09 EM, d=8: slope=0.52 NM, d=8: slope=0.06

Figure 4. Plots characterizing the behavior of Expectation-Maximization (EM) and Newton s method (NM) for two Gaussian mixture models in d = 2 and d = 8 dimensions. (a) Log-log plots of the Euclidean distance bθn θ 2 versus the sample size. It shows that all the algorithms converge to an estimate at Euclidean distance of the order n 1/4 from the true parameter θ . (b) Log-log plots for the number of iterations taken by diﬀerent algorithms to converge to the ﬁnal estimate. While EM takes roughtly n iterations, the scaling of iterations taken by Newton s method is signiﬁcantly slower.

level objective L by replacing the sum over samples in equation (28c) with the corresponding expectation:

L(θ) := EX N(0,1) [log fθ(X)] = EX

2φ(X; θ, 1) + 1

2φ(X; θ, 1) . (30)

Second, we use GEM and GNM respectively to denote the corresponding population-level EM and Newton s method operators (see Appendix D.2 for the precise expressions).

Results for EM: For the case of θ = 0, Theorem 2 and Lemma 1 of our prior works (Dwivedi et al., 2020a) show that, for any initialization θ0, the EM operators GEM and GEM n satisfy

(GEM)t(θ0) θ c

sup θ B(θ ,r)

GEM(θ) GEM n (θ) c1r

where the second bound holds with probability at least 1 δ for any ﬁxed radius r > 0. In the framework of our current work, the bounds (31) imply that the operator GEM

exhibits SLOW(1

2)-convergence, and the operator GEM n is STA(1)-stable with the noise function q

n . Thus a direct application of Theorem 5 of this paper (in a fashion similar to that of equations (26a) and (26b)), recovers the main result of our prior work (Dwivedi et al., 2020a) (Theorem 3). That is, with high probability, the sequence θt+1 n = GEM n (θt n) satisﬁes

1/2 1+1/2 1/2 = n 1

4 for t [n 1

2 ] 1 1+1/2 1/2 = n 1 2 . (32)

Instability, Computational Efficiency and Statistical Accuracy

Results for Newton s method: In Appendix D.2, we demonstrate the following properties of Newton s method operators:

(M1) the Newton operator GNM is FAST(7

9)-convergent over the ball B(θ , 1

(M2) the operator GNM n is UNS( 1)-unstable over the annulus A(θ , eρn, 1/3) with noise function ε(n, δ) = log(n/δ) n where eρn = c log2(3n/δ)

Based on the results of Theorem 6(a) with κ = 7

9 and γ = 1, the items (M1) and (M2) suggest that the Newton updates θt = (MNM n )t(θ0) satisfy θt θ max n [ε(n, δ)] 1 1+1 , eρn o n 1

4 for t log(1/ε(n, δ)) log n. (33)

Furthermore, we prove that the Newton iterates satisfy Assumption (D) (see the argument with equation (97)). Therefore, the Newton iterates stay in a close vicinity of θ for all future iterations.

4.3 Non-linear regression model

In our third example, we consider a non-linear regression model (Carroll et al., 1997) with a known link function g. Models of this type have proven useful for applications in signal processing, econometrics, statistics, and machine learning (Ichimura, 1993; Horowitz and H ardle, 1996). For simplicity, we brieﬂy summarize the one-dimensional version of this problem. The multivariate setting of the problem is considered in Appendix E.2. We observe the pairs of data (Xi, Yi) R2 that are generated from the model

Yi = g (Xiθ ) + ξi for i = 1, . . . , n. (34a)

Here Yi denotes the response variable, Xi corresponds to the covariate and ξi denotes the additive noise assumed to have a standard Gaussian distribution, i.e., ξi i.i.d. N(0, 1). Note that, the Gaussianity of the additive noise is for the simplicity of the proof, and the results can be extended to sub-Gaussian errors. In this example, we consider the case of random design for the covariates, i.e., the covariates {Xi}n i=1 are independent and Xi N(0, 1). Given the samples {(Xi, Yi), i [n]}, we want to estimate the unknown parameter θ . A popular choice is the maximum-likelihood estimate (MLE):

bθmle n arg min θ R e Ln(θ) where e Ln := 1

i=1 (Yi g (Xiθ))2 . (34b)

Generally, the loss-function e Ln is non-convex and hence the MLE does not admit a closedform expression. Consequently, one needs to make use of certain optimization algorithms to compute an estimate bθn, which need not be the same as bθmle n . In the remainder of this section, we study the case when the SNR degenerates to zero. Speciﬁcally, we consider θ = 0 and a link function of the form g(x) = x2p with p 1. For such a setting, the optimization problem (34b) takes the following form:

bθn arg min θ R e Ln(θ) where e Ln := 1

Yi (Xiθ)2p 2 . (34c)

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

4.3.1 Theoretical guarantees

For the non-linear regression model described above with the link function g(x) = x2p, we consider three iterative optimization methods: (a) gradient descent with a step size η (0, 1 (4p 1)!!(2p)], (b) (pure) Newton s method, and (c) cubic-regularized Newton s method with Lipschitz constant L := (4p 1)!!(4p 1)p/3. We denote the updates for these three methods via the operators FGD n , FNM n , and FCNM n respectively (see Appendix D.3 for the precise expressions of these operators). The next result characterizes the behavior of these three methods:

Corollary 9 For the non-linear regression model (34) with link function g(x) = x2p for p 1 and true parameter θ = 0, given some δ (0, 1), the following properties hold with probability at least 1 δ:

(a) For any ﬁxed α (0, 1/4) and initialization θ0 B(θ , 1), the sequence θt := (FGD n )t(θ0) of gradient iterates satisﬁes the bound

log4p(n log(1/α)

4p α for all iterates t c 1n

as long as n c 1 log log(1/α)

(b) For any initialization θ0 A(θ , clogp/(2p 1)(n/δ)

n1/4(2p 1) , 1), the sequence of Newton iterates θt := (FNM n )t(θ0) satisﬁes the bound

4p for all iterates t c 2 log n, (35b)

as long as n c 2 log(1/δ).

(c) The sequence of cubic-regularized Newton iterates θt := (FCNM n )t(θ0) with initialization

θ0 A(θ , clogp/(2p 1)(n/δ)

n1/4(2p 1) , 1) satisﬁes the bound

4p for all iterates t c 3n

4p 3 2(4p 1) , (35c)

as long as n c 3 log(1/δ).

See Appendix D.3 for the proof (and below for the proof sketch).

This corollary shows that the ﬁnal statistical errors achieved by gradient descent and the (cubic-regularized) Newton s method have the same scaling. Moreover, Newton s method, while unstable, converges to the correct statistical radius in a signiﬁcantly smaller log n number of steps when compared to gradient descent, which takes n

2p steps and cubic-

regularized Newton s method, which takes n

4p 3 2(4p 1) steps. Moreover, we also show that

Instability, Computational Efficiency and Statistical Accuracy

assumption (D) holds for the iterates from the (cubic-regularized) Newton method s and hence we obtain that these iterates not only converge to a ball of radius n 1

4p around θ , but also that they stay there for all the future iterations. Finally, in Appendix D.3 (see equation (116)) we also establish that the statistical radius n 1/(4p) achieved by the considered optimization methods is tight. When g(x) = x2, the model (34a) corresponds to a phase retrieval problem. In the regime of large signal-to-noise ratio (SNR), i.e., |θ | 1, and with the link function g(x) = x2, there are eﬃcient algorithms which produce an estimate bθn satisfying a bound |bθn θ | n 1

2 (Eldar and Mendelson, 2013; Cand es et al., 2015; Tan and Vershynin, 2018). However, as the SNR approaches zero these parametric rates do not apply and precise statistical behavior of these estimates are not known.

4.3.2 Proof sketch for Corollary 9

In order to study these updates using our framework, we need to consider the populationlevel version of the optimization problem (34c), which is given by

min θ R e L(θ) where e L(θ) := 1

Y (Xθ)2p 2 ,

where the expectation is taken with respect to X N(0, 1), Y N(0, 1) as θ = 0. Direct computation yields that

2 + (4p 1)!!θ4p

2 and arg min θ e L(θ) = 0 = θ . (36)

Like the previous proof sketches, we let FGD, FNM and FCNM denote the population operators corresponding to the algorithms, gradient descent, Newton s method and cubicregularized Newton s method, for the problem (36) (for a given p). See Appendix D.3 for the precise deﬁnitions of these operators. In Appendix D.3, we show that with θ = 0, these population-level operators satisfy the following properties over the ball B(θ , 1):

(e P1) the gradient operator FGD is SLOW( 1 4p 2)-convergent for step size η (0, 1 (4p 1)!!(2p)],

(e P2) the Newton operator FNM is FAST(4p 2

4p 1)-convergent, and

(e P3) the cubic-regularized Newton operator FCNM is SLOW( 2 4p 3)-convergent.

Moreover in the Appendix D.3, we show that with the noise function ε(n, δ) = q

n , the sample-level operators satisfy the following properties:

(e S1) the operator FGD n is STA(2p 1)-stable over the ball B(θ , 1),

(e S2) the operator FNM n is UNS( (2p 1))-unstable over the annulus A(θ , eρn, 1) with inner radius eρn = c logp/(2p 1)(n/δ)/n1/4(2p 1), and

(e S3) the operator FCNM n is UNS( 1

2)-unstable over the annulus A(θ , eρn, 1).

. See the proofs of equations (111) and (117) in Appendix D.3 for more details.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

These properties show that the gradient descent is a slow-converging stable method and we can apply Theorem 5. On the other hand, Newton s method is a fast-converging unstable method, and Theorem 6(a) can be applied. Finally, cubic-regularized Newton s method is a slow-converging unstable method and Theorem 6(b) can be applied. In the subsequent proof-sketch, we track the dependency only on the sample size n and ignore logarithmic factors and universal constants. Moreover, since the computations here mimic the discussion from Section 4.1.2, we keep the discussion briefer.

Results for gradient descent: Applying Theorem 5 with β = 1 4p 2, and γ = 2p 1

(items (e P1) and (e S1) respectively), we ﬁnd that the statistical error for the gradient iterates θt = (FGD n )t(θ0) satisfy θt θ [ε(n, δ)]

β 1+β γβ n 1

2p for t [ε(n, δ)] 1 1+β γβ n

Results for Newton s method: Next applying Theorem 6(a) for the Newton s method with κ = 4p 2

4p 1, and γ = (2p 1) (see items (e P2) and (e S2)), we conclude that the updates θt = (FNM n )t(θ0) from the Newton s method have the following property: θt θ max n [ε(n, δ)]

1 1+|γ| , eρn o n 1

2p for t log(1/ε(n, δ)) log n. (38)

Results for cubic-regularized Newton s method: Finally by using Theorem 6(b) for the cubic-regularized Newton s method with β = 2 4p 3, and γ = 1

2 (see items (e P3)

and (e S3)), the following results hold for the cubic-regularized Newton iterates θt = (FCNM n )t(θ0): θt θ max n [ε(n, δ)]

β 1+β γβ , eρn o

2p for t [ε(n, δ)] 1 1+β n

4p 3 2(4p 1) . (39)

5. Discussion

In this paper, we established several results characterizing the statistical radius achieved by a sequence of updates {F t n(θ0 n)}t 0, induced by an operator Fn and a given initial point θ0 n. We established these results by analyzing the interplay between (in)-stability of the operator Fn for its population operator F and the local convergence of F around its ﬁxed point θ . We then applied our general theory to derive sharp algorithmic and statistical guarantees for several iterative algorithms by analyzing the corresponding sample and population operators, in three diﬀerent statistical settings. In particular, we studied the behavior of gradient methods and higher-order (cubic-regularized) Newton s method for parameter estimation in the weak signal-to-noise ratio regime in Gaussian mixture models, nonlinear regression models, and informative non-response models. We showed that for such models, despite instability, fast algorithms like Newton s method may still be preferred over a stable one like gradient descent since they achieve the same statistical accuracy as that of the stable counterpart in exponentially fewer steps. We now discuss a few questions that arise naturally from our work. First, our results, as stated, are not directly applicable to the settings of accelerated optimization methods or quasi-Newton methods, e.g., accelerated gradient descent (Nesterov, 2013) and LBFGS (Fletcher, 1987). On the one hand, the updates from an accelerated gradient descent

Instability, Computational Efficiency and Statistical Accuracy

method require that the operators Fn and F to change with each iteration. On the other hand, the updates from the L-BFGS method would require additional machinery to deal with the preconditioning matrices in each step. Developing a general theory to characterize the statistical performance of algorithms associated with a time-varying operator Fn is an interesting direction for future research. Secondly, it is desirable to understand the behavior of optimization methods to a wider range of statistical problems. In the context of mixture models, recent work by Dwivedi et al. (Dwivedi et al., 2020b) established that for over-speciﬁed mixtures with both location and scale parameter unknown, EM takes an O(n 3 4 ) steps to return estimates with minimax statistical error of order n 1

4 for the location and scale parameter, respectively. Whether an unstable method like (cubic-regularized) Newton s EM proves computationally advantageous (without losing statistical accuracy) in such more challenging non-convex landscapes remains an open problem. Finally, our theory does not easily extend to the settings with dependent data, such as time series. When the samples are (time) dependent, taking the limit of inﬁnite sample size does not yield a natural population-level operator. One possible ﬁx is to borrow the technique of truncating the sample operator from the analysis of the Baum-Welch algorithm for hidden Markov models (Yang et al., 2017). However, even with the help of such a technique, ample technical challenges remain towards developing a general theory for such non-i.i.d. settings. In this supplementary material, we provide the details of proofs and results that were deferred from the main paper. Appendices A and C contain the proofs of Theorems 5 and 6, respectively, including all the details of the localization argument and the proofs of all auxiliary technical lemmas. In Appendix B, we construct a simple class of problems to demonstrate that the guarantees Theorems 5 and 6 are unimprovable in general. Finally, in Appendix D, we collect the proofs of several corollaries stated in the paper. Finally, we discuss an extension of the theoretical results in the main text to multivariate settings in Appendix E.

Appendix A. Proofs of main results

In this section, we provide the proofs of our main results, namely Theorems 5 and 6.

A.1 Proof of Theorem 5

The reader should recall the proof outline provided following the statement of the theorem. Our proof here follows this outline, making each step precise. For the remainder of the proof, we assume without loss of generality that θ = 0 and r0 = 1. Proofs for the cases θ = 0 or r0 > 1 can be reduced to this case in a straightforward fashion and are thereby omitted.

A.1.1 Notation for stable case

For each positive integer ℓ= 1, 2, . . ., let Tℓdenote the number of iterations during the ℓ-th epoch, and let Sℓdenote the total number of iterations taken up to the completion of epoch

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

ℓ. In order to describe some recursions satisﬁed by these quantities, we deﬁne

T (1) ℓ := Cε(n, δ ) λℓ 1(γ)+1

1+β and T (2) ℓ := C ε(n, δ ) λℓ(γ)+1

for C := (c22γ) 1 (1+β) and C := C(c )

γ 1+β , (40a)

where c := (c22γ)

β 1+β = C β and hence we have C = C

1+β . Here the constant c2 is the constant from the the stability deﬁnition (11). The sequences {Tℓ} and {Sℓ} have the following properties: with the initialzation T0 := 0, we have

Tℓ:= l T (1) ℓ + T (2) ℓ m and Sℓ:=

j=0 Tj for ℓ= 1, 2, . . .. (40b)

Our proof is based on studying the sequence of real-numbers {λℓ}ℓ 0 given by

λ0 = 0 and λℓ+1 = λℓν + ν , where ν = βγ 1+β and ν = β 1+β. (40c)

Note that Assumption (B) implies that ν (0, 1) and hence

λℓ= ν (1 νℓ) ν where ν := β 1 + β γβ , (40d)

where λℓ ν means that the sequence λℓis monotonically increasing and converges to ν . In the epoch-based argument, we need to control the deviation sup θ r F(θ) Fn(θ) uniformly for each radii r R . To this end, for any tolerance δ (0, 1), we deﬁne the event E by

sup θ B(θ ,r) F(θ) Fn(θ) c2rγε(n, δ ) uniformly for all r R )

where δ = δ log( 1+β

8 log( β α(1+β γβ )) was deﬁned in equation (14) and the radii-set R is deﬁned as

R := R 2R, with

R := n ε(n, δ )λ0, . . . , ε(n, δ )λℓα, c ε(n, δ )λ0, . . . , c ε(n, δ )λℓα o ,

ℓα = log(1/α) and c = (c22γ)

Combining the STA(γ)-stability assumption (11) with a standard application of union bound we conclude that

P(E) 1 δ. (43)

Before we start the main argument, we state a lemma useful in the proof of our theorem:

Lemma 10 Assume that the assumptions of Theorem 5 are in force. Then conditioned on the event E (41. 43), for all radius r in the set R (42), we have

sup θ B(θ ,r) F t(θ) F t n(θ) c2(2r)γε(n, δ ) t for all t e T (r), (44)

Instability, Computational Efficiency and Statistical Accuracy

where e T (r) := r1 γ 2γc2ε(n,δ ). Furthermore, for all ℓ ℓα we have

T (1) ℓ+1 e T (ε(n, δ )λℓ) and T (2) ℓ+1 e T (c ε(n, δ )λℓ+1). (45)

See Appendix C.1 for the proof of this lemma.

A.1.2 Main argument

We claim that the sequence {θt n}t 1 satisﬁes

θSℓ n 2 ε(n, δ )λℓ uniformly for all ℓ {0, 1, . . . , ℓα} , and (46a)

θSℓα+t n 2ε(n, δ )ν α uniformly for all t {0, 1, 2, . . .}, (46b)

with probability at least 1 δ. The quantities λℓ, Sℓand ℓα are deﬁned in equations (40a) through equation (40c). With these claims at our disposal, it remains to prove an upper bound on the scalar Sℓα. Towards this end, doing some straightforward algebra we ﬁnd that

Tℓ Tℓα c ε(n, δ ) ν

β for any 0 ℓ ℓα. (47)

Combining the above bounds on Tℓwith the deﬁnition of Sℓfrom equation (40b) yields an upper bound on Sℓα. Substituting the upper bound on Sℓα in inequality (46b) yields the claimed bound (15) of Theorem 5. We now prove the claims (46a) and (46b) using induction.

A.1.3 Proof of claim (46a)

We condition on the event E deﬁned in the equation (41), which occurs with probability at least 1 δ, and establish the claim using induction on the epoch index ℓ. The base case ℓ= 0 is immediate. We now establish the inductive step, i.e., given θSℓ n ε(n, δ )λℓfor some ℓ ℓα 1, we show that θSℓ+1 n ε(n, δ )λℓ+1. We split the proof in two parts (primarily to handle the constants):

θ Sℓ+T (1) ℓ+1 n c ε(n, δ )λℓ+1 and (48a)

θ Sℓ+T (1) ℓ+1+T (2) ℓ+1 n ε(n, δ )λℓ+1, (48b)

where c > 1 is a universal constant. These claims together imply the induction hypothesis and thereby the claim (46a).

Proof of claim (48a) Inequality (45) implies that T (1) ℓ+1 e T (ε(n, δ )λℓ), and hence we can

apply the bound (44) from Lemma 10 with r = ε(n, δ )λℓ R for any t T (1) ℓ+1. Applying the triangle inequality yields

θt+Sℓ n = F t n(θSℓ n ) F t(θSℓ n ) + F t(θSℓ n ) F t n(θSℓ n )

tβ + F t(θSℓ n ) F t n(θSℓ n ) (49)

tβ + c2(2ε(n, δ )λℓ)γε(n, δ )t, (50)

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

for any t T (1) ℓ+1; where step (i) follows from the SLOW(β)-convergence (8) of the operator F along with the assumption that θ = 0, and step (ii) follows by using the inductive hypothesis θSℓ n ε(n, δ )λℓand applying Lemma 10 with r = ε(n, δ )λℓ. Note that in the ﬁnal bound (50) the ﬁrst term decreases with iteration t while the second term increases with t. In order to trade oﬀthese two terms, we set t = T (1) ℓ+1 (40a) in the bound (50) and ﬁnd that

θ Sℓ+T (1) ℓ+1 n 1

(T (1) ℓ+1) β + c2(2ε(n, δ )λℓ)γε(n, δ )T (1) ℓ+1

β 1+β | {z } =:c ε(n, δ )1 λℓγ+1

= c ε(n, δ )

= c ε(n, δ )λℓ+1,

where the last equality follows from the relation (40c) between λℓand λℓ+1. The claim (48a) now follows.

Proof of claim (48b) For any t e T (c ε(n, δ )λℓ+1), we have

θ t+Sℓ+T (1) ℓ+1 n F t(θ Sℓ+T (1) ℓ+1 n ) + F t(θ Sℓ+T (1) ℓ+1 n ) F t n(θ Sℓ+T (1) ℓ+1 n )

tβ + c2(2c ε(n, δ )λℓ+1)γε(n, δ )t,

where the last inequality follows from arguments similar to those used to establish the inequalities (49) and (50) above. Next, recalling the inequality T (2) ℓ+1 e T (c ε(n, δ )λℓ+1)

from equation (45) and plugging t = T (2) ℓ+1 (40a) in the above inequality, we ﬁnd that

θSℓ+1 n 2(c22γ)

1+β | {z } =: e C

1+β = e Cε(n, δ )λℓ+2.

In order to complete the proof, it remains to show that last quantity is upper bounded by ε(n, δ )λℓ+1; equivalently, we need to verify the following upper bound

ε(n, δ ) 1 e Cλℓ+2 λℓ+1 , (51)

which is equivalent to the large sample-size assumption (C) (see condition (79) for a more precise statement) if we establish that

λℓ+2 λℓ+1 α := α(1 + β βγ)

1 + β . (52)

In order to do so, we use the fact (40d) that λℓ= ν (1 νℓ) and obtain that

λℓ ν α and consequently that ν νℓ α

. We ignore the eﬀect of the ceiling function to simplify the computations

Instability, Computational Efficiency and Statistical Accuracy

for all ℓ {0, 1, . . . , ℓα}. Putting together the pieces we have

λℓ+2 λℓ+1 = ν νℓ+1(1 ν) α(1 ν) = α ,

which yields the claimed bound (52) and we are done.

A.1.4 Proof of claim (46b)

The proof of this claim follows a similar road-map as that in the previous Section, and hence we simply sketch it. Conditional on the event E, we claim that

θSℓα+k Tℓα n ε(n, δ )ν α uniformly for all k {0, 1, 2, . . .}. (53)

Assuming this bound is given for now, we complete the proof. Invoking inequality (75) from the proof of Lemma 10, we obtain that

θSℓα+k Tℓα+t n 2ε(n, δ )ν α (54)

for all k {1, 2, . . .} and t e T (ε(n, δ )ν α). Mimicking the arguments from claims (48a) and (48b), and using the large sample-size assumption (C) (condition (79)) yields the claim (54) for any t ε(n, δ ) ν

β . Putting this together with the fact (47) that Tℓα ε(n, δ ) ν

β implies the claim (46b). Turning to the proof of claim (53), we note that the base case k = 0 follows from the claim (46a) by plugging in ℓ= ℓα. For the inductive step, assuming θSℓα+k Tℓα n ε(n, δ )ν α, arguments similar to that in the proof of claims (48a) and (48b) yield

θ Sℓα+k Tℓα+T (1) ℓα n c ε(n, δ )ν α and,

θ Sℓα+k Tℓα+T (1) ℓα +T (2) ℓα n | {z }

θ Sℓα +(k+1)Tℓα n

ε(n, δ )ν α,

thereby establishing the induction hypothesis.

A.2 Proof of Theorem 6

We divide the proof into two subsections, corresponding to parts (a) and (b) of Theorem 6.

A.2.1 Proof of part (a)

We introduce the shorthands eε(n, δ) = (ε(n, δ)) 1 1+γ and Tf = 1 (1+γ) log(ρ/ε(n,δ))

log(1/κ) . Without loss of generality, we can assume that

θt n θ > (2 κ)

(1 κ)eε(n, δ) for all t {0, . . . , Tf 1} . (55)

Otherwise, the claim is immediate. Given the condition (55), we prove the following two claims:

θt n A(θ , eε(n, δ), ρ) for all t {0, . . . , Tf 1} , (56a)

and θTf n θ (2 κ)

(1 κ)eε(n, δ). (56b)

The latter claim (56b) completes the proof of part (a) of the theorem.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Proof of claim (56a) With the condition (55) in hand, it remains to prove that θt n θ ρ. The base case of t = 0 is immediate from the initialization conditions. For the induction step, assuming θt n A(θ , eε(n, δ), ρ), we have

θt+1 n θ = Fn(θt n) θ Fn(θt n) F(θt n) + F(θt n) θ

(i) sup θ A(θ ,eε(n,δ),ρ) Fn(θ) F(θ) + κ θt n θ

(ii) ε(n, δ) max 1 eε(n, δ)γ , ρ + κρ (57)

eε(n, δ)γ + κρ

= ε(n, δ) 1 1+γ + κρ (iii) ρ,

where the inequality (i) follows from the induction hypothesis that θt n A(θ , eε(n, δ), ρ) and the fact that operator F is κ-contractive in the ball B(θ , ρ); inequality (ii) follows from the ﬁrst inequality from condition (19a) that implies that eε(n, δ) = ε(n, δ) 1 1+γ eρ and then invoking the instability condition (12) with r = eε(n, δ) and ρ2 = ρ. Finally, the last inequality (iii) follows from the second bound of the condition (19a). The inductive step is thus established.

Proof of claim (56b) We observe that

θTf n θ = Fn(θTf 1 n ) θ (58)

Fn(θTf 1 n ) F(θTf 1 n ) + F(θTf 1 n ) θ

(i) sup θ A(θ ,eε(n,δ),ρ) Fn(θTf 1 n ) F(θTf 1 n ) + κ θTf 1 n θ

(ii) ε(n, δ) max 1 eε(n, δ)γ , ρ + κ θTf 1 n , (59)

where inequality (i) follows from our earlier claim (56a) and the κ-contractivity of the operator F on the ball B(θ , ρ); inequality (ii) follows from an argument similar to the one used to establish the inequality (57). Finally, recursing equation (59) Tf times, we obtain that

θTf n θ ε(n, δ) max 1 eε(n, δ)γ , ρ (1 + κ + . . . + κTf 1) + κTf θ0 n θ

(1 κ) max 1 eε(n, δ)γ , ρ + κTfρ

(1 κ) + eε(n, δ) = (2 κ)

(1 κ)eε(n, δ),

where the last step follows from the upper bound on iteration Tf, which in turn implies that κTfρ eε(n, δ). The proof is now complete.

Instability, Computational Efficiency and Statistical Accuracy

A.2.2 Proof of part (b)

The proof for Theorem 6(b) borrows ideas from the proof of Theorem 5 as well as the proof of part (a) of Theorem 6. We introduce the following deﬁnitions:

Ts := [ε(n, δ)] 1 |γ|ν

1+β , where ν := β 1 + β γβ .

In order to prove the result (20b), we can, without loss of generality, assume that

θt n θ > 2[ε(n, δ)]ν for all t {0, . . . , Ts 1} , (60)

and show that θTs n θ 2[ε(n, δ)]ν . We only prove the result for θ = 0 as the more general case can be derived in a similar fashion. In order to proceed further, we make use of a result similar to Lemma 10 adapted to the unstable case. Given two positive scalars r1 < r2, we deﬁne

e T (r1, r2) := r2r|γ| 1 ε(n, δ). (61)

Lemma 11 Suppose that the assumptions for part (b) of Theorem 6 hold. Further, suppose that the operator Fn satisﬁes F t n(θ) r1 for any point θ such that θ [r1, r2] and for all t e T (r1, r2), where eρ r1 r2 ρ/2. Then with probability at least 1 δ, we have

sup θ A(θ ,r1,r2) F t(θ) F t n(θ) t ε(n, δ)

r|γ| 1 for all t e T (r1, r2). (62)

See Appendix C.2 for its proof. We are now ready for the main argument. We have

θt n = F t n(θ0 n) F t(θ0 n) + F t(θ0 n) F t n(θ0 n)

tβ + F t(θ0 n) F t n(θ0 n) (63)

tβ + t ε(n, δ) [ε(n, δ)]ν |γ| , for all t e T ([ε(n, δ)]ν , ρ), (64)

with probability at least 1 δ. Here, inequality (i) follows from the SLOW(β)-convergence condition (8) of the operator F along with the assumptions that θ = 0 and θ0 n ρ; inequality (ii) follows by applying Lemma 11 with r1 = [ε(n, δ)]ν and r2 = ρ in light of the condition (60). In the ﬁnal bound (64), the ﬁrst term decreases with iteration t while the second term increases with t. In order to trade oﬀthe two terms, we plug

in t = Ts ( ) e T ([ε(n, δ)]ν , ρ) (where the inequality ( ) holds due to the second bound in assumption (20a)), and perform some algebra to obtain that

T β s + Ts ε(n, δ) [ε(n, δ)]ν |γ| 2[ε(n, δ)]ν ,

which yields the claim.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Appendix B. Tightness of general results

In this appendix, we construct a simple class of problems to demonstrate that the guarantees Theorems 5 and 6 in this paper are unimprovable in general.

B.1 Constructing the family of operators

We establish our lower bounds by considering the following pairs of optimization problems:

min θ Rd f(θ), where f(θ) := θ p

p , and (65)

min θ Rd fn(θ), where fn(θ) := f(θ) εn θ q

where p, q are positive reals satisfying q 2, and p > q+1, and the scalar εn is a perturbation term. The perturbation εn is any non-increasing function in n, that decays to zero as the sample size n increases, so that the problem (66) can be seen as a noisy ﬁnite-sample instantiation of the population-level problem (65). To study the tightness of our general results, we study the guarantees for three diﬀerent algorithms: (a) gradient descent method, (b) Newton s method, and (c) cubic-regularized Newton s method (for d = 1). We note that for the population-level updates, there is a unique global optimal θ = 0, and for the sample-level objective, the global minima θ n satisﬁes ϵstat n := θ n = ε

1 p q n . For our lower bounds, we analyze the behavior of three diﬀerent algorithms: (a) gradient descent method, (b) Newton s method, and (c) cubic-regularized Newton s method (for d = 1), with the population-level operators QGD n , QNM n , and QCNM n deﬁned as follows:

QGD(θ) = θ η f(θ) = θ 1 η θ q 2 , (67a)

QNM(θ) = θ 2f(θ) 1 f(θ) = 1 1 p 1

θ, and (67b)

QCNM(θ) = arg min y R

f(θ)(y θ) + 1

2 2f(θ)(y θ)2 + cp |y θ|3 , (67c)

where cp := 1

6(p 1)(p 2), and η > 0 denotes the step-size of gradient descent algorithm. The corresponding sample-level updates are generated by the operators QGD n , QNM n , and QCNM n , as follows:

QGD n (θ)=θ η fn(θ) = θ η θ p 2 ε θ q 2 θ, (68a)

QNM n (θ) = θ 2fn(θ) 1 fn(θ) = (p 2) θ p 2 (q 2)ε θ q 2

(p 1) θ p 2 ε(q 1) θ q 2 θ, (68b)

QCNM n (θ) = arg min y R

fn(θ)(y θ) + 1

2 2fn(θ)(y θ)2+cp |y θ|3 . (68c)

Standard algebra with the update equations (67a)-(67c) yields the following properties with the population-level operators:

(b P1) the operator QGD is SLOW( 1 p 2)-convergent on the ball B(θ , 1) for small enough η > 0,

Instability, Computational Efficiency and Statistical Accuracy

(b P2) the operator QNM is FAST(p 2

p 1)-convergent towards θ = 0, and

(b P3) the operator QCNM is SLOW( 2 p 3)-convergent on the ball B(θ , 1).

Moving to the (in)-stability of sample-level operators, we can verify that:

(b S1) the operator QGD n is STA(q 1)-stable over the Euclidean ball B(θ , 1);

(b S2) the operator QNM n is UNS( p + q + 1)-unstable over the annulus A(θ , c1r , 1), and

(b S3) the operator QCNM n is UNS( p+1

2 + q)-unstable over the annulus A(θ , c2r , 1) (d = 1)

with respect to the corresponding population-level operators, and the noise function εn.

B.2 Lower bounds showing sharpness

In this section, we demonstrate the our general upper bounds on statistical accuracy and the iteration count, when specialized to the set-up above, are unimprovable. More precisely, the following result applies to the gradient descent updates (68a) with step size η 0, 1

2 , along with the cubic regularized and standard Newton updates.

Proposition 12 Let p > q +1 and q 2, and deﬁne ϵstat n := ε

1 p q n , then for the set-up (66), given an initialization θ0 with θ0 = 1, we have

(QGD n )t(θ0) θ

2ϵstat n for all t c1ε p 2

p q n , ϵstat n for all t 1,

2ϵstat n for all t c 1ε p 2

(QNM n )t(θ0) θ

2ϵstat n for all t c2 log(ε 1 n )), ϵstat n for all t 1, 2ϵstat n for all t c 2 log(ε 1 n )),

(QCNM n )t(θ0) θ

2ϵstat n for all t c3ε p 3

p 1 n , ϵstat n for all t 1, (71)

where θ = 0 denotes the ﬁxed point of the operators QGD, QNM, and QCNM, and c1 > c 1, c2 > c 2, and c3 denote universal constants depending on p, q and independent of n.

It is worth understanding how Proposition 12 establishes the tightness of the general upper bounds given in Theorems 5 and 6. Note that the properties (b P1) (b P3) and (b S1) (b S3), in conjunction with our general results in Theorems 5 and 6, provide an upper bound on the statistical error given suﬃciently many iterations as summarized in bounds (69), (70) and (71), e.g., for the GD iterates, substituting β = 1 p 2 and γ = q 1 in Theorem 5, we conclude that

(QGD n )t(θ0) θ 2ε

1 p q n = 2ϵstat n for all t c1ε p 2

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Furthermore, Proposition 12 guarantees that, up to a the constant pre-factor 2, this statistical error is the best possible since

(QGD n )t(θ0) θ ϵstat n for all t 1.

Finally, the proposition also asserts that the GD updates take at least order ε p 2

q 2 n iterations to converge by the following additional bounds, as we also have the following bound from the display (69):

(QGD n )t(θ0) θ 2ϵstat n for all t = c 2ε p 2

A similar tightness of the statistical and computational guarantee can be argued for fast unstable methods stated in Theorem 6(a) via the guarantee (70) for the Newton s method. Finally, for slow unstable operators, we establish the tightness for the statistical error guarantee of Theorem 6(b) via the bound (71) for the CNM algorithm. (Showing the tightness of iteration complexity for this case requires fairly involved technical analysis, and is left for future work.) In a nutshell, Proposition 12 shows that the upper bound on the ﬁnal statistical errors, and the lower bound on the number of iterations needed to obtain that ﬁnal estimate, as stated in Theorems 5 and 6 are tight for the class of problems (66).

B.3 Proof of Proposition 12

As noted earlier, the upper bounds on the statistical error, and the corresponding lower bound on the number of iterations follow directly by substituting appropriate β and γ parameters from the properties listed above in Theorems 5 and 6. Since the arguments are very similar to those in Appendix D, we omit a detailed derivation. In order to see that the statistical error cannot decrease any further, we note that in our example the iterates from gradient descent and (cubic-regularized) Newton s methods always converge to the global minima θ n of fn. Thus, we also have

QGD n (θ) ϵstat n , QNM n (θ) ϵstat n , and QCNM n (θ) ϵstat n

for all θ ϵstat n . Consequently, we conclude that the error for all iterations are lower bounded by ϵstat n .

Next, we establish the lower bounds on the number of iterations to converge to within 2ϵstat n for Gradient descent and Newton s method. Introducing the shorthand ε := εn, and rearranging terms in equations (68a) and (68b), we ﬁnd that

QGD n (θ) = 1 η θ p 2 + ηε θ q 2 θ, (72)

QNM n (θ) = 1 θ p 2 θ q 2ε (p 1) θ p 2 (q 1) θ q 2ε

Proof for gradient descent iterates: Recursing the update (72), we ﬁnd that

1 η θt+j p 2 + ηε θt+j q 2 . (74)

Instability, Computational Efficiency and Statistical Accuracy

Note that it suﬃces to show that with θt = 2 := 4ϵstat n = 4ε 1 p q , the smallest T such that θT +t satisﬁes T = Ω( 2 p). Since the sequence { θt+j }T j=1 is a decreasing sequence, we ﬁnd that

θt+j 2 for all j = 1, . . . T .

Using := 2 ε 1 p q and the update (74), we have

θt+T +1 1 cη p 2 T θt = 2 1 cη 2 T .

where c = 22p 4 2q p > 0 under the assumptions p > q + 1 and q 2. In order to ensure that θt+T , we need to have

1 cη p 2 T 1

Rearranging the last equation yields T c p 2 c ε p 2

p q , where c is a universal constant which depends only on the pair (p, q). This completes the proof.

Proof for Newton s method iterates: Following an argument similar to the last paragraph and using θt = 2 2 ε 1 p q , we ﬁnd that

θt+T 1 1 p q

T θt = 2 1 1 p q

Recalling that p q 2, we have that T log 2

log p q p q 1 . Consequently, in order to achieve

an accuracy of 1 εp q , we need at least log 2

log p q p q 1 log(ε (p q)) = c log(1/ε) steps. Here,

the universal constant c only depends on (p, q). This completes the proof of the sharpness of the Newton s method.

B.4 Undesirable behavior of unstable operators

In this appendix, we prove that the minimum over all iterates k {1, 2, . . . , t} in Theorem 6 is necessary. In particular, we consider the following example

L(θ) = θ4(θ 2)2 and Ln(θ) = θ4 θ2

We let F and Fn denote the operators corresponding to the Newton s method as applied to the functions L and Ln, respectively (Consequently, the operator F has three ﬁxed points). Following some simple algebra, it can be veriﬁed there are universal constants (c1, c2) such that that the operators F and Fn deﬁned above satisfy the conditions of Theorem 6 (a) with θ = 0 for some κ < 1, γ = 1, ε(n, δ) = n 1

2 , eρ = c1n 1

4 and ρ = c2. In panel (a) of Figure 5, we plot the two functions L and Ln and illustrate the radii eρ, ρ (for a ﬁxed n). Some additional algebra shows that there exists θ0 n B(θ , eρ) such that the iterates

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

corresponding to the sequence θt+1 n = Fn(θt n) satisfy θt n θ 1 n 1

4 for all iterations t = 1, 2, . . .. See, in particular, the red (diamond) iterates in panel (b) of Figure 5 which are generated with a starting point θ0 n = c3n 1

4 (which is below the controlled instability threshold eρ). Clearly, we see that the ﬁrst iterate produced by Newton s method escapes the local basin of attraction and the subsequent iterates converge to a very diﬀerent ﬁxed point of the function Ln. On the other hand, when the Newton s method is initialized in the annulus A(θ , eρ, ρ), the sequence θt n (blue circles) converges quickly to the vicinity of θ as guaranteed by Theorem 6. Furthermore, the iterates do not escape this local neighborhood. Via this simple example, we have demonstrated that if no further regularity assumptions

θ100 θ0 θ10

L(θ) Newton iterates Newton iterates

Figure 5. Instability of Newton s method for the example discussed above (ﬁgure best viewed in color). When the algorithm is initialized too close to θ (red diamonds), the instability of Newton s method forces the iterates to jump too far away from θ and converge to another ﬁxed point. On the other hand, if the initial point is initialized in the annulus A(θ , eρ, ρ), the Newton iterates (blue circles), do not leave this annulus and converge monotonically to a small neighborhood of θ .

are made, then starting an unstable algorithm from a point that is too close to θ , the subsequent iterates can be quite far from the true parameter.

Appendix C. Proofs of auxiliary results

In this appendix, we collect the proofs of Lemmas 10 and 11 that are central to the proofs of our main theorems.

C.1 Proof of Lemma 10

We ﬁx a radius r R. Our proof is based on the following auxiliary claim: conditioned on the event E from equation (41), we have

sup θ B(θ ,r) F t n(θ) 2r for all t e T (r) = r1 γ

2γc2ε(n, δ ). (75)

Taking this claim as given for the moment, we now establish the bound (44) claimed in the lemma. We do so via induction on the iteration t {0, 1, . . . , e T (r)}. Note that the

Instability, Computational Efficiency and Statistical Accuracy

base-case t = 0 holds trivially, since F 0(θ) F 0 n(θ) = θ θ = 0. Given the induction hypothesis for t, we establish the claim for t = t + 1. For any θ B(θ , r), we have

F t (θ) F t n (θ) = F t+1(θ) F t+1 n (θ) (76)

F(F t(θ)) F(F t n(θ)) + F(F t n(θ)) Fn(F t n(θ))

(i) F t(θ) F t n(θ) + sup eθ B(θ ,2r) F(eθ) Fn(eθ)

(ii) sup θ B(θ ,r) F t(θ) F t n(θ) + c2(2r)γε(n, δ )

(iii) c2(2r)γε(n, δ )t + c2(2r)γε(n, δ )

= (t + 1)c2(2r)γε(n, δ ).

In the above sequence of inequalities, we have made use of the following facts. In step (i), we have used the 1-Lipschitzness (6) of the operator F for the ﬁrst term and the bound (75) on F t n(θ) for the second term. In order to establish step (ii), we have used the fact that θ B(θ , r) for the ﬁrst term, while for the second term we have invoked the deﬁnition of the event E in equation (41) with radius 2r (note that 2R R and the event E is deﬁned for all r R ). Finally step (iii) follows directly from the induction hypothesis. Noting that the bound (75) holds for any t e T (r) and taking supremum over θ B(θ , r) on the LHS of equation (81), we obtain the desired proof of the inductive step.

C.1.1 Proof of claim (75)

We establish the claim (75) by proving the following stronger result: For any ﬁxed r R, and any θ B(θ , r), we have

F t n(θ) r + c2(2r)γε(n, δ ) t for all iterations t = 0, 1, . . . , e T (r). (77)

We note that the claim (75) is a direct application of this result along with the deﬁnition e T (r) = r1 γ 2γc2ε(n,δ ). We now use an induction argument on the iteration t (similar to the ones used in the paragraph above) to establish the claim (77). The base-case t = 0 holds trivially. Let us assume that F t n(θ) r + c2(2r)γε(n, δ ) t and establish the claim (77) for t = t + 1. Note that since t e T (r), this assumption trivially yields that F t n(θ) 2r. We have

F t+1 n (θ) F(F t n(θ)) + F(F t n(θ)) Fn(F t n(θ))

(i) F t n(θ) + sup eθ B(θ ,2r) F(eθ) Fn(eθ)

(ii) (r + c2(2r)γε(n, δ ) t) + c2(2r)γε(n, δ )

= r + c2(2r)γε(n, δ )(t + 1),

where in step (i), we have used the 1-Lipschitzness (6) of the operator F for the ﬁrst term and the observation that F t n(θ) 2r for the second term. On the other hand, in step (ii),

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

we have used the induction hypothesis to bound the ﬁrst term, and invoked the deﬁnition of the event E in equation (41) with radius 2r to bound the second term. Taking supremum over θ B(θ , r) completes the proof.

C.1.2 Proof of claim (45)

Combining the relation λℓ= ν (1 νℓ) with the two inequalities in equation (45), we ﬁnd that it suﬃces to prove the following two bounds:

ε(n, δ ) βνℓ

β 1+β and ε(n, δ ) βνℓ+1

β 1+β (c ) β ν (1+β) . (78)

Observe that λℓ ν α/4; consequently, we ﬁnd that 1/νℓ 4ν /α for all ℓ ℓα. Finally, invoking assumption (14) we ﬁnd that

α max n 1, (c ) 4 α o. (79)

The rest of the proof follows by noting that the upper bound (79) implies the bounds in equation (78).

C.2 Proof of Lemma 11

Fix an arbitrary pair of radii r1, r2 R. Our proof is based on the following intermediate claim

F t n(θ) 2r2 for all t e T (r1, r2). (80)

We prove this claim at the end of this appendix. Assuming that this claim is given at the moment, we now establish the bound (62) claimed in the lemma. We do so by using induction on the iteration t {0, 1, . . . , e T (r1, r2)} where we note that the base-case t = 0 holds trivially, since F 0(θ) F 0 n(θ) = θ θ = 0. Turning to the induction step (with t = t + 1), for any θ with θ [r1, r2], we have

F t (θ) F t n (θ) = F t+1(θ) F t+1 n (θ) (81)

F(F t(θ)) F(F t n(θ)) + F(F t n(θ)) Fn(F t n(θ))

(i) F t(θ) F t n(θ) + sup r1 eθ 2r2 F(eθ) Fn(eθ)

(ii) sup r1 θ 2r2 F t(θ) F t n(θ) + ε(n, δ )

r|γ| 1 (iii) tε(n, δ )

r|γ| 1 + ε(n, δ )

r|γ| 1 = (t + 1) ε(n, δ )

In step (i), we have used the 1-Lipschitzness (6) of the operator F for the ﬁrst term and the upper bound (75) on F t n(θ) for the second term. In step (ii), the upper bound for the ﬁrst term follows from the sequence of inequalities

eρ r1 θ r2 2r2 ρ,

Instability, Computational Efficiency and Statistical Accuracy

whereas for the second term we have invoked the bound eθ := F t n (θ) 2r2 (80) and applied the instability condition (12). Finally, step (iii) follows from a direct application of the induction hypothesis. Note that the bound (75) holds for any t e T (r). By taking supremum over θ B(θ , r) on the LHS of equation (81), we obtain the desired proof of the inductive step.

C.2.1 Proof of bound (80)

We use an inductive argument to show that

F t n(θ) t ε(n, δ )

r|γ| 1 + r2 for all 1 t e T (r1, r2), (82)

which immediately implies the claim (80) once we plug in the deﬁnition of e T (61). For the base-case t = 0, invoking the properties of the operators F and Fn we have

Fn(θ) Fn(θ) F(θ) + F(θ) (i) sup r1 θ r2 Fn(θ) F(θ) + θ

(ii) ε(n, δ )

r|γ| 1 + r2,

where step (i) follows since θ [r1, r2] and the operator F is 1-Lipschitz, and step (ii) follows from the instability condition (12). This proves the base case of the induction hypothesis (82). Now we prove the inductive step. In particular, we assume that the induction hypothesis (82) holds for t e T (r1, r2) 1 and show that the upper bound (82) holds for t = t + 1. Towards this end, unwrapping the expression for F t+1 n (θ) we have

F t n (θ) F t+1 n (θ) F(F t n(θ)) + F(F t n(θ))

(iii) sup r1 θ 2r2 Fn(θ) F(θ) + F t n(θ)

(iv) ε(n, δ )

r|γ| 1 + tε(n, δ )

r|γ| 1 + r2

= (t + 1)ε(n, δ )

r|γ| 1 + r2.

Here, step (iii) follows from the fact that F t n(θ) r1 and the LL(ρ) condition (6); step (iv) stems from the instability condition (12) and the induction hypothesis. This completes the proof of the intermediate claim (82).

Appendix D. Proofs of corollaries

We now collect the proofs of several corollaries stated in the paper. As a high-level summary, our analysis in all three examples in Section 4 involves applying Theorem 5 to analyze gradient descent/ascent and EM, both of which are stable algorithms and exhibit slow

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

convergence for the considered examples. We invoke Theorem 6(b) to characterize the cubic-regularized Newton algorithm, a slowly convergent and unstable algorithm. Finally, the analysis of Newton s method in all the examples relies on Theorem 6(a). Appendices D.1 and D.2 are devoted to the proofs of Corollaries 7 and 8, respectively. We then prove Corollary 9 in Appendix D.3. In this section, the values of universal constants (e.g., c, c

etc.) can change from line-to-line.

D.1 Proof of Corollary 7

In this appendix, we demonstrate the convergence and stability of the gradient and Newton methods. The operators for the gradient method and Newton s method take the following forms

MGA(θ) = θ + η L (θ), and MGA n (θ) = θ + η L n(θ), (83a)

MNM(θ) = θ L (θ)

, and MNM n (θ) = θ L n(θ) L n(θ)

D.1.1 Proofs for the gradient operators

In lieu of the discussion around Corollary 7 it remains to establish that (a) the operator MGA exhibits a slow convergence condition SLOW(1

2) over the Euclidean ball B(θ , 1/2) and (b) the operator MGA n satisﬁes a stability condition STA(1) over the Euclidean ball B(θ , 1/2) with noise function ε(n, δ) = p

log(1/δ)/n when n c log(1/δ) for some universal constant c > 0.

Slow convergence of MGA Direct computation with the gradient of population loglikelihood function L leads to

L (θ) := θ 2(θ2 + 1)(2

1 + θ2 1) θ

= MGA(θ) = θ 1 η 1

2 1 2(θ2 + 1)(2

Noting that the ﬁxed point of the population operator is θ = 0 and that η 8/3, we ﬁnd that

MGA(θ) θ = |θ| 1 η 1

2 1 2(θ2 + 1)(2

2 1 2(θ2 + 1)

for all |θ| [0, 1/2].

Thus the population operator MGA satisﬁes a slow convergence condition SLOW(1

2) over the ball B(θ , 1/2).

Instability, Computational Efficiency and Statistical Accuracy

Stability of the sample operator MGA n We have MGA n (θ) MGA(θ) = η L(θ) Ln(θ)

2(θ2 + 1) 2

i=1 (1 Ri) 1

i=1 Ri Y 2 i

Recall that, R1, . . . , Rn are i.i.d. samples from Bernoulli distribution with probability 1/2. Invoking Hoeﬀding s inequality yields that 2 n

i=1 (1 Ri) 1

with probability at least 1 δ. Additionally, as Y1, . . . , Yn are i.i.d. samples from standard Gaussian distribution N(0, 1) and R1, . . . , Rn are independent of Y1, . . . , Yn, by following the same argument as that in the proof of Lemma 1 from the paper (Dwivedi et al., 2020a), we can demonstrate that 1 n

i=1 Ri Y 2 i 1

as long as the sample size n c2 log(1/δ) with probability at least 1 δ where c1 and c2 are some universal constants. Combining the inequalities (85) and (86) yields the following bound

sup θ B(θ ,r)

MGA n (θ) MGA(θ)

n sup θ B(θ ,r)

2(θ2 + 1) 2

1 + θ2 1 + |θ|

with probability at least 1 2δ for any r > 0. Here, the second inequality in the above display follows from the fact that (θ2 + 1) 2

1 + θ2 1 1 for all θ R. Thus, the

sample-level operator MGA n is STA(1)-stable over the Euclidean ball B(θ , 1/2) with noise function ε(n, δ) = p

log(1/δ)/n when n c log(1/δ) for some universal constant c > 0.

D.1.2 Proof for the Newton operators

Similar to the proof for Newton operators in over-speciﬁed Gaussian mixtures (see Appendix D.2.1), we ﬁrst verify the geometric convergence of population operator MNM and the instability condition of sample operator MNM n . Then, we validate Assumption (D) by

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

showing that the Newton updates are monotone decreasing and satisfy the following lower bound MNM(θ) |θ n| , (87)

for all |θ| [|θ n| , 1/2] for any global maxima θ n of the sample log-likelihood function Ln in equation (23).

Geometric convergence of MNM We can verify that L (θ) < 0 for all θ R. Additionally, we have the following equation

MNM(θ) θ = |θ θ | θ2T2(θ) T1(θ) + θ2T2(θ),

where the functions T1 and T2 are deﬁned as

2 1 2(θ2 + 1)(2

θ2 + 1 1) , and

T2(θ) := 1 2(θ2 + 1)2(2

From the earlier proof argument for slow convergence of MGA, we have T1(θ) θ2

8 for all |θ| [0, 1/2]. Given the above lower bound of T1, we directly obtain that

MNM(θ) θ |θ θ | T2(θ) 1/8 + T2(θ) |θ θ | T2(1/2) 1/8 + T2(1/2) 4

for all |θ| [0, 1/2] where the last inequality is due to the fact that T2(θ)/(c + T2(θ)) achieves its maximum value at |θ| = 1/2. Therefore, the population operator MNM is FAST(4/5)-convergent on the ball B(θ , 1/2).

Instability of the sample Newton operator MNM n Given the formulations of population operator MNM and sample operator MNM n from Newton s method, we have the following inequality

MNM n (θ) MNM(θ) L (θ) L n(θ) L (θ)

| {z } :=J1

+ L n(θ) 1 L (θ) 1 L n(θ)

| {z } :=J2

We claim the following upper bounds of J1 and J2:

J1 c1 1 |θ|

with probability at least 1 2δ as long as |θ| [0, 1/2] and n c log(1/δ), and

Instability, Computational Efficiency and Statistical Accuracy

with probability at least 1 6δ when |θ|

2c (log(1/δ)/n)1/4. With the upper bounds (89) and (90) of J1 and J2 respectively, we arrive at the following inequality

MNM n (θ) MNM(θ) c |θ| 1 p

log(1/δ)/n,

with probability at least 1 8δ as long as

2c (log(1/δ)/n)1/4 |θ| 1/2. As a consequence, the sample operator MNM n satisﬁes instability condition UNS(1) over the annu-

2c (log(1/δ)/n)1/4 , 1/2) with noise function ε(n, δ) =

n as long as

n c log(1/δ).

Proof for the upper bound of J1 When n c log(1/δ), we can validate that

L (θ) L n(θ) c |θ|

for any |θ| [0, 1/2] with probability at least 1 2δ where c and c are some universal constants. Furthermore, based on the computations in Appendix D.1.2, we ﬁnd that

L (θ) = T1(θ) + θ2T2(θ) θ2

8 + θ2T2(1/2) 11θ2

for any |θ| [0, 1/2]. Combining the previous inequalities, we have the following upper bound with J1:

J1 c1 1 |θ|

with probability at least 1 2δ as long as |θ| [0, 1/2] and n c log(1/δ).

Proof for the upper bound of J2 In order to derive an upper bound for J2, we make use of the following bounds:

L n(θ) L (θ) c2

for all |θ| [0, 1/2] with probability at least 1 2δ when n c log(1/δ). Here, c, c1, c2, c3 in the above bounds are universal constants independent of δ. Deferring the proofs of these claims to later, we now proceed to give an upper bound for J2 based on the given bounds in the above display. In particular, from the formulation

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

of J2, we achieve that

n + |θ|3 ! q

with probability at least 1 6δ when |θ|

2c (log(1/δ)/n)1/4 where C is some univer-

sal constant. Here, the last inequality is due to |θ| q

n + |θ|3 |θ|3 1 + 1

n |θ|2 /2 as long as we have |θ|

2c (log(1/δ)/n)1/4.

Proof of claim (92a) Invoking triangle inequality, when n c log(1/δ) we have

L n(θ) c |θ|

2(θ2 + 1) 2

with probability at least 1 2δ for any |θ| [0, 1/2] where the inequality in the above display is due to the inequalities (85) and (86). Furthermore, we can validate that

2(θ2 + 1) 2

θ2 + 1 1 3θ2

for any |θ| [0, 1/2]. In light of the previous inequalities, we arrive at the following inequality

L n(θ) 3c |θ|

with probability at least 1 2δ for all |θ| [0, 1/2]. As a consequence, we reach the conclusion of claim (92a).

Proof of claims (92b) and (92c) The proof of claim (92b) is a direct application of triangle inequality and the fact that |θ| [0, 1/2]. In addition, we have

L n(θ) L (θ) L n(θ) L (θ) c

with probability at least 1 2δ for any |θ| [0, 1/2] where c, c are universal constants independent of δ and the last inequality in the above display is due the results from equation (91) and claim (92b). As a consequence, we achieve the conclusion of claim (92c).

Instability, Computational Efficiency and Statistical Accuracy

Lower bound and monotonicity of Newton updates Now, we proceed to verify the lower bound of Newton updates in claim (87). In order to ease the ensuing presentation, we denote f(θ) := 1 (θ2+1)(2

θ2+1 1) for all θ. The global maxima θ n of the sample log-likelihood

function Ln are the solutions of the following equation

i=1 Ri Y 2 i

The speciﬁc forms of θ n depend on the values of Ri, Yi for i [n]. In particular, when Pn i=1 Ri Y 2 i < Pn i=1(1 Ri), namely, the Hessian of sample likelihood function Ln at 0 is positive, the function Ln is bimodal and symmetric around 0. Additionally, θ n are diﬀerent from 0 and become the solution of the following equation

i=1 Ri Y 2 i

On the other hand, when Pn i=1 Ri Y 2 i > Pn i=1(1 Ri), the function Ln is unimodal and symmetric around 0. Under this case, θ n = 0 is the unique global maximum. Without loss of generality, we assume that θ > 0 and the global maxima are solutions of equation (93). From the formulation of MNM n , the inequality MNM n (θ) > 0 is equivalent to

θf (θ) + f(θ) < f(θ n),

which holds for all θ |θ n| since f(θ) < f(θ n) and f (θ) < 0 as θ |θ n|. Therefore, we have MNM n (θ) > 0 for all θ |θ n|. Now, in order to demonstrate that MNM n (θ) |θ n| for θ |θ n|, it is equivalent to

(|θ n| θ) θf (θ) + |θ n| (f(θ) f(θ n)) 0. (94)

Invoking mean value theorem, we can ﬁnd some constant θ (|θ n| , θ) such that

f(θ) f(θ n) = f(θ) f(|θ n|) = f ( θ)(θ |θ n|).

Given the above equation, the inequality (94) can be rewritten as

|θ n| f ( θ) θf (θ) (95)

for all θ |θ n|. Since the function θf (θ) is a decreasing function in (0, 1/2], we have θf (θ) θf ( θ) for any θ < θ. Since f ( θ) < 0 and θ > |θ n|, we ﬁnd that θf ( θ) |θ n| f ( θ). In light of these two inequalities, we achieve the inequality (95). As a consequence, we reach the conclusion of claim (87).

D.2 Proof of Corollary 8

Under the model (28b), the sample EM operator takes the form

GEM n (θ) = 1

i=1 Xi tanh(θXi),

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

where tanh(x) = exp(x) exp( x)

exp(x)+exp( x) is the hyperbolic tangent. In our prior work (cf. Theorem 3 in the paper (Dwivedi et al., 2020a)), we studied the sample EM operator for this model. Accordingly, in this paper, we limit our analysis to the Newton updates; see Appendix D.2.1 for the details. The sample and population Newton updates take the form

GNM(θ) = θ L (θ) 1 L (θ) = θ + E [X tanh(Xθ)] θ

E X2 tanh2(Xθ) , and (96a)

GNM n (θ) = θ L n(θ) 1 L n(θ)

n Pn i=1 Xi tanh(Xiθ) θ

1 n Pn i=1 X2 i tanh2(Xiθ) + 1 1

n Pn i=1 X2 i . (96b)

D.2.1 Proofs for Newton operators

We begin by verifying the fast convergence of the operator GNM and then the instability of the operator GNM n with respect to GNM in Theorem 6. Then, we demonstrate that the Newton updates satisfy Assumption (D). Noting that it can be done by establishing that the Newton updates are monotone decreasing and admit the following lower bound

GNM n (θ) |θ n| (97)

for all |θ| [|θ n| , 1/3] for any global maximum θ n of Ln.

Fast convergence of the population-level operator GNM We provide the full proof for the case θ (0, 1

3]; the proof for the case θ [ 1

3, 0) is analogous. We make use of the following known bounds (Dwivedi et al., 2020b) on the hyperbolic function x 7 x tanh(x):

3 x tanh(x) x2 x4

15 for all x R. (98)

Applying this bound, we ﬁnd that

E [X tanh(Xθ)] 1

θE (Xθ)2 (Xθ)4/3 + 2(Xθ)6/15 = θ θ3 + 2θ5, as well as

E X2 tanh2(Xθ) 1

θ2 E (Xθ)4 = 3θ2,

and consequently that

θ E [X tanh(Xθ)]

E X2 tanh2(Xθ) θ (θ θ3 + 2θ5)

3θ2 = θ 2θ3

Noting that GNM(θ) = θ θ E[X tanh(Xθ)]

E[X2 tanh2(Xθ)] and θ = 0, we conclude that the population

Newton operator GNM is FAST(7

9)-convergent over the ball B(θ , 1

Instability, Computational Efficiency and Statistical Accuracy

Instability of the sample-level operator GNM n Let us introduce the shorthand

i=1 Xi tanh(Xiθ), and Bn := 1

i=1 X2 i tanh2(Xiθ) + 1 1

Using the deﬁnitions (96b) of the operators GNM n and GNM, we ﬁnd that GNM n (θ) GNM(θ) (99)

E [X tanh(Xθ)] θ

E X2 tanh2(Xθ) An θ

|E [X tanh(Xθ)] An|

E X2 tanh2(Xθ)

| {z } :=J1

1 E X2 tanh2(Xθ) 1

| {z } :=J2

Thus, in order to bound the diﬀerence GNM n (θ) GNM(θ) , it suﬃces to derive bounds for the terms J1 and J2.

Upper bound for J1 For a given δ (0, 1), as long as the sample size n C log(1/δ) for some universal constant C, we can apply Lemma 1 from the paper (Dwivedi et al., 2020a) to assert that

|E [X tanh(Xθ)] An| c |θ|

n for all |θ| (0, 1

with probability 1 δ. Moreover, the bound (98) implies that

E X2 tanh2(Xθ) 1

θ2 E (Xθ)2 (Xθ)4

3 2 = 3θ2 10θ4 + 35θ6

3]. Combining the above inequalities yields

J1 = |E [X tanh(Xθ)] An|

E X2 tanh2(Xθ) c |θ| q

for all |θ| (0, 1/3) with probability at least 1 δ.

Upper bound for J2 In order to obtain an upper bound for J2, we claim the following key bounds appearing in its formulation:

θ2 clog4(3n/δ) n

E X2 tanh2(Xθ) Bn c3 log(n/δ) n , (102c)

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

for all |θ| (0, 1/3] with probability at least 1 2δ as long as the sample size n c log(1/δ). Here, c, c1, c2, c3 in the above probability bounds are universal constants independent of δ. Assume that the above claims are given at the moment. The results in these claims lead to

E X2 tanh2(Xθ) Bn Bn E X2 tanh2(Xθ)

n + |θ|3 ! log(n/δ) n

θ2 θ2 clog4(3n/δ) n

|θ| log(n/δ) n (103)

with probability at least 1 5δ. Here, the last inequality is due to the facts that

n + |θ|3 |θ|3 1 + 1

and θ2 clog4(3n/δ) n |θ|2 /2,

as long as |θ|

2c log2(3n/δ)/n1/4. Plugging the bounds (101) and (103) into equa-

tion (99), the operator GNM n is UNS( 1)-unstable over the annulus A(θ ,

2c log2(3n/δ)

n1/4 , 1/3)

with noise function ε(n, δ) = log(n/δ) n as long as the sample size n C log8(3n/δ)

Proof of claim (102a) Invoking the concentration bound (100) and applying the triangle inequality, we ﬁnd that

i=1 Xi tanh(Xiθ) E [X tanh(Xθ)]

+ |E [X tanh(Xθ)] θ|

E [Xθ tanh(Xθ)] θ2 !

for all |θ| (0, 1/3] with probability 1 δ. Next, taking expectation on both sides in the bounds (98), we ﬁnd that

E [Xθ tanh(Xθ)] θ2 E (Xθ)2 (Xθ)4

= θ4 + 2θ6 7θ4

E [Xθ tanh(Xθ)] θ2 E (Xθ)2 (Xθ)4

Putting these pieces together yields the claim (102a).

Instability, Computational Efficiency and Statistical Accuracy

Proof of claim (102b) Invoking standard chi-squared concentration bounds and applying triangle inequality, we obtain that

i=1 X2 i tanh2(Xiθ)

i=1 X2 i tanh2(Xiθ)

with probability at least 1 δ. Using the lower bound from inequality (98), we ﬁnd that

i=1 X2 i tanh2(Xiθ) 1

θX2 i θ3X4 i 3

(i) θ2 3 c log2(3n/δ) n

15 + c log3(3n/δ) n

105 c log4(3n/δ) n

θ2 c log4(3n/δ) n ,

with probability at least 1 δ for some universal constant c. Here step (i) makes use of the following concentration bound for higher moments of Gaussian random variables (Lemma 5 (Dwivedi et al., 2020b)):

i=1 X2k i E h X2ki c logk(3n/δ)

3 for k {2, 4, 6}

with probability at least 1 δ/3 for k {2, 4, 6}. Putting together the pieces yields the claim (102b).

Proof of claim (102c) Applying the triangle inequality yields E X2 tanh2(Xθ) Bn (104)

i=1 X2 i tanh2(Xiθ) E X2 tanh2(Xθ) +

i=1 X2 i tanh2(Xiθ) E X2 tanh2(Xθ) + c

with probability at least 1 δ. By adapting the truncation argument from the proof of Lemma 5 in the paper (Dwivedi et al., 2020b) for the random variable X tanh(X) with X N(0, 1), it follows that 1 n

i=1 X2 i tanh2(Xiθ) E X2 tanh2(Xθ) c log(n/δ) n ,

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

for all |θ| (0, 1/3] with probability at least 1 δ. Putting the results together yields the claim (102c).

Lower bound and monotonicity of Newton updates We ﬁrst make some observations about the structure of the log-likelihood function Ln. Deﬁne

f(θ) := θ 1

i=1 Xi tanh(Xiθ).

When Pn i=1 X2 i > n, it can be shown (by computing the gradient and Hessian) that the log-likelihood Ln is bimodal and symmetric around 0. It has multiple global maxima θ n that are non-zero, and are solutions of the equation f(θ) = 0. On the other hand, when Pn i=1 X2 i n, the function Ln is unimodal and symmetric around 0, and the point θ n = 0 is the unique global maximum of the log likelihood. Next we verify the lower bound of Newton updates GNM n (θ) in claim (97); the proof of monotonicity can be argued similarly. Without loss of generality, we only consider the setting when the global maxima θ n are diﬀerent from 0 and θ > 0. Under that case, the Hessian of the function Ln at |θ n| is negative. A direct computation with the gradient of the function f leads to

f (θ) = 1 1

i=1 X2 i sech2(Xiθ) = 1 1

i=1 X2 i sech2(|Xi| |θ|)

i=1 X2 i sech2(|Xi| |θ n|)

= 2Ln(θ n) > 0

for any θ > |θ n|. Therefore, the function f is a strictly increasing function when θ > |θ n|. It leads to the inequality f(θ) f(θ n) = 0 for all θ |θ n|. Further computation with second derivative of f yields that

i=1 X3 i tanh(Xiθ)sech2(Xiθ) > 0

for all θ > 0. The above inequality is due to Xi tanh(Xiθ) > 0 for all θ > 0 and i [n]. Thus, the function f is strictly increasing when θ > 0. Now the inequality GNM n (θ) |θ n| for all θ |θ n| is equivalent to

f (θ)(θ |θ n|) f(θ) f(θ n). (105)

Invoking the mean value theorem, we ﬁnd that

f(θ) f(θ n) = f(θ) f(|θ n|) = f ( θ)(θ |θ n|)

for some θ (|θ n| , θ). Given that equality, the equality (105) can be rewritten as f (θ) f ( θ) for all θ |θ n|. This inequality is true since f is an increasing function when θ > 0. As a consequence, we achieve the conclusion of claim (97).

Instability, Computational Efficiency and Statistical Accuracy

D.3 Proof of Corollary 9

In this appendix, we demonstrate the convergence and stability properties of operators from gradient descent and (cubic-regularized) Newton s methods in the non-linear regression model. The sample operators of these methods take the following forms

FGD n (θ) = θ η e L n(θ) = θ η 2p

i=1 X4p i θ4p 1 2p

i=1 Yi X2p i θ2p 1 , (106a)

FNM n (θ) = θ h e L n(θ) i 1 e L n(θ)

1 n Pn i=1 X4p i θ2p+1 1 n Pn i=1 Yi X2p i θ 4p 1

n Pn i=1 X4p i θ2p 2p 1

n Pn i=1 Yi X2p i , and (106b)

FCNM n (θ) = arg min y R

e L n(θ)(y θ) + 1

2 e L n(θ)(y θ)2 + L |y θ|3 , (106c)

where L := (4p 1)!!(4p 1)p/3. Noting that the speciﬁc choice of L in the formulation of the cubic-regularized Newton operator FCNM n arises because the second-order derivative of e Ln is Lipschitz continuous with constant L. Similarly, the population-level operators are given by

FGD(θ) = θ η e L (θ) = θ 1 (4p 1)!!(2p)ηθ4p 2 , (107a)

FNM(θ) = θ h e L (θ) i 1 e L (θ) = (4p 2)

4p 1 θ, and (107b)

FCNM(θ) = arg min y R

e L (θ)(y θ) + 1

2 e L (θ)(y θ)2 + L |y θ|3 . (107c)

D.3.1 Proofs for the gradient descent operators

In order to achieve the conclusion of the corollary with convergence rate of updates from gradient descent method, it is suﬃcient to demonstrate that the sample gradient operator FGD n is STA(2p 1)-stable over the Euclidean ball B(θ , 1) with noise function ε(n, δ) = log2p(n/δ) n . By using the similar truncation argument as that in equation (104), we can verify the following concentration bound

i=1 Yi X2p i

c log2p(n/δ)/ n, (108)

with probability 1 δ where c is some universal constant. An application of triangle inequality yields

FGA(θ) FGA n (θ)

i=1 X4p i (4p 1)!!

|θ|4p 1 + clog2p(n/δ) n |θ|2p 1 . (109)

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Based on known concentration bounds for moments of Gaussian random variables (cf. Lemma 5 in (Dwivedi et al., 2020b)), we have

i=1 X4p i (4p 1)!!

c log2p(n/δ)/ n (110)

with probability 1 δ where c is some universal constant. Substituting the inequality (110) into equation (109) yields the above claim with the stability of FGD n .

D.3.2 Proofs for the Newton operators

Moving to the convergence rates of updates from Newton s method, it is suﬃcient to establish the instability of FNM n with respect to FNM, and moreover that, for any global minimum θ n of the sample least-squares function e Ln in equation (34b), we have

FNM n (θ) |θ n| , (111)

for all |θ| [|θ n| , 1].

Instability of the sample Newton operator FNM n Let us introduce the following shorthand notation:

i=1 Yi X2p i

i=1 Yi X2p i

Applying the triangle inequality yields

FNM n (θ) FNM(θ)

(4p 1)!!(2p)θ4p 1 An

(4p 1)!!(2p)(4p 1)θ4p 2 | {z } :=J1

+ |An| 1 (4p 1)!!(2p)(4p 1)θ4p 2 1

| {z } :=J2

Upper bound for J1 Invoking triangle inequality, we obtain that

An (4p 1)!!(2p)θ4p 1

i=1 X4p i (4p 1)!!

i=1 Yi X2p i

clog2p(n/δ) n

|θ|4p 1 + |θ|2p 1 ,

Instability, Computational Efficiency and Statistical Accuracy

where the last inequality is due to concentration bounds for moments of Gaussian random variables (108). With the above inequality, we have

J1 c log2p(n/δ) |θ|4p 1 + |θ|2p 1

(4p 1)!!(2p)(4p 1) n |θ|4p 2 2c |θ|2p 1 log2p(n/δ) n , (112)

for all |θ| 1 with probability at least 1 2δ.

Upper bound for J2 In order to obtain an upper bound for J2, we exploit the following concentration bounds

|θ|4p 1 + log2p(n/δ) n |θ|2p 1 , (113a)

Bn (4p 1)!!(2p)(4p 1)θ4p 2 c2 log2p(n/δ) n , (113b)

(4p 1)!!(2p)(4p 1)θ4p 2 clog2p(n/δ) n

for all |θ| 1 with probability at least 1 2δ. Here, c, c1, c2, c3 are universal constants independent of δ. The proofs of the above claims are direct applications of triangle inequalities and concentration bounds we utilized earlier with gradient descent operators in Appendix D.3.1; therefore, they are omitted. In light of the above bounds, we can bound J2 as follows:

|θ|4p 1 + log2p(n/δ) n |θ|2p 1

log2p(n/δ) n

θ4p 2 (4p 1)!!(2p)(4p 1)θ4p 2 clog2p(n/δ) n

c3c 1 |θ|2p 1 log2p(n/δ) n , (114)

for all |θ| [C logp/(2p 1)(n/δ)/n1/4(2p 1), 1] with probability 1 6δ where C is solution of the equation (4p 1)!!(2p)(4p 1)θ4p 2 = 2clog2p(n/δ) n . Combining the results from equations (112) and (114), we achieve that

FNM n (θ) FNM(θ) c 1 |θ|2p 1 log2p(n/δ) n (115)

for all |θ| [C logp/(2p 1)(n/δ)/n1/4(2p 1), 1] with probability 1 8δ where c is some universal constant. As a consequence, the sample operator FNM n is UNS( 2p + 1)-unstable over the annulus

A(θ , c1 logp/(2p 1)(n/δ)/n1/4(2p 1), 1) with noise function ε(n, δ) = log2p(n/δ) n .

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Lower bound and monotonicity of Newton updates Moving to the claim (111), we ﬁrst study the global minima θ n of the sample least-squares function e Ln in equation (34b). In particular, they satisfy the equation e Ln(θ n) = 0, which is equivalent to

i=1 Yi X2p i

(θ n)2p 1 = 0.

Given the above equation, the speciﬁc form of θ n depends on the sign of second derivative of e Ln at 0. In particular, when Pn i=1 Yi X2p i > 0, the function e Ln is bimodal and symmetric around 0. Additionally, global mimima θ n have the form

i=1 Yi X2p i

On the other hand, when 1

n Pn i=1 Yi X2p i 0, the function e Ln is unimodal and symmetric around 0. Furthermore, it has only global minimum θ n = 0. Now, we focus on the case θ > 0 and Pn i=1 Yi X2p i > 0, i.e., the global minima θ n are diﬀerent from 0 and the solutions of equation (116). A simple calculation demonstrates that Bn > 0 and FNM n (θ) > 0 as long as θ > |θ n|. Now, the inequality FNM n (θ) |θ n| is equivalent to

i=1 Yi X2p i

θ2p |θ n| +

i=1 Yi X2p i

for θ |θ n|. In light of the closed form expression of |θ n| in equation (116), a simple algebra with the above inequality leads to the inequality

(4p 2)θ2p+1 + (2p 1) |θ n|2p+1 (2p 2) (θ n)2p θ + (4p 1) |θ n| θ2p,

which holds true due to AM-GM inequality. Thus, we have established the claim (111).

D.3.3 Proofs for the cubic-regularized Newton operators

Our proof is divided into three separate steps. First, we establish the slow convergence of operator FCNM. Then, we proceed to establishing the instability of operator FCNM n . Finally, we demonstrate the monotonicity of cubic-regularized Newton updates and their lower bound

FCNM n (θ) |θ n| , (117)

for all |θ| [|θ n| , 1] for any global minima θ n of the sample least-squares function e Ln in equation (34b).

Instability, Computational Efficiency and Statistical Accuracy

Slow convergence of FCNM Without loss of generality, we assume that θ (0, 1]. Direct computation leads to

FCNM(θ) = θ + θ4p 2 r

θ8p 4 + 2 4p 1θ4p 1

2 4p 1θ4p 1

θ8p 4 + 2 4p 1θ4p 1 θ 1 c1θ(4p 3)/2 ,

for any θ (0, 1] where c1 < 1 is some universal constant. As a consequence, the operator FCNM satisﬁes slow convergence condition SLOW(2/(4p 3)) over the Euclidean ball B(θ , 1).

Instability of the sample operator FCNM n Suppose that θ > |θ n|, where θ n are global minima of the sample least-squares function e Ln. With this condition, direct computation of FCNM n (θ) leads to

FCNM n (θ) = θ 2 e L n(θ)

r e L n(θ) 2 + 12L e L n(θ)

:= θ 2 e L n(θ) Tn .

Similar to the previous proofs with cubic-regularized Newton operators, we ﬁnd that

FCNM(θ) FCNM n (θ) 2 e L (θ) |Tn T| + T e L n(θ) e L (θ)

where T := e L (θ)+

r e L (θ) 2 + 12L e L (θ) q

12L e L (θ) C θ(4p 1)/2 for some universal

constant C > 0. Additionally, we have

|Tn T| c θ 1/2 log2p(n/δ) n

when θ c max n |θ n| , logp/(2p 1)(n/δ)

n1/4(2p 1) o with probability 1 10δ for some universal constants

c and c . Furthermore, we can check that Tn q

12L e L n(θ) c θ(4p 1)/2 as long as

θ c max n |θ n| , logp/(2p 1)(n/δ)

n1/4(2p 1) o with probability 1 2δ for some universal constant c . These inequalities guarantee that

FCNM(θ) FCNM n (θ) c1θ 1/2 log2p(n/δ) n

for all θ c max n |θ n| , logp/(2p 1)(n/δ)

n1/4(2p 1) o with probability 1 14δ. As a consequence, we con-

clude that the operator FCNM n is UNS( 1/2)-unstable over the annulus A(θ , clogp/(2p 1)(n/δ)

n1/4(2p 1) , 1)

with noise function ε = log2p(n/δ) n where c is some universal constant.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

Lower bound and monotonicity of cubic-regularized Newton updates To simplify the presentation, we only consider θ > 0 and the setting when global minima θ n are diﬀerent from 0. As θ |θ n|, the inequality FCNM n (θ) |θ n| is equivalent to

r e L n(θ) 2 + 12L e L n(θ) > 2 e L n(eθ)

for some eθ (|θ n| , θ). This inequality holds since e L n and e L n are positive and strictly increasing when θ > |θ n|, thereby completing the proof of claim (117).

Appendix E. Extension to multivariate settings

In this appendix, we discuss some extensions of the theoretical results in Section 4 to multivariate settings. Here we state detailed theoretical results for the EM algorithm and gradient descent for multivariate versions of the over-speciﬁed mixture model and the nonlinear regression model. We explore the behavior of Newton s method via experimental studies in both Figures 4 and 6.

E.1 Over-speciﬁed mixture model

We denote by φ( ; θ, σ2Id) the density of N(θ, σ2Id) random variable, i.e.,

φ(x; θ, σ2Id) = (2πσ2) d/2e x θ 2

Assume that X1, . . . , Xn be n are i.i.d. samples from N(0, Id). We then ﬁt a two-component symmetric Gaussian mixture with equal ﬁxed weights whose density is given by:

2φ(x; θ, Id) + 1

2φ(x; θ, Id), (118)

where θ Rd is the parameter to be estimated. Given the model, the true parameter is unique and given by θ = 0. Similar to the univariate setting in Section 4.2, we also use the EM algorithm to estimate θ = 0. Direct calculation of the sample EM operator yields that

GEM n (θ) = 1

i=1 Xi tanh(θ Xi),

where tanh(x) = exp(x) exp( x) exp(x)+exp( x) for all x R. The result characterizing the behavior of sample EM operator in the multivariate setting (118) is already proven in our prior work (Dwivedi et al., 2020a) (see Theorem 3 in that paper). So as to keep our discussion self-contrained, we restate it here:

Corollary 13 For the over-speciﬁed Gaussian mixture model (118) with θ = 0, given some δ (0, 1) and for any ﬁxed α (0, 1/4) and initialization θ0 B(θ , 1), with probability at least 1 δ the sequence θt := (GEM n )t(θ0) of EM iterates satisﬁes the bound

d + log(log(1/α)

4 α for all iterates t c 1 pn

as long as n c 1(d + log log(1/α)

Instability, Computational Efficiency and Statistical Accuracy

The result of Corollary 13 shows that the EM iterates converge to a radius of convergence (d/n)1/4 around the true parameter θ = 0 after p

n/d number of iterations. Note that our simulation results for EM, as shown in Figure 4, are consistent with this theoretical prediction.

E.2 Non-linear regression model

We now turn to the multivariate instantiation of the non-linear regression model considered in the main text. Suppose that we observe pairs (Xi, Yi) Rd R generated from the model

Yi = g X i θ + ξi for i = 1, . . . , n, (119)

where ξi N(0, 1). We assume that the covariate vectors Xi are drawn i.i.d. from the multivariate Gaussian N(0, Id). As in our study of the univariate case, we consider the family of link functions g(x) = x2p for p 1 and the unknown parameter θ = 0. With this set-up, the maximum likelihood estimate for θ is based on the minimization problem

min θ Rd e Ln(θ) where e Ln(θ) := 1

Yi X i θ 2p 2 . (120)

By taking the expectation of e Ln with respect to X1, . . . , Xn N(0, Id), we ﬁnd that the corresponding population version of e L takes the form

e L(θ) := 1

" Y X θ 2p 2#

= 1 + (4p 1)!! θ θ 4p

The sample operator for the gradient method is given by

FGD n (θ) = θ η e Ln(θ)

i=1 Xi(X i θ)4p 1 2p

i=1 Yi Xi(X i θ)2p 1 , (122)

whereas the population level operator corresponding to the operator FGD n takes the form

FGD(θ) = θ η e L(θ) = θ 1 (4p 1)!!2pη θ 4p 2 , (123)

where Id denotes the identity matrix in d dimension. We ﬁrst state a result concerning the contraction and stability properties of the population and sample operators FGD and FGD n .

Lemma 14 (a) For any step size η (0, 1 (4p 1)!!(2p)], the gradient operator FGD is SLOW( 1 4p 2)- convergent over the ball B(θ , 1). (b) The operator FGD n is STA(2p 1)-stable over the ball B(θ , 1) with noise function ε(n, δ) = q

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

102 103 104 Number of samples n

Statistical error vs n

GD, d=2: slope= 0.23 NM, d=2: slope= 0.26 GD, d=4: slope= 0.26 NM, d=4: slope= 0.24

102 103 104 Number of samples n

# Iterations to converge

Iteration complexity vs. sample size

GD, d=2: slope=0.46 NM, d=2: slope=0.07 GD, d=4: slope=0.48 NM, d=4: slope=0.05

Figure 6. Plots characterizing the behavior of Gradient Descent (GD) and Newton s method (NM) for the non-linear regression with p = 1 for d = 2 and d = 4 dimensions. (a) Log-log plots of the Euclidean distance bθn θ 2 versus the sample size. It shows that all the algorithms converge to an estimate at Euclidean distance of the order n 1/4 from the true parameter θ . (b) Log-log plots for the number of iterations taken by diﬀerent algorithms to converge to the ﬁnal estimate.

The proof of Lemma 14 is deferred to the end of this appendix. Based on the result of that lemma, we have the following result characterizing the behavior of the updates from the gradient descent algorithm for solving e Ln.

Corollary 15 For the non-linear regression model (119) with θ = 0, given some δ (0, 1) and for any ﬁxed α (0, 1/4) and initialization θ0 B(θ , 1), with probability at least 1 δ the sequence θt := (FGD n )t(θ0) generated by gradient descent satisﬁes the bound

d + log(log(1/α)

4p α for all iterates t c 1 n

as long as n c 1(d + log log(1/α)

Based on the result of Corollary 15, the updates from the gradient method converge to a ball of radius of the order of (d/n)1/4p around the true parameter θ = 0 after an order of (n/d)(2p 1)/2p number of iterations. We further illustrate these behaviors of the gradient method when p = 1 in Figure 6. Based on these results, the computational complexity of the gradient method is at the order of n

2p d 1 2p . For the Newton s method, the experimental results in Figure 6 show that the Newton iterates also converge to the similar radius of convergence (d/n)1/4p after log(n) number of iterations. Since each iteration of the Newton s method takes an order of n d + d3 arithmetic operations where d3 is computational complexity of computing inverse of an d d matrix via Gauss-Jordan elimination, the overall complexity required to reach to the ﬁnal estimate scales as (nd + d3) log n. Thus, when

6p 1 4p 1 n, Newton s method is computationally more eﬃcient than the gradient descent method.

Instability, Computational Efficiency and Statistical Accuracy

E.2.1 Proof of Lemma 14

The slow contraction of the population gradient operator FGD follows immediately from its deﬁnition. Furthermore, the proof of the stability of the sample operator FGD n follows from the concentration bound in Corollary 3 in (Mou et al., 2019). In fact, from the proof of Corollary 3 in (Mou et al., 2019), as long as r 1 we have

sup θ B(θ ,r)

FGD n (θ) FGD(θ) Cr2p 1 r

d + log(1/δ)

as long as n C (d + log(d/δ))4p where C and C are some universal constants. As a consequence, we obtain the conclusion of the lemma with the contraction and stability of the operators FGD and FGD n .

A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. Annals of Statistics, 40(5):2452 2482, 2012.

Arash A Amini and Martin J Wainwright. High-dimensional analysis of semideﬁnite relaxations for sparse principal components. In 2008 IEEE International Symposium on Information Theory, pages 2454 2458. IEEE, 2008.

S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. Annals of Statistics, 45:77 120, 2017.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Stephen Becker, J erˆome Bobin, and Emmanuel J Cand es. Nesta: A fast and accurate ﬁrst-order method for sparse recovery. SIAM Journal on Imaging Sciences, 4(1):1 39, 2011.

T. T. Cai, J. Ma, and L. Zhang. CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. Annals of Statistics, To Appear.

E. J. Cand es, T. Strohmer, and V. Voroninski. Phase Lift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66:1241 1274, 2012.

E. J. Cand es, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger ﬂow: Theory and algorithms. IEEE Transactions on Information Theory, 61:1985 2007, 2015.

R. J. Carroll, J. Fan, I. Gijbels, and M. P. Wand. Generalized partially linear single-index models. Journal of the American Statistical Association, 92:477 489, 1997.

Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pages 745 754, 2018.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

J. Chen. Optimal rate of convergence for ﬁnite mixture models. Annals of Statistics, 23(1): 221 233, 1995.

Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. Technical report, UC Berkeley, September 2015. arxiv:1509.03025.pdf.

Yuansi Chen, Chi Jin, and Bin Yu. Stability and convergence trade-oﬀof iterative optimization algorithms. ar Xiv preprint ar Xiv:1804.01619, 2018a.

Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming, pages 1 33, 2018b.

H. Chernoﬀ. Estimation of the mode. Annals of the Institute of Statistical Mathematics, 16:31 41, 1964.

C. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of EM suﬃce for mixtures of two Gaussians. In Proceedings of the 2017 Conference on Learning Theory, 2017.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 39:1 38, 1997.

P. Diggle and M. G. Kenward. Informative drop-out in longitudinal data analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 43:49 93, 1994.

R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, M. I. Jordan, and B. Yu. Singularity, misspeciﬁcation, and the convergence rate of EM. To appear, Annals of Statistics, 2020a.

R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, M. I. Jordan, and B. Yu. Sharp analysis of expectation-maximization for weakly identiﬁable models. AISTATS, 2020b.

Y. C. Eldar and S. Mendelson. Phase retrieval: Stability and recovery guarantees. Applied and Computational Harmonic Analysis, 36:473 494, 2013.

R. Fletcher. Practical methods of optimization. John Wiley & Sons, 1987.

Rahul Garg and Rohit Khandekar. Gradient descent with sparsiﬁcation: an iterative algorithm for sparse recovery with restricted isometry property. In ICML, volume 9, pages 337 344, 2009.

Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for ℓ1-minimization: Methodology and convergence. SIAM Journal on Optimization, 19(3):1107 1130, 2008.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1225 1234, New York, New York, USA, 20 22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/hardt16.html.

Instability, Computational Efficiency and Statistical Accuracy

T. Hastie, R. Tibshrani, and M. J. Wainwright. Statistical Learning with Sparsity: The Lasso and generalizations. CRC Press, 2015.

J. J. Heckman. The common structure of statistical models of truncation, sample selection and limited dependent variables, and a simple estimator for such models. Annals of Economic and Social Measurement, 5:475 492, 1976.

N. Ho and X. Nguyen. Convergence rates of parameter estimation for some weakly identiﬁable ﬁnite mixtures. Annals of Statistics, 44:2726 2755, 2016.

Joel Horowitz and Wolfgang H ardle. Direct semiparametric estimation of single-index models with discrete covariates. Journal of the American Statistical Association, 91(436): 1632 1640, 1996.

H. Ichimura. Semiparametric least squares (SLS) and weighted (SLS) estimation of single index models. Journal of Econometrics, 58:71 120, 1993.

Ilja Kuzborskij and Christoph Lampert. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pages 2815 2824, 2018.

L. F. Lee. Asymptotic distribution of the maximum likelihood estimator for a stochastic frontier function model with a singular information matrix. Econometric Theory, 9:413 430, 1993.

L. F. Lee and A. Chesher. Speciﬁcation testing when score test statistic are identically zero. Journal of Econometrics, 31:121 149, 1986.

Po-Ling Loh and Martin J Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559 616, 2015.

Z. Ma. Sparse principal component analysis and iterative thresholding. Annals of Statistics, 41(2):772 801, 2013.

C. F. Manski. Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics, 3:205 228, 1975.

Wenlong Mou, Nhat Ho, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. A diﬀusion process perspective on the posterior contraction rates for parameters. ar Xiv preprint ar Xiv:1909.00966, 2019.

Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

X. Nguyen. Convergence of latent mixing measures in ﬁnite and inﬁnite mixture models. Annals of Statistics, 4(1):370 400, 2013.

A. Rotnitzky, D. R. Cox, M. Bottai, and J. Robins. Likelihood-based inference with singular information matrix. Bernoulli, 6:243 284, 2000.

Ho, Khamaru, Dwivedi, Wainwright, Jordan, Yu

J. Rousseau and K. Mengersen. Asymptotic behaviour of the posterior distribution in overﬁtted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73:689 710, 2011.

P. J. Rousseeuw. Least median of squares regression. Journal of the American Statistical Association, 79:871 880, 1984.

Ronald G Shaiko, Diana Dwyre, Mark O Gorman, Jeﬀrey M Stonecash, and James Vike. Pre-election political polling and the non-response bias issue. International Journal of Public Opinion Research, 3(1):86 99, 1991.

Y. S. Tan and R. Vershynin. Phase retrieval via randomized Kaczmarz: Theoretical guarantees. Information and Inference: A journal of the IMA, 2018.

A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.

Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Annals of statistics, 42(6):2164, 2014.

Yihong Wu and Harrison H Zhou. Randomly initialized EM algorithm for twocomponent gaussian mixture achieves near optimality in O( n) iterations. ar Xiv preprint ar Xiv:1908.10935, 2019.

F. Yang, S. Balakrishnan, and M. Wainwright. Statistical and computational guarantees for the Baum-Welch algorithm. Journal of Machine Learning Research, 18:1 53, 2017.

Xinyang Yi and Constantine Caramanis. Regularized EM algorithms: A uniﬁed framework and statistical guarantees. In Advances in Neural Information Processing Systems, pages 1567 1575, 2015.

Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013.

C.-H Zhang and T. Zhang. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27(4):576 593, 2012.

Huishuai Zhang, Yi Zhou, Yingbin Liang, and Yuejie Chi. A nonconvex approach for phase retrieval: Reshaped Wirtinger ﬂow and incremental algorithms. The Journal of Machine Learning Research, 18(1):5164 5198, 2017.