# convergence_of_an_alternating_maximization_procedure__c143be56.pdf Journal of Machine Learning Research 17 (2016) 1-53 Submitted 7/15; Revised 10/15; Published 4/16 Convergence of an Alternating Maximization Procedure Andreas Andresen andreas.andresen@posteo.de Weierstrass-Institute, Mohrenstr. 39, 10117 Berlin, Germany Vladimir Spokoiny spokoiny@wias-berlin.de Weierstrass-Institute and Humboldt University Berlin, Higher School of Economics, IITP RAN, MIPT Moscow, Mohrenstr. 39, 10117 Berlin, Germany Editor: Jie Peng We derive two convergence results for a sequential alternating maximization procedure to approximate the maximizer of random functionals such as the realized log likelihood in MLE estimation. We manage to show that the sequence attains the same deviation properties as shown for the profile M-estimator by Andresen and Spokoiny (2013), that means a finite sample Wilks and Fisher theorem. Further under slightly stronger smoothness constraints on the random functional we can show nearly linear convergence to the global maximizer if the starting point for the procedure is well chosen. Keywords: alternating maximization, alternating minimization, profile maximum likelihood, EM-algorithm, M-estimation, local linear approximation, local concentration, semiparametric c 2016 Andreas Andresen, and Vladimir Spokoiny. Andresen and Spokoiny 1 Introduction 3 1.1 Relation to the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Linear Series Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Finite sample Wilks and Fisher Theorems . . . . . . . . . . . . . . . . . . . 9 2 Main Results 10 2.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.4 Conditions for the Full Model . . . . . . . . . . . . . . . . . . . . . . 14 2.1.5 Quadratic Drift Beats Linear Fluctuation . . . . . . . . . . . . . . . 14 2.2 Dependence on Initial Guess . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Introduction of Important Objects . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Statistical Properties of the Alternating Sequence . . . . . . . . . . . . . . . 17 2.5 Convergence to the ME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Critical Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Application to Single Index Model 20 4 Proof of Theorem 7 23 4.1 Idea of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 A Desirable Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Probability of Desirable Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4 Proof Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 Result after Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Proof of Corollary 13 38 6 Proof of Theorem 14 38 A Deviation Bounds for Quadratic Forms 46 B A Uniform Bound for the Norm of a Random Process 46 C A Bound for the Spectral Norm of a Random Matrix Process 49 Convergence of an Alternating Procedure 1. Introduction This paper presents two convergence results for an alternating maximization procedure to approximate M-estimators. Let Y Y denote some observed random data, and P denote the data distribution. In the semiparametric profile M-estimation framework the target of analysis is θ = Πθυ = Πθ argmax υ EPL(υ, Y), (1) where L : Υ Y R is an appropriate functional, Πθ : Υ Rp is a projection and where Υ is some high dimensional or even infinite dimensional parameter space. A prominent way of estimating θ is the profile M-estimator (p ME) eθ def = Πθeυ def = Πθ argmax (θ,η) L(θ, η) = argmax θ max η L(θ, η). (2) This paper focuses on finite dimensional parameter spaces Υ Rp with p = p + m N being the full dimension, as infinite dimensional maximization problems are computationally not feasible. This is motivated by the sieve M-estimation technique, which projects the estimation problem to a finite dimensional submodel - see Section 1.2 for details. The alternating maximization procedure is used in situations where a direct computation of the full maximum estimator (ME) eυ Rp is not feasible or simply very difficult to implement. Consider for example the task to calculate the p ME where with scalar random observations Y = (yi)n i=1 R , parameter υ = (θ, η) Rp Rm and a function basis (ek) L2(R) L(θ, η) = 1 k=0 ηkek(X i θ) 2 . In this case the maximization problem is high dimensional and non-convex (see Section 3 for more details). But for fixed θ S1 Rp maximization with respect to η Rm is rather simple while for fixed η Rm the maximization with respect to θ Rp can be feasible for low p N . This motivates the following iterative procedure. Given some (data dependent) functional L : Rp Rm R and an initial guess eυ0 Rp+m set for k N eυk,k+1 def = (eθk, eηk+1) = eθk, argmax η Rm L(eθk, η) eυk,k def = (eθk, eηk) = argmax θ Rp L(θ, eηk), eηk The so called alternating maximization procedure (or minimization) is a widely applied algorithm in many parameter estimation tasks (see Jain, Netrapalli, and Sanghavi, 2013; Netrapalli, Jain, and Sanghavi, 2013; Keshavan, Montanari, and Oh, 2010; Yi, Caramanis, and Sanghavi, 2013). Some natural questions arise: Does the sequence (eθk) converge to Andresen and Spokoiny a limit that satisfies the same statistical properties as the profile estimator? And if the answer is yes, after how many steps does the sequence acquire these properties? Under what circumstances does the sequence actually converge to the global maximizer eυ ? This problem is hard because the behavior of each step of the sequence is determined by the actual finite sample realization of the functional L( , Y) . To the authors knowledge no general convergence result is available that answers the questions from above except for the treatment of specific models again (see Jain, Netrapalli, and Sanghavi, 2013; Netrapalli, Jain, and Sanghavi, 2013; Keshavan, Montanari, and Oh, 2010; Yi, Caramanis, and Sanghavi, 2013) or variants of the procedure (Cheng, 2013). We address this difficulty via employing new finite sample techniques (Andresen and Spokoiny, 2014; Spokoiny, 2012), which allow to answer the above questions: with growing iteration number k N the estimators eθk attain the same statistical properties as the profile M-estimator and Theorem 7 provides a choice of the necessary number of steps K N . Under slightly stronger conditions on the structure of the model we can give a convergence result to the global maximizier that does not rely on unimodality. Further we can address the important question under which ratio of full dimension p = p + m N to sample size n N the sequence behaves as desired. For instance for smooth L our results become sharp if p / n is small and convergence to the full maximizer already occurs if p /n is small. The alternating maximization procedure can be understood as a special case of the Expectation Maximization algorithm (EM algorithm) as we illustrate in Section 1.1. The EM algorithm itself was derived in a work of Dempster, Laird, and Rubin (1977) where particular versions of this approach are generalized. This paper (Dempster, Laird, and Rubin, 1977) also contains a variety of problems where an application of the EM algorithm can be fruitful; for a brief history of the EM algorithm (see Mc Lachlan and Krishnan, 1997, Section 1.8). We briefly explain the EM algorithm in Section 1.1. Since the EM algorithm is very popular in applications a lot of research on its behavior has been done. We are only dealing with a special case of this procedure so we restrict our selves to citing the well-known convergence result by Wu Wu (1983), which is still state of the art in most settings. Unfortunately Wu s result - as most convergence results on these iterative procedures - only ensures convergence to some set of local maximizers or fixpoints of the procedure. Only in very special cases like unimodality can actual convergence to the maximizer be ensured. In a recent work (Balakrishnan, Wainwright, and Yu, 2014) a new way is presented of addressing the properties of the EM sequence in a very general i.i.d. setting, based on concavity of L . They assume that the functional L is concave and smooth enough (First order stability) and that for a sample (Y i)i=1,...,n with high probability an uniform bound is satisfied of the kind max θ Br(θ ) argmax θ L(θ, eηθ ) argmax θ EL(θ, eηθ ) ϵn. (4) Under these assumptions, with high probability and some ν < 1 they show eθk θ νk θ0 θ + Cϵn. (5) Convergence of an Alternating Procedure Unfortunately this does not answer our two questions to full satisfaction. First the bound (4) is rather high level and has to be checked for each model, while we seek (and find) properties of the functional - such as smoothness and bounds on the moments of its gradient - that lead to comparably desirable behavior. Further with (5) it remains unclear whether for large k N the alternating sequence satisfies a Fisher expansion or whether a Wilks type phenomenon occurs. In particular it remains open which ratio of dimension to sample size ensures good performance of the procedure. Also the actual convergence of eθk eθ is not addressed. These results apply to our problem if the involved regularity criteria are met. But as noted these results do not tell us if the limit of the sequence (eθk) actually is the profile and the statistical properties of limit points are not clear without too restrictive assumptions on L and the data. Another new work (Cheng, 2013) contains the analysis of a slightly altered algorithm in a very general semiparametric asymptotic framework. Instead of alternatingly maximizing the functional L a kind of gradient decent procedure for the profile likelihood pl(θ) def = maxη L(θ, η) is analyzed, i.e. they define θk def = θk 1 + b D(θk 1) 2ℓ(θk 1), (6) where bη( ) is an estimator of argmaxη EL( , η) , b D( ) is an estimator of 2E maxη L( , η) and ℓ( ) is an estimator of maxη L( , η) . Under common regularity conditions it is shown, that θk eθ = o P(1/ n) if k(n) N is chosen such that the rate of the initial guess θ0 - obtained via a stochastic grid search - and the rate of the estimator of the nuisance parameter are addressed. These results resemble very much what is aimed for in this work but it is important to note a series of differences between the results of that work and the present paper. First and most importantly, the treated algorithm in that paper (Cheng, 2013) is similar in virtue to the alternating procedure, but in fact is a different procedure. It is a gradient descend scheme and involves a very careful data driven choice of step sizes when carrying out the estimations necessary in (6) and in that sense differs substantially from the simple and direct alternating maximization. Also the estimating step of the nuisance component is not object of the analysis but assumed to be sufficiently good for the arguments to go through. Finally the results of (Cheng, 2013) are purely asymptotic. In this work we carry out a finite sample analysis for the alternating maxmimization procedure in (3). Instead of a general semiparametric framework we address sieve profile estimators also called finite dimensional linear series estimation (see Chen, 2007; Andresen and Spokoiny, 2014), see Section 1.2 for more details. In this setting the bias of estimation - induced by projection the full model to a finite dimensional sub model - can be treated separately and the model becomes finite dimensional as far as the algorithm is concerned. This allows a very careful and explicit analysis of the behavior of the procedure. In particular the speed of convergence can be linked to characteristics of the information matrix 2EL(υ ) - namely to the constant ν < 1 in (20) - and to the the full dimension of the projected parameter space. The resulting number of iterations necessary for efficient estimation can be given in a rather simple and closed form. Finally our results are nonasymptotic which in this context is crucial as a clear comparison of the computational and the estimation error for finite samples is needed for reasonable inference. Andresen and Spokoiny Our main result can be summarized as follows: Under a set of regularity conditions on the data and the functional L points of the sequence (eθk) behave for large iteration number k N like the p ME. To be more precise we show in Theorem 7 that if the initial guess eυ0 Υ is good enough the step estimator sequence (eθk) satisfies with high probability D eθk θ ξ 2 ϵ(p + νk R0), (7) max η L(eθk, η) max η L(θ , η) ξ 2/2 (p + x)1/2ϵ(p + νk R0), (8) where ν < 1 is introduced in (20) and ϵ > 0 is some small number, for example ϵ = C/ n in the smooth i.i.d setting. Further R0 > 0 is a bound related to the quality of the initial guess. Generally it is proportional to the full dimension and in this way the rate with which the full nuisance can be estimated affects the speed of the convergence of the procedure. The random variable ξ Rp and the matrix D Rp p are related to the efficient influence function in semiparametric models and its covariance. These are up to νk R0 the same properties as those proven for the p ME by Andresen and Spokoiny (2014) under nearly the same set of conditions. Up to the finite sample bounds on the right hand sides this means that the estimating points of the procedure admit a Fisher expansion - in other words are asymptotical normal - and a Wilks expansion. Consequently the usual inference procedures based on confidence and concentration sets can be applied to these estimators. Further in our second main result we manage to show under slightly stronger smoothness conditions that (eθk, eηk) approaches the ME (eθ, eη) = argmax L(θ, η ) with nearly linear convergence speed, i.e. D((θk, ηk) eυ) τ k/ log(k) with some 0 < τ < 1 and D2 = E 2L(υ ) (see Theorem 14). To clarify we want to mention that the term convergence refers to the behavior of the sequence (eθk, eηk) when the number of iterations k N tends to infinity. We show (eθk, eηk) (eθ, eη) . This has to be distinguished from the usual stochastic convergence results of the M-estimator (eθ, eη) towards the target (θ , η ) or the weak convergence to a normal distribution as the sample size increases. Our setup is assuming finite sample size such that even with k there remains a gap between (eθk, eηk) and (θ , η ) and between eθk and θ that is related to the parametric and semiparametric Cramer-Rao lower bounds respectively. This is why we can in the finite sample setting only hope to obtain convergence of the alternating procedure to (eθ, eη) but not to (θ , η ) . But for a growing sample size (7) implies the weak convergence results also for the estimator eθk when k(n) N is large enough and ϵp vanishes (see Section 1.3). In the following we write eυk,k(+1) in statements that are true for both eυk,k+1 and eυk,k . Also we do not specify whether the elements of the resulting sequence are sets or single points. All statements made about properties of eυk,k(+1) are to be understood in the sense that they hold for every point of eυk,k(+1) . 1.1 Relation to the EM Algorithm In the introduction we claimed that the alternating procedure analyzed in this work is related to the EM algorithm. In this section we want to elaborate on that. Convergence of an Alternating Procedure First we explain the EM algortihm. Consider data (X) Pθ for some parametric family (Pθ, θ Θ) . Assume that a parameter θ Θ is to be estimated as maximizer of the functional Lc(θ, X) R , but that only Y Y is observed, where Y = f Y (X) is the image of the complete data set X X under some map f Y : X Y . Prominent examples for the map f Y are projections onto subspaces of X if both Y, X are vector spaces. The information lost under the map can be regarded as missing data or latent variables. As a direct maximization of the functional is impossible without knowledge of X the EM algorithm serves as a workaround. It consists of the iteration of two steps: starting with some initial guess eθ0 the kth Expectation step derives the functional Q via Q(θ, θk) = Eθk[Lc(θ, X)|Y], which means that on the right hand side the conditional expectation is calculated under the distribution Pθk . The kth Maximation step then simply locates the maximizer θk+1 of Q . Now we can present the convergence result of Wu (1983) in more detail. Wu presents regularity conditions that ensure that L(θk+1|Y) L(θk|Y) where L(θ|Y) def = E [Lc(θ, X)|Y = f Y (X)] , such that L(θk|Y) L for some limit value L (Y) > 0 , that may depend on the starting point θ0 . Additionally Wu gives conditions that guarantee that the sequence θk (possibly a sequence of sets) converges to C(L ) def = {θ| L(θ|Y) = L (Y)} . Dempster, Laird, and Rubin (1977) show that the speed of convergence is linear in the case of point valued θk and of some differentiability criterion being met. A limitation of these results is that it is not clear whether L (Y) = sup L(θ|Y) and thus it is not guaranteed that C(L ) is the desired MLE and not just some local maximum. Of course this problem disappears if L( |Y) is unimodal and the regularity conditions are met but this assumption may be too restrictive. To see that the procedure (3) is a special case of the EM algorithm we have to find the right triplet (X, f Y , Lc) . For this we take X = Z, Y with Z argmaxη L{(θ, η), Y} under Pθ . Further we set f Y (X) = Y and Lc(θ, X) def = L(θ, η, Y) , where X = (η, Y) . Then we find Q(θ, eθ (k 1)) = Eeθ (k 1)[Lc(θ, X)|Y] = Eeθ (k 1) h Lc θ, argmax η L{(θ(k 1), η), Y}, Y Y i = Lc θ, argmax η L{(eθ (k 1), η), Y}, Y = L(θ, eη(k), Y), and thus the resulting sequence is the same as in (3). Andresen and Spokoiny 1.2 Linear Series Estimators In semiparametric models the profile M estimator eθ Rp from Equation (2) cannot be calculated in practice if the full model is infinite dimensional. There are various ways to circumvent this problem. Next to non parametric estimation and plugin of the nuisance η X a prominent approach is the so called sieve technique that motivates the setting in this work. The sieve approach was systematically introduced by Grenander (see Grenander, 1981, Chapter 8) and consists in choosing a suitable sequence of subsets (Υm) m=1 Υ such that for each υ Υ there exists a sequence Πm(υ) Υm with υ Πm(υ) 0 . Furthermore the sets Υm Υ have to be such that supυ Υm L(υ) can be calculated in practice. In the setting of semiparametric M-Estimation we assume Υ = Υθ Υη Rp X with some infinite dimensional separable Hilbert space X and countable basis {e1, e2, . . .} X . We set Υm = Υθ ΠmΥη , where Πm : X Xm denotes the orthogonal projection onto Xm def = span(e1, . . . , em} . For each m N the sieve profile M-estimator is defined as eθm def = Πθeυm def = Πθ argmax θ Rp η Rm L This means that for the calculation of the estimator eθm only a finite dimensional setting has to be considered. In our analysis we will focus on the behavior of the alternating maximization procedure in that case. But of course the projection onto a finite dimensional submodel induces an approximation bias υ υ m where (θ m, η m) def = υ m def = argmax θ Rp η Rm EL In (Andresen and Spokoiny, 2014) it is explained in detail how this bias can be treated. Once the bias is controlled this leads for each m N to bounds of the kind D(eθm θ ) ξm (x) + α(m), max η ΠmΥη L(eθm, η) max η ΠmΥη L(θ , η) ξm 2 p (x) + α(m), where α(m) 0 quantifies the impact of the bias υ υ m . The choice of m N then has to balance the two terms (x) and α(m) , which leads to common optimal choices for the dimension based on the smoothness of the nuisance component η X . To ease notation we drop the m in the following, as the treatment of the bias can be done separately, (see Andresen and Spokoiny, 2014). Convergence of an Alternating Procedure 1.3 Finite sample Wilks and Fisher Theorems Before we present our main results we want to explain what type of results we aim at and how they can be interpreted. Hopefully this will ease the understanding and will make some of the apparently cumbersome notation more intelligible. Usually in asymptotic treatments of semiparametric M-estimators like eθ in (2) the aim is to derive statements of the kind n(eθ θ ) d 1 ξ = o P(1), (9) max η L(eθ, η) max η L(θ , η) ξ 2/2 = o P(1), (10) ξ w N(0, d 1 v2 d 1), where n N denotes the sample size. The random variable ξ Rp is called semiparametric score. Below we will briefly explain its derivation along with the explanation of the matrices v2, d2 Rp p . But before, we sketch how (9) and (10) can be used for the construction of asymptotic confidence sets that yield statistical tests. Given the matrices v2, d2 the construction works as follows. Let q2 α > 0 be an α level quantile of a χ2 p( d 2 v2 d 2) - distribution. Set E(qα) = θ : n (eθ θ) qα ; (11) then one can use (9) to show P {θ / E (qα)} = P n n (eθ θ ) qα o 1 α. Similarly one can exploit (10). Now we explain the definition of v2 and d2 . Introduce n v 2(υ) def = Πθ Cov( L(υ)) 1Π θ , n d 2(υ) def = Πθ 2EL(υ) 1 Π θ , where Πθ is the orthogonal projection onto the θ -commponent in Rp and Π θ is its dual operator. Then v2 = v2(υ ) and d2 = d2(υ ) . Remark 1 Note that these two matrices coincide if the functional L was the complete loglikelihood of the observations and that then d2 would equal the covariance of the efficient influence function (see Kosorok, 2005, for more details). For the definition of the semiparametric score ξ Rp consider nd2(υ) = d2(υ) a(υ) a (υ) h2(υ) def = 2EL(υ) Rp p . Andresen and Spokoiny ξ def = 1 n d(υ )(1 Eϵ)Πθd 2(υ ) L(υ ) = 1 n(1 Eϵ) d 1(υ ) θL(υ ) ah 2(υ ) ηL(υ ) , This random variable is related to the efficient influence function in semiparametric estimation and it plays the role that the usual score L(υ ) plays in the setting of parametric M-estimation. In this work we derive (9) and (10) for eθk instead of eθ but with finite sample bounds for the terms on the right-hand sides of (9) and (10). To be more precise we derive statements of the following kind. With probability greater than 1 Ce x n(eθk θ ) d 1 ξ ϵ(p + νk R0), (12) max η L(eθk, η) max η L(θ , η) ξ 2/2 = (p + x)1/2ϵ(p + νk R0), (13) with some small value ϵ > 0 as in (7) and (8). Note that for vanishing right hand sides these equations imply (9) and (10) when n tends to infty. Using the scheme in (11) the bounds (12) and (13) allow the construction of (conservative) finite sample confidence sets . Assume that (approximate) quantiles qα for ξ are available, i.e. that with some small ϵ > 0 and any α [0, 1] P( ξ qα) (α δ, α + δ), then with some generic constant C > 0 (see Andresen and Spokoiny, 2014, Remark 2.13) α + δ + Ce x P n θ E(qα + ϵ(p + νk R0) o , P n θ E(qα ϵ(p + νk R0) o α δ Ce x. The important achievement is that one can make approximate confidence statements for the estimators of the alternating procedure and this even in the finite sample case, without ignoring hopefully small enough terms. As remarked above such approximate quantiles could be attained via an plug-in-estimation of d 2 v2 d 2 combined with a Gaussian approximation or a bootstrap. 2. Main Results This Section contains the thorough presentation of our main convergence results. It involves the introduction of various technicalities and may appear a bit cumbersome on first read. We recommend to carefully read Section 1.3 first to ease understanding. Convergence of an Alternating Procedure 2.1 Conditions This section collects the conditions imposed on the model. We use the same set of assumptions as Andresen and Spokoiny (2014) and this section closely follows Section 2.1 of that paper. Let the full dimension of the paramter space be finite, i.e. p < . Our conditions involve the symmetric positive definite information matrix D2 0 Rp p and a central point υ Rp . To ease presentation in this paper we identify υ with the true point from (1) and define D2 0 def = 2EL(υ ), where we assume, that the second derivative exists. In the context of semiparametric estimation, it is convenient to represent the information matrix in block form: D2 0 = D2 0 A0 A 0 H2 0 First we state an identifiability condition, which basically imposes that D2 is positive definite. Note that in this work allways denotes the spectral norm when its argument is a matrix. (I) It holds for some ν < 1 H 1 0 A 0 D 1 0 ν. (14) The condition (I) allows to introduce the important p p efficient information matrix D 2 0 which is defined as the inverse of the θ -block of the inverse of the full dimensional matrix D2 0 . The exact formula is given by D 2 0 def = D2 0 A0H 2A 0 , and (I) ensures that the matrix D 2 0 is positive definite such that D is well defined. At the same time ν < 1 ensures that the alternating sequence actually converges. As can be seen in Theorem 7 the speed of convergence is linear in ν . Using the matrix D2 0 and the central point υ Rp , we define the local set Υ (r) Υ Rp with some r 0 : Υ (r) def = υ = (θ, η) Υ : D0(υ υ ) r . (15) 2.1.1 Smoothness Usually in the context of regular M-estimation one assumes local quadraticity, i.e. that pl(θ) = pl(θ )(θ θ ) + 1 2(θ θ ) 2pl(θ )(θ θ ) + o P(1), Andresen and Spokoiny for θ Rp close enough to θ . The functional pl( ) in this context denotes the profilefunctional and is defined as pl(θ) def = max η L(θ, η), see for instance (Cheng, 2013). In our setting we need a precise bound for the accuracy of a quadratic approximation of the expected profile functional Epl . We want to bound the error of a local linear approximation of the projected gradient θEL(υ) which is defined as θ = θ A0H 2 0 η. Instead of local quadraticity of the profile functional we impose that for (θ, η) = υ Υ (r) - with the local set Υ (r) defined in (15) - with some small ϵ > 0 D 1 EL(υ) EL(υ ) D(θ θ ) ϵr2. The following condition serves a less high level way of checking that such an approximation holds (see Andresen and Spokoiny, 2014, Lemma B.1) ( L0) For each r r0 , there is a constant ϵ > 0 such that it holds on the set Υ (r) : D 1D2(υ)D 1 Ip ϵr, D 1(A(υ) A)H 1 ϵr, D 1AH 1 Im H 1H2(υ)H 1 ϵr. If EL is three times continuously differentiable one obtains ϵ C D 1 . In i.i.d. models one usually has D 1 = O(1/ n) . Remark 2 Here and in what follows we implicitly assume that the function L(υ): Rp R is sufficiently smooth in υ Rp , L(υ) Rp stands for the gradient and 2EL(υ) Rp p for the Hessian of the expectation EL : Rp R at υ Rp . By smooth enough we mean that we can interchange EL = E L on Υ (R0) , where Υ (r) is defined in (15) and R0 > 0 in (19). It is worth mentioning that D2 0 = V2 0 def = Cov( L(υ )) if the model Y Pυ (Pυ) is correctly specified and sufficiently regular; see e.g. (Ibragimov and Khas minskij, 1981). 2.1.2 Complexity The usual approach to gain asymptotic control on profile M-estimators would be to assume that the class { L(υ), υ Υ (r)}, is P -Donsker (see Cheng, 2013). To pin down the estimator sequence (eθk, eηk)k N and to obtain finite sample results we use a more specific approach, which is based on a new finite sample approach (Spokoiny, Convergence of an Alternating Procedure 2012; Andresen and Spokoiny, 2014). First note that we assume that - as far as the alternating procedure is concerned - the model is finite dimensional, which means that Υ (r) Υ Rp in (15) is automatically compact for finite radius r > 0 . If we can ensure the right smoothness and moment conditions on L(υ) , we automatically obtain that the above class is P -Donsker. But using the new techniques (Spokoiny, 2012; Andresen and Spokoiny, 2014) we manage to obtain slightly stronger bounds that are useful in a finite sample setting. To understand the next condition consider first the definition of a subgaussian random vector. A random vector X Rp is called subgaussian, if for any µ R and some ν > 0 sup γ 1 log E exp n µγ X o ν2µ2/2. (16) Obviously this is a strong condition. As it turns out in many situations it is sufficient to assume subexponentiality, which simply relaxes subgaussianity to demanding that (16) is met for any |µ| g with some g > 0 . In this way many distributions that would be excluded by assuming subgaussianity can still be treated. Our next condition combines subexponentiality with a smoothness constraint on the stochastic component of L(υ) . It assumes that - with some ϵ > 0 - the random vector 1 ϵ D(υ υ ) D 1 θζ(υ) θζ(υ ) Rp, is uniformly subexponential for υ, υ Υ (r) , with the stochastic component defined as ζ(υ) def = L(υ) EL(υ) . It reads: ( ED1) For all 0 < r < r0 , there exists a constant ϵ 1/2 such that for all |µ| g and υ, υ Υ (r) sup υ,υ Υ (r) sup γ 1 log E exp µγ D 1 θζ(υ) θζ(υ ) To convey more intuition consider for some pair υ, υ Υ (r) the function ψγ(t) def = γ D 1 θζ(υ) θζ(υ + t(υ υ)) . Then ( ED1) is met with ϵ D 1 , if - uniformly in γ - for any pair υ, υ Υ (r) the corresponding function ψγ : [0, 1] R is Lipshitz continuous and the Lipshitz constant a subexponential random variable. 2.1.3 Moments We need another condition that allows to control the deviation behavior of D 1 ζ(υ ) . To present this condition define the covariance matrix V2 0 Rp p and V 2 Rp p V2 0 def = Cov L(υ ) , V 2 = Cov( θζ(υ )). We impose subexponential moments on V 1 θζ(υ ) : Andresen and Spokoiny ( ED) There exist constants ν0 > 0 and g > 0 such that for all |µ| g sup γ 1 log E exp n µγ V 1 θζ(υ ) o ν2 0µ2 2.1.4 Conditions for the Full Model So far we only presented conditions that allow to treat the properties of eθk on local sets Υ (rk) , for some sequence (rk)k N . To show that that the sequence of estimators (υk)k N satisfies υk Υ (rk) for an appropriately decreasing sequence (rk)k N the following, stronger conditions are employed, which can be interpreted just as the previous ones. (L0) For each r r0 , there is a constant ϵ > 0 such that it holds on the set Υ (r) : D 1 0 2EL(υ) D 1 0 Ip ϵr. (ED1) There exists a constant ϵ 1/2 , such that for all |µ| g and all 0 < r < r0 sup υ,υ Υ (r) sup γ =1 log E exp ( µ γ D 1 0 ζ(υ) ζ(υ ) (ED) There exist constants ν0 > 0 and g > 0 such that for all |µ| g sup γ 1 log E exp n µγ V 1 0 ζ(υ ) o ν2 0µ2 It is important to note, that the constants ϵ, ν and ϵ, ν in the respective weak and strong version can differ substantially and may depend on the full dimension p N in less or more severe ways ( AH 2 ηL might be quite smooth while ηL could be less regular). For the convergence statement in Theorem 14 we additionally need the following condition, that controls the moments and the smoothness of the process 2(L EL) : (ED2) There exists a constant ϵ2 1/2 , such that for all |µ| g and all 0 < r < r0 sup υ,υ Υ (r) sup γ1 =1 sup γ2 =1 log E exp ( µ γ 1 D 1 2ζ(υ) 2ζ(υ ) γ2 ϵ2 D(υ υ ) 2.1.5 Quadratic Drift Beats Linear Fluctuation Finally we present two conditions that allow to ensure that with a high probability the sequence (υk,k(+1)) stays close to υ if the initial guess eυ0 lands close to υ . These conditions have to be satisfied on the whole set Υ Rp . The first condition imposes that EL(υ) decreases nearly quadratically as the distance of υ to υ grows. Convergence of an Alternating Procedure (Lr) For any r > r0 there exists a value b > 0 , such that E [L(υ) L(υ )] b D0(υ υ ) 2. The next condition bounds moments of the full gradient L(υ) - again via subexponentiality. (Er) For any r r0 there exists a constant g(r) > 0 such that sup υ Υ (r) sup µ g(r) sup γ 1 log E exp n µγ D 1 ζ(υ) o ν2 rµ2 We impose one further merely technical condition: (B1) We assume for all r 6νr x + 4p 3ν2 r b g(r). Remark 3 Without this the calculation of R0(x) in Section 4.3 would become technically more involved, without that further insight would be gained. Remark 4 The condition (Er) can be substantially relaxed to b = b(r) > 0 that decreases to 0 as r . We avoid the resulting technicalities and refer the reader to the original publication for the non constant case (see Spokoiny, 2012, Theorem 4.1). 2.2 Dependence on Initial Guess Our main theorem is only valid under the conditions from Section 2.1 and under some constraints on the quality of the initial guess eυ0 Rp which we denote by (A1) , (A2) and (A3) : (A1) With probability greater 1 β the initial guess satisfies L(eυ0) L(υ ) K0 for some K0 0 . (A2) The conditions ( ED1) , ( L0) , (ED1) and (L0) from Section 2.1 hold for all r R0(x) where R0(x) can be bounded with (see (19)) x + p + K0. (A3) K0 R and ϵ > 0 are small enough to ensure ϵC(ν)R0 < 1, (17) C(ν) def = 16 2(1 + ν) (1 ν)(1 ν), (18) where ν > 0 is defined in (14). Andresen and Spokoiny Condition (A1) allows to concentrate the analysis on a local set { D(υ υ ) R0(x)} Υ with dominating probability (see Theorem 28). Conditions (A2) and (A3) ensure that this neighborhood is small enough to imply convergence of the procedure. They impose a bound on R0(x) and thus on K0 from (A1) . These conditions boil down to ϵ K0 being significantly smaller than 1, which is a quantification of the quality of the first guess. There are numerous ways to initiate the procedure. In Section 3 we use a grid search and show that for a sufficiently fine grid these conditions can be met in the treated model. Remark 5 One way of obtaining condition (A1) is to show that D(eυ υ ) R with probability greater 1 β for some finite R R and 0 β(R) < 1 . Then one can use - with some constant C > 0 - K0 (1/2 + ϵ(1 + 12ν0))(R + C p as we show in Lemma 31. Remark 6 The precise definition of R0(x) > 0 reads R0(x) def = z(x) 6νr b(1 ν) x + 2.4p + b2 9ν2r K0, (19) with the term which is defined in (30). 2.3 Introduction of Important Objects In this section we collect the most important objects and bounds that are relevant for Theorem 7. Remember the p p information matrix D2 from Section 2.1, which is defined similarly to the Fisher information matrix: D2 def = 2EL(υ ) = D2 A A H2 A crucial object is the constant 0 ν defined by D 1AH 1 2 def = ν, (20) which we assume with condition (I) to be smaller 1. It determines the speed of convergence of the alternating procedure (see Theorem 7 ). Further introduce the p p matrix D and the p -vectors θ and ξ as D 2 = D2 AH 2A , ξ = D 1 θL(υ ), θL = θL AH 2 ηL. Convergence of an Alternating Procedure The random variable ξ Rp is related to the efficient influence function in semiparametric models. If the model is regular and correctly specified D 2 is the covariance of the efficient influence function and its inverse the semiparametric Cramer-Rao lower bound for regular estimators. We define the semiparametric uniform spread Q(r, x) def = ϵ 16 (1 ν2)2 r2 + z Q(x, 2p + 2p)2 (1 + 6 ν2 1). (21) This object is central for our analysis as it describes the accuracy of our main results. It is small if ϵ(r2 + x + p ) is small, since z Q(x, p ) p + x (see its definition in Equation (42)). 2.4 Statistical Properties of the Alternating Sequence In this Section we present our main theorem in full rigor, i.e. that the limit of the alternating sequence satisfies a finite sample Wilks Theorem and Fisher expansion. Theorem 7 Assume that the conditions (Lr) , (Er) and (B1) of Section 2.1 are met. Further assume (A1) , (A2) and (A3) of Section 2.2. Then it holds with probability greater 1 8e x β for all k N D eθk θ ξ Q(rk, x), (22) max η L(eθk, η) max η L(θ , η) ξ 2/2 5 ξ + Q(rk, x) Q(rk, x), (23) p + x + νk R0 , with a constant C that depends on ν < 1 and 1 C(ν)ϵR0 > 0 . In particular this means that if k log(p + x) log{R0} Q(rk, x) Q C p Remark 8 Note that with linear convergence speed this leads to statements about eθk that are very similar to those in (Andresen and Spokoiny, 2014) for the profile M estimator eθ . Remark 9 Concerning the properties of ξ Rp we repeat remark 2.1 of (Andresen and Spokoiny, 2014). In case of correct model specification the deviation properties of the quadratic form ξ 2 = D 1 θ 2 are essentially the same as those of a chi-square random variable with p degrees of freedom; see Theorem 39 in the appendix. In the case of a possible Andresen and Spokoiny model misspecification the behavior of the quadratic form ξ 2 will depend on the charac- teristics of the matrix IB def = Cov( ξ) ; see again Theorem 39. Moreover, in the asymptotic setup the vector ξ is asymptotically standard normal; see Section 2.2. of (Andresen and Spokoiny, 2014) for the i.i.d. case. Remark 10 These results allow to derive some important corollaries like concentration and confidence sets (see Section 1.3). Remark 11 In general an exact numerical computation of θ(η) def = argmax θ Rp L(θ, η), or η(θ) def = argmax η Rm L(θ, η), is not possible. Define bθ(η) and bη(θ) as the numerical approximations to θ(η) and η(θ) and assume that - with the local set Υ (r) defined in (15) - D(bθ(η) θ(η)) τ, for all η Υ ,η(R0) def = {υ Υ (R0), Πηυ = η}, H(bη(θ) η(θ)) τ, for all θ Υ ,θ(R0) def = {υ Υ (R0), Πθυ = θ}. Then we can easily modify the proof of Theorem 7 via adding C(ν)τ to the error terms and the radii rk , where C(ν) is some rational function of ν . 2.5 Convergence to the ME Even though Theorem 7 tells us that the statistical properties of the alternating sequence resemble those of its target - the profile ME eθ - it is an interesting question if the underlying approach allows to qualify conditions under which the sequence actually attains the maximizer eυ . Define the radius r0(x) > 0 to be the smallest radius r > 0 such that P(eυ, eυθ Υ (r0)) 1 e x , where eυθ def = argmax υ Υ Πθυ=θ L(υ). Remark 12 This radius can be determined using conditions (Lr) and (Er) of Section 2.1 and Theorem 28 which would yield r0(x) = O( x + p ) . Without further assumptions Theorem 7 yields the following Corollary: Corollary 13 Under the assumptions of Theorem 7 it holds with probability greater 1 8e x β D(eθ eθk) Q(rk, x) + Q(r0, x). Corollary 13 is a first step in the direction of an actual convergence result but the gap Q(rk, x) + Q(r0, x) is not a zero sequence in k N . It turns out that it is possible to Convergence of an Alternating Procedure prove convergence to the ME at the cost of assuming more smoothness of the functional L and using the right bound for the maximal eigenvalue of the Hessian 2L(υ ) . Define z2(x, 2L(υ )) via P D 1 2L(υ ) z2 x, 2L(υ ) e x, and κ(x, R0) as κ(x, R0) def = 2 2(1 + ν) 1 ν ϵR0 + 9ϵ2ν2 D 1 z1(x, 6p )R0 + D 1 z2 x, 2L(υ ) , where z1(x, ) x + p , see (46). With these definitions we can prove the following Theorem: Theorem 14 Let the conditions (Lr) , (Er) and (B1) be met. Further suppose (A1) and (A2) , with (ED2) instead of (ED1) . Assume that κ(x, R0) < (1 ν) . Then D(υk,k(+1) eυ) r k ! 2 1 κ(x,R0)k R0, κ(x, R0)k 1, k log(k) log 1 ν κ(x,R0) ck R0, otherwise, (24) with some sequence (ck) N , where 0 < ck 2 . Remark 15 This means that we obtain nearly linear convergence to the global maximizer eυ . Remark 16 As in Remark 11 if no exact numerical computation of the stepwise maximizers is possible we can easily modify the proof of Theorem 14 via adding C(ν)τ to κ(x, R0) to address that case. Remark 17 For the case that L(υ) = Pn i=1 ℓi(υ) with a sum of independent marginal functionals ℓi : Υ R we can use Corollary 3.7 of (Tropp, 2012) to obtain z2 x, 2L(υ ) = x + log(p ), if for some sequence of matrices (Ai) Rp p log E exp λ 2ℓi(υ ) ν2λ2/2Ai, In the case of smooth i.i.d models this means that κ(x, R0) C n(x + R0 + log(p )), if p + x = o(n) . Andresen and Spokoiny Remark 18 It may happen that κ(x, R0)/(1 ν) is very close to or even larger than 1 . But a close look at the proof of Theorem 14 reveals that this can be improved using Lemma 33. For this purpose bound r k C (z(x) + νk R0) with r k defined in (33) and with some constant C > 0 . Then the result of Theorem 14 is true with κ(x, C p + x) instead of κ(x, R0) and with probability greater 1 10e x . See Remark 38 for more details. 2.6 Critical Dimension We want to address the issue of critical parameter dimensions when the full dimension p grows with the sample size n . We write p = pn . The results of Theorem 7 are accurate if the spread function Q(rk, x) from (21) is small. The critical size of pn then depends on the exact bounds on ϵ, ϵ . In the i.i.d setting we have ϵ ϵ 1/ n such that (rk, x) pn/ n for large k N . In other words, one needs that pn2/n is small to obtain an accurate non-asymptotic version of the Wilks phenomenon and the Fisher Theorem for the limit of the alternating sequence. This is not surprising because good performance of the ME itself can only be guaranteed if pn2/n is small , as is shown by Andresen and Spokoiny (2014). There are examples where the p ME only satisfies a Wilksor Fisher result if pn2/n is small , such that in any of those settings the alternating sequence started in the global maximizer does not admit an accurate Wilksor Fisher expansion. Interesting enough the constrain κ(x, R0) < (1 ν) of Theorem 14 for the convergence of the sequence to the global maximizer means that one needs pn/n 1 in the smooth i.i.d. setting if R0 CR0 pn + x . Further Theorem 14 states a lower bound for the speed of convergence that in the smooth i.i.d. setting decreases if pn/n grows. Unfortunately we were unable to find an example that meets the conditions of Section 2.1 and where no convergence occurs if pn/n tends to infinity. So whether this dimension effect on the convergence is an artifact of our proofs or indeed a property of the alternating procedure remains an open question. 3. Application to Single Index Model We illustrate how the results of Theorem 7 and Theorem 14 can be applied in Single Index modeling. This section is based on (Andresen, 2015). See that paper for a more detailed presentation. Consider the following model yi = f(X i θ ) + εi, i = 1, ..., n, for some f : R R and θ Sp,+ 1 Rp and with i.i.d errors εi R , Var(εi) = σ2 and i.i.d random variables Xi Rp with distribution denoted by PX . The single-index model is widely applied in statistics. For example in econometric studies it serves as a compromise between too restrictive parametric models and flexible but hardly estimable purely nonparametric models. Usually the statistical inference focuses on estimating the index vector θ . A lot of research has already been done in this field. For instance, (Delecroix. et al., 1997) show the asymptotic efficiency of the general semiparametric maximum-functional estimator for particular examples and in (Haerdle et al., 1993) the right choice of band- Convergence of an Alternating Procedure width for the nonparametric estimation of the link function is analyzed. We want to use this model to illustrate our theoretical results. To ensure identifiability of θ Rp we assume that it lies in the half sphere Sp,+ 1 def = {θ Rp : θ = 1, θ1 > 0} Rp . For simplicity we assume that the support of the Xi Rp is contained in the ball of radius s > 0 . This allows to approximate f {f : [ s, s] 7 R} by an orthonormal C3 -Daubechies-wavelet basis (ek)k N on the interval. A candidate to estimate θ is the sieve profile ME eθm def = Πθ argmax (θ,η) Υm Lm(θ, η), Lm(θ, η) = 1 k=0 ηkek(X i θ) 2 , and where Υm = Sp,+ 1 Rm . Again we will suppress the sub index m in the following. In this setting a direct computation of eυ becomes involved, as the maximization problem is high dimensional and not convex. But as noted in the introduction the maximization with respect to η for given θ is high dimensional but convex and consequently feasible. Further for moderate p N the maximization with respect to θ for fixed η is computationally realistic. So an alternating maximization procedure is applicable. To show that it behaves in a desired way we apply the technique presented above. For the initial guess eυ0 Υ one can use a simple grid search. For this take a uniform grid GN def = (θ1, . . . , θN) S+ 1 and define eυ0 def = argmax (θ,η) Υ θ GN Note that given the grid the above maximizer is easily obtained. Simply calculate eη0,k def = argmax L(θk, η) = i=1 ee (X i θk) i=1 yie (X i θk) Rm, (26) where by abuse of notation e = (e1, . . . , em) Rm . Now observe that eυ0 = argmax k=1,...,N L(θk, eη0,k). Define the mesh size of the grid τ def = supθ,θ GN θ θ . To apply the result presented in Theorem 7 and Theorem 14 we need a list of assumptions denoted by (A) . (Cond X) The random variables (Xi)i=1,...,n Rp are i.i.d and bounded with distribution denoted by PX and independent of (εi)i=1,...,n R . The measure PX is absolutely continuous with respect to the Lebesgue measure. The Andresen and Spokoiny Lebesgue density p X of PX is Lipschitz continuous and positive on Bs(0) Rp . For any pair θ S+,p 1 with θ θ we have almost surely Var X θ X θ > 0. (Condf) For some η Br (0) l2 def = {(uk)k N : P k=1 u2 k < } where f < and f < and where with some α > 2 k=0 k2αη k 2 < . On some interval [t0 h, t0 + h] [ s + s] with h > 0 it holds true that |f (t)| > 0. (Condε) The errors (εi) R are i.i.d. with E[εi] = 0 , Cov(εi) = σ2 and satisfy for all |µ| eg for some eg > 0 and some eν > 0 log E[exp {µε1}] eν2µ2/2. Remark 19 Note that our assumptions in terms of moments and smoothness are quite common in this model. For instance (Haerdle et al., 1993) assume that the density p X of the regressors (Xi) is twice continuously differentiable, that f has two bounded derivatives and that the errors (εi) are centered with bounded polynomial moments of arbitrary degree. Remark 20 Var X θ X θ = 0 would mean that X θ = a(X θ ) for some measurable function a : R R . But then we would have for any (α, β) R2 with α2 +β2 = 1 that f(X (αθ + βθ )) = f(αX θ + βa(X θ )) def = f α,β(X θ ), such that the problem would no longer be identifiable. Also |f (t)| > 0 on some interval is necessary for identifyability of θ . Proposition 21 Let τ = o(p 3/2) and p 5/n 0 . With initial guess given by Equation (25) and for not too large x > 0 the alternating sequence satisfies (22) and (23) with probability greater 1 9 exp{ x} and where with some constant C R Q(r, x) C(p + x)3/2 n (r2 + p + x). Convergence of an Alternating Procedure Proposition 22 Take the initial guess given by Equation (25). Assume (A) . Further assume that p 4/n 0 and τ = o(p 3/2) . Then we get the claim of Theorem 14 with β = e x and κ(x, R0) = O(τp 3/2 + τxp 3/2/n1/4) + O(p 2/ n) 0, for moderate choice of x > 0 . Remark 23 The constraint τ = o(p 3/2) implies that for the calculation of the initial guess the vector eη0,l of (26) and the functional L( ) have to be evaluated N = p 3(p 1)/2 For details and proofs see (Andresen, 2015). 4. Proof of Theorem 7 In this section we present the proof of Theorem 7. As the proof is quite technical and complex we want to first explain the basic ideas of the proof. In a second section we will outline more clearly the steps of the proof. Finally we carry out each of these steps which combine to yield the proof. 4.1 Idea of the Proof To ease the understanding of what follows in the subsequent sections we want to illustrate the central ideas with a simple model. Consider for some positive definite matrix D Rp p and some vector υ = (θ , η ) Rp+m = Rp the model Y = υ + ε Rp , where ε N(0, D 2), D2 = D2 A A H2 Set L to be the true log likelihood of the observations, i.e. L(υ, Y) = D(υ Y) 2/2. With any starting initial guess eυ0 Rp+m we obtain from (3) for k N and the usual first order criterion of maximality the following two equations D(eθk θ ) = Dεθ + D 1A(eηk η ), H(eηk+1 η ) = Hεη + H 1A (eθk θ ). Combining these two equations we derive, assuming D 1AH 2A D 1 def = M 0 = ν < 1 D(eθk θ ) = D 1(D2εθ Aεη) + D 1AH 1A D 1D(eθk 1 θ ) l=1 M k l 0 D 1(D2εθ Aεη) + M k 0D(eθ0 θ ) X def = D(bθ θ ). Andresen and Spokoiny Because the limit bθ is independent of the initial point eυ0 and because the profile eθ is a fix-point of the procedure the unique limit satisfies bθ = eθ . This argument is based on the fact that in this setting the functional is quadratic such that the gradient satisfies L(υ) = D2 υ (υ υ ) + D2 υ ε. Any smooth function is quadratic around its maximizer which motivates a local linear approximation of the gradient of the functional L to derive our results with similar arguments. This is done in the proof of Theorem 7 . First it is ensured that the whole sequence (eυk,k(+1))k N0 satisfies for some R0(x) > 0 and with probability greater that 1 e x {eυk,k(+1), k N0} { D(υ υ ) R0(x)}, (27) where D2 def = 2EL(υ ) (see Theorem 28). In the second step we approximate with ζ = L EL L(υ) L(υ ) = ζ(υ )(υ υ ) D(υ υ ) 2/2 + α(υ, υ ), (28) where α(υ, υ ) is defined by (28). Similar to the toy case above this allows using the first order criterion of maximality and (27) to obtain a bound of the kind D(υk,k υ ) C l=0 νl D 1 ζ(υ ) + |α(υl,l, υ )| C1 D 1 ζ(υ ) + ϵ(R0) + νk R0 def = rk. This is done in Lemma 32 using results from (Andresen and Spokoiny, 2014) to show that ϵ(R0) is small. Finally the same arguments as in (Andresen and Spokoiny, 2014) allow to obtain our main result using that with high probability for all k N0 eυk,k { D(υ υ ) rk} . For the convergence result similar arguments are used. The only difference is that instead of (28) we use the approximation L(υ) L(eυ) = D(υ eυ) 2/2 + α (υ, eυ), exploiting that L(eυ) 0 , which allows to obtain actual convergence to the ME. 4.2 A Desirable Set In this section we will explain the agenda of the proof. The first step of the proof is to find a desirable set Ω(x) Ωof high probability, on which a linear approximation of the gradient of the functional L(υ) can be carried out with sufficient accuracy. Once this set is found all subsequent analysis concerns events in Ω(x) Ω. Convergence of an Alternating Procedure For this purpose define - with the local set Υ (r) defined in (15) - for some K N the set k=0 (Ck,k Ck,k+1) C( ) {L(eυ0) L(υ ) K0}, where (29) Ck,k(+1) = n D(eυk,k(+1) υ ) R0(x), D(eθk θ ) R0(x), H(eηk(+1) η ) R0(x) o , sup υ Υ (r) 6ϵν1 Y(υ) 2r2 z Q(x, 4p )2 sup υ Υ (r) 6 ϵ ν1 Y(υ) 2r2 z Q(x, 2p + 2p)2 max{ D 1 L , D 1 θL , H 1 ηL } z(x) {eυ, eυθ Υ (r0(x))}. For ζ(υ) = L(υ) EL(υ) the semiparametric normalized stochastic gradient gap is defined as Y(υ) = D 1 θζ(υ) θζ(υ ) . the parametric normalized stochastic gradient gap Y(υ) is defined as Y(υ) = D 1 ζ(υ) ζ(υ ) , and r0(x) > 0 is chosen such that P(eυ, eυθ Υ (r0)) 1 e x , where eυθ def = argmax υ Υ Πθυ=θ L(υ). The constant z(x) in the definition of C( ) is only introduced for ease of notation. This makes some bounds less sharp but allows to address all terms that are of order p + x with one symbol. It is defined as z(x) def = z(x, D 1V2D 1) z Q(x, 4p ) p p + x, (30) where V2 Rp p denotes the covariance matrix from Section 2.1 V2 def = Cov( L(υ )). Remark 24 z(x, ) is explained in more detail in Section A and z Q(x, ) is defined in Equation (42). The constant z(x, IB) is comparable to the 1 e x -quantile of the norm Andresen and Spokoiny of X , where X N(0, IB) , i.e. it is of order of the trace of IB . The constant z Q(x, Q) arises as an exponential deviation bound for the supremum of a smooth process over a set with complexity described by Q . Remark 25 We intersect the set with the event {eυ, eυθ Υ (r0)} where we a priory demand r0(x) > 0 to be chosen such that P(eυ, eυθ Υ (r0)) 1 e x . Note that condition (Er) together with (Lr) allow to set p + x r0 R0 (see Theorem 28). In Section 4.3 we show that this set is of probability greater 1 8e x β(A) . We want to explain the purpose of this set along the architecture of the proof of our main theorem. {L(eυ0, υ ) K0} : This set ensures, that the first guess satisfies L(eυ0, υ ) K0 , which means that it is close enough to the target υ Rp . This fact allows us to obtain an a priori bound for the deviation of the sequence (eυk,k(+1)) Υ from υ Υ with Theorem 28. {D(eυk,k(+1) υ ) R0(x)} : As just mentioned this event is of high probability due to L(eυ0, υ ) K0 and Theorem 28. This allows to concentrate the analysis on the set Υ (R0) on which Taylor expansions of the functional L : Rp R become accurate. C( ) : This set ensures that on Ω(x) Ωall occurring random quadratic forms and stochastic errors are controlled by z(x) R . Consequently we can derive in the proof of Lemma 32 an a priori bound of the form D(eυk,k(+1) υ ) rk for a decreasing sequence of radii (rk) R+ satisfying lim supk rk = Cz(x) . Further this set allows to obtain in Lemma 34 the bounds for all k N . On Ω(x) Ωwe find eυk,k(+1) Υ (rk) such that we can follow the arguments of Theorem 2.2 of (Andresen and Spokoiny, 2014) to obtain the desired result with accuracy measured by Q(rk, x) . The sketch in figure 4.2 illustrates the behavior of the first steps of the procedure. The axes correspond to the θ - or η -subspaces respectively. The two ellipsoids with center υ and solid frame represent the local sets Υ (RK) Υ (R0) , with RK > 0 from Remark 5. We see that the initial guess eυ0 lies in Υ (R) . The elements (υk,k(+1)) of the alternating sequence all land inside of the respective Υ (rk) , which are represented by shrinking ellipsoids centered in υ with doted frames. Note that not the set Υ (RK) but Υ (R0) contains all points of the sequence. 4.3 Probability of Desirable Set Here we show that the set Ω(x) actually is of probability greater 1 8e x β . We prove the following two Lemmas, which together yield the claim. Lemma 26 The set C( ) satisfies P(C( )) 1 7e x. Convergence of an Alternating Procedure Figure 1: The behavior of the procedure for the first 4 steps of the alternating algorithm. Andresen and Spokoiny Proof The proof is similar to the proof of Theorem 3.1 in (Spokoiny, 2012). Denote sup υ Υ (r) 6ϵν1 Y(υ) 2r2 z Q(x, 4p )2 sup υ Υ (r) 6 ϵ ν1 Y(υ) 2r2 z Q(x, 2p + 2p)2 C def = n max{ D 1 L , D 1 θL , H 1 ηL } z(x) o . We estimate P(C( )) 1 P (Ac) P (Bc) P (Cc) P (eυ, eυθ / Υ (r0)) P ξ > z(x, Cov( ξ)) . We bound using for both terms Theorem 42 which is applicable due to (ED1) and ( ED1) : P(Ac) e x, P (Bc) e x. For the set C Ωobserve that we can use (I) and Lemma 27 to find H 1 η D 1 θ D 1 . This implies that { D 1 z(x, IB)} { D 1 θ H 1 η z(x, IB)}. Using the deviation properties of quadratic forms as sketched in Section A we find P D 1 > z(x, IB) 2e x, P D 1 > z(x, Cov( ξ)) 2e x. By the choice of z(x) > 0 and r0 > 0 this gives the claim. We cite Lemma B.2 of (Andresen and Spokoiny, 2014): Lemma 27 Let D2 = D2 A A H2 R(p+p) (p+p), D Rp p, H Rm m invertible, D 1AH 1 < 1. Then for any υ = (θ, η) Rp+m we have H 1η D 1θ D 1υ . Convergence of an Alternating Procedure The next step is to show that the set TK k=1(Ck,k Ck,k+1) has high probability, that is independent of the number of necessary steps. A close look at the proof of Theorem 4.1 of (Spokoiny, 2012) shows that it actually yields the following modified version: Theorem 28 ((Spokoiny, 2012), Theorem 4.1) Suppose (Er) and (Lr) with b(r) b . Further define the following random set Υ(K) def = {υ Υ : L(υ, υ ) K}. If for a fixed r0 and any r r0 , the following conditions are fulfilled: x + 2p 3ν2 rg(r)/b, x + 2p + b 9ν2r K rb, P(Υ(K) Υ (r0)) 1 e x. Note that with (I) D(eθk θ ) H(eηk(+1) η ) 1 1 ν D(eυk,k(+1) υ ) . With assumption (B1) and R0(x) = 6νr b(1 ν) x + Q + b 9ν2r K0, this implies the desired result as L(υk,k(+1), υ ) L(eυ0, υ ) such that with Theorem 28 k=0 (Ck,k Ck,k+1) k=0 (Ck,k Ck,k+1) {L(eυ0, υ ) K0} P(L(eυ0, υ ) K0) P n Υ(K0) Υ (1 ν)R0(x) o β(A) 1 e x β(A). Remark 29 This also shows that the sets of maximizers (eυk,k(+1)) are nonempty and well defined since the maximization always takes place on compact sets of the form {θ Rp, (θ, η) Υ (R0)} or {η Rm, (θ, η) Υ (R0)} . Remark 30 To address the claim of Remark 5 we present the following lemma: Andresen and Spokoiny Lemma 31 On the set C( ) {eυ0 Υ (RK)} it holds L(υ0, υ ) (1/2 + ϵ(1 + 12ν0))(R + z(x))2. Proof With similar arguments as in the proof of Lemma 32 we have on C( ) Ωthat L(υ0) L(υ ) E[L(υ0) L(υ )] D 1 ζ(υ ) R |{ ζ(bυ) ζ(υ )}(υ0 υ )| D(υ0 υ ) 2/2 D 1 ζ(υ ) R D 1 L(bυ) L(υ ) R ϵR2 K R2/2 z(x)R ϵ(1 + 12ν0) R2 + z(x)2 . 4.4 Proof Convergence We derive the a priori bound eυk,k(+1) Υ (rk) with an adequately decreasing sequence (rk) R+ using the argument of Section 4.1, where lim sup rk z(x) . For this purpose we define the parametric uniform spread Q(r, x) def = ϵ 2r2 + z Q(x, 4p )2 (1 + 6ν2 1). (31) Lemma 32 Assume that for some sequence (r(l) k )k N n υk,k(+1) Υ r(l) k o . Then under the assumptions of Theorem 7 we get on Ω(x) for all k N0 D(eυk,k(+1) υ ) 2 2(1 ν) 1 z(x) + (1 + ν)νk R0(x) r=0 νr Q r(l) r =: r(l+1) k . Proof 1. We first show that on Ω(x) D(eθk θ ) = D 1 θL(υ ) D 1A(eηk η ) + τ r(l) k , (32) H(eηk η ) = H 1 ηL(υ ) H 1A (eθk 1 θ ) + τ r(l) k , Convergence of an Alternating Procedure where - with Q(r, x) defined in (31) - τ(r) Q(r, x). The proof is the same in each step for both statements such that we only prove the first one. The arguments presented here are similar to those of Theorem D.1 in (Andresen and Spokoiny, 2014). By assumption on Ω(x) we have eυk,k(+1) Υ r(l) k . Define with ζ = L EL α(υ, υ ) := L(υ) L(υ ) ζ(υ )(υ υ ) D(υ υ ) 2/2 . L(υ) L(υ ) = ζ(υ )(υ υ ) D(υ υ ) 2/2 + α(υ, υ ) = θζ(υ )(θ θ ) D(θ θ ) 2/2 + (θ θ ) A(η η ) + ηζ(υ )(η η ) H(η η ) 2/2 + α(υ, υ ). Setting θL(eθk, eηk) = 0 we find D(eθk θ ) D 1 θζ(υ ) A(eηk η ) = D 1 θα(eυk,k, υ ). As we assume that eυk,k Υ (R0) it suffices to show that with dominating probability sup (θ,eηk) Υ (R0) Uθ(θ, eηk) (r(l) k ), Uθ(θ, eηk) def = D 1 θL(eυk,k) θL(υ ) D2 (θ θ ) A(eηk η ) . To see this note first that with Lemma 27 D 1ΠθDυ D 1Dυ . This gives by condition (L0) , Lemma 27 and Taylor expansion sup (θ,eηk) Υ (r) EU(θ, eηk) sup υ Υ (r) D 1Πθ EL(υ) EL(υ ) D (υ υ ) sup υ Υ (r) D 1ΠθD D 1 2EL(υ)2D 1 Ip 1/2r For the remainder note that again with Lemma 27 D 1 θζ(υ) θζ(υ ) D 1 ζ(υ) ζ(υ ) . Andresen and Spokoiny This yields that on Ω(x) sup (θ,eηk) Υ (r) Uθ(θ, eηk) EUθ(θ, eηk) sup υ Υ (r) D 1 θζ(υ) θζ(υ ) sup υ Υ (r) 6ν1ϵ Y(υ) 6ν1ϵ 6ν1ϵ z Q(x, 4p ) + 2r2 . Using the same argument for eηk gives the claim. 2. We prove the apriori bound for the distance of the k. estimator to the oracle D(eυk,k(+1) υ ) r(l+1) k . To see this we first use the inequality D(eυk,k(+1) υ ) 2 D(eθk θ ) + 2 H(eηk(+1) η ) . Now we find with (32) D(eθk θ ) D 1 θL(υ ) + D 1A(eηk η ) + τ r(l) k D 1 θL(υ ) + D 1AH 1 H(eηk η ) + τ r(l) k . Next we use that on Ω(x) D 1AH 1 ν, D 1 θL(υ ) z(x), H 1 ηL(υ ) z(x), H(eηk η ) H 1 ηL(υ ) + H 1A (eθk 1 θ ) + τ r(l) k , to derive the recursive formula D(eθk θ ) (1 + ν) z(x) + τ r(l) k + ν D(eθk 1 θ ) . Deriving the analogous formula for H(eηk η ) and solving the recursion gives the claim. Lemma 33 Assume the same as in Theorem 7 . Further assume that (17) is met with C(ν) defined in (18). Then υk,k(+1) Υ (r k) , Convergence of an Alternating Procedure where - with C(ν) 0 defined in (18) - r k C(ν) + 4ϵC(ν)4z(x) 1 ϵC(ν)z(x) +νk C(ν) + 4ϵC(ν)4R0 Proof We proof this claim via induction. On Ω(x) we have υk,k(+1) Υ (R0), set r(0) k def = R0. Now with Lemma 32 we find that if n υk,k(+1) Υ (r(l) k ) o , n υk,k(+1) Υ (r(l+1) k ) o , 2(1 ν) 1 z(x) + (1 + ν)νk R0(x) r=0 νr Q r(l 1) k r , x . Setting l = 1 this gives 2(1 ν) 1 n (z(x) + Q(R0, x)) + (1 + ν)νk R0(x) o . We show that lim sup l r(l) k υk,k(+1) Υ (r k) . Andresen and Spokoiny So we have to show that lim supl r(l) k r k in (33). For this we estimate further 2(1 ν) 1 z(x) + (1 + ν)νk R0(x) r=0 νr r(l 1) k r 2 + z(x)2 2(1 ν) 1 z(x) + ϵz(x)2 + (1 + ν)νk R0(x) r=0 νr r(l 1) k r 2 ( z(x) + ϵz(x)2 + νk R0 + ϵ r=0 νr r(l 1) k r 2 ) where C(ν) > 0 is defined in (18). We set A(l) s,k def = k r1 ... rs 1 1 X rs=0 νrs r(l 1) k r1 ... rs 2 . . . A(l) s,k 4 Ps 1 t=0 2t C(ν)2s ( 1 1 ν Ps 1 t=0 2t z(x) + ϵz(x)2 2s (34) +νk 1 ν 1 1 Ps 1 t=0 2t R2s 0 +4 Ps 1 t=0 2t(C(ν)ϵ)2s A(l 1) s+1,k. We proof this claim via induction. Clearly r1=0 νr1 r(l 1) k r1 2 4C(ν)2 k 1 X r1=0 νr1 n z(x) + ϵz(x)2 2 + ν2(k r1)R2 0 o +4C(ν)2ϵ2 k 1 X r1=0 νr1 k r1 r2 1 X r2=0 νr2 r(l 2) k r1 r2 2 !2 4C(ν)2 1 1 ν z(x) + ϵz(x)2 2 + νk +4C(ν)2ϵ2A(l 1) 2,k . Convergence of an Alternating Procedure Furthermore A(l) s,k def = k r1 ... rs 1 1 X rs=0 νrs r(l 1) k r1 ... rs 2 . . . r1=0 νr1 A(l) s 1,k r1 Plugging in (34) we get for s 2 4 Ps 2 t=0 2t C(ν)2s 1 ( 1 1 ν Ps 2 t=0 2t z(x) + ϵz(x)2 2s 1 +νk 1 ν 1 1 Ps 2 t=0 2t R2s 1 0 + 4 Ps 2 t=0 2t(C(ν)ϵ)2s 1A(l 1) s,k r1 Shifting the index this gives 4 Ps 1 t=1 2t C(ν)2s ( 1 1 ν Ps 1 t=1 2t 1 z(x) + ϵz(x)2 2s +νk 1 ν 1 1 Ps 1 t=1 2t R2s 0 + 4 Ps 1 t=1 2t(C(ν)ϵ)2s(A(l 1) s,k r1)2 ! Direct calculation then leads to A(l) s,k 4 Ps 1 t=0 2t C(ν)2s ( 1 1 ν Ps 1 t=0 2t z(x) + ϵz(x)2 2s +νk 1 ν 1 1 Ps 1 t=0 2t R2s 0 +4 Ps 1 t=0 2t(C(ν)ϵ)2s k 1 X r1=0 νr1(A(l 1) s,k r1)2, Andresen and Spokoiny which gives (34) with (35). Similarly we can prove A(1) s,k = 1 1 ν 2s 1 R2s 0 . λs def = 42s 1C(ν)2s, βs def = 42s 1(C(ν)ϵ)2s, zs(x) def = 1 1 ν 2s 1 z(x) + ϵz(x)2 2s , Rs def = 1 ν 1 1 2s 1 R2s 0 . r(l) k C(ν) n z(x) + ϵz(x)2 + νk R0 + ϵA(l) 1,k o r=0 βrzs(x) + νk l 1 X r=0 βr Rs + r=0 βr Rl. (36) We estimate further r=0 βrzs(x) C(ν) z(x) + ϵz(x)2 = r=0 βrzs(x) s=1 42s C(ν)2s+1ϵ2s 1 1 1 ν 2s 1 z(x) + ϵz(x)2 2s = ϵ42C(ν)4 1 1 ν z(x) + ϵz(x)2 2 ϵ4C(ν) 1 1 ν z(x) + ϵz(x)2 2s 1 . Assuming (17) and the definition of R0 > z(x) this gives r=0 βrzs(x) C(ν) + 4ϵC(ν)4z(x) 1 ϵC(ν)z(x) With the same argument we find under (17) that r=0 βr Rs νk C(ν) + 4ϵC(ν)4R0 Additionally (17) implies r=0 βr Rr ϵ4C(ν) 1 ν 1 1 2l 1 R2l 0 0. Convergence of an Alternating Procedure Plugging these bounds into (36) and letting l gives the claim. 4.5 Result after Convergence In the previous section we showed that sup υ Υ (r) 6 ϵ ν1 Y(υ) 2r2 z Q(x, 2p + 2p)2 k N {υk,k Υ (r k) , υk,k+1 Υ (r k)} {eυ, eυθ Υ (r0)}, where r k is defined in (33). The claim of Theorem 7 follows with the following lemma: Lemma 34 Assume ( ED1) , ( L0) , and (I) . Then it holds on Ω(x) ϵ that for all k N D eθk θ ξ Q(rk, x), (37) 2 L(eθk, θ ) ξ 2 5 D 1 + Q(rk, x) Q(rk, x), (38) where the spread (r, x) is defined in (21) and where rk def = r k r0. Proof The proof is nearly the same as that of Theorem 2.2 of (Andresen and Spokoiny, 2014) which is inspired by the proof of Theorem 1 of (Murphy and van der Vaart, 2000). So we only sketch it and refer the reader to (Andresen and Spokoiny, 2014) for the skipped arguments. We define l : Rp Υ R, (θ1, θ2, η) 7 L(θ1, η + H 2A (θ2 θ1)). θ1l(θ1, θ2, η) = θL(θ1, η + H 2A (θ2 θ1)), eθk = argmax θ l(θ, eθk, eηk), such that θL(eθk, eηk) = 0 . This gives D eθk θ ξ = D 1 L(eθk, eηk) D 1 L(υ ) + D eθk θ . Now the right hand side can be bounded just as in the proof of Theorem 2.2 of (Andresen and Spokoiny, 2014). This gives (37). For (38) we can represent: L(eθk) L(θ ) = l(eθk, eθk, eηk+1) l(θ , θ , eηθ ), Andresen and Spokoiny eηθ def = Πη argmax υ Υ, Πθυ=θ L(υ). Due to the definition of eθk and eηk+1 l(eθk, θ , eηθ ) l(θ , θ , eηθ ) L(eθk) L(θ ) l(eθk, eθk, eηk+1) l(θ , eθk, eηk+1). Again the remaining steps are exactly the same as in the proof of Theorem 2.2 of (Andresen and Spokoiny, 2014). 5. Proof of Corollary 13 Proof Note that with the argument of Section 4.3 P(ϵ (x)) 1 8e x β where with Ω(x) from (29) ϵ (x) = Ω(x) {eυ Υ (r0)}. On ϵ (x) it holds due to Theorem 7 and due to Theorem 2.1 of (Andresen and Spokoiny, 2014) D(eθk θ ) ξ Q(rk, x), D(eθ θ ) ξ (r0, x). Now the claim follows with the triangular inequality and noting that (r0, x) Q(r0, x) . 6. Proof of Theorem 14 We prove this Theorem in a similar manner to the convergence result in Lemma 32. Redefine the set Ω(x) k=0 (Ck,k Ck,k+1) C( ) {L(eυ0) L(υ ) K0}, where (39) Ck,k(+1) = n D(eυk,k(+1) υ ) R0(x), D(eθk θ ) R0(x), H(eηk(+1) η ) R0(x) o , sup υ Υ (R0(x)) Y( 2)(υ) 9ν2ϵ2z1(x, 6p )R0(x) { D 1 2ζ(υ ) z(x, 2ζ(υ ))}. Convergence of an Alternating Procedure Y( 2)(υ) def = D 1 2ζ(υ) 2ζ(υ ) Rp 2. We see that on Ω(x) υk,k(+1) eΥ (R0) def = { D(υ eυ) R0 + r0} Υ (R0). Lemma 35 Under the conditions of Theorem 14 P(Ω(x)) 1 3e x β. Proof The proof is very similar to the one presented in Section 4.3, so we only give a sketch. By assumption P D 1 2ζ(υ ) z(x, 2ζ(υ )) 1 e x, and due to (ED2) with Theorem 47 sup υ Υ (R0(x)) Y( 2)(υ) 9ν2ϵ2z1(x, 6p )R0(x) Lemma 36 Assume for some sequence (r(l) k ) that n D(eυk,k(+1) eυ) r(l) k o Ω(x). Then we get on Ω(x) D(eυk,k(+1) eυ) 2 r=0 νr τ(r(l) k r) + 2 2νk(R0 + r0), =: r(l+1) k . (40) τ(r) ϵR0 + 9ν2ϵ2 D 1 z1(x, 6p )R0 + D 1 z(x, 2ζ(υ )) r. Proof 1. We first show that on Ω(x) D(eθk eθ) = D 1A(eηk eη) + τ r(l) k , H(eηk η ) = H 1A (eθk 1 eθ) + τ r(l) k . Andresen and Spokoiny The proof is very similar to that of Lemma 32. Define α(υ, eυ) := L(υ) L(eυ) + D(υ eυ) 2/2. L(υ) L(eυ) = L(υ) D(υ eυ) 2/2 + α(υ, υ ) = D(θ eθ) 2/2 + (θ θ ) A(η eη) H(η eη) 2/2 + α(υ, eυ). Setting θL(eθk, eηk) = 0 we find D(eθk eθ) = D 1A(eηk eη) + D 1 θα(eυk,k, eυ). We want to show (θ,eηk) eΥ r(l) k Υ (R0) D 1 θα((θ, eηk), eυ) τ r(l) k , D 1 θα(υ, eυ) def = D 1 θL(υ) D2 (θ eθ) A(eηk eη) . To see this note that by assumption we have Ω(x) {eυ Υ (r0)} {eυ Υ (R0)} . By condition (L0) , Lemma 27 and Taylor expansion we have (θ,eηk) eΥ (r(l) k ) Υ (R0) EU(θ, eηk) υ eΥ (r(l) k ) Υ (R0) D 1Πθ EL(υ) EL(eυ) D (υ υ ) sup υ Υ (R0) D 1ΠθD D 1 2EL(υ)D 1 Ip r(l) k Convergence of an Alternating Procedure For the remainder note that with ζ = L EL on Ω(x) using Lemma 27 we can bound (θ,eηk) eΥ (r(l) k ) Υ (R0) Uθ(θ, eηk) EUθ(θ, eηk) υ eΥ (r(l) k ) Υ (R0) D 1 θζ(υ) θζ(eυ) sup υ Υ (r) D 1 2ζ(υ)D 1 r(l) k sup υ Υ (R0) 1 9ν2ϵ2 D 1 2ζ(υ) 2ζ(υ ) D 1 6ν1ϵr(l) k + D 1 2ζ(υ )D 1 r(l) k 9ν2ϵ2 D 1 z1(x, 6p )R0 + D 1 z(x, 2ζ(υ )) r(l) k . Using the same argument for eηk gives the claim. Now the claim follows as in the proof of Lemma 32. Lemma 37 Assume that κ(x, R0) < 1 ν where κ(x, R0) def = 2 2(1 + ν) 1 ν ϵR0 + 9ϵ2ν2 D 1 z1(x, 6p )R0 + D 1 z2 x, 2L(υ ) . n υk,k(+1) eΥ (rk) o , where (rk)k N satisfy the bound (24). Proof Define for all k N0 the sequence r(0) k = R0 . We estimate τ(r(l) k ) 1 1 ν ϵR0 + 6ν1ϵ2 D 1 z1(x, 6p )R0 + D 1 z(x, IB( 2) r(l) k , such that by definition r=0 νr τ(r(l) k r) κ(x, R0) r=0 νrr(l) k r. Andresen and Spokoiny Plugging in the recursive formula for r(l) k from (40) and denoting e R0 def = R0 + r0 we find r(l) k κ(x, R0) r=0 νrr(l 1) k r + 2 s=0 νsr(l 2) k r s + 2νk r e R0 κ(x, R0)2 k 1 X r=0 νr k r 1 X s=0 νsr(l 2) k r s + 2 2νk e R0 (κ(x, R0)k + 1) κ(x, R0)2 k 1 X r=0 νr k r 1 X t=0 νtr(l 3) k r s t + 2νk r se R0 2νk e R0 (κ(x, R0)k + 1) κ(x, R0)3 k 1 X r=0 νr k r 1 X s=0 νsr(l 3) k r s + 2 2νk e R0 κ(x, R0)2k2 + κ(x, R0)k + 1 . By induction this gives for l N r(l) k κ(x, R0)l k 1 X r1=0 νr1 k r1 1 X r2=0 νr2 . . . k Pl 1 s=1 rs 1 X rl=0 νrl e R0 s=0 κ(x, R0)sks s=0 (κ(x, R0)k)s ! e R0 2νk 1 1 κ(x,R0)k e R0, κ(x, R0)k 1, κ(x, R0)l 1 1 ν l + 2 2νk kl κ(x,R0)k 1 e R0, otherwise. By Lemma 36 n eυk,k(+1) eΥ r(l) k o . Set if κ(x, R0)/(1 ν) < 1 ( , κ(x, R0)k 1, k log(ν)+log(2 2) log(κ(x,R0)k 1) log(1 ν) log(k) , otherwise. Convergence of an Alternating Procedure Then with r k def = r( l(k) ) k we get n eυk,k(+1) eΥ r k o , 2 1 κ(x,R0)k e R0, κ(x, R0)k 1, 1 ν k log(k)L(k) 1 e R0, otherwise. The sequence L(k) > 0 is defined as $ log(1/ν) 1 2) log(κ(x, R0)k 1) 1 + 1 log(k) log(1 ν) where x N0 denotes the largest natural number smaller than x > 0 . To ensure that L(k) > 0 we assume that k log(1/ν) log(2 2) > k . Further as κ(x, R0) < (1 ν) and L(k) is only relevant once κ(x, R0)k > 1 it follows that 0 < 1 + 1 log(k) log(1 ν) < 1. L(k) log(1/ν) 1 2) log(κ(x, R0)k 1) > 1. Consequently k log(k)L(k) ν k log(k) log 1 ν κ(x,R0) κ(x, R0) 1 log(k)(log(2 2) log(κ(x,R0)k 1)) k log(k) log 1 ν κ(x,R0) ck, where ck κ(x,R0) 1 ν . Finally note that e R0 2R0 and the proof is complete. Remark 38 As pointed out in Remark 18 the above result can be improved. Redefine Ω(x) as the intersection of the two sets in (29) and (39). Then P(Ω(x)) 1 10e x . Also redefine κ(x, r) def = 2 2(1 + ν) 1 ν ϵr + 3ϵ2 D 1 z1(x, 6p )r + D 1 z2 x, 2L(υ ) . By the arguments of the proof of Theorem 7 we find with r k defined in (33) k N {υk,k(+1) Υ (r k)}. Andresen and Spokoiny Using this in Lemma 36 instead of k N{υk,k(+1) Υ (R0)} we can bound τ(r(l) k ) 1 1 ν ϵr k + 6ν1ϵ2 D 1 z1(x, 6p )r k + D 1 z(x, IB( 2) r(l) k . Consequently, representing r k = C z(x) + νk R0 we find r(l) k κ(x, Cz(x)) r=0 νrr(l 1) k r + 2 2νk(R0 + r0) +Cϵ(1 + D 1 z1(x, 6p )) r=0 νk R0r(l 1) k r κ(x, Cz(x)) r=0 νrr(l 1) k r + C1ϵR0kνk(R0 + r0), 2 + C(1 + D 1 z1(x, 6p )) . With the same arguments as in the proof of Lemma 37 we infer r(l) k /(R0 + r0) 1 ν l + kνk C1ϵR0 1 κ(x,Cz(x))k , κ(x, Cz(x))k 1, κ(x, Cz(x))l 1 1 ν l + νk kl+1C1ϵR0 κ(x,Cz(x))k 1 , otherwise. ( , κ(x, Cz(x))k 1, k log(ν)+log(C1ϵR0) log(κ(x,Cz(x))k 1) log(k) log(1 ν) log(k) , otherwise. Then with r k def = r( l(k) ) k we get with a slight adaptation of L(k) n eυk,k(+1) eΥ r k o , 2 1 κ(x Cz(x))k(R0 + r0), κ(x, Cz(x))k 1, 2 κ(x,Cz(x)) 1 ν k log(k)L(k) 1 (R0 + r0), otherwise. This gives the claim. Acknowledgments Convergence of an Alternating Procedure The first author was supported by Research Units 1735 Structural Inference in Statistics: Adaptation and Efficiency . The research was partly supported by the Russian Science Foundation grant (project 14-50-00150). Financial support by the German Research Foundation (DFG) through the Collaborative Research Center 649 Economic Risk is gratefully acknowledged. Andresen and Spokoiny Appendix A. Deviation Bounds for Quadratic Forms This section is the same as Section A of Andresen and Spokoiny (2014). The following general result from Spokoiny (2012) helps to control the deviation for quadratic forms of type IBξ 2 for a given positive matrix IB and a random vector ξ . It will be used several times in our proofs. Suppose that log E exp γ ξ γ 2/2, γ Rp, γ g. For a symmetric matrix IB , define p = tr(IB2), v2 = 2 tr(IB4), λ def = IB2 def = λmax(IB2). For ease of presentation, suppose that g2 2p IB . The other case only changes the constants in the inequalities. Note that ξ 2 = η IB η . Define µc = 2/3 and 2(xc + 2) def = (g2/µc p IB)/λ + log det Ip µc IB/λ . Proposition 39 Let (ED0) hold with ν0 = 1 and g2 2p IB . Then for each x > 0 P IBξ z(x, IB) 2e x, where z(x, IB) is defined by z2(IB, x) def = p IB + 2v IB(x + 1)1/2, x + 1 v IB/(18λ ), p IB + 6λ (x + 1), v IB/(18λ ) < x + 1 xc + 2, yc + 2λ (x xc + 1)/gc 2, x > xc + 1, with y2 c p IB + 6λ (xc + 2) . Appendix B. A Uniform Bound for the Norm of a Random Process We want to derive for a random process Y(υ) Rp a bound of the kind sup r r sup υ Υ (r) ϵ Y(υ) 2r2 Cz Q(x, p ) This is a slightly stronger result than the one derived in Section D of (Andresen and Spokoiny, 2014) but the ideas employed here are very similar. We want to apply Corollary 2.5 of the supplement of Spokoiny (2012) which we cite here as a Theorem. Note that we slightly generalized the formulation of the theorem, to make it applicable in out setting. The proof remains the same. Theorem 40 Let (U(r))0 r r Rp be a sequence of balls around υ induced by the metric d( , ) . Let a random real valued process U(r, υ) fulfill for any 0 r r that U(r, υ ) = 0 and Convergence of an Alternating Procedure (Ed) For any υ, υ U(r) log E exp λU(r, υ) U(r, υ ) 2 , |λ| g. (41) Finally assume that supυ U(r)(U(r, υ)) increases in r . Then with probability greater 1 e x 3ν1 U(r, υ) d(υ, υ )2 z Q(x, p )2, where z Q(x, p ) def = Q(U(r )) denotes the entropy of the set U(r ) Rp and where with g0 = ν0g and for some Q > 0 z Q(x, Q)2 def = ( (1 + x + Q)2 if 1 + x + Q g0, 1 + {2g 1 0 (x + Q) + g0}2 otherwise. (42) To use this result let Y(υ) be a smooth centered random vector process with values in Rp and let D : Rp Rp be some linear operator. We aim at bounding the maximum of the norm Y(υ) over a vicinity Υ (r) def = { D(υ υ ) r} of υ . Suppose that Y(υ) satisfies for each 0 < r < r and for all pairs υ, υ Υ (r) = υ Υ : D(υ υ ) r Rp sup u 1 log E exp λu Y(υ) Y(υ ) Remark 41 In the setting of Theorem 7 we have Y(υ) = D 1 ζ(υ) ζ(υ ) , and condition (43) becomes (ED1) from 2.1. Theorem 42 Let a random p -vector process Y(υ) fulfill Y(υ ) = 0 and let condition (43) be satisfied. Then for each 0 r r , on a set of probability greater 1 e x sup r r sup υ Υ (r) 6ϵν1 Y(υ) 2r2 z Q(x, 2p + 2p)2, with g0 = ν0g . Remark 43 Note that the entropy of the original set Υ (r) Rp is equal to 2p . So in order to control the norm Y(υ) one only pays with the additional sumand 2p . Andresen and Spokoiny Proof In what follows, we use the representation Y(υ) = ϵ sup u D(υ υ ) 1 ϵ D(υ υ ) u Y(υ). This implies sup υ Υ (r) Y(υ) = ϵ sup υ Υ (r) sup u D(υ υ ) 1 ϵ D(υ υ ) u Y(υ). Due to Lemma 44 the process U(r, υ, u) def = 1 ϵ D(υ υ ) u Y(υ) satisfies condition (Ed) (see (41)) as process on U(r ) where U(r) def = Υ (r) Br(0). (44) Further sup(υ,u) U(r) U(r, υ, u) is increasing in r . This allows to apply Theorem 42 to obtain the desired result. Set d((υ, u), (υ , u ))2 = D(υ υ ) 2 + u u 2 . We get on a set of probability greater 1 e x sup (υ,u) U(r ) 1 6ϵν1 D(υ υ ) u Y(υ) D(υ υ ) 2 u 2 z Q x, Q U(r ) . The constant Q U(r ) > 0 quantifies the complexity of the set U(r ) Rp Rp . We point out that for compact M Rp we have Q(M) = 2p (see Supplement of Spokoiny (2012), Lemma 2.10). This gives Q U = 2p + 2p . Finally observe that sup r r sup υ Υ (r) 6ϵν1 Y(υ) 2r2 sup r r sup (υ,u) U(r) 1 6ϵν1 D(υ υ ) u Y(υ) D(υ υ ) 2 u 2 = sup (υ,u) U(r ) 1 6ϵν1 D(υ υ ) u Y(υ) D(υ υ ) 2 u 2 . Lemma 44 Suppose that Y(υ) satisfies for each u 1 and |λ| g the inequality (43). Then the process U(υ, u) = 1 2ϵ D(υ υ ) Y(υ) u1 satisfies (Ed) from (41) with |λ| g/2 , d((υ, u), (υ , u ))2 = D(υ υ ) 2 + u u 2 , ν = 2ν0 and U Rp +p defined in (44), i.e. for any (υ, u1), (υ , u2) U log E exp λU(υ, u1) U(υ , u2) d((υ, u1), (υ , u2)) 2 , |λ| g/2. Convergence of an Alternating Procedure Proof Let (υ, u1), (υ , u2) U and w.l.o.g. u1 D(υ υ ) D(υ υ ) . By the H older inequality and (43), we find log E exp λU(υ, u1) U(υ, u2) d((υ, u1), (υ , u2)) = log E exp λU(υ, u1) U(υ , u1) + U(υ , u1) U(υ , u2) d((υ, u1), (υ , u2)) 2 log E exp 2λ u 1 1 D(υ υ ) Y(υ) 1 D(υ υ ) Y(υ ) 2 log E exp 2λ (u 1 u 2 ) Y(υ ) ϵ u1 u2 D(υ υ ) 1 2 log E exp 2λu Y(υ) Y(υ ) 1 2 log E exp 2λu Y(υ ) Y(υ ) Appendix C. A Bound for the Spectral Norm of a Random Matrix Process We want to derive for a random process Y(υ) Rp p a bound of the kind sup υ Υ (r) n Y(υ) o Cϵ2z1(x, p )r We derive such a bound in a very similar manner to Theorem E.1 of Andresen and Spokoiny (2014). We want to apply Corollary 2.2 of the supplement of Spokoiny (2012). Again we slightly generalized the formulation but the proof remains the same. Corollary 45 Let (U(r))0 r r Rp be a sequence of balls around υ induced by the metric d( , ) . Let a random real valued process U(υ) fulfill that U(υ ) = 0 and (Ed) For any υ, υ U(r) log E exp λU(υ) U(υ ) 2 , |λ| g. (45) Andresen and Spokoiny Then for each 0 r r , on a set of probability greater 1 e x sup υ U(r) U(υ) 3ν1z1(x, p )2d(υ, υ ), where z1(x, p ) def = Q(U(r )) denotes the entropy of the set U(r ) Rp and where with g0 = ν0g and for some Q > 0 z1(x, Q) def = 2(x + Q) if p 2(x + Q) g0, g 1 0 (x + Q) + g0/2 otherwise. (46) To use this result let Y(υ) be a smooth centered random process with values in Rp p and let D : Rp Rp be some linear operator. We aim at bounding the maximum of the spectral norm Y(υ) over a vicinity Υ (r) def = { υ υ Y r} of υ . Suppose that Y(υ) satisfies Y(υ ) = 0 and for each 0 < r < r and for all pairs υ, υ Υ (r) = υ Υ : υ υ Y r Rp sup u1 1 sup u2 1 log E exp λu 1 Y(υ) Y(υ ) u2 ϵ2 D(υ υ ) Remark 46 In the setting of Theorem 14 we have υ υ Y = D(υ υ ) and Y(υ) = D 1 2ζ(υ) D 1 2ζ(υ ), and condition (47) becomes (ED2) from 2.1. Theorem 47 Let a random process Y(υ) Rp p fulfill Y(υ ) = 0 and let condition (47) be satisfied. Then for each 0 r r , on a set of probability greater than 1 e x sup υ Υ (r) Y(υ) 9ϵ2ν2z1(x, 6p )r, with g0 = ν0g . Remark 48 Note that the entropy of the original set Υ (r) Rp is multiplied by 3. So in order to control the spectral norm Y(υ) one only pays with this factor. Proof In what follows, we use the representation Y(υ) = ϵ2 sup u1 r sup u2 r 1 ϵ2r2 u 1 Y(υ)u2. This implies sup υ Υ (r) Y(υ) = ϵ sup υ Υ (r) sup u2 r sup u2 r 1 ϵr2 u 1 Y(υ)u2. Convergence of an Alternating Procedure Due to Lemma 49 the process U(υ) def = 1 ϵr2 u 1 Y(υ)u2 satisfies condition (Ed) (see (45)) as process on U(r) def = Υ (r) Br(0) Br(0) R3p . (48) This allows to apply Corollary 45 to obtain the desired result. We get on a set of probability greater 1 e x sup υ Υ (r) Y(υ) sup (υ,u1,u2) U(r) r2 u 1 Y(υ)u2 9ϵ2ν2z1 x, Q U(r ) r. The constant Q U(r) > 0 quantifies the complexity of the set U(r) R3p . We point out that for compact M R3p we have Q(M) = 6p (see Supplement of Spokoiny (2012), Lemma 2.10). This gives the claim. Lemma 49 Suppose that Y(υ) Rp p satisfies Y(υ ) = 0 and for each u1 1 , u2 1 and |λ| g the inequality (47). Then the process U(υ, u1, u2) = 1 2ϵ2r2 u 1 Y(υ) u2 satisfies (Ed) from (45) with U R3p defined in (48), with |λ| g/3 and with d((υ, u1, u2), (υ , u 1, u 2))2 = D(υ υ ) 2 + u1 u 1 2 + u2 u 2 2, i.e. for any (υ, u1, u2), (υ , u 1, u 2) U log E exp λU(υ, u1, u2) U(υ , u 1, u 2) d((υ, u1, υ2), (υ , u 1, u 2)) 2 , |λ| g/3. Andresen and Spokoiny Proof Let (υ, u1, u2), (υ , u 1, u 2) U . By the H older inequality and (47), we find log E exp λU(υ, u1, u2) U(υ , u 1, u 2) d((υ, u1, u2), (υ , u 1, u 2)) = log E exp λ U(υ, u1, u2) U(υ , u1, u2) d((υ, u1, u2), (υ , u 1, u 2)) + U(υ , u1, u2) U(υ , u 1, u2) d((υ, u1, υ2), (υ , u 1, u 2)) +U(υ , u 1, u2) U(υ , u 1, u 2) d((υ, u1, u2), (υ , u 1, u 2)) 3 log E exp 3λu 1 1 r2 Y(υ ) u2 ϵ2 D(υ υ ) 3 log E exp 3λ(u1 u 1) )Y(υ )u2 ϵ2 u1 u2 r2 3 log E exp 3λ(u 1) )Y(υ )(u2 u 2) ϵ2 u1 u2 r2 3 sup u1 1 sup u2 1 log E exp 3λu 1 Y(υ) Y(υ ) u2 ϵ2 D(υ υ ) 3 sup u1 1 sup u2 1 log E exp 3λu 1 Y(υ ) Y(υ ) u2 ϵ2 D(υ υ ) A. Andresen. Finite sample analysis of profile m-estimation in the single index model. Electronic Journal of Statistics, 9(2):2528 2641, 2015. A. Andresen and V. Spokoiny. Critical dimension in profile semiparametric estimation. Electron. J. Statist., 8(2):3077 3125, 2014. ISSN 1935-7524. doi: 10.1214/14-EJS982. S. Balakrishnan, M. J. Wainwright, and B. Yu. Satistical guarantees for the em algorithm: From population to sample-based analysis. ar Xiv: 1408.2156, 2014. Xiaohong Chen. Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics, 6:55495632, 2007. G. Cheng. How many iterations are sufficient for efficient semiparametric estimation? Scandinavian Journal of Statistics,, 40(3):592 618, 2013. M. Delecroix., W. Haerdle, and M. Hristache. Efficient estimation in single-index regression. Technical report, SFB 373, Humboldt Univ. Berlin, 1997. Convergence of an Alternating Procedure A.P Dempster, N.M. Laird, and D.B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society,, Series B, 39:1 38, 1977. U. Grenander. Abstract Inference. Whiley, New York, 1981. W. Haerdle, P. Hall, and H. Ichimura. Optimal smoothing in single-index models. Ann. Statist., 21:157 178, 1993. I.A. Ibragimov and R.Z. Khas minskij. Statistical estimation. Asymptotic theory. Transl. from the Russian by Samuel Kotz. . New York - Heidelberg -Berlin: Springer-Verlag , 1981. P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. STOC, pages 665 674, 2013. R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010. M.R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Springer in Statistics, 2005. G.J. Mc Lachlan and T Krishnan. The EM Algorithm and Extensions. Wiley, New York, 1997. S. A. Murphy and A. W. van der Vaart. On profile likelihood. Journal of the American Statistical Association, 95(450):449 465, 2000. P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. NIPS, pages 2796 2804, 2013. Vladimir Spokoiny. Parametric estimation. Finite sample theory. Ann. Statist., 40(6): 2877 2909, 2012. J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12:389 434, 2012. C.F.J Wu. On the convergence properties of the em algorithm. Annals of Statistics, 11: 95 103, 1983. X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. ar Xiv: 1310.3745, 2013.