# convergence_of_an_alternating_maximization_procedure__c143be56.pdf

Journal of Machine Learning Research 17 (2016) 1-53 Submitted 7/15; Revised 10/15; Published 4/16

Convergence of an Alternating Maximization Procedure

Andreas Andresen andreas.andresen@posteo.de

Weierstrass-Institute, Mohrenstr. 39, 10117 Berlin, Germany

Vladimir Spokoiny spokoiny@wias-berlin.de Weierstrass-Institute and Humboldt University Berlin, Higher School of Economics, IITP RAN, MIPT Moscow, Mohrenstr. 39, 10117 Berlin, Germany

Editor: Jie Peng

We derive two convergence results for a sequential alternating maximization procedure to approximate the maximizer of random functionals such as the realized log likelihood in MLE estimation. We manage to show that the sequence attains the same deviation properties as shown for the proﬁle M-estimator by Andresen and Spokoiny (2013), that means a ﬁnite sample Wilks and Fisher theorem. Further under slightly stronger smoothness constraints on the random functional we can show nearly linear convergence to the global maximizer if the starting point for the procedure is well chosen. Keywords: alternating maximization, alternating minimization, proﬁle maximum likelihood, EM-algorithm, M-estimation, local linear approximation, local concentration, semiparametric

c 2016 Andreas Andresen, and Vladimir Spokoiny.

Andresen and Spokoiny

1 Introduction 3 1.1 Relation to the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Linear Series Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Finite sample Wilks and Fisher Theorems . . . . . . . . . . . . . . . . . . . 9

2 Main Results 10 2.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.4 Conditions for the Full Model . . . . . . . . . . . . . . . . . . . . . . 14 2.1.5 Quadratic Drift Beats Linear Fluctuation . . . . . . . . . . . . . . . 14 2.2 Dependence on Initial Guess . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Introduction of Important Objects . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Statistical Properties of the Alternating Sequence . . . . . . . . . . . . . . . 17 2.5 Convergence to the ME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Critical Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Application to Single Index Model 20

4 Proof of Theorem 7 23 4.1 Idea of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 A Desirable Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Probability of Desirable Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4 Proof Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 Result after Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Proof of Corollary 13 38

6 Proof of Theorem 14 38

A Deviation Bounds for Quadratic Forms 46

B A Uniform Bound for the Norm of a Random Process 46

C A Bound for the Spectral Norm of a Random Matrix Process 49

Convergence of an Alternating Procedure

1. Introduction

This paper presents two convergence results for an alternating maximization procedure to approximate M-estimators. Let Y Y denote some observed random data, and P denote the data distribution. In the semiparametric proﬁle M-estimation framework the target of analysis is

θ = Πθυ = Πθ argmax υ EPL(υ, Y), (1)

where L : Υ Y R is an appropriate functional, Πθ : Υ Rp is a projection and where Υ is some high dimensional or even inﬁnite dimensional parameter space. A prominent way of estimating θ is the proﬁle M-estimator (p ME)

eθ def = Πθeυ def = Πθ argmax (θ,η) L(θ, η) = argmax θ max η L(θ, η). (2)

This paper focuses on ﬁnite dimensional parameter spaces Υ Rp with p = p + m N being the full dimension, as inﬁnite dimensional maximization problems are computationally not feasible. This is motivated by the sieve M-estimation technique, which projects the estimation problem to a ﬁnite dimensional submodel - see Section 1.2 for details.

The alternating maximization procedure is used in situations where a direct computation of the full maximum estimator (ME) eυ Rp is not feasible or simply very diﬃcult to implement. Consider for example the task to calculate the p ME where with scalar random observations Y = (yi)n i=1 R , parameter υ = (θ, η) Rp Rm and a function basis (ek) L2(R)

L(θ, η) = 1

k=0 ηkek(X i θ) 2 .

In this case the maximization problem is high dimensional and non-convex (see Section 3 for more details). But for ﬁxed θ S1 Rp maximization with respect to η Rm is rather simple while for ﬁxed η Rm the maximization with respect to θ Rp can be feasible for low p N . This motivates the following iterative procedure. Given some (data dependent) functional L : Rp Rm R and an initial guess eυ0 Rp+m set for k N

eυk,k+1 def = (eθk, eηk+1) =

eθk, argmax η Rm L(eθk, η)

eυk,k def = (eθk, eηk) = argmax θ Rp L(θ, eηk), eηk

The so called alternating maximization procedure (or minimization) is a widely applied algorithm in many parameter estimation tasks (see Jain, Netrapalli, and Sanghavi, 2013; Netrapalli, Jain, and Sanghavi, 2013; Keshavan, Montanari, and Oh, 2010; Yi, Caramanis, and Sanghavi, 2013). Some natural questions arise: Does the sequence (eθk) converge to

Andresen and Spokoiny

a limit that satisﬁes the same statistical properties as the proﬁle estimator? And if the answer is yes, after how many steps does the sequence acquire these properties? Under what circumstances does the sequence actually converge to the global maximizer eυ ? This problem is hard because the behavior of each step of the sequence is determined by the actual ﬁnite sample realization of the functional L( , Y) . To the authors knowledge no general convergence result is available that answers the questions from above except for the treatment of speciﬁc models again (see Jain, Netrapalli, and Sanghavi, 2013; Netrapalli, Jain, and Sanghavi, 2013; Keshavan, Montanari, and Oh, 2010; Yi, Caramanis, and Sanghavi, 2013) or variants of the procedure (Cheng, 2013). We address this diﬃculty via employing new ﬁnite sample techniques (Andresen and Spokoiny, 2014; Spokoiny, 2012), which allow to answer the above questions: with growing iteration number k N the estimators eθk attain the same statistical properties as the proﬁle M-estimator and Theorem 7 provides a choice of the necessary number of steps K N . Under slightly stronger conditions on the structure of the model we can give a convergence result to the global maximizier that does not rely on unimodality. Further we can address the important question under which ratio of full dimension p = p + m N to sample size n N the sequence behaves as desired. For instance for smooth L our results become sharp if p / n is small and convergence to the full maximizer already occurs if p /n is small. The alternating maximization procedure can be understood as a special case of the Expectation Maximization algorithm (EM algorithm) as we illustrate in Section 1.1. The EM algorithm itself was derived in a work of Dempster, Laird, and Rubin (1977) where particular versions of this approach are generalized. This paper (Dempster, Laird, and Rubin, 1977) also contains a variety of problems where an application of the EM algorithm can be fruitful; for a brief history of the EM algorithm (see Mc Lachlan and Krishnan, 1997, Section 1.8). We brieﬂy explain the EM algorithm in Section 1.1. Since the EM algorithm is very popular in applications a lot of research on its behavior has been done. We are only dealing with a special case of this procedure so we restrict our selves to citing the well-known convergence result by Wu Wu (1983), which is still state of the art in most settings. Unfortunately Wu s result - as most convergence results on these iterative procedures - only ensures convergence to some set of local maximizers or ﬁxpoints of the procedure. Only in very special cases like unimodality can actual convergence to the maximizer be ensured. In a recent work (Balakrishnan, Wainwright, and Yu, 2014) a new way is presented of addressing the properties of the EM sequence in a very general i.i.d. setting, based on concavity of L . They assume that the functional L is concave and smooth enough (First order stability) and that for a sample (Y i)i=1,...,n with high probability an uniform bound is satisﬁed of the kind

max θ Br(θ )

argmax θ L(θ, eηθ ) argmax θ EL(θ, eηθ ) ϵn. (4)

Under these assumptions, with high probability and some ν < 1 they show

eθk θ νk θ0 θ + Cϵn. (5)

Convergence of an Alternating Procedure

Unfortunately this does not answer our two questions to full satisfaction. First the bound (4) is rather high level and has to be checked for each model, while we seek (and ﬁnd) properties of the functional - such as smoothness and bounds on the moments of its gradient - that lead to comparably desirable behavior. Further with (5) it remains unclear whether for large k N the alternating sequence satisﬁes a Fisher expansion or whether a Wilks type phenomenon occurs. In particular it remains open which ratio of dimension to sample size ensures good performance of the procedure. Also the actual convergence of eθk eθ is not addressed. These results apply to our problem if the involved regularity criteria are met. But as noted these results do not tell us if the limit of the sequence (eθk) actually is the proﬁle and the statistical properties of limit points are not clear without too restrictive assumptions on L and the data. Another new work (Cheng, 2013) contains the analysis of a slightly altered algorithm in a very general semiparametric asymptotic framework. Instead of alternatingly maximizing the functional L a kind of gradient decent procedure for the proﬁle likelihood pl(θ) def = maxη L(θ, η) is analyzed, i.e. they deﬁne

θk def = θk 1 + b D(θk 1) 2ℓ(θk 1), (6)

where bη( ) is an estimator of argmaxη EL( , η) , b D( ) is an estimator of 2E maxη L( , η) and ℓ( ) is an estimator of maxη L( , η) . Under common regularity conditions it is shown, that θk eθ = o P(1/ n) if k(n) N is chosen such that the rate of the initial guess θ0 - obtained via a stochastic grid search - and the rate of the estimator of the nuisance parameter are addressed. These results resemble very much what is aimed for in this work but it is important to note a series of diﬀerences between the results of that work and the present paper. First and most importantly, the treated algorithm in that paper (Cheng, 2013) is similar in virtue to the alternating procedure, but in fact is a diﬀerent procedure. It is a gradient descend scheme and involves a very careful data driven choice of step sizes when carrying out the estimations necessary in (6) and in that sense diﬀers substantially from the simple and direct alternating maximization. Also the estimating step of the nuisance component is not object of the analysis but assumed to be suﬃciently good for the arguments to go through. Finally the results of (Cheng, 2013) are purely asymptotic. In this work we carry out a ﬁnite sample analysis for the alternating maxmimization procedure in (3). Instead of a general semiparametric framework we address sieve proﬁle estimators also called ﬁnite dimensional linear series estimation (see Chen, 2007; Andresen and Spokoiny, 2014), see Section 1.2 for more details. In this setting the bias of estimation - induced by projection the full model to a ﬁnite dimensional sub model - can be treated separately and the model becomes ﬁnite dimensional as far as the algorithm is concerned. This allows a very careful and explicit analysis of the behavior of the procedure. In particular the speed of convergence can be linked to characteristics of the information matrix 2EL(υ ) - namely to the constant ν < 1 in (20) - and to the the full dimension of the projected parameter space. The resulting number of iterations necessary for eﬃcient estimation can be given in a rather simple and closed form. Finally our results are nonasymptotic which in this context is crucial as a clear comparison of the computational and the estimation error for ﬁnite samples is needed for reasonable inference.

Andresen and Spokoiny

Our main result can be summarized as follows: Under a set of regularity conditions on the data and the functional L points of the sequence (eθk) behave for large iteration number k N like the p ME. To be more precise we show in Theorem 7 that if the initial guess eυ0 Υ is good enough the step estimator sequence (eθk) satisﬁes with high probability

D eθk θ ξ 2 ϵ(p + νk R0), (7) max η L(eθk, η) max η L(θ , η) ξ 2/2 (p + x)1/2ϵ(p + νk R0), (8)

where ν < 1 is introduced in (20) and ϵ > 0 is some small number, for example ϵ = C/ n in the smooth i.i.d setting. Further R0 > 0 is a bound related to the quality of the initial guess. Generally it is proportional to the full dimension and in this way the rate with which the full nuisance can be estimated aﬀects the speed of the convergence of the procedure. The random variable ξ Rp and the matrix D Rp p are related to the eﬃcient inﬂuence function in semiparametric models and its covariance. These are up to νk R0 the same properties as those proven for the p ME by Andresen and Spokoiny (2014) under nearly the same set of conditions. Up to the ﬁnite sample bounds on the right hand sides this means that the estimating points of the procedure admit a Fisher expansion - in other words are asymptotical normal - and a Wilks expansion. Consequently the usual inference procedures based on conﬁdence and concentration sets can be applied to these estimators. Further in our second main result we manage to show under slightly stronger smoothness conditions that (eθk, eηk) approaches the ME (eθ, eη) = argmax L(θ, η ) with nearly linear convergence speed, i.e. D((θk, ηk) eυ) τ k/ log(k) with some 0 < τ < 1 and D2 = E 2L(υ ) (see Theorem 14). To clarify we want to mention that the term convergence refers to the behavior of the sequence (eθk, eηk) when the number of iterations k N tends to inﬁnity. We show (eθk, eηk) (eθ, eη) . This has to be distinguished from the usual stochastic convergence results of the M-estimator (eθ, eη) towards the target (θ , η ) or the weak convergence to a normal distribution as the sample size increases. Our setup is assuming ﬁnite sample size such that even with k there remains a gap between (eθk, eηk) and (θ , η ) and between eθk and θ that is related to the parametric and semiparametric Cramer-Rao lower bounds respectively. This is why we can in the ﬁnite sample setting only hope to obtain convergence of the alternating procedure to (eθ, eη) but not to (θ , η ) . But for a growing sample size (7) implies the weak convergence results also for the estimator eθk when k(n) N is large enough and ϵp vanishes (see Section 1.3). In the following we write eυk,k(+1) in statements that are true for both eυk,k+1 and eυk,k . Also we do not specify whether the elements of the resulting sequence are sets or single points. All statements made about properties of eυk,k(+1) are to be understood in the sense that they hold for every point of eυk,k(+1) .

1.1 Relation to the EM Algorithm

In the introduction we claimed that the alternating procedure analyzed in this work is related to the EM algorithm. In this section we want to elaborate on that.

Convergence of an Alternating Procedure

First we explain the EM algortihm. Consider data (X) Pθ for some parametric family (Pθ, θ Θ) . Assume that a parameter θ Θ is to be estimated as maximizer of the functional Lc(θ, X) R , but that only Y Y is observed, where Y = f Y (X) is the image of the complete data set X X under some map f Y : X Y . Prominent examples for the map f Y are projections onto subspaces of X if both Y, X are vector spaces. The information lost under the map can be regarded as missing data or latent variables. As a direct maximization of the functional is impossible without knowledge of X the EM algorithm serves as a workaround. It consists of the iteration of two steps: starting with some initial guess eθ0 the kth Expectation step derives the functional Q via

Q(θ, θk) = Eθk[Lc(θ, X)|Y],

which means that on the right hand side the conditional expectation is calculated under the distribution Pθk . The kth Maximation step then simply locates the maximizer θk+1 of Q . Now we can present the convergence result of Wu (1983) in more detail. Wu presents regularity conditions that ensure that L(θk+1|Y) L(θk|Y) where

L(θ|Y) def = E [Lc(θ, X)|Y = f Y (X)] ,

such that L(θk|Y) L for some limit value L (Y) > 0 , that may depend on the starting point θ0 . Additionally Wu gives conditions that guarantee that the sequence θk (possibly

a sequence of sets) converges to C(L ) def = {θ| L(θ|Y) = L (Y)} . Dempster, Laird, and Rubin (1977) show that the speed of convergence is linear in the case of point valued θk and of some diﬀerentiability criterion being met. A limitation of these results is that it is not clear whether L (Y) = sup L(θ|Y) and thus it is not guaranteed that C(L ) is the desired MLE and not just some local maximum. Of course this problem disappears if L( |Y) is unimodal and the regularity conditions are met but this assumption may be too restrictive. To see that the procedure (3) is a special case of the EM algorithm we have to ﬁnd the right triplet (X, f Y , Lc) . For this we take X = Z, Y with Z argmaxη L{(θ, η), Y}

under Pθ . Further we set f Y (X) = Y and Lc(θ, X) def = L(θ, η, Y) , where X = (η, Y) . Then we ﬁnd

Q(θ, eθ (k 1)) = Eeθ (k 1)[Lc(θ, X)|Y]

= Eeθ (k 1) h Lc θ, argmax η L{(θ(k 1), η), Y}, Y Y i

= Lc θ, argmax η L{(eθ (k 1), η), Y}, Y

= L(θ, eη(k), Y),

and thus the resulting sequence is the same as in (3).

Andresen and Spokoiny

1.2 Linear Series Estimators

In semiparametric models the proﬁle M estimator eθ Rp from Equation (2) cannot be calculated in practice if the full model is inﬁnite dimensional. There are various ways to circumvent this problem. Next to non parametric estimation and plugin of the nuisance η X a prominent approach is the so called sieve technique that motivates the setting in this work. The sieve approach was systematically introduced by Grenander (see Grenander, 1981, Chapter 8) and consists in choosing a suitable sequence of subsets (Υm) m=1 Υ such that for each υ Υ there exists a sequence Πm(υ) Υm with υ Πm(υ) 0 . Furthermore the sets Υm Υ have to be such that supυ Υm L(υ) can be calculated in practice. In the setting of semiparametric M-Estimation we assume Υ = Υθ Υη Rp X with some inﬁnite dimensional separable Hilbert space X and countable basis {e1, e2, . . .} X . We set Υm = Υθ ΠmΥη , where Πm : X Xm denotes the orthogonal projection onto

Xm def = span(e1, . . . , em} . For each m N the sieve proﬁle M-estimator is deﬁned as

eθm def = Πθeυm def = Πθ argmax θ Rp η Rm L

This means that for the calculation of the estimator eθm only a ﬁnite dimensional setting has to be considered. In our analysis we will focus on the behavior of the alternating maximization procedure in that case. But of course the projection onto a ﬁnite dimensional submodel induces an approximation bias υ υ m where

(θ m, η m) def = υ m def = argmax θ Rp η Rm EL

In (Andresen and Spokoiny, 2014) it is explained in detail how this bias can be treated. Once the bias is controlled this leads for each m N to bounds of the kind

D(eθm θ ) ξm (x) + α(m), max η ΠmΥη L(eθm, η) max η ΠmΥη L(θ , η) ξm 2 p (x) + α(m),

where α(m) 0 quantiﬁes the impact of the bias υ υ m . The choice of m N then has to balance the two terms (x) and α(m) , which leads to common optimal choices for the dimension based on the smoothness of the nuisance component η X . To ease notation we drop the m in the following, as the treatment of the bias can be done separately, (see Andresen and Spokoiny, 2014).

Convergence of an Alternating Procedure

1.3 Finite sample Wilks and Fisher Theorems

Before we present our main results we want to explain what type of results we aim at and how they can be interpreted. Hopefully this will ease the understanding and will make some of the apparently cumbersome notation more intelligible. Usually in asymptotic treatments of semiparametric M-estimators like eθ in (2) the aim is to derive statements of the kind

n(eθ θ ) d 1 ξ = o P(1), (9)

max η L(eθ, η) max η L(θ , η) ξ 2/2 = o P(1), (10)

ξ w N(0, d 1 v2 d 1),

where n N denotes the sample size. The random variable ξ Rp is called semiparametric score. Below we will brieﬂy explain its derivation along with the explanation of the matrices v2, d2 Rp p . But before, we sketch how (9) and (10) can be used for the construction of asymptotic conﬁdence sets that yield statistical tests. Given the matrices v2, d2 the construction works as follows. Let q2 α > 0 be an α level quantile of a χ2 p( d 2 v2 d 2) - distribution. Set

E(qα) = θ : n (eθ θ) qα ; (11)

then one can use (9) to show

P {θ / E (qα)} = P n n (eθ θ ) qα o 1 α.

Similarly one can exploit (10). Now we explain the deﬁnition of v2 and d2 . Introduce

n v 2(υ) def = Πθ Cov( L(υ)) 1Π θ ,

n d 2(υ) def = Πθ 2EL(υ) 1 Π θ ,

where Πθ is the orthogonal projection onto the θ -commponent in Rp and Π θ is its dual operator. Then v2 = v2(υ ) and d2 = d2(υ ) .

Remark 1 Note that these two matrices coincide if the functional L was the complete loglikelihood of the observations and that then d2 would equal the covariance of the eﬃcient inﬂuence function (see Kosorok, 2005, for more details).

For the deﬁnition of the semiparametric score ξ Rp consider

nd2(υ) = d2(υ) a(υ) a (υ) h2(υ)

def = 2EL(υ) Rp p .

Andresen and Spokoiny

ξ def = 1 n d(υ )(1 Eϵ)Πθd 2(υ ) L(υ )

= 1 n(1 Eϵ) d 1(υ ) θL(υ ) ah 2(υ ) ηL(υ ) ,

This random variable is related to the eﬃcient inﬂuence function in semiparametric estimation and it plays the role that the usual score L(υ ) plays in the setting of parametric M-estimation. In this work we derive (9) and (10) for eθk instead of eθ but with ﬁnite sample bounds for the terms on the right-hand sides of (9) and (10). To be more precise we derive statements of the following kind. With probability greater than 1 Ce x

n(eθk θ ) d 1 ξ ϵ(p + νk R0), (12) max η L(eθk, η) max η L(θ , η) ξ 2/2 = (p + x)1/2ϵ(p + νk R0), (13)

with some small value ϵ > 0 as in (7) and (8). Note that for vanishing right hand sides these equations imply (9) and (10) when n tends to infty. Using the scheme in (11) the bounds (12) and (13) allow the construction of (conservative) ﬁnite sample conﬁdence sets . Assume that (approximate) quantiles qα for ξ are available, i.e. that with some small ϵ > 0 and any α [0, 1]

P( ξ qα) (α δ, α + δ),

then with some generic constant C > 0 (see Andresen and Spokoiny, 2014, Remark 2.13)

α + δ + Ce x P n θ E(qα + ϵ(p + νk R0) o ,

P n θ E(qα ϵ(p + νk R0) o α δ Ce x.

The important achievement is that one can make approximate conﬁdence statements for the estimators of the alternating procedure and this even in the ﬁnite sample case, without ignoring hopefully small enough terms. As remarked above such approximate quantiles could be attained via an plug-in-estimation of d 2 v2 d 2 combined with a Gaussian approximation or a bootstrap.

2. Main Results

This Section contains the thorough presentation of our main convergence results. It involves the introduction of various technicalities and may appear a bit cumbersome on ﬁrst read. We recommend to carefully read Section 1.3 ﬁrst to ease understanding.

Convergence of an Alternating Procedure

2.1 Conditions

This section collects the conditions imposed on the model. We use the same set of assumptions as Andresen and Spokoiny (2014) and this section closely follows Section 2.1 of that paper. Let the full dimension of the paramter space be ﬁnite, i.e. p < . Our conditions involve the symmetric positive deﬁnite information matrix D2 0 Rp p and a central point υ Rp . To ease presentation in this paper we identify υ with the true point from (1) and deﬁne

D2 0 def = 2EL(υ ),

where we assume, that the second derivative exists. In the context of semiparametric estimation, it is convenient to represent the information matrix in block form:

D2 0 = D2 0 A0 A 0 H2 0

First we state an identiﬁability condition, which basically imposes that D2 is positive deﬁnite. Note that in this work allways denotes the spectral norm when its argument is a matrix.

(I) It holds for some ν < 1

H 1 0 A 0 D 1 0 ν. (14)

The condition (I) allows to introduce the important p p eﬃcient information matrix D 2 0 which is deﬁned as the inverse of the θ -block of the inverse of the full dimensional matrix D2 0 . The exact formula is given by

D 2 0 def = D2 0 A0H 2A 0 ,

and (I) ensures that the matrix D 2 0 is positive deﬁnite such that D is well deﬁned. At the same time ν < 1 ensures that the alternating sequence actually converges. As can be seen in Theorem 7 the speed of convergence is linear in ν . Using the matrix D2 0 and the central point υ Rp , we deﬁne the local set Υ (r) Υ Rp with some r 0 :

Υ (r) def = υ = (θ, η) Υ : D0(υ υ ) r . (15)

2.1.1 Smoothness

Usually in the context of regular M-estimation one assumes local quadraticity, i.e. that

pl(θ) = pl(θ )(θ θ ) + 1

2(θ θ ) 2pl(θ )(θ θ ) + o P(1),

Andresen and Spokoiny

for θ Rp close enough to θ . The functional pl( ) in this context denotes the proﬁlefunctional and is deﬁned as

pl(θ) def = max η L(θ, η),

see for instance (Cheng, 2013). In our setting we need a precise bound for the accuracy of a quadratic approximation of the expected proﬁle functional Epl . We want to bound the error of a local linear approximation of the projected gradient θEL(υ) which is deﬁned as

θ = θ A0H 2 0 η.

Instead of local quadraticity of the proﬁle functional we impose that for (θ, η) = υ Υ (r) - with the local set Υ (r) deﬁned in (15) - with some small ϵ > 0 D 1 EL(υ) EL(υ ) D(θ θ ) ϵr2.

The following condition serves a less high level way of checking that such an approximation holds (see Andresen and Spokoiny, 2014, Lemma B.1)

( L0) For each r r0 , there is a constant ϵ > 0 such that it holds on the set Υ (r) :

D 1D2(υ)D 1 Ip ϵr, D 1(A(υ) A)H 1 ϵr, D 1AH 1 Im H 1H2(υ)H 1 ϵr.

If EL is three times continuously diﬀerentiable one obtains ϵ C D 1 . In i.i.d. models one usually has D 1 = O(1/ n) .

Remark 2 Here and in what follows we implicitly assume that the function L(υ): Rp R is suﬃciently smooth in υ Rp , L(υ) Rp stands for the gradient and 2EL(υ) Rp p for the Hessian of the expectation EL : Rp R at υ Rp . By smooth enough we mean that we can interchange EL = E L on Υ (R0) , where Υ (r) is deﬁned in

(15) and R0 > 0 in (19). It is worth mentioning that D2 0 = V2 0 def = Cov( L(υ )) if the model Y Pυ (Pυ) is correctly speciﬁed and suﬃciently regular; see e.g. (Ibragimov and Khas minskij, 1981).

2.1.2 Complexity

The usual approach to gain asymptotic control on proﬁle M-estimators would be to assume that the class

{ L(υ), υ Υ (r)},

is P -Donsker (see Cheng, 2013). To pin down the estimator sequence (eθk, eηk)k N and to obtain ﬁnite sample results we use a more speciﬁc approach, which is based on a new ﬁnite sample approach (Spokoiny,

Convergence of an Alternating Procedure

2012; Andresen and Spokoiny, 2014). First note that we assume that - as far as the alternating procedure is concerned - the model is ﬁnite dimensional, which means that Υ (r) Υ Rp in (15) is automatically compact for ﬁnite radius r > 0 . If we can ensure the right smoothness and moment conditions on L(υ) , we automatically obtain that the above class is P -Donsker. But using the new techniques (Spokoiny, 2012; Andresen and Spokoiny, 2014) we manage to obtain slightly stronger bounds that are useful in a ﬁnite sample setting. To understand the next condition consider ﬁrst the deﬁnition of a subgaussian random vector. A random vector X Rp is called subgaussian, if for any µ R and some ν > 0

sup γ 1 log E exp n µγ X o ν2µ2/2. (16)

Obviously this is a strong condition. As it turns out in many situations it is suﬃcient to assume subexponentiality, which simply relaxes subgaussianity to demanding that (16) is met for any |µ| g with some g > 0 . In this way many distributions that would be excluded by assuming subgaussianity can still be treated. Our next condition combines subexponentiality with a smoothness constraint on the stochastic component of L(υ) . It assumes that - with some ϵ > 0 - the random vector

1 ϵ D(υ υ ) D 1 θζ(υ) θζ(υ ) Rp,

is uniformly subexponential for υ, υ Υ (r) , with the stochastic component deﬁned as

ζ(υ) def = L(υ) EL(υ) . It reads:

( ED1) For all 0 < r < r0 , there exists a constant ϵ 1/2 such that for all |µ| g and υ, υ Υ (r)

sup υ,υ Υ (r) sup γ 1 log E exp

µγ D 1 θζ(υ) θζ(υ )

To convey more intuition consider for some pair υ, υ Υ (r) the function

ψγ(t) def = γ D 1 θζ(υ) θζ(υ + t(υ υ)) .

Then ( ED1) is met with ϵ D 1 , if - uniformly in γ - for any pair υ, υ Υ (r) the corresponding function ψγ : [0, 1] R is Lipshitz continuous and the Lipshitz constant a subexponential random variable.

2.1.3 Moments

We need another condition that allows to control the deviation behavior of D 1 ζ(υ ) . To present this condition deﬁne the covariance matrix V2 0 Rp p and V 2 Rp p

V2 0 def = Cov L(υ ) , V 2 = Cov( θζ(υ )).

We impose subexponential moments on V 1 θζ(υ ) :

Andresen and Spokoiny

( ED) There exist constants ν0 > 0 and g > 0 such that for all |µ| g

sup γ 1 log E exp n µγ V 1 θζ(υ ) o ν2 0µ2

2.1.4 Conditions for the Full Model

So far we only presented conditions that allow to treat the properties of eθk on local sets Υ (rk) , for some sequence (rk)k N . To show that that the sequence of estimators (υk)k N satisﬁes υk Υ (rk) for an appropriately decreasing sequence (rk)k N the following, stronger conditions are employed, which can be interpreted just as the previous ones.

(L0) For each r r0 , there is a constant ϵ > 0 such that it holds on the set Υ (r) :

D 1 0 2EL(υ) D 1 0 Ip ϵr.

(ED1) There exists a constant ϵ 1/2 , such that for all |µ| g and all 0 < r < r0

sup υ,υ Υ (r) sup γ =1 log E exp

( µ γ D 1 0 ζ(υ) ζ(υ )

(ED) There exist constants ν0 > 0 and g > 0 such that for all |µ| g

sup γ 1 log E exp n µγ V 1 0 ζ(υ ) o ν2 0µ2

It is important to note, that the constants ϵ, ν and ϵ, ν in the respective weak and strong version can diﬀer substantially and may depend on the full dimension p N in less or more severe ways ( AH 2 ηL might be quite smooth while ηL could be less regular). For the convergence statement in Theorem 14 we additionally need the following condition, that controls the moments and the smoothness of the process 2(L EL) :

(ED2) There exists a constant ϵ2 1/2 , such that for all |µ| g and all 0 < r < r0

sup υ,υ Υ (r) sup γ1 =1 sup γ2 =1 log E exp

( µ γ 1 D 1 2ζ(υ) 2ζ(υ ) γ2 ϵ2 D(υ υ )

2.1.5 Quadratic Drift Beats Linear Fluctuation

Finally we present two conditions that allow to ensure that with a high probability the sequence (υk,k(+1)) stays close to υ if the initial guess eυ0 lands close to υ . These conditions have to be satisﬁed on the whole set Υ Rp . The ﬁrst condition imposes that EL(υ) decreases nearly quadratically as the distance of υ to υ grows.

Convergence of an Alternating Procedure

(Lr) For any r > r0 there exists a value b > 0 , such that

E [L(υ) L(υ )] b D0(υ υ ) 2.

The next condition bounds moments of the full gradient L(υ) - again via subexponentiality.

(Er) For any r r0 there exists a constant g(r) > 0 such that

sup υ Υ (r) sup µ g(r) sup γ 1 log E exp n µγ D 1 ζ(υ) o ν2 rµ2

We impose one further merely technical condition:

(B1) We assume for all r 6νr

x + 4p 3ν2 r b g(r).

Remark 3 Without this the calculation of R0(x) in Section 4.3 would become technically more involved, without that further insight would be gained.

Remark 4 The condition (Er) can be substantially relaxed to b = b(r) > 0 that decreases to 0 as r . We avoid the resulting technicalities and refer the reader to the original publication for the non constant case (see Spokoiny, 2012, Theorem 4.1).

2.2 Dependence on Initial Guess

Our main theorem is only valid under the conditions from Section 2.1 and under some constraints on the quality of the initial guess eυ0 Rp which we denote by (A1) , (A2) and (A3) :

(A1) With probability greater 1 β the initial guess satisﬁes L(eυ0) L(υ ) K0 for some K0 0 .

(A2) The conditions ( ED1) , ( L0) , (ED1) and (L0) from Section 2.1 hold for all r R0(x) where R0(x) can be bounded with (see (19))

x + p + K0.

(A3) K0 R and ϵ > 0 are small enough to ensure

ϵC(ν)R0 < 1, (17)

C(ν) def = 16

2(1 + ν) (1 ν)(1 ν), (18)

where ν > 0 is deﬁned in (14).

Andresen and Spokoiny

Condition (A1) allows to concentrate the analysis on a local set { D(υ υ ) R0(x)} Υ with dominating probability (see Theorem 28). Conditions (A2) and (A3) ensure that this neighborhood is small enough to imply convergence of the procedure. They impose a bound on R0(x) and thus on K0 from (A1) . These conditions boil down to ϵ K0 being signiﬁcantly smaller than 1, which is a quantiﬁcation of the quality of the ﬁrst guess. There are numerous ways to initiate the procedure. In Section 3 we use a grid search and show that for a suﬃciently ﬁne grid these conditions can be met in the treated model.

Remark 5 One way of obtaining condition (A1) is to show that D(eυ υ ) R with probability greater 1 β for some ﬁnite R R and 0 β(R) < 1 . Then one can use - with some constant C > 0 -

K0 (1/2 + ϵ(1 + 12ν0))(R + C p

as we show in Lemma 31.

Remark 6 The precise deﬁnition of R0(x) > 0 reads

R0(x) def = z(x) 6νr b(1 ν)

x + 2.4p + b2

9ν2r K0, (19)

with the term

which is deﬁned in (30).

2.3 Introduction of Important Objects

In this section we collect the most important objects and bounds that are relevant for Theorem 7. Remember the p p information matrix D2 from Section 2.1, which is deﬁned similarly to the Fisher information matrix:

D2 def = 2EL(υ ) = D2 A A H2

A crucial object is the constant 0 ν deﬁned by

D 1AH 1 2 def = ν, (20)

which we assume with condition (I) to be smaller 1. It determines the speed of convergence of the alternating procedure (see Theorem 7 ). Further introduce the p p matrix D and the p -vectors θ and ξ as

D 2 = D2 AH 2A ,

ξ = D 1 θL(υ ), θL = θL AH 2 ηL.

Convergence of an Alternating Procedure

The random variable ξ Rp is related to the eﬃcient inﬂuence function in semiparametric models. If the model is regular and correctly speciﬁed D 2 is the covariance of the eﬃcient inﬂuence function and its inverse the semiparametric Cramer-Rao lower bound for regular estimators. We deﬁne the semiparametric uniform spread

Q(r, x) def = ϵ 16 (1 ν2)2 r2 + z Q(x, 2p + 2p)2 (1 + 6 ν2 1). (21)

This object is central for our analysis as it describes the accuracy of our main results. It is small if ϵ(r2 + x + p ) is small, since z Q(x, p ) p + x (see its deﬁnition in Equation (42)).

2.4 Statistical Properties of the Alternating Sequence

In this Section we present our main theorem in full rigor, i.e. that the limit of the alternating sequence satisﬁes a ﬁnite sample Wilks Theorem and Fisher expansion.

Theorem 7 Assume that the conditions (Lr) , (Er) and (B1) of Section 2.1 are met. Further assume (A1) , (A2) and (A3) of Section 2.2. Then it holds with probability greater 1 8e x β for all k N D eθk θ ξ Q(rk, x), (22) max η L(eθk, η) max η L(θ , η) ξ 2/2 5 ξ + Q(rk, x) Q(rk, x), (23)

p + x + νk R0 ,

with a constant C that depends on ν < 1 and 1 C(ν)ϵR0 > 0 . In particular this means that if

k log(p + x) log{R0}

Q(rk, x) Q C p

Remark 8 Note that with linear convergence speed this leads to statements about eθk that are very similar to those in (Andresen and Spokoiny, 2014) for the proﬁle M estimator eθ .

Remark 9 Concerning the properties of ξ Rp we repeat remark 2.1 of (Andresen and Spokoiny, 2014). In case of correct model speciﬁcation the deviation properties of the quadratic form ξ 2 = D 1 θ 2 are essentially the same as those of a chi-square random variable with p degrees of freedom; see Theorem 39 in the appendix. In the case of a possible

Andresen and Spokoiny

model misspeciﬁcation the behavior of the quadratic form ξ 2 will depend on the charac-

teristics of the matrix IB def = Cov( ξ) ; see again Theorem 39. Moreover, in the asymptotic setup the vector ξ is asymptotically standard normal; see Section 2.2. of (Andresen and Spokoiny, 2014) for the i.i.d. case.

Remark 10 These results allow to derive some important corollaries like concentration and conﬁdence sets (see Section 1.3).

Remark 11 In general an exact numerical computation of

θ(η) def = argmax θ Rp L(θ, η), or η(θ) def = argmax η Rm L(θ, η),

is not possible. Deﬁne bθ(η) and bη(θ) as the numerical approximations to θ(η) and η(θ) and assume that - with the local set Υ (r) deﬁned in (15) -

D(bθ(η) θ(η)) τ, for all η Υ ,η(R0) def = {υ Υ (R0), Πηυ = η},

H(bη(θ) η(θ)) τ, for all θ Υ ,θ(R0) def = {υ Υ (R0), Πθυ = θ}.

Then we can easily modify the proof of Theorem 7 via adding C(ν)τ to the error terms and the radii rk , where C(ν) is some rational function of ν .

2.5 Convergence to the ME

Even though Theorem 7 tells us that the statistical properties of the alternating sequence resemble those of its target - the proﬁle ME eθ - it is an interesting question if the underlying approach allows to qualify conditions under which the sequence actually attains the maximizer eυ . Deﬁne the radius r0(x) > 0 to be the smallest radius r > 0 such that P(eυ, eυθ Υ (r0)) 1 e x , where

eυθ def = argmax υ Υ Πθυ=θ L(υ).

Remark 12 This radius can be determined using conditions (Lr) and (Er) of Section 2.1 and Theorem 28 which would yield r0(x) = O( x + p ) .

Without further assumptions Theorem 7 yields the following Corollary:

Corollary 13 Under the assumptions of Theorem 7 it holds with probability greater 1 8e x β

D(eθ eθk) Q(rk, x) + Q(r0, x).

Corollary 13 is a ﬁrst step in the direction of an actual convergence result but the gap Q(rk, x) + Q(r0, x) is not a zero sequence in k N . It turns out that it is possible to

Convergence of an Alternating Procedure

prove convergence to the ME at the cost of assuming more smoothness of the functional L and using the right bound for the maximal eigenvalue of the Hessian 2L(υ ) . Deﬁne z2(x, 2L(υ )) via

P D 1 2L(υ ) z2 x, 2L(υ ) e x,

and κ(x, R0) as

κ(x, R0) def = 2

2(1 + ν) 1 ν

ϵR0 + 9ϵ2ν2 D 1 z1(x, 6p )R0 + D 1 z2 x, 2L(υ ) ,

where z1(x, ) x + p , see (46). With these deﬁnitions we can prove the following Theorem:

Theorem 14 Let the conditions (Lr) , (Er) and (B1) be met. Further suppose (A1) and (A2) , with (ED2) instead of (ED1) . Assume that κ(x, R0) < (1 ν) . Then

D(υk,k(+1) eυ) r k !

2 1 κ(x,R0)k R0, κ(x, R0)k 1,

k log(k) log 1 ν κ(x,R0) ck R0, otherwise, (24)

with some sequence (ck) N , where 0 < ck 2 .

Remark 15 This means that we obtain nearly linear convergence to the global maximizer eυ .

Remark 16 As in Remark 11 if no exact numerical computation of the stepwise maximizers is possible we can easily modify the proof of Theorem 14 via adding C(ν)τ to κ(x, R0) to address that case.

Remark 17 For the case that L(υ) = Pn i=1 ℓi(υ) with a sum of independent marginal functionals ℓi : Υ R we can use Corollary 3.7 of (Tropp, 2012) to obtain

z2 x, 2L(υ ) =

x + log(p ),

if for some sequence of matrices (Ai) Rp p

log E exp λ 2ℓi(υ ) ν2λ2/2Ai,

In the case of smooth i.i.d models this means that

κ(x, R0) C n(x + R0 + log(p )),

if p + x = o(n) .

Andresen and Spokoiny

Remark 18 It may happen that κ(x, R0)/(1 ν) is very close to or even larger than 1 . But a close look at the proof of Theorem 14 reveals that this can be improved using Lemma 33. For this purpose bound r k C (z(x) + νk R0) with r k deﬁned in (33) and with some constant C > 0 . Then the result of Theorem 14 is true with κ(x, C p + x) instead of κ(x, R0) and with probability greater 1 10e x . See Remark 38 for more details.

2.6 Critical Dimension

We want to address the issue of critical parameter dimensions when the full dimension p grows with the sample size n . We write p = pn . The results of Theorem 7 are accurate if the spread function Q(rk, x) from (21) is small. The critical size of pn then depends on the exact bounds on ϵ, ϵ . In the i.i.d setting we have ϵ ϵ 1/ n such that (rk, x) pn/ n for large k N . In other words, one needs that pn2/n is small to obtain an accurate non-asymptotic version of the Wilks phenomenon and the Fisher Theorem for the limit of the alternating sequence. This is not surprising because good performance of the ME itself can only be guaranteed if pn2/n is small , as is shown by Andresen and Spokoiny (2014). There are examples where the p ME only satisﬁes a Wilksor Fisher result if pn2/n is small , such that in any of those settings the alternating sequence started in the global maximizer does not admit an accurate Wilksor Fisher expansion. Interesting enough the constrain κ(x, R0) < (1 ν) of Theorem 14 for the convergence of the sequence to the global maximizer means that one needs pn/n 1 in the smooth i.i.d. setting if R0 CR0 pn + x . Further Theorem 14 states a lower bound for the speed of convergence that in the smooth i.i.d. setting decreases if pn/n grows. Unfortunately we were unable to ﬁnd an example that meets the conditions of Section 2.1 and where no convergence occurs if pn/n tends to inﬁnity. So whether this dimension eﬀect on the convergence is an artifact of our proofs or indeed a property of the alternating procedure remains an open question.

3. Application to Single Index Model

We illustrate how the results of Theorem 7 and Theorem 14 can be applied in Single Index modeling. This section is based on (Andresen, 2015). See that paper for a more detailed presentation. Consider the following model

yi = f(X i θ ) + εi, i = 1, ..., n,

for some f : R R and θ Sp,+ 1 Rp and with i.i.d errors εi R , Var(εi) = σ2 and i.i.d random variables Xi Rp with distribution denoted by PX . The single-index model is widely applied in statistics. For example in econometric studies it serves as a compromise between too restrictive parametric models and ﬂexible but hardly estimable purely nonparametric models. Usually the statistical inference focuses on estimating the index vector θ . A lot of research has already been done in this ﬁeld. For instance, (Delecroix. et al., 1997) show the asymptotic eﬃciency of the general semiparametric maximum-functional estimator for particular examples and in (Haerdle et al., 1993) the right choice of band-

Convergence of an Alternating Procedure

width for the nonparametric estimation of the link function is analyzed. We want to use this model to illustrate our theoretical results. To ensure identiﬁability of θ Rp we assume that it lies in the half sphere Sp,+ 1 def = {θ Rp : θ = 1, θ1 > 0} Rp . For simplicity we assume that the support of the Xi Rp is contained in the ball of radius s > 0 . This allows to approximate f {f : [ s, s] 7 R} by an orthonormal C3 -Daubechies-wavelet basis (ek)k N on the interval. A candidate to estimate θ is the sieve proﬁle ME

eθm def = Πθ argmax (θ,η) Υm Lm(θ, η),

Lm(θ, η) = 1

k=0 ηkek(X i θ) 2 ,

and where Υm = Sp,+ 1 Rm . Again we will suppress the sub index m in the following. In this setting a direct computation of eυ becomes involved, as the maximization problem is high dimensional and not convex. But as noted in the introduction the maximization with respect to η for given θ is high dimensional but convex and consequently feasible. Further for moderate p N the maximization with respect to θ for ﬁxed η is computationally realistic. So an alternating maximization procedure is applicable. To show that it behaves in a desired way we apply the technique presented above. For the initial guess eυ0 Υ one can use a simple grid search. For this take a uniform grid GN def = (θ1, . . . , θN) S+ 1 and deﬁne

eυ0 def = argmax (θ,η) Υ θ GN

Note that given the grid the above maximizer is easily obtained. Simply calculate

eη0,k def = argmax L(θk, η) =

i=1 ee (X i θk)

i=1 yie (X i θk) Rm, (26)

where by abuse of notation e = (e1, . . . , em) Rm . Now observe that

eυ0 = argmax k=1,...,N L(θk, eη0,k).

Deﬁne the mesh size of the grid τ def = supθ,θ GN θ θ . To apply the result presented in Theorem 7 and Theorem 14 we need a list of assumptions denoted by (A) .

(Cond X) The random variables (Xi)i=1,...,n Rp are i.i.d and bounded with distribution denoted by PX and independent of (εi)i=1,...,n R . The measure PX is absolutely continuous with respect to the Lebesgue measure. The

Andresen and Spokoiny

Lebesgue density p X of PX is Lipschitz continuous and positive on Bs(0) Rp . For any pair θ S+,p 1 with θ θ we have almost surely

Var X θ X θ > 0.

(Condf) For some η Br (0) l2 def = {(uk)k N : P k=1 u2 k < }

where f < and f < and where with some α > 2

k=0 k2αη k 2 < .

On some interval [t0 h, t0 + h] [ s + s] with h > 0 it holds true that

|f (t)| > 0.

(Condε) The errors (εi) R are i.i.d. with E[εi] = 0 , Cov(εi) = σ2 and satisfy for all |µ| eg for some eg > 0 and some eν > 0

log E[exp {µε1}] eν2µ2/2.

Remark 19 Note that our assumptions in terms of moments and smoothness are quite common in this model. For instance (Haerdle et al., 1993) assume that the density p X of the regressors (Xi) is twice continuously diﬀerentiable, that f has two bounded derivatives and that the errors (εi) are centered with bounded polynomial moments of arbitrary degree.

Remark 20 Var X θ X θ = 0 would mean that X θ = a(X θ ) for some measurable function a : R R . But then we would have for any (α, β) R2 with α2 +β2 = 1 that

f(X (αθ + βθ )) = f(αX θ + βa(X θ )) def = f α,β(X θ ),

such that the problem would no longer be identiﬁable. Also |f (t)| > 0 on some interval is necessary for identifyability of θ .

Proposition 21 Let τ = o(p 3/2) and p 5/n 0 . With initial guess given by Equation (25) and for not too large x > 0 the alternating sequence satisﬁes (22) and (23) with probability greater 1 9 exp{ x} and where with some constant C R

Q(r, x) C(p + x)3/2

n (r2 + p + x).

Convergence of an Alternating Procedure

Proposition 22 Take the initial guess given by Equation (25). Assume (A) . Further assume that p 4/n 0 and τ = o(p 3/2) . Then we get the claim of Theorem 14 with β = e x and

κ(x, R0) = O(τp 3/2 + τxp 3/2/n1/4) + O(p 2/ n) 0,

for moderate choice of x > 0 .

Remark 23 The constraint τ = o(p 3/2) implies that for the calculation of the initial guess the vector eη0,l of (26) and the functional L( ) have to be evaluated N = p 3(p 1)/2

For details and proofs see (Andresen, 2015).

4. Proof of Theorem 7

In this section we present the proof of Theorem 7. As the proof is quite technical and complex we want to ﬁrst explain the basic ideas of the proof. In a second section we will outline more clearly the steps of the proof. Finally we carry out each of these steps which combine to yield the proof.

4.1 Idea of the Proof

To ease the understanding of what follows in the subsequent sections we want to illustrate the central ideas with a simple model. Consider for some positive deﬁnite matrix D Rp p and some vector υ = (θ , η ) Rp+m = Rp the model

Y = υ + ε Rp , where ε N(0, D 2), D2 = D2 A A H2

Set L to be the true log likelihood of the observations, i.e.

L(υ, Y) = D(υ Y) 2/2.

With any starting initial guess eυ0 Rp+m we obtain from (3) for k N and the usual ﬁrst order criterion of maximality the following two equations

D(eθk θ ) = Dεθ + D 1A(eηk η ),

H(eηk+1 η ) = Hεη + H 1A (eθk θ ).

Combining these two equations we derive, assuming D 1AH 2A D 1 def = M 0 = ν < 1

D(eθk θ ) = D 1(D2εθ Aεη) + D 1AH 1A D 1D(eθk 1 θ )

l=1 M k l 0 D 1(D2εθ Aεη) + M k 0D(eθ0 θ )

X def = D(bθ θ ).

Andresen and Spokoiny

Because the limit bθ is independent of the initial point eυ0 and because the proﬁle eθ is a ﬁx-point of the procedure the unique limit satisﬁes bθ = eθ . This argument is based on the fact that in this setting the functional is quadratic such that the gradient satisﬁes

L(υ) = D2 υ (υ υ ) + D2 υ ε.

Any smooth function is quadratic around its maximizer which motivates a local linear approximation of the gradient of the functional L to derive our results with similar arguments. This is done in the proof of Theorem 7 . First it is ensured that the whole sequence (eυk,k(+1))k N0 satisﬁes for some R0(x) > 0 and with probability greater that 1 e x

{eυk,k(+1), k N0} { D(υ υ ) R0(x)}, (27)

where D2 def = 2EL(υ ) (see Theorem 28). In the second step we approximate with ζ = L EL

L(υ) L(υ ) = ζ(υ )(υ υ ) D(υ υ ) 2/2 + α(υ, υ ), (28)

where α(υ, υ ) is deﬁned by (28). Similar to the toy case above this allows using the ﬁrst order criterion of maximality and (27) to obtain a bound of the kind

D(υk,k υ ) C

l=0 νl D 1 ζ(υ ) + |α(υl,l, υ )|

C1 D 1 ζ(υ ) + ϵ(R0) + νk R0 def = rk.

This is done in Lemma 32 using results from (Andresen and Spokoiny, 2014) to show that ϵ(R0) is small. Finally the same arguments as in (Andresen and Spokoiny, 2014) allow to obtain our main result using that with high probability for all k N0 eυk,k { D(υ υ ) rk} . For the convergence result similar arguments are used. The only diﬀerence is that instead of (28) we use the approximation

L(υ) L(eυ) = D(υ eυ) 2/2 + α (υ, eυ),

exploiting that L(eυ) 0 , which allows to obtain actual convergence to the ME.

4.2 A Desirable Set

In this section we will explain the agenda of the proof. The ﬁrst step of the proof is to ﬁnd a desirable set Ω(x) Ωof high probability, on which a linear approximation of the gradient of the functional L(υ) can be carried out with suﬃcient accuracy. Once this set is found all subsequent analysis concerns events in Ω(x) Ω.

Convergence of an Alternating Procedure

For this purpose deﬁne - with the local set Υ (r) deﬁned in (15) - for some K N the set

k=0 (Ck,k Ck,k+1) C( ) {L(eυ0) L(υ ) K0}, where (29)

Ck,k(+1) = n D(eυk,k(+1) υ ) R0(x), D(eθk θ ) R0(x),

H(eηk(+1) η ) R0(x) o ,

sup υ Υ (r)

6ϵν1 Y(υ) 2r2 z Q(x, 4p )2

sup υ Υ (r)

6 ϵ ν1 Y(υ) 2r2 z Q(x, 2p + 2p)2

max{ D 1 L , D 1 θL , H 1 ηL } z(x)

{eυ, eυθ Υ (r0(x))}.

For ζ(υ) = L(υ) EL(υ) the semiparametric normalized stochastic gradient gap is deﬁned as

Y(υ) = D 1 θζ(υ) θζ(υ ) .

the parametric normalized stochastic gradient gap Y(υ) is deﬁned as

Y(υ) = D 1 ζ(υ) ζ(υ ) ,

and r0(x) > 0 is chosen such that P(eυ, eυθ Υ (r0)) 1 e x , where

eυθ def = argmax υ Υ Πθυ=θ L(υ).

The constant z(x) in the deﬁnition of C( ) is only introduced for ease of notation. This makes some bounds less sharp but allows to address all terms that are of order p + x with one symbol. It is deﬁned as

z(x) def = z(x, D 1V2D 1) z Q(x, 4p ) p

p + x, (30)

where V2 Rp p denotes the covariance matrix from Section 2.1

V2 def = Cov( L(υ )).

Remark 24 z(x, ) is explained in more detail in Section A and z Q(x, ) is deﬁned in Equation (42). The constant z(x, IB) is comparable to the 1 e x -quantile of the norm

Andresen and Spokoiny

of X , where X N(0, IB) , i.e. it is of order of the trace of IB . The constant z Q(x, Q) arises as an exponential deviation bound for the supremum of a smooth process over a set with complexity described by Q .

Remark 25 We intersect the set with the event {eυ, eυθ Υ (r0)} where we a priory demand r0(x) > 0 to be chosen such that P(eυ, eυθ Υ (r0)) 1 e x . Note that condition (Er) together with (Lr) allow to set p + x r0 R0 (see Theorem 28).

In Section 4.3 we show that this set is of probability greater 1 8e x β(A) . We want to explain the purpose of this set along the architecture of the proof of our main theorem.

{L(eυ0, υ ) K0} : This set ensures, that the ﬁrst guess satisﬁes L(eυ0, υ ) K0 , which means that it is close enough to the target υ Rp . This fact allows us to obtain an a priori bound for the deviation of the sequence (eυk,k(+1)) Υ from υ Υ with Theorem 28.

{D(eυk,k(+1) υ ) R0(x)} : As just mentioned this event is of high probability due to L(eυ0, υ ) K0 and Theorem 28. This allows to concentrate the analysis on the set Υ (R0) on which Taylor expansions of the functional L : Rp R become accurate.

C( ) : This set ensures that on Ω(x) Ωall occurring random quadratic forms and stochastic errors are controlled by z(x) R . Consequently we can derive in the proof of Lemma 32 an a priori bound of the form D(eυk,k(+1) υ ) rk for a decreasing sequence of radii (rk) R+ satisfying lim supk rk = Cz(x) . Further this set allows to obtain in Lemma 34 the bounds for all k N .

On Ω(x) Ωwe ﬁnd eυk,k(+1) Υ (rk) such that we can follow the arguments of Theorem 2.2 of (Andresen and Spokoiny, 2014) to obtain the desired result with accuracy measured by Q(rk, x) . The sketch in ﬁgure 4.2 illustrates the behavior of the ﬁrst steps of the procedure. The axes correspond to the θ - or η -subspaces respectively. The two ellipsoids with center υ

and solid frame represent the local sets Υ (RK) Υ (R0) , with RK > 0 from Remark 5. We see that the initial guess eυ0 lies in Υ (R) . The elements (υk,k(+1)) of the alternating sequence all land inside of the respective Υ (rk) , which are represented by shrinking ellipsoids centered in υ with doted frames. Note that not the set Υ (RK) but Υ (R0) contains all points of the sequence.

4.3 Probability of Desirable Set

Here we show that the set Ω(x) actually is of probability greater 1 8e x β . We prove the following two Lemmas, which together yield the claim.

Lemma 26 The set C( ) satisﬁes

P(C( )) 1 7e x.

Convergence of an Alternating Procedure

Figure 1: The behavior of the procedure for the ﬁrst 4 steps of the alternating algorithm.

Andresen and Spokoiny

Proof The proof is similar to the proof of Theorem 3.1 in (Spokoiny, 2012). Denote

sup υ Υ (r)

6ϵν1 Y(υ) 2r2 z Q(x, 4p )2

sup υ Υ (r)

6 ϵ ν1 Y(υ) 2r2 z Q(x, 2p + 2p)2

C def = n max{ D 1 L , D 1 θL , H 1 ηL } z(x) o .

We estimate

P(C( )) 1 P (Ac) P (Bc) P (Cc)

P (eυ, eυθ / Υ (r0)) P ξ > z(x, Cov( ξ)) .

We bound using for both terms Theorem 42 which is applicable due to (ED1) and ( ED1) :

P(Ac) e x, P (Bc) e x.

For the set C Ωobserve that we can use (I) and Lemma 27 to ﬁnd

H 1 η D 1 θ D 1 .

This implies that

{ D 1 z(x, IB)}

{ D 1 θ H 1 η z(x, IB)}.

Using the deviation properties of quadratic forms as sketched in Section A we ﬁnd

P D 1 > z(x, IB) 2e x, P D 1 > z(x, Cov( ξ)) 2e x.

By the choice of z(x) > 0 and r0 > 0 this gives the claim.

We cite Lemma B.2 of (Andresen and Spokoiny, 2014):

Lemma 27 Let

D2 = D2 A A H2

R(p+p) (p+p), D Rp p, H Rm m invertible,

D 1AH 1 < 1.

Then for any υ = (θ, η) Rp+m we have H 1η D 1θ D 1υ .

Convergence of an Alternating Procedure

The next step is to show that the set TK k=1(Ck,k Ck,k+1) has high probability, that is independent of the number of necessary steps. A close look at the proof of Theorem 4.1 of (Spokoiny, 2012) shows that it actually yields the following modiﬁed version:

Theorem 28 ((Spokoiny, 2012), Theorem 4.1) Suppose (Er) and (Lr) with b(r) b . Further deﬁne the following random set

Υ(K) def = {υ Υ : L(υ, υ ) K}.

If for a ﬁxed r0 and any r r0 , the following conditions are fulﬁlled:

x + 2p 3ν2 rg(r)/b,

x + 2p + b 9ν2r K rb,

P(Υ(K) Υ (r0)) 1 e x.

Note that with (I)

D(eθk θ ) H(eηk(+1) η ) 1 1 ν D(eυk,k(+1) υ ) .

With assumption (B1) and

R0(x) = 6νr b(1 ν)

x + Q + b 9ν2r K0,

this implies the desired result as L(υk,k(+1), υ ) L(eυ0, υ ) such that with Theorem 28

k=0 (Ck,k Ck,k+1)

k=0 (Ck,k Ck,k+1) {L(eυ0, υ ) K0}

P(L(eυ0, υ ) K0)

P n Υ(K0) Υ (1 ν)R0(x) o β(A)

1 e x β(A).

Remark 29 This also shows that the sets of maximizers (eυk,k(+1)) are nonempty and well deﬁned since the maximization always takes place on compact sets of the form {θ Rp, (θ, η) Υ (R0)} or {η Rm, (θ, η) Υ (R0)} .

Remark 30 To address the claim of Remark 5 we present the following lemma:

Andresen and Spokoiny

Lemma 31 On the set C( ) {eυ0 Υ (RK)} it holds

L(υ0, υ ) (1/2 + ϵ(1 + 12ν0))(R + z(x))2.

Proof With similar arguments as in the proof of Lemma 32 we have on C( ) Ωthat

L(υ0) L(υ )

E[L(υ0) L(υ )] D 1 ζ(υ ) R |{ ζ(bυ) ζ(υ )}(υ0 υ )|

D(υ0 υ ) 2/2 D 1 ζ(υ ) R

D 1 L(bυ) L(υ ) R ϵR2 K

R2/2 z(x)R ϵ(1 + 12ν0) R2 + z(x)2 .

4.4 Proof Convergence

We derive the a priori bound eυk,k(+1) Υ (rk) with an adequately decreasing sequence (rk) R+ using the argument of Section 4.1, where lim sup rk z(x) . For this purpose we deﬁne the parametric uniform spread

Q(r, x) def = ϵ 2r2 + z Q(x, 4p )2 (1 + 6ν2 1). (31)

Lemma 32 Assume that for some sequence (r(l) k )k N

n υk,k(+1) Υ r(l) k o .

Then under the assumptions of Theorem 7 we get on Ω(x) for all k N0

D(eυk,k(+1) υ ) 2

2(1 ν) 1 z(x) + (1 + ν)νk R0(x)

r=0 νr Q r(l) r

=: r(l+1) k .

Proof 1. We ﬁrst show that on Ω(x)

D(eθk θ ) = D 1 θL(υ ) D 1A(eηk η ) + τ r(l) k , (32)

H(eηk η ) = H 1 ηL(υ ) H 1A (eθk 1 θ ) + τ r(l) k ,

Convergence of an Alternating Procedure

where - with Q(r, x) deﬁned in (31) -

τ(r) Q(r, x).

The proof is the same in each step for both statements such that we only prove the ﬁrst one. The arguments presented here are similar to those of Theorem D.1 in (Andresen and Spokoiny, 2014). By assumption on Ω(x) we have eυk,k(+1) Υ r(l) k . Deﬁne with ζ = L EL

α(υ, υ ) := L(υ) L(υ ) ζ(υ )(υ υ ) D(υ υ ) 2/2 .

L(υ) L(υ ) = ζ(υ )(υ υ ) D(υ υ ) 2/2 + α(υ, υ )

= θζ(υ )(θ θ ) D(θ θ ) 2/2 + (θ θ ) A(η η )

+ ηζ(υ )(η η ) H(η η ) 2/2 + α(υ, υ ).

Setting θL(eθk, eηk) = 0 we ﬁnd

D(eθk θ ) D 1 θζ(υ ) A(eηk η ) = D 1 θα(eυk,k, υ ).

As we assume that eυk,k Υ (R0) it suﬃces to show that with dominating probability

sup (θ,eηk) Υ (R0) Uθ(θ, eηk) (r(l) k ),

Uθ(θ, eηk) def = D 1 θL(eυk,k) θL(υ ) D2 (θ θ ) A(eηk η ) .

To see this note ﬁrst that with Lemma 27 D 1ΠθDυ D 1Dυ . This gives by condition (L0) , Lemma 27 and Taylor expansion

sup (θ,eηk) Υ (r) EU(θ, eηk) sup υ Υ (r) D 1Πθ EL(υ) EL(υ ) D (υ υ )

sup υ Υ (r) D 1ΠθD D 1 2EL(υ)2D 1 Ip 1/2r

For the remainder note that again with Lemma 27

D 1 θζ(υ) θζ(υ ) D 1 ζ(υ) ζ(υ ) .

Andresen and Spokoiny

This yields that on Ω(x)

sup (θ,eηk) Υ (r)

Uθ(θ, eηk) EUθ(θ, eηk) sup υ Υ (r)

D 1 θζ(υ) θζ(υ )

sup υ Υ (r)

6ν1ϵ Y(υ) 6ν1ϵ 6ν1ϵ z Q(x, 4p ) + 2r2 .

Using the same argument for eηk gives the claim.

2. We prove the apriori bound for the distance of the k. estimator to the oracle

D(eυk,k(+1) υ ) r(l+1) k .

To see this we ﬁrst use the inequality

D(eυk,k(+1) υ )

2 D(eθk θ ) +

2 H(eηk(+1) η ) .

Now we ﬁnd with (32)

D(eθk θ ) D 1 θL(υ ) + D 1A(eηk η ) + τ r(l) k

D 1 θL(υ ) + D 1AH 1 H(eηk η ) + τ r(l) k .

Next we use that on Ω(x)

D 1AH 1 ν, D 1 θL(υ ) z(x), H 1 ηL(υ ) z(x),

H(eηk η ) H 1 ηL(υ ) + H 1A (eθk 1 θ ) + τ r(l) k ,

to derive the recursive formula

D(eθk θ ) (1 + ν) z(x) + τ r(l) k + ν D(eθk 1 θ ) .

Deriving the analogous formula for H(eηk η ) and solving the recursion gives the claim.

Lemma 33 Assume the same as in Theorem 7 . Further assume that (17) is met with C(ν) deﬁned in (18). Then

υk,k(+1) Υ (r k) ,

Convergence of an Alternating Procedure

where - with C(ν) 0 deﬁned in (18) -

r k C(ν) + 4ϵC(ν)4z(x)

1 ϵC(ν)z(x)

+νk C(ν) + 4ϵC(ν)4R0

Proof We proof this claim via induction. On Ω(x) we have

υk,k(+1) Υ (R0), set r(0) k def = R0.

Now with Lemma 32 we ﬁnd that if

n υk,k(+1) Υ (r(l) k ) o ,

n υk,k(+1) Υ (r(l+1) k ) o ,

2(1 ν) 1 z(x) + (1 + ν)νk R0(x)

r=0 νr Q r(l 1) k r , x .

Setting l = 1 this gives

2(1 ν) 1 n (z(x) + Q(R0, x)) + (1 + ν)νk R0(x) o .

We show that

lim sup l r(l) k

υk,k(+1) Υ (r k) .

Andresen and Spokoiny

So we have to show that lim supl r(l) k r k in (33). For this we estimate further

2(1 ν) 1 z(x) + (1 + ν)νk R0(x)

r=0 νr r(l 1) k r 2 + z(x)2

2(1 ν) 1 z(x) + ϵz(x)2 + (1 + ν)νk R0(x)

r=0 νr r(l 1) k r 2

( z(x) + ϵz(x)2 + νk R0 + ϵ

r=0 νr r(l 1) k r 2 )

where C(ν) > 0 is deﬁned in (18). We set

A(l) s,k def =

k r1 ... rs 1 1 X

rs=0 νrs r(l 1) k r1 ... rs 2 . . .

A(l) s,k 4 Ps 1 t=0 2t C(ν)2s ( 1 1 ν

Ps 1 t=0 2t z(x) + ϵz(x)2 2s (34)

+νk 1 ν 1 1

Ps 1 t=0 2t R2s 0

+4 Ps 1 t=0 2t(C(ν)ϵ)2s A(l 1) s+1,k.

We proof this claim via induction. Clearly

r1=0 νr1 r(l 1) k r1 2 4C(ν)2 k 1 X

r1=0 νr1 n z(x) + ϵz(x)2 2 + ν2(k r1)R2 0 o

+4C(ν)2ϵ2 k 1 X

r1=0 νr1 k r1 r2 1 X

r2=0 νr2 r(l 2) k r1 r2 2 !2

4C(ν)2 1 1 ν z(x) + ϵz(x)2 2 + νk

+4C(ν)2ϵ2A(l 1) 2,k .

Convergence of an Alternating Procedure

Furthermore

A(l) s,k def =

k r1 ... rs 1 1 X

rs=0 νrs r(l 1) k r1 ... rs 2 . . .

r1=0 νr1 A(l) s 1,k r1

Plugging in (34) we get for s 2

4 Ps 2 t=0 2t C(ν)2s 1 ( 1 1 ν

Ps 2 t=0 2t z(x) + ϵz(x)2 2s 1

+νk 1 ν 1 1

Ps 2 t=0 2t R2s 1 0

+ 4 Ps 2 t=0 2t(C(ν)ϵ)2s 1A(l 1) s,k r1

Shifting the index this gives

4 Ps 1 t=1 2t C(ν)2s ( 1 1 ν

Ps 1 t=1 2t 1 z(x) + ϵz(x)2 2s

+νk 1 ν 1 1

Ps 1 t=1 2t R2s 0

+ 4 Ps 1 t=1 2t(C(ν)ϵ)2s(A(l 1) s,k r1)2 !

Direct calculation then leads to

A(l) s,k 4 Ps 1 t=0 2t C(ν)2s ( 1 1 ν

Ps 1 t=0 2t z(x) + ϵz(x)2 2s

+νk 1 ν 1 1

Ps 1 t=0 2t R2s 0

+4 Ps 1 t=0 2t(C(ν)ϵ)2s k 1 X

r1=0 νr1(A(l 1) s,k r1)2,

Andresen and Spokoiny

which gives (34) with (35). Similarly we can prove

A(1) s,k = 1 1 ν

2s 1 R2s 0 .

λs def = 42s 1C(ν)2s, βs def = 42s 1(C(ν)ϵ)2s,

zs(x) def = 1 1 ν

2s 1 z(x) + ϵz(x)2 2s , Rs def = 1 ν 1 1

2s 1 R2s 0 .

r(l) k C(ν) n z(x) + ϵz(x)2 + νk R0 + ϵA(l) 1,k o

r=0 βrzs(x) + νk l 1 X

r=0 βr Rs +

r=0 βr Rl. (36)

We estimate further

r=0 βrzs(x) C(ν) z(x) + ϵz(x)2 =

r=0 βrzs(x)

s=1 42s C(ν)2s+1ϵ2s 1 1 1 ν

2s 1 z(x) + ϵz(x)2 2s

= ϵ42C(ν)4 1 1 ν

z(x) + ϵz(x)2 2

ϵ4C(ν) 1 1 ν z(x) + ϵz(x)2 2s 1 .

Assuming (17) and the deﬁnition of R0 > z(x) this gives

r=0 βrzs(x) C(ν) + 4ϵC(ν)4z(x)

1 ϵC(ν)z(x)

With the same argument we ﬁnd under (17) that

r=0 βr Rs νk C(ν) + 4ϵC(ν)4R0

Additionally (17) implies

r=0 βr Rr ϵ4C(ν) 1 ν 1 1

2l 1 R2l 0 0.

Convergence of an Alternating Procedure

Plugging these bounds into (36) and letting l gives the claim.

4.5 Result after Convergence

In the previous section we showed that

sup υ Υ (r)

6 ϵ ν1 Y(υ) 2r2 z Q(x, 2p + 2p)2

k N {υk,k Υ (r k) , υk,k+1 Υ (r k)} {eυ, eυθ Υ (r0)},

where r k is deﬁned in (33). The claim of Theorem 7 follows with the following lemma:

Lemma 34 Assume ( ED1) , ( L0) , and (I) . Then it holds on Ω(x) ϵ that for all k N

D eθk θ ξ Q(rk, x), (37) 2 L(eθk, θ ) ξ 2 5 D 1 + Q(rk, x) Q(rk, x), (38)

where the spread (r, x) is deﬁned in (21) and where

rk def = r k r0.

Proof The proof is nearly the same as that of Theorem 2.2 of (Andresen and Spokoiny, 2014) which is inspired by the proof of Theorem 1 of (Murphy and van der Vaart, 2000). So we only sketch it and refer the reader to (Andresen and Spokoiny, 2014) for the skipped arguments. We deﬁne

l : Rp Υ R, (θ1, θ2, η) 7 L(θ1, η + H 2A (θ2 θ1)).

θ1l(θ1, θ2, η) = θL(θ1, η + H 2A (θ2 θ1)), eθk = argmax θ l(θ, eθk, eηk),

such that θL(eθk, eηk) = 0 . This gives

D eθk θ ξ = D 1 L(eθk, eηk) D 1 L(υ ) + D eθk θ .

Now the right hand side can be bounded just as in the proof of Theorem 2.2 of (Andresen and Spokoiny, 2014). This gives (37). For (38) we can represent:

L(eθk) L(θ ) = l(eθk, eθk, eηk+1) l(θ , θ , eηθ ),

Andresen and Spokoiny

eηθ def = Πη argmax υ Υ, Πθυ=θ L(υ).

Due to the deﬁnition of eθk and eηk+1

l(eθk, θ , eηθ ) l(θ , θ , eηθ ) L(eθk) L(θ ) l(eθk, eθk, eηk+1) l(θ , eθk, eηk+1).

Again the remaining steps are exactly the same as in the proof of Theorem 2.2 of (Andresen and Spokoiny, 2014).

5. Proof of Corollary 13

Proof Note that with the argument of Section 4.3 P(ϵ (x)) 1 8e x β where with Ω(x) from (29)

ϵ (x) = Ω(x) {eυ Υ (r0)}.

On ϵ (x) it holds due to Theorem 7 and due to Theorem 2.1 of (Andresen and Spokoiny, 2014)

D(eθk θ ) ξ Q(rk, x), D(eθ θ ) ξ (r0, x).

Now the claim follows with the triangular inequality and noting that (r0, x) Q(r0, x) .

6. Proof of Theorem 14

We prove this Theorem in a similar manner to the convergence result in Lemma 32. Redeﬁne the set Ω(x)

k=0 (Ck,k Ck,k+1) C( ) {L(eυ0) L(υ ) K0}, where (39)

Ck,k(+1) = n D(eυk,k(+1) υ ) R0(x), D(eθk θ ) R0(x),

H(eηk(+1) η ) R0(x) o ,

sup υ Υ (R0(x)) Y( 2)(υ) 9ν2ϵ2z1(x, 6p )R0(x)

{ D 1 2ζ(υ ) z(x, 2ζ(υ ))}.

Convergence of an Alternating Procedure

Y( 2)(υ) def = D 1 2ζ(υ) 2ζ(υ ) Rp 2.

We see that on Ω(x)

υk,k(+1) eΥ (R0) def = { D(υ eυ) R0 + r0} Υ (R0).

Lemma 35 Under the conditions of Theorem 14

P(Ω(x)) 1 3e x β.

Proof The proof is very similar to the one presented in Section 4.3, so we only give a sketch. By assumption

P D 1 2ζ(υ ) z(x, 2ζ(υ )) 1 e x,

and due to (ED2) with Theorem 47

sup υ Υ (R0(x)) Y( 2)(υ) 9ν2ϵ2z1(x, 6p )R0(x)

Lemma 36 Assume for some sequence (r(l) k ) that

n D(eυk,k(+1) eυ) r(l) k o Ω(x).

Then we get on Ω(x)

D(eυk,k(+1) eυ) 2

r=0 νr τ(r(l) k r) + 2

2νk(R0 + r0),

=: r(l+1) k . (40)

τ(r) ϵR0 + 9ν2ϵ2 D 1 z1(x, 6p )R0 + D 1 z(x, 2ζ(υ )) r.

Proof 1. We ﬁrst show that on Ω(x)

D(eθk eθ) = D 1A(eηk eη) + τ r(l) k ,

H(eηk η ) = H 1A (eθk 1 eθ) + τ r(l) k .

Andresen and Spokoiny

The proof is very similar to that of Lemma 32. Deﬁne

α(υ, eυ) := L(υ) L(eυ) + D(υ eυ) 2/2.

L(υ) L(eυ) = L(υ) D(υ eυ) 2/2 + α(υ, υ )

= D(θ eθ) 2/2 + (θ θ ) A(η eη)

H(η eη) 2/2 + α(υ, eυ).

Setting θL(eθk, eηk) = 0 we ﬁnd

D(eθk eθ) = D 1A(eηk eη) + D 1 θα(eυk,k, eυ).

We want to show

(θ,eηk) eΥ r(l) k Υ (R0) D 1 θα((θ, eηk), eυ) τ r(l) k ,

D 1 θα(υ, eυ) def = D 1 θL(υ) D2 (θ eθ) A(eηk eη) .

To see this note that by assumption we have Ω(x) {eυ Υ (r0)} {eυ Υ (R0)} . By condition (L0) , Lemma 27 and Taylor expansion we have

(θ,eηk) eΥ (r(l) k ) Υ (R0) EU(θ, eηk)

υ eΥ (r(l) k ) Υ (R0) D 1Πθ EL(υ) EL(eυ) D (υ υ )

sup υ Υ (R0) D 1ΠθD D 1 2EL(υ)D 1 Ip r(l) k

Convergence of an Alternating Procedure

For the remainder note that with ζ = L EL on Ω(x) using Lemma 27 we can bound

(θ,eηk) eΥ (r(l) k ) Υ (R0)

Uθ(θ, eηk) EUθ(θ, eηk)

υ eΥ (r(l) k ) Υ (R0)

D 1 θζ(υ) θζ(eυ)

sup υ Υ (r)

D 1 2ζ(υ)D 1 r(l) k

sup υ Υ (R0)

1 9ν2ϵ2 D 1 2ζ(υ) 2ζ(υ ) D 1 6ν1ϵr(l) k

+ D 1 2ζ(υ )D 1 r(l) k

9ν2ϵ2 D 1 z1(x, 6p )R0 + D 1 z(x, 2ζ(υ )) r(l) k .

Using the same argument for eηk gives the claim.

Now the claim follows as in the proof of Lemma 32.

Lemma 37 Assume that κ(x, R0) < 1 ν where

κ(x, R0) def = 2

2(1 + ν) 1 ν

ϵR0 + 9ϵ2ν2 D 1 z1(x, 6p )R0

+ D 1 z2 x, 2L(υ ) .

n υk,k(+1) eΥ (rk) o ,

where (rk)k N satisfy the bound (24).

Proof Deﬁne for all k N0 the sequence r(0) k = R0 . We estimate

τ(r(l) k ) 1 1 ν

ϵR0 + 6ν1ϵ2 D 1 z1(x, 6p )R0 + D 1 z(x, IB( 2) r(l) k ,

such that by deﬁnition

r=0 νr τ(r(l) k r) κ(x, R0)

r=0 νrr(l) k r.

Andresen and Spokoiny

Plugging in the recursive formula for r(l) k from (40) and denoting e R0 def = R0 + r0 we ﬁnd

r(l) k κ(x, R0)

r=0 νrr(l 1) k r + 2

s=0 νsr(l 2) k r s + 2νk r e R0

κ(x, R0)2 k 1 X

r=0 νr k r 1 X

s=0 νsr(l 2) k r s + 2

2νk e R0 (κ(x, R0)k + 1)

κ(x, R0)2 k 1 X

r=0 νr k r 1 X

t=0 νtr(l 3) k r s t + 2νk r se R0

2νk e R0 (κ(x, R0)k + 1)

κ(x, R0)3 k 1 X

r=0 νr k r 1 X

s=0 νsr(l 3) k r s + 2

2νk e R0 κ(x, R0)2k2 + κ(x, R0)k + 1 .

By induction this gives for l N

r(l) k κ(x, R0)l k 1 X

r1=0 νr1 k r1 1 X

r2=0 νr2 . . .

k Pl 1 s=1 rs 1 X

rl=0 νrl e R0

s=0 κ(x, R0)sks

s=0 (κ(x, R0)k)s ! e R0

2νk 1 1 κ(x,R0)k

e R0, κ(x, R0)k 1,

κ(x, R0)l 1 1 ν l + 2

2νk kl κ(x,R0)k 1

e R0, otherwise.

By Lemma 36

n eυk,k(+1) eΥ r(l) k o .

Set if κ(x, R0)/(1 ν) < 1

( , κ(x, R0)k 1,

k log(ν)+log(2

2) log(κ(x,R0)k 1) log(1 ν) log(k) , otherwise.

Convergence of an Alternating Procedure

Then with r k def = r( l(k) ) k we get

n eυk,k(+1) eΥ r k o ,

2 1 κ(x,R0)k e R0, κ(x, R0)k 1,

1 ν k log(k)L(k) 1 e R0, otherwise.

The sequence L(k) > 0 is deﬁned as

$ log(1/ν) 1

2) log(κ(x, R0)k 1)

1 + 1 log(k) log(1 ν)

where x N0 denotes the largest natural number smaller than x > 0 . To ensure that L(k) > 0 we assume that k log(1/ν) log(2

2) > k . Further as κ(x, R0) < (1 ν) and L(k) is only relevant once κ(x, R0)k > 1 it follows that

0 < 1 + 1 log(k) log(1 ν) < 1.

L(k) log(1/ν) 1

2) log(κ(x, R0)k 1) > 1.

Consequently

k log(k)L(k) ν

k log(k) log 1 ν κ(x,R0) κ(x, R0)

1 log(k)(log(2

2) log(κ(x,R0)k 1))

k log(k) log 1 ν κ(x,R0) ck,

where ck κ(x,R0)

1 ν . Finally note that e R0 2R0 and the proof is complete.

Remark 38 As pointed out in Remark 18 the above result can be improved. Redeﬁne Ω(x) as the intersection of the two sets in (29) and (39). Then P(Ω(x)) 1 10e x . Also redeﬁne

κ(x, r) def = 2

2(1 + ν) 1 ν

ϵr + 3ϵ2 D 1 z1(x, 6p )r

+ D 1 z2 x, 2L(υ ) .

By the arguments of the proof of Theorem 7 we ﬁnd with r k deﬁned in (33)

k N {υk,k(+1) Υ (r k)}.

Andresen and Spokoiny

Using this in Lemma 36 instead of k N{υk,k(+1) Υ (R0)} we can bound

τ(r(l) k ) 1 1 ν

ϵr k + 6ν1ϵ2 D 1 z1(x, 6p )r k

+ D 1 z(x, IB( 2) r(l) k .

Consequently, representing r k = C z(x) + νk R0 we ﬁnd

r(l) k κ(x, Cz(x))

r=0 νrr(l 1) k r + 2

2νk(R0 + r0)

+Cϵ(1 + D 1 z1(x, 6p ))

r=0 νk R0r(l 1) k r

κ(x, Cz(x))

r=0 νrr(l 1) k r + C1ϵR0kνk(R0 + r0),

2 + C(1 + D 1 z1(x, 6p )) . With the same arguments as in the proof of Lemma 37 we infer

r(l) k /(R0 + r0)

1 ν l + kνk C1ϵR0 1 κ(x,Cz(x))k

, κ(x, Cz(x))k 1,

κ(x, Cz(x))l 1 1 ν l + νk kl+1C1ϵR0 κ(x,Cz(x))k 1

, otherwise.

( , κ(x, Cz(x))k 1, k log(ν)+log(C1ϵR0) log(κ(x,Cz(x))k 1) log(k)

log(1 ν) log(k) , otherwise.

Then with r k def = r( l(k) ) k we get with a slight adaptation of L(k)

n eυk,k(+1) eΥ r k o ,

2 1 κ(x Cz(x))k(R0 + r0), κ(x, Cz(x))k 1,

2 κ(x,Cz(x))

1 ν k log(k)L(k) 1 (R0 + r0), otherwise.

This gives the claim.

Acknowledgments

Convergence of an Alternating Procedure

The ﬁrst author was supported by Research Units 1735 Structural Inference in Statistics: Adaptation and Eﬃciency . The research was partly supported by the Russian Science Foundation grant (project 14-50-00150). Financial support by the German Research Foundation (DFG) through the Collaborative Research Center 649 Economic Risk is gratefully acknowledged.

Andresen and Spokoiny

Appendix A. Deviation Bounds for Quadratic Forms

This section is the same as Section A of Andresen and Spokoiny (2014). The following general result from Spokoiny (2012) helps to control the deviation for quadratic forms of type IBξ 2 for a given positive matrix IB and a random vector ξ . It will be used several times in our proofs. Suppose that

log E exp γ ξ γ 2/2, γ Rp, γ g.

For a symmetric matrix IB , deﬁne

p = tr(IB2), v2 = 2 tr(IB4), λ def = IB2 def = λmax(IB2).

For ease of presentation, suppose that g2 2p IB . The other case only changes the constants in the inequalities. Note that ξ 2 = η IB η . Deﬁne µc = 2/3 and

2(xc + 2) def = (g2/µc p IB)/λ + log det Ip µc IB/λ .

Proposition 39 Let (ED0) hold with ν0 = 1 and g2 2p IB . Then for each x > 0

P IBξ z(x, IB) 2e x,

where z(x, IB) is deﬁned by

z2(IB, x) def =

p IB + 2v IB(x + 1)1/2, x + 1 v IB/(18λ ), p IB + 6λ (x + 1), v IB/(18λ ) < x + 1 xc + 2, yc + 2λ (x xc + 1)/gc 2, x > xc + 1,

with y2 c p IB + 6λ (xc + 2) .

Appendix B. A Uniform Bound for the Norm of a Random Process

We want to derive for a random process Y(υ) Rp a bound of the kind

sup r r sup υ Υ (r)

ϵ Y(υ) 2r2 Cz Q(x, p )

This is a slightly stronger result than the one derived in Section D of (Andresen and Spokoiny, 2014) but the ideas employed here are very similar. We want to apply Corollary 2.5 of the supplement of Spokoiny (2012) which we cite here as a Theorem. Note that we slightly generalized the formulation of the theorem, to make it applicable in out setting. The proof remains the same.

Theorem 40 Let (U(r))0 r r Rp be a sequence of balls around υ induced by the metric d( , ) . Let a random real valued process U(r, υ) fulﬁll for any 0 r r that U(r, υ ) = 0 and

Convergence of an Alternating Procedure

(Ed) For any υ, υ U(r)

log E exp λU(r, υ) U(r, υ )

2 , |λ| g. (41)

Finally assume that supυ U(r)(U(r, υ)) increases in r . Then with probability greater 1 e x

3ν1 U(r, υ) d(υ, υ )2 z Q(x, p )2,

where z Q(x, p ) def = Q(U(r )) denotes the entropy of the set U(r ) Rp and where with g0 = ν0g and for some Q > 0

z Q(x, Q)2 def =

( (1 + x + Q)2 if 1 + x + Q g0, 1 + {2g 1 0 (x + Q) + g0}2 otherwise. (42)

To use this result let Y(υ) be a smooth centered random vector process with values in Rp and let D : Rp Rp be some linear operator. We aim at bounding the maximum of the norm Y(υ) over a vicinity Υ (r) def = { D(υ υ ) r} of υ . Suppose that Y(υ) satisﬁes for each 0 < r < r and for all pairs υ, υ Υ (r) = υ Υ : D(υ υ ) r Rp

sup u 1 log E exp λu Y(υ) Y(υ )

Remark 41 In the setting of Theorem 7 we have

Y(υ) = D 1 ζ(υ) ζ(υ ) ,

and condition (43) becomes (ED1) from 2.1.

Theorem 42 Let a random p -vector process Y(υ) fulﬁll Y(υ ) = 0 and let condition (43) be satisﬁed. Then for each 0 r r , on a set of probability greater 1 e x

sup r r sup υ Υ (r)

6ϵν1 Y(υ) 2r2 z Q(x, 2p + 2p)2,

with g0 = ν0g .

Remark 43 Note that the entropy of the original set Υ (r) Rp is equal to 2p . So in order to control the norm Y(υ) one only pays with the additional sumand 2p .

Andresen and Spokoiny

Proof In what follows, we use the representation

Y(υ) = ϵ sup u D(υ υ )

1 ϵ D(υ υ ) u Y(υ).

This implies

sup υ Υ (r) Y(υ) = ϵ sup υ Υ (r) sup u D(υ υ )

1 ϵ D(υ υ ) u Y(υ).

Due to Lemma 44 the process U(r, υ, u) def = 1 ϵ D(υ υ ) u Y(υ) satisﬁes condition (Ed) (see (41)) as process on U(r ) where

U(r) def = Υ (r) Br(0). (44)

Further sup(υ,u) U(r) U(r, υ, u) is increasing in r . This allows to apply Theorem 42 to obtain the desired result. Set d((υ, u), (υ , u ))2 = D(υ υ ) 2 + u u 2 . We get on a set of probability greater 1 e x

sup (υ,u) U(r )

1 6ϵν1 D(υ υ ) u Y(υ) D(υ υ ) 2 u 2

z Q x, Q U(r ) .

The constant Q U(r ) > 0 quantiﬁes the complexity of the set U(r ) Rp Rp . We point out that for compact M Rp we have Q(M) = 2p (see Supplement of Spokoiny (2012), Lemma 2.10). This gives Q U = 2p + 2p . Finally observe that

sup r r sup υ Υ (r)

6ϵν1 Y(υ) 2r2

sup r r sup (υ,u) U(r)

1 6ϵν1 D(υ υ ) u Y(υ) D(υ υ ) 2 u 2

= sup (υ,u) U(r )

1 6ϵν1 D(υ υ ) u Y(υ) D(υ υ ) 2 u 2 .

Lemma 44 Suppose that Y(υ) satisﬁes for each u 1 and |λ| g the inequality (43). Then the process U(υ, u) = 1 2ϵ D(υ υ ) Y(υ) u1 satisﬁes (Ed) from (41) with |λ| g/2 ,

d((υ, u), (υ , u ))2 = D(υ υ ) 2 + u u 2 , ν = 2ν0 and U Rp +p deﬁned in (44), i.e. for any (υ, u1), (υ , u2) U

log E exp λU(υ, u1) U(υ , u2)

d((υ, u1), (υ , u2))

2 , |λ| g/2.

Convergence of an Alternating Procedure

Proof Let (υ, u1), (υ , u2) U and w.l.o.g. u1 D(υ υ ) D(υ υ ) . By the H older inequality and (43), we ﬁnd

log E exp λU(υ, u1) U(υ, u2)

d((υ, u1), (υ , u2))

= log E exp λU(υ, u1) U(υ , u1) + U(υ , u1) U(υ , u2)

d((υ, u1), (υ , u2))

2 log E exp 2λ u 1 1 D(υ υ ) Y(υ) 1 D(υ υ ) Y(υ )

2 log E exp 2λ (u 1 u 2 ) Y(υ ) ϵ u1 u2 D(υ υ )

1 2 log E exp 2λu Y(υ) Y(υ )

1 2 log E exp 2λu Y(υ ) Y(υ )

Appendix C. A Bound for the Spectral Norm of a Random Matrix Process

We want to derive for a random process Y(υ) Rp p a bound of the kind

sup υ Υ (r)

n Y(υ) o Cϵ2z1(x, p )r

We derive such a bound in a very similar manner to Theorem E.1 of Andresen and Spokoiny (2014). We want to apply Corollary 2.2 of the supplement of Spokoiny (2012). Again we slightly generalized the formulation but the proof remains the same.

Corollary 45 Let (U(r))0 r r Rp be a sequence of balls around υ induced by the metric d( , ) . Let a random real valued process U(υ) fulﬁll that U(υ ) = 0 and

(Ed) For any υ, υ U(r)

log E exp λU(υ) U(υ )

2 , |λ| g. (45)

Andresen and Spokoiny

Then for each 0 r r , on a set of probability greater 1 e x

sup υ U(r) U(υ) 3ν1z1(x, p )2d(υ, υ ),

where z1(x, p ) def = Q(U(r )) denotes the entropy of the set U(r ) Rp and where with g0 = ν0g and for some Q > 0

z1(x, Q) def =

2(x + Q) if p

2(x + Q) g0, g 1 0 (x + Q) + g0/2 otherwise. (46)

To use this result let Y(υ) be a smooth centered random process with values in Rp p

and let D : Rp Rp be some linear operator. We aim at bounding the maximum of the spectral norm Y(υ) over a vicinity Υ (r) def = { υ υ Y r} of υ . Suppose that Y(υ) satisﬁes Y(υ ) = 0 and for each 0 < r < r and for all pairs υ, υ Υ (r) = υ Υ : υ υ Y r Rp

sup u1 1 sup u2 1 log E exp λu 1 Y(υ) Y(υ ) u2 ϵ2 D(υ υ )

Remark 46 In the setting of Theorem 14 we have υ υ Y = D(υ υ ) and

Y(υ) = D 1 2ζ(υ) D 1 2ζ(υ ),

and condition (47) becomes (ED2) from 2.1.

Theorem 47 Let a random process Y(υ) Rp p fulﬁll Y(υ ) = 0 and let condition (47) be satisﬁed. Then for each 0 r r , on a set of probability greater than 1 e x

sup υ Υ (r) Y(υ) 9ϵ2ν2z1(x, 6p )r,

with g0 = ν0g .

Remark 48 Note that the entropy of the original set Υ (r) Rp is multiplied by 3. So in order to control the spectral norm Y(υ) one only pays with this factor.

Proof In what follows, we use the representation

Y(υ) = ϵ2 sup u1 r sup u2 r

1 ϵ2r2 u 1 Y(υ)u2.

This implies

sup υ Υ (r) Y(υ) = ϵ sup υ Υ (r) sup u2 r sup u2 r

1 ϵr2 u 1 Y(υ)u2.

Convergence of an Alternating Procedure

Due to Lemma 49 the process U(υ) def = 1 ϵr2 u 1 Y(υ)u2 satisﬁes condition (Ed) (see (45)) as process on

U(r) def = Υ (r) Br(0) Br(0) R3p . (48)

This allows to apply Corollary 45 to obtain the desired result. We get on a set of probability greater 1 e x

sup υ Υ (r) Y(υ) sup (υ,u1,u2) U(r)

r2 u 1 Y(υ)u2

9ϵ2ν2z1 x, Q U(r ) r.

The constant Q U(r) > 0 quantiﬁes the complexity of the set U(r) R3p . We point out that for compact M R3p we have Q(M) = 6p (see Supplement of Spokoiny (2012), Lemma 2.10). This gives the claim.

Lemma 49 Suppose that Y(υ) Rp p satisﬁes Y(υ ) = 0 and for each u1 1 , u2 1 and |λ| g the inequality (47). Then the process

U(υ, u1, u2) = 1 2ϵ2r2 u 1 Y(υ) u2

satisﬁes (Ed) from (45) with U R3p deﬁned in (48), with |λ| g/3 and with

d((υ, u1, u2), (υ , u 1, u 2))2 = D(υ υ ) 2 + u1 u 1 2 + u2 u 2 2,

i.e. for any (υ, u1, u2), (υ , u 1, u 2) U

log E exp λU(υ, u1, u2) U(υ , u 1, u 2) d((υ, u1, υ2), (υ , u 1, u 2))

2 , |λ| g/3.

Andresen and Spokoiny

Proof Let (υ, u1, u2), (υ , u 1, u 2) U . By the H older inequality and (47), we ﬁnd

log E exp λU(υ, u1, u2) U(υ , u 1, u 2) d((υ, u1, u2), (υ , u 1, u 2))

= log E exp λ U(υ, u1, u2) U(υ , u1, u2)

d((υ, u1, u2), (υ , u 1, u 2)) + U(υ , u1, u2) U(υ , u 1, u2) d((υ, u1, υ2), (υ , u 1, u 2))

+U(υ , u 1, u2) U(υ , u 1, u 2) d((υ, u1, u2), (υ , u 1, u 2))

3 log E exp 3λu 1 1

r2 Y(υ ) u2 ϵ2 D(υ υ )

3 log E exp 3λ(u1 u 1) )Y(υ )u2 ϵ2 u1 u2 r2

3 log E exp 3λ(u 1) )Y(υ )(u2 u 2) ϵ2 u1 u2 r2

3 sup u1 1 sup u2 1 log E exp 3λu 1 Y(υ) Y(υ ) u2 ϵ2 D(υ υ )

3 sup u1 1 sup u2 1 log E exp 3λu 1 Y(υ ) Y(υ ) u2 ϵ2 D(υ υ )

A. Andresen. Finite sample analysis of proﬁle m-estimation in the single index model. Electronic Journal of Statistics, 9(2):2528 2641, 2015.

A. Andresen and V. Spokoiny. Critical dimension in proﬁle semiparametric estimation. Electron. J. Statist., 8(2):3077 3125, 2014. ISSN 1935-7524. doi: 10.1214/14-EJS982.

S. Balakrishnan, M. J. Wainwright, and B. Yu. Satistical guarantees for the em algorithm: From population to sample-based analysis. ar Xiv: 1408.2156, 2014.

Xiaohong Chen. Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics, 6:55495632, 2007.

G. Cheng. How many iterations are suﬃcient for eﬃcient semiparametric estimation? Scandinavian Journal of Statistics,, 40(3):592 618, 2013.

M. Delecroix., W. Haerdle, and M. Hristache. Eﬃcient estimation in single-index regression. Technical report, SFB 373, Humboldt Univ. Berlin, 1997.

Convergence of an Alternating Procedure

A.P Dempster, N.M. Laird, and D.B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society,, Series B, 39:1 38, 1977.

U. Grenander. Abstract Inference. Whiley, New York, 1981.

W. Haerdle, P. Hall, and H. Ichimura. Optimal smoothing in single-index models. Ann. Statist., 21:157 178, 1993.

I.A. Ibragimov and R.Z. Khas minskij. Statistical estimation. Asymptotic theory. Transl. from the Russian by Samuel Kotz. . New York - Heidelberg -Berlin: Springer-Verlag , 1981.

P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. STOC, pages 665 674, 2013.

R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010.

M.R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Springer in Statistics, 2005.

G.J. Mc Lachlan and T Krishnan. The EM Algorithm and Extensions. Wiley, New York, 1997.

S. A. Murphy and A. W. van der Vaart. On proﬁle likelihood. Journal of the American Statistical Association, 95(450):449 465, 2000.

P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. NIPS, pages 2796 2804, 2013.

Vladimir Spokoiny. Parametric estimation. Finite sample theory. Ann. Statist., 40(6): 2877 2909, 2012.

J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12:389 434, 2012.

C.F.J Wu. On the convergence properties of the em algorithm. Annals of Statistics, 11: 95 103, 1983.

X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. ar Xiv: 1310.3745, 2013.