# estimating_structured_vector_autoregressive_models__7d6dc17a.pdf Estimating Structured Vector Autoregressive Models Igor Melnyk MELNYK@CS.UMN.EDU Arindam Banerjee BANERJEE@CS.UMN.EDU Department of Computer Science and Engineering, University of Minnesota, Twin Cities While considerable advances have been made in estimating high-dimensional structured models from independent data using Lasso-type models, limited progress has been made for settings when the samples are dependent. We consider estimating structured VAR (vector auto-regressive model), where the structure can be captured by any suitable norm, e.g., Lasso, group Lasso, order weighted Lasso, etc. In VAR setting with correlated noise, although there is strong dependence over time and covariates, we establish bounds on the non-asymptotic estimation error of structured VAR parameters. The estimation error is of the same order as that of the corresponding Lasso-type estimator with independent samples, and the analysis holds for any norm. Our analysis relies on results in generic chaining, subexponential martingales, and spectral representation of VAR models. Experimental results on synthetic and real data with a variety of structures are presented, validating theoretical results. 1. Introduction The past decade has seen considerable progress on approaches to structured parameter estimation, especially in the linear regression setting, where one considers regularized estimation problems of the form: ˆβ = argmin β Rq 1 M y Zβ 2 2 + λMR(β) , (1) where {(yi, zi), i = 1, . . . , M}, yi R, zi Rq, such that y = [y T 1 , . . . , y T M]T and Z = [z T 1 , . . . , z T M]T , is the training set of M independently and identically distributed (i.i.d.) samples, λM > 0 is a regularization parameter, and R( ) denotes a suitable norm (Tibshirani, Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). 1996; Zou & Hastie, 2005; Yuan & Lin, 2006). Specific choices of R( ) lead to certain types of structured parameters to be estimated. For example, the decomposable norm R(β) = β 1 yields Lasso, estimating sparse parameters, R(β) = β G gives Group Lasso, estimating group sparse parameters, and R(β) = β owl, the ordered weighted L1 norm (OWL) (Bogdan et al., 2013), gives sorted L1penalized estimator, clustering correlated regression parameters (Figueiredo & Nowak, 2014). Non-decomposable norms, such as K-support norm (Argyriou et al., 2012) or overlapping group sparsity norm (Jacob et al., 2009) can be used to uncover more complicated model structures. Theoretical analysis of such models, including sample complexity and non-asymptotic bounds on the estimation error rely on the design matrix Z, usually assumed (sub)- Gaussian with independent rows, and the specific norm R( ) under consideration (Raskutti et al., 2010; Rudelson & Zhou, 2013). Recent work has generalized such estimators to work with any norm (Negahban et al., 2012; Banerjee et al., 2014) with i.i.d. rows in Z. The focus of the current paper is on structured estimation in vector auto-regressive (VAR) models (Lutkepohl, 2007), arguably the most widely used family of multivariate time series models. VAR models have been applied widely, ranging from describing the behavior of economic and financial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function connectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is defined as xt = A1xt 1 + A2xt 2 + + Adxt d + ϵt , (2) where xt Rp denotes a multivariate time series, Ak Rp p, k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we assume that the noise ϵt Rp follows a Gaussian distribution, ϵt N(0, Σ), with E(ϵtϵT t ) = Σ and E(ϵtϵT t+τ) = 0, for τ = 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix Σ is assumed to be positive definite with bounded largest eigenvalue, i.e., Λmin(Σ) > 0 and Λmax(Σ) < . In the current context, the parameters {Ak} are assumed Estimating Structured VAR to be structured, in the sense of having low values according to a suitable norm R( ). We consider a general setting where any norm can be applied to the rows Ak(i, :) Rp of Ak, allowing the possibility of different norms being applies to different rows of Ak, and different norms for different parameter matrices Ak, k = 1, . . . , d. Choosing L1norm Ak(i, :) 1 for all rows and all parameter matrices is a simple special case of our setting. We discuss certain other choices in Section 2.1, and discuss related results in Section 4. In order to estimate the parameters, one can consider regularized estimators of the form (1), where yi and zi correspond to xt in the VAR setting. Unfortunately, unlike (yi, zi) in (1), the xt are far from independent, having strong dependence across time and correlated across dimensions. As a result, existing results from the rich literature on regularized estimators for structured problems (Zhao & Yu, 2006; Wainwright, 2009; Meinshausen & Yu, 2009) cannot be directly applied to get sample complexities and estimation error bounds in VAR models. Related Work: In recent literature, the problem of estimating structured VAR models has been considered for the special case of L1 norm. (Han & Liu, 2013) analyzed a constrained estimator based on the Dantzig selector (Candes & Tao, 2007), and established the recovery results for the special case of L1 norm. (Song & Bickel, 2011) considered a regularized VAR estimation problem under Lasso and Group Lasso penalties and derived oracle inequalities for the prediction error and estimation accuracy. However, their analysis is for the case when the dimensionality of the problem is fixed with respect to the sample size. Moreover, they employed an assumption on the dependency structure in the VAR, thus limiting the sample correlation issues mentioned earlier. The work of (Kock & Callot, 2015) studied regularized Lasso-based estimator while allowing for problem dimensionality to grow with sample size, utilizing suitable martingale concentration inequalities to analyze dependency structure. (Loh & Wainwright, 2011) considered L1 VAR estimation for first order models (d = 1) assuming A1 2 < 1, and the analysis was not extended to the general case of d > 1. In recent work, (Basu & Michailidis, 2015) considered a VAR Lasso estimator and established the sample complexity and error bounds by building on the prior work of (Loh & Wainwright, 2011). Their approach exploits the spectral properties of a general VAR model of order d, providing insights on the dependency structure of the VAR process. However, in line with the existing literature, the analysis was tailored to the special case of L1 norm, thus limiting its generality. Our Contributions: Compared to the existing literature, our results are substantially more general since the results and analysis apply to any norm R( ). One may wonder given the popularity of L1 norm, why worry about other norms? Over the past decade, considerable effort has been devoted to generalize L1 norm based results to other norms (Negahban et al., 2012; Chatterjee et al., 2012; Banerjee et al., 2014; Figueiredo & Nowak, 2014). Our work obviates the need for a similar exercise for VAR models. Further, some of these norms have found key niche in specific application areas e.g., (Zhou et al., 2012; Yang et al., 2015). From a technical perspective, one may also wonder once we have the result for L1 norm, why should not the extension to other norms be straightforward? A key technical aspect of the estimation error analysis boils down to getting sharp concentration bounds for R (ZT ϵ), where R ( ) is the dual norm of R( ), Z is the design matrix, and ϵ is the noise (Banerjee et al., 2014). For the special case of L1, the dual norm is L , and one can use union bound to get the required concentration. In fact, this is exactly how the analysis in (Basu & Michailidis, 2015) was done. For general norms, the union bound is inapplicable. Our analysis is based on a considerably more power tool,generic chaining (Talagrand, 2006), yielding an analysis applicable to any norm, and producing results in terms of geometric properties, such as Gaussian widths (Ledoux & Talagrand, 2013), of sets related to the norm. Results for specific norms can then be obtained by plugging in suitable bounds on the Gaussian widths (Chandrasekaran et al., 2012; Chen & Banerjee, 2015). We illustrate the idea by recovering known bounds for Lasso and Group Lasso, and obtaining new results for Spare Group Lasso and OWL norms. Finally, in terms of the core technical analysis, the application of generic chaining to the VAR estimation setting is not straightforward. In the VAR setting, generic chaining has to consider a stochastic process derived from sub-exponential martingale difference sequence (MDS). We first generalize the classical Azuma-Hoeffding inequality applicable to sub-Gaussian MDSs to get an Azuma-Bernstein inequality for sub-exponential MDSs. Further, we use suitable representations of Talagrand s γ-functions (Talagrand, 2006) in the context of generic chaining to obtain bounds on R (ZT ϵ) in terms of the Gaussian width w(ΩR) of the unit norm ball ΩR = {u Rdp|R(u) 1}. Our estimation error bounds in the VAR setting are exactly of the same order as Lasso-type models in the i.i.d. setting implying, surprisingly, that the strong temporal dependency in the VAR setting has no adverse effect on the estimation. The rest of the paper is organized as follows. In Section 2 we present the estimation problem for the structured VAR model. The main results on estimation guarantees are established in Section 3. We present experimental results in Section 4 and conclude in Section 5. 2. Structured VAR Models In this section we formulate structured VAR estimation problem and discuss its properties, which are essential in Estimating Structured VAR characterizing sample complexity and error bounds. 2.1. Regularized Estimator To estimate the parameters of the VAR model, we transform the model in (2) into the form suitable for regularized estimator (1). Let (x0, x1, . . . , x T ) denote the T + 1 samples generated by the stable VAR model in (2), then stacking them together we obtain x T d x T d+1 ... x T T x T d 1 x T d 2 . . . x T 0 x T d x T d 1 . . . x T 1 ... ... ... ... x T T 1 x T T 2 . . . x T T d AT 1 AT 2 ... AT d ϵT d ϵT d+1 ... ϵT T which can also be compactly written as Y = XB + E, (3) where Y RN p, X RN dp, B Rdp p, and E RN p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip p X)vec(B) + vec(E) y = Zβ + ϵ, where y RNp, Z = (Ip p X) RNp dp2, β Rdp2, ϵ RNp, and is the Kronecker product. The covariance matrix of the noise ϵ is now E[ϵϵT ] = Σ IN N. Consequently, the regularized estimator takes the form ˆβ = argmin β Rdp2 1 N ||y Zβ||2 2 + λNR(β), (4) where R(β) can be any vector norm, separable along the rows of matrices Ak. Specifically, if we denote β = [βT 1 . . . βT p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to i=1 R h A1(i, :)T. . .Ad(i, :)T i T . (5) To reduce clutter and without loss of generality, we assume the norm R( ) to be the same for each row i. Since the analysis decouples across rows, it is straightforward to extend our analysis to the case when a different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-support norm (Argyriou et al., 2012) for row three, etc. Observe that within a row, the norm need not be decomposable across columns. The main difference between the estimation problem in (1) and the formulation in (4) is the strong dependence between the samples (x0, x1, . . . , x T ), violating the i.i.d. assumption on the data {(yi, zi), i = 1, . . . , Np}. In particular, this leads to the correlations between the rows and columns of matrix X (and consequently of Z). To deal with such dependencies, following (Basu & Michailidis, 2015), we utilize the spectral representation of the autocovariance of VAR models to control the dependencies in matrix X. 2.2. Stability of VAR Model Since VAR models are (linear) dynamical systems, for the analysis we need to establish conditions under which the VAR model (2) is stable, i.e., the time-series process does not diverge over time. For understanding stability, it is convenient to rewrite VAR model of order d in (2) as an equivalent VAR model of order 1 xt xt 1 ... xt (d 1) A1 A2 . . . Ad 1 Ad I 0 . . . 0 0 0 I . . . 0 0 ... ... ... ... ... 0 0 . . . I 0 xt 1 xt 2 ... xt d where A Rdp dp. Therefore, VAR process is stable if all the eigenvalues of A satisfy det(λIdp dp A) = 0 for λ C, |λ| < 1. Equivalently, if expressed in terms of original parameters Ak, stability is satisfied if det(I Pd k=1 Ak 1 λk ) = 0 (see Section 2.1.1 of (Lutkepohl, 2007) and Section 1.3 of the supplement for more details). 2.3. Properties of Design Matrix X In what follows, we analyze the covariance structure of matrix X in (3) using spectral properties of VAR model. The results will then be used in establishing the high probability bounds for the estimation guarantees in problem (4). Define any row of X as Xi,: Rdp, 1 i N. Since we assumed that ϵt N(0, Σ), it follows that each row is distributed as Xi,: N(0, CX), where the covariance matrix CX Rdp dp is the same for all i Γ(0) Γ(1) . . . Γ(d 1) Γ(1)T Γ(0) . . . Γ(d 2) ... ... ... ... Γ(d 1)T Γ(d 2)T . . . Γ(0) where Γ(h) = E(xtx T t+h) Rp p. It turns out that since CX is a block-Toeplitz matrix, its eigenvalues can be bounded as (see (Gutierrez-Gutierrez & Crespo, 2011)) inf 1 j p ω [0,2π] Λj[ρ(ω)] Λk[CX] 1 k dp sup 1 j p ω [0,2π] Λj[ρ(ω)], (8) where Λk[ ] denotes the k-th eigenvalue of a matrix and for i = 1, ρ(ω) = P h= Γ(h)e hiω, ω [0, 2π], is the spectral density, i.e., a Fourier transform of the autocovariance matrix Γ(h). The advantage of utilizing spectral Estimating Structured VAR density is that it has a closed form expression (see Section 9.4 of (Priestley, 1981)) k=1 Ake kiω ! 1 k=1 Ake kiω ! 1 where denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound Λmin[CX] Λmin(Σ)/Λmax(A) = L, (9) where we defined Λmax(A) = max ω [0,2π]Λmax(A(ω)) for k=1 AT k ekiω ! k=1 Ake kiω ! In establishing high probability bounds we will also need information about a vector q = Xa RN for any a Rdp, a 2 = 1. Since each element XT i,:a N(0, a T CXa), it follows that q N(0, Qa) with a covariance matrix Qa RN N. It can be shown (see Section 2.1.2 of the supplement) that Qa can be written as Qa = (I a T )CU(I a), (11) where CU = E(UUT ) for U = XT 1,: . . . XT N,: T RNdp which is obtained from matrix X by stacking all the rows in a single vector, i.e, U = vec(XT ). In order to bound eigenvalues of CU (and consequently of Qa), observe that U can be viewed as a vector obtained by stacking N outputs from VAR model in (6). Similarly as in (8), if we denote the spectral density of the VAR process in (6) as ρX(ω) = P h= ΓX(h)e hiω, ω [0, 2π], where ΓX(h) = E[Xj,:XT j+h,:] Rdp dp, then we can write inf 1 l dp ω [0,2π] Λl[ρX(ω)] Λk[CU] 1 k Ndp sup 1 l dp ω [0,2π] The closed form expression of spectral density is ρX(ω) = I Ae iω 1 ΣE h I Ae iω 1i , where ΣE is the covariance matrix of a noise vector and A are as defined in expression (6). Thus, an upper bound on CU can be obtained as Λmax[CU] Λmax(Σ) Λmin(A), where we defined Λmin(A) = min ω [0,2π]Λmin(A(ω)) for A(ω) = I AT eiω I Ae iω . (12) Referring back to covariance matrix Qa in (11), we get Λmax[Qa] Λmax(Σ)/Λmin(A) = M. (13) We note that for a general VAR model, there might not exist closed-form expressions for Λmax(A) and Λmin(A). However, for some special cases there are results establishing the bounds on these quantities (e.g., see Proposition 2.2 in (Basu & Michailidis, 2015)). 3. Regularized Estimation Guarantees Denote by = ˆβ β the error between the solution of optimization problem (4) and β , the true value of the parameter. The focus of our work is to determine conditions under which the optimization problem in (4) has guarantees on the accuracy of the obtained solution, i.e., the error term is bounded: || ||2 δ for some known δ. To establish such conditions, we utilize the framework of (Banerjee et al., 2014). Specifically, estimation error analysis is based on the following known results adapted to our settings. The first one characterizes the restricted error set ΩE, where the error belongs. Lemma 3.1 Assume that N ZT ϵ , (14) for some constant r > 1, where R 1 N ZT ϵ is a dual form of the vector norm R( ), which is defined as R [ 1 N ZT ϵ] = sup R(U) 1 N ZT ϵ, U , for U Rdp2, where U = [u T 1 , u T 2 , . . . , u T p ]T and ui Rdp. Then the error vector 2 belongs to the set ΩE= Rdp2 R(β + ) R(β )+1 r R( ) . (15) The second condition in (Banerjee et al., 2014) establishes the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds for cone(ΩE) and some constant κ > 0, where cone(ΩE) is a cone of the error set, then || ||2 1 + r κ Ψ(cone(ΩE)), (17) where Ψ(cone(ΩE)) is a norm compatibility constant, defined as Ψ(cone(ΩE)) = sup U cone(ΩE) R(U) ||U||2 . Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upper bound in (17). However, the results are defined in terms of the quantities, involving Z and ϵ, which are random. Therefore, in the following we establish high probability bounds on the regularization parameter in (14) and RE condition in (16). Estimating Structured VAR 3.1. High Probability Bounds In this Section we present the main results of our work, followed by the discussion on their properties and illustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section 3.4 we will present the main ideas of our proof technique, with all the details delegated to the supplement. To establish lower bound on the regularization parameter λN, we derive an upper bound on R [ 1 N ZT ϵ] α, for some α > 0, which will establish the required relationship λN α R [ 1 Theorem 3.3 Let ΩR = {u Rdp|R(u) 1}, and define w(ΩR) = E[ sup u ΩR g, u ] to be a Gaussian width of set ΩR for g N(0, I). For any ϵ1 > 0 and ϵ2 > 0 with probability at least 1 c exp( min(ϵ2 2, ϵ1) + log(p)) we can establish that N ZT ϵ c2(1+ϵ2)w(ΩR) N + c1(1+ϵ1)w2(ΩR) where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, we will show that inf cone(ΩE) ||(Ip p X) ||2 || ||2 ν, for some ν > 0 and Theorem 3.4 Let Θ = cone(ΩEj) Sdp 1, where Sdp 1 is a unit sphere. The error set ΩEj is defined as ΩEj = n j Rdp R(β j + j) R(β j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp 1, and β = [β T 1 . . . β T p ]T , for β j Rdp. The set ΩEj is a part of the decomposition in ΩE = ΩE1 ΩEp due to the assumption on the row-wise separability of norm R( ) in (5). Also define w(Θ) = E[sup u Θ g, u ] to be a Gaussian width of set Θ for g N(0, I) and u Rdp. Then with probability at least 1 c1 exp( c2η2 + log(p)), for any η > 0, inf cone(ΩE) ||(Ip p X) ||2 M cw(Θ) η and c, c1, c2 are positive constants, and L, M are defined in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose η = 1 2 M cw(Θ) η and since κN > 0 must be satisfied, we can establish a lower bound on the number of samples N: L/2 = O(w(Θ)). Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfy the restricted eigenvalue condition is smaller if Λmin(A) and Λmin(Σ) are larger and Λmax(A) and Λmax(Σ) are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned and the VAR process is stable, with eigenvalues well inside the unit circle (see Section 2.2). Alternatively, we can also understand the bound on N as showing that large values of M and small values of L indicate stronger dependency in the data, thus requiring more samples for the RE conditions to hold with high probability. Analyzing Theorems 3.3 and 3.4 we can interpret the established results as follows. As the size and dimensionality N, p and d of the problem increase, we emphasize the scale of the results and use the order notations to denote the constants. Select a number of samples at least N O(w2(Θ)) and let the regularization parameter sat- isfy λN O w(ΩR) N2 . With high probability then the restricted eigenvalue condition ||Z ||2 κN for cone(ΩE) holds, so that κ = O(1) is a positive constant. Moreover, the norm of the estimation error in optimization problem (4) is bounded by 2 N2 Ψ(cone(ΩEj)). Note that the norm compatibility constant Ψ(cone(ΩEj)) is assumed to be the same for all j = 1, . . . , p, which follows from our assumption in (5). Consider now Theorem 3.3 and the bound on the regularization parameter λN O w(ΩR) N 2 . As the dimensionality of the problem p and d grows and the number of samples N increases, the first term w(ΩR) N will dominate the second one w2(ΩR) N2 . This can be seen by computing N for which the two terms become equal w(ΩR) N2 , which happens at N = w 2 3 (ΩR) < w(ΩR). Therefore, we can rewrite our results as follows: once the restricted eigenvalue condition holds and λN O w(ΩR) , the error norm is upper-bounded by Ψ(cone(ΩEj)). 3.3. Special Cases While the presented results are valid for any norm R( ), separable along the rows of Ak, it is instructive to specialize our analysis to a few popular regularization choices, such as L1 and Group Lasso, Sparse Group Lasso and OWL norms. 3.3.1. LASSO To establish results for L1 norm, we assume that the parameter β is s-sparse, which in our case is meant to represent the largest number of non-zero elements in any βi, i = 1, . . . , p, i.e., the combined i-th rows of each Ak, Estimating Structured VAR k = 1, . . . , d. Since L1 is decomposable, it can be shown that Ψ(cone(ΩEj)) 4 s. Next, since ΩR = {u Rdp|R(u) 1}, then using Lemma 3 in (Banerjee et al., 2014) and Gaussian width results in (Chandrasekaran et al., 2012), we can establish that w(ΩR) O( p log(dp)). Therefore, based on Theorem 4.3 and the discussion at the end of Section 3.2, the bound on the regularization parameter takes the form λN O p log(dp)/N . Hence, the es- timation error is bounded by 2 O p s log(dp)/N as long as N > O(log(dp)). 3.3.2. GROUP LASSO To establish results for Group norm, we assume that for each i = 1, . . . , p, the vector βi Rdp can be partitioned into a set of K disjoint groups, G = {G1, . . . , GK}, with the size of the largest group m = max k |Gk|. Group Lasso norm is defined as β GL = PK k=1 βGk 2. We assume that the parameter β is s G-group-sparse, which means that the largest number of non-zero groups in any βi, i = 1, . . . , p is s G. Since Group norm is decomposable, as was established in (Negahban et al., 2012), it can be shown that Ψ(cone(ΩEj)) 4 s G. Similarly as in the Lasso case, using Lemma 3 in (Banerjee et al., 2014), we get w(ΩRGL) O( p m + log(K)). The bound on the λN takes the form λN O p (m + log(K))/N . Combining these derivations, we obtain the bound 2 s G(m + log(K))/N for N > O(m + log(K)). 3.3.3. SPARSE GROUP LASSO Similarly as in Section 3.3.2, we assume that we have K disjoint groups of size at most m. The Sparse Group Lasso norm enforces sparsity not only across but also within the groups and is defined as β SGL = α β 1 + (1 α) PK k=1 βGk 2, where α [0, 1] is a parameter which regulates a convex combination of Lasso and Group Lasso penalties. Note that since β 2 β 1, it follows that β GL β SGL. As a result, for β ΩRSGL β ΩRGL, so that ΩRSGL ΩRGL and thus w(ΩRSGL) w(ΩRGL) O( p m + log(K)), according to Section 3.3.2. Assuming β is s-sparse and s G-group-sparse and noting that the norm is decomposable, we get Ψ(cone(ΩEj)) 4(α s + (1 α) s G)). Consequently, the error bound is 2 (αs + (1 α)s G)(m + log(K))/N . 3.3.4. OWL NORM Ordered weighted L1 norm is a recently introduced regularizer and is defined as β owl = Pdp i=1 ci|β|(i), where c1 . . . cdp 0 is a predefined non-increasing se- quence of weights and |β|(1) . . . |β|(dp) is the sequence of absolute values of β, ranked in decreasing order. In (Chen & Banerjee, 2015) it was shown that w(ΩR) O( p log(dp)/ c), where c is the average of c1, . . . , cdp and the norm compatibility constant is Ψ(cone(ΩEj)) 2c2 1 s/ c. Therefore, based on Theorem 4.3, we get λN log(dp)/( c N) and the estimation error is bounded s log(dp)/( c N) . We note that the bound obtained for Lasso and Group Lasso is similar to the bound obtained in (Song & Bickel, 2011; Basu & Michailidis, 2015; Kock & Callot, 2015). Moreover, this result is also similar to the works, which dealt with independent observations, e.g., (Bickel et al., 2009; Negahban et al., 2012), with the difference being the constants, reflecting correlation between the samples, as we discussed in Section 3.2. The explicit bound for Sparse Group Lasso and OWL is a novel aspect of our work for the non-asymptotic recovery guarantees for the VAR estimation problem with norm regularization, being just a simple consequence from our more general framework. 3.4. Proof Sketch In this Section we outline the steps of the proof for Theorem 3.3 and 3.4, all the details can be found in Sections 2.2 and 2.3 of the supplement. 3.4.1. BOUND ON REGULARIZATION PARAMETER Recall that our objective is to establish for some α > 0 a probabilistic statement that λN α R [ 1 N ZT ϵ] = sup R(U) 1 N ZT ϵ, U , where U = [u T 1 , . . . , u T p ]T Rdp2 for uj Rdp and ϵ = vec(E) for E in (3). We denote E:,j RN as a column of noise matrix E and note that since Z = Ip p X, then using the row-wise separability assumption in (5) we can split the overall probability statement into p parts, which are easier to work with. Thus, our objective would be to establish P sup R(uj) rj 1 N XT E:,j, uj αj for j = 1, . . . , p, where Pp j=1 αj = α and Pp j=1 rj = 1. The overall strategy is to first show that the random variable 1 N XT E:,j, uj has sub-exponential tails. Based on the generic chaining argument, we then use Theorem 1.2.7 in (Talagrand, 2006) and bound sup R(uj) rj 1 N XT E:,j, uj # . Finally, using Theo- rem 1.2.9 in (Talagrand, 2006) we establish the high probability bound on concentration of sup R(uj) rj 1 N XT E:,j, uj around its mean, i.e., derive the bound in (18). Estimating Structured VAR We note that the main difficulty of working with the term XT E:,j, uj is the complicated dependency between X and E:,j, which is due to the VAR generation process in (3). However, if we write XT E:,j, uj = PN i=1 Ei,j, (Xi,:uj) = PN i=1 mi, where mi = Ei,j(Xi,:uj) and we can interpret this as a summation over martingale difference sequence (Lutkepohl, 2007). This can be easily proven by showing E(mi|m1, . . . , mi 1) = 0. The latter is true since in mi = Ei,j(Xi,:uj) the terms Ei,j and Xi,:uj are independent since ϵd+i is independent from xd k+i for 0 i T d and 1 k d (see (2)). To show that PN i=1 Ei,j, (X:,iuj) has sub-exponential tails, recall that since ϵt in (2) is Gaussian, Ei,j and Xi,:uj are independent Gaussian random variables, whose product has sub-exponential tails. Moreover, the sum over subexponential martingale difference sequence can be shown to be itself sub-exponential using (Shamir, 2011), based on Bernstein-type inequality (Vershynin, 2010). 3.4.2. RESTRICTED EIGENVALUE CONDITION To show ||(Ip p X) ||2 || ||2 0 for all cone(ΩE), similarly as before, we split the problem into p parts by using row-wise separability assumption of the norm in (5). In particular, denote = [ T 1 , . . . , T p ]T , where j is dp 1, then we can represent the original set ΩE as a Cartesian product of subsets ΩEj, i.e., ΩE = ΩE1 ΩEp, implying that cone(ΩE) = cone(ΩE1) cone(ΩEp). Therefore, our objective would be to establish inf uj Θj||Xuj||2 νj πj, for j = 1, . . . , p, where Θ = cone(ΩEj) Sdp 1 and we defined uj = j || j||2 , since it will be easier to operate with unit-norm vectors (we drop index j, to reduce clutter). The overall strategy is to first show that Xu 2 E( Xu 2) is sub-Gaussian. Then, using generic chaining argument in (Talagrand, 2006), specifically Theo- rem 2.1.5, we bound E inf u Θ||Xu||2 . Finally, based on Lemma 2.1.3 in (Talagrand, 2006) we establish the concentration inequality on inf u Θ||Xu||2 around its mean. 4. Experimental Results In this Section we present the experiments on simulated and real data to demonstrate the obtained theoretical results. In particular, for L1 and Group L1, we investigate how error norm 2 and regularization parameter λN scale as the problem size p and N change. Moreover, using real flight data we also compare the performance of Sparse Group L1, OWL and ridge regularizers. Additional simulation and experimental results are included in the supplement. 4.1. Synthetic Data To evaluate the estimation problem with L1 norm, we simulated a first-order VAR process for different values of p [10, 600], s [4, 260], and N [10, 5000]. Regularization parameter was varied in the range λN (0, λmax), where λmax is the largest parameter, for which estimation problem (4) produces a zero solution. All the results are shown after averaging across 50 runs. The results for Lasso are shown in the top row of Fig. 1. In particular, in Fig. 1.a we show 2 for different p and N for fixed λN. When N is small, the estimation error is large and the results cannot be trusted. However, once N O(w2(Θ)), the RE condition in Lemma 3.2 is satisfied and we see a fast decrease of errors for all p s. In Fig. 1.b we plot 2 against rescaled sample size N s log(pd). The errors are now closely aligned, confirming results of Sec. 3.3.1, i.e, 2 O p (s log(pd))/N . Finally, in Figs. 1.c and 1.d we show the dependence of optimal λN (for fixed N and p, we picked λN achieving the smallest estimation error) on N and p. It can be seen that as p increases, λN grows (for fixed N) at the rate similar to log p. On the other hand, as N increases, the selected λN decreases (for fixed p) at the rate similar to 1/ For Group Lasso the sparsity in rows of A1 was generated in groups, whose number varied as K [2, 60]. We set the largest number of non-zero groups in any row as s G [2, 22]. Results are shown in the bottom row of Fig. 1, which have similar flavor as in Lasso case. The difference can be seen in Fig. 1.f, where a close alignment of errors occurs when N is now scaled as N s G(m+log(K)). Moreover, the selected regularization parameter λ increases with the number of groups K and decreases with N. 4.2. Real Data We have also performed evaluation tests on real data to compare the accuracy of the VAR estimation using various penalized formulations based on five norms: L1, OWL, Group, Sparse Group and Ridge (square of L2). Although 2 2 is not a norm, we included its results for reference purposes as it is frequently used in practice. In terms of data, we used the NASA flight dataset from (nas), consisting of over 100,000 flights, each having a record of about 250 parameters, sampled at 1 Hz. For our test, we selected 300 flights and picked 31 parameters most suitable for the prediction task and focused on the landing part of the trajectory (duration approximately 15 minutes). For each flight we separately fitted a first-order VAR model using five approaches and performed 5-fold cross validation to select λ, achieving smallest prediction error. For Sparse Group we set α = 0.5, while for OWL the weights c1, . . . , cp were set as a monotonically decreasing sequence. Table 1 shows Estimating Structured VAR 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 0 100 200 300 400 500 600 700 0.4 N/[s log(pd)] 0 100 200 300 400 500 600 0 1000 2000 3000 4000 5000 0 500 1000 1500 2000 2500 0 1000 2000 3000 4000 5000 N/[s(m + log K)] K 10 20 30 40 50 60 0.3 1 (a) (b) (c) (d) (e) (f) (g) (h) Figure 1. Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse VAR (bottom row). Problem dimensions: p [10, 600], N [10, 5000], λN λmax [0, 1], K [2, 60] and d = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against 2 to show that errors scale as (s log p)/N; in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of λN on p (or number of groups K in (g)) for fixed sample size N; finally, Figures (d) and (h) display the dependency of λN on N for fixed p. Lasso OWL Group Lasso Sparse Group Lasso Ridge 32.3(6.5) 32.2(6.6) 32.7(6.5) 32.2(6.4) 33.5(6.1) 32.7(7.9) 44.5(15.6) 75.3(8.4) 38.4(9.6) 99.9(0.2) Table 1. Mean squared error (row 2) of the five methods used in fitting VAR model, evaluated on aviation dataset (MSE is computed using one-step-ahead prediction errors). Row 3 shows the average number of non-zeros (as a percentage of total number of elements) in the VAR matrix. The last row shows a typical sparsity pattern in A1 for each method (darker dots - stronger dependencies, lighter dots - weaker dependencies). The values in parenthesis denote one standard deviation after averaging the results over 300 flights. the results after averaging across 300 flights. From the table we can see that the considered problem exhibits a sparse structure since all the methods detected similar patterns in matrix A1. In particular, the analysis of such patterns revealed a meaningful relationship among the flight parameters (darker dots), e.g., normal acceleration had high dependency on vertical speed and angle-of-attack, the altitude had mainly dependency with fuel quantity, vertical speed with aircraft nose pitch angle, etc. The results also showed that the sparse regularization helps in recovering more accurate and parsimonious models as is evident by comparing performance of Ridge regression with other methods. Moreover, while all the four Lasso-based approaches performed similar to each other, their sparsity levels were different, with Lasso producing the sparsest solutions. As was also expected, Group Lasso had larger number of non-zeros since it did not enforce sparsity within the groups, as compared to the sparse version of this norm. 5. Conclusions In this work we present a set of results for characterizing non-asymptotic estimation error in estimating structured vector autoregressive models. The analysis holds for any norms, separable along the rows of parameter matrices. Our analysis is general as it is expressed in terms of Gaussian widths, a geometric measure of size of suitable sets, and includes as special cases many of the existing results focused on structured sparsity in VAR models. Acknowledgements The research was supported by NSF grants IIS-1447566, IIS-1447574, IIS-1422557, CCF-1451986, CNS1314560, IIS-0953274, IIS-1029711, NASA grant NNX12AQ39A, and gifts from Adobe, IBM, and Yahoo. Estimating Structured VAR NASA Aviation Safety Dataset. Available at https://c3.nasa.gov/dashlink/projects/85/. Argyriou, A., Foygel, R., and Srebro, N. Sparse prediction with the k-support norm. In Advances in Neural Information Processing Systems, pp. 1457 1465, 2012. Banerjee, A., Chen, S., Fazayeli, F., and Sivakumar, V. Estimation with norm regularization. In Advances in Neural Information Processing Systems, pp. 1556 1564, 2014. Basu, S. and Michailidis, G. Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics, 43(4):1535 1567, 08 2015. Bickel, P. J., Ritov, Y., and Tsybakov, A. B. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 1732, 08 2009. Bogdan, M., Berg, E, Su, W., and Candes, E. Statistical estimation and testing via the sorted l1 norm. ar Xiv preprint ar Xiv:1310.1969, 2013. Candes, E. and Tao, T. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, pp. 2313 2351, 2007. Chandrasekaran, V., Recht, B., Parrilo, P, and Willsky, A. The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805 849, 2012. Chatterjee, S., Steinhaeuser, K., Banerjee, A., Chatterjee, S., and Ganguly, G. Sparse group Lasso: Consistency and climate applications. In Proceedings of International Conference on Data Mining, pp. 47 58, 2012. Chen, S. and Banerjee, A. Structured estimation with atomic norms: General bounds and applications. In Advances in Neural Information Processing Systems, pp. 2890 2898, 2015. Figueiredo, M. and Nowak, R. Sparse estimation with strongly correlated variables using ordered weighted l1 regularization. ar Xiv preprint ar Xiv:1409.4005, 2014. Gutierrez-Gutierrez, J. and Crespo, P. M. Block Toeplitz matrices: asymptotic results and applications. Foundations and Trends in Communications and Information Theory, 8(3):179 257, 2011. Han, F. and Liu, H. A direct estimation of high dimensional stationary vector autoregressions. Ar Xiv e-prints, ar Xiv:1307.0293, 2013. Jacob, L., Obozinski, G., and Vert, J.-P. Group lasso with overlap and graph lasso. In Proceedings of the International conference on machine learning, pp. 433 440, 2009. Kock, A. B. and Callot, L. Oracle inequalities for high dimensional vector autoregressions. Journal of Econometrics, 186(2):325 344, 2015. Ledoux, M. and Talagrand, M. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science, 2013. Ljung, L. System identification: theory for the user. Springer, 1998. Loh, P.-L. and Wainwright, M. J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pp. 2726 2734, 2011. Lutkepohl, H. New introduction to multiple time series analysis. Springer, 2007. Meinshausen, N. and Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, pp. 246 270, 2009. Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, 11 2012. Priestley, M. B. Spectral analysis and time series. Academic press, 1981. Raskutti, G., Wainwright, M. J, and Yu, B. Restricted eigenvalue properties for correlated Gaussian designs. The Journal of Machine Learning Research, 11:2241 2259, 2010. Rudelson, M. and Zhou, S. Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59(6):3434 3447, 2013. Shamir, O. A variant of Azuma s inequality for martingales with subgaussian tails. ar Xiv preprint ar Xiv:1110.2392, 2011. Song, S. and Bickel, P. J. Large vector auto regressions. Ar Xiv e-prints, ar Xiv:1106.3915, 2011. Talagrand, M. The generic chaining: upper and lower bounds of stochastic processes. Springer, 2006. Tibshirani, R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, pp. 267 288, 1996. Estimating Structured VAR Tsay, R. S. Analysis of financial time series, volume 543. 2005. Valdes-Sosa, P. A., Sanchez-Bornot, J. M., Lage Castellanos, A., Vega-Hernandez, M., Bosch-Bayard, J., Melie-Garcia, L., and Canales-Rodriguez, E. Estimating brain functional connectivity with sparse multivariate autoregression. Philosophical Transactions of the Royal Society, 360(1457):969 981, 2005. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. Ar Xiv e-prints, ar Xiv:1011.3027, 2010. Wainwright, M. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183 2202, 2009. Yang, T., Wang, J., Sun, Q., Hibar, D. P., Jahanshad, N., Liu, L., Wang, Y., Zhan, L., Thompson, P., and Ye, J. Detecting genetic risk factors for Alzheimer s disease in whole genome sequence data via Lasso screening. In IEEE International Symposium on Biomedical Imaging, 2015. Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society., 68(1):49 67, 2006. Zhao, P. and Yu, B. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7: 2541 2563, 2006. Zhou, J., Liu, J., Narayan, V. A, and Ye, J. Modeling disease progression via fused sparse group Lasso. In Proceedings of International conference on Knowledge discovery and data mining, pp. 1095 1103, 2012. Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society., 67(2):301 320, 2005.