# invariant_subspace_decomposition__6b238aff.pdf Journal of Machine Learning Research 26 (2025) 1-56 Submitted 5/24; Revised 2/25; Published 4/25 Invariant Subspace Decomposition Margherita Lazzaretto mala@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark Jonas Peters jonas.peters@stat.math.ethz.ch Seminar for Statistics ETH Zürich Zürich, Switzerland Niklas Pfister np@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark Editor: Aapo Hyvarinen We consider the task of predicting a response Y from a set of covariates X in settings where the conditional distribution of Y given X changes over time. For this to be feasible, assumptions on how the conditional distribution changes over time are required. Existing approaches assume, for example, that changes occur smoothly over time so that short-term prediction using only the recent past becomes feasible. To additionally exploit observations further in the past, we propose a novel invariance-based framework for linear conditionals, called Invariant Subspace Decomposition (ISD), that splits the conditional distribution into a time-invariant and a residual time-dependent component. As we show, this decomposition can be employed both for zero-shot and time-adaptation prediction tasks, that is, settings where either no or a small amount of training data is available at the time points we want to predict Y at, respectively. We propose a practical estimation procedure, which automatically infers the decomposition using tools from approximate joint matrix diagonalization. Furthermore, we provide finite sample guarantees for the proposed estimator and demonstrate empirically that it indeed improves on approaches that do not use the additional invariant structure. Keywords: invariance, distribution shift, time adaptation, zero-shot learning, joint block diagonalization. 1. Introduction Many commonly studied systems evolve with time, giving rise to heterogeneous data. Indeed, changes over time in the data generating process lead to shifts in the observed data distribution, and make it in general impossible to find a fixed model that consistently describes the system over time. . NP is now at Lakera in Zürich but most of this work was done while NP was at the University of Copenhagen. c 2025 Margherita Lazzaretto, Jonas Peters, Niklas Pfister. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0699.html. Lazzaretto, Peters, Pfister In this work, we analyze the problem of estimating the relation between a response Yt and a set of covariates (or predictors) Xt, t N, both of which are observed over some period of time, with the goal of predicting an unobserved response, when new observations of the covariates become available. Two types of distribution shifts across time that can arise are (i) variations in the mechanism relating the response to the covariates and (ii) variations in the covariate distribution. Under distribution shifts, successfully predicting an unobserved response from covariates requires assumptions on how the underlying data generating model changes over time. In regression settings, varying-coefficients models (e.g., Hastie and Tibshirani, 1993; Fan and Zhang, 2008) are a common approach to deal with dynamic system behaviours related to changes in the functional relationship between a response and some covariates. In these works, the model coefficients are usually assumed to change smoothly, and smoothing methods are used in their estimation. Another approach for online estimation of smoothly changing parameters uses state-space models and filtering solutions (see, for example, Durbin and Koopman, 2012). In general, smooth changes or other kinds of structured variations such as, for example, step-wise changes, allow us to learn a regression function using previous time points from the recent past. To exploit information further in the past, one can look for invariant patterns that persist through time. In this spirit, Meinshausen and Bühlmann (2015) define the maximin effect as a worst case optimal model that maintains good predictive properties at all observed times. Distribution shifts in time-series models are conceptually similar to distribution shifts across domains. In both cases a key problem is to find invariant parts of the observed data distribution and transfer it to the new domain or time-point. Unsupervised domain adaptation methods aim to predict an unobserved response in the target domain: for the problem to be tractable, they rely, for example, on the invariance of the conditional distribution of the response given the covariates (Sun et al., 2017; Zhao et al., 2019) or on independently changing factors of the joint distribution of the covariates and the response (Zhang et al., 2015; Stojanov et al., 2021). Solutions to the unsupervised domain adaptation problem are often based on aligning the source and target domains by minimizing the discrepancy either between the covariates distributions (Zhang et al., 2015) or between the covariates second order moments (Sun et al., 2017), or by finding low-dimensional invariant transformations of the variables whose distributions match in the domains of interest (Zhao et al., 2019; Stojanov et al., 2021). Other unsupervised domain adaptation approaches such as the one proposed by Bousmalis et al. (2016) rely on learning disentangled representations, where disentangled means that changes in one representation do not affect the remaining ones. This approach is also explored in non-stationary settings, when changes in the covariate distribution happen over time instead of across different domains, for example in the works by Yao et al. (2022); Li et al. (2024). The goal of these works is not prediction, but rather the identification of generalizable latent states that could be used for arbitrary downstream tasks. The field of causality widely explores the concept of invariant and independently changing mechanisms, too. In causal models, changes in the covariate distribution are modeled as interventions, and different interventional settings represent different environments. In this context, Peters et al. (2016) propose to look for a set of stable causal predictors, that is, a set of covariates for which the conditional distribution of the response given such Invariant Subspace Decomposition covariates remains invariant across different environments. The same idea is extended by Pfister et al. (2019a) to a sequential setting in which the different environments are implicitly generated by changes of the covariate distribution over time. Such invariances can then be used for prediction in previously unseen environments, too (e.g., Rojas-Carulla et al., 2018; Magliacane et al., 2018; Pfister et al., 2021). Other works on causal discovery from heterogeneous or nonstationary data, such as the ones by Huang et al. (2020); Günther et al. (2023), study settings in which the skeleton of the causal graph remains invariant through different contexts or time points, but the causal mechanisms are also allowed to change. Their goal is then to identify the causal graph on observed covariates. To detect whether a mechanism changes, the context or time is included as an additional variable in the graph. As the context or time can be assumed to be exogenous, this often leads to additional identification of causal edges in the graphs. Feng et al. (2022) develop and apply similar ideas in a reinforcement learning setting. Their interest is in learning the causal graph and improving the efficiency of non-stationarity adaptation by disentangling the system variations as latent parameters. In this work, we build on invariance ideas both from the maximin framework proposed by Meinshausen and Bühlmann (2015) and from the covariates shifts literature, and exploit them for time adaptation. We propose an invariance-based framework, which we call invariant subspace decomposition (ISD), to estimate time-varying regression models in which both the covariate distribution and their relationship with a response can change. In particular, we consider a sequence of independent random vectors (Xt, Yt)t N Rp R satisfying for all time points t N a linear model of the form Yt = X t γ0,t + ϵt (1) with E[ϵt | Xt] = 0, where γ0,t is the (unknown) true time-varying parameter. We allow the covariance matrices Var(Xt) to change over time, assuming that they are approximately constant in small time windows. Regarding changes in γ0,t, we only need to assume smoothness during test time (if there is more than one test point) we provide more details later in Remark 2. Having observed the Xt and Yt up until some time point n, our goal is to learn γ0,t for some t > n, which we then use to predict Yt from Xt . We propose to do so by considering the explained variance, which, for all t, is given for some function f by Vart(f) := Var(Yt) Var(Yt f(Xt)). Using the explained variance as the objective function provides an intuitive evaluation of the predictive quality of a function f: negative explained variance indicates in particular that using f for prediction is harmful, in that it is worse than the best constant function. Under model (1), f is a linear function fully characterized by a parameter β Rp, i.e., f(Xt) = X t β, and the true time-varying parameter can always be expressed as γ0,t = arg max β Rp Vart(β). (2) Key to ISD is the decomposition of the explained variance maximization into a time-invariant and a time-dependent part: we show in Theorem 1 that, under some assumptions, for all t N, γ0,t can be expressed as γ0,t = arg max β Sinv Var(β) | {z } =:βinv + arg max β Sres Vart(β) | {z } =:δres t Lazzaretto, Peters, Pfister where Var(β) := 1 n Pn t=1 Vart(β), and Sinv, Sres Rp are two orthogonal linear subspaces, with Sinv Sres = Rp, assumed to be uniquely determined by the distribution of (X1, Y1), . . . , (Xn, Yn). The decomposition of Rp into Sinv and Sres from observed data is achieved by exploiting ideas from independent subspace analysis (e.g., Gutch and Theis, 2012). Unlike works on latent representations, we do not reduce the overall dimensionality of the problem but partition the observed space into two lower dimensional subspaces, Sinv and Sres. This in particular implies that the two sub-problems in (3) each have less degrees of freedom than the original one. The first sub-problem can be solved using all available heterogeneous observations, which we call historical data: its solution βinv is a time-invariant linear parameter that partially describes the dependence of Y on X at all times. The second sub-problem is a time adaptation problem that tunes the invariant component to a time point of interest: its solution δres t explains the residuals Yt X t βinv; the sum βinv + δres t gives an estimator for the time-varying linear parameter of interest γ0,t. In order to estimate the residual component δres t , we assume the model is approximately stationary in a small time window preceding t and use this local subset of the data, which we call adaption data, for estimation. We distinguish two tasks: (i) the zero-shot task, where adaption data is not available and we approximate γ0,t by βinv and (ii) the time-adaption task, where adaption data is available and we solve both sub-problems to approximate γ0,t by βinv + δres t . A twodimensional example of γ0,t and its ISD estimates is shown in Figure 1. The same example is presented again in more detail in Example 1 in Section 2.1 below and used as a running example throughout the paper. The fundamental assumption we make to apply the ISD framework to new data (Assumption 2) is that the decomposition inferred from available observations generalizes to unseen time points. This guarantees that the estimated invariant component can be meaningfully used for prediction at all new time points, either directly or as part of a two-step estimation that is fine-tuned by solving the time adaptation sub-problem. The ISD framework allows us to improve on naive methods that directly maximize the explained variance on Rp using only the most recent available observations and on methods focusing on invariance across environments or time points such as the maximin by Meinshausen and Bühlmann (2015), which do not account for time-adaptation. We show in particular in Theorem 3 that isolating an invariant component and reducing the dimensionality of the adaptation problem guarantees lower prediction error or, equivalently, higher explained variance compared to to naive methods. An example is provided for a simulated setting in Figure 2, which shows on the left the average explained variance by the invariant component computed at training time on historical data, and on the right the cumulative explained variance by the invariant (zero-shot task) and the adapted invariant component (time-adaptation task) at testing time when new observations become available. For the zero-shot task, we compare the ISD invariant component with the standard OLS solution, computed on all training observations, and with the maximin effect ( ˆβmm) by Meinshausen and Bühlmann (2015), which maximizes the worst case explained variance over the available observations. For the time-adaptation task, we compare the ISD with the standard OLS solution computed on a rolling window over the latest new observations. For both tasks, the ISD estimates achieve higher cumulative explained variance at new time-points. The remainder of the paper is structured as follows. In Section 2 we introduce the ISD framework, starting from the identification of the invariant and residual subspaces for or- Invariant Subspace Decomposition 1 0 1 2 γ1 t 1 0 1 2 γ1 t Sinv Sres ˆSinv true parameter γ0,t (test data) rolling window OLSˆγOLS t ISD ˆγISD t =ˆβinv+ˆδres t true subspaces Sinv,Sres estimated subspaces ˆSinv, ˆSres true invariant component βinv estimated invariant component ˆβinv Figure 1: Example of two-dimensional true parameter γ0,t varying on a one-dimensional subspace of R2, and its estimates using ISD (right) compared to rolling window OLS (left). Time is visually encoded using a color map. (Left) True parameter γ0,t (hexagons) on 350 test points and OLS estimates ˆγOLS t based on rolling windows of size 16. (Right) Same test data and true parameters γ0,t, but now we additionally use 1000 prior time-points as historical data (not shown) to estimate the decomposition of R2 into the orthogonal subspaces Sinv and Sres (dashed lines). Next, we estimate βinv using the historical data and ˆSinv. Then, using the same rolling windows as in the left plot as adaption data, we estimate δres t using ˆSres. The ISD estimates are then given by ˆγISD t = ˆβinv + ˆδres t . All details on the generative model are provided in Example 1. The subspaces Sinv and Sres do not need to be axis aligned, so ISD is applicable even in cases where the conditional of Yt given Xt and all conditionals of Yt given subsets of Xt vary over time. thogonal covariates transformations and showing the explained variance separation (3) in Theorem 1. In Sections 2.2 and 2.3 we define population estimators for the invariant and residual components, solutions to the two sub-problems in (3). We then describe, in Section 3, the two tasks that ISD solves, namely zero-shot prediction and time-adaptation, and provide a characterization of the invariant component as a worst-case optimal parameter. In Section 4 we propose an estimation method for ISD, and provide finite sample generalization guarantees for the proposed estimator. Finally, in Section 5, we illustrate ISD based on numerical experiments, both on simulated and on real world data, and validate the theoretical results presented in the paper. Lazzaretto, Peters, Pfister historical data γ0,t (ground truth) βinv (oracle) 0 100 200 300 400 t Pt r=1 Varr(ˆβ) adaptation data Zero-shot estimators Time adaptation estimators ˆγISD t =ˆβinv+ˆδres t ˆγOLS t Ground truth Ground truth Time adaptation estimators ˆγISD t =ˆβinv+ˆδres t ˆγOLS t Zero-shot estimators Figure 2: For the data-generating model in Section 5.2, we plot (left) the average explained variance (distribution over 20 runs) obtained at training time (historical data) and (right) the cumulative explained variance obtained testing time (adaptation data) (in one of the 20 runs); in this example, the time-varying components in the historical and adaptation data have disjoint support. The example considers p = 10-dimensional predictors and an invariant component of dimension 7. As baselines, we use (i) the true time-varying parameter γ0,t, which maximizes the explained variance at all observed time points t, and (ii) the oracle invariant component βinv. 6000 historical observations are used to estimate: (iii) the invariant component ˆβinv of the ISD framework, (iv) the OLS solution ˆβOLS, (v) the maximin effect ˆβmm. Starting from t = 0 after the observed history, windows of length 3p are used to estimate: (vi) the adaptation parameter ˆδres t for ˆβinv to obtain the ISD estimate ˆγISD and (vii) the rolling window OLS solution ˆγOLS t . While at training time on historical data the ISD invariant component ˆβinv is the most conservative, with the lowest average explained variance, after a distribution shift (adaptation data) the same component can explain higher variance than other methods based on historical data only ( ˆβOLS, ˆβmm), and can be tuned to new time points to improve on estimators based on adaptation data only (ˆγOLS t ). 2. Invariant subspace decomposition We formalize the setup described in the introduction in the following setting. Setting 1 Let (Xt, Yt)t N Rp R1 be a sequence of independent random vectors satisfying for all t N a linear model as in (1), with γ0,t Rp and E[ϵt | Xt] = 0. Assume that for all t N the covariance matrix of the predictors Σt := Var(Xt) is strictly positive definite. Moreover, for all n N, let [n] := {1, . . . , n} and assume we observe n observations (Xt, Yt)t [n] from model (1), which we call historical data. Additionally, let Iad N be an interval of consecutive time points with m := |Iad| > p and mint Iad t > n. Assume we observe a second set of observations (Xt, Yt)t Iad from the same model but succeeding the historical data, which we call adaptation data. Finally, denote by t > maxt Iad a time point of interest that occurs after the adaptation data and assume that, for all t Iad {t }, the quantities γ0,t, Σtand Var(ϵt) in model (1) are constant (in practice, being approximately constant is sufficient). Invariant Subspace Decomposition . . . . . . . . . 1 2 n 1 n n + 1 t m t 2 t 1 t time historical data (Xt, Yt)t [n] adaptation data (Xad t , Y ad t )t Iad Zero-shot task: observe (Xt, Yt)t [n] and Xt predict Yt Adaptation task: observe (Xt, Yt)t [n], (Xad t , Y ad t )t Iad and Xt predict Yt Figure 3: Illustration of the historical and adaptation data and the zero-shot and adaptation tasks. Our goal is to predict Yt from Xt . A naive solution is to only consider the adaptation data in Iad (see also the analysis in Section 3): we aim to improve on this approach by additionally exploiting historical data in [n]. In particular, depending on whether we only have access to the historical data or we additionally have access to the data in Iad, we can solve (i) the zero-shot task or (ii) the adaptation task, respectively. The full setup including the different tasks is visualized in Figure 3. The two tasks correspond to the two sub-problems in (3): the solution to (i) is a time-invariant parameter βinv Rp, defined formally below in Section 2.2; the solution to (ii) is an adaptation parameter δres t formally defined in Section 2.3 that satisfies βinv + δres t = γ0,t. We start by defining what a time-invariant parameter is. To do so, we first consider the explained variance for a parameter β Rp at time t N, defined by Vart(β) := Var(Yt) Var(Yt X t β). Under model (1), the explained variance can be equivalently expressed as Vart(β) = 2 Cov(Yt, X t β) Var(X t β) = 2 Cov(Yt X t β, X t β) + Var(X t β). (4) A desirable property for a non-varying parameter β Rp is to guarantee that its explained variance remains non-negative at all time points. We call parameters β Rp never harmful if, for all t N, Vart(β) 0. Intuitively, parameters with this property at least partially explain Yt at all time points t N, and are therefore meaningful to use for prediction.1 In this paper, we consider the subset of never harmful parameters β Rp that satisfy Vart(β) = Var(X t β) or equivalently, using (4), Cov(Yt X t β, X t β) = 0; thus, the explained variance of these parameters reflects changes of Var(Xt) (over t) but is independent of changes in γ0,t. Definition 1 (time-invariance) We call a parameter β Rp time-invariant (over [n]) if for all t [n], Cov(Yt X t β, X t β) = 0. (5) 1. Under an assumption similar to Assumption 2 below, the maximin framework by Meinshausen and Bühlmann (2015) allows us to estimate never harmful parameters: we provide a more detailed comparison between our approach and the maximin in Remark 3. Lazzaretto, Peters, Pfister We use this definition, in Section 2.1, to define the subspaces Sinv and Sres that allow us to obtain the separation of γ0,t as in (3). In particular, as we show in Proposition 2, the definition of the invariant subspace Sinv guarantees that the invariant component βinv maximizing the pooled explained variance over Sinv is always a time-invariant parameter according to Definition 1. While we consider the explained variance here, one can equivalently consider the mean squared prediction error (MSPE). Remark 1 (exchanging explained variance with MSPE) Under model (1) maximizing the explained variance at time t [n] is equivalent to minimizing the MSPE at t, MSPEt(β) := E[(Yt X t β)2]. More precisely, assuming that all variables have mean zero, arg max β Rp Vart(β) = arg min β Rp Var(Yt X t β) = Var(Xt) 1 Cov(Xt, Yt) = arg min β Rp MSPEt(β). The remaining part of this section is organized as follows. Section 2.1 explicitly constructs the spaces Sinv and Sres and shows that they can be characterized by joint block diagonalization. In Section 2.2 we then analyze the time-invariant part of (3), leading to the optimal time-invariant parameter βinv, and show in Proposition 2 (iii) that its solution corresponds to the maximally predictive time-invariant parameter and is an interesting target of inference in its own right. In Section 2.3 we analyze the residual part of (3), leading to the optimal parameter δres t . 2.1 Invariant and residual subspaces We now construct the two linear subspaces Sinv and Sres that allow us to express the true time-varying parameter as the solution to two separate optimizations as in (3). For a linear subspace S Rp, denote by ΠS Rp p the orthogonal projection matrix from Rp onto S, that is, a symmetric matrix such that Π2 S = ΠS and, for all vectors v Rp, ΠSv S and (v ΠSv) ΠSv = 0. We call a collection of pairwise orthogonal linear subspaces S1, . . . , Sq Rp with Lq j=1 Sj = Rp an orthogonal and (Xt)t [n]-decorrelating partition (of cardinality q) if for all i, j {1, . . . , q} with i = j, and for all t [n] it holds that Cov(ΠSi Xt, ΠSj Xt) = 0. (6) Moreover, we define an orthogonal and (Xt)t [n]-decorrelating partition as irreducible if there is no orthogonal and (Xt)t [n]-decorrelating partition with strictly larger cardinality. We will construct Sinv and Sres from an irreducible orthogonal and (Xt)t [n]-decorrelating partition. Given a (not necessarily irreducible) orthogonal and (Xt)t [n]-decorrelating partition {Sj}q j=1, the true time-varying parameter can be expressed as the sum of q orthogonal components each lying in one subspace of the partition. These components can be expressed in terms of the covariates and of the response at time t as shown by the following lemma. Lemma 1 Let {Sj}q j=1 be an orthogonal and (Xt)t [n]-decorrelating partition. Then it holds for all t [n] that γ0,t = Pq j=1 ΠSjγ0,t and for all j {1, . . . , q} that ΠSjγ0,t = Var ΠSj Xt Cov ΠSj Xt, Yt (7) where ( ) denotes the Moore-Penrose pseudoinverse. Invariant Subspace Decomposition Each component is in particular the projection of γ0,t on the corresponding subspace Sj, as well as a maximizer of the explained variance in Sj, as shown by Lemma 2. Lemma 2 Let N N and let {Sj}q j=1 be an orthogonal and (Xt)t N -decorrelating partition.2 Then it holds for all j {1, . . . , q} and for all t N that arg max β Sj Vart(β) = Var(ΠSj Xt) Cov(ΠSj Xt, Yt). (8) Combining Lemmas 1 and 2 implies that j=1 arg max β Sj Vart(β). (9) Therefore, any orthogonal and (Xt)t [n]-decorrelating partition allows us to split the optimization (2) of the explained variance into separate optimizations over the individual subspaces. In order to leverage all available observations in at least some of the optimizations by pooling the explained variance as in (3), we need the optimizer to remain constant over time. More formally, we call a linear subspace S Rp opt-invariant (optimum invariant) on [n] if for all t, s [n] it holds that arg max β S Vart(β) = arg max β S Vars(β). (10) For all terms in (9) corresponding to an opt-invariant subspace, we can by definition of opt-invariance use all time-points in the optimization. For an irreducible orthogonal and (Xt)t [n]-decorrelating partition {Sj}q j=1, we therefore define the invariant subspace Sinv and the residual subspace Sres by Sinv := L j {1,...,q}: Sj opt-invariant on [n] Sj and Sres := L j {1,...,q}: Sj not opt-invariant on [n] Sj, (11) respectively. The invariant subspace Sinv is opt-invariant (see Lemma 7). Moreover, by definition, it holds that Sres = Sinv , where ( ) denotes the orthogonal complement in Rp, and that {Sinv, Sres} is an orthogonal and (Xt)t [n]-decorrelating partition (see Lemma 6). To ensure that these two spaces do not depend on the chosen irreducible orthogonal partition, we introduce the following assumption. Assumption 1 (uniqueness of the subspace decomposition) Let {Sj}q j=1 and { Sj} q j=1 be two irreducible orthogonal and (Xt)t [n]-decorrelating partitions. Then, it holds that L j {1,...,q}: Sj opt-invariant on [n] Sj = L j {1,..., q}: Sj opt-invariant on [n] Sj. Assumption 1 is for example satisfied if an irreducible orthogonal and (Xt)t [n]-decorrelating partition is unique. As we show in Appendix A.1, the uniqueness does not always hold (e.g., if for all t [n] the multiplicity of some eigenvalues of Σt is larger than one and shared 2. This is defined analogously to an orthogonal and (Xt)t [n]-decorrelating partition. Lazzaretto, Peters, Pfister across all such matrices). Assumption 1 is, however, a mild assumption; for example, the uniqueness of an irreducible orthogonal and (Xt)t [n]-decorrelating partition is satisfied if there exists at least one t [n] such that all eigenvalues of Σt, which we have assumed to be non-zero, are distinct (see Lemma 8 and Proposition 1 below). Whenever Assumption 1 holds, the invariant and residual spaces Sinv and Sres do not depend on which irreducible orthogonal partition is used in their construction. Example 1 This example describes the setting used to generate Figure 1. Consider a 2dimensional covariate Xt R2, and assume that model (1) is defined as follows. We take, for all t [n] 3t/n t/n 1.5 + 3σ1,t + σ2,t 3(σ2,t σ1,t) 3(σ2,t σ1,t) σ1,t + 3σ2,t where σ1,t and σ2,t are two fixed sequences sampled as two independent i.i.d. samples from a uniform distribution on [0, 1]. In this example, we have that an irreducible orthogonal and (Xt)t [n]-decorrelating partition is given by the two spaces S2 = 0.5 0.5 Indeed, it holds for all t [n] that Cov(ΠS1Xt, ΠS2Xt) = ΠS1ΣtΠS2 = 04 4. Moreover, since σ1,t and σ2,t are the eigenvalues of Σt, the irreducible orthogonal partition defined by S1 and S2 is unique (i.e., Assumption 1 is satisfied by Lemma 8) if σ1,t = σ2,t for some t [n]. It also holds that ΠS2γ0,t = 1 which does not depend on t (it can be verified that the same does not hold for ΠS1γ0,t). As shown in Lemmas 1 and 2, it holds that arg max β Sj Vart(β) = arg max β Sj Var(Yt) Var(Yt X t ΠSjβ) = ΠSjγ0,t. It follows that the subspace S2 is opt-invariant, whereas S1 is not, and therefore Sinv = S2 and Sres = S1. The two spaces S1 and S2 also appear in Figure 1: the true time-varying parameter γ0,t does not vary with t in the direction of the vector [0.5, 0.5 3] generating S2; S1 can be visualized when connecting the circles. In order for Sinv and Sres to be useful for prediction on future observations, we assume that the subspace separations we consider remain fixed over time. Assumption 2 (generalization) For all irreducible orthogonal and (X)t [n]-decorrelating partitions, Sinv and Sres defined in (11) satisfy the following two conditions: (i) for all t N it holds that Cov(ΠSinv Xt, ΠSres Xt) = 0 and (ii) Sinv is opt-invariant on N. As shown in the following theorem, Assumption 2 ensures that the two sets satisfy a separation of the form (3) for all observed and unobserved time points t N, as desired. For unobserved time points t N \ [n], Assumption 2 does not require (6) to hold for all i, j {1, . . . , q}, but only for all i, j {1, . . . , q} such that Si Sinv and Sj Sres. Invariant Subspace Decomposition Theorem 1 Assume Assumption 2 is satisfied. Let Sinv and Sres be defined as in (11) for an arbitrary irreducible orthogonal and (Xt)t [n]-decorrelating partition. Then, it holds for all t N that γ0,t = arg max β Sinv Var(β) + arg max β Sres Vart(β), (12) where Var(β) := 1 n Pn t=1 Vart(β). Moreover, if Assumption 1 is satisfied, the separation in (12) is independent of the considered irreducible orthogonal and (Xt)t [n]-decorrelating partition. The proof of Theorem 1 can be found in Appendix E.1 and relies on the fact that, under Assumption 2, the invariant and residual subspace form an orthogonal and (Xt)t Ndecorrelating partition. 2.1.1 Identifying invariant and residual subspaces using joint block diagonalization We can characterize an irreducible orthogonal partition using joint block diagonalization of the set of covariance matrices (Σt)t [n]. Joint block diagonalization of (Σt)t [n] consists of finding an orthogonal matrix U Rp p such that the matrices Σt := U Σt U, t [n], are block diagonal and we can choose q U blocks such that the indices of the corresponding submatrices do not change with t (the entries of the blocks may change with t though). Let q U max be the largest number of such blocks and let ( Σt,1)t [n], . . . , ( Σt,q U max)t [n] denote the corresponding common blocks with dimensions p1, . . . , pq U max that are independent of t. We call U a joint block diagonalizer of (Σt)t [n]. Moreover, we call U an irreducible joint block diagonalizer if, in addition, for all other joint block diagonalizers U Rp p the resulting number of common blocks is at most q U max. Joint block diagonalization has been considered extensively in the literature (see, for example, Murota et al., 2010; Nion, 2011; Tichavsky and Koldovsky, 2012) and various computationally feasible algorithms have been proposed (see Section A.3 for further details). In our setting, joint block diagonalization can be used to identify the invariant and residual subspaces, since an irreducible joint block diagonalizer U of the covariance matrices (Σt)t [n] corresponds to an irreducible orthogonal partition, as the following proposition shows. Proposition 1 (i) Let U Rp p be a joint block diagonalizer of (Σt)t [n]. For all j {1, . . . , q U max}, let Sj {1, . . . , p} denote the subset of indices corresponding to the j-th common block Σt,j. Moreover, let uk denote the k-th column of U and, for all j {1, . . . , q U max}, define Sj := span{uk | k Sj}. Then, {Sj}q U max j=1 is an orthogonal and (Xt)t [n]-decorrelating partition. Moreover, if the joint block diagonalizer is irreducible, then the corresponding orthogonal partition is irreducible. (ii) The converse is also true. Let {Sj}q j=1 be an orthogonal and (Xt)t [n]-decorrelating partition. Then, there exists a joint block diagonalizer U Rp p of (Σt)t [n] such that for all t [n] the matrix Σt := U Σt U is block diagonal with q diagonal blocks eΣt,j = Lazzaretto, Peters, Pfister (USj) Σt USj, j {1, . . . , q} of dimension |Sj| = dim(Sj), where Sj {1, . . . , p} indexes a subset of the columns of U. If the orthogonal partition is irreducible, then U is an irreducible joint block diagonalizer. Moreover, ΠSj = USj(USj) . If Assumption 1 is satisfied, any irreducible joint block diagonalizer U, via its corresponding irreducible orthogonal and (Xt)t [n]-decorrelating partition constructed in Proposition 1(i), leads to the same (unique) invariant and residual subspaces defined in (11). It is clear that Assumption 1 is automatically satisfied whenever the joint block diagonalization is unique up to trivial indeterminacies, that is, if for all orthogonal matrices U, U Rp p that jointly block diagonalize the set (Σt)t [n] into q U max irreducible common blocks, it holds that U is equal to U up to block permutations and block-wise isometric transformations. Explicit conditions under which uniqueness of joint block diagonalization is satisfied can be found, for example, in the works by De Lathauwer (2008); Murota et al. (2010). Intuitively, these conditions are satisfied whenever there is sufficient variability across time in the covariance matrices (Σt)t [n]. Given an irreducible joint block diagonalizer U, the invariant and residual subspaces can be identified using Proposition 1 and Lemma 1, by checking for all j {1, . . . , q U max} whether ΠSjγ0,t = USj(USj) γ0,t remains constant (Sj Sinv) or not (Sj Sres) on [n]. We denote by Uinv and Ures the submatrices of U formed by the columns that span the invariant and the residual subspace respectively. Example 1 (continued) In Example 1 so far, we expressed S1 and S2 in terms of their generating vectors. These can be retrieved using Proposition 1 by joint block diagonalizing the matrices (Σt)t [n]. In this specific example, an irreducible joint block diagonalizer is given by 3 0.5 0.5 0.5 which is a (clockwise) rotation matrix of 30 degrees (see Figure 1 and use S1 = {1} and S2 = {2}). In particular, we have that Σt = U Σt U = diag(σ1,t, σ2,t): therefore, q U max = 2 and each block has dimension 1. Moreover, it holds for all t [n] that ΠS1γ0,t = US1(US1) γ0,t = 1.5 3t/n t/n 1.5 and ΠS2γ0,t = US2(US2) γ0,t = 1 and therefore Sinv = S2 and Sres = S1. 2.2 Invariant component In Theorem 1 we have shown that the true time-varying parameter γ0,t can be expressed as the result of two separate optimization problems over the two orthogonal spaces Sinv and Sres. In this section we analyze the result of the first optimization step over the invariant subspace Sinv. To ensure that the space Sinv is unique, we assume that Assumption 1 is satisfied throughout Section 2.2. Definition 2 (invariant component) We denote the parameter that maximizes the explained variance over the invariant subspace by βinv := arg max β Sinv Var(β). (13) Invariant Subspace Decomposition The parameter βinv corresponds to the pooled OLS solution obtained by regressing Yt on the projected predictors ΠSinv Xt, and can be computed using all observations in [n]. The whole procedure to find βinv, by first identifying Sinv via joint block diagonalization and then using Proposition 2 (i) below, is summarized in Algorithm 1 (see Section 2.4). Proposition 2 summarizes some of the properties of βinv. Proposition 2 (properties of βinv) Under Assumption 1, βinv satisfies the following properties: (i) For all t [n], βinv = ΠSinvγ0,t = ΠSinvγ0, where γ0 := 1 n Pn t=1 γ0,t. (ii) βinv is time-invariant over [n], see Definition 1. (iii) If, in addition, it holds for all β Rp time-invariant over [n] that β Sinv then βinv = arg maxβ Rp time-invariant Var(β). Definition 1 guarantees, for all t [n], that the explained variance for all β Rp timeinvariant over [n] is Vart(β) = Var(X β) = β Σtβ. Point (ii) of Proposition 2 therefore implies that, for all t [n], Vart(βinv) = (βinv) Σtβinv. Under Assumption 2, we have that the same holds for all t N: βinv is therefore a never harmful parameter and for all t N it is a solution to arg maxβ Sinv Vart(β). Moreover, Proposition 2 (iii) implies that, under an additional assumption, the parameter βinv is optimal, i.e., maximizes the explained variance, among all time-invariant parameters. In particular, βinv represents an interesting target of inference: it can be used for zero-shot prediction, if no adaptation data is available at time t to solve the second part of the optimization in (12) over Sres. Using Uinv as defined in Section 2.1.1, we can express βinv as follows. βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ) (14) where Var(X) := 1 n Pn t=1 Var(Xt) and Cov(X, Y ) := 1 n Pn t=1 Cov(Xt, Yt). We derive this expression in Lemma 10, and we later use it for estimation in Section 4.2. Example 1 (continued) Considering again Example 1, we have that Sinv = S2, and therefore the invariant component is given by βinv = arg max β S2 Var(β) = ΠS2γ0 = 1 Moreover, we can express βinv under the basis for the irreducible subspaces using the irreducible joint block diagonalizer U, obtaining U βinv = 0, 2 : the only non-zero component is indeed the one corresponding to the invariant subspace S2. 2.3 Residual component and time adaptation Under the generalization assumption (Assumption 2), Theorem 1 implies that we can partially explain the variance of the response Yt at all observed and unobserved time points t N using βinv. It also implies that for all t N we can reconstruct the true time-varying Lazzaretto, Peters, Pfister parameter by adding to βinv a residual parameter that maximizes the explained variance at time t over the residual subspace Sres, i.e., γ0,t = βinv + arg max β Sres Vart(β). In this section we focus on the second optimization step over Sres. We assume that Assumptions 1 and 2 hold, and for all t N we define the residual component δres t := arg maxβ Sres Vart(β). Using (7), (8) and (11), we can express δres t as δres t = Var(ΠSres Xt) Cov(ΠSres Xt, Yt) (15) = Var(ΠSres Xt) Cov(ΠSres Xt, Yt X t βinv). We can now reduce the number of parameters that need to be estimated by expressing δres t as an OLS solution with dim(Sres) parameters. To see this, we use that, under Assumption 1, Proposition 1 allows us to express the space Sres in terms of an irreducible joint block diagonalizer U corresponding to the irreducible orthogonal partition {Sj}q U max j=1 as Sres = span{uk | j {1, . . . , q U max} : Sj not opt-invariant and k Sj}. Moreover, the orthogonal projection matrix onto Sres is given by ΠSres = Ures(Ures) . Hence, using Lemma 5 we get that δres t = Ures Var((Ures) Xt) 1 Cov((Ures) Xt, Yt X t βinv), (16) where Var((Ures) Xt) 1 Cov((Ures) Xt, Yt X t βinv) is the population ordinary least squares solution obtained by regressing Yt X t βinv on (Ures) Xt Rdim(Sres). Example 1 (continued) In Example 1, we have that for all t [n] the residual component δres t is given by δres t = arg max β S1 Vart(β) = ΠS1γ0,t = 1.5 3t/n t/n 1.5 Moreover, we can express δres t under the irreducible orthogonal partition basis (given by the two vectors generating S1 and S2), obtaining U δres t = 3 2t/n, 0 , which indeed only has one degree of freedom in the first component. This component takes the following values: for all t [n] we have 3 2t/n [1, 3]. 2.4 Population ISD algorithm We call the procedure to identify βinv and δres t invariant subspace decomposition (ISD). By construction, the result of ISD in its population version is equal to the true time-varying parameter at time t N, i.e., γ0,t = βinv + δres t . (17) The full ISD procedure is summarized in Algorithm 1. In the algorithm, the invariant and residual subspaces are identified through joint block diagoanlization as described at Invariant Subspace Decomposition Algorithm 1 Population ISD Input: distributions of (Xt, Yt)t [n] Output: βinv, δres n 1: (Σt)t [n] {Var(Xt)}t [n] 2: {γ0,t}t [n] {Var(Xt) 1 Cov(Xt, Yt)}t [n] 3: U, {Sj}q j=1 irreducible Joint Block Diagonalizer((Σt)t [n]) Prop. 1 4: Find the opt-invariant subspaces to identify Sinv, Sres 5: Sinv, Sres 6: for j = 1, . . . , q do 7: ΠSj USj(USj) 8: if ΠSjγ0,t constant in [n] then Sinv Sinv Sj see Prop. 2 9: else Sres Sres Sj 10: Define estimators for the invariant and residual component of γ0,t: 11: ΠSinv Uinv(Uinv) n Pn t=1 Var(ΠSinv Xt) Cov(ΠSinv Xt, Yt) see Prop. 2, Lemmas 1,2 13: δres n Ures Var((Ures) Xn) 1 Cov((Ures) Xn, Yn X n βinv) see Eq. (16) the end of Section 2.1.1. The number of subspaces q in the irreducible orthogonal and (Xt)t [n]-decorrelating partition is inferred by the joint block diagonalization algorithm. If Assumption 1 is not satisfied, then the decomposition in (17) and therefore the output of Algorithm 1 depends on the irreducible orthogonal and (Xt)t [n]-decorrelating partition used. In Appendix C we show that a decomposition of the true time-varying parameter similar to (17) is also obtained when considering (Xt)t [n]-decorrelating partitions that are not orthogonal, i.e., such that (6) holds but the subspaces in the partition are not necessarily pairwise orthogonal. In this case, in particular, we can still find an invariant and residual parameter as in (14) and (16), respectively, where the matrix U is now a non-orthogonal irreducible joint block diagonalizer. Remark 2 (changes of γ0,t and Σt over time) For the ISD procedure to work in practice, we assume that the covariance matrices Σt are approximately constant in small time windows (e.g., smoothly changing or piece-wise constant). This is required to obtain accurate estimates of Σt, which we then (approximately) joint block diagonalize (see Section 4.1.1). Similarly, we assume that γ0,t is approximately constant in the adaptation window, in order to perform the adaptation step (see Setting 1). In the historical data, however, γ0,t can vary arbitrarily fast in the residual subspace (βinv remains constant, of course). Intuitively, even if γ0,t does not change in a structured way, its projection ΠSinvγ0,t on the invariant subspace remains constant across time points. We provide an example in which γ0,t is quickly varying in Appendix A.2. Lazzaretto, Peters, Pfister 3. Analysis of the two ISD tasks: zero-shot generalization and time adaptation We now analyze the two prediction tasks that can be solved with the ISD framework introduced in the previous sections, namely (i) the zero-shot task and (ii) the adaptation task. We consider Setting 1 and assume for (i) that only the historical data is available, while for (ii) we assume to also have access to the adaptation data. Throughout the remainder of this section we assume that the invariant and residual subspaces are known (they can be computed from the joint distributions), which simplifies the theoretical analysis. 3.1 Zero-shot task We start by analyzing the zero-shot task, in which no adaptation data are observed, but only historical data and Xt . Under the generalization assumption (Assumption 2), Sinv is optinvariant on N and we can characterize all possible models defined as in (1) by the possible variations of the true time-varying parameter in the residual space Sres, or, equivalently, by all possible values of γ0,t βinv Sres. We then obtain that βinv is worst case optimal in the following sense. Theorem 2 Under Assumptions 1 and 2 it holds for all t N that βinv = arg max β Rp inf γ0,t Rp: γ0,t βinv Sres Vart(β). We further obtain from Proposition 2 (iii) that, under an additional condition, βinv is worst case optimal among all time-invariant parameters. This characterization of βinv allows for a direct comparison with the maximin effect by Meinshausen and Bühlmann (2015), which we report in more detail in Remark 3. In absence of further information on δres t , Theorem 2 suggests to use an estimate of ˆβinv of βinv to predict Yt , i.e., ˆYt = X t ˆβinv. Remark 3 (relation to maximin) The maximin framework introduced by Meinshausen and Bühlmann (2015) considers a linear regression model with varying coefficients, where the variations do not necessarily happen in a structured way, e.g., in time. Translated to our notation and restricting to time-based changes, their model can be expressed for t [n] as Yt = X t γ0,t + ϵt, where E[ϵt | Xt] = 0 and the covariance matrix Σ := Var(Xt) does not vary with t. The maximin estimator maximizes the explained variance in the worst (most adversarial) scenario that is covered in the training data; it is defined as βmm := arg max β Rp min t [n] Vart(β). The maximin estimator guarantees that, for all t [n], Cov(Yt X t βmm, X t βmm) 0 and therefore Vart(βmm) 0 (see Meinshausen and Bühlmann, 2015, Equation (8)). This means that the maximin relies on a weaker notion of invariance that only requires the lefthand side of (5) to be non-negative instead of zero. This implies that βinv is in general Invariant Subspace Decomposition more conservative than βmm in the sense that, for all t [n], Vart(βmm) Vart(βinv). By finding the invariant and residual subspaces, we determine the domain in which γ0,t varies and assume (Assumption 2) that this does not change even at unobserved time points. The parameter βinv is then worst case optimal over this domain, and guarantees that the explained variance remains positive even for scenarios that are more adversarial than the ones observed in [n], i.e., such that Vars(βmm) < mint [n] Vart(βmm) for some s {n + 1, n + 2, . . . } (see, for example, the results of the experiment described in Section 5.1 and shown in Figure 2). Furthermore, the decomposition allows for time adaptation (see Section 2.3), which would not be possible starting from the maximin effect. 3.2 Adaptation task We now consider the adaptation task in which, additionally to the historical data, adaptation data are available. Adaptation data can be used to define an estimator for the residual component δres t , which we denote by ˆδres t . This, together with an estimator ˆβinv for βinv fitted on the historical data gives us an estimator ˆγISD t := ˆβinv + ˆδres t for the true timevarying parameter. We compare this estimator with a generic estimator ˆγt for γ0,t which uses only the adaptation data. To do so, we consider the minimax lower bound provided by Mourtada (2022)[Theorem 1] for the expected squared prediction error of an estimator ˆγ computed using n i.i.d. observations of a random covariate vector X Rp and of the corresponding response Y . It is given by inf ˆγ sup P P EP [(X (ˆγ γ))2] σ2 p where P is the class of distributions over (X, Y ) such that Y = X γ + ϵ, E[ϵ | X] = 0 and E[ϵ2 | X] < σ2. It follows from (18) that, for a generic estimator ˆγt of γ0,t based on the adaptation data alone, we can expect at best to achieve a prediction error of σ2 ad p m, where σ2 ad is the variance of ϵt for t Iad (which is assumed to be constant). We can improve on this if we allow the estimator ˆγt to also depend on the historical data. To see this, observe that we can always decompose ˆγt = ˆβ + ˆδt with ˆβ Sinv and ˆδt Sres. Moreover, under Assumption 2 we can split the expected prediction error of ˆγt at t accordingly as E[(X t (ˆγt γ0,t ))2] = E[((ΠSinv Xt ) (ˆβ βinv))2] + E[((ΠSres Xt ) (ˆδt δres t ))2]. Then, ˆβ represents an estimator for βinv and can be computed on historical data3, whereas ˆδt estimates δres t and is based on adaptation data: by decomposing ˆγt in this way, the best prediction error we can hope for is of the order dim(Sinv) n + dim(Sres) m . In Section 4, we prove that ˆγISD t indeed achieves this bound in Theorem 3. If the invariant subspace is nondegenerate and therefore dim(Sres) < p, and n is sufficiently large, this implies that ˆγISD t has better finite sample performance than estimators based on the adaptation data alone. 3. In principle, an estimator for βinv could be computed using both historical and adaptation data. However, the ISD procedure is motivated by scenarios in which the size of historical data is very large (and n m): this means that it could be computationally costly to update the estimate for the invariant component every time new adaptation data are available, without a significant gain in estimation accuracy. For this reason, we only consider estimators for βinv that use historical data. Lazzaretto, Peters, Pfister 4. ISD estimator and its finite sample generalization guarantee We now construct an empirical estimation procedure for the ISD framework, based on the results of Section 2.2 and on Algorithm 1 described in Section 2.3. Throughout this section, we assume that Assumptions 1 and 2 are satisfied. We assume that we observe both historical and adaptation data as in the adaptation task. We use the historical data to first estimate the decomposition of Rp into Sinv and Sres, employing a joint block diagonalization algorithm (Section 4.1). We then use the resulting decomposition to estimate βinv. Finally, we use the adaptation data to construct an estimator for δres t in Section 4.2. In Section 4.3 we then show the advantage of separating the optimization as in (3) to estimate γ0,t at the previously unobserved time point t , by providing finite sample guarantees for the ISD estimator. 4.1 Estimating the subspace decomposition 4.1.1 Approximate joint block diagonalization We first need to find a good estimator for the covariance matrices Σt. Since only one observation (Xt, Yt) is available at each time step t [n], some further assumptions are needed about how Σt varies over time. Here, we assume that Σt varies smoothly with t [n] and is therefore approximately constant in small time windows. We can then consider a rolling window approach, i.e., consider K windows in [n] of length w n over which the constant approximation is deemed valid, and for the k-th time window, k {1, . . . , K}, take the sample covariance ˆΣk as an estimator for Σt in such time window. Given the set of estimated covariance matrices {ˆΣk}K k=1, we now need to estimate an orthogonal transformation ˆU that approximately joint block diagonalizes them. We provide an overview of joint block diagonalization methods in Section A.3 in the appendix. In our simulated settings, we solve the approximate joint block diagonalization (AJBD) problem via approximate joint diagonalization (AJD), since we found this approach to represent a good trade offbetween computational complexity and accuracy. More in detail, similarly to what is proposed by Tichavsky and Koldovsky (2012), we start from the output matrix V Rp p of the uwedge algorithm by Tichavsky and Yeredor (2008), which solves AJD for the set {ˆΣk}K k=1, i.e., it is such that for all k {1, . . . , K} the matrix V ˆΣk V is approximately diagonal. We then use the off-diagonal elements of the approximately diagonalized covariance matrices to identify the common diagonal blocks. As described more in detail in Section A.3, this is achieved by finding an appropriate permutation matrix P Rp p for the columns of V such that for all k {1, . . . , K} the matrix (V P) ˆΣk(V P) is approximately joint block diagonal. The estimated irreducible joint block diagonalizer is then given by ˆU = V P. We denote by q ˆU max the number of estimated diagonal blocks and by { ˆSj}q ˆ U max j=1 the estimated irreducible orthogonal and (Xt)t [n]-decorrelating partition. 4.1.2 Estimating the invariant and residual subspaces We now estimate the invariant and residual subspaces using the estimated irreducible joint block diagonalizer ˆU. To do so, we first estimate the true time-varying parameter γ0,t using similar considerations as the ones made in Section 4.1.1. We assume for example that γ0,t is approximately constant in small windows (for simplicity, we consider the same K windows Invariant Subspace Decomposition defined in Section 4.1.1). This assumption is helpful to define the estimation procedure, but is not strictly necessary for the ISD framework to work. We show in Appendix A.2 that the procedure can still work if this assumption is violated. We then compute the regression coefficient ˆγk of Yk on Xk, where Yk Rw 1 and Xk Rw p are the observations in the k-th time window4. We use the estimates ˆγk to determine which of the subspaces identified by ˆU are opt-invariant. For all j {1, . . . , q ˆU max}, we take the sets of indices Sj as defined in Proposition 1, and consider the estimated orthogonal projection matrices Π ˆSj = ˆUSj( ˆUSj) from Rp onto the j-th subspace ˆSj of the estimated irreducible orthogonal partition. It follows from Lemma 1 that we can find the opt-invariant subspaces by checking for all j {1, . . . , q ˆU max} whether Π ˆSj ˆγk remains approximately constant for k {1, . . . , K}. To do so, let ˆγ := (PK k=1 Var(ˆγk) 1) 1 PK k=1 Var(ˆγk) 1ˆγk be the average of the estimated regression coefficients inversely weighted by their variance. We further use that, by Lemma 9, if ˆSj is opt-invariant on [n] then the weighted average of the (approximately constant) projected regression coefficient Π ˆSj ˆγk, i.e., Π ˆSj ˆγ, approximately satisfies the time-invariance constraint (5). This approach is motivated by Proposition 2 (iii). If the corresponding assumption cannot be assumed to hold, other methods can alternatively be used to determine whether Π ˆSj ˆγk is constant, e.g., checking its gradient or variance. In (5), we can equivalently consider the correlation in place of the covariance, that is, corr(Yt X t β, X t β) = 0: an estimate of this correlation allows us to obtain a normalized measure of (5) that is comparable across different experiments. Formally, we consider for all k {1, . . . , K} and for all j {1, . . . , q ˆU max} ˆcj k := [ Corr(Yk Xk(Π ˆSj ˆγ), Xk(Π ˆSj ˆγ)) and check, for all j {1, . . . , q ˆU max}, whether ˆcj k λ (19) for some small threshold λ [0, 1]. The threshold λ can be chosen, for example, using cross-validation (more details are provided in Section A.4 in the appendix). An estimator of the invariant and residual subspaces is then given by j {1,...,q ˆ U max}: (19) is satisfied ˆSj, ˆSres = M j {1,...,q ˆ U max}: (19) is not satisfied where we approximate opt-invariance with the inequality (19) being satisfied. 4.2 Estimating the invariant and residual components Let ˆUinv and ˆUres be the submatrices of ˆU whose columns span ˆSinv and ˆSres, respectively. We propose to estimate βinv using the following plug-in estimator for (14) ˆβinv := ˆUinv(( ˆUinv) X X ˆUinv) 1( ˆUinv) X Y. (21) 4. We have so far omitted the intercept in our linear model, but it can be included by adding a constant term to Xt when estimating the linear parameters. We explicitly show how to take the intercept into account in Section B in the appendix. Lazzaretto, Peters, Pfister We consider now the new observation of the covariates Xt at time t and the adaptation data (Xt, Yt)t Iad introduced in Setting 1, and denote by Xad Rm p and Yad Rm 1 the matrices containing this adaptation data. Similarly to ˆβinv, using (16) we obtain the following plug-in estimator for δres t ˆδres t := ˆUres(( ˆUres) (Xad) Xad ˆUres) 1( ˆUres) (Xad) (Yad Xad ˆβinv). (22) We can now define the ISD estimator for the true time-varying parameter at t as ˆγISD t := ˆβinv + ˆδres t . (23) A prediction of the response Yt is then given by ˆYt = X t ˆγISD t . We provide the pseudocode summarizing the whole ISD estimation procedure in the Appendix B. Example 1 (continued) Assume that in Example 1 the true time-varying parameter at time points t {n + 1, n + 2, . . . } is given by γad 0,t = 1 + 0.5 T n sin2( t n 3 0.5 + 1.5 t n T n sin2( t n for some T N. We can verify that ΠS2γad 0,t = ΠS2γ0,t = βinv, and therefore the invariant and residual subspaces defined on [n] generalize to {n + 1, n + 2, . . . } and Assumption 2 is satisfied. Moreover, we obtain that for t {n + 1, . . . } the residual component expressed under the irreducible orthogonal partition basis is U δres t = 1 3 t n T n sin2( t n The first entry now takes values in [ 1.5, 1], which is disjoint from the range of values of the first coordinate observed in [n] (which was [1, 3]). We now take T = 350 and consider an online setup in which we sequentially observe X t at a new time point t {n+1, . . .}, and assume that Yt is observed until t = t 1. We take as historical data the observations on [n], and consider as adaptation data the observations in windows Iad = {t m, . . . , t 1} of length m = 16. After estimating βinv on historical data, we use the adaptation data to estimate δres t and the OLS solution γOLS t . We repeat this online step 350 times (each time increasing t by one). Figure 1 shows the results of this experiment. 4.3 Finite sample generalization guarantee Considering the setting described in Section 2.3, we now compare the ISD estimator in (23) with the OLS estimator computed on Iad , i.e., ˆγOLS t := ((Xad) Xad) 1(Xad) Yad. We assume that we are given the (oracle) subspaces Sinv and Sres, and consider the expected explained variance at t of ˆγISD t and ˆγOLS t . More in detail, the expected explained variance at t of an arbitrary estimator ˆγ of γ0,t is given by E[ Vart (ˆγ)] := E[Var(Yt ) Var(Yt X t ˆγ | ˆγ)]. (24) Invariant Subspace Decomposition Evaluating (24) allows us to obtain a measure of the prediction accuracy of ˆγ: the higher the expected explained variance, the more predictive the estimator is. This also becomes clear by isolating in (24) the term E[Var(Yt X t ˆγ | ˆγ)], which represents the mean square prediction error obtained by using ˆγ (see Remark 1). Our goal is to show that the explained variance by ˆγISD t is always greater than that of the OLS estimator. Theorem 3 Assume Assumption 2 and that, in model (1), γ0,t and the variances of Xt and ϵt do not change with respect to t Iad t , and denote them by γ0,t , Σt and σ2 ad, respectively. Moreover, let c, σ2 ϵ,max > 0 be constants such that for all n N it holds that σ2 ϵ,max maxt [n] Var(ϵt) and for all m N with m p, λmin( 1 m(Xad) Xad) c almost surely, where λmin( ) denotes the smallest eigenvalue. Further assume that the invariant and residual subspaces Sinv and Sres are known. Then, there exist Cinv, Cres > 0 constants such that for all n, m N with n, m p it holds that MSPE(ˆγISD t ) := E[(Xt (ˆγISD t γ0,t ))2] σ2 ϵ,max dim(Sinv) n Cinv + σ2 ad dim(Sres) Furthermore, for all n, m N with n, m p it holds that MSPE(ˆγOLS t ) MSPE(ˆγISD t ) σ2 ad dim(Sinv) m σ2 ϵ,max dim(Sinv) From the proof of Theorem 3 it follows that Cres can be chosen close to 1 if we only consider sufficiently large m. Moreover, because MSPE(ˆγOLS t ) MSPE(ˆγISD t ) = E[ Vart (ˆγISD t )] E[ Vart (ˆγOLS t )], Theorem 3 implies that if dim(Sinv) 1 and n sufficiently large then E[ Vart (ˆγISD t )] > E[ Vart (ˆγOLS t )]. The first term in the difference between the expected explained variances (or MSPEs) depends on the dimensions of the invariant subspace and of the time-adaptation window Iad: there is a higher gain in using ˆγISD t instead of ˆγOLS t if the dimension of Sinv is large and only a small amount of time-points are available are available in the adaptation data. 5. Experiments To show the effectiveness of the ISD framework we report the results of two simulation experiments and one real data experiment. The first simulation experiment evaluates the estimation accuracy of the invariant component ˆβinv for increasing sample size n of the historical data. The second simulation experiment compares the predictive accuracy of ˆγISD t and ˆγOLS t for different sizes m of the adaptation dataset, to empirically investigate the dependence of the MSPE difference on the size of the adaptation data shown in Theorem 3. In both simulation experiments we let the dimension of the covariates be p = 10, and dim(Sinv) = 7, dim(Sres) = 3, and generate data as follows. We sample a random orthogonal matrix U, and sample the covariates Xt from a normal distribution with zero mean and covariance matrix U Σt U , where Σt is a block-diagonal matrix with four blocks of dimensions 2, 4, 3 and 1, and random entries that change 10 times in the observed time horizon n. We take as true time-varying parameter the rotation by U of the parameter with constant entries equal to 0.2 (we set these entries all to the same value for simplicity, but they need not to be equal in general) corresponding to the blocks of sizes 4 and 3, and time-varying Lazzaretto, Peters, Pfister entries corresponding to the blocks of sizes 2 and 1 equal to (1 1.5t sin2(it/n + i))/n, where i {2, ..., 8} is the entry index (the values of these coefficients range between 0.25 and 1). The noise terms ϵt are sampled i.i.d. from a normal distribution with zero mean and variance σ2 ϵt = 0.64. The dimensions of the subspaces, Sinv and Sres, and the true time-varying parameter γ0,t are chosen to ensure that, in the historical data, X t βinv and X t δres t explain approximately half of the variance of Yt each. This choice allows for better visualization in the described experiments. In Section 5.3 we repeat the same experiments on real data coming from a controlled physical system (light tunnel) developed by Gamella et al. (2025). The data is taken from the smooth_polarizers experiment in the lt_walks_v1 dataset, available at https: //github.com/juangamella/causal-chamber. The code for the presented experiments is available at https://github.com/mlazzaretto/Invariant-Subspace-Decomposition. The implementation of the uwedge algorithm is taken from the Python package https: //github.com/sweichwald/coro ICA-python developed by Pfister et al. (2019b). 5.1 Invariant decomposition and zero-shot prediction We first estimate the time invariant parameter βinv for different sample sizes n of the historical data. We consider n {500, 1000, 2500, 4000, 6000}, and repeat the experiment 20 times for each n. To compute ˆβinv, we use K = 25 equally distributed windows of length n/8 (see Section 4). Figure 4 shows that the mean squared error (MSE) βinv ˆβinv 2 2 converges to zero for increasing values of n. We then consider a separate time window of 250 observations in which the value of the time-varying coefficients (before the transformation using U) is set to 1. We use these observations to test the zero-shot predictive capability of the estimated invariant component, i.e., they can be seen as realizations of the variable Xt introduced in Setting 1 (we refer to this window as test data). We compare the predictive performance of the parameter ˆβinv on 500 1000 2500 4000 6000 n βinv ˆβinv 2 2 βinv 2 2 Figure 4: MSE of ˆβinv for increasing size of the historical data n (see Section 5.1). For larger values of n, the estimation of the invariant subspace decomposition becomes more precise and leads to smaller errors in the estimated invariant component ˆβinv. Invariant Subspace Decomposition 500 1000 2500 4000 6000 n in-sample prediction 500 1000 2500 4000 6000 n zero-shot prediction (test sample size = 250) Figure 5: Normalized explained variance (R2) by ˆβinv and comparison with βinv, ˆβmm and ˆβOLS: training (historical data, left) and zero-shot generalization (test data, right), for different sizes n of the historical data (see Section 5.1). The dashed line indicates the population value of the (normalized) explained variance by βinv. the historical and test data with that of the oracle invariant parameter βinv, the maximin effect ˆβmm (computed using the magging estimator proposed by Bühlmann and Meinshausen, 2015), and the OLS solution ˆβOLS, both computed using the historical data. We show in Figure 5 the results in terms of the R2 coefficient, given by R2 = Pn t=1(d Var(Yt) d Var(Yt X t ˆβ)) Pn t=1 d Var(Yt) . Figure 5 shows that the R2 coefficient of the oracle invariant component βinv remains positive even for values of γ0,t in the test data that lie outside of the observed support in the historical data; for increasing n the same holds for the estimated ˆβinv. Using ˆβOLS or ˆβmm leads instead to negative explained variance in this experiment. The main limitation of the ISD method lies in the estimation of the invariant and residual subspaces. As outlined in Section 4, this process consists of two main steps, approximate joint block diagonalization and selection of the invariant blocks, both of which are in practice sensitive to noise. In both steps, we implement our estimator to be as conservative as possible, that is, such that it does not on average overestimate the number of common diagonal blocks or the dimension of Sinv, to avoid including part of the residual subspace into ˆSinv. This behavior is however hard to avoid if the size of the historical dataset is not sufficiently large, therefore requiring large values of n for the ISD framework to work effectively, see Figures 4 and 5. 5.2 Time adaptation In the same setting, we now fix the size of the historical dataset to n = 6000, which we use to estimate ˆβinv, and consider a test dataset in which the time-varying coefficients (before the transformation using U) undergo two shifts and take values 0.5 and 2 on two consecutive time windows, each containing 1000 observations. We assume that the Lazzaretto, Peters, Pfister 10 2 10 1 100 MSPE(ˆγOLS t ) 10 2 10 1 100 10 2 10 1 100 10 2 10 1 100 MSPE(ˆγISD t ) MSPE(βinv + ˆδres t ) Figure 6: MSPE comparison: ˆγISD t (blue dots) vs. ˆγOLS t for p = 10 and various adaptation window lengths m (see Section 5.2). The ISD estimator achieves lower MSPE than the OLS for smaller sizes m of the adaptation window, while the two estimators become comparable for increasing m. The orange dots show the MSPE of βinv(oracle) + ˆδres t vs. ˆγOLS t : if the subspace decomposition is known, then the ISD always achieves lower MSPE than the OLS. test data are observed sequentially, and take as adaptation data a rolling window of length m contained in the test data and shifting by one time point at the time. We use these sequential adaptation datasets to estimate the residual parameter ˆδres t and the OLS solution ˆγOLS t . We then compute the squared prediction error of ˆγISD t = ˆβinv + ˆδres t and ˆγOLS t on the next data point Xt+1, i.e., (X t+1(γ0,t+1 ˆγt))2 and approximate the corresponding MSPE using a Monte-Carlo approximation with 1000 draws from Xt+1 (these correspond to the 1000 sequential observations in each of the two windows of the test data). We repeat the simulation 20 times for different sizes of the adaptation window, m {1.5p, 2p, 5p, 10p}, and plot the obtained MSPE values for ˆγOLS t against ˆγISD t . The result is shown in Figure 6, and empirically supports Theorem 3, in particular that the difference in the MSPEs of the OLS and ISD estimators is proportional to the ratio dim(Sinv) m and is therefore larger for small values of m and shrinks for increasing size of the adaptation data (additional details on this simulation are provided in Section A.5 in the Appendix). The ISD framework is particularly helpful in scenarios in which the size of the available adaptation window is small (two first plots from the left in Figure 6). Indeed, from Theorem 3 it also follows that the larger the dimension of the invariant subspace the greater the advantage in using the ISD framework for prediction rather than a naive OLS approach. A further benefit of the ISD estimator is that it allows us to estimate ˆδres t for small lengths m of the adaptation window where dim(Sres) < m < p and OLS is not feasible. We run a similar experiment to show (Figure 7) the average cumulative explained variance on the adaptation data over 20 runs, both by estimators computed only on the historical data and estimators that use the adaptation data. For visualization purposes, we now consider the time-varying coefficients (before the transformation using U) equal to (0.5 t sin2(it/n + i))/n, where i {2, ..., 8} is the coefficient index, in the historical data, and constantly equal to 0.3, 0.65 and 1 in three consecutive time windows of size 150 on the time points after the historical data. We estimate ˆδres t and ˆγOLS t on a rolling adaptation window of size m = 3p. The plot in Figure 7 shows that, on average, the ISD framework, by exploiting invariance properties in the observed data, allows us to accurately explain the Invariant Subspace Decomposition 0 100 200 300 400 t Pt r=1 Varr(ˆβ) Zero-shot prediction Ground truth Time adaptation estimators ˆγISD t =ˆβinv+ˆδres t ˆγOLS t Ground truth Zero-shot prediction Time adaptation estimators ˆγISD t =ˆβinv+ˆδres t ˆγOLS t Figure 7: Average cumulative explained variance on the adaptation data and one standard deviation intervals (over 20 runs) by the true time-varying parameter γ0,t and various estimators (see Section 5.2). When few data points are available for adaptation at time t, the explained variance of the ISD estimator is significantly higher than that of the rolling window OLS, and improves on invariance-based estimators such as the invariant component ˆβinv or the maximin ˆβmm. Due to a shift of γ0,t in the residual subspace, the OLS computed on historical data can perform worse than the zero function. variance of the response by using small windows for time adaptation, significantly improving on the OLS solution in the same time windows. 5.3 Real data example We now present a toy example that applies ISD to real data. The data used for this experiment are collected using a controlled physical system developed by Gamella et al. (2025). The system, shown in Figure 8, consists of a light tunnel with a light source XRGB whose emitted light passes through two polarizers with relative angle θ between them and is captured by a sensor placed at the end of the tunnel, behind the polarizers. At the end of the tunnel there are two additional LED light sources, XL31 and XL32, whose emitted light is unaffected by the polarizers. The sensor Y := I3 measures the overall infrared light at the end of the tunnel, which is affected by the intensity of the RGB source and by the two LEDs. As described by Gamella et al. (2025), the effect of XRGB on Y is linear and depends on θ, more precisely, it holds that Y cos2(θ)XRGB. The dependence of Y on the two LEDs is instead independent of θ. We take Y as our response, consider the covariates vector [XRGB, XL31, XL32] R3 and assume that the angle θ is unknown. Since we control the three light sources independently, and we expect the dependence of the infrared light Y on the two LEDs to remain the same across time, we hope to detect a nontrivial invariant subspace related to the two LED covariates. The available dataset contains 8000 observations, collected under changing values of the angle θ. The historical dataset contains the first 7000 observations, and the test dataset the remaining 1000. Figure 9 shows the dependence of the response on the three covariates, as well as the values of the response and of cos2(θ) through time. Lazzaretto, Peters, Pfister Light source Linear polarizer Linear polarizer Angle sensor Angle sensor Motor setting Motor setting LEDs Red, green and blue levels Drawn electric current Light sensor Light sensor Light sensor Figure 8: Illustration of the light tunnel, see Section 5.3. The figure is taken from Gamella et al. (2025) (published under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)). The variables of interest in our experiment are the RGB values of the light source, the LEDs intensities L31 and L32, and the measurement of the light sensor at the end of the tunnel I3. R2 historical data test data (zero-shot prediction) ˆβinv 0.019 0.062 ˆβOLS 0.558 -10.703 ˆβmm 0.477 -2.194 Table 1: Normalized explained variance (R2) by ˆβinv and comparison with ˆβOLS and ˆβmm: training (historical data) and zero-shot generalization (test data); see Section 5.3. We apply ISD on the historical data to find an invariant component ˆβinv. For comparison, we also compute the maximin ˆβmm(Bühlmann and Meinshausen, 2015) and the OLS solution ˆβOLS on historical data. We then use these estimated parameters for zeroshot prediction on test data. Table 1 shows the R2 coefficient, defined as in Section 5.1 as the fraction of explained variance R2 = Pn t=1(d Var(Yt) d Var(Yt X t ˆβ)) Pn t=1 d Var(Yt) , on historical and test data. The invariant component ˆβinv is the only estimator that achieves positive explained variance on test data. This is because ˆβinv only captures the parts of the variance that can be transferred to the test data, as can be seen from the lower explained variance compared to ˆβOLS on the historical data. The estimated invariant subspace ˆSinv = span{[ 0.13910168, 0.95836843, 0.24936055] } has dimension 1 and shows in particular that most of the invariant information is encoded in the two LEDs, with highest weight given to XL31. This is expected since the LEDs intensities are not affected by the changing angle between the two polarizers. Also the relatively small R2 is expected: since the RGB light source is stronger than the two LEDs at the end of the tunnel, most of the variance Invariant Subspace Decomposition 250 500 XRGB 100 200 XL31 100 200 XL32 0 5000 t historical data test data 250 500 XRGB 100 200 XL31 100 200 XL32 7000 8000 t 7000 8000 t Figure 9: Dataset of the experiment discussed in Section 5.3. The top four figures from the left show the dependence of the infrared measurement Y on the covariates and on time t (encoded by the colormap) for historical data. The angle of the polarizers changes over time (top right) and thus has an influence on the linear relationship between Y and XRGB (top left). For this experiment we assume that the angle θ is unknown. The second row shows the same quantities during test time, where the polarization angle is much smaller (bottom right). The dependence between Y and the two LEDs is small but significant (testing a reduced linear model without either L31 or L32 against the full model results in p-values smaller than 10 4, both for historical and test data). 7000 7200 7400 7600 7800 8000 t Pt r=1 Varr(ˆβ) Zero-shot prediction Response variance Pt r=1 Var(Yr) Time adaptation estimators ˆγISD t =ˆβinv+ˆδres t ˆγOLS t Response variance Pt r=1 Var(Yr) Zero-shot prediction Time adaptation estimators ˆγISD t =ˆβinv+ˆδres t ˆγOLS t Figure 10: Cumulative explained variance on the adaptation data by the time adaptation estimators ˆγISD and ˆγOLS and by the zero-shot predictors ˆβinv and ˆβOLS; see Section 5.3. in Y is explained by the non-invariant component XRGB (see Figure 9). However, it is not the case that the whole subspace spanned by the two covariates corresponding to the LEDs is invariant, as we would have assumed from the knowledge of the physical system. An explanation is that due to the data collection process there is nonzero observed correlation between XRGB and XL32, but not between XRGB and XL31 (on the historical data, we have corr(XRGB, XL31) = 0.004 with p-value 0.723 and corr(XRGB, XL32) = 0.107 with p-value smaller than 10 15). Lazzaretto, Peters, Pfister We also run the adaptation step considering as adaptation data a rolling window of size m = 8 shifting through the test data. We show in Figure 10 the cumulative explained variance obtained by ISD, by the OLS solution computed on the same adaptation data and by the OLS solution and the invariant component computed on historical data. The plot shows that ˆγISD achieves the highest explained variance, with a small improvement on the rolling window OLS ˆγOLS. Indeed, in this particular example the size of the invariant subspace is small (dim(Sinv) = 1) and by Theorem 3, we expect only a small improvement. We propose Invariant Subspace Decomposition (ISD), a framework for invariance-based time adaptation. Our method relies on the orthogonal decomposition of the parameter space into an invariant subspace Sinv and a residual subspace Sres, such that the maximizer of the explained variance over Sinv is time-invariant. The estimation of the invariant component βinv on a large historical dataset and the reduced dimensionality of Sres with respect to the original parameter space Rp allow the ISD estimator to improve on the prediction accuracy of existing estimation techniques. We provide finite sample guarantees for the proposed estimation method and additionally support the validity of our theoretical results through simulated experiments and one real world data experiment. Future developments of this work may investigate the presented problem in the case of nonlinear time-varying models. For instance, Eastwood et al. (2023) show how to adapt invariant nonlinear features to heterogeneous environments in classification tasks: extending similar ideas to the regression setting considered here could be of interest. Future work may further study how to incorporate the ISD framework in specific applied settings such as contextual bandits. Acknowledgments We thank Juan Gamella for generating the data for the presented real world example, and the three anonymous reviewers for the valuable comments. NP and ML are supported by a research grant (0069071) from Novo Nordisk Fonden. K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. Advances in neural information processing systems, 29, 2016. P. Bühlmann and N. Meinshausen. Magging: maximin aggregation for inhomogeneous largescale data. Proceedings of the IEEE, 104(1):126 135, 2015. L. De Lathauwer. Decompositions of a higher-order tensor in block terms part II: Definitions and uniqueness. SIAM Journal on Matrix Analysis and Applications, 30(3):1033 1066, 2008. J. Durbin and S. J. Koopman. Time series analysis by state space methods, volume 38. OUP Oxford, 2012. Invariant Subspace Decomposition C. Eastwood, S. Singh, A. L. Nicolicioiu, M. Vlastelica Pogančić, J. von Kügelgen, and B. Schölkopf. Spuriosity didn t kill the classifier: Using invariant predictions to harness spurious features. Advances in Neural Information Processing Systems, 36:18291 18324, 2023. J. Fan and W. Zhang. Statistical methods with varying coefficient models. Statistics and its Interface, 1(1):179, 2008. F. Feng, B. Huang, K. Zhang, and S. Magliacane. Factored adaptation for non-stationary reinforcement learning. Advances in Neural Information Processing Systems, 35:31957 31971, 2022. C. Févotte and F. J. Theis. Pivot selection strategies in Jacobi joint block-diagonalization. In International Conference on Independent Component Analysis and Signal Separation, pages 177 184. Springer, 2007. J. L. Gamella, J. Peters, and P. Bühlmann. Causal chambers as a real-world physical testbed for AI methodology. Nature Machine Intelligence, 7(1):107 118, Jan 2025. ISSN 2522-5839. doi: 10.1038/s42256-024-00964-x. URL https://doi.org/10.1038/ s42256-024-00964-x. W. Günther, U. Ninad, and J. Runge. Causal discovery for time series from multiple datasets with latent contexts. In Uncertainty in Artificial Intelligence, pages 766 776. PMLR, 2023. H. W. Gutch and F. J. Theis. Uniqueness of linear factorizations into independent subspaces. Journal of Multivariate Analysis, 112:48 62, 2012. T. Hastie and R. Tibshirani. Varying-coefficient models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 55(4):757 779, 1993. R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012. B. Huang, K. Zhang, J. Zhang, J. Ramsey, R. Sanchez-Romero, C. Glymour, and B. Schölkopf. Causal discovery from heterogeneous/nonstationary data. Journal of Machine Learning Research, 21(89):1 53, 2020. Z. Li, R. Cai, Z. Yang, H. Huang, G. Chen, Y. Shen, Z. Chen, X. Song, Z. Hao, and K. Zhang. When and how: Learning identifiable latent states for nonstationary time series forecasting. ar Xiv preprint ar Xiv:2402.12767, 2024. S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J. M. Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10846 10856. Curran Associates, Inc., 2018. N. Meinshausen and P. Bühlmann. Maximin effects in inhomogeneous large-scale data. The Annals of Statistics, 43(4):1801 1830, 2015. Lazzaretto, Peters, Pfister J. Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. The Annals of Statistics, 50(4):2157 2178, 2022. K. Murota, Y. Kanno, M. Kojima, and S. Kojima. A numerical algorithm for block-diagonal decomposition of matrix-algebras with application to semidefinite programming. Japan Journal of Industrial and Applied Mathematics, 27(1):125 160, 2010. D. Nion. A tensor framework for nonunitary joint block diagonalization. IEEE Transactions on Signal Processing, 59(10):4585 4594, 2011. J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947 1012, 2016. N. Pfister, P. Bühlmann, and J. Peters. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264 1276, 2019a. N. Pfister, S. Weichwald, P. Bühlmann, and B. Schölkopf. Robustifying independent component analysis by adjusting for group-wise stationary noise. Journal of Machine Learning Research, 20(147):1 50, 2019b. N. Pfister, E. G. William, J. Peters, R. Aebersold, and P. Bühlmann. Stabilizing variable selection and regression. The Annals of Applied Statistics, 15(3):1220 1246, 2021. M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters. Causal transfer in machine learning. Journal of Machine Learning Research, 19(36):1 34, 2018. J. R. Schott. Matrix analysis for statistics. John Wiley & Sons, 2016. P. Stojanov, Z. Li, M. Gong, R. Cai, J. Carbonell, and K. Zhang. Domain adaptation with invariant representation learning: What transformations to learn? In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 24791 24803. Curran Associates, Inc., 2021. B. Sun, J. Feng, and K. Saenko. Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications, pages 153 171. Springer, 2017. P. Tichavsky and Z. Koldovsky. Algorithms for nonorthogonal approximate joint blockdiagonalization. In 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pages 2094 2098. IEEE, 2012. P. Tichavsky and A. Yeredor. Fast approximate joint diagonalization incorporating weight matrices. IEEE Transactions on Signal Processing, 57(3):878 891, 2008. P. Tichavsk y, A. Yeredor, and Z. Koldovsk y. On computation of approximate joint blockdiagonalization using ordinary AJD. In International Conference on Latent Variable Analysis and Signal Separation, pages 163 171. Springer, 2012. W. Yao, G. Chen, and K. Zhang. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492 26503, 2022. Invariant Subspace Decomposition K. Zhang, M. Gong, and B. Schölkopf. Multi-source domain adaptation: A causal view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. H. Zhao, R. T. Des Combes, K. Zhang, and G. Gordon. On learning invariant representations for domain adaptation. In International conference on machine learning, pages 7523 7532. PMLR, 2019. Lazzaretto, Peters, Pfister Appendix A. Supporting examples and remarks A.1 Example of non-uniqueness of an irreducible orthogonal partition Assume that for all t [n] the covariance matrix Σt of Xt takes one of the following two values (and each value is taken at least once in [n]) 2 0 0 0 2 0 0 0 1 3 0 0 0 3 0 0 0 2 Define for all j {1, 2, 3} the linear spaces Sj = ej , where ej is the j-th vector of the canonical basis. Then since Σ1 and Σ2 are (block) diagonal the partition S1, S2, S3 is an irreducible orthogonal and (Xt)t [n]-decorrelating partition. Consider now the orthonormal matrix It holds that U Σ2U = Σ2. Therefore, the spaces also form an irreducible orthogonal and (Xt)t [n]-decorrelating partition (this follows, for example, from Proposition 1) but {S1, S2, S3} = { S1, S2, S3}. Then, if we assume for example that γ0,t = 1, t, 1 , given the first partition we obtain Sinv = S1 S3, whereas given the second partition it holds that Π S1γ0,t = 1 t 2 , 0 and Π S2γ0,t = 1+t 2 , 0 and therefore Sinv = S3, leading to Assumption 1 not being satisfied. A.2 Example of quickly varying γ0,t and zero-shot generalization We repeat the same simulation described in Section 5.1 but now consider non-smooth variations of γ0,t. More specifically, the only difference from the simulation described in Section 5.1 is that we let the 3 time-varying entries of γ0,t (prior to its rotation by U) vary quickly with t. More specifically, at each time point each one of the 3 entries is sampled uniformly in an interval of width 1 centered around a value which changes 20 times in the observed time horizon n. These centers are randomly sampled in [0, 1.2]. These choices of the sampling intervals ensure that the experiments are comparable for different sizes n of the historical data and that there is a shift in γ0,t outside of the observed support in the test Invariant Subspace Decomposition 500 1000 2500 4000 6000 n βinv ˆβinv 2 2 βinv 2 2 Figure 11: MSE of ˆβinv for increasing size of the historical data n. For larger values of n, the estimation of the invariant subspace decomposition becomes more precise and leads to smaller errors in the estimated invariant component ˆβinv. 500 1000 2500 4000 6000 n in-sample prediction 500 1000 2500 4000 6000 n zero-shot prediction (test sample size = 250) Figure 12: Normalized explained variance (R2) by ˆβinv and comparison with βinv, ˆβmm and ˆβOLS: training (historical data, left) and zero-shot generalization (test data, right), for different sizes n of the historical data. The dashed line indicates the population value of the (normalized) explained variance by βinv. In the historical data, γ0,t is quickly varying. data (which is generated as in Section 5.1). This experiment supports Remark 2, showing that we do not need assumptions on the type of changes in γ0,t in the historical data. In particular, Figure 11 shows that the MSE of ˆβinv converges to zero for increasing values of n, that is, we are able to estimate the invariant component even when γ0,t is quickly varying in the historical data. Lazzaretto, Peters, Pfister Moreover, Figure 12 shows the results in terms of the R2 coefficient. In particular, for n large enough, the R2 coefficient of the estimated invariant component ˆβinv remains positive even for values of the test data that lie outside of the observed support. A.3 Methods and computational complexity of joint block diagonalization In the context of joint block diagonalization, we can differentiate between methods that solve the exact problem (JBD), i.e., are such that the transformed matrices have exactly zero offblock diagonal entries, and approximate methods (AJBD), which assume the presence of noise and aim to minimize the off-block diagonal entries, without necessarily setting them to zero. JBD is in general an easier problem, and algorithms that solve it have been shown to achieve polynomial complexity (see, e.g., Murota et al., 2010; Tichavsk y et al., 2012). Many of these methods, e.g., the one presented by Murota et al. (2010), are based on eigenvalue decompositions. Alternatively, as shown by Tichavsk y et al. (2012), some algorithms that solve the problem of approximate joint diagonalization (AJD) of a set of matrices, such as uwedge developed by Tichavsky and Yeredor (2008), can also be used for JBD. More in detail, a solution to AJD is a matrix that maximally jointly diagonalizes a set of matrices by minimizing the average value of the off-diagonal entries; if the matrices in the set cannot be exactly jointly diagonalized, the transformed matrices will have some non-zero off-diagonal elements. Adding an appropriate permutation of the columns of the joint diagonalizer allows to reorganize the non-zero off-diagonal elements into blocks, leading to jointly block diagonal matrices: in Section 4.3 of their work Gutch and Theis (2012) argue that if the matrices to be block diagonalized are symmetric, then the solution found in this way is also an optimal JBD solution. However, in general, methods for JBD cannot be directly applied to solve AJBD. Algorithms that solve AJBD directly based on the iterative optimization of a cost function via matrix rotations have been developed, for example, by Tichavsky and Koldovsky (2012) and Févotte and Theis (2007), but require the number of diagonal blocks to be known in advance. Alternatively and similarly to how JBD can be solved by AJD methods, one can also, with some slight modifications, use AJD methods to solve AJBD. More specifically, some heuristics need to be used to determine the size of the blocks: these can consist, for example, in setting a threshold for the non-zero off-block-diagonal elements in the transformed matrices. In the (orthogonal) settings considered here, we have found the last approach to work effectively. More specifically, in Section 4.1.1, we have denoted by V the AJD solution for the set of estimated covariance matrices {ˆΣk}K k=1. Similarly to how Tichavsky and Koldovsky (2012) suggest to determine a permutation of the AJD result, we proceed in the following way. To discriminate the non-zero off-block-diagonal entries in these matrices, we start by computing the following auxiliary matrix using V Σ := maxk {1,...,K} |V ˆΣk V |, where the maximum is taken element-wise. The matrix Σ captures in its off-diagonal entries the residual correlation among the components identified by AJD, for all the K jointly diagonalized matrices. For all thresholds τ R, we let P(τ) Rp p denote one of the Invariant Subspace Decomposition permutation matrices satisfying that P(τ) ΣP(τ) is a block diagonal matrix if all entries smaller than τ are considered zero. We then define the optimal threshold τ by τ arg min τ 1 K k=1 OBDτ(|(V P(τ)) ˆΣk(V P(τ))|) + ν pbd(τ) where OBDτ( ) denotes the average value of off-block-diagonal entries (determined by the threshold τ) of a matrix, pbd(τ) is the total number of entries in the blocks induced by τ and ν R is a regularization parameter. The penalization term pbd(τ) p2 is introduced to avoid always selecting a zero threshold, and the regularization parameter is set to ν = 1 K PK k=1 λmin(ˆΣk) where λmin( ) denotes the minimum eigenvalue. The optimal permutation is then P := P(τ ) and the estimated irreducible joint block diagonalizer is ˆU = V P . A.4 Threshold selection for opt-invariant subspaces In the simulations, we select the threshold λ in (19) by cross-validation. More in detail, we define the grid of possible thresholds λ by K PK k=1 |ˆc1 k|, . . . , 1 K PK k=1 |ˆcq ˆ U max k | , where, for all j {1, . . . , q ˆU max} and k {1, . . . , K}, ˆcj k := [ Corr(Yk Xk(Π ˆSj ˆγ), Xk(Π ˆSj ˆγ)). We then split the historical data into L = 10 disjoint blocks of observations, and for all j {1, . . . , J} denote by Xℓand Yℓthe observations in the ℓ-th block and by X ℓand Y ℓ the remaining historical data. For all possible thresholds λ Λ we proceed in the following way. For all folds ℓ {1, . . . , L}, we compute an estimate for the invariant component as in (20) using X ℓand Y ℓ, which we denote by ˆβinv, ℓ(λ). Inside the left-out ℓ-th block of observations, we then consider a rolling window of length d = 2p and the observation at t immediately following the rolling window: we compute the residual parameter ˆδres t (λ) as in (22) using the d observations in the rolling window, and evaluate the empirical explained variance by ˆβinv, ℓ(λ) + ˆδres t (λ) on the observation at t , i.e., \ Vart (λ) := (Yt )2 (Yt X t (ˆβinv, ℓ(λ) + ˆδres t (λ)))2. We repeat this computation for all possible t Iℓ, where Iℓdenotes the time points in the ℓ-th block of observations excluding the first d observations, and define Var λ ℓ:= 1 |Iℓ| P t Iℓ\ Vart (λ). For all λ Λ, we denote the average explained variance over the L folds by Var λ := 1 L PL ℓ=1 Var λ ℓand the standard error (across the L folds) of such explained variance as se( Var λ) := 1 L q PL ℓ=1( Var λ ℓ Var λ)2. Moreover, let λmax := arg maxλ Λ Var λ. Then, we choose the optimal threshold as λ := min n λ Λ Var λ > Var λmax tse se( Var λmax ) o , which is the most conservative (lowest) threshold such that the corresponding explained variance is within tse (in our simulations, we choose tse = 1) standard errors (computed across the folds) of the maximal explained variance. Lazzaretto, Peters, Pfister A.5 Further simulation details: MSPE comparison In Section 5.2 we present a simulated experiment in which we compare the ISD estimator and the OLS estimator on the time adaptation task. Figure 6 shows that the difference in the MSPE for ˆγOLS t and ˆγISD t is positive and decreases for increasing values of m. To further support the statement of Theorem 3, we show in Figure 13 the value of such difference against σ2 ad dim(Sinv) m , when computing ˆγISD t both with the estimated and oracle invariant component. The figure shows that the difference in the MSPEs indeed satisfies the bound stated in Theorem 3, i.e., it is always greater than σ2 ad dim(Sinv) m . Moreover, it shows that for small values of m the gain in using the ISD estimator over the OLS is even higher than what the theoretical bound suggests, indicating that it is not sharp for small m. We further show in Figure 14 the MSPE of ˆγISD t (again computed both with the estimated and oracle invariant component) against σ2 ad dim(Sres) m , obtaining in this case an empirical confirmation of the first bound presented in Theorem 3. σ2 ad dim(Sinv) m MSPE(ˆγOLS t ) MSPE(ˆγISD t ) MSPE(ˆγOLS t ) MSPE(βinv + ˆδres t ) σ2 ad dim(Sinv) m Figure 13: Difference in the MSPE of ˆγOLS t and ˆγISD t (and an oracle version of ˆγISD t based on the true βinv) with respect to σ2 ad dim(Sinv) m for different values of m. The computed difference is larger than σ2 ad dim(Sinv) m for all values of m (in the oracle case), empirically confirming the lower bound obtained in Theorem 3. The filled dots in the boxplots show the mean over 20 runs (while the empty dots represent the outliers). Invariant Subspace Decomposition σ2 ad dim(Sres) m MSPE(ˆγISD t ) MSPE(βinv + ˆδres t ) σ2 ad dim(Sres) m Figure 14: MSPE of ˆγISD t (and an oracle version of ˆγISD t based on the true βinv) with respect to σ2 ad dim(Sres) m for different values of m. The computed MSPE is smaller than σ2 ad dim(Sres) m for all values of m (in the oracle case), empirically confirming the upper bound obtained in Theorem 3. The filled dots in the boxplots show the mean over 20 runs (while the empty dots represent the outliers). Appendix B. ISD estimation algorithm We provide here the pseudocode summarizing the ISD procedure described in Section 4. The algorithm includes the estimation of the intercept, that is, it considers for all t N the model Yt = γ0 0,t + X t γ0,t + ϵt (25) satisfying the assumptions of Setting 1 but with the addition of γ0 0,t R. In the estimation of the intercept, the algorithm distinguishes between two cases: (a) the intercept remains approximately constant in [n] Iad and (b) the intercept changes with time but is assumed to be approximately constant in Iad. In case (a), the computation of ˆγ0 t in line 21 of Algorithm 2 is done by averaging the estimated values of the (approximately constant) intercept on historical data. Alternatively, in case (a) we could estimate the intercept simultaneously to the invariant component βinv: the whole vector (including the intercept) can be computed by taking the OLS solution of regressing Y on 1n X ˆUinv and premultiplying it by e1 ˆUinv (where e1 is the p-dimensional vector with the first entry equal to 1 and the remaining equal to zero) in place of lines 16 and 21 of Algorithm 2 (see Remark 4). In case (b), the intercept can instead be estimated simultaneously to the residual component ˆδres t : the computation in line 22 of the algorithm is equivalent to taking the first component of the OLS solution of regressing Yad Xad ˆβinv on 1m Xad ˆUres , premultiplied by e1 ˆUres . Remark 4 Recall that the population OLS solution for the linear model (25) at time t can be found by adding a 1 to the vector Xt and solving arg min γ0,γ E Yt [1 X t ] γ0 Lazzaretto, Peters, Pfister The problem has closed form solution E[Xt] E[Xt X t ] 1 E[Yt] E[Xt Yt] = E[Yt] E[Xt] Var(Xt) 1 Cov(Xt, Yt) Var(Xt) 1 Cov(Xt, Yt) Algorithm 2 ISD: estimation Input: observations (Xt, Yt)t [n] Iad, Xt , number of windows K, λ [0, 1] Output: ˆβinv, ˆδres t , ˆγt , ˆγ0 t 1: (Xk)k [K] ([X (k 1)n K , . . . , X kn 2: (Yk)k [K] ([Y (k 1)n K , . . . , Y kn 3: (ˆΣk)k [K] {d Var(Xk)}k [K] (" ˆγ0 k ˆγk k [K] n OLS(Yk, h 1 n " ˆγ0 k ˆγk 6: ˆU, { ˆSj}q j=1 approx Irreducible Joint Block Diagonalizer((ˆΣt)t [n]) see Sec. 4.1.1 7: Find the opt-invariant subspaces to estimate Sinv, Sres: 8: ˆSinv, ˆSres 9: for j = 1, . . . , q do 10: Π ˆSj ˆUSj( ˆUSj) 11: (ˆcj k)k [K] ([ Corr(Yk Xk(Π ˆSj ˆγ), Xk(Π ˆSj ˆγ)))k [K] 12: if 1 K PK k=1 ˆcj k λ then ˆSinv ˆSinv ˆSj see Eq. (19) 13: else ˆSres ˆSres ˆSj 14: Invariant component estimation: 15: X [X1 . . . Xn] , Y [Y1 . . . Yn] 16: ˆβinv ˆUinvd Var(X ˆUinv) 1 d Cov(X ˆUinv, Y) see Eq. (21) 17: Adaptation step: 18: Xad [(Xt)t Iad] , Yad [(Yt)t Iad] 19: ˆδres t ˆUresd Var(Xad ˆUres) 1 d Cov(Xad ˆUres, Yad Xad ˆβinv) see Eq. (22) 20: Intercept estimation: 21: if {ˆγ0 k}k [K] approximately constant then ˆγ0 t ˆγ0 22: else ˆγ0 t ˆE[Yad Xad ˆβinv] ˆE[Xad]ˆδres t 23: ˆγt ˆβinv + ˆδres t see Eq. (23) Invariant Subspace Decomposition Appendix C. Extension to non-orthogonal subspaces In Section 2.1 we have defined an orthogonal and (Xt)t [n]-decorrelating partition as a collection {Sj}j {1,...,q} of pairwise orthogonal linear subspaces of Rp satisfying (6). Finding the orthogonal subspaces that form the partition means in particular finding a rotation of the original X-space such that each subspace is spanned by a subset of the rotated axes, and the coordinates of the projected predictors onto one subspace are uncorrelated with the ones in the remaining subspaces. In this section, we show that orthogonality of the subspaces in the partition is not strictly required to obtain a separation of the true timevarying parameter of the form (7), that is, more general invertible linear transformations can be considered besides rotations. In particular, we briefly present results similar to the ones obtained throughout Section 2 but where the subspaces in the partition are not necessarily orthogonal. To do so, we define a collection of (not necessarily orthogonal) linear subspaces S1, . . . , Sq Rp with Lq j=1 Sj = Rp and satisfying (6) a (Xt)t [n]-decorrelating partition (of cardinality q), and further call it irreducible if it is of maximal cardinality. A (Xt)t [n]-decorrelating partition can still be identified through a joint block diagonalization of the covariance matrices (Σt)t [n] as described in Section 2.1.1 but with an adjustment. More specifically, instead of assuming that the joint diagonalizer U is an orthogonal matrix, we only assume it is invertible and for all j {1, . . . , q}, the columns of USj are orthonormal vectors. A version of Proposition 1 in which the resulting partition is not necessarily orthogonal follows with the same proof. Moreover, similarly to the orthogonal case, the uniqueness of an irreducible (Xt)t [n]-decorrelating partition is implied by the uniqueness of an irreducible non-orthogonal joint block diagonalizer for (Σt)t [n]; explicit conditions under which such uniqueness holds can be found for example in the work by Nion (2011). In the results presented in the remainder of this section we adopt the same notation introduced in Section 2.1.1, and we additionally define the matrix W := U . A (Xt)t [n]-decorrelating partition of cardinality q allows us to decompose the true timevarying parameter into the sum of q components. To obtain such a decomposition via non-orthogonal subspaces, oblique projections need to be considered in place of orthogonal ones. Oblique projections are defined (see, e.g., Schott, 2016) for two subspaces S1, S2 Rp such that S1 L S2 = Rp and a vector x Rp as the vectors x1 S1 and x2 S2 such that x = x1 + x2: x1 is called the projection of x onto S1 along S2, and x2 the projection of x onto S2 along S1. For a (Xt)t [n]-decorrelating partition {Sj}q j=1, we denote by PSj|S j the oblique projection matrix onto Sj along L i {1,...,q}\{j} Si: this can be expressed in terms of a (non-orthogonal) joint block diagonalizer U corresponding to the partition as PSj|S j = USj(W Sj) . Orthogonal and (Xt)t [n]-decorrelating partitions {Sj}q j=1 are a special case of (Xt)t [n]-decorrelating partitions. In particular, if the subspaces are pairwise orthogonal, it holds for all j {1, . . . , q} that PSj|S j = ΠSj. By definition of oblique projections, for all t [n], we can express γ0,t as j=1 PSj|S jγ0,t. Similarly to how we define opt-invariance on [n] for orthogonal subspaces in Section 2.1, we say that a subspace Sj in a (Xt)t [n]-decorrelating partition is proj-invariant on [n] if it Lazzaretto, Peters, Pfister satisfies for all t, s [n] that PSj|S jγ0,t = PSj|S jγ0,s. By Lemma 1 it follows that for orthogonal partitions proj-invariance is equivalent to optinvariance. For an irreducible (Xt)t [n]-decorrelating partition, we now define the invariant subspace Sinv and residual subspace Sres as Sinv := L j {1,...,q}: Sj proj-invariant on [n] Sj and Sres := L j {1,...,q}: Sj not proj-invariant on [n] Sj. It follows directly by the definition of partitions that {Sinv, Sres} is a (Xt)t [n]- decorrelating partition. Moreover, Sinv is proj-invariant on [n] since PSinv|Sresγ0,t = P j {1,...,q}: Sj proj-invariant on [n] PSj|S jγ0,t. We finally define the invariant and residual components by βinv := PSinv|Sres γ0 and δres t := PSres|Sinvγ0,t. We show in the following proposition that the expressions (14) and (16) used to construct our estimators for the invariant and residual component in the case of orthogonal partitions, remain valid in the case of non-orthogonal partitions. Proposition 3 Let {Sj}q j=1 be a (Xt)t [n]-decorrelating partition and let U be a joint block diagonalizer corresponding to that partition. Then, the following results hold. (i) For all t [n] and for all j {1, . . . , q} PSj|S jγ0,t = USj Var((USj) Xt) 1 Cov((USj) Xt, Yt). (26) (ii) βinv is time-invariant over [n] and βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ). δres t = Ures((Ures) Var(Xt)Ures) 1(Ures) Cov(Xt, Yt X t βinv). Proposition 3 implies in particular that, apart from the joint block diagonalization differences, estimating βinv and δres t in the non-orthogonal case can be done as described in Section 4. Moreover, it holds that the parameter βinv defined for non-orthogonal (Xt)t [n]- decorrelating partitions is still a time-invariant parameter. Under a generalization assumption analogous to Assumption 2, βinv has positive explained variance at all time points t N and can be used to at least partially predict γ0,t (we have not added this result explicitly). In addition, also in the non-orthogonal case the estimation of the residual component only requires to estimate a reduced number of parameters, that is, dim(Sres) . In the case of non-orthogonal partitions, however, we cannot directly interpret ISD as separating the true time-varying parameter into two separate optimizations of the explained variance over Sinv and Sres. Invariant Subspace Decomposition Example 2 (non-orthogonal irreducible partition) Let Xt R3 with covariance matrix Σt that for all t [n] takes one of the following values 1 0.5 0 0.5 1 0 0 0 1 4 2 0 2 7/2 0 0 0 1 These matrices are in block diagonal form and in particular the 2 2 submatrices forming the first diagonal block do not commute. This implies that an irreducible (orthogonal) joint block diagonalizer for these matrices is the three-dimensional identity matrix I3, and the diagonal blocks cannot be further reduced into smaller blocks using an orthogonal transformation. It also implies that an irreducible orthogonal and (Xt)t [n]-decorrelating partition is given by {Sj}2 j=1 with S1 := e1, e2 and S2 = e3 , where ej is the j-th vector of the canonical basis for R3. There exists, however, a non-orthogonal joint diagonalizer for these matrices, i.e., a non-orthogonal matrix U such that Σt = U Σt U is diagonal. It is given by 5 0 0 0 0 1 U induces an irreducible (Xt)t [n]-decorrelating partition by = US1 , S2 := = US2 , S3 := Let Sinv and Sinv be the invariant subspaces associated with the irreducible orthogonal partition {Sj}3 j=1 and the irreducible partition { Sj}3 j=1, respectively. As any irreducible orthogonal and (Xt)t [n]-decorrelating partition is also a (Xt)t [n]-decorrelating partition, it in general holds that dim(Sinv) dim( Sinv). Moreover, as in the explicit example above, the inequality can be strict. Appendix D. Auxiliary results Lemma 3 Let B Rm m be a symmetric invertible matrix, U Rm m an orthogonal block diagonalizer of B and S1, . . . , Sq {1, . . . , m} disjoint subsets satisfying (US1) BUS1 0 0 0 ... ... ... ... ... ... 0 0 0 (USq) BUSq Then it holds for all j {1, . . . , q} that (ΠSj BΠSj) = ΠSj B 1ΠSj, where ΠSj := USj(USj) . Lazzaretto, Peters, Pfister Proof The pseudo-inverse A of a matrix A is defined as the unique matrix satisfying: (i) AA A = A, (ii) A AA = A , (iii) (AA ) = AA and (iv) (A A) = A A. Fix j {1 . . . , q} and define A := ΠSj BΠSj and A := ΠSj B 1ΠSj. Moreover, define e B := U BU and for all k {1, . . . , q}, e Bk := (USk) BUSk. We now verify that conditions (i)-(iv) hold for A and A and hence A is indeed the pseudo-inverse. Conditions (ii) and (iv) hold by symmetry of B and ΠS. For (i) and (iii), first observe that by the properties of orthogonal matrices it holds that e B 1 = U B 1U, and, due to the block diagonal structure of e B, e B 1 1 ... e B 1 q (US1) ... (USq) B 1 US1 . . . USq . (27) Hence we get that e B 1 j = (USj) B 1USj. For (i), we now get ΠSj BΠSjΠSj B 1ΠSjΠSj BΠSj = USj(USj) BUSj(USj) B 1USj(USj) BUSj(USj) = USj e Bj e B 1 j e Bj(USj) = ΠSj BΠSj. Similarly, for (iii) we get (ΠSj BΠSjΠSj B 1ΠSj) = (USj(USj) BUSj(USj) B 1USj(USj) ) = (USj e Bj e B 1 j (USj) ) This completes the proof of Lemma 3. Lemma 4 Let N N and let {Sj}q j=1 be an orthogonal and (Xt)t N -decorrelating partition. Then, there exists a joint block diagonalizer of (Σt)t N . More precisely, there exists an orthonormal matrix U Rp p such that for all t N the matrix Σt := U Σt U is block diagonal with q diagonal blocks eΣt,j = (USj) Σt USj, j {1, . . . , q} and of dimension |Sj| = dim(Sj), where Sj {1, . . . , p} indexes a subset of the columns of U. Moreover, ΠSj = USj(USj) . Proof Let U = (u1, . . . , up) Rp p be an orthonormal matrix with columns u1, . . . , up such that for all j {1, . . . , q} there exists Sj {1, . . . , p} such that Sj = span({uk | k Sj}). Such a matrix U can be constructed by selecting an orthonormal basis for each of the disjoint subspaces {Sj}q j=1. Furthermore, assume that the columns of U are ordered in such a way that for all i, j {1, . . . , q} with i < j it holds for all k Si and ℓ Sj that k < ℓ. As the matrix U is orthogonal, it holds for all j {1, . . . , q} that the projection matrix ΠSj can be expressed as ΠSj = USj(USj) and hence using the definition of orthogonal partition (see Invariant Subspace Decomposition (6)) it holds that, for all t N, eΣt := U Σt U = 0 ... ... ... ... ... ... 0 0 0 eΣt,q where for all j {1, . . . , q} we defined eΣt,j := (USj) Σt USj. Lemma 5 Let {Sj}q j=1 be an orthogonal and (Xt)t [n]-decorrelating partition. Then it holds for all t [n] and, for all j {1, . . . , q}, that Var ΠSj Xt = ΠSj Var (Xt) 1 ΠSj. (28) Proof Let U Rp p be an orthonormal matrix such that, for all t [n], Σt := U Σt U is block diagonal with q diagonal blocks eΣt,1, . . . , eΣt,q of dimensions dim(S1), . . . , dim(Sq). Such a matrix exists by Lemma 4 and each diagonal block is given by eΣt,j := (USj) Σt USj and the projection matrix ΠSj can be expressed as ΠSj = USj(USj) . The statement then follows from Lemma 3. D.1 Proof of Lemma 1 Proof Let U Rp p be an orthonormal matrix such that, for all t [n], Σt := U Σt U is block diagonal with q diagonal blocks eΣt,1, . . . , eΣt,q of dimensions dim(S1), . . . , dim(Sq). Such a matrix exists by Lemma 4 and each diagonal block is given by eΣt,j := (USj) Σt USj and the projection matrix ΠSj can be expressed as ΠSj = USj(USj) . Now, for an arbitrary t [n] it holds using the linear model (1) that γ0,t = Σ 1 t Cov(Xt, Yt) and hence we use U to get the following expansion γ0,t = Σ 1 t Cov(Xt, Yt) = U Σ 1 t U Cov(Xt, Yt) Σ 1 t,1 ... Σ 1 t,q U Cov(Xt, Yt) Σ 1 t,1 (US1) Cov(Xt, Yt) ... Σ 1 t,q (USq) Cov(Xt, Yt) Lazzaretto, Peters, Pfister By the properties of orthogonal matrices it holds that Σ 1 t = U Σ 1 t U, and, due to the block diagonal structure of Σt, Σ 1 t,1 ... Σ 1 t,q (US1) ... (USq) Σ 1 t US1 . . . USq . (29) This implies that Σ 1 t,j = (USj) Σ 1 t USj and therefore (US1) Σ 1 t US1(US1) Cov(Xt, Yt) ... (USq) Σ 1 t USq(USq) Cov(Xt, Yt) (US1) Σ 1 t US1(US1) Cov(Xt, Yt) ... (USq) Σ 1 t USq(USq) Cov(Xt, Yt) j=1 USj(USj) Σ 1 t USj(USj) Cov(Xt, Yt) j=1 ΠSjΣ 1 t ΠSj Cov(Xt, Yt) j=1 ΠSjΣ 1 t ΠSjΠSj Cov(Xt, Yt) j=1 Var(ΠSj Xt) Cov(ΠSj Xt, Yt). In the second to last equality, we have used that ΠSj is a projection matrix and thus idempotent. In the last equality we have used that Var(ΠSj Xt) = ΠSjΣ 1 t ΠSj, by Lemma 5. It now suffices to show that ΠSjγ0,t = Var(ΠSj Xt) Cov(ΠSj Xt, Yt), which follows from the following computation ΠSjγ0,t = ΠSj i=1 ΠSi Var(Xt) 1ΠSi Cov(ΠSi Xt, Yt) = ΠSj Var(Xt) 1ΠSj Cov(ΠSj Xt, Yt) = Var(ΠSj Xt) Cov(ΠSj Xt, Yt), where the first equality follows from the first part of this proof and and the third equality follows from Lemma 5. Invariant Subspace Decomposition D.2 Proof of Lemma 2 Proof For all j {1, . . . , q}, for all β Sj and for all t N it holds that Vart(β) = 2 Cov(Yt, Xt)β β Var(Xt)β = 2 Cov(Yt, (Pq k=1 ΠSk Xt))ΠSjβ β ΠSj Var(Pq k=1 ΠSk Xt)ΠSjβ = 2 Cov(Yt, (ΠSj Xt))ΠSjβ β ΠSj Var(ΠSj Xt)ΠSjβ = 2 Cov(Yt, (ΠSj Xt))β β Var(ΠSj Xt)β, (30) where the first equality follows by (4) and the third equality follows from the definition of an orthogonal partition. It follows that ( Vart)(β) = 2 Cov(ΠSj Xt, Yt) 2 Var(ΠSj Xt)β, where denotes the gradient. The equation ( Vart)(β) = 0 has a unique solution in Sj given by Var(ΠSj Xt) Cov(ΠSj Xt, Yt). To see this, observe that all other solutions in Rp are given, for an arbitrary vector w Rp, by Var(ΠSj Xt) Cov(ΠSj Xt, Yt) + (Ip Var(ΠSj Xt) Var(ΠSj Xt))w (31) where Ip denotes the identity matrix. Let U Rp p and ( Σt)t N be defined for the orthogonal and (Xt)t N -decorrelating partition as in Lemma 4. We now observe that Var(ΠSj Xt) Var(ΠSj Xt) = ΠSjΣ 1 t ΠSjΠSjΣtΠSj (32) = USj(USj) Σ 1 t USj(USj) Σt USj(USj) = USj Σ 1 t,j Σt,j(USj) where the first equality follows from Lemma 3, since U jointly block diagonalizes the matrices (Σt)t N by Lemma 4. We can therefore rewrite (31) as Var(ΠSj Xt) Cov(ΠSj Xt, Yt) + (Ip ΠSj)w. For all w Sj, this expression equals Var(ΠSj Xt) Cov(ΠSj Xt, Yt). For all w Sj, it is not in Sj. This concludes the proof of Lemma 2. Lemma 6 {Sinv, Sres} is an orthogonal and (Xt)t [n]-decorrelating partition. Proof Let {Sj}q j=1 be a fixed irreducible orthogonal and (Xt)t [n]-decorrelating partition according to which Sinv and Sres are defined. By orthogonality of the subspaces in the partition and by definition of Sinv and Sres, Sres = (Sinv) . Moreover, by definition of orthogonal and (Xt)t [n]-decorrelating partition, it holds that Cov(ΠSinv Xt, ΠSres Xt) = Cov( X j {1,...,q}: Sj opt-invariant on [n] j {1,...,q}: Sj not opt-invariant on [n] ΠSj Xt) = 0. Lazzaretto, Peters, Pfister Lemma 7 Sinv is opt-invariant on [n]. Proof That Sinv is opt-invariant on [n] can be seen from the following computations. It holds for all t, s [n] that arg max β Sinv Vart(β) = Var(ΠSinv Xt) Cov(ΠSinv Xt, Yt) j {1,...,q}: Sj opt-invariant on [n] Var(ΠSj Xt) Cov(ΠSj Xt, Yt) j {1,...,q}: Sj opt-invariant on [n] arg max β Sj Vart(β) j {1,...,q}: Sj opt-invariant on [n] arg max β Sj Vars(β) = arg max β Sinv Vars(β). The first equality holds by Lemma 2, since {Sinv, Sres} is indeed an orthogonal and (Xt)t [n]-decorrelating partition, see Lemma 6. The second equality follows from the definition of Sinv and can be proved by Lemma 1 and observing that the set of subspaces {Sj | j {1, . . . , q} : Sj opt-invariant on [n]} is an orthogonal and (ΠSinv Xt)t [n]- decorrelating partition. The third equality holds by Lemma 2 and the fourth by definition of an opt-invariant subspace on [n]. Lemma 8 Let {At}n t=1 be a set of n symmetric strictly positive definite matrices. If there exists a matrix A {At}n t=1 that has all distinct eigenvalues, then any two irreducible joint block diagonalizers U, U for the set {At}n t=1 are equal up to block permutations and block-wise isometric transformations. Proof We start by observing that if a matrix is symmetric and has all distinct eigenvalues, then its eigenvectors are orthogonal to each other and unique up to scaling. We define Q Rp p as the orthonormal matrix whose columns are eigenvectors for A: Q is then uniquely defined up to permutations of its columns. We now exploit the results by Murota et al. (2010) used in the construction of an irreducible orthogonal joint block diagonalizer for a set of symmetric matrices {At}n t=1 (not necessarily containing a matrix with all distinct eigenvalues). In the following, we translate all the useful results by Murota et al. (2010) in our notation introduced for joint block diagonalizers in Section 2.1.1. Murota et al. (2010) show in Theorem 1 that there exists an irreducible orthogonal joint block diagonalizer U such that, for all t [n], U At U = Lq j=1(I mj At,j), where q, mj N, 0 < q, mj p are such that q U max = Pq j=1 mj, and At,j are square matrices (common diagonal blocks). Here denotes the direct sum operator for matrices and the Kronecker product. They further propose to partition the columns of the matrix U into q subsets, each denoted by USj, with j {1, . . . , q} indexing the diagonal blocks and Invariant Subspace Decomposition Sj {1, . . . , p} denoting the subset of indexes corresponding to the selected columns in USj, such that, for all t [n], (USj) At USj = I mj At,j. They then argue that, as a consequence of Theorem 1, the spaces spanned by such subsets of columns, i.e., Uj := span{uk | k Sj} are uniquely defined. We therefore only need to prove that, if at least one matrix in the set {At}n t=1 has all distinct eigenvalues, then for all j {1, . . . , q}, mj = 1. Such condition implies that the diagonal blocks indexed by Sj, j {1, . . . , q}, are irreducible, and q U max = q. The uniqueness of the spaces Uj then implies the uniqueness of the irreducible joint block diagonalizer U up to block permutations and block-wise isometric transformations. To show this result, we now use Murota et al. (2010, Proposition 1). In particular, if the set {At}n t=1 contains at least one matrix A with all distinct eigenvalues then the assumptions of Proposition 1 are satisfied. Let {Q1, . . . , Qp} be the set of eigenspaces of A and for all i {1, . . . , p} let mi := dim(Qi) = 1. The proposition then implies that for all i {1, . . . , p} there exists j {1, . . . , q} such that Qi Uj. Moreover, for all i such that Qi Uj it holds that mi = mj. As a consequence, we obtain that, since the eigenvalues of A are distinct, m1 = = mp = 1 and therefore for all j {1, . . . , q}, mj = 1. This concludes the proof for Lemma 8. Lemma 9 Let {Sj}q j=1 be an irreducible orthogonal and (Xt)t [n]-decorrelating partition. Then, for all j {1, . . . , q} it holds that arg max β Sj t=1 Var(ΠSj Xt) t=1 Cov(ΠSj Xt, Yt). In addition, for all j {1, . . . , q} such that Sj is opt-invariant on [n], it holds for all t [n] that arg max β Sj Var(β) = ΠSjγ0,t = ΠSj γ0. Moreover, ΠSj γ0 is time-invariant on [n]. Proof Let U Rp p and ( Σt)t [n] be defined for the orthogonal and (Xt)t [n]-decorrelating partition as in Lemma 4. For all j {1, . . . , q} and for all β Sj it follows from (30) that t=1 Cov(Yt, ΠSj Xt)β β 1 t=1 Var(ΠSj Xt)β which has gradient ( Var)(β) = 2 t=1 Cov(ΠSj Xt, Yt) 2 t=1 Var(ΠSj Xt)β. The equation ( Var)(β) = 0 has solution equal to t=1 Var(ΠSj Xt) t=1 Cov(ΠSj Xt, Yt). Lazzaretto, Peters, Pfister By Lemma 3, it holds that 1 n Pn t=1 Var(ΠSj Xt) = ΠSj 1 n Pn t=1 Var(Xt) 1 ΠSj and therefore β Sj. In order to apply Lemma 3, we in particular use that U is such that, for all t [n], Σt = U Σt U is block diagonal with diagonal blocks given, for all j {1, . . . , q}, by Σt,j = (USj) Σt USj. This implies that U (Pn t=1 Σt) U is also block diagonal with diagonal blocks (USj) (Pn t=1 Σt) USj. Moreover, its inverse is block diagonal and, by the properties of orthogonal matrices, its diagonal blocks are (USj) (Pn t=1 Σt) 1 USj. It now remains to show that this is also the only solution in Sj. All other solutions in Rp are given, for an arbitrary vector w Rp, by t=1 Var(ΠSj Xt) t=1 Var(ΠSj Xt) =β + Ip ΠSj w. The equality follows from the following computation 1 n t=1 Var(ΠSj Xt) t=1 Var(ΠSj Xt) t=1 Var(Xt) t=1 Var(Xt) = USj(USj) n X ! 1 USj(USj) n X The first equality follows again from Lemma 3. For all w Sj, β + (Ip ΠSj)w equals β . For all w Sj, β + (Ip ΠSj)w is not in Sj. For all Sj opt-invariant on [n] and for all t [n] it holds that arg max β Sj = arg max β Sj s=1 Vars(β) = arg max β Sj Vart(β) = Var(ΠSj Xt) Cov(ΠSj Xt, Yt) where we used the definition of opt-invariance on [n] for the second equality, Lemma 2 for the third equality and Lemma 1 for the fourth equality. Since the result holds for all t [n], it also follows that ΠSjγ0,t = ΠSj γ0. Invariant Subspace Decomposition We now need to prove for all t [n] that Cov(Yt X t ΠSj γ0, X t ΠSj γ0) = 0. To see this, fix t [n]. Then Cov(Yt X t ΠSj γ0, X t ΠSj γ0) = Cov(X t (γ0,t ΠSjγ0,t), X t ΠSjγ0,t) = Cov(X t ( i=1 ΠSiγ0,t ΠSjγ0,t), X t ΠSjγ0,t) = Cov(X t X i {1,...,q}:i =j ΠSiγ0,t, X t ΠSjγ0,t) i {1,...,q}:i =j γ 0,t Cov(ΠSi Xt, ΠSj Xt)γ0,t The last equality follows from the definition of an orthogonal and (Xt)t [n]-decorrelating partition. Lemma 10 Let {Sj}q j=1 be an irreducible orthogonal and (Xt)t [n]-decorrelating partition and let Sinv be the corresponding invariant subspace. Moreover, let Var(X) := 1 n Pn t=1 Var(Xt) and Cov(X, Y ) := 1 n Pn t=1 Cov(Xt, Yt). Finally, let Uinv be the submatrix of an arbitrary irreducible joint block diagonalizer U corresponding to the irreducible orthogonal and (Xt)t [n]-decorrelating partition whose columns span Sinv. Then, βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ). Proof Expanding the definition of βinv, we obtain that βinv = arg max β Sinv Var(β) = arg max β Sinv 1 n t=1 Vart(β) = arg max β Sinv 1 n s=1 (2 Cov(Yt, ΠSinv Xt)β β Var(ΠSinv Xt)β) t=1 Var(ΠSinv Xt) t=1 Cov(ΠSinv Xt, Yt) = ΠSinv Var(X) 1ΠSinv Cov(X, Y ) = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ). The fourth equality follows from Lemma 9. The fifth equality follows by Lemma 3 (see the proof of Lemma 9). In the last equality we used that (Uinv) Var(X) 1Uinv = ((Uinv) Var(X)Uinv) 1, which follows from the properties of orthogonal matrices and Lazzaretto, Peters, Pfister the block diagonal structure of U Var(X)U: these imply that (U Var(X)U) 1 = U Var(X) 1U and ((US1) Var(X)US1) 1 ... ((USq) Var(X)USq) 1 (US1) Var(X) 1US1 ... (USq) Var(X) 1USq Appendix E. Proofs E.1 Proof of Theorem 1 Proof Let {Sj}q j=1 be an irreducible orthogonal and (Xt)t [n]-decorrelating partition and define Sinv and Sres as in (11). By Assumption 2, {Sinv, Sres} form an orthogonal and (Xt)t N-decorrelating partition and Sinv is opt-invariant on N. Therefore, using Lemma 1 and Lemma 2, we get for all t N that γ0,t = arg max β Sinv Vart(β) + arg max β Sres Vart(β). Furthermore, Sinv opt-invariant on N implies that the first term, arg maxβ Sinv Vart(β), does not depend on t and hence it holds for all t N that γ0,t = arg max β Sinv Var(β) + arg max β Sres Vart(β). Moreover, Assumption 1 ensures that the invariant and residual subspaces can be uniquely identified from an arbitrary irreducible orthogonal and (Xt)t [n]-decorrelating partition, and therefore all the above results do not depend on the considered partition. This concludes the proof of Theorem 1. E.2 Proof of Proposition 1 Proof (i) For all j {1, . . . , q U max}, let USj denote the submatrix of U formed by the columns indexed by Sj. Then, the orthogonal projection matrix onto the subspace Sj can be expressed as ΠSj = USj(USj) . It follows for all t [n] that Cov(ΠSi Xt, ΠSj Xt) = ΠSiΣtΠSj = USi(USi) Σt USj(USj) Invariant Subspace Decomposition where the last equality holds since (USi) Σt USj is the (i, j)-th (off-diagonal) block of Σt := U Σt U, which is zero due to the block diagonal structure of Σt. Irreducibility of the orthogonal partition follows from the irreducibility of the joint block diagonal decomposition. (ii) The statement follows from Lemma 4, taking N = [n]. Irreducibility of the joint block diagonalizer follows from the irreducibility of the orthogonal partition. E.3 Proof of Proposition 2 Proof (i) First, observe that, by Lemma 6, {Sinv, Sres} is an orthogonal and (Xt)t [n]- decorrelating partition and that, by Lemma 7, Sinv is opt-invariant on [n]. Then, by definition of βinv and by Lemma 9 we get for all t [n] that βinv = arg max β Sinv Var(β) = ΠSinvγ0,t = ΠSinv γ0. (ii) We need to prove for all t [n] that Cov(Yt X t βinv, X t βinv) = 0. To see this, fix t [n]. Then Cov(Yt X t βinv, X t βinv) = Cov(X t (γ0,t βinv), X t βinv) = Cov(X t (ΠSinvγ0,t + ΠSresγ0,t βinv), X t (ΠSinvβinv)) = Cov(X t ΠSresγ0,t, X t ΠSinvβinv) = γ 0,t Cov(ΠSres Xt, ΠSinv Xt)βinv where the third equality uses βinv = ΠSinvγ0,t by Proposition 2 (i) and the last equality follows from the fact that Sinv, Sres are an orthogonal and (Xt)t [n]-decorrelating partition by Lemma 6. (iii) Let Bn denote the set of all time-invariant parameters over [n]. By assumption, we have that Bn Sinv. Moreover, by point (ii), we have that βinv Bn. Therefore, by definition of βinv we obtain that βinv = arg maxβ Bn Var(β). This completes the proof of Proposition 2. E.4 Proof of Theorem 2 Proof Under Assumptions 1 and 2 and using the definitions of βinv and δres t , we can write the explained variance of β Rp at time t N for the true time varying parameter γ0,t Rp Vart(β) = 2γ 0,t Var(Xt)β β Var(Xt)β = 2(βinv + δres t ) Var((ΠSinv + ΠSres)Xt)β β Var(Xt)β = 2(βinv) Var(ΠSinv Xt)ΠSinvβ + 2(δres t ) Var(ΠSres Xt)ΠSresβ β Var(Xt)β. Lazzaretto, Peters, Pfister Using this expansion, we get, since δres t = γ0,t βinv, for all β / Sinv that inf γ0,t Rp: γ0,t βinv Sres Vart(β) = . arg max β Rp inf γ0,t Rp: γ0,t βinv Sres Vart(β) = arg max β Sinv inf γ0,t Rp: γ0,t βinv Sres Vart(β) = arg max β Sinv Vart(β). Since by assumption it holds that Sinv is opt-invariant on N, it further holds that for all t N arg max β Sinv Vart(β) = arg max β Sinv Var(β). The claim follows from the definition of βinv. E.5 Proof of Theorem 3 Proof We assume without loss of generality that the observed predictors Xt have zero mean in t Iad t . Alternatively, as mentioned in Section 4.1.2, a constant term could be added to Xt to account for the mean. We also observe that ˆγOLS t and ˆδres t are both unbiased estimators for γ0,t and δres t , respectively (this can be checked using standard OLS analysis). We now start by computing the out-of-sample MSPE for ˆγISD t . MSPE(ˆγISD t ) = E[(X t (ˆγISD t γ0,t ))2] = E[(X t (ΠSinv + ΠSres)(ˆβinv + ˆδres t βinv δres t ))2] = trace(E[(ˆβinv βinv)(ˆβinv βinv) ] Var(ΠSinv Xt )) + trace(E[(ˆδres t δres t )(ˆδres t δres t ) ] Var(ΠSres Xt )) = trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt )) + trace(E[Var(ˆδres t | Xad)] Var(ΠSres Xt )) = trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt )) + σ2 ad m trace(ΠSres E[( 1 m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )), where we have used that Var(ˆδres t | Xad) = σ2 adΠSres((Xad) Xad) 1ΠSres. We further observe that Var(ˆβinv | X) = ΠSinv(X X) 1ΠSinv X Var(ϵ)XΠSinv(X X) 1ΠSinv, where Var(ϵ) is a n n diagonal matrix whose diagonal elements are the error variances at each observed time steps, (Var(ϵt))t [n]. Let σ2 ϵ,min := mint [n] Var(ϵt) and σ2 ϵ,max maxt [n] Var(ϵt). Then, σ2 ϵ,minΠSinv(X X) 1ΠSinv Var(ˆβinv | X) σ2 ϵ,maxΠSinv(X X) 1ΠSinv, Invariant Subspace Decomposition where denotes the Loewner order, and it follows from diagn(σ2 ϵ,min) Var(ϵt) diagn(σ2 ϵ,max), with diagn( ) denoting an n-dimensional diagonal matrix with diagonal elements all equal. We have used in particular that for two symmetric matrices A, B Rn n such that A B and a matrix S Rn n it holds that S AS S BS (see, for example, Theorem 7.7.2. by Horn and Johnson, 2012). Using this relation and Jensen s inequality, we obtain that the first term in MSPE(ˆγISD t ) is lower bounded by trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt )) σ2 ϵ,min trace(ΠSinv E[(X X) 1]ΠSinv Var(ΠSinv Xt )) n trace(ΠSinvΣ 1ΠSinvΣt ΠSinv), where Σ := E[ 1 n X X]. Using now that, by assumption, {Sinv, Sres} is an orthogonal and (Xt)t N-decorrelating partition, let U Rp p be defined for such a partition as in Lemma 4 and let Uinv Rp dim(Sinv) be the submatrix of U whose columns form an orthonormal basis for Sinv, such that ΠSinv = Uinv(Uinv) . Moreover, let Σinv := (Uinv) ΣUinv and Σinv t := (Uinv) Σt Uinv denote, as in Lemma 4, the diagonal block corresponding to Sinv of the block diagonal matrices U ΣU and U Σt U, respectively. By the properties of orthogonal matrices, ( Σinv) 1 = (Uinv) Σ 1Uinv. For all i {1, . . . , dim(Sinv)}, let λi denote the i-th eigenvalue in decreasing order. Then, we can further express the above lower bound as n trace(ΠSinvΣ 1ΠSinvΣt ΠSinv) n trace(Uinv(Uinv) Σ 1Uinv(Uinv) Σt Uinv(Uinv) ) n trace(Uinv( Σinv) 1 Σinv t (Uinv) ) σ2 ϵ,min dim(Sinv) n λmin( Σinv t ) λmax( Σinv) =σ2 ϵ,min dim(Sinv) where we define cinv := λmin( Σinv t )/λmax( Σinv). The same term in MSPE(ˆγISD t ) is upper bounded by trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt )) σ2 ϵ,max trace(ΠSinv E[(X X) 1]ΠSinv Var(ΠSinv Xt )) σ2 ϵ,max dim(Sinv) where Cinv := 1 + trace(ΠSinv(E[( 1 n X X) 1] Σ 1 t )ΠSinv Var(ΠSinv Xt )). Proceeding in a similar way, we can express the second term in MSPE(ˆγISD t ) as σ2 ad m trace(ΠSres E[( 1 m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )) = σ2 ad m hres(m), where hres(m) := trace(ΠSres E[( 1 m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )). Since we have assumed that for all t Iad {t } the distribution of Xt does not change, by Jensen s inequality and by the fact that ΠSresΣ 1 t ΠSresΣt ΠSres = ΠSres (see Lemma 6 and (32)) we Lazzaretto, Peters, Pfister have that hres(m) dim(Sres), and limm hres(m) = dim(Sres). Moreover, we can find an upper bound for hres(m) in the following way. hres(m) = trace(ΠSres E[( 1 m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )) = trace(E[( 1 m(Xad) Xad) 1] Var(ΠSres Xt )) trace(E[( 1 m(Xad) Xad) 1]) trace(Var(ΠSres Xt )) m(Xad) Xad) 1] op trace(Var(ΠSres Xt )) m(Xad) Xad) 1 op] trace(Var(ΠSres Xt )) c 1λmax(Σt ) dim(Sres). Summarizing, we obtain that MSPE(ˆγISD t ) σ2 ϵ,min dim(Sinv) n cinv + σ2 ad dim(Sres) m and MSPE(ˆγISD t ) σ2 ϵ,max dim(Sinv) n Cinv + σ2 ad dim(Sres) where Cres := c 1λmax(Σt ) is the constant introduced in the theorem statement. We now compute the MSPE of ˆγOLS t . MSPE(ˆγOLS t ) = E[(X t (ˆγOLS t γ0,t ))2] = trace(E[(ˆγOLS t γ0,t )(ˆγOLS t γ0,t ) Xt X t ]) = trace(E[(ˆγOLS t γ0,t )(ˆγOLS t γ0,t ) ] Var(Xt )) = trace(E[Var(ˆγOLS t | Xad)]Σt )) = σ2 ad trace(E[((Xad) Xad) 1]Σt )) = σ2 ad m trace(E[( 1 m(Xad) Xad) 1]Σt )). We can further express MSPE(ˆγOLS t ) as σ2 ad m trace(E[( 1 m(Xad) Xad) 1]Σt ) =σ2 ad m trace((ΠSinv + ΠSres)E[( 1 m(Xad) Xad) 1](ΠSinv + ΠSres) Var((ΠSinv + ΠSres)Xt )) =σ2 ad m trace(ΠSinv E[( 1 m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt )) + σ2 ad m trace(ΠSres E[( 1 m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )). In particular, the second term in the above sum also appears in MSPE(ˆγISD t ). Taking now the difference between the MSPE of ˆγOLS t and ˆγISD t , we obtain that MSPE(γOLS t ) MSPE(γISD t ) = σ2 ad m trace(ΠSinv E[( 1 m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt )) trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt )) We have already obtained an upper bound for the second term in the difference, namely σ2 ϵ,max dim(Sinv) n Cinv. For the first term, we can make the same considerations made above for σ2 ad m trace(ΠSres E[( 1 m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )) and hres(m). In particular, σ2 ad m trace(ΠSinv E[( 1 m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt )) = σ2 ad m had,inv(m) Invariant Subspace Decomposition where had,inv(m) := trace(ΠSinv E[( 1 m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt )) satisfies had,inv(m) dim(Sinv) and had,inv(m) c 1λmax(Σt ) dim(Sinv) (as for hres(m), we observe that limm had,inv(m) = dim(Sinv)). Therefore, we obtain that MSPE(γOLS t ) MSPE(γISD t ) σ2 ad dim(Sinv) m σ2 ϵ,max dim(Sinv) and MSPE(γOLS t ) MSPE(γISD t ) σ2 ad dim(Sinv) m Cad,inv σ2 ϵ,min dim(Sinv) where Cad,inv := c 1λmax(Σt ). In particular, this difference is always positive if n is sufficiently large. Finally, we can observe that the expected explained variance (24) for ˆγ, where ˆγ = ˆγISD t or ˆγ = ˆγOLS t , is E[ Vart (ˆγ)] = E[Var(Yt ) Var(Yt X t ˆγ | ˆγ)] = Var(X t γ0,t + ϵt ) E[Var(X t (γ0,t ˆγ) + ϵt | ˆγ)] = γ 0,t Var(Xt )γ0,t + σ2 ϵ E[(γ0,t ˆγ) Var(Xt )(γ0,t ˆγ)] σ2 ϵ = γ 0,t Var(Xt )γ0,t E[trace((γ0,t ˆγ)(γ0,t ˆγ) Var(Xt ))] = γ 0,t Σt γ0,t MSPE(ˆγ) and therefore the same inequalities found for the MSPEs difference equivalently hold for E[ Vart (ˆγISD t )] E[ Vart (ˆγOLS t )]. E.6 Proof of Proposition 3 Proof (i) By definition of a (non-orthogonal) joint block diagonalizer, it holds for all t [n] that the matrix eΣt := U Σt U is block diagonal with q diagonal blocks eΣt,j := (USj) Σt USj, j {1, . . . , q}. Define now the matrix W := U , and observe that the following relations hold Σt = W eΣt W , Σ 1 t = U eΣ 1 t U and eΣ 1 t = W Σ 1 t W block diagonal with blocks eΣ 1 t,j := (W Sj) Σ 1 t W Sj. The last relation is obtained by observing that Σ 1 t,1 ... Σ 1 t,q (W S1) ... (W Sq) Σ 1 t W S1 . . . W Sq . Moreover, Σ 1 t,j = Var((USj) Xt) 1. We observe that, because W U = Ip, it holds for all i, j {1, . . . , q} that (W Sj) USi = 0, that is, the space spanned by the columns of W Sj is orthogonal to the space spanned by the columns of USi. Moreover, the matrix USj(W Sj) Lazzaretto, Peters, Pfister is an oblique projection matrix onto Sj along L i {1,...,q}\{j} Si. Fix j {1, . . . , q}. Then, using these relations we obtain that PSj|S jγ0,t = USj(W Sj) Σ 1 t Cov(Xt, Yt) = USj(W Sj) Σ 1 t WU Cov(Xt, Yt) = USj(W Sj) Σ 1 t W S1 . . . W Sq (US1) ... (USq) Cov(Xt, Yt) = USj(W Sj) Σ 1 t W Sj(USj) Cov(Xt, Yt) = USj Σ 1 t,j Cov((USj) Xt, Yt) = USj Var((USj) Xt) 1 Cov((USj) Xt, Yt). The fourth equality follows from the block diagonal structure of W Σ 1 t W, which implies that, for all j = 1, (W S1) Σ 1 t W Sj = 0. (ii) By definition of βinv and using point (i) of this proposition and that {Sinv, Sres} form a (Xt)t [n]-decorrelating partition, we obtain that, for all t [n], βinv = Uinv Var((USinv) Xt) 1(Uinv) Cov(Xt, Yt). We now observe that βinv := Var((USinv) Xt) 1(Uinv) Cov(Xt, Yt) is the OLS solution of regressing Yt onto (Uinv) Xt (as we assume all variables have zero mean). Moreover, this quantity is constant by proj-invariance of Sinv. In particular, this means that βinv = arg minβ Rdim(Sinv) E[ 1 n Pn t=1(Yt X t Uinvβ)2], which implies that βinv = ((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ). Therefore, βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ). (iii) The claim follows from the definition of δres t , from the fact that {Sinv, Sres} forms a (Xt)t [n]-decorrelating partition and from point (i) of this proposition.