# invariant_subspace_decomposition__6b238aff.pdf

Journal of Machine Learning Research 26 (2025) 1-56 Submitted 5/24; Revised 2/25; Published 4/25

Invariant Subspace Decomposition

Margherita Lazzaretto mala@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark

Jonas Peters jonas.peters@stat.math.ethz.ch Seminar for Statistics ETH Zürich Zürich, Switzerland

Niklas Pﬁster np@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark

Editor: Aapo Hyvarinen

We consider the task of predicting a response Y from a set of covariates X in settings where the conditional distribution of Y given X changes over time. For this to be feasible, assumptions on how the conditional distribution changes over time are required. Existing approaches assume, for example, that changes occur smoothly over time so that short-term prediction using only the recent past becomes feasible. To additionally exploit observations further in the past, we propose a novel invariance-based framework for linear conditionals, called Invariant Subspace Decomposition (ISD), that splits the conditional distribution into a time-invariant and a residual time-dependent component. As we show, this decomposition can be employed both for zero-shot and time-adaptation prediction tasks, that is, settings where either no or a small amount of training data is available at the time points we want to predict Y at, respectively. We propose a practical estimation procedure, which automatically infers the decomposition using tools from approximate joint matrix diagonalization. Furthermore, we provide ﬁnite sample guarantees for the proposed estimator and demonstrate empirically that it indeed improves on approaches that do not use the additional invariant structure. Keywords: invariance, distribution shift, time adaptation, zero-shot learning, joint block diagonalization.

1. Introduction

Many commonly studied systems evolve with time, giving rise to heterogeneous data. Indeed, changes over time in the data generating process lead to shifts in the observed data distribution, and make it in general impossible to ﬁnd a ﬁxed model that consistently describes the system over time.

. NP is now at Lakera in Zürich but most of this work was done while NP was at the University of Copenhagen.

c 2025 Margherita Lazzaretto, Jonas Peters, Niklas Pﬁster.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0699.html.

Lazzaretto, Peters, Pfister

In this work, we analyze the problem of estimating the relation between a response Yt and a set of covariates (or predictors) Xt, t N, both of which are observed over some period of time, with the goal of predicting an unobserved response, when new observations of the covariates become available. Two types of distribution shifts across time that can arise are (i) variations in the mechanism relating the response to the covariates and (ii) variations in the covariate distribution. Under distribution shifts, successfully predicting an unobserved response from covariates requires assumptions on how the underlying data generating model changes over time. In regression settings, varying-coeﬃcients models (e.g., Hastie and Tibshirani, 1993; Fan and Zhang, 2008) are a common approach to deal with dynamic system behaviours related to changes in the functional relationship between a response and some covariates. In these works, the model coeﬃcients are usually assumed to change smoothly, and smoothing methods are used in their estimation. Another approach for online estimation of smoothly changing parameters uses state-space models and ﬁltering solutions (see, for example, Durbin and Koopman, 2012). In general, smooth changes or other kinds of structured variations such as, for example, step-wise changes, allow us to learn a regression function using previous time points from the recent past. To exploit information further in the past, one can look for invariant patterns that persist through time. In this spirit, Meinshausen and Bühlmann (2015) deﬁne the maximin eﬀect as a worst case optimal model that maintains good predictive properties at all observed times. Distribution shifts in time-series models are conceptually similar to distribution shifts across domains. In both cases a key problem is to ﬁnd invariant parts of the observed data distribution and transfer it to the new domain or time-point. Unsupervised domain adaptation methods aim to predict an unobserved response in the target domain: for the problem to be tractable, they rely, for example, on the invariance of the conditional distribution of the response given the covariates (Sun et al., 2017; Zhao et al., 2019) or on independently changing factors of the joint distribution of the covariates and the response (Zhang et al., 2015; Stojanov et al., 2021). Solutions to the unsupervised domain adaptation problem are often based on aligning the source and target domains by minimizing the discrepancy either between the covariates distributions (Zhang et al., 2015) or between the covariates second order moments (Sun et al., 2017), or by ﬁnding low-dimensional invariant transformations of the variables whose distributions match in the domains of interest (Zhao et al., 2019; Stojanov et al., 2021). Other unsupervised domain adaptation approaches such as the one proposed by Bousmalis et al. (2016) rely on learning disentangled representations, where disentangled means that changes in one representation do not aﬀect the remaining ones. This approach is also explored in non-stationary settings, when changes in the covariate distribution happen over time instead of across diﬀerent domains, for example in the works by Yao et al. (2022); Li et al. (2024). The goal of these works is not prediction, but rather the identiﬁcation of generalizable latent states that could be used for arbitrary downstream tasks. The ﬁeld of causality widely explores the concept of invariant and independently changing mechanisms, too. In causal models, changes in the covariate distribution are modeled as interventions, and diﬀerent interventional settings represent diﬀerent environments. In this context, Peters et al. (2016) propose to look for a set of stable causal predictors, that is, a set of covariates for which the conditional distribution of the response given such

Invariant Subspace Decomposition

covariates remains invariant across diﬀerent environments. The same idea is extended by Pﬁster et al. (2019a) to a sequential setting in which the diﬀerent environments are implicitly generated by changes of the covariate distribution over time. Such invariances can then be used for prediction in previously unseen environments, too (e.g., Rojas-Carulla et al., 2018; Magliacane et al., 2018; Pﬁster et al., 2021). Other works on causal discovery from heterogeneous or nonstationary data, such as the ones by Huang et al. (2020); Günther et al. (2023), study settings in which the skeleton of the causal graph remains invariant through diﬀerent contexts or time points, but the causal mechanisms are also allowed to change. Their goal is then to identify the causal graph on observed covariates. To detect whether a mechanism changes, the context or time is included as an additional variable in the graph. As the context or time can be assumed to be exogenous, this often leads to additional identiﬁcation of causal edges in the graphs. Feng et al. (2022) develop and apply similar ideas in a reinforcement learning setting. Their interest is in learning the causal graph and improving the eﬃciency of non-stationarity adaptation by disentangling the system variations as latent parameters. In this work, we build on invariance ideas both from the maximin framework proposed by Meinshausen and Bühlmann (2015) and from the covariates shifts literature, and exploit them for time adaptation. We propose an invariance-based framework, which we call invariant subspace decomposition (ISD), to estimate time-varying regression models in which both the covariate distribution and their relationship with a response can change. In particular, we consider a sequence of independent random vectors (Xt, Yt)t N Rp R satisfying for all time points t N a linear model of the form

Yt = X t γ0,t + ϵt (1)

with E[ϵt | Xt] = 0, where γ0,t is the (unknown) true time-varying parameter. We allow the covariance matrices Var(Xt) to change over time, assuming that they are approximately constant in small time windows. Regarding changes in γ0,t, we only need to assume smoothness during test time (if there is more than one test point) we provide more details later in Remark 2. Having observed the Xt and Yt up until some time point n, our goal is to learn γ0,t for some t > n, which we then use to predict Yt from Xt . We propose to do so by considering the explained variance, which, for all t, is given for some function f by Vart(f) := Var(Yt) Var(Yt f(Xt)). Using the explained variance as the objective function provides an intuitive evaluation of the predictive quality of a function f: negative explained variance indicates in particular that using f for prediction is harmful, in that it is worse than the best constant function. Under model (1), f is a linear function fully characterized by a parameter β Rp, i.e., f(Xt) = X t β, and the true time-varying parameter can always be expressed as γ0,t = arg max β Rp Vart(β). (2)

Key to ISD is the decomposition of the explained variance maximization into a time-invariant and a time-dependent part: we show in Theorem 1 that, under some assumptions, for all t N, γ0,t can be expressed as

γ0,t = arg max β Sinv Var(β)

| {z } =:βinv

+ arg max β Sres Vart(β)

| {z } =:δres t

Lazzaretto, Peters, Pfister

where Var(β) := 1 n Pn t=1 Vart(β), and Sinv, Sres Rp are two orthogonal linear subspaces, with Sinv Sres = Rp, assumed to be uniquely determined by the distribution of (X1, Y1), . . . , (Xn, Yn). The decomposition of Rp into Sinv and Sres from observed data is achieved by exploiting ideas from independent subspace analysis (e.g., Gutch and Theis, 2012). Unlike works on latent representations, we do not reduce the overall dimensionality of the problem but partition the observed space into two lower dimensional subspaces, Sinv

and Sres. This in particular implies that the two sub-problems in (3) each have less degrees of freedom than the original one. The ﬁrst sub-problem can be solved using all available heterogeneous observations, which we call historical data: its solution βinv is a time-invariant linear parameter that partially describes the dependence of Y on X at all times. The second sub-problem is a time adaptation problem that tunes the invariant component to a time point of interest: its solution δres t explains the residuals Yt X t βinv; the sum βinv + δres t gives an estimator for the time-varying linear parameter of interest γ0,t. In order to estimate the residual component δres t , we assume the model is approximately stationary in a small time window preceding t and use this local subset of the data, which we call adaption data, for estimation. We distinguish two tasks: (i) the zero-shot task, where adaption data is not available and we approximate γ0,t by βinv and (ii) the time-adaption task, where adaption data is available and we solve both sub-problems to approximate γ0,t by βinv + δres t . A twodimensional example of γ0,t and its ISD estimates is shown in Figure 1. The same example is presented again in more detail in Example 1 in Section 2.1 below and used as a running example throughout the paper. The fundamental assumption we make to apply the ISD framework to new data (Assumption 2) is that the decomposition inferred from available observations generalizes to unseen time points. This guarantees that the estimated invariant component can be meaningfully used for prediction at all new time points, either directly or as part of a two-step estimation that is ﬁne-tuned by solving the time adaptation sub-problem. The ISD framework allows us to improve on naive methods that directly maximize the explained variance on Rp using only the most recent available observations and on methods focusing on invariance across environments or time points such as the maximin by Meinshausen and Bühlmann (2015), which do not account for time-adaptation. We show in particular in Theorem 3 that isolating an invariant component and reducing the dimensionality of the adaptation problem guarantees lower prediction error or, equivalently, higher explained variance compared to to naive methods. An example is provided for a simulated setting in Figure 2, which shows on the left the average explained variance by the invariant component computed at training time on historical data, and on the right the cumulative explained variance by the invariant (zero-shot task) and the adapted invariant component (time-adaptation task) at testing time when new observations become available. For the zero-shot task, we compare the ISD invariant component with the standard OLS solution, computed on all training observations, and with the maximin eﬀect ( ˆβmm) by Meinshausen and Bühlmann (2015), which maximizes the worst case explained variance over the available observations. For the time-adaptation task, we compare the ISD with the standard OLS solution computed on a rolling window over the latest new observations. For both tasks, the ISD estimates achieve higher cumulative explained variance at new time-points. The remainder of the paper is structured as follows. In Section 2 we introduce the ISD framework, starting from the identiﬁcation of the invariant and residual subspaces for or-

Invariant Subspace Decomposition

1 0 1 2 γ1 t

1 0 1 2 γ1 t

Sinv Sres ˆSinv

true parameter γ0,t (test data)

rolling window OLSˆγOLS t

ISD ˆγISD t =ˆβinv+ˆδres t true subspaces Sinv,Sres

estimated subspaces

ˆSinv, ˆSres

true invariant component βinv

estimated invariant component ˆβinv

Figure 1: Example of two-dimensional true parameter γ0,t varying on a one-dimensional subspace of R2, and its estimates using ISD (right) compared to rolling window OLS (left). Time is visually encoded using a color map. (Left) True parameter γ0,t (hexagons) on 350 test points and OLS estimates ˆγOLS t based on rolling windows of size 16. (Right) Same test data and true parameters γ0,t, but now we additionally use 1000 prior time-points as historical data (not shown) to estimate the decomposition of R2 into the orthogonal subspaces Sinv and Sres (dashed lines). Next, we estimate βinv using the historical data and ˆSinv. Then, using the same rolling windows as in the left plot as adaption data, we estimate δres t using ˆSres. The ISD estimates are then given by ˆγISD t = ˆβinv + ˆδres t . All details on the generative model are provided in Example 1. The subspaces Sinv and Sres do not need to be axis aligned, so ISD is applicable even in cases where the conditional of Yt given Xt and all conditionals of Yt given subsets of Xt vary over time.

thogonal covariates transformations and showing the explained variance separation (3) in Theorem 1. In Sections 2.2 and 2.3 we deﬁne population estimators for the invariant and residual components, solutions to the two sub-problems in (3). We then describe, in Section 3, the two tasks that ISD solves, namely zero-shot prediction and time-adaptation, and provide a characterization of the invariant component as a worst-case optimal parameter. In Section 4 we propose an estimation method for ISD, and provide ﬁnite sample generalization guarantees for the proposed estimator. Finally, in Section 5, we illustrate ISD based on numerical experiments, both on simulated and on real world data, and validate the theoretical results presented in the paper.

Lazzaretto, Peters, Pfister

historical data

γ0,t (ground truth)

βinv (oracle)

0 100 200 300 400 t

Pt r=1 Varr(ˆβ)

adaptation data

Zero-shot estimators

Time adaptation estimators

ˆγISD t =ˆβinv+ˆδres t ˆγOLS t

Ground truth

Ground truth

Time adaptation estimators

ˆγISD t =ˆβinv+ˆδres t ˆγOLS t

Zero-shot estimators

Figure 2: For the data-generating model in Section 5.2, we plot (left) the average explained variance (distribution over 20 runs) obtained at training time (historical data) and (right) the cumulative explained variance obtained testing time (adaptation data) (in one of the 20 runs); in this example, the time-varying components in the historical and adaptation data have disjoint support. The example considers p = 10-dimensional predictors and an invariant component of dimension 7. As baselines, we use (i) the true time-varying parameter γ0,t, which maximizes the explained variance at all observed time points t, and (ii) the oracle invariant component βinv. 6000 historical observations are used to estimate: (iii) the invariant component ˆβinv of the ISD framework, (iv) the OLS solution ˆβOLS, (v) the maximin eﬀect ˆβmm. Starting from t = 0 after the observed history, windows of length 3p are used to estimate: (vi) the adaptation parameter ˆδres t for ˆβinv to obtain the ISD estimate ˆγISD and (vii) the rolling window OLS solution ˆγOLS t . While at training time on historical data the ISD invariant component ˆβinv is the most conservative, with the lowest average explained variance, after a distribution shift (adaptation data) the same component can explain higher variance than other methods based on historical data only ( ˆβOLS, ˆβmm), and can be tuned to new time points to improve on estimators based on adaptation data only (ˆγOLS t ).

2. Invariant subspace decomposition

We formalize the setup described in the introduction in the following setting.

Setting 1 Let (Xt, Yt)t N Rp R1 be a sequence of independent random vectors satisfying for all t N a linear model as in (1), with γ0,t Rp and E[ϵt | Xt] = 0. Assume that for all t N the covariance matrix of the predictors Σt := Var(Xt) is strictly positive deﬁnite. Moreover, for all n N, let [n] := {1, . . . , n} and assume we observe n observations (Xt, Yt)t [n] from model (1), which we call historical data. Additionally, let Iad N be an interval of consecutive time points with m := |Iad| > p and mint Iad t > n. Assume we observe a second set of observations (Xt, Yt)t Iad from the same model but succeeding the historical data, which we call adaptation data. Finally, denote by t > maxt Iad a time point of interest that occurs after the adaptation data and assume that, for all t Iad {t }, the quantities γ0,t, Σtand Var(ϵt) in model (1) are constant (in practice, being approximately constant is suﬃcient).

Invariant Subspace Decomposition

. . . . . . . . . 1 2 n 1 n n + 1 t m t 2 t 1 t time

historical data

(Xt, Yt)t [n]

adaptation data

(Xad t , Y ad t )t Iad

Zero-shot task: observe (Xt, Yt)t [n] and Xt predict Yt

Adaptation task: observe (Xt, Yt)t [n], (Xad t , Y ad t )t Iad and Xt predict Yt

Figure 3: Illustration of the historical and adaptation data and the zero-shot and adaptation tasks.

Our goal is to predict Yt from Xt . A naive solution is to only consider the adaptation data in Iad (see also the analysis in Section 3): we aim to improve on this approach by additionally exploiting historical data in [n]. In particular, depending on whether we only have access to the historical data or we additionally have access to the data in Iad, we can solve (i) the zero-shot task or (ii) the adaptation task, respectively. The full setup including the diﬀerent tasks is visualized in Figure 3. The two tasks correspond to the two sub-problems in (3): the solution to (i) is a time-invariant parameter βinv Rp, deﬁned formally below in Section 2.2; the solution to (ii) is an adaptation parameter δres t formally deﬁned in Section 2.3 that satisﬁes βinv + δres t = γ0,t. We start by deﬁning what a time-invariant parameter is. To do so, we ﬁrst consider the explained variance for a parameter β Rp at time t N, deﬁned by

Vart(β) := Var(Yt) Var(Yt X t β).

Under model (1), the explained variance can be equivalently expressed as

Vart(β) = 2 Cov(Yt, X t β) Var(X t β)

= 2 Cov(Yt X t β, X t β) + Var(X t β). (4)

A desirable property for a non-varying parameter β Rp is to guarantee that its explained variance remains non-negative at all time points. We call parameters β Rp never harmful if, for all t N, Vart(β) 0. Intuitively, parameters with this property at least partially explain Yt at all time points t N, and are therefore meaningful to use for prediction.1

In this paper, we consider the subset of never harmful parameters β Rp that satisfy Vart(β) = Var(X t β) or equivalently, using (4), Cov(Yt X t β, X t β) = 0; thus, the explained variance of these parameters reﬂects changes of Var(Xt) (over t) but is independent of changes in γ0,t.

Deﬁnition 1 (time-invariance) We call a parameter β Rp time-invariant (over [n]) if for all t [n], Cov(Yt X t β, X t β) = 0. (5)

1. Under an assumption similar to Assumption 2 below, the maximin framework by Meinshausen and Bühlmann (2015) allows us to estimate never harmful parameters: we provide a more detailed comparison between our approach and the maximin in Remark 3.

Lazzaretto, Peters, Pfister

We use this deﬁnition, in Section 2.1, to deﬁne the subspaces Sinv and Sres that allow us to obtain the separation of γ0,t as in (3). In particular, as we show in Proposition 2, the deﬁnition of the invariant subspace Sinv guarantees that the invariant component βinv

maximizing the pooled explained variance over Sinv is always a time-invariant parameter according to Deﬁnition 1. While we consider the explained variance here, one can equivalently consider the mean squared prediction error (MSPE).

Remark 1 (exchanging explained variance with MSPE) Under model (1) maximizing the explained variance at time t [n] is equivalent to minimizing the MSPE at t, MSPEt(β) := E[(Yt X t β)2]. More precisely, assuming that all variables have mean zero,

arg max β Rp Vart(β) = arg min β Rp Var(Yt X t β) = Var(Xt) 1 Cov(Xt, Yt) = arg min β Rp MSPEt(β).

The remaining part of this section is organized as follows. Section 2.1 explicitly constructs the spaces Sinv and Sres and shows that they can be characterized by joint block diagonalization. In Section 2.2 we then analyze the time-invariant part of (3), leading to the optimal time-invariant parameter βinv, and show in Proposition 2 (iii) that its solution corresponds to the maximally predictive time-invariant parameter and is an interesting target of inference in its own right. In Section 2.3 we analyze the residual part of (3), leading to the optimal parameter δres t .

2.1 Invariant and residual subspaces

We now construct the two linear subspaces Sinv and Sres that allow us to express the true time-varying parameter as the solution to two separate optimizations as in (3). For a linear subspace S Rp, denote by ΠS Rp p the orthogonal projection matrix from Rp onto S, that is, a symmetric matrix such that Π2 S = ΠS and, for all vectors v Rp, ΠSv S and (v ΠSv) ΠSv = 0. We call a collection of pairwise orthogonal linear subspaces S1, . . . , Sq Rp with Lq j=1 Sj = Rp an orthogonal and (Xt)t [n]-decorrelating partition (of cardinality q) if for all i, j {1, . . . , q} with i = j, and for all t [n] it holds that

Cov(ΠSi Xt, ΠSj Xt) = 0. (6)

Moreover, we deﬁne an orthogonal and (Xt)t [n]-decorrelating partition as irreducible if there is no orthogonal and (Xt)t [n]-decorrelating partition with strictly larger cardinality. We will construct Sinv and Sres from an irreducible orthogonal and (Xt)t [n]-decorrelating partition. Given a (not necessarily irreducible) orthogonal and (Xt)t [n]-decorrelating partition {Sj}q j=1, the true time-varying parameter can be expressed as the sum of q orthogonal components each lying in one subspace of the partition. These components can be expressed in terms of the covariates and of the response at time t as shown by the following lemma.

Lemma 1 Let {Sj}q j=1 be an orthogonal and (Xt)t [n]-decorrelating partition. Then it holds for all t [n] that γ0,t = Pq j=1 ΠSjγ0,t and for all j {1, . . . , q} that

ΠSjγ0,t = Var ΠSj Xt Cov ΠSj Xt, Yt (7)

where ( ) denotes the Moore-Penrose pseudoinverse.

Invariant Subspace Decomposition

Each component is in particular the projection of γ0,t on the corresponding subspace Sj, as well as a maximizer of the explained variance in Sj, as shown by Lemma 2.

Lemma 2 Let N N and let {Sj}q j=1 be an orthogonal and (Xt)t N -decorrelating partition.2 Then it holds for all j {1, . . . , q} and for all t N that

arg max β Sj Vart(β) = Var(ΠSj Xt) Cov(ΠSj Xt, Yt). (8)

Combining Lemmas 1 and 2 implies that

j=1 arg max β Sj Vart(β). (9)

Therefore, any orthogonal and (Xt)t [n]-decorrelating partition allows us to split the optimization (2) of the explained variance into separate optimizations over the individual subspaces. In order to leverage all available observations in at least some of the optimizations by pooling the explained variance as in (3), we need the optimizer to remain constant over time. More formally, we call a linear subspace S Rp opt-invariant (optimum invariant) on [n] if for all t, s [n] it holds that

arg max β S Vart(β) = arg max β S Vars(β). (10)

For all terms in (9) corresponding to an opt-invariant subspace, we can by deﬁnition of opt-invariance use all time-points in the optimization. For an irreducible orthogonal and (Xt)t [n]-decorrelating partition {Sj}q j=1, we therefore deﬁne the invariant subspace Sinv and the residual subspace Sres by

Sinv := L j {1,...,q}: Sj opt-invariant on [n] Sj and Sres := L j {1,...,q}: Sj not opt-invariant on [n] Sj, (11)

respectively. The invariant subspace Sinv is opt-invariant (see Lemma 7). Moreover, by deﬁnition, it holds that Sres = Sinv , where ( ) denotes the orthogonal complement in Rp, and that {Sinv, Sres} is an orthogonal and (Xt)t [n]-decorrelating partition (see Lemma 6). To ensure that these two spaces do not depend on the chosen irreducible orthogonal partition, we introduce the following assumption.

Assumption 1 (uniqueness of the subspace decomposition) Let {Sj}q j=1 and { Sj} q j=1 be two irreducible orthogonal and (Xt)t [n]-decorrelating partitions. Then, it holds that

L j {1,...,q}: Sj opt-invariant on [n] Sj = L j {1,..., q}: Sj opt-invariant on [n] Sj.

Assumption 1 is for example satisﬁed if an irreducible orthogonal and (Xt)t [n]-decorrelating partition is unique. As we show in Appendix A.1, the uniqueness does not always hold (e.g., if for all t [n] the multiplicity of some eigenvalues of Σt is larger than one and shared

2. This is deﬁned analogously to an orthogonal and (Xt)t [n]-decorrelating partition.

Lazzaretto, Peters, Pfister

across all such matrices). Assumption 1 is, however, a mild assumption; for example, the uniqueness of an irreducible orthogonal and (Xt)t [n]-decorrelating partition is satisﬁed if there exists at least one t [n] such that all eigenvalues of Σt, which we have assumed to be non-zero, are distinct (see Lemma 8 and Proposition 1 below). Whenever Assumption 1 holds, the invariant and residual spaces Sinv and Sres do not depend on which irreducible orthogonal partition is used in their construction.

Example 1 This example describes the setting used to generate Figure 1. Consider a 2dimensional covariate Xt R2, and assume that model (1) is deﬁned as follows. We take, for all t [n]

3t/n t/n 1.5 +

3σ1,t + σ2,t

3(σ2,t σ1,t)

3(σ2,t σ1,t) σ1,t + 3σ2,t

where σ1,t and σ2,t are two ﬁxed sequences sampled as two independent i.i.d. samples from a uniform distribution on [0, 1]. In this example, we have that an irreducible orthogonal and (Xt)t [n]-decorrelating partition is given by the two spaces

S2 = 0.5 0.5

Indeed, it holds for all t [n] that Cov(ΠS1Xt, ΠS2Xt) = ΠS1ΣtΠS2 = 04 4. Moreover, since σ1,t and σ2,t are the eigenvalues of Σt, the irreducible orthogonal partition deﬁned by S1 and S2 is unique (i.e., Assumption 1 is satisﬁed by Lemma 8) if σ1,t = σ2,t for some t [n]. It also holds that

ΠS2γ0,t = 1

which does not depend on t (it can be veriﬁed that the same does not hold for ΠS1γ0,t). As shown in Lemmas 1 and 2, it holds that

arg max β Sj Vart(β) = arg max β Sj Var(Yt) Var(Yt X t ΠSjβ) = ΠSjγ0,t.

It follows that the subspace S2 is opt-invariant, whereas S1 is not, and therefore Sinv = S2 and Sres = S1. The two spaces S1 and S2 also appear in Figure 1: the true time-varying parameter γ0,t does not vary with t in the direction of the vector [0.5, 0.5

3] generating S2; S1 can be visualized when connecting the circles.

In order for Sinv and Sres to be useful for prediction on future observations, we assume that the subspace separations we consider remain ﬁxed over time.

Assumption 2 (generalization) For all irreducible orthogonal and (X)t [n]-decorrelating partitions, Sinv and Sres deﬁned in (11) satisfy the following two conditions: (i) for all t N it holds that Cov(ΠSinv Xt, ΠSres Xt) = 0 and (ii) Sinv is opt-invariant on N.

As shown in the following theorem, Assumption 2 ensures that the two sets satisfy a separation of the form (3) for all observed and unobserved time points t N, as desired. For unobserved time points t N \ [n], Assumption 2 does not require (6) to hold for all i, j {1, . . . , q}, but only for all i, j {1, . . . , q} such that Si Sinv and Sj Sres.

Invariant Subspace Decomposition

Theorem 1 Assume Assumption 2 is satisﬁed. Let Sinv and Sres be deﬁned as in (11) for an arbitrary irreducible orthogonal and (Xt)t [n]-decorrelating partition. Then, it holds for all t N that γ0,t = arg max β Sinv Var(β) + arg max β Sres Vart(β), (12)

where Var(β) := 1

n Pn t=1 Vart(β). Moreover, if Assumption 1 is satisﬁed, the separation in (12) is independent of the considered irreducible orthogonal and (Xt)t [n]-decorrelating partition.

The proof of Theorem 1 can be found in Appendix E.1 and relies on the fact that, under Assumption 2, the invariant and residual subspace form an orthogonal and (Xt)t Ndecorrelating partition.

2.1.1 Identifying invariant and residual subspaces using joint block diagonalization

We can characterize an irreducible orthogonal partition using joint block diagonalization of the set of covariance matrices (Σt)t [n]. Joint block diagonalization of (Σt)t [n] consists of ﬁnding an orthogonal matrix U Rp p such that the matrices Σt := U Σt U, t [n], are block diagonal and we can choose q U blocks such that the indices of the corresponding submatrices do not change with t (the entries of the blocks may change with t though). Let q U max be the largest number of such blocks and let ( Σt,1)t [n], . . . , ( Σt,q U max)t [n] denote the corresponding common blocks with dimensions p1, . . . , pq U max that are independent of t. We call U a joint block diagonalizer of (Σt)t [n]. Moreover, we call U an irreducible joint block diagonalizer if, in addition, for all other joint block diagonalizers U Rp p the resulting number of common blocks is at most q U max. Joint block diagonalization has been considered extensively in the literature (see, for example, Murota et al., 2010; Nion, 2011; Tichavsky and Koldovsky, 2012) and various computationally feasible algorithms have been proposed (see Section A.3 for further details). In our setting, joint block diagonalization can be used to identify the invariant and residual subspaces, since an irreducible joint block diagonalizer U of the covariance matrices (Σt)t [n] corresponds to an irreducible orthogonal partition, as the following proposition shows.

Proposition 1 (i) Let U Rp p be a joint block diagonalizer of (Σt)t [n]. For all j {1, . . . , q U max}, let Sj {1, . . . , p} denote the subset of indices corresponding to the j-th common block Σt,j. Moreover, let uk denote the k-th column of U and, for all j {1, . . . , q U max}, deﬁne Sj := span{uk | k Sj}.

Then, {Sj}q U max j=1 is an orthogonal and (Xt)t [n]-decorrelating partition. Moreover, if the joint block diagonalizer is irreducible, then the corresponding orthogonal partition is irreducible.

(ii) The converse is also true. Let {Sj}q j=1 be an orthogonal and (Xt)t [n]-decorrelating partition. Then, there exists a joint block diagonalizer U Rp p of (Σt)t [n] such that for all t [n] the matrix Σt := U Σt U is block diagonal with q diagonal blocks eΣt,j =

Lazzaretto, Peters, Pfister

(USj) Σt USj, j {1, . . . , q} of dimension |Sj| = dim(Sj), where Sj {1, . . . , p} indexes a subset of the columns of U. If the orthogonal partition is irreducible, then U is an irreducible joint block diagonalizer. Moreover, ΠSj = USj(USj) .

If Assumption 1 is satisﬁed, any irreducible joint block diagonalizer U, via its corresponding irreducible orthogonal and (Xt)t [n]-decorrelating partition constructed in Proposition 1(i), leads to the same (unique) invariant and residual subspaces deﬁned in (11). It is clear that Assumption 1 is automatically satisﬁed whenever the joint block diagonalization is unique up to trivial indeterminacies, that is, if for all orthogonal matrices U, U Rp p that jointly block diagonalize the set (Σt)t [n] into q U max irreducible common blocks, it holds that U is equal to U up to block permutations and block-wise isometric transformations. Explicit conditions under which uniqueness of joint block diagonalization is satisﬁed can be found, for example, in the works by De Lathauwer (2008); Murota et al. (2010). Intuitively, these conditions are satisﬁed whenever there is suﬃcient variability across time in the covariance matrices (Σt)t [n]. Given an irreducible joint block diagonalizer U, the invariant and residual subspaces can be identiﬁed using Proposition 1 and Lemma 1, by checking for all j {1, . . . , q U max} whether ΠSjγ0,t = USj(USj) γ0,t remains constant (Sj Sinv) or not (Sj Sres) on [n]. We denote by Uinv and Ures the submatrices of U formed by the columns that span the invariant and the residual subspace respectively.

Example 1 (continued) In Example 1 so far, we expressed S1 and S2 in terms of their generating vectors. These can be retrieved using Proposition 1 by joint block diagonalizing the matrices (Σt)t [n]. In this speciﬁc example, an irreducible joint block diagonalizer is given by

3 0.5 0.5 0.5

which is a (clockwise) rotation matrix of 30 degrees (see Figure 1 and use S1 = {1} and S2 = {2}). In particular, we have that Σt = U Σt U = diag(σ1,t, σ2,t): therefore, q U max = 2 and each block has dimension 1. Moreover, it holds for all t [n] that

ΠS1γ0,t = US1(US1) γ0,t = 1.5

3t/n t/n 1.5

and ΠS2γ0,t = US2(US2) γ0,t = 1

and therefore Sinv = S2 and Sres = S1.

2.2 Invariant component

In Theorem 1 we have shown that the true time-varying parameter γ0,t can be expressed as the result of two separate optimization problems over the two orthogonal spaces Sinv and Sres. In this section we analyze the result of the ﬁrst optimization step over the invariant subspace Sinv. To ensure that the space Sinv is unique, we assume that Assumption 1 is satisﬁed throughout Section 2.2.

Deﬁnition 2 (invariant component) We denote the parameter that maximizes the explained variance over the invariant subspace by

βinv := arg max β Sinv Var(β). (13)

Invariant Subspace Decomposition

The parameter βinv corresponds to the pooled OLS solution obtained by regressing Yt on the projected predictors ΠSinv Xt, and can be computed using all observations in [n]. The whole procedure to ﬁnd βinv, by ﬁrst identifying Sinv via joint block diagonalization and then using Proposition 2 (i) below, is summarized in Algorithm 1 (see Section 2.4). Proposition 2 summarizes some of the properties of βinv.

Proposition 2 (properties of βinv) Under Assumption 1, βinv satisﬁes the following properties:

(i) For all t [n], βinv = ΠSinvγ0,t = ΠSinvγ0, where γ0 := 1

n Pn t=1 γ0,t.

(ii) βinv is time-invariant over [n], see Deﬁnition 1.

(iii) If, in addition, it holds for all β Rp time-invariant over [n] that β Sinv then βinv = arg maxβ Rp time-invariant Var(β).

Deﬁnition 1 guarantees, for all t [n], that the explained variance for all β Rp timeinvariant over [n] is Vart(β) = Var(X β) = β Σtβ. Point (ii) of Proposition 2 therefore implies that, for all t [n], Vart(βinv) = (βinv) Σtβinv. Under Assumption 2, we have that the same holds for all t N: βinv is therefore a never harmful parameter and for all t N it is a solution to arg maxβ Sinv Vart(β). Moreover, Proposition 2 (iii) implies that, under an additional assumption, the parameter βinv is optimal, i.e., maximizes the explained variance, among all time-invariant parameters. In particular, βinv represents an interesting target of inference: it can be used for zero-shot prediction, if no adaptation data is available at time t to solve the second part of the optimization in (12) over Sres. Using Uinv as deﬁned in Section 2.1.1, we can express βinv as follows.

βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ) (14)

where Var(X) := 1

n Pn t=1 Var(Xt) and Cov(X, Y ) := 1

n Pn t=1 Cov(Xt, Yt). We derive this expression in Lemma 10, and we later use it for estimation in Section 4.2.

Example 1 (continued) Considering again Example 1, we have that Sinv = S2, and therefore the invariant component is given by

βinv = arg max β S2

Var(β) = ΠS2γ0 = 1

Moreover, we can express βinv under the basis for the irreducible subspaces using the irreducible joint block diagonalizer U, obtaining U βinv = 0, 2 : the only non-zero component is indeed the one corresponding to the invariant subspace S2.

2.3 Residual component and time adaptation

Under the generalization assumption (Assumption 2), Theorem 1 implies that we can partially explain the variance of the response Yt at all observed and unobserved time points t N using βinv. It also implies that for all t N we can reconstruct the true time-varying

Lazzaretto, Peters, Pfister

parameter by adding to βinv a residual parameter that maximizes the explained variance at time t over the residual subspace Sres, i.e.,

γ0,t = βinv + arg max β Sres Vart(β).

In this section we focus on the second optimization step over Sres. We assume that Assumptions 1 and 2 hold, and for all t N we deﬁne the residual component δres t := arg maxβ Sres Vart(β). Using (7), (8) and (11), we can express δres t as

δres t = Var(ΠSres Xt) Cov(ΠSres Xt, Yt) (15)

= Var(ΠSres Xt) Cov(ΠSres Xt, Yt X t βinv).

We can now reduce the number of parameters that need to be estimated by expressing δres t as an OLS solution with dim(Sres) parameters. To see this, we use that, under Assumption 1, Proposition 1 allows us to express the space Sres in terms of an irreducible joint block diagonalizer U corresponding to the irreducible orthogonal partition {Sj}q U max j=1 as

Sres = span{uk | j {1, . . . , q U max} : Sj not opt-invariant and k Sj}.

Moreover, the orthogonal projection matrix onto Sres is given by ΠSres = Ures(Ures) . Hence, using Lemma 5 we get that

δres t = Ures Var((Ures) Xt) 1 Cov((Ures) Xt, Yt X t βinv), (16)

where Var((Ures) Xt) 1 Cov((Ures) Xt, Yt X t βinv) is the population ordinary least squares solution obtained by regressing Yt X t βinv on (Ures) Xt Rdim(Sres).

Example 1 (continued) In Example 1, we have that for all t [n] the residual component δres t is given by

δres t = arg max β S1 Vart(β) = ΠS1γ0,t = 1.5

3t/n t/n 1.5

Moreover, we can express δres t under the irreducible orthogonal partition basis (given by the two vectors generating S1 and S2), obtaining U δres t = 3 2t/n, 0 , which indeed only has one degree of freedom in the ﬁrst component. This component takes the following values: for all t [n] we have 3 2t/n [1, 3].

2.4 Population ISD algorithm

We call the procedure to identify βinv and δres t invariant subspace decomposition (ISD). By construction, the result of ISD in its population version is equal to the true time-varying parameter at time t N, i.e., γ0,t = βinv + δres t . (17)

The full ISD procedure is summarized in Algorithm 1. In the algorithm, the invariant and residual subspaces are identiﬁed through joint block diagoanlization as described at

Invariant Subspace Decomposition

Algorithm 1 Population ISD

Input: distributions of (Xt, Yt)t [n] Output: βinv, δres n

1: (Σt)t [n] {Var(Xt)}t [n]

2: {γ0,t}t [n] {Var(Xt) 1 Cov(Xt, Yt)}t [n]

3: U, {Sj}q j=1 irreducible Joint Block Diagonalizer((Σt)t [n]) Prop. 1

4: Find the opt-invariant subspaces to identify Sinv, Sres

5: Sinv, Sres

6: for j = 1, . . . , q do

7: ΠSj USj(USj)

8: if ΠSjγ0,t constant in [n] then Sinv Sinv Sj see Prop. 2

9: else Sres Sres Sj

10: Deﬁne estimators for the invariant and residual component of γ0,t:

11: ΠSinv Uinv(Uinv)

n Pn t=1 Var(ΠSinv Xt) Cov(ΠSinv Xt, Yt) see Prop. 2, Lemmas 1,2

13: δres n Ures Var((Ures) Xn) 1 Cov((Ures) Xn, Yn X n βinv) see Eq. (16)

the end of Section 2.1.1. The number of subspaces q in the irreducible orthogonal and (Xt)t [n]-decorrelating partition is inferred by the joint block diagonalization algorithm. If Assumption 1 is not satisﬁed, then the decomposition in (17) and therefore the output of Algorithm 1 depends on the irreducible orthogonal and (Xt)t [n]-decorrelating partition used.

In Appendix C we show that a decomposition of the true time-varying parameter similar to (17) is also obtained when considering (Xt)t [n]-decorrelating partitions that are not orthogonal, i.e., such that (6) holds but the subspaces in the partition are not necessarily pairwise orthogonal. In this case, in particular, we can still ﬁnd an invariant and residual parameter as in (14) and (16), respectively, where the matrix U is now a non-orthogonal irreducible joint block diagonalizer.

Remark 2 (changes of γ0,t and Σt over time) For the ISD procedure to work in practice, we assume that the covariance matrices Σt are approximately constant in small time windows (e.g., smoothly changing or piece-wise constant). This is required to obtain accurate estimates of Σt, which we then (approximately) joint block diagonalize (see Section 4.1.1). Similarly, we assume that γ0,t is approximately constant in the adaptation window, in order to perform the adaptation step (see Setting 1). In the historical data, however, γ0,t can vary arbitrarily fast in the residual subspace (βinv remains constant, of course). Intuitively, even if γ0,t does not change in a structured way, its projection ΠSinvγ0,t on the invariant subspace remains constant across time points. We provide an example in which γ0,t is quickly varying in Appendix A.2.

Lazzaretto, Peters, Pfister

3. Analysis of the two ISD tasks: zero-shot generalization and time adaptation

We now analyze the two prediction tasks that can be solved with the ISD framework introduced in the previous sections, namely (i) the zero-shot task and (ii) the adaptation task. We consider Setting 1 and assume for (i) that only the historical data is available, while for (ii) we assume to also have access to the adaptation data. Throughout the remainder of this section we assume that the invariant and residual subspaces are known (they can be computed from the joint distributions), which simpliﬁes the theoretical analysis.

3.1 Zero-shot task

We start by analyzing the zero-shot task, in which no adaptation data are observed, but only historical data and Xt . Under the generalization assumption (Assumption 2), Sinv is optinvariant on N and we can characterize all possible models deﬁned as in (1) by the possible variations of the true time-varying parameter in the residual space Sres, or, equivalently, by all possible values of γ0,t βinv Sres. We then obtain that βinv is worst case optimal in the following sense.

Theorem 2 Under Assumptions 1 and 2 it holds for all t N that

βinv = arg max β Rp inf γ0,t Rp: γ0,t βinv Sres Vart(β).

We further obtain from Proposition 2 (iii) that, under an additional condition, βinv is worst case optimal among all time-invariant parameters. This characterization of βinv allows for a direct comparison with the maximin eﬀect by Meinshausen and Bühlmann (2015), which we report in more detail in Remark 3. In absence of further information on δres t , Theorem 2 suggests to use an estimate of ˆβinv of βinv to predict Yt , i.e., ˆYt = X t ˆβinv.

Remark 3 (relation to maximin) The maximin framework introduced by Meinshausen and Bühlmann (2015) considers a linear regression model with varying coeﬃcients, where the variations do not necessarily happen in a structured way, e.g., in time. Translated to our notation and restricting to time-based changes, their model can be expressed for t [n] as

Yt = X t γ0,t + ϵt,

where E[ϵt | Xt] = 0 and the covariance matrix Σ := Var(Xt) does not vary with t. The maximin estimator maximizes the explained variance in the worst (most adversarial) scenario that is covered in the training data; it is deﬁned as

βmm := arg max β Rp min t [n] Vart(β).

The maximin estimator guarantees that, for all t [n], Cov(Yt X t βmm, X t βmm) 0 and therefore Vart(βmm) 0 (see Meinshausen and Bühlmann, 2015, Equation (8)). This means that the maximin relies on a weaker notion of invariance that only requires the lefthand side of (5) to be non-negative instead of zero. This implies that βinv is in general

Invariant Subspace Decomposition

more conservative than βmm in the sense that, for all t [n], Vart(βmm) Vart(βinv). By ﬁnding the invariant and residual subspaces, we determine the domain in which γ0,t varies and assume (Assumption 2) that this does not change even at unobserved time points. The parameter βinv is then worst case optimal over this domain, and guarantees that the explained variance remains positive even for scenarios that are more adversarial than the ones observed in [n], i.e., such that Vars(βmm) < mint [n] Vart(βmm) for some s {n + 1, n + 2, . . . } (see, for example, the results of the experiment described in Section 5.1 and shown in Figure 2). Furthermore, the decomposition allows for time adaptation (see Section 2.3), which would not be possible starting from the maximin eﬀect.

3.2 Adaptation task

We now consider the adaptation task in which, additionally to the historical data, adaptation data are available. Adaptation data can be used to deﬁne an estimator for the residual component δres t , which we denote by ˆδres t . This, together with an estimator ˆβinv for βinv

ﬁtted on the historical data gives us an estimator ˆγISD t := ˆβinv + ˆδres t for the true timevarying parameter. We compare this estimator with a generic estimator ˆγt for γ0,t which uses only the adaptation data. To do so, we consider the minimax lower bound provided by Mourtada (2022)[Theorem 1] for the expected squared prediction error of an estimator ˆγ computed using n i.i.d. observations of a random covariate vector X Rp and of the corresponding response Y . It is given by inf ˆγ sup P P EP [(X (ˆγ γ))2] σ2 p

where P is the class of distributions over (X, Y ) such that Y = X γ + ϵ, E[ϵ | X] = 0 and E[ϵ2 | X] < σ2. It follows from (18) that, for a generic estimator ˆγt of γ0,t based on the adaptation data alone, we can expect at best to achieve a prediction error of σ2 ad p m, where σ2 ad is the variance of ϵt for t Iad (which is assumed to be constant). We can improve on this if we allow the estimator ˆγt to also depend on the historical data. To see this, observe that we can always decompose ˆγt = ˆβ + ˆδt with ˆβ Sinv and ˆδt Sres. Moreover, under Assumption 2 we can split the expected prediction error of ˆγt at t accordingly as

E[(X t (ˆγt γ0,t ))2] = E[((ΠSinv Xt ) (ˆβ βinv))2] + E[((ΠSres Xt ) (ˆδt δres t ))2].

Then, ˆβ represents an estimator for βinv and can be computed on historical data3, whereas ˆδt estimates δres t and is based on adaptation data: by decomposing ˆγt in this way, the best prediction error we can hope for is of the order dim(Sinv)

n + dim(Sres)

m . In Section 4, we prove that ˆγISD t indeed achieves this bound in Theorem 3. If the invariant subspace is nondegenerate and therefore dim(Sres) < p, and n is suﬃciently large, this implies that ˆγISD t has better ﬁnite sample performance than estimators based on the adaptation data alone.

3. In principle, an estimator for βinv could be computed using both historical and adaptation data. However, the ISD procedure is motivated by scenarios in which the size of historical data is very large (and n m): this means that it could be computationally costly to update the estimate for the invariant component every time new adaptation data are available, without a signiﬁcant gain in estimation accuracy. For this reason, we only consider estimators for βinv that use historical data.

Lazzaretto, Peters, Pfister

4. ISD estimator and its ﬁnite sample generalization guarantee

We now construct an empirical estimation procedure for the ISD framework, based on the results of Section 2.2 and on Algorithm 1 described in Section 2.3. Throughout this section, we assume that Assumptions 1 and 2 are satisﬁed. We assume that we observe both historical and adaptation data as in the adaptation task. We use the historical data to ﬁrst estimate the decomposition of Rp into Sinv and Sres, employing a joint block diagonalization algorithm (Section 4.1). We then use the resulting decomposition to estimate βinv. Finally, we use the adaptation data to construct an estimator for δres t in Section 4.2. In Section 4.3 we then show the advantage of separating the optimization as in (3) to estimate γ0,t at the previously unobserved time point t , by providing ﬁnite sample guarantees for the ISD estimator.

4.1 Estimating the subspace decomposition

4.1.1 Approximate joint block diagonalization

We ﬁrst need to ﬁnd a good estimator for the covariance matrices Σt. Since only one observation (Xt, Yt) is available at each time step t [n], some further assumptions are needed about how Σt varies over time. Here, we assume that Σt varies smoothly with t [n] and is therefore approximately constant in small time windows. We can then consider a rolling window approach, i.e., consider K windows in [n] of length w n over which the constant approximation is deemed valid, and for the k-th time window, k {1, . . . , K}, take the sample covariance ˆΣk as an estimator for Σt in such time window. Given the set of estimated covariance matrices {ˆΣk}K k=1, we now need to estimate an orthogonal transformation ˆU that approximately joint block diagonalizes them. We provide an overview of joint block diagonalization methods in Section A.3 in the appendix. In our simulated settings, we solve the approximate joint block diagonalization (AJBD) problem via approximate joint diagonalization (AJD), since we found this approach to represent a good trade oﬀbetween computational complexity and accuracy. More in detail, similarly to what is proposed by Tichavsky and Koldovsky (2012), we start from the output matrix V Rp p of the uwedge algorithm by Tichavsky and Yeredor (2008), which solves AJD for the set {ˆΣk}K k=1, i.e., it is such that for all k {1, . . . , K} the matrix V ˆΣk V is approximately diagonal. We then use the oﬀ-diagonal elements of the approximately diagonalized covariance matrices to identify the common diagonal blocks. As described more in detail in Section A.3, this is achieved by ﬁnding an appropriate permutation matrix P Rp p for the columns of V such that for all k {1, . . . , K} the matrix (V P) ˆΣk(V P) is approximately joint block diagonal. The estimated irreducible joint block diagonalizer is then given by ˆU = V P. We denote by q ˆU max the number of estimated diagonal blocks and by { ˆSj}q ˆ U max j=1 the estimated irreducible orthogonal and (Xt)t [n]-decorrelating partition.

4.1.2 Estimating the invariant and residual subspaces

We now estimate the invariant and residual subspaces using the estimated irreducible joint block diagonalizer ˆU. To do so, we ﬁrst estimate the true time-varying parameter γ0,t using similar considerations as the ones made in Section 4.1.1. We assume for example that γ0,t is approximately constant in small windows (for simplicity, we consider the same K windows

Invariant Subspace Decomposition

deﬁned in Section 4.1.1). This assumption is helpful to deﬁne the estimation procedure, but is not strictly necessary for the ISD framework to work. We show in Appendix A.2 that the procedure can still work if this assumption is violated. We then compute the regression coeﬃcient ˆγk of Yk on Xk, where Yk Rw 1 and Xk Rw p are the observations in the k-th time window4. We use the estimates ˆγk to determine which of the subspaces identiﬁed by ˆU are opt-invariant. For all j {1, . . . , q ˆU max}, we take the sets of indices Sj as deﬁned in Proposition 1, and consider the estimated orthogonal projection matrices Π ˆSj = ˆUSj( ˆUSj) from Rp onto the j-th subspace ˆSj of the estimated irreducible orthogonal partition. It follows from Lemma 1 that we can ﬁnd the opt-invariant subspaces by checking for all j {1, . . . , q ˆU max} whether Π ˆSj ˆγk remains approximately constant for k {1, . . . , K}.

To do so, let ˆγ := (PK k=1 Var(ˆγk) 1) 1 PK k=1 Var(ˆγk) 1ˆγk be the average of the estimated regression coeﬃcients inversely weighted by their variance. We further use that, by Lemma 9, if ˆSj is opt-invariant on [n] then the weighted average of the (approximately constant) projected regression coeﬃcient Π ˆSj ˆγk, i.e., Π ˆSj ˆγ, approximately satisﬁes the time-invariance constraint (5). This approach is motivated by Proposition 2 (iii). If the corresponding assumption cannot be assumed to hold, other methods can alternatively be used to determine whether Π ˆSj ˆγk is constant, e.g., checking its gradient or variance. In (5), we can equivalently

consider the correlation in place of the covariance, that is, corr(Yt X t β, X t β) = 0: an estimate of this correlation allows us to obtain a normalized measure of (5) that is comparable across diﬀerent experiments. Formally, we consider for all k {1, . . . , K} and for all j {1, . . . , q ˆU max}

ˆcj k := [ Corr(Yk Xk(Π ˆSj ˆγ), Xk(Π ˆSj ˆγ))

and check, for all j {1, . . . , q ˆU max}, whether

ˆcj k λ (19)

for some small threshold λ [0, 1]. The threshold λ can be chosen, for example, using cross-validation (more details are provided in Section A.4 in the appendix). An estimator of the invariant and residual subspaces is then given by

j {1,...,q ˆ U max}: (19) is satisﬁed

ˆSj, ˆSres = M

j {1,...,q ˆ U max}: (19) is not satisﬁed

where we approximate opt-invariance with the inequality (19) being satisﬁed.

4.2 Estimating the invariant and residual components

Let ˆUinv and ˆUres be the submatrices of ˆU whose columns span ˆSinv and ˆSres, respectively. We propose to estimate βinv using the following plug-in estimator for (14) ˆβinv := ˆUinv(( ˆUinv) X X ˆUinv) 1( ˆUinv) X Y. (21)

4. We have so far omitted the intercept in our linear model, but it can be included by adding a constant term to Xt when estimating the linear parameters. We explicitly show how to take the intercept into account in Section B in the appendix.

Lazzaretto, Peters, Pfister

We consider now the new observation of the covariates Xt at time t and the adaptation data (Xt, Yt)t Iad introduced in Setting 1, and denote by Xad Rm p and Yad Rm 1

the matrices containing this adaptation data. Similarly to ˆβinv, using (16) we obtain the following plug-in estimator for δres t

ˆδres t := ˆUres(( ˆUres) (Xad) Xad ˆUres) 1( ˆUres) (Xad) (Yad Xad ˆβinv). (22)

We can now deﬁne the ISD estimator for the true time-varying parameter at t as

ˆγISD t := ˆβinv + ˆδres t . (23)

A prediction of the response Yt is then given by ˆYt = X t ˆγISD t . We provide the pseudocode summarizing the whole ISD estimation procedure in the Appendix B.

Example 1 (continued) Assume that in Example 1 the true time-varying parameter at time points t {n + 1, n + 2, . . . } is given by

γad 0,t = 1 + 0.5

T n sin2( t n

3 0.5 + 1.5 t n

T n sin2( t n

for some T N. We can verify that ΠS2γad 0,t = ΠS2γ0,t = βinv, and therefore the invariant and residual subspaces deﬁned on [n] generalize to {n + 1, n + 2, . . . } and Assumption 2 is satisﬁed. Moreover, we obtain that for t {n + 1, . . . } the residual component expressed under the irreducible orthogonal partition basis is

U δres t = 1 3 t n

T n sin2( t n

The ﬁrst entry now takes values in [ 1.5, 1], which is disjoint from the range of values of the ﬁrst coordinate observed in [n] (which was [1, 3]). We now take T = 350 and consider an online setup in which we sequentially observe X t at a new time point t {n+1, . . .}, and assume that Yt is observed until t = t 1. We take as historical data the observations on [n], and consider as adaptation data the observations in windows Iad = {t m, . . . , t 1} of length m = 16. After estimating βinv on historical data, we use the adaptation data to estimate δres t and the OLS solution γOLS t . We repeat this online step 350 times (each time increasing t by one). Figure 1 shows the results of this experiment.

4.3 Finite sample generalization guarantee

Considering the setting described in Section 2.3, we now compare the ISD estimator in (23) with the OLS estimator computed on Iad , i.e., ˆγOLS t := ((Xad) Xad) 1(Xad) Yad. We assume that we are given the (oracle) subspaces Sinv and Sres, and consider the expected explained variance at t of ˆγISD t and ˆγOLS t . More in detail, the expected explained variance at t of an arbitrary estimator ˆγ of γ0,t is given by

E[ Vart (ˆγ)] := E[Var(Yt ) Var(Yt X t ˆγ | ˆγ)]. (24)

Invariant Subspace Decomposition

Evaluating (24) allows us to obtain a measure of the prediction accuracy of ˆγ: the higher the expected explained variance, the more predictive the estimator is. This also becomes clear by isolating in (24) the term E[Var(Yt X t ˆγ | ˆγ)], which represents the mean square prediction error obtained by using ˆγ (see Remark 1). Our goal is to show that the explained variance by ˆγISD t is always greater than that of the OLS estimator.

Theorem 3 Assume Assumption 2 and that, in model (1), γ0,t and the variances of Xt and ϵt do not change with respect to t Iad t , and denote them by γ0,t , Σt and σ2 ad, respectively. Moreover, let c, σ2 ϵ,max > 0 be constants such that for all n N it holds that σ2 ϵ,max maxt [n] Var(ϵt) and for all m N with m p, λmin( 1

m(Xad) Xad) c almost surely, where λmin( ) denotes the smallest eigenvalue. Further assume that the invariant and residual subspaces Sinv and Sres are known. Then, there exist Cinv, Cres > 0 constants such that for all n, m N with n, m p it holds that

MSPE(ˆγISD t ) := E[(Xt (ˆγISD t γ0,t ))2] σ2 ϵ,max dim(Sinv)

n Cinv + σ2 ad dim(Sres)

Furthermore, for all n, m N with n, m p it holds that

MSPE(ˆγOLS t ) MSPE(ˆγISD t ) σ2 ad dim(Sinv)

m σ2 ϵ,max dim(Sinv)

From the proof of Theorem 3 it follows that Cres can be chosen close to 1 if we only consider suﬃciently large m. Moreover, because MSPE(ˆγOLS t ) MSPE(ˆγISD t ) = E[ Vart (ˆγISD t )] E[ Vart (ˆγOLS t )], Theorem 3 implies that if dim(Sinv) 1 and n suﬃciently large then

E[ Vart (ˆγISD t )] > E[ Vart (ˆγOLS t )].

The ﬁrst term in the diﬀerence between the expected explained variances (or MSPEs) depends on the dimensions of the invariant subspace and of the time-adaptation window Iad: there is a higher gain in using ˆγISD t instead of ˆγOLS t if the dimension of Sinv is large and only a small amount of time-points are available are available in the adaptation data.

5. Experiments

To show the eﬀectiveness of the ISD framework we report the results of two simulation experiments and one real data experiment. The ﬁrst simulation experiment evaluates the estimation accuracy of the invariant component ˆβinv for increasing sample size n of the historical data. The second simulation experiment compares the predictive accuracy of ˆγISD t and ˆγOLS t for diﬀerent sizes m of the adaptation dataset, to empirically investigate the dependence of the MSPE diﬀerence on the size of the adaptation data shown in Theorem 3. In both simulation experiments we let the dimension of the covariates be p = 10, and dim(Sinv) = 7, dim(Sres) = 3, and generate data as follows. We sample a random orthogonal matrix U, and sample the covariates Xt from a normal distribution with zero mean and covariance matrix U Σt U , where Σt is a block-diagonal matrix with four blocks of dimensions 2, 4, 3 and 1, and random entries that change 10 times in the observed time horizon n. We take as true time-varying parameter the rotation by U of the parameter with constant entries equal to 0.2 (we set these entries all to the same value for simplicity, but they need not to be equal in general) corresponding to the blocks of sizes 4 and 3, and time-varying

Lazzaretto, Peters, Pfister

entries corresponding to the blocks of sizes 2 and 1 equal to (1 1.5t sin2(it/n + i))/n, where i {2, ..., 8} is the entry index (the values of these coeﬃcients range between 0.25 and 1). The noise terms ϵt are sampled i.i.d. from a normal distribution with zero mean and variance σ2 ϵt = 0.64. The dimensions of the subspaces, Sinv and Sres, and the true time-varying parameter γ0,t are chosen to ensure that, in the historical data, X t βinv and X t δres t explain approximately half of the variance of Yt each. This choice allows for better visualization in the described experiments. In Section 5.3 we repeat the same experiments on real data coming from a controlled physical system (light tunnel) developed by Gamella et al. (2025). The data is taken from the smooth_polarizers experiment in the lt_walks_v1 dataset, available at https: //github.com/juangamella/causal-chamber. The code for the presented experiments is available at https://github.com/mlazzaretto/Invariant-Subspace-Decomposition. The implementation of the uwedge algorithm is taken from the Python package https: //github.com/sweichwald/coro ICA-python developed by Pﬁster et al. (2019b).

5.1 Invariant decomposition and zero-shot prediction

We ﬁrst estimate the time invariant parameter βinv for diﬀerent sample sizes n of the historical data. We consider n {500, 1000, 2500, 4000, 6000}, and repeat the experiment 20 times for each n. To compute ˆβinv, we use K = 25 equally distributed windows of length n/8 (see Section 4). Figure 4 shows that the mean squared error (MSE) βinv ˆβinv 2 2 converges to zero for increasing values of n. We then consider a separate time window of 250 observations in which the value of the time-varying coeﬃcients (before the transformation using U) is set to 1. We use these observations to test the zero-shot predictive capability of the estimated invariant component, i.e., they can be seen as realizations of the variable Xt introduced in Setting 1 (we refer to this window as test data). We compare the predictive performance of the parameter ˆβinv on

500 1000 2500 4000 6000 n

βinv ˆβinv 2 2 βinv 2 2

Figure 4: MSE of ˆβinv for increasing size of the historical data n (see Section 5.1). For larger values of n, the estimation of the invariant subspace decomposition becomes more precise and leads to smaller errors in the estimated invariant component ˆβinv.

Invariant Subspace Decomposition

500 1000 2500 4000 6000 n

in-sample prediction

500 1000 2500 4000 6000 n

zero-shot prediction (test sample size = 250)

Figure 5: Normalized explained variance (R2) by ˆβinv and comparison with βinv, ˆβmm

and ˆβOLS: training (historical data, left) and zero-shot generalization (test data, right), for diﬀerent sizes n of the historical data (see Section 5.1). The dashed line indicates the population value of the (normalized) explained variance by βinv.

the historical and test data with that of the oracle invariant parameter βinv, the maximin effect ˆβmm (computed using the magging estimator proposed by Bühlmann and Meinshausen, 2015), and the OLS solution ˆβOLS, both computed using the historical data. We show in

Figure 5 the results in terms of the R2 coeﬃcient, given by R2 = Pn t=1(d Var(Yt) d Var(Yt X t ˆβ)) Pn t=1 d Var(Yt) .

Figure 5 shows that the R2 coeﬃcient of the oracle invariant component βinv remains positive even for values of γ0,t in the test data that lie outside of the observed support in the historical data; for increasing n the same holds for the estimated ˆβinv. Using ˆβOLS or ˆβmm leads instead to negative explained variance in this experiment. The main limitation of the ISD method lies in the estimation of the invariant and residual subspaces. As outlined in Section 4, this process consists of two main steps, approximate joint block diagonalization and selection of the invariant blocks, both of which are in practice sensitive to noise. In both steps, we implement our estimator to be as conservative as possible, that is, such that it does not on average overestimate the number of common diagonal blocks or the dimension of Sinv, to avoid including part of the residual subspace into ˆSinv. This behavior is however hard to avoid if the size of the historical dataset is not suﬃciently large, therefore requiring large values of n for the ISD framework to work eﬀectively, see Figures 4 and 5.

5.2 Time adaptation

In the same setting, we now ﬁx the size of the historical dataset to n = 6000, which we use to estimate ˆβinv, and consider a test dataset in which the time-varying coeﬃcients (before the transformation using U) undergo two shifts and take values 0.5 and 2 on two consecutive time windows, each containing 1000 observations. We assume that the

Lazzaretto, Peters, Pfister

10 2 10 1 100

MSPE(ˆγOLS t )

10 2 10 1 100

10 2 10 1 100

10 2 10 1 100

MSPE(ˆγISD t ) MSPE(βinv + ˆδres t )

Figure 6: MSPE comparison: ˆγISD t (blue dots) vs. ˆγOLS t for p = 10 and various adaptation window lengths m (see Section 5.2). The ISD estimator achieves lower MSPE than the OLS for smaller sizes m of the adaptation window, while the two estimators become comparable for increasing m. The orange dots show the MSPE of βinv(oracle) + ˆδres t vs. ˆγOLS t : if the subspace decomposition is known, then the ISD always achieves lower MSPE than the OLS.

test data are observed sequentially, and take as adaptation data a rolling window of length m contained in the test data and shifting by one time point at the time. We use these sequential adaptation datasets to estimate the residual parameter ˆδres t and the OLS solution ˆγOLS t . We then compute the squared prediction error of ˆγISD t = ˆβinv + ˆδres t and ˆγOLS t on the next data point Xt+1, i.e., (X t+1(γ0,t+1 ˆγt))2 and approximate the corresponding MSPE using a Monte-Carlo approximation with 1000 draws from Xt+1 (these correspond to the 1000 sequential observations in each of the two windows of the test data). We repeat the simulation 20 times for diﬀerent sizes of the adaptation window, m {1.5p, 2p, 5p, 10p}, and plot the obtained MSPE values for ˆγOLS t against ˆγISD t . The result is shown in Figure 6, and empirically supports Theorem 3, in particular that the diﬀerence in the MSPEs of the OLS and ISD estimators is proportional to the ratio dim(Sinv)

m and is therefore larger for small values of m and shrinks for increasing size of the adaptation data (additional details on this simulation are provided in Section A.5 in the Appendix). The ISD framework is particularly helpful in scenarios in which the size of the available adaptation window is small (two ﬁrst plots from the left in Figure 6). Indeed, from Theorem 3 it also follows that the larger the dimension of the invariant subspace the greater the advantage in using the ISD framework for prediction rather than a naive OLS approach. A further beneﬁt of the ISD estimator is that it allows us to estimate ˆδres t for small lengths m of the adaptation window where dim(Sres) < m < p and OLS is not feasible. We run a similar experiment to show (Figure 7) the average cumulative explained variance on the adaptation data over 20 runs, both by estimators computed only on the historical data and estimators that use the adaptation data. For visualization purposes, we now consider the time-varying coeﬃcients (before the transformation using U) equal to (0.5 t sin2(it/n + i))/n, where i {2, ..., 8} is the coeﬃcient index, in the historical data, and constantly equal to 0.3, 0.65 and 1 in three consecutive time windows of size 150 on the time points after the historical data. We estimate ˆδres t and ˆγOLS t on a rolling adaptation window of size m = 3p. The plot in Figure 7 shows that, on average, the ISD framework, by exploiting invariance properties in the observed data, allows us to accurately explain the

Invariant Subspace Decomposition

0 100 200 300 400 t

Pt r=1 Varr(ˆβ)

Zero-shot prediction

Ground truth

Time adaptation estimators

ˆγISD t =ˆβinv+ˆδres t ˆγOLS t

Ground truth

Zero-shot prediction

Time adaptation estimators

ˆγISD t =ˆβinv+ˆδres t ˆγOLS t

Figure 7: Average cumulative explained variance on the adaptation data and one standard deviation intervals (over 20 runs) by the true time-varying parameter γ0,t and various estimators (see Section 5.2). When few data points are available for adaptation at time t, the explained variance of the ISD estimator is signiﬁcantly higher than that of the rolling window OLS, and improves on invariance-based estimators such as the invariant component ˆβinv or the maximin ˆβmm. Due to a shift of γ0,t in the residual subspace, the OLS computed on historical data can perform worse than the zero function.

variance of the response by using small windows for time adaptation, signiﬁcantly improving on the OLS solution in the same time windows.

5.3 Real data example

We now present a toy example that applies ISD to real data. The data used for this experiment are collected using a controlled physical system developed by Gamella et al. (2025). The system, shown in Figure 8, consists of a light tunnel with a light source XRGB whose emitted light passes through two polarizers with relative angle θ between them and is captured by a sensor placed at the end of the tunnel, behind the polarizers. At the end of the tunnel there are two additional LED light sources, XL31 and XL32, whose emitted light is unaﬀected by the polarizers. The sensor Y := I3 measures the overall infrared light at the end of the tunnel, which is aﬀected by the intensity of the RGB source and by the two LEDs. As described by Gamella et al. (2025), the eﬀect of XRGB on Y is linear and depends on θ, more precisely, it holds that Y cos2(θ)XRGB. The dependence of Y on the two LEDs is instead independent of θ. We take Y as our response, consider the covariates vector [XRGB, XL31, XL32] R3 and assume that the angle θ is unknown. Since we control the three light sources independently, and we expect the dependence of the infrared light Y on the two LEDs to remain the same across time, we hope to detect a nontrivial invariant subspace related to the two LED covariates.

The available dataset contains 8000 observations, collected under changing values of the angle θ. The historical dataset contains the ﬁrst 7000 observations, and the test dataset the remaining 1000. Figure 9 shows the dependence of the response on the three covariates, as well as the values of the response and of cos2(θ) through time.

Lazzaretto, Peters, Pfister

Light source

Linear polarizer

Linear polarizer

Angle sensor

Angle sensor

Motor setting

Motor setting LEDs Red, green and blue levels

Drawn electric current

Light sensor Light sensor

Light sensor

Figure 8: Illustration of the light tunnel, see Section 5.3. The ﬁgure is taken from Gamella et al. (2025) (published under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)). The variables of interest in our experiment are the RGB values of the light source, the LEDs intensities L31 and L32, and the measurement of the light sensor at the end of the tunnel I3.

R2 historical data test data (zero-shot prediction) ˆβinv 0.019 0.062 ˆβOLS 0.558 -10.703 ˆβmm 0.477 -2.194

Table 1: Normalized explained variance (R2) by ˆβinv and comparison with ˆβOLS and ˆβmm: training (historical data) and zero-shot generalization (test data); see Section 5.3.

We apply ISD on the historical data to ﬁnd an invariant component ˆβinv. For comparison, we also compute the maximin ˆβmm(Bühlmann and Meinshausen, 2015) and the OLS solution ˆβOLS on historical data. We then use these estimated parameters for zeroshot prediction on test data. Table 1 shows the R2 coeﬃcient, deﬁned as in Section 5.1

as the fraction of explained variance R2 = Pn t=1(d Var(Yt) d Var(Yt X t ˆβ)) Pn t=1 d Var(Yt) , on historical and

test data. The invariant component ˆβinv is the only estimator that achieves positive explained variance on test data. This is because ˆβinv only captures the parts of the variance that can be transferred to the test data, as can be seen from the lower explained variance compared to ˆβOLS on the historical data. The estimated invariant subspace ˆSinv = span{[ 0.13910168, 0.95836843, 0.24936055] } has dimension 1 and shows in particular that most of the invariant information is encoded in the two LEDs, with highest weight given to XL31. This is expected since the LEDs intensities are not aﬀected by the changing angle between the two polarizers. Also the relatively small R2 is expected: since the RGB light source is stronger than the two LEDs at the end of the tunnel, most of the variance

Invariant Subspace Decomposition

250 500 XRGB

100 200 XL31

100 200 XL32

0 5000 t historical data test data

250 500 XRGB

100 200 XL31

100 200 XL32

7000 8000 t

7000 8000 t

Figure 9: Dataset of the experiment discussed in Section 5.3. The top four ﬁgures from the left show the dependence of the infrared measurement Y on the covariates and on time t (encoded by the colormap) for historical data. The angle of the polarizers changes over time (top right) and thus has an inﬂuence on the linear relationship between Y and XRGB (top left). For this experiment we assume that the angle θ is unknown. The second row shows the same quantities during test time, where the polarization angle is much smaller (bottom right). The dependence between Y and the two LEDs is small but signiﬁcant (testing a reduced linear model without either L31 or L32 against the full model results in p-values smaller than 10 4, both for historical and test data).

7000 7200 7400 7600 7800 8000 t

Pt r=1 Varr(ˆβ)

Zero-shot prediction

Response variance

Pt r=1 Var(Yr)

Time adaptation estimators

ˆγISD t =ˆβinv+ˆδres t ˆγOLS t

Response variance

Pt r=1 Var(Yr)

Zero-shot prediction

Time adaptation estimators

ˆγISD t =ˆβinv+ˆδres t ˆγOLS t

Figure 10: Cumulative explained variance on the adaptation data by the time adaptation estimators ˆγISD and ˆγOLS and by the zero-shot predictors ˆβinv and ˆβOLS; see Section 5.3.

in Y is explained by the non-invariant component XRGB (see Figure 9). However, it is not the case that the whole subspace spanned by the two covariates corresponding to the LEDs is invariant, as we would have assumed from the knowledge of the physical system. An explanation is that due to the data collection process there is nonzero observed correlation between XRGB and XL32, but not between XRGB and XL31 (on the historical data, we have corr(XRGB, XL31) = 0.004 with p-value 0.723 and corr(XRGB, XL32) = 0.107 with p-value smaller than 10 15).

Lazzaretto, Peters, Pfister

We also run the adaptation step considering as adaptation data a rolling window of size m = 8 shifting through the test data. We show in Figure 10 the cumulative explained variance obtained by ISD, by the OLS solution computed on the same adaptation data and by the OLS solution and the invariant component computed on historical data. The plot shows that ˆγISD achieves the highest explained variance, with a small improvement on the rolling window OLS ˆγOLS. Indeed, in this particular example the size of the invariant subspace is small (dim(Sinv) = 1) and by Theorem 3, we expect only a small improvement.

We propose Invariant Subspace Decomposition (ISD), a framework for invariance-based time adaptation. Our method relies on the orthogonal decomposition of the parameter space into an invariant subspace Sinv and a residual subspace Sres, such that the maximizer of the explained variance over Sinv is time-invariant. The estimation of the invariant component βinv on a large historical dataset and the reduced dimensionality of Sres with respect to the original parameter space Rp allow the ISD estimator to improve on the prediction accuracy of existing estimation techniques. We provide ﬁnite sample guarantees for the proposed estimation method and additionally support the validity of our theoretical results through simulated experiments and one real world data experiment. Future developments of this work may investigate the presented problem in the case of nonlinear time-varying models. For instance, Eastwood et al. (2023) show how to adapt invariant nonlinear features to heterogeneous environments in classiﬁcation tasks: extending similar ideas to the regression setting considered here could be of interest. Future work may further study how to incorporate the ISD framework in speciﬁc applied settings such as contextual bandits.

Acknowledgments

We thank Juan Gamella for generating the data for the presented real world example, and the three anonymous reviewers for the valuable comments. NP and ML are supported by a research grant (0069071) from Novo Nordisk Fonden.

K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. Advances in neural information processing systems, 29, 2016.

P. Bühlmann and N. Meinshausen. Magging: maximin aggregation for inhomogeneous largescale data. Proceedings of the IEEE, 104(1):126 135, 2015.

L. De Lathauwer. Decompositions of a higher-order tensor in block terms part II: Deﬁnitions and uniqueness. SIAM Journal on Matrix Analysis and Applications, 30(3):1033 1066, 2008.

J. Durbin and S. J. Koopman. Time series analysis by state space methods, volume 38. OUP Oxford, 2012.

Invariant Subspace Decomposition

C. Eastwood, S. Singh, A. L. Nicolicioiu, M. Vlastelica Pogančić, J. von Kügelgen, and B. Schölkopf. Spuriosity didn t kill the classiﬁer: Using invariant predictions to harness spurious features. Advances in Neural Information Processing Systems, 36:18291 18324, 2023.

J. Fan and W. Zhang. Statistical methods with varying coeﬃcient models. Statistics and its Interface, 1(1):179, 2008.

F. Feng, B. Huang, K. Zhang, and S. Magliacane. Factored adaptation for non-stationary reinforcement learning. Advances in Neural Information Processing Systems, 35:31957 31971, 2022.

C. Févotte and F. J. Theis. Pivot selection strategies in Jacobi joint block-diagonalization. In International Conference on Independent Component Analysis and Signal Separation, pages 177 184. Springer, 2007.

J. L. Gamella, J. Peters, and P. Bühlmann. Causal chambers as a real-world physical testbed for AI methodology. Nature Machine Intelligence, 7(1):107 118, Jan 2025. ISSN 2522-5839. doi: 10.1038/s42256-024-00964-x. URL https://doi.org/10.1038/ s42256-024-00964-x.

W. Günther, U. Ninad, and J. Runge. Causal discovery for time series from multiple datasets with latent contexts. In Uncertainty in Artiﬁcial Intelligence, pages 766 776. PMLR, 2023.

H. W. Gutch and F. J. Theis. Uniqueness of linear factorizations into independent subspaces. Journal of Multivariate Analysis, 112:48 62, 2012.

T. Hastie and R. Tibshirani. Varying-coeﬃcient models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 55(4):757 779, 1993.

R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.

B. Huang, K. Zhang, J. Zhang, J. Ramsey, R. Sanchez-Romero, C. Glymour, and B. Schölkopf. Causal discovery from heterogeneous/nonstationary data. Journal of Machine Learning Research, 21(89):1 53, 2020.

Z. Li, R. Cai, Z. Yang, H. Huang, G. Chen, Y. Shen, Z. Chen, X. Song, Z. Hao, and K. Zhang. When and how: Learning identiﬁable latent states for nonstationary time series forecasting. ar Xiv preprint ar Xiv:2402.12767, 2024.

S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J. M. Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10846 10856. Curran Associates, Inc., 2018.

N. Meinshausen and P. Bühlmann. Maximin eﬀects in inhomogeneous large-scale data. The Annals of Statistics, 43(4):1801 1830, 2015.

Lazzaretto, Peters, Pfister

J. Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. The Annals of Statistics, 50(4):2157 2178, 2022.

K. Murota, Y. Kanno, M. Kojima, and S. Kojima. A numerical algorithm for block-diagonal decomposition of matrix-algebras with application to semideﬁnite programming. Japan Journal of Industrial and Applied Mathematics, 27(1):125 160, 2010.

D. Nion. A tensor framework for nonunitary joint block diagonalization. IEEE Transactions on Signal Processing, 59(10):4585 4594, 2011.

J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference by using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947 1012, 2016.

N. Pﬁster, P. Bühlmann, and J. Peters. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264 1276, 2019a.

N. Pﬁster, S. Weichwald, P. Bühlmann, and B. Schölkopf. Robustifying independent component analysis by adjusting for group-wise stationary noise. Journal of Machine Learning Research, 20(147):1 50, 2019b.

N. Pﬁster, E. G. William, J. Peters, R. Aebersold, and P. Bühlmann. Stabilizing variable selection and regression. The Annals of Applied Statistics, 15(3):1220 1246, 2021.

M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters. Causal transfer in machine learning. Journal of Machine Learning Research, 19(36):1 34, 2018.

J. R. Schott. Matrix analysis for statistics. John Wiley & Sons, 2016.

P. Stojanov, Z. Li, M. Gong, R. Cai, J. Carbonell, and K. Zhang. Domain adaptation with invariant representation learning: What transformations to learn? In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 24791 24803. Curran Associates, Inc., 2021.

B. Sun, J. Feng, and K. Saenko. Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications, pages 153 171. Springer, 2017.

P. Tichavsky and Z. Koldovsky. Algorithms for nonorthogonal approximate joint blockdiagonalization. In 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pages 2094 2098. IEEE, 2012.

P. Tichavsky and A. Yeredor. Fast approximate joint diagonalization incorporating weight matrices. IEEE Transactions on Signal Processing, 57(3):878 891, 2008.

P. Tichavsk y, A. Yeredor, and Z. Koldovsk y. On computation of approximate joint blockdiagonalization using ordinary AJD. In International Conference on Latent Variable Analysis and Signal Separation, pages 163 171. Springer, 2012.

W. Yao, G. Chen, and K. Zhang. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492 26503, 2022.

Invariant Subspace Decomposition

K. Zhang, M. Gong, and B. Schölkopf. Multi-source domain adaptation: A causal view. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

H. Zhao, R. T. Des Combes, K. Zhang, and G. Gordon. On learning invariant representations for domain adaptation. In International conference on machine learning, pages 7523 7532. PMLR, 2019.

Lazzaretto, Peters, Pfister

Appendix A. Supporting examples and remarks

A.1 Example of non-uniqueness of an irreducible orthogonal partition

Assume that for all t [n] the covariance matrix Σt of Xt takes one of the following two values (and each value is taken at least once in [n])

2 0 0 0 2 0 0 0 1

3 0 0 0 3 0 0 0 2

Deﬁne for all j {1, 2, 3} the linear spaces Sj = ej , where ej is the j-th vector of the canonical basis. Then since Σ1 and Σ2 are (block) diagonal the partition S1, S2, S3 is an irreducible orthogonal and (Xt)t [n]-decorrelating partition. Consider now the orthonormal matrix

It holds that

U Σ2U = Σ2.

Therefore, the spaces

also form an irreducible orthogonal and (Xt)t [n]-decorrelating partition (this follows, for example, from Proposition 1) but {S1, S2, S3} = { S1, S2, S3}. Then, if we assume for example that γ0,t = 1, t, 1 , given the ﬁrst partition we obtain Sinv = S1 S3, whereas given the second partition it holds that Π S1γ0,t = 1 t

2 , 0 and Π S2γ0,t = 1+t

2 , 0 and therefore Sinv = S3, leading to Assumption 1 not being satisﬁed.

A.2 Example of quickly varying γ0,t and zero-shot generalization

We repeat the same simulation described in Section 5.1 but now consider non-smooth variations of γ0,t. More speciﬁcally, the only diﬀerence from the simulation described in Section 5.1 is that we let the 3 time-varying entries of γ0,t (prior to its rotation by U) vary quickly with t. More speciﬁcally, at each time point each one of the 3 entries is sampled uniformly in an interval of width 1 centered around a value which changes 20 times in the observed time horizon n. These centers are randomly sampled in [0, 1.2]. These choices of the sampling intervals ensure that the experiments are comparable for diﬀerent sizes n of the historical data and that there is a shift in γ0,t outside of the observed support in the test

Invariant Subspace Decomposition

500 1000 2500 4000 6000 n

βinv ˆβinv 2 2 βinv 2 2

Figure 11: MSE of ˆβinv for increasing size of the historical data n. For larger values of n, the estimation of the invariant subspace decomposition becomes more precise and leads to smaller errors in the estimated invariant component ˆβinv.

500 1000 2500 4000 6000 n

in-sample prediction

500 1000 2500 4000 6000 n

zero-shot prediction (test sample size = 250)

Figure 12: Normalized explained variance (R2) by ˆβinv and comparison with βinv, ˆβmm

and ˆβOLS: training (historical data, left) and zero-shot generalization (test data, right), for diﬀerent sizes n of the historical data. The dashed line indicates the population value of the (normalized) explained variance by βinv. In the historical data, γ0,t is quickly varying.

data (which is generated as in Section 5.1). This experiment supports Remark 2, showing that we do not need assumptions on the type of changes in γ0,t in the historical data.

In particular, Figure 11 shows that the MSE of ˆβinv converges to zero for increasing values of n, that is, we are able to estimate the invariant component even when γ0,t is quickly varying in the historical data.

Lazzaretto, Peters, Pfister

Moreover, Figure 12 shows the results in terms of the R2 coeﬃcient. In particular, for n large enough, the R2 coeﬃcient of the estimated invariant component ˆβinv remains positive even for values of the test data that lie outside of the observed support.

A.3 Methods and computational complexity of joint block diagonalization

In the context of joint block diagonalization, we can diﬀerentiate between methods that solve the exact problem (JBD), i.e., are such that the transformed matrices have exactly zero oﬀblock diagonal entries, and approximate methods (AJBD), which assume the presence of noise and aim to minimize the oﬀ-block diagonal entries, without necessarily setting them to zero. JBD is in general an easier problem, and algorithms that solve it have been shown to achieve polynomial complexity (see, e.g., Murota et al., 2010; Tichavsk y et al., 2012). Many of these methods, e.g., the one presented by Murota et al. (2010), are based on eigenvalue decompositions. Alternatively, as shown by Tichavsk y et al. (2012), some algorithms that solve the problem of approximate joint diagonalization (AJD) of a set of matrices, such as uwedge developed by Tichavsky and Yeredor (2008), can also be used for JBD. More in detail, a solution to AJD is a matrix that maximally jointly diagonalizes a set of matrices by minimizing the average value of the oﬀ-diagonal entries; if the matrices in the set cannot be exactly jointly diagonalized, the transformed matrices will have some non-zero oﬀ-diagonal elements. Adding an appropriate permutation of the columns of the joint diagonalizer allows to reorganize the non-zero oﬀ-diagonal elements into blocks, leading to jointly block diagonal matrices: in Section 4.3 of their work Gutch and Theis (2012) argue that if the matrices to be block diagonalized are symmetric, then the solution found in this way is also an optimal JBD solution. However, in general, methods for JBD cannot be directly applied to solve AJBD. Algorithms that solve AJBD directly based on the iterative optimization of a cost function via matrix rotations have been developed, for example, by Tichavsky and Koldovsky (2012) and Févotte and Theis (2007), but require the number of diagonal blocks to be known in advance. Alternatively and similarly to how JBD can be solved by AJD methods, one can also, with some slight modiﬁcations, use AJD methods to solve AJBD. More speciﬁcally, some heuristics need to be used to determine the size of the blocks: these can consist, for example, in setting a threshold for the non-zero oﬀ-block-diagonal elements in the transformed matrices. In the (orthogonal) settings considered here, we have found the last approach to work eﬀectively. More speciﬁcally, in Section 4.1.1, we have denoted by V the AJD solution for the set of estimated covariance matrices {ˆΣk}K k=1. Similarly to how Tichavsky and Koldovsky (2012) suggest to determine a permutation of the AJD result, we proceed in the following way. To discriminate the non-zero oﬀ-block-diagonal entries in these matrices, we start by computing the following auxiliary matrix using V

Σ := maxk {1,...,K} |V ˆΣk V |,

where the maximum is taken element-wise. The matrix Σ captures in its oﬀ-diagonal entries the residual correlation among the components identiﬁed by AJD, for all the K jointly diagonalized matrices. For all thresholds τ R, we let P(τ) Rp p denote one of the

Invariant Subspace Decomposition

permutation matrices satisfying that P(τ) ΣP(τ) is a block diagonal matrix if all entries smaller than τ are considered zero. We then deﬁne the optimal threshold τ by

τ arg min τ 1 K

k=1 OBDτ(|(V P(τ)) ˆΣk(V P(τ))|) + ν pbd(τ)

where OBDτ( ) denotes the average value of oﬀ-block-diagonal entries (determined by the threshold τ) of a matrix, pbd(τ) is the total number of entries in the blocks induced by τ and ν R is a regularization parameter. The penalization term pbd(τ)

p2 is introduced to avoid always selecting a zero threshold, and the regularization parameter is set to ν = 1 K PK k=1 λmin(ˆΣk) where λmin( ) denotes the minimum eigenvalue. The optimal permutation is then P := P(τ ) and the estimated irreducible joint block diagonalizer is ˆU = V P .

A.4 Threshold selection for opt-invariant subspaces

In the simulations, we select the threshold λ in (19) by cross-validation. More in detail, we deﬁne the grid of possible thresholds λ by

K PK k=1 |ˆc1 k|, . . . , 1

K PK k=1 |ˆcq ˆ U max k | ,

where, for all j {1, . . . , q ˆU max} and k {1, . . . , K}, ˆcj k := [ Corr(Yk Xk(Π ˆSj ˆγ), Xk(Π ˆSj ˆγ)). We then split the historical data into L = 10 disjoint blocks of observations, and for all j {1, . . . , J} denote by Xℓand Yℓthe observations in the ℓ-th block and by X ℓand Y ℓ the remaining historical data. For all possible thresholds λ Λ we proceed in the following way. For all folds ℓ {1, . . . , L}, we compute an estimate for the invariant component as in (20) using X ℓand Y ℓ, which we denote by ˆβinv, ℓ(λ). Inside the left-out ℓ-th block of observations, we then consider a rolling window of length d = 2p and the observation at t immediately following the rolling window: we compute the residual parameter ˆδres t (λ) as in (22) using the d observations in the rolling window, and evaluate the empirical explained variance by ˆβinv, ℓ(λ) + ˆδres t (λ) on the observation at t , i.e.,

\ Vart (λ) := (Yt )2 (Yt X t (ˆβinv, ℓ(λ) + ˆδres t (λ)))2.

We repeat this computation for all possible t Iℓ, where Iℓdenotes the time points in the ℓ-th block of observations excluding the ﬁrst d observations, and deﬁne Var λ ℓ:= 1 |Iℓ| P t Iℓ\ Vart (λ). For all λ Λ, we denote the average explained variance over the

L folds by Var λ := 1 L PL ℓ=1 Var λ ℓand the standard error (across the L folds) of such

explained variance as se( Var λ) := 1 L

q PL ℓ=1( Var λ ℓ Var λ)2. Moreover, let λmax :=

arg maxλ Λ Var λ. Then, we choose the optimal threshold as

λ := min n λ Λ Var λ > Var λmax tse se( Var λmax ) o ,

which is the most conservative (lowest) threshold such that the corresponding explained variance is within tse (in our simulations, we choose tse = 1) standard errors (computed across the folds) of the maximal explained variance.

Lazzaretto, Peters, Pfister

A.5 Further simulation details: MSPE comparison

In Section 5.2 we present a simulated experiment in which we compare the ISD estimator and the OLS estimator on the time adaptation task. Figure 6 shows that the diﬀerence in the MSPE for ˆγOLS t and ˆγISD t is positive and decreases for increasing values of m. To further support the statement of Theorem 3, we show in Figure 13 the value of such diﬀerence against σ2 ad dim(Sinv)

m , when computing ˆγISD t both with the estimated and oracle invariant component. The ﬁgure shows that the diﬀerence in the MSPEs indeed satisﬁes the bound stated in Theorem 3, i.e., it is always greater than σ2 ad dim(Sinv)

m . Moreover, it shows that for small values of m the gain in using the ISD estimator over the OLS is even higher than what the theoretical bound suggests, indicating that it is not sharp for small m. We further show in Figure 14 the MSPE of ˆγISD t (again computed both with the estimated and oracle invariant component) against σ2 ad dim(Sres)

m , obtaining in this case an empirical conﬁrmation of the ﬁrst bound presented in Theorem 3.

σ2 ad dim(Sinv) m

MSPE(ˆγOLS t ) MSPE(ˆγISD t )

MSPE(ˆγOLS t ) MSPE(βinv + ˆδres t )

σ2 ad dim(Sinv) m

Figure 13: Diﬀerence in the MSPE of ˆγOLS t and ˆγISD t (and an oracle version of ˆγISD t based on the true βinv) with respect to σ2 ad dim(Sinv)

m for diﬀerent values of m. The computed diﬀerence

is larger than σ2 ad dim(Sinv)

m for all values of m (in the oracle case), empirically conﬁrming the lower bound obtained in Theorem 3. The ﬁlled dots in the boxplots show the mean over 20 runs (while the empty dots represent the outliers).

Invariant Subspace Decomposition

σ2 ad dim(Sres) m

MSPE(ˆγISD t )

MSPE(βinv + ˆδres t )

σ2 ad dim(Sres) m

Figure 14: MSPE of ˆγISD t (and an oracle version of ˆγISD t based on the true βinv) with respect to σ2 ad dim(Sres)

m for diﬀerent values of m. The computed MSPE is smaller than σ2 ad dim(Sres)

m for all values of m (in the oracle case), empirically conﬁrming the upper bound obtained in Theorem 3. The ﬁlled dots in the boxplots show the mean over 20 runs (while the empty dots represent the outliers).

Appendix B. ISD estimation algorithm

We provide here the pseudocode summarizing the ISD procedure described in Section 4. The algorithm includes the estimation of the intercept, that is, it considers for all t N the model Yt = γ0 0,t + X t γ0,t + ϵt (25)

satisfying the assumptions of Setting 1 but with the addition of γ0 0,t R. In the estimation of the intercept, the algorithm distinguishes between two cases: (a) the intercept remains approximately constant in [n] Iad and (b) the intercept changes with time but is assumed to be approximately constant in Iad. In case (a), the computation of ˆγ0 t in line 21 of Algorithm 2 is done by averaging the estimated values of the (approximately constant) intercept on historical data. Alternatively, in case (a) we could estimate the intercept simultaneously to the invariant component βinv: the whole vector (including the intercept) can be computed by taking the OLS solution of regressing Y on 1n X ˆUinv and premultiplying it by e1 ˆUinv

(where e1 is the p-dimensional vector with the ﬁrst entry equal to 1 and the remaining equal to zero) in place of lines 16 and 21 of Algorithm 2 (see Remark 4). In case (b), the intercept can instead be estimated simultaneously to the residual component ˆδres t : the computation in line 22 of the algorithm is equivalent to taking the ﬁrst component of the OLS solution of regressing Yad Xad ˆβinv on 1m Xad ˆUres , premultiplied by e1 ˆUres .

Remark 4 Recall that the population OLS solution for the linear model (25) at time t can be found by adding a 1 to the vector Xt and solving

arg min γ0,γ E Yt [1 X t ] γ0

Lazzaretto, Peters, Pfister

The problem has closed form solution

E[Xt] E[Xt X t ]

1 E[Yt] E[Xt Yt]

= E[Yt] E[Xt] Var(Xt) 1 Cov(Xt, Yt) Var(Xt) 1 Cov(Xt, Yt)

Algorithm 2 ISD: estimation

Input: observations (Xt, Yt)t [n] Iad, Xt , number of windows K, λ [0, 1] Output: ˆβinv, ˆδres t , ˆγt , ˆγ0 t

1: (Xk)k [K] ([X (k 1)n

K , . . . , X kn

2: (Yk)k [K] ([Y (k 1)n

K , . . . , Y kn

3: (ˆΣk)k [K] {d Var(Xk)}k [K]

(" ˆγ0 k ˆγk

k [K] n OLS(Yk, h 1 n

" ˆγ0 k ˆγk

6: ˆU, { ˆSj}q j=1 approx Irreducible Joint Block Diagonalizer((ˆΣt)t [n]) see Sec. 4.1.1

7: Find the opt-invariant subspaces to estimate Sinv, Sres:

8: ˆSinv, ˆSres

9: for j = 1, . . . , q do

10: Π ˆSj ˆUSj( ˆUSj)

11: (ˆcj k)k [K] ([ Corr(Yk Xk(Π ˆSj ˆγ), Xk(Π ˆSj ˆγ)))k [K]

12: if 1 K PK k=1 ˆcj k λ then ˆSinv ˆSinv ˆSj see Eq. (19)

13: else ˆSres ˆSres ˆSj

14: Invariant component estimation:

15: X [X1 . . . Xn] , Y [Y1 . . . Yn]

16: ˆβinv ˆUinvd Var(X ˆUinv) 1 d Cov(X ˆUinv, Y) see Eq. (21)

17: Adaptation step:

18: Xad [(Xt)t Iad] , Yad [(Yt)t Iad]

19: ˆδres t ˆUresd Var(Xad ˆUres) 1 d Cov(Xad ˆUres, Yad Xad ˆβinv) see Eq. (22)

20: Intercept estimation:

21: if {ˆγ0 k}k [K] approximately constant then ˆγ0 t ˆγ0

22: else ˆγ0 t ˆE[Yad Xad ˆβinv] ˆE[Xad]ˆδres t

23: ˆγt ˆβinv + ˆδres t see Eq. (23)

Invariant Subspace Decomposition

Appendix C. Extension to non-orthogonal subspaces

In Section 2.1 we have deﬁned an orthogonal and (Xt)t [n]-decorrelating partition as a collection {Sj}j {1,...,q} of pairwise orthogonal linear subspaces of Rp satisfying (6). Finding the orthogonal subspaces that form the partition means in particular ﬁnding a rotation of the original X-space such that each subspace is spanned by a subset of the rotated axes, and the coordinates of the projected predictors onto one subspace are uncorrelated with the ones in the remaining subspaces. In this section, we show that orthogonality of the subspaces in the partition is not strictly required to obtain a separation of the true timevarying parameter of the form (7), that is, more general invertible linear transformations can be considered besides rotations. In particular, we brieﬂy present results similar to the ones obtained throughout Section 2 but where the subspaces in the partition are not necessarily orthogonal. To do so, we deﬁne a collection of (not necessarily orthogonal) linear subspaces S1, . . . , Sq Rp with Lq j=1 Sj = Rp and satisfying (6) a (Xt)t [n]-decorrelating partition (of cardinality q), and further call it irreducible if it is of maximal cardinality. A (Xt)t [n]-decorrelating partition can still be identiﬁed through a joint block diagonalization of the covariance matrices (Σt)t [n] as described in Section 2.1.1 but with an adjustment. More speciﬁcally, instead of assuming that the joint diagonalizer U is an orthogonal matrix, we only assume it is invertible and for all j {1, . . . , q}, the columns of USj are orthonormal vectors. A version of Proposition 1 in which the resulting partition is not necessarily orthogonal follows with the same proof. Moreover, similarly to the orthogonal case, the uniqueness of an irreducible (Xt)t [n]-decorrelating partition is implied by the uniqueness of an irreducible non-orthogonal joint block diagonalizer for (Σt)t [n]; explicit conditions under which such uniqueness holds can be found for example in the work by Nion (2011). In the results presented in the remainder of this section we adopt the same notation introduced in Section 2.1.1, and we additionally deﬁne the matrix W := U . A (Xt)t [n]-decorrelating partition of cardinality q allows us to decompose the true timevarying parameter into the sum of q components. To obtain such a decomposition via non-orthogonal subspaces, oblique projections need to be considered in place of orthogonal ones. Oblique projections are deﬁned (see, e.g., Schott, 2016) for two subspaces S1, S2 Rp

such that S1 L S2 = Rp and a vector x Rp as the vectors x1 S1 and x2 S2 such that x = x1 + x2: x1 is called the projection of x onto S1 along S2, and x2 the projection of x onto S2 along S1. For a (Xt)t [n]-decorrelating partition {Sj}q j=1, we denote by PSj|S j the oblique projection matrix onto Sj along L i {1,...,q}\{j} Si: this can be expressed in terms of a (non-orthogonal) joint block diagonalizer U corresponding to the partition as PSj|S j = USj(W Sj) . Orthogonal and (Xt)t [n]-decorrelating partitions {Sj}q j=1 are a special case of (Xt)t [n]-decorrelating partitions. In particular, if the subspaces are pairwise orthogonal, it holds for all j {1, . . . , q} that PSj|S j = ΠSj. By deﬁnition of oblique projections, for all t [n], we can express γ0,t as

j=1 PSj|S jγ0,t.

Similarly to how we deﬁne opt-invariance on [n] for orthogonal subspaces in Section 2.1, we say that a subspace Sj in a (Xt)t [n]-decorrelating partition is proj-invariant on [n] if it

Lazzaretto, Peters, Pfister

satisﬁes for all t, s [n] that PSj|S jγ0,t = PSj|S jγ0,s.

By Lemma 1 it follows that for orthogonal partitions proj-invariance is equivalent to optinvariance. For an irreducible (Xt)t [n]-decorrelating partition, we now deﬁne the invariant subspace Sinv and residual subspace Sres as

Sinv := L j {1,...,q}: Sj proj-invariant on [n] Sj and Sres := L j {1,...,q}: Sj not proj-invariant on [n] Sj.

It follows directly by the deﬁnition of partitions that {Sinv, Sres} is a (Xt)t [n]- decorrelating partition. Moreover, Sinv is proj-invariant on [n] since PSinv|Sresγ0,t = P j {1,...,q}: Sj proj-invariant on [n] PSj|S jγ0,t. We ﬁnally deﬁne the invariant and residual components

by βinv := PSinv|Sres γ0 and δres t := PSres|Sinvγ0,t.

We show in the following proposition that the expressions (14) and (16) used to construct our estimators for the invariant and residual component in the case of orthogonal partitions, remain valid in the case of non-orthogonal partitions.

Proposition 3 Let {Sj}q j=1 be a (Xt)t [n]-decorrelating partition and let U be a joint block diagonalizer corresponding to that partition. Then, the following results hold.

(i) For all t [n] and for all j {1, . . . , q}

PSj|S jγ0,t = USj Var((USj) Xt) 1 Cov((USj) Xt, Yt). (26)

(ii) βinv is time-invariant over [n] and

βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ).

δres t = Ures((Ures) Var(Xt)Ures) 1(Ures) Cov(Xt, Yt X t βinv).

Proposition 3 implies in particular that, apart from the joint block diagonalization differences, estimating βinv and δres t in the non-orthogonal case can be done as described in Section 4. Moreover, it holds that the parameter βinv deﬁned for non-orthogonal (Xt)t [n]- decorrelating partitions is still a time-invariant parameter. Under a generalization assumption analogous to Assumption 2, βinv has positive explained variance at all time points t N and can be used to at least partially predict γ0,t (we have not added this result explicitly). In addition, also in the non-orthogonal case the estimation of the residual component only requires to estimate a reduced number of parameters, that is, dim(Sres) . In the case of non-orthogonal partitions, however, we cannot directly interpret ISD as separating the true time-varying parameter into two separate optimizations of the explained variance over Sinv and Sres.

Invariant Subspace Decomposition

Example 2 (non-orthogonal irreducible partition) Let Xt R3 with covariance matrix Σt that for all t [n] takes one of the following values

1 0.5 0 0.5 1 0 0 0 1

4 2 0 2 7/2 0 0 0 1

These matrices are in block diagonal form and in particular the 2 2 submatrices forming the ﬁrst diagonal block do not commute. This implies that an irreducible (orthogonal) joint block diagonalizer for these matrices is the three-dimensional identity matrix I3, and the diagonal blocks cannot be further reduced into smaller blocks using an orthogonal transformation. It also implies that an irreducible orthogonal and (Xt)t [n]-decorrelating partition is given by {Sj}2 j=1 with S1 := e1, e2 and S2 = e3 , where ej is the j-th vector of the canonical basis for R3. There exists, however, a non-orthogonal joint diagonalizer for these matrices, i.e., a non-orthogonal matrix U such that Σt = U Σt U is diagonal. It is given by

5 0 0 0 0 1

U induces an irreducible (Xt)t [n]-decorrelating partition by

= US1 , S2 :=

= US2 , S3 :=

Let Sinv and Sinv be the invariant subspaces associated with the irreducible orthogonal partition {Sj}3 j=1 and the irreducible partition { Sj}3 j=1, respectively. As any irreducible orthogonal and (Xt)t [n]-decorrelating partition is also a (Xt)t [n]-decorrelating partition, it in general holds that dim(Sinv) dim( Sinv).

Moreover, as in the explicit example above, the inequality can be strict.

Appendix D. Auxiliary results

Lemma 3 Let B Rm m be a symmetric invertible matrix, U Rm m an orthogonal block diagonalizer of B and S1, . . . , Sq {1, . . . , m} disjoint subsets satisfying

(US1) BUS1 0 0

0 ... ... ... ... ... ... 0 0 0 (USq) BUSq

Then it holds for all j {1, . . . , q} that

(ΠSj BΠSj) = ΠSj B 1ΠSj,

where ΠSj := USj(USj) .

Lazzaretto, Peters, Pfister

Proof The pseudo-inverse A of a matrix A is deﬁned as the unique matrix satisfying: (i) AA A = A, (ii) A AA = A , (iii) (AA ) = AA and (iv) (A A) = A A. Fix j {1 . . . , q} and deﬁne A := ΠSj BΠSj and A := ΠSj B 1ΠSj. Moreover, deﬁne e B := U BU and for all k {1, . . . , q}, e Bk := (USk) BUSk. We now verify that conditions (i)-(iv) hold for A and A and hence A is indeed the pseudo-inverse. Conditions (ii) and (iv) hold by symmetry of B and ΠS. For (i) and (iii), ﬁrst observe that by the properties of orthogonal matrices it holds that e B 1 = U B 1U, and, due to the block diagonal structure of e B,

e B 1 1 ... e B 1 q

(US1) ... (USq)

B 1 US1 . . . USq . (27)

Hence we get that e B 1 j = (USj) B 1USj. For (i), we now get

ΠSj BΠSjΠSj B 1ΠSjΠSj BΠSj = USj(USj) BUSj(USj) B 1USj(USj) BUSj(USj)

= USj e Bj e B 1 j e Bj(USj)

= ΠSj BΠSj.

Similarly, for (iii) we get

(ΠSj BΠSjΠSj B 1ΠSj) = (USj(USj) BUSj(USj) B 1USj(USj) )

= (USj e Bj e B 1 j (USj) )

This completes the proof of Lemma 3.

Lemma 4 Let N N and let {Sj}q j=1 be an orthogonal and (Xt)t N -decorrelating partition. Then, there exists a joint block diagonalizer of (Σt)t N . More precisely, there exists an orthonormal matrix U Rp p such that for all t N the matrix Σt := U Σt U is block diagonal with q diagonal blocks eΣt,j = (USj) Σt USj, j {1, . . . , q} and of dimension |Sj| = dim(Sj), where Sj {1, . . . , p} indexes a subset of the columns of U. Moreover, ΠSj = USj(USj) .

Proof Let U = (u1, . . . , up) Rp p be an orthonormal matrix with columns u1, . . . , up such that for all j {1, . . . , q} there exists Sj {1, . . . , p} such that Sj = span({uk | k Sj}). Such a matrix U can be constructed by selecting an orthonormal basis for each of the disjoint subspaces {Sj}q j=1. Furthermore, assume that the columns of U are ordered in such a way that for all i, j {1, . . . , q} with i < j it holds for all k Si and ℓ Sj that k < ℓ. As the matrix U is orthogonal, it holds for all j {1, . . . , q} that the projection matrix ΠSj can be expressed as ΠSj = USj(USj) and hence using the deﬁnition of orthogonal partition (see

Invariant Subspace Decomposition

(6)) it holds that, for all t N,

eΣt := U Σt U =

0 ... ... ... ... ... ... 0 0 0 eΣt,q

where for all j {1, . . . , q} we deﬁned eΣt,j := (USj) Σt USj.

Lemma 5 Let {Sj}q j=1 be an orthogonal and (Xt)t [n]-decorrelating partition. Then it holds for all t [n] and, for all j {1, . . . , q}, that

Var ΠSj Xt = ΠSj Var (Xt) 1 ΠSj. (28)

Proof Let U Rp p be an orthonormal matrix such that, for all t [n], Σt := U Σt U is block diagonal with q diagonal blocks eΣt,1, . . . , eΣt,q of dimensions dim(S1), . . . , dim(Sq). Such a matrix exists by Lemma 4 and each diagonal block is given by eΣt,j := (USj) Σt USj and the projection matrix ΠSj can be expressed as ΠSj = USj(USj) . The statement then follows from Lemma 3.

D.1 Proof of Lemma 1

Proof Let U Rp p be an orthonormal matrix such that, for all t [n], Σt := U Σt U is block diagonal with q diagonal blocks eΣt,1, . . . , eΣt,q of dimensions dim(S1), . . . , dim(Sq). Such a matrix exists by Lemma 4 and each diagonal block is given by eΣt,j := (USj) Σt USj and the projection matrix ΠSj can be expressed as ΠSj = USj(USj) . Now, for an arbitrary t [n] it holds using the linear model (1) that γ0,t = Σ 1 t Cov(Xt, Yt) and hence we use U to get the following expansion

γ0,t = Σ 1 t Cov(Xt, Yt)

= U Σ 1 t U Cov(Xt, Yt)

Σ 1 t,1 ... Σ 1 t,q

U Cov(Xt, Yt)

Σ 1 t,1 (US1) Cov(Xt, Yt) ... Σ 1 t,q (USq) Cov(Xt, Yt)

Lazzaretto, Peters, Pfister

By the properties of orthogonal matrices it holds that Σ 1 t = U Σ 1 t U, and, due to the block diagonal structure of Σt,

Σ 1 t,1 ... Σ 1 t,q

(US1) ... (USq)

Σ 1 t US1 . . . USq . (29)

This implies that Σ 1 t,j = (USj) Σ 1 t USj and therefore

(US1) Σ 1 t US1(US1) Cov(Xt, Yt) ... (USq) Σ 1 t USq(USq) Cov(Xt, Yt)

(US1) Σ 1 t US1(US1) Cov(Xt, Yt) ... (USq) Σ 1 t USq(USq) Cov(Xt, Yt)

j=1 USj(USj) Σ 1 t USj(USj) Cov(Xt, Yt)

j=1 ΠSjΣ 1 t ΠSj Cov(Xt, Yt)

j=1 ΠSjΣ 1 t ΠSjΠSj Cov(Xt, Yt)

j=1 Var(ΠSj Xt) Cov(ΠSj Xt, Yt).

In the second to last equality, we have used that ΠSj is a projection matrix and thus idempotent. In the last equality we have used that Var(ΠSj Xt) = ΠSjΣ 1 t ΠSj, by Lemma 5. It now suﬃces to show that ΠSjγ0,t = Var(ΠSj Xt) Cov(ΠSj Xt, Yt), which follows from the following computation

ΠSjγ0,t = ΠSj

i=1 ΠSi Var(Xt) 1ΠSi Cov(ΠSi Xt, Yt)

= ΠSj Var(Xt) 1ΠSj Cov(ΠSj Xt, Yt)

= Var(ΠSj Xt) Cov(ΠSj Xt, Yt),

where the ﬁrst equality follows from the ﬁrst part of this proof and and the third equality follows from Lemma 5.

Invariant Subspace Decomposition

D.2 Proof of Lemma 2

Proof For all j {1, . . . , q}, for all β Sj and for all t N it holds that

Vart(β) = 2 Cov(Yt, Xt)β β Var(Xt)β

= 2 Cov(Yt, (Pq k=1 ΠSk Xt))ΠSjβ β ΠSj Var(Pq k=1 ΠSk Xt)ΠSjβ

= 2 Cov(Yt, (ΠSj Xt))ΠSjβ β ΠSj Var(ΠSj Xt)ΠSjβ

= 2 Cov(Yt, (ΠSj Xt))β β Var(ΠSj Xt)β, (30)

where the ﬁrst equality follows by (4) and the third equality follows from the deﬁnition of an orthogonal partition. It follows that

( Vart)(β) = 2 Cov(ΠSj Xt, Yt) 2 Var(ΠSj Xt)β,

where denotes the gradient. The equation ( Vart)(β) = 0 has a unique solution in Sj given by Var(ΠSj Xt) Cov(ΠSj Xt, Yt). To see this, observe that all other solutions in Rp are given, for an arbitrary vector w Rp, by

Var(ΠSj Xt) Cov(ΠSj Xt, Yt) + (Ip Var(ΠSj Xt) Var(ΠSj Xt))w (31)

where Ip denotes the identity matrix. Let U Rp p and ( Σt)t N be deﬁned for the orthogonal and (Xt)t N -decorrelating partition as in Lemma 4. We now observe that

Var(ΠSj Xt) Var(ΠSj Xt) = ΠSjΣ 1 t ΠSjΠSjΣtΠSj (32)

= USj(USj) Σ 1 t USj(USj) Σt USj(USj)

= USj Σ 1 t,j Σt,j(USj)

where the ﬁrst equality follows from Lemma 3, since U jointly block diagonalizes the matrices (Σt)t N by Lemma 4. We can therefore rewrite (31) as

Var(ΠSj Xt) Cov(ΠSj Xt, Yt) + (Ip ΠSj)w.

For all w Sj, this expression equals Var(ΠSj Xt) Cov(ΠSj Xt, Yt). For all w Sj, it is not in Sj. This concludes the proof of Lemma 2.

Lemma 6 {Sinv, Sres} is an orthogonal and (Xt)t [n]-decorrelating partition.

Proof Let {Sj}q j=1 be a ﬁxed irreducible orthogonal and (Xt)t [n]-decorrelating partition according to which Sinv and Sres are deﬁned. By orthogonality of the subspaces in the partition and by deﬁnition of Sinv and Sres, Sres = (Sinv) . Moreover, by deﬁnition of orthogonal and (Xt)t [n]-decorrelating partition, it holds that

Cov(ΠSinv Xt, ΠSres Xt) = Cov( X

j {1,...,q}: Sj opt-invariant on [n]

j {1,...,q}: Sj not opt-invariant on [n]

ΠSj Xt) = 0.

Lazzaretto, Peters, Pfister

Lemma 7 Sinv is opt-invariant on [n].

Proof That Sinv is opt-invariant on [n] can be seen from the following computations. It holds for all t, s [n] that

arg max β Sinv Vart(β) = Var(ΠSinv Xt) Cov(ΠSinv Xt, Yt)

j {1,...,q}: Sj opt-invariant on [n]

Var(ΠSj Xt) Cov(ΠSj Xt, Yt)

j {1,...,q}: Sj opt-invariant on [n]

arg max β Sj Vart(β)

j {1,...,q}: Sj opt-invariant on [n]

arg max β Sj Vars(β)

= arg max β Sinv Vars(β).

The ﬁrst equality holds by Lemma 2, since {Sinv, Sres} is indeed an orthogonal and (Xt)t [n]-decorrelating partition, see Lemma 6. The second equality follows from the deﬁnition of Sinv and can be proved by Lemma 1 and observing that the set of subspaces {Sj | j {1, . . . , q} : Sj opt-invariant on [n]} is an orthogonal and (ΠSinv Xt)t [n]- decorrelating partition. The third equality holds by Lemma 2 and the fourth by deﬁnition of an opt-invariant subspace on [n].

Lemma 8 Let {At}n t=1 be a set of n symmetric strictly positive deﬁnite matrices. If there exists a matrix A {At}n t=1 that has all distinct eigenvalues, then any two irreducible joint block diagonalizers U, U for the set {At}n t=1 are equal up to block permutations and block-wise isometric transformations.

Proof We start by observing that if a matrix is symmetric and has all distinct eigenvalues, then its eigenvectors are orthogonal to each other and unique up to scaling. We deﬁne Q Rp p as the orthonormal matrix whose columns are eigenvectors for A: Q is then uniquely deﬁned up to permutations of its columns. We now exploit the results by Murota et al. (2010) used in the construction of an irreducible orthogonal joint block diagonalizer for a set of symmetric matrices {At}n t=1 (not necessarily containing a matrix with all distinct eigenvalues). In the following, we translate all the useful results by Murota et al. (2010) in our notation introduced for joint block diagonalizers in Section 2.1.1. Murota et al. (2010) show in Theorem 1 that there exists an irreducible orthogonal joint block diagonalizer U such that, for all t [n], U At U = Lq j=1(I mj At,j), where q, mj N, 0 < q, mj p are such that q U max = Pq j=1 mj, and At,j are square matrices (common diagonal blocks). Here denotes the direct sum operator for matrices and the Kronecker product. They further propose to partition the columns of the matrix U into q subsets, each denoted by USj, with j {1, . . . , q} indexing the diagonal blocks and

Invariant Subspace Decomposition

Sj {1, . . . , p} denoting the subset of indexes corresponding to the selected columns in USj, such that, for all t [n], (USj) At USj = I mj At,j. They then argue that, as a consequence of Theorem 1, the spaces spanned by such subsets of columns, i.e.,

Uj := span{uk | k Sj}

are uniquely deﬁned. We therefore only need to prove that, if at least one matrix in the set {At}n t=1 has all distinct eigenvalues, then for all j {1, . . . , q}, mj = 1. Such condition implies that the diagonal blocks indexed by Sj, j {1, . . . , q}, are irreducible, and q U max = q. The uniqueness of the spaces Uj then implies the uniqueness of the irreducible joint block diagonalizer U up to block permutations and block-wise isometric transformations. To show this result, we now use Murota et al. (2010, Proposition 1). In particular, if the set {At}n t=1 contains at least one matrix A with all distinct eigenvalues then the assumptions of Proposition 1 are satisﬁed. Let {Q1, . . . , Qp} be the set of eigenspaces of A and for all i {1, . . . , p} let mi := dim(Qi) = 1. The proposition then implies that for all i {1, . . . , p} there exists j {1, . . . , q} such that Qi Uj. Moreover, for all i such that Qi Uj it holds that mi = mj. As a consequence, we obtain that, since the eigenvalues of A are distinct, m1 = = mp = 1 and therefore for all j {1, . . . , q}, mj = 1. This concludes the proof for Lemma 8.

Lemma 9 Let {Sj}q j=1 be an irreducible orthogonal and (Xt)t [n]-decorrelating partition. Then, for all j {1, . . . , q} it holds that

arg max β Sj

t=1 Var(ΠSj Xt)

t=1 Cov(ΠSj Xt, Yt).

In addition, for all j {1, . . . , q} such that Sj is opt-invariant on [n], it holds for all t [n] that arg max β Sj

Var(β) = ΠSjγ0,t = ΠSj γ0.

Moreover, ΠSj γ0 is time-invariant on [n].

Proof Let U Rp p and ( Σt)t [n] be deﬁned for the orthogonal and (Xt)t [n]-decorrelating partition as in Lemma 4. For all j {1, . . . , q} and for all β Sj it follows from (30) that

t=1 Cov(Yt, ΠSj Xt)β β 1

t=1 Var(ΠSj Xt)β

which has gradient

( Var)(β) = 2

t=1 Cov(ΠSj Xt, Yt) 2

t=1 Var(ΠSj Xt)β.

The equation ( Var)(β) = 0 has solution equal to

t=1 Var(ΠSj Xt)

t=1 Cov(ΠSj Xt, Yt).

Lazzaretto, Peters, Pfister

By Lemma 3, it holds that 1

n Pn t=1 Var(ΠSj Xt) = ΠSj 1

n Pn t=1 Var(Xt) 1 ΠSj and therefore β Sj. In order to apply Lemma 3, we in particular use that U is such that, for all t [n], Σt = U Σt U is block diagonal with diagonal blocks given, for all j {1, . . . , q}, by Σt,j = (USj) Σt USj. This implies that U (Pn t=1 Σt) U is also block diagonal with diagonal blocks (USj) (Pn t=1 Σt) USj. Moreover, its inverse is block diagonal and, by the properties of orthogonal matrices, its diagonal blocks are (USj) (Pn t=1 Σt) 1 USj. It now remains to show that this is also the only solution in Sj. All other solutions in Rp are given, for an arbitrary vector w Rp, by

t=1 Var(ΠSj Xt)

t=1 Var(ΠSj Xt)

=β + Ip ΠSj w.

The equality follows from the following computation 1 n

t=1 Var(ΠSj Xt)

t=1 Var(ΠSj Xt)

t=1 Var(Xt)

t=1 Var(Xt)

= USj(USj) n X

! 1 USj(USj) n X

The ﬁrst equality follows again from Lemma 3. For all w Sj, β + (Ip ΠSj)w equals β . For all w Sj, β + (Ip ΠSj)w is not in Sj. For all Sj opt-invariant on [n] and for all t [n] it holds that

arg max β Sj

= arg max β Sj

s=1 Vars(β)

= arg max β Sj Vart(β)

= Var(ΠSj Xt) Cov(ΠSj Xt, Yt)

where we used the deﬁnition of opt-invariance on [n] for the second equality, Lemma 2 for the third equality and Lemma 1 for the fourth equality. Since the result holds for all t [n], it also follows that ΠSjγ0,t = ΠSj γ0.

Invariant Subspace Decomposition

We now need to prove for all t [n] that Cov(Yt X t ΠSj γ0, X t ΠSj γ0) = 0. To see this, ﬁx t [n]. Then

Cov(Yt X t ΠSj γ0, X t ΠSj γ0) = Cov(X t (γ0,t ΠSjγ0,t), X t ΠSjγ0,t)

= Cov(X t (

i=1 ΠSiγ0,t ΠSjγ0,t), X t ΠSjγ0,t)

= Cov(X t X

i {1,...,q}:i =j ΠSiγ0,t, X t ΠSjγ0,t)

i {1,...,q}:i =j γ 0,t Cov(ΠSi Xt, ΠSj Xt)γ0,t

The last equality follows from the deﬁnition of an orthogonal and (Xt)t [n]-decorrelating partition.

Lemma 10 Let {Sj}q j=1 be an irreducible orthogonal and (Xt)t [n]-decorrelating partition and let Sinv be the corresponding invariant subspace. Moreover, let Var(X) := 1 n Pn t=1 Var(Xt) and Cov(X, Y ) := 1 n Pn t=1 Cov(Xt, Yt). Finally, let Uinv be the submatrix of an arbitrary irreducible joint block diagonalizer U corresponding to the irreducible orthogonal and (Xt)t [n]-decorrelating partition whose columns span Sinv. Then,

βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ).

Proof Expanding the deﬁnition of βinv, we obtain that

βinv = arg max β Sinv Var(β)

= arg max β Sinv 1 n

t=1 Vart(β)

= arg max β Sinv 1 n

s=1 (2 Cov(Yt, ΠSinv Xt)β β Var(ΠSinv Xt)β)

t=1 Var(ΠSinv Xt)

t=1 Cov(ΠSinv Xt, Yt)

= ΠSinv Var(X) 1ΠSinv Cov(X, Y )

= Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ).

The fourth equality follows from Lemma 9. The ﬁfth equality follows by Lemma 3 (see the proof of Lemma 9). In the last equality we used that (Uinv) Var(X) 1Uinv = ((Uinv) Var(X)Uinv) 1, which follows from the properties of orthogonal matrices and

Lazzaretto, Peters, Pfister

the block diagonal structure of U Var(X)U: these imply that (U Var(X)U) 1 = U Var(X) 1U and

((US1) Var(X)US1) 1 ... ((USq) Var(X)USq) 1

(US1) Var(X) 1US1 ... (USq) Var(X) 1USq

Appendix E. Proofs

E.1 Proof of Theorem 1

Proof Let {Sj}q j=1 be an irreducible orthogonal and (Xt)t [n]-decorrelating partition and deﬁne Sinv and Sres as in (11). By Assumption 2, {Sinv, Sres} form an orthogonal and (Xt)t N-decorrelating partition and Sinv is opt-invariant on N. Therefore, using Lemma 1 and Lemma 2, we get for all t N that

γ0,t = arg max β Sinv Vart(β) + arg max β Sres Vart(β).

Furthermore, Sinv opt-invariant on N implies that the ﬁrst term, arg maxβ Sinv Vart(β), does not depend on t and hence it holds for all t N that

γ0,t = arg max β Sinv Var(β) + arg max β Sres Vart(β).

Moreover, Assumption 1 ensures that the invariant and residual subspaces can be uniquely identiﬁed from an arbitrary irreducible orthogonal and (Xt)t [n]-decorrelating partition, and therefore all the above results do not depend on the considered partition. This concludes the proof of Theorem 1.

E.2 Proof of Proposition 1

Proof (i) For all j {1, . . . , q U max}, let USj denote the submatrix of U formed by the columns indexed by Sj. Then, the orthogonal projection matrix onto the subspace Sj can be expressed as ΠSj = USj(USj) . It follows for all t [n] that

Cov(ΠSi Xt, ΠSj Xt) = ΠSiΣtΠSj = USi(USi) Σt USj(USj)

Invariant Subspace Decomposition

where the last equality holds since (USi) Σt USj is the (i, j)-th (oﬀ-diagonal) block of Σt := U Σt U, which is zero due to the block diagonal structure of Σt. Irreducibility of the orthogonal partition follows from the irreducibility of the joint block diagonal decomposition.

(ii) The statement follows from Lemma 4, taking N = [n]. Irreducibility of the joint block diagonalizer follows from the irreducibility of the orthogonal partition.

E.3 Proof of Proposition 2

Proof (i) First, observe that, by Lemma 6, {Sinv, Sres} is an orthogonal and (Xt)t [n]- decorrelating partition and that, by Lemma 7, Sinv is opt-invariant on [n]. Then, by deﬁnition of βinv and by Lemma 9 we get for all t [n] that

βinv = arg max β Sinv Var(β) = ΠSinvγ0,t = ΠSinv γ0.

(ii) We need to prove for all t [n] that Cov(Yt X t βinv, X t βinv) = 0. To see this, ﬁx t [n]. Then

Cov(Yt X t βinv, X t βinv) = Cov(X t (γ0,t βinv), X t βinv)

= Cov(X t (ΠSinvγ0,t + ΠSresγ0,t βinv), X t (ΠSinvβinv))

= Cov(X t ΠSresγ0,t, X t ΠSinvβinv)

= γ 0,t Cov(ΠSres Xt, ΠSinv Xt)βinv

where the third equality uses βinv = ΠSinvγ0,t by Proposition 2 (i) and the last equality follows from the fact that Sinv, Sres are an orthogonal and (Xt)t [n]-decorrelating partition by Lemma 6. (iii) Let Bn denote the set of all time-invariant parameters over [n]. By assumption, we have that Bn Sinv. Moreover, by point (ii), we have that βinv Bn. Therefore, by deﬁnition of βinv we obtain that βinv = arg maxβ Bn Var(β). This completes the proof of Proposition 2.

E.4 Proof of Theorem 2

Proof Under Assumptions 1 and 2 and using the deﬁnitions of βinv and δres t , we can write the explained variance of β Rp at time t N for the true time varying parameter γ0,t Rp

Vart(β) = 2γ 0,t Var(Xt)β β Var(Xt)β

= 2(βinv + δres t ) Var((ΠSinv + ΠSres)Xt)β β Var(Xt)β

= 2(βinv) Var(ΠSinv Xt)ΠSinvβ + 2(δres t ) Var(ΠSres Xt)ΠSresβ β Var(Xt)β.

Lazzaretto, Peters, Pfister

Using this expansion, we get, since δres t = γ0,t βinv, for all β / Sinv that

inf γ0,t Rp: γ0,t βinv Sres Vart(β) = .

arg max β Rp inf γ0,t Rp: γ0,t βinv Sres Vart(β) = arg max β Sinv inf γ0,t Rp: γ0,t βinv Sres Vart(β)

= arg max β Sinv Vart(β).

Since by assumption it holds that Sinv is opt-invariant on N, it further holds that for all t N arg max β Sinv Vart(β) = arg max β Sinv Var(β).

The claim follows from the deﬁnition of βinv.

E.5 Proof of Theorem 3

Proof We assume without loss of generality that the observed predictors Xt have zero mean in t Iad t . Alternatively, as mentioned in Section 4.1.2, a constant term could be added to Xt to account for the mean. We also observe that ˆγOLS t and ˆδres t are both unbiased estimators for γ0,t and δres t , respectively (this can be checked using standard OLS analysis). We now start by computing the out-of-sample MSPE for ˆγISD t .

MSPE(ˆγISD t ) = E[(X t (ˆγISD t γ0,t ))2]

= E[(X t (ΠSinv + ΠSres)(ˆβinv + ˆδres t βinv δres t ))2]

= trace(E[(ˆβinv βinv)(ˆβinv βinv) ] Var(ΠSinv Xt ))

+ trace(E[(ˆδres t δres t )(ˆδres t δres t ) ] Var(ΠSres Xt ))

= trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt ))

+ trace(E[Var(ˆδres t | Xad)] Var(ΠSres Xt ))

= trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt ))

+ σ2 ad m trace(ΠSres E[( 1

m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )),

where we have used that Var(ˆδres t | Xad) = σ2 adΠSres((Xad) Xad) 1ΠSres. We further observe that Var(ˆβinv | X) = ΠSinv(X X) 1ΠSinv X Var(ϵ)XΠSinv(X X) 1ΠSinv,

where Var(ϵ) is a n n diagonal matrix whose diagonal elements are the error variances at each observed time steps, (Var(ϵt))t [n]. Let σ2 ϵ,min := mint [n] Var(ϵt) and σ2 ϵ,max maxt [n] Var(ϵt). Then,

σ2 ϵ,minΠSinv(X X) 1ΠSinv Var(ˆβinv | X) σ2 ϵ,maxΠSinv(X X) 1ΠSinv,

Invariant Subspace Decomposition

where denotes the Loewner order, and it follows from diagn(σ2 ϵ,min) Var(ϵt) diagn(σ2 ϵ,max), with diagn( ) denoting an n-dimensional diagonal matrix with diagonal elements all equal. We have used in particular that for two symmetric matrices A, B Rn n

such that A B and a matrix S Rn n it holds that S AS S BS (see, for example, Theorem 7.7.2. by Horn and Johnson, 2012). Using this relation and Jensen s inequality, we obtain that the ﬁrst term in MSPE(ˆγISD t ) is lower bounded by

trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt ))

σ2 ϵ,min trace(ΠSinv E[(X X) 1]ΠSinv Var(ΠSinv Xt ))

n trace(ΠSinvΣ 1ΠSinvΣt ΠSinv),

where Σ := E[ 1

n X X]. Using now that, by assumption, {Sinv, Sres} is an orthogonal and (Xt)t N-decorrelating partition, let U Rp p be deﬁned for such a partition as in Lemma 4 and let Uinv Rp dim(Sinv) be the submatrix of U whose columns form an orthonormal basis for Sinv, such that ΠSinv = Uinv(Uinv) . Moreover, let Σinv := (Uinv) ΣUinv and Σinv t := (Uinv) Σt Uinv denote, as in Lemma 4, the diagonal block corresponding to Sinv

of the block diagonal matrices U ΣU and U Σt U, respectively. By the properties of orthogonal matrices, ( Σinv) 1 = (Uinv) Σ 1Uinv. For all i {1, . . . , dim(Sinv)}, let λi denote the i-th eigenvalue in decreasing order. Then, we can further express the above lower bound as

n trace(ΠSinvΣ 1ΠSinvΣt ΠSinv)

n trace(Uinv(Uinv) Σ 1Uinv(Uinv) Σt Uinv(Uinv) )

n trace(Uinv( Σinv) 1 Σinv t (Uinv) )

σ2 ϵ,min dim(Sinv)

n λmin( Σinv t ) λmax( Σinv)

=σ2 ϵ,min dim(Sinv)

where we deﬁne cinv := λmin( Σinv t )/λmax( Σinv). The same term in MSPE(ˆγISD t ) is upper bounded by

trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt ))

σ2 ϵ,max trace(ΠSinv E[(X X) 1]ΠSinv Var(ΠSinv Xt ))

σ2 ϵ,max dim(Sinv)

where Cinv := 1 + trace(ΠSinv(E[( 1

n X X) 1] Σ 1 t )ΠSinv Var(ΠSinv Xt )). Proceeding in a similar way, we can express the second term in MSPE(ˆγISD t ) as

σ2 ad m trace(ΠSres E[( 1

m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )) = σ2 ad m hres(m),

where hres(m) := trace(ΠSres E[( 1

m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )). Since we have assumed that for all t Iad {t } the distribution of Xt does not change, by Jensen s inequality and by the fact that ΠSresΣ 1 t ΠSresΣt ΠSres = ΠSres (see Lemma 6 and (32)) we

Lazzaretto, Peters, Pfister

have that hres(m) dim(Sres), and limm hres(m) = dim(Sres). Moreover, we can ﬁnd an upper bound for hres(m) in the following way.

hres(m) = trace(ΠSres E[( 1

m(Xad) Xad) 1]ΠSres Var(ΠSres Xt ))

= trace(E[( 1

m(Xad) Xad) 1] Var(ΠSres Xt ))

trace(E[( 1

m(Xad) Xad) 1]) trace(Var(ΠSres Xt ))

m(Xad) Xad) 1] op trace(Var(ΠSres Xt ))

m(Xad) Xad) 1 op] trace(Var(ΠSres Xt ))

c 1λmax(Σt ) dim(Sres).

Summarizing, we obtain that

MSPE(ˆγISD t ) σ2 ϵ,min dim(Sinv)

n cinv + σ2 ad dim(Sres)

m and MSPE(ˆγISD t ) σ2 ϵ,max dim(Sinv)

n Cinv + σ2 ad dim(Sres)

where Cres := c 1λmax(Σt ) is the constant introduced in the theorem statement. We now compute the MSPE of ˆγOLS t .

MSPE(ˆγOLS t ) = E[(X t (ˆγOLS t γ0,t ))2]

= trace(E[(ˆγOLS t γ0,t )(ˆγOLS t γ0,t ) Xt X t ])

= trace(E[(ˆγOLS t γ0,t )(ˆγOLS t γ0,t ) ] Var(Xt ))

= trace(E[Var(ˆγOLS t | Xad)]Σt ))

= σ2 ad trace(E[((Xad) Xad) 1]Σt ))

= σ2 ad m trace(E[( 1

m(Xad) Xad) 1]Σt )).

We can further express MSPE(ˆγOLS t ) as

σ2 ad m trace(E[( 1

m(Xad) Xad) 1]Σt )

=σ2 ad m trace((ΠSinv + ΠSres)E[( 1

m(Xad) Xad) 1](ΠSinv + ΠSres) Var((ΠSinv + ΠSres)Xt ))

=σ2 ad m trace(ΠSinv E[( 1

m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt ))

+ σ2 ad m trace(ΠSres E[( 1

m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )).

In particular, the second term in the above sum also appears in MSPE(ˆγISD t ). Taking now the diﬀerence between the MSPE of ˆγOLS t and ˆγISD t , we obtain that

MSPE(γOLS t ) MSPE(γISD t ) = σ2 ad m trace(ΠSinv E[( 1

m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt ))

trace(E[Var(ˆβinv | X)] Var(ΠSinv Xt ))

We have already obtained an upper bound for the second term in the diﬀerence, namely σ2 ϵ,max dim(Sinv)

n Cinv. For the ﬁrst term, we can make the same considerations made above for σ2 ad m trace(ΠSres E[( 1

m(Xad) Xad) 1]ΠSres Var(ΠSres Xt )) and hres(m). In particular,

σ2 ad m trace(ΠSinv E[( 1

m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt )) = σ2 ad m had,inv(m)

Invariant Subspace Decomposition

where had,inv(m) := trace(ΠSinv E[( 1

m(Xad) Xad) 1]ΠSinv Var(ΠSinv Xt ))

satisﬁes had,inv(m) dim(Sinv) and had,inv(m) c 1λmax(Σt ) dim(Sinv) (as for hres(m), we observe that limm had,inv(m) = dim(Sinv)). Therefore, we obtain that

MSPE(γOLS t ) MSPE(γISD t ) σ2 ad dim(Sinv)

m σ2 ϵ,max dim(Sinv)

and MSPE(γOLS t ) MSPE(γISD t ) σ2 ad dim(Sinv)

m Cad,inv σ2 ϵ,min dim(Sinv)

where Cad,inv := c 1λmax(Σt ). In particular, this diﬀerence is always positive if n is suﬃciently large. Finally, we can observe that the expected explained variance (24) for ˆγ, where ˆγ = ˆγISD t or ˆγ = ˆγOLS t , is

E[ Vart (ˆγ)]

= E[Var(Yt ) Var(Yt X t ˆγ | ˆγ)]

= Var(X t γ0,t + ϵt ) E[Var(X t (γ0,t ˆγ) + ϵt | ˆγ)]

= γ 0,t Var(Xt )γ0,t + σ2 ϵ E[(γ0,t ˆγ) Var(Xt )(γ0,t ˆγ)] σ2 ϵ

= γ 0,t Var(Xt )γ0,t E[trace((γ0,t ˆγ)(γ0,t ˆγ) Var(Xt ))]

= γ 0,t Σt γ0,t MSPE(ˆγ)

and therefore the same inequalities found for the MSPEs diﬀerence equivalently hold for E[ Vart (ˆγISD t )] E[ Vart (ˆγOLS t )].

E.6 Proof of Proposition 3

Proof (i) By deﬁnition of a (non-orthogonal) joint block diagonalizer, it holds for all t [n] that the matrix eΣt := U Σt U is block diagonal with q diagonal blocks eΣt,j := (USj) Σt USj, j {1, . . . , q}. Deﬁne now the matrix W := U , and observe that the following relations hold

Σt = W eΣt W , Σ 1 t = U eΣ 1 t U and eΣ 1 t = W Σ 1 t W block diagonal with blocks eΣ 1 t,j := (W Sj) Σ 1 t W Sj.

The last relation is obtained by observing that

Σ 1 t,1 ... Σ 1 t,q

(W S1) ... (W Sq)

Σ 1 t W S1 . . . W Sq .

Moreover, Σ 1 t,j = Var((USj) Xt) 1. We observe that, because W U = Ip, it holds for all i, j {1, . . . , q} that (W Sj) USi = 0, that is, the space spanned by the columns of W Sj is orthogonal to the space spanned by the columns of USi. Moreover, the matrix USj(W Sj)

Lazzaretto, Peters, Pfister

is an oblique projection matrix onto Sj along L i {1,...,q}\{j} Si. Fix j {1, . . . , q}. Then, using these relations we obtain that

PSj|S jγ0,t = USj(W Sj) Σ 1 t Cov(Xt, Yt)

= USj(W Sj) Σ 1 t WU Cov(Xt, Yt)

= USj(W Sj) Σ 1 t W S1 . . . W Sq

(US1) ... (USq)

Cov(Xt, Yt)

= USj(W Sj) Σ 1 t W Sj(USj) Cov(Xt, Yt)

= USj Σ 1 t,j Cov((USj) Xt, Yt)

= USj Var((USj) Xt) 1 Cov((USj) Xt, Yt).

The fourth equality follows from the block diagonal structure of W Σ 1 t W, which implies that, for all j = 1, (W S1) Σ 1 t W Sj = 0. (ii) By deﬁnition of βinv and using point (i) of this proposition and that {Sinv, Sres} form a (Xt)t [n]-decorrelating partition, we obtain that, for all t [n],

βinv = Uinv Var((USinv) Xt) 1(Uinv) Cov(Xt, Yt).

We now observe that βinv := Var((USinv) Xt) 1(Uinv) Cov(Xt, Yt) is the OLS solution of regressing Yt onto (Uinv) Xt (as we assume all variables have zero mean). Moreover, this quantity is constant by proj-invariance of Sinv. In particular, this means that βinv = arg minβ Rdim(Sinv) E[ 1

n Pn t=1(Yt X t Uinvβ)2], which implies that

βinv = ((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ).

Therefore, βinv = Uinv((Uinv) Var(X)Uinv) 1(Uinv) Cov(X, Y ). (iii) The claim follows from the deﬁnition of δres t , from the fact that {Sinv, Sres} forms a (Xt)t [n]-decorrelating partition and from point (i) of this proposition.