# identification_of_intermittent_temporal_latent_process__c91bfd83.pdf

Published as a conference paper at ICLR 2025

IDENTIFICATION OF INTERMITTENT TEMPORAL LATENT PROCESS

Yuke Li1 , Yujia Zheng2 , Guangyi Chen2, Kun Zhang2, Heng Huang1

1 University of Maryland College Park, College Park, MD, USA 2 Carnegie Mellon University, Pittsburgh PA, USA

Identifying time-delayed temporal latent process is crucial for understanding temporal dynamics and enabling downstream reasoning. Although recent methods have made remarkable progress in this field, they cannot address the dynamics in which the influence of some latent factors on both the subsequent latent states and the observed data can become inactive or irrelevant at different time steps. Therefore, we introduce intermittent temporal latent processes, where: (1) any subset of latent factors may be missing during nonlinear data generation at any time step, and (2) the active latent factors at each step are unknown. This framework encompasses both nonstationary and stationary transitions, accommodating changing or consistent active factors over time. Our work shows that under certain assumptions, the latent variables are block-wise identifiable. With further conditional independence assumption, each latent variable can even be recovered up to component-wise transformations. Using this identification theory, we propose an unsupervised approach, Inter Latent , to reliably uncover the representations of the intermittent temporal latent process. The experiments on both synthetic and real-world datasets verify our theoretical claims.

1 INTRODUCTION

Learning meaningful representations from sequential data remains a fundamental challenge across various fields. Time series data, such as financial markets and climate observations, are ubiquitous and exhibit high nonlinearity Berzuini et al. (2012); Ghysels et al. (2016). This inspires a extensive line of works to temporal latent representation learning Yao et al. (2022b;a); Chen et al. (2024), upon the recent advancement in nonlinear ICA Khemakhem et al. (2020); Zhang et al.; Kong et al. (2022); Zheng et al. (2022); Li et al. (2023); von K ugelgen et al. (2024); Ng et al. (2023); Zheng & Zhang (2024); Morioka & Hyvarinen (2024); Yao et al. (2024); Zheng et al. (2024); Kong et al. (2024); Lachapelle et al. (2024a). However, many real-world systems exhibit latent time-delayed dynamics where the influence of certain latent factors on both subsequent latent states and observed data can be inactive or irrelevant at specific time steps. Consider, for example, a complex manufacturing process: various machine components contribute to the final product quality at different stages, with some components becoming temporarily inactive or irrelevant during certain production phases. Current works may struggle to capture these intermittent influences, potentially missing crucial aspects of the underlying dynamics. This highlights the need for a more flexible and robust framework to identify such temporal processes with intermittence of latent variables.

In this work, we investigate the identification of representations of intermittent temporal latent processes. Two key properties characterize the intermittence of a temporal latent process: (1) any subset of latent factors can be missing during the nonlinear time-delayed data generation at any time step, and (2) the specific set of active latent factors at a time step is unknown. Figure 1 takes an example of data generating mechanism of an intermittent latent temporal process to illustrate its concept. In the transition mechanism (top of Figure 1b), we see the zero entries in Jacobians indicating that not all latent variables influence each other s transitions. Similarly, in the generating mechanism

Equal contribution. This work was partially supported by NSF IIS 2347592, 2348169, DBI 2405416, CCF 2348306, CNS 2347617.

Published as a conference paper at ICLR 2025

(bottom of Figure 1b), the sparse Jacobians show that not all latent variables contribute to every observed variable. We define the support as the set of active latent factors at each time step, for both the transition and generating mechanisms. Missingness occurs when a latent factor is absent from the support, having no influence on the subsequent latent state or the observed data.

The intermittent nature of these processes presents two significant challenges for representation learning: (1) The supports for both transition and generating mechanisms are unknown, requiring methods to adapt to the data generated by only active latent factors at each time step. (2) The interactions between intermittently active latent variables may be intricate and time-varying in both mechanisms, necessitating the models that can capture the possible various support and missingness. The existing literature has yet to fully address these challenges. Wiedemer et al. (2024) relies on compositional mixing functions and requires supervision on the latent variables. Lachapelle et al. (2023); Fumero et al. (2023); Xu et al. (2024) are restricted to linear or piecewise linear settings. Ca Ri NG Chen et al. (2024) tackles missingness only within the mixing function by leveraging historical information during the unmixing process.

In contrast to previous works, we present identification guarantees for uncovering intermittent temporal latent processes. Our theoretical analysis begins with establishing block-wise identifiability under assumptions of sufficient variability of transitions of latent variables (Theorem 1). Block-wise identifiability ensures that blocks of true latent variables can be uniquely recovered from estimated latent variables up to a certain indeterminacy. Notably, this outcome holds regardless of whether latent variables are within the support or missingness. Building on this foundation, we further prove component-wise identifiability for latent variables within the support in Theorem 2, given an additional assumption of independence of latent variables conditioning on previous time steps. Component-wise identifiability guarantees that each individual component within the support can be uniquely recovered from estimated latent variables up to a certain indeterminacy. Moreover, our identifiability results are able to handle both nonstationary and stationary temporal latent process, allowing for changing or consistent active factors over time without compromising identifiability guarantees. These theoretical contributions establish, to the best of our knowledge, one of the first general frameworks for uncovering latent variables in intermittent temporal processes with appropriate identifiability guarantees.

Leveraging these theoretical insights, we introduce a novel unsupervised method that extends the Sequential Variational Autoencoder Li & Mandt (2018). Our method, Inter Latent , accommodates supports and missingnesses through sparsity regularizations on both mixing and transition functions, enabling it to model complex interactions between intermittently active latent variables and handle sparse, temporally variable latent spaces. We evaluate our approach on synthetic and real-world datasets, demonstrating its effectiveness in uncovering complex hidden temporal processes, as well as validating the proposed identifiability theory.

2 PROBLEM SETTING

Given a temporal sequence ranging from t = 1 to t = T, let x = {x1, x2, . . . , x T } denote the K-dimensional observations. At each time step t, N latent causal variables zt = {z1 t , . . . , z N t } generate xt RK. We formalize the data generating process as follows:

xt = g(zt), zn t = fn(Pa(zn t ), ϵn t ) for n [1, N] (1)

Here, g is assumed to be an injective, nonlinear, non-parametric mixing function: RN RK. In this work, we work on the undercomplete case, where K N to ensure the injectivity of g. fn denotes the nonlinear, nonparametric time-delayed transition function for the n-th latent variable. Pa(zn t ) represents the parent nodes of zn t from previous time steps. Without loss of generality, we assume a time lag of 1 in Eq. equation 1, i.e., Pa(zn t ) zt 1. The general case of multiple lags and sequence lengths is discussed in Appendix B.1. ϵn t is the noise term, sampled independently for each zn t from a standard normal distribution N(0, 1).

We are now ready to introduce the intermittent temporal latent process upon the concept of missingness of latent variables. In particular, not all components of zt participate in the data generating process at a time step. Formally, there exists a u [1, N] such that the u-th row of Jacobian of the transition function f u, denoted as Ju,: f,t, and the u-th column of the Jacobian of the mixing function, J:,u g,t, are zero. This implies that when zu t is missing, it neither receives influence from zt 1 nor

Published as a conference paper at ICLR 2025

exerts influence on zt+1 or xt in the data generation process. Figure 1 illustrates this concept, where z1 2 and z2 3 are examples of such missing latent variables.

The non-missing indices of zt help to define the support of the data generating process in Eq. 1 by:

st := {i [1, N] | zt 1 and zt, J:,i g,t(zt) = 0 Ji,: f,t(zt 1) = 0 J:,i f,t+1(zt) = 0} (2)

A similar formulation can define the missingness of zt by sc t. We assume that st and sc t partition the index set [1, N] sc t := {u [1, N] | zt 1 and zt, J:,u g,t(zt) = 0 Ju,: f,t(zt 1) = 0 J:,u f,t+1(zt) = 0} (3)

Figure 1: Data generations of Intermittent Temporal Latent Process and its Jacobian Structures for a three-step sequence, e.g., z1 3 means the latent variable z1 at t = 3. (a) illustrates the connections between time steps and how zt generates xt in intermittent temporal latent process. (b) Jacobian structures reveals the definition of support and missingness in Eq. 2 and Eq. 3 by and 0, respectively.

Equations 2 and 3 sets the stage for identifying zt by characterizing a sparse support at t of both transition and mixing functions. Specifically, there may exist latent variables {zu t |u sc t} that do not participate in the data generating process described in Eq. 1. The zero entries in the Jacobian matrices of both the transition and mixing functions, as illustrated in Figure 1(b), provide a clear visual representation of this sparsity. Let dt = |st| denote the cardinality of st, and consequently, |sc t| = dc t = N dt. In our analysis, we assume both st and sc t are non-empty for all time steps t. The case where sc t = can be considered a special instance of the intermittent temporal latent process. It is worthnoting that the mixing function g itself remains unchanged across time steps - only its Jacobian structure varies based on the missingness in zt. To illustrate this, we give an example in the following. Let g(zt) = sinh(zt) as our mixing function. When certain components of zt are missing, the corresponding columns in the Jacobian become zero, but g remains the same function. For instance, let us say N = 2, and K = 2, if z2 t is missing at time t, the jacobian of the mixing function is:

x1 t z1 t = cosh(z1 t ) 0

x2 t z1 t = cosh(z1 t ) 0

In order to introduce our identification resutls, we define the observational equivalence next.

Definition 1 (Observational Equivalence): Given a sequence of observed variables x = {x1, x2, . . . , x T } for t = 1 to T, let the true temporally causal latent process be specified by f, g, p(ϵ) as in Eq. equation 1. A learned generative model ˆf, ˆg, p(ˆϵ) is observationally equivalent to the ground truth if the model distribution matches the data distribution everywhere: p ˆ f,ˆg,p(ˆϵ)(x1:T ) = pf,g,p(ϵ)(x1:T ) (4)

Both the mixing function and transition functions can be recovered (up to certain indeterminacies) once zt is identified as we assume the injectivity of g and no latent causal confounders, respectively.

Suppose there exists an invertible mapping h, such that ˆzt = h(zt). We further provide the definition of block-wise identifiability and component-wise idenfiability in the following

Definition 2 (Block-wise Identifiability): h is considered block-wise identifiable if, given a block of the true latent variables z B t , there exists an unique partitioning B of ˆzt that matches z B t up to a permutation π, such that ˆz B t = h B(π(z B t )). h B is invertible, and its domain is z B t R|B|.

Published as a conference paper at ICLR 2025

Definition 3 (Component-wise Identifiability): For an individual component of the latent variables zn t , there exists a unique component n of ˆzt matches zn t up to a permutation π, such that ˆzn t = hn(π(zn t )), where hn is invertible with a domain zt RN. zn t is component-wise identifiable.

3 IDENTIFIABILITY THEORY

This section presents our identifiability results. We first leverage the assumptions of sufficient variability of temporal data and support sparsity to establish block-wise identifiability, as detailed in Theorem 1. Building upon this foundation, we then demonstrate component-wise identifiability of latent variables by exploiting conditional independence assumptions, as formalized in Theorem 2.

3.1 BLOCK-WISE IDENTIFIABILITY

Eq. 2 and 3 enable the partitioning of zt into two subsets: {zi t|i st} and {zu t |u sc t}. We now present our findings on the block-wise identifiability of these subsets:

Theorem 1 For the observations xt RK and estimated latent variables zt Z RN, and ˆz ˆZ RN, suppose that there exist functions ˆg and ˆf satisfying observational equivalence in Eq. 4 If the following assumptions and regularization hold:

i (Smoothness and positivity): The probability density function of the latent causal variables, p(zt), has positive measure in the space of zt and is twice continuously differentiable. ii (Path connected): For any z0, z1 Z, there is a continous function ϕ : [0, 1] Z, s.t. ϕ(0) = z0 and ϕ(1) = z1 . iii (Sufficient variability of zt and ˆzt): Let q(zt|zt 1) = log p(zt|zt 1), as well as q(ˆzt|ˆzt 1) = log p(ˆzt|ˆzt 1), and Hzt,zt 1q(zt|zt 1) denotes the Hessian matrix of q(zt|zt 1) w.r.t. zt and zt 1. Suppose Gzt {0, 1}N N as a binary adjacency matrix that indicates the existence of the transitions from zt 1 and zt, where Gzt i1,i2 = 1 means that there exists a transition from zi1 t 1 to zi2 t . We assume that: span{Hzt,zt 1q(zt|zt 1)}dt j=1 = Rdt dt Gzt and

span{Hˆzt,ˆzt 1q(ˆzt|ˆzt 1)} ˆdt j=1 = R ˆdt ˆdt ˆ Gˆzt .

iv (Support sparsity regularization): For any time step t, st is not an empty set, ˆdt dt

There exists a permutation σ, such that

ˆst = σ(st) and ˆsc t = σ(sc t).

In other words, i, {zi t|i st} are paritial identifiable from u, {zu t |u sc t}.

Proof sketch The complete proof is provided in the Appendix B.1. Here, we outline the key steps: First, let p(zt|zt 1) be the ground-truth transition pdf and p(ˆzt|ˆzt 1) be the estimated transition pdf. Define q(zt|zt 1) = log p(zt|zt 1) and q(ˆzt|ˆzt 1) = log p(ˆzt|ˆzt 1). Using h defined in ˆzt = h(zt) helps us derive: Hˆzt,ˆzt 1q(ˆzt|ˆzt 1) = (Jh 1(ˆzt)) Hzt,zt 1q(zt|zt 1)Jh 1(ˆzt 1), where H denotes the Hessian matrix and Jh 1 is the inverse Jacobian of h. We further leverage the sufficient variability assumption to establish a connection between the support sets st and ˆst, as well as their complements sc t and ˆsc t, respectively. By incorporating the support sparsity regularization ˆdt dt, we conclude the block-wise identifiability of both i, {zi t|i st} and u, {zu t |u sc t}.

Remarks Assumptions i and ii have been commonly adopted for the identification theory Chen et al. (2024); Lachapelle et al. (2024b). These assumptions provide the foundations to present Theorem 1, which concerns the transitions from zt 1 to zt over the space Z.

Recall that this work does not require all components of zt to actively participate in the data generation process. The crux of our identification approach lies in formalizing the relationships between the support sets st and ˆst, as well as their complements sc t and ˆsc t. To this end, we further introduce the sufficient variability in assumption iii to ensure the span of the Hessian matrices of the log-transition probabilities covers the full space of i, {zi t|i st}. We can thus establish our main result on partial identifiability by leveraging support sparsity regularization to reach our conclusion of the block-wise identifiability.

Published as a conference paper at ICLR 2025

Notably, we do not assume the invariance of the support set st over time. Regardless of whether st changes or remains constant, we demonstrate in Section 3.2 that for all i, {zi t|i st} can be recovered up to a component-wise invertible transformation and a permutation.

3.2 COMPONENT-WISE IDENTIFIABILITY

In this section, we exploit the conditional independence assumption to establish the component-wise identifiability of {zi t|i st}.

Theorem 2 Let all assumptions from the Theorem 1 hold. Additionally, suppose the following assumption is exposed to data generating process in Eq. 1 as well:

i (Conditional independence): At t, we assume that each component of zt is conditional independent given the previous latent variables zt 1. For any i1, i2 [N]:

zi1 t zi2 t |zt 1 (5)

Then {ˆzj t|j ˆst} must be a component-wise transformation of a permuted version of true {zi t|i st}.

Proof Sketch: Our main idea rests on proving component-wise identifiability by contradiction. We demonstrate that if component-wise identifiability does not hold, it would violate the conditional independence assumption. More specifically, The proof proceeds as follows: 1. From the Theorem 1, we have: Hˆzt,ˆzt 1q(ˆzt|ˆzt 1) = (Jh 1(ˆzt)) Hzt,zt 1q(zt|zt 1)Jh 1(ˆzt 1); 2. the conditional independence of q(zt|zt 1) as established in Eq. 5 only hold if J 1 h (ˆzt) is a diagonal matrix.

Remarks: The conditional independence is widely adopted in identification results for time-series data, as evidenced in recent works Yao et al. (2022b;a); Chen et al. (2024). Our analysis demonstrates that this regularization plays a crucial role in establishing our identification results.

The conclusions derived from Theorem 1 and Theorem 2 are applicable to both stationary and nonstationary processes. We consider the process is nonstationary if the support sets st and sc t vary over time since the transition from zt 1 to zt changes as well. Conversely, the process is stationary if these support sets remains unchanged over time. Our framework allows for temporal variation in the support sets st and their complements sc t, subject only to the constraint that neither is an empty set at any time point.

Our proposed data generating process encompasses previous models as special cases. For instance, if we remove the intermittent feature described in Eq. 2 and Eq. 3, our model reduces to LEAP Yao et al. (2022b) without the domain index. When handling non-stationary sequences, our identifiability results removes the assumption of known auxiliary variables, which is required by Yao et al. (2022b;a); Chen et al. (2024).

4 IN T E RLA T E N T APPROACH

Building upon our identifiability results, we now introduce Inter Latent to estimate the latent causal variables. Our approach aims to achieve the observational equivalence by modeling the support sparsity and the conditional independence assumptions for the data generating process in Eq. 1. In general, Inter Latent formalizes the probabilistic joint distribution of Eq. 1 as:

p(x1:T , z1:T ) = pγ(x1|z1)pϕ(z1)

t=2 pγ(xt|zt)pϕ(zt|zt 1). (6)

where γ denotes the parameters for the mixing function g, and ϕ denotes the parameters for the transition function f. To learn zt from the observations xt, we also introduce the encoder qω(zt|xt) with parameters ω. We build our approach upon Sequential Variational Auto-Encoders Li & Mandt (2018). Figure 2 illustrates the overall framework of Inter Latent . In what follows, we introduce each part of our network individually.

Published as a conference paper at ICLR 2025

4.1 NETWORK DESIGN

Eq. 6 suggests that the architecture of Inter Latent comprises of three key components. The encoder acquires latent causal representations by inferring qω(ˆzt|xt) from observations. These learned latent variables are then used by the step-to-step decoder pγ(ˆxt|ˆzt) to reconstruct the observations, implementing the mixing function g in Eq. 1. To learn the latent variables, we constrain them through the KL divergence between their posterior distribution and a prior distribution, which is estimated using a normalizing flow that converts the prior into Gaussian noise. A detailed exploration of all modules is forthcoming.

Figure 2: The overall framework of Inter Latent consists of: (1) an encoder that maps observations xt to latent variables ˆzt (t [1, T]), (2) a decoder that reconstructs observations ˆxt (t [1, T]) from zt, and (3) a temporal prior estimation module that models the transition dynamics between latent states. We train Inter Latent by LRecon along LKLD. ˆϵt (t [1, T]) denotes the estimation of the true noise terms ϵt (t [1, T]).

Encoder qω(ˆzt|xt): We assume ˆzt is independent of ˆzt conditioning on x, where t = t . Therefore, the decomposition of joint probability distribution of the posterior

is qω(ˆz|x) = TQ

t=1 qω(ˆzt|xt). We choose to approximate

q by an isotropic Gaussian characterized by mean µt and covariance σt. To learn the posterior we use an encoder composed of an MLP followed by leaky Re LU activation:

ˆzt N(µt, σt), µt, σt = Leaky Re LU(MLP(xt)). (7)

Temporal Prior Estimation pϕ(zt|zt 1): To enforce the conditional independence assumption in Eq. 5, we minimize the KL divergence between the posterior distribution and a prior distribution. This approach encourages the posterior to adopt the independence property as well, such that ˆzt|xt are mutually independent. To address the challenges of directly estimating the arbitrary density function pϕ(zt|zt 1), we introduce a transition prior module based on normalizing flows. This design represents the prior as a Gaussian distribution transformed by the Jacobian of the transition function, enabling efficient computing. Formally, j, {ˆzj t|j ˆst}, we formulate the prior module as ˆϵj t = ˆf 1 j (ˆzj t|ˆzt 1). This computation meets the requirement that fn to be invertible. Then the prior distribution of the j-th dimension of the temporal

dynamics, ˆzj t, can be computed as pϕ(ˆϵj t)| ˆ fj 1

ˆzj t | = pϕ( ˆf 1 j (ˆzj t|ˆzt 1))| ˆ fj 1

In addition, for any v, such that {ˆzv t |v ˆsc t}, we evaluate that ˆϵv t = ˆf 1 v (ˆzv t ). The prior distribution

is calculated by pϕ(ˆϵv t )| ˆ fv 1

ˆzv t | = pϕ( ˆf 1 v (ˆzv t ))| ˆ fv 1

ˆzv t | as ˆzv t is independent of ˆzt 1. Combing together, the total prior distribution is:

pϕ(ˆzt|ˆzt 1) =

n=1 pϕ(ˆϵn t )| ˆf 1 n ˆzn t | (8)

The flow model f in Eq. 8 is built with the MLP layers. For more details on the derivations of prior estimation, please refer to Appendix C.1.

Decoder pγ(ˆxt|ˆzt): The decoder pairs with our encoder to generate an reconstruct of the observation ˆxt from the estimated latent variables ˆzt, which consists of a stacked MLP followed by leaky Re LU activation:

ˆxt = Leaky Re LU(MLP(ˆzt)). (9)

4.2 LEARNING OBJECTIVE

In this work, we extend our learning objective from Sequential Variational Autoencoder Li & Mandt (2018) with a modified ELBO. In general, the ELBO implements the observational equivalence requirement from Definition 1, which ensures our learned model matches the data-generating distri-

Published as a conference paper at ICLR 2025

Figure 3: Mean Correlation Coefficient (MCC) scores for various methods for both Nonstationary and Stationary settings. dc means the size of sc t in a sequence. Higher MCC scores indicate better performance in identifying latent variables.

bution. We formulate the entire ELBO objective in the following:

t=1 Eˆzt qω log pγ(ˆxt|ˆzt)

| {z } LRecon

t=1 βEˆzt qω log q (ˆzt|xt) log pϕ(ˆzt|ˆzt 1)

| {z } LKLD

t=1 (|Jˆg,t|2,1 + |J ˆ f,t|1,1) +

t=2 |J ˆ f,t|2,1

| {z } Sparsity Regularization

where β is the hyperparameter to balance the two losses. The reconstruction loss LRecon minimizes the discrepancy between xt and ˆxt using mean-squared error.

The KL divergence loss LKLD serves dual theoretical purposes. It enforces the conditional independence assumption from Theorem 2 through the factorized prior pϕ(ˆzt|ˆzt 1) by Eq. 8, while simultaneously satisfying the sufficient variability assumption from Theorem 1 by encouraging diverse transitions in the latent space. When computing LKLD, we follow Yao et al. (2022a); Chen et al. (2024) employ a sampling method since the prior distribution lacks an explicit form.

The sparsity regularization implements the support sparsity of intermittent sequences through three terms. The L1 norm on decoder Jacobian columns Jˆg,t|2,1 enforces sparse mixing patterns. The L1 norm on transition Jacobian rows ||J ˆ f,t|1,1 ensures sparse transitions from zt 1 to zt. The L1 norm on transition Jacobian columns |J ˆ f,t|2,1 maintains consistent sparsity structure. Following standard practice, we use these L1 norms to approximate the L0 norm for differentiability.

5 EXPERIMENTS

5.1 SYNTHETIC EXPERIMENTS

Experimental Setup To evaluate Inter Latent ability to learn causal processes and identify latent variables in non-invertible scenarios, we conduct simulation experiments using random causal structures with specified sample and variable sizes. We generate synthetic dataset satisfying the identifiability assumptions outlined in Theorem 1 and 2 (details in Appendix D.1), considering both nonstationary (st varying across the sequence) and stationary (st constant throughout) settings.

Published as a conference paper at ICLR 2025

Figure 4: Visualization of the correlations between zt and ˆzt at time steps t = 2, 5, and 8. The top row represents a scatter plot on a nonstationary sequence, while the bottom row depicts a scatter plot on a stationary sequence. The red bounding boxes depicts the missing part of zt, i.e, {zu|u sc t}. The green bounding boxes highlight the latent variables that are component-wise identified for {zi t|i st}. The results confirm that Inter Latent successfully identifies {zi t|i st} in both nonstationary and stationary sequences. Also, we can observe that {zu|u sc t} is distinguishable from {zi t|i st}.

For each setting, we generate three scenarios of sequences, resulting in six scenarios in total. Each scenario has a particular value of dc t. This design allows us to assess the performance of Inter Latent under different complexities of missingness. The Mean Correlation Coefficient (MCC) serves as our evaluation metric, measuring latent factor recovery by computing absolute correlation coefficients between ground-truth and estimated latent variables. MCC scores range from 0 to 1, with higher values indicating better identifiability.

Results Figure 3 summarizes our main results on our simulations. We evaluate Inter Latent against several state-of-the-art approaches in identifying time-series causal variables and representation learning, such as LEAP Yao et al. (2022b), TDRL Yao et al. (2022a) and Ca Ri NG Chen et al. (2024). Additionally, we include classic representation learning approaches, such as Beta VAE Higgins et al. (2016), i-VAE Khemakhem et al. (2020) and Slow VAE Klindt et al. (2020).

The results from Figure 3 demonstrate that Inter Latent consistently achieves higher Mean Correlation Coefficient (MCC) across both nonstationary and stationary scenarios. For instance, in the nonstationary sequence with dc t = 1, Inter Latent outperforms all other methods by a substantial margin, exceeding 0.1 in MCC. We attribute the superior performance of Inter Latent to its capability of handling missingness in zt, a feature not present in the comparative methods. This key distinction enables our approach to more accurately capture the temporal dynamics of the latent variables. Figure 4 visualizes the disentanglement between the true latent variable and the estimations at different time steps from a sequence.

Ablation Study and Discussions To elucidate the key assumptions of our data generating process in Eq. 1, we further conduct ablation study focusing on the impact of sparse support. We introduce three baselines: (1) W/O s of f , which removes sparsity regularization on transition functions;

Published as a conference paper at ICLR 2025

(2) W/O s of g , which removes sparsity regularization on mixing functions; (3) WS , a weakly supervised variant drops all sparse regularizations as having access to st and sc t during training.

We summarize our experimental results in Figure 3. Inter Latent obtains the scores on par with WS baseline. This speaks the effectiveness of using the sparsity regularization terms against using st and sc t for g and f directly. The W/O S of f baseline, which assumes sc t = , yields a significantly lower Mean Correlation Coefficient (MCC) compared to Inter Latent approach. Similarly, W/O S of g fails to achieve competitive results due to its disregard for missingness in the mixing function. These outcomes confirm that without accounting for missing components, the baselines are unable to adequately model our simulated data.

5.2 REAL-WORLD EXPERIMENTS

Task setup To evaluate our proposed identification theories in complex real-world scenarios, we apply them to the task of Group Activity Recognition (GAR) using the Volleyball dataset Ibrahim et al. (2016). GAR aims to categorize the activity for an individual frames in multi-actor scenes, aligning well with our scenario of intermittent temporal latent processes. This is because not all actors participate in every activity, reflecting real-world dynamics where some may be occluded or out of view in fast-evolving sporting scenarios. In our implementation, each actor at a given time point is modeled as a specific component of the latent variables, with occluded or out-of-view actors treated as missing in the activity representations. This setup provides a solid testbed for our identification theory, allowing us to assess its robustness and effectiveness in handling real-world complexities such as partial observations and dynamic participant involvement.

Let x = {xn t }T,N t=1,n=1 denote a video consisting of T-frame observations and N agents. For each time step t and agent n, there exists a latent variable zn t {zn t }T,N t=1,n=1 that generates xn t according to Equation 1.

We take inspirations from the two-phase training pipeline from Li et al. (2024) to modify our training objective. First, we train Inter Latent using the objective function defined in Eq. 10. Subsequently, a classifier ˆc predicts the one-hot activity label ˆy from the learned sequence of latent representations ˆz1:T using an MLP: ˆy = MLP(Concat(ˆz1:T )). The classifier is trained using a crossentropy loss with a L1 regularization on its Jacobian: LCE cls = Eˆy one-hot(y) log(softmax(ˆy)) + |Jˆc|2,1, where one-hot(y) denotes the one-hot embedding of the true activity label. More data preprocessing details can be found in Appendix D.2.

Methods MCA SACRF Pramono et al. (2020) 83.3 AT Gavrilyuk et al. (2020) 84.3 SAM Yan et al. (2020) 86.3 DIN Yuan et al. (2021) 86.5 DFGAR Kim et al. (2022) 90.5 Hi GCIN Yan et al. (2023) 91.4 PAP Nakatani et al. (2024) 91.8 Dual-AI Han et al. (2022) 93.2 Bi Causal Zhang et al. (2024) 93.4 TDRL Yao et al. (2022a) 92.9 Ca Ri NG Chen et al. (2024) 94.0 Inter Latent 95.7 Table 1: Comparison with the state-of-the-art methods on Volleball dataset

Data and Comparing Methods The Volleyball dataset Ibrahim et al. (2016) contains 55 video recordings of volleyball games and is split into 3493 training clips and 1337 testing clips. The center frame of each clip is annotated with one group activity label out of eight labels (i.e. right set, right spike, right pass, right winpoint, left set, left spike, left pass, and left winpoint).

The comparing methods include the state-of-the-art methods on GAR task, such as SAM Yan et al. (2020), AT Gavrilyuk et al. (2020), ASACRF Pramono et al. (2020), DIN Yuan et al. (2021), DFGAR Kim et al. (2022), Hi GCIN Yan et al. (2023), PAP Nakatani et al. (2024), and Bi Causal Zhang et al. (2024). We also benchmark against TDRL Yao et al. (2022a) and Ca Ri NG Chen et al. (2024) to evaluate the efficacy of identifying the intermittent temporal latent process. For fair comparisons, Inter Latent adopts the Res Net-18 backbone He et al. (2016) and weakly supervised setting from Yan et al. (2020) for feature extractions from the RGB frames, which is also commonly utilized by other approaches.

Results and Discussions Table 1 presents a comparison of Multi-class Classification Accuracy (MCA) on the Volleyball dataset. Notably, Inter Latent demonstrates superior performance with respect to those do not consider the missingness in both transition and mixing functions, i.e., Ca Ri NG and TDRL. For example, Inter Latent achieves the highest accuracy of 95.7, significantly

Published as a conference paper at ICLR 2025

Figure 5: The visual examples of Inter Latent on Volleyball dataset. Highlighted frames show the annotated activity, with yellow bounding boxes indicating occluded actors. Inter Latent correctly predicts the three activities but misclassifies a video of left spike as left set . Note that the spike activity is performed by a actor that is severely occluded. This implies the misclassification may stem from that the label itself is not grounded by the true process.

surpassing the previous best result of 94.0 obtained by Ca Ri NG. Inter Latent also outperforms the state-of-the-art approaches on GAR, such as Dual-AI and Bi Causal by significant margins of 2.5 and 2.3 points, respectively.

Figure 5 illustrates the visual examples of activity classification outcomes produced by Inter Latent . The model demonstrates robust performance in handling occlusion-induced missingness, accurately categorizing activities in challenging scenarios. Figure 5a 5c showcase successful classifications of right set , left set and left pass respectively, despite partial occlusions of key players. We also present a failure case in Figure 5d, where Inter Latent misclassifies a left spike as a left set. However, this misclassification stems from that the label itself is not grounded by the true process, since the spike activity is performed by a player that is severely occluded.

6 CONCLUSION

We establish a set of novel identifiability results for intermittent latent temporal processes, extending the identifiability theory to scenarios where latent factors may be missing or inactive at different time steps. Specifically, we prove block-wise identifiability under assumptions on support sparsity, and further demonstrate component-wise identifiability within the support given conditional independence assumption. These results hold for both nonstationary and stationary transitions, accommodating a wide range of real-world temporal dynamics. Our theoretical findings are validated through experiments on both synthetic and real-world datasets, demonstrating the practical applicability of our approach. The proposed Inter Latent framework not only advances our understanding of complex temporal processes but also provides a principled method for uncovering hidden structures in time-delayed systems with variable latent factor participation. Future work could explore the application of this framework to related tasks such as temporal disentanglement, transfer learning in time series data, and causal discovery in dynamic systems. While we have demonstrated the effectiveness of our approach on visual-based task, the lack of other applications is a limitation of this work.

Published as a conference paper at ICLR 2025

Carlo Berzuini, Philip Dawid, and Luisa Bernardinell. Causality: Statistical perspectives and applications. John Wiley & Sons, 2012.

Philippe Brouillard, S ebastien Lachapelle, Alexandre Lacoste, Simon Lacoste-Julien, and Alexandre Drouin. Differentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33:21865 21877, 2020.

Guangyi Chen, Yifan Shen, Zhenhao Chen, Xiangchen Song, Yuewen Sun, Weiran Yao, Xiao Liu, and Kun Zhang. Caring: Learning temporal causal representation under non-invertible generation process. ar Xiv preprint ar Xiv:2401.14535, 2024.

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. ar Xiv preprint ar Xiv:1906.07155, 2019.

Martin Danelljan, Gustav H ager, Fahad Khan, and Michael Felsberg. Accurate scale estimation for robust visual tracking. In British machine vision conference, Nottingham, September 1-5, 2014. Bmva Press, 2014.

Marco Fumero, Florian Wenzel, Luca Zancato, Alessandro Achille, Emanuele Rodol a, Stefano Soatto, Bernhard Sch olkopf, and Francesco Locatello. Leveraging sparse and shared feature activations for disentangled representation learning. Advances in Neural Information Processing Systems, 36:27682 27698, 2023.

Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM Snoek. Actor-transformers for group activity recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 839 848, 2020.

Eric Ghysels, Jonathan B Hill, and Kaiji Motegi. Testing for granger causality with mixed frequency data. Journal of Econometrics, 192(1):207 230, 2016.

Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun Chang, and Yu Qiao. Dual-ai: Dual-path actor interaction learning for group activity recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2990 2999, 2022.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Kaiming He, Georgia Gkioxari, Piotr Doll ar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961 2969, 2017.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.

Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1971 1980, 2016.

Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International conference on artificial intelligence and statistics, pp. 2207 2217. PMLR, 2020.

Dongkeun Kim, Jinsung Lee, Minsu Cho, and Suha Kwak. Detector-free weakly supervised group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20083 20093, 2022.

David Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. ar Xiv preprint ar Xiv:2007.10930, 2020.

Published as a conference paper at ICLR 2025

Lingjing Kong, Shaoan Xie, Weiran Yao, Yujia Zheng, Guangyi Chen, Petar Stojanov, Victor Akinwande, and Kun Zhang. Partial disentanglement for domain adaptation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 11455 11472. PMLR, 17 23 Jul 2022.

Lingjing Kong, Guangyi Chen, Biwei Huang, Eric P Xing, Yuejie Chi, and Kun Zhang. Learning discrete concepts in latent hierarchical models. ar Xiv preprint ar Xiv:2406.00519, 2024.

S ebastien Lachapelle, Tristan Deleu, Divyat Mahajan, Ioannis Mitliagkas, Yoshua Bengio, Simon Lacoste-Julien, and Quentin Bertrand. Synergies between disentanglement and sparsity: Generalization and identifiability in multi-task learning. In International Conference on Machine Learning, pp. 18171 18206. PMLR, 2023.

S ebastien Lachapelle, Pau Rodr ıguez L opez, Yash Sharma, Katie Everett, R emi Le Priol, Alexandre Lacoste, and Simon Lacoste-Julien. Nonparametric partial disentanglement via mechanism sparsity: Sparse actions, interventions and sparse temporal dependencies. ar Xiv preprint ar Xiv:2401.04890, 2024a.

S ebastien Lachapelle, Divyat Mahajan, Ioannis Mitliagkas, and Simon Lacoste-Julien. Additive decoders for latent variables identification and cartesian-product extrapolation. Advances in Neural Information Processing Systems, 36, 2024b.

Yingzhen Li and Stephan Mandt. Disentangled sequential autoencoder. ar Xiv preprint ar Xiv:1803.02991, 2018.

Yuke Li, Guangyi Chen, Ben Abramowitz, Stefano Anzellotti, and Donglai Wei. Learning causal domain-invariant temporal dynamics for few-shot action recognition. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id= Lvuu Yq U0BW.

Zijian Li, Ruichu Cai, Guangyi Chen, Boyang Sun, Zhifeng Hao, and Kun Zhang. Subspace identification for multi-source domain adaptation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=BACQLWQW8u.

Lars Lorch, Jonas Rothfuss, Bernhard Sch olkopf, and Andreas Krause. Dibs: Differentiable bayesian structure learning. In Neur IPS, pp. 24111 24123, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ ca6ab34959489659f8c3776aaf1f8efd-Abstract.html.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.

Hiroshi Morioka and Aapo Hyvarinen. Causal representation learning made identifiable by grouping of observational variables. In Forty-first International Conference on Machine Learning, 2024.

Chihiro Nakatani, Hiroaki Kawashima, and Norimichi Ukita. Learning group activity features through person attribute prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18233 18242, 2024.

Ignavier Ng, Yujia Zheng, Xinshuai Dong, and Kun Zhang. On the identifiability of sparse ica without assuming non-gaussianity. Advances in Neural Information Processing Systems, 36:47960 47990, 2023.

Rizard Renanda Adhi Pramono, Yie Tarng Chen, and Wen Hsien Fang. Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In European Conference on Computer Vision, pp. 71 90. Springer, 2020.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137 1149, 2016.

Published as a conference paper at ICLR 2025

Xiangchen Song, Weiran Yao, Yewen Fan, Xinshuai Dong, Guangyi Chen, Juan Carlos Niebles, Eric Xing, and Kun Zhang. Temporally disentangled representation learning under unknown nonstationarity. Neur IPS, 2023.

Xiangchen Song, Zijian Li, Guangyi Chen, Yujia Zheng, Yewen Fan, Xinshuai Dong, and Kun Zhang. Causal temporal representation learning with nonstationary sparse transition. ar Xiv preprint ar Xiv:2409.03142, 2024.

Julius von K ugelgen, Michel Besserve, Liang Wendong, Luigi Gresele, Armin Keki c, Elias Bareinboim, David Blei, and Bernhard Sch olkopf. Nonparametric identifiability of causal representations from unknown interventions. Advances in Neural Information Processing Systems, 36, 2024.

Thadd aus Wiedemer, Prasanna Mayilvahanan, Matthias Bethge, and Wieland Brendel. Compositional generalization from first principles. Advances in Neural Information Processing Systems, 36, 2024.

Danru Xu, Dingling Yao, S ebastien Lachapelle, Perouz Taslakian, Julius von K ugelgen, Francesco Locatello, and Sara Magliacane. A sparsity principle for partially observable causal representation learning. ar Xiv preprint ar Xiv:2403.08335, 2024.

Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. Social adaptive module for weaklysupervised group activity recognition. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part VIII 16, pp. 208 224. Springer, 2020.

Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. Higcin: Hierarchical graph-based cross inference network for group activity recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(06):6955 6968, 2023.

Dingling Yao, Danru Xu, Sebastien Lachapelle, Sara Magliacane, Perouz Taslakian, Georg Martius, Julius von K ugelgen, and Francesco Locatello. Multi-view causal representation learning with partial observability. In The Twelfth International Conference on Learning Representations, 2024.

Weiran Yao, Guangyi Chen, and Kun Zhang. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492 26503, 2022a.

Weiran Yao, Yuewen Sun, Alex Ho, Changyin Sun, and Kun Zhang. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022b.

Hangjie Yuan, Dong Ni, and Mang Wang. Spatio-temporal dynamic inference network for group activity recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7476 7485, 2021.

Kun Zhang, Shaoan Xie, Ignavier Ng, and Yujia Zheng. Causal representation learning from multiple distributions: A general setting. In Forty-first International Conference on Machine Learning.

Youliang Zhang, Wenxuan Liu, Danni Xu, Zhuo Zhou, and Zheng Wang. Bi-causal: Group activity recognition via bidirectional causality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1450 1459, 2024.

Yujia Zheng and Kun Zhang. Generalizing nonlinear ica beyond structural sparsity. Advances in Neural Information Processing Systems, 36, 2024.

Yujia Zheng, Ignavier Ng, and Kun Zhang. On the identifiability of nonlinear ica: Sparsity and beyond. Advances in neural information processing systems, 35:16411 16422, 2022.

Yujia Zheng, Biwei Huang, Wei Chen, Joseph Ramsey, Mingming Gong, Ruichu Cai, Shohei Shimizu, Peter Spirtes, and Kun Zhang. Causal-learn: Causal discovery in python. Journal of Machine Learning Research, 25(60):1 8, 2024. URL http://jmlr.org/papers/v25/ 23-0970.html.

Published as a conference paper at ICLR 2025

Table of notions

Data and estimations

xt RK Observations ˆxt RK Reconstructions zt RN Latent variables ˆzt RN Latent variable estimations ϵt True noise term ˆϵt Estimated noise term st The support of zt ˆst The support of ˆzt sc t The missingness of zt ˆsc t The missingness of ˆzt dt The cardinality of st ˆdt The cardinality of ˆst dc t The cardinality of sc t ˆdc t The cardinality of ˆsc t

t [1, T] Time step n [N] The indices of zt i st The indices of zt within st j ˆst The indices of ˆzt within ˆst u sc t The indices of zt within sc t v ˆsc t The indices of ˆzt within ˆsc t

Ground-truth & Learned model

g True mixing function ˆg Learned mixing function f True transition function ˆf Learned transition function Jg Jacobian of g Jˆg Jacobian of ˆg Jf Jacobian of f J ˆ f Jacobian of ˆf

Optimizations

ϕ True parameters of f ˆϕ Learned parameters of ˆf γ True parameters of g ˆγ Learned parameters of ˆg ω True parameters of encoder ˆω Learned parameters of encoder | |2,1 L1 norm on columns of | |1,1 L1 norm on rows of

B PROOF OF THEOREM1 AND THEOREM2

In this section, we provide proof of our identifiability results in Theorem1 and Theorem2. To this end, we take insiprations from Lemma B.1 of Lachapelle et al. (2023) to present a Lemma that is throughout our proof.

Lemma 1 (Invertible matrix contains a permutation) Let L RN N be an invertible matrix. Then, there exists a permutation σ such that Ln,σ(n) = 0 for all n, In other words, P L where P is the permutation matrix associated with σ, i.e. Pen = eσ(n).

Proof: Since the matrix L is invertible, its determinant is nonzero. We can obtain the following with the assistance of Leibniz formula:

|det L| = X

σ S sign(σ) Y

i Ln,σ(n) = 0 (11)

where S denotes a set of permutation. Eq. 11 suggests that at least one term of the sum is non-zero, meaning there exists σ S, such that n, Ln,σ(n) = 0.

B.1 PROOF OF THEOREM 1

Theorem 1 is where most of the theoretical contribution of this work lies. Let us recall Theorem 1:

Published as a conference paper at ICLR 2025

Theorem 1 (Block-wise identifiability): For the observations xt RK and estimated latent variables ˆzt RN, suppose there exist functions ˆg satisfying observational equivalence in Eq. 4 If the following assumptions and regularization hold:

i (Smoothness and positivity): The probability density function of the latent causal variables, p(zt), has positive measure in the space of zt and is twice continuously differentiable. ii (Path connected): For any z0, z1 Z, there is a continous function ϕ : [0, 1] Z, s.t. ϕ(0) = z0 and ϕ(1) = z1 . iii (Sufficient variability of zt and ˆzt): Let q(zt|zt 1) = log p(zt|zt 1), as well as q(ˆzt|ˆzt 1) = log p(ˆzt|ˆzt 1), and Hzt,zt 1q(zt|zt 1) denotes the Hessian matrix of q(zt|zt 1) w.r.t. zt and zt 1. Suppose Gzt {0, 1}N N as a binary adjacency matrix that indicates the existence of the transitions from zt 1 and zt. Gzt i1,i2 = 1 means that there exists a transition from zi1 t 1 to zi2 t .

We assume that: span{Hzt,zt 1q(zt|zt 1)}dt i=1 = Rdt dt Gzt and span{Hˆzt,ˆzt 1q(ˆzt|ˆzt 1)} ˆdt j=1 =

R ˆdt ˆdt ˆ Gˆzt iv (Support sparsity regularization): For any time step t, st is not an empty set, ˆdt dt

There exists a permutation σ, such that

ˆst = σ(st) and ˆsc t = σ(sc t)

In other words, both i, {zi t|i st} and u, {zu t |u sc t} are block-wise identifiable.

Proof: Taking inspirations from Zheng & Zhang (2024), applying the chain rule to our definition in Definition 1 leads to:

xt = ˆxt = g(zt) = g(h 1(ˆzt)) = ˆg(ˆzt) = Jg(zt) Jh 1(ˆzt) = Jˆg(ˆzt), (12)

where Jg, Jh 1, and Jˆg denote the Jacobian matrices of g, h 1, and ˆg, respectively. Eq. 12 provides a rigorous definition of observational equivalence in the context of intermittent temporal latent processes, establishing the relationship between the true and learned models through their distributions and Jacobians.

We follow Yao et al. (2022b;a); Song et al. (2023); Chen et al. (2024); Song et al. (2024) to connect Jh 1(ˆzt) in Eq. 12 with the transition probability density function p(zt|zt 1) as we work on identification over the time-series data. Given the fact that p(zt|zt 1) = p(zt|g(zt 1)) = p(zt|xt 1) as well as p(ˆzt|ˆzt 1) = p(ˆzt|g(ˆzt 1)) = p(zt|xt 1), we are able to map (xt 1, zt) to (xt 1, ˆzt) with

the jaocbian I 0 0 Jh(zt)

p(zt|xt 1) = p(ˆzt|ˆxt 1)|det Jh(zt)| p(zt|zt 1) = p(ˆzt|ˆzt 1)|det Jh(zt)|, (13)

where h is an invertible mapping, such that ˆzt = h(zt). |det Jh(zt)| denotes the determinant of h(zt).

Taking the logarithm on both sides of Eq. 13, we have:

log p(zt|zt 1) log |det Jh(zt)| = log p(ˆzt|ˆzt 1). (14)

We replace q(ˆzt|ˆzt 1) = log p(ˆzt|ˆzt 1)) as well as q(zt|zt 1) = log p(zt|zt 1), and calculate the Hessian with ˆzt and ˆzt 1 on both sides of Eq. 14 using change-of-variable and chain rule:

Hˆzt,ˆzt 1q(ˆzt|ˆzt 1) = (Jh 1(ˆzt)) Hzt,zt 1q(zt|zt 1)Jh 1(ˆzt 1). (15)

We can rewrite Eq. 15 based on assumption iii by:

span{Hˆzt,ˆzt 1q(ˆzt|ˆzt 1)} ˆdt j=1 = (Jh 1(ˆzt)) span{Hzt,zt 1q(zt|zt 1)}dt i=1Jh 1(ˆzt 1), (16)

where denotes the Hadmard product.

For any i1, i2 st, (i1 = i2) we can obtain the standrad one-hot basis vector ei1 and ei2 of Hessian matrix, such that ei1(ei2) Rdt dt Gzt . Eq. 16 indicates the existence of a permutation matrix P associated with the permutation σ of Jh 1 (Lemma 1), such that:

P ei1 (ei2) P = eσ(i1)(eσ(i2)) R ˆdt ˆdt ˆ Gˆzt, (17)

Published as a conference paper at ICLR 2025

which implies:

(σ(i1), σ(i2)) Hˆzt,ˆzt 1q(ˆzt|ˆzt 1). (18)

The support sparsity constraint suggests that:

ˆdt = |Hˆzt,ˆzt 1q(ˆzt|ˆzt 1)|1,0 |Hzt,zt 1q(zt|zt 1)|1,0 = dt

where | |1,0 denotes the ℓ0-norm of the rows of matrix .

Combining this with Equation 18, we can conclude that:

ˆst = σ(st). (19)

If this were not the case, there would exist a pair (i 1, i 2) Gzt, where i 1 = i 2, that contradicts Equation 18.

Eq. 19 suggests that, i st and v ˆsc t:

zi t ˆzv t = 0 (20)

Also, as h is a invertible mapping, we can conclude that det(J 1 h 1) = 0. Therefore, ˆzj t zu t = 0, which supports conclusion of block-wise identification.

Generalize from zt 1 to z<t: Notably we can easily generalize Theorem 1 by replacing zt 1 with z<t, and ˆzt 1 with ˆz<t from Eq. 13 to Eq. 16, respectively. In other words, we can extend the conditional probability density function into a non-markov setting. Accordingly, Gzt {0, 1}dt dt(t 1)

and Gˆzt {0, 1} ˆdt ˆdt(t 1), if both z<t and ˆz<t start from t = 1 in Eq. 13.

B.2 PROOF OF THEOREM 2

Theorem 1 allows us to further explore the identifiability i, {zi t|i st}. In what follows, we provide the proof of Theorem 2 in details.

Theorem 2 (Component-wise identifiability i, {zi t|i st}): Let all assumptions from the Theorem hold. Additionally, suppose the following assumption is true for data generating process in Eq. 1 as well:

i (Conditional independence): At t, we assume that each component of zt is conditional independent given the previous latent variables zt 1. For any i1, i2 [N]:

zi1 t zi2 t |zt 1

Then for {ˆzj t|j ˆst} must be a component-wise transformation of a permuted version of true {zi t|i st}.

Proof: Following previous works Zheng & Zhang (2024), our goal can be rewritten as demonstrating that Jh 1(ˆzt) = D(ˆzt)P, where D denotes an diagonal matrix. P is a permutation matrix that is defined in Lemma 1, and has been proven in Theorem 1. If Jh 1(ˆzt) = D(ˆzt)P, there must exist i1 and i2 (i1 = i2), such that j1, j2 J:,i1 g 1, and j2 J:,i2 h 1. J:,i1 h 1 is the i1-th column of Jh 1, which

corresponds to zi1 t . Similarly, J:,i2 h 1 corresponds to zi2 t . Given Eq. 15, we can obtain:

ˆzj1 t , ˆzj2 t J:,i1 h 1(ˆzt) Hzt,zt 1q(zt|zt 1)Jh 1(ˆzt 1). (21)

ˆzj2 t J:,i2 h 1(ˆzt) Hzt,zt 1q(zt|zt 1)Jh 1(ˆzt 1). (22)

Therefore, zi1 t and zi2 t are dependent as given they are both dependent on ˆzj2 t . This contradicts with our conditional independence assumption.

Published as a conference paper at ICLR 2025

B.3 EXTENSION OF TEMPORAL SUPPORT SPARSITY

Proosition 1(Identifiability under temporal support sparsity): In addition to the assumptions of Theorem 1, if the following assumption and regularization hold:

i (positivity and independence of the support): For any time step t, there exists the probability density function of the support, p(st), has positive measure in the space of st. The support at any time step t is independent of the supports at other time steps, thus can be factorized by

p(s1:T ) = TQ

ii (temporal support sparsity regularization): For any time step t, st is not an empty set, E( ˆd1:T ) E(d1:T )

There exists a permutation σ, such that

ˆst = σ(st) and ˆsc t = σ(sc t)

In other words, both i, {zi t|i st} and u, {zu t |u sc t} are block-wise identifiable.

We start from Eq. 18 in Theorem 1 to prove the Theorem 3. Let Ht = Hzt,zt 1q(zt|zt 1), and ˆHt = Hzt,zt 1q(zt|zt 1), the expected sparsity constraint can be reformulated by:

E(d1:T ) = E|H1:T |1,0 = Ep(s1:T )E(

n=1 1(Hn,: t = 0)|st)

= Ep(s1:T )

n=1 E(1(Hn,: t = 0)|st)

= Ep(s1:T )(

n=1 PHt|st(Hn,: t = 0)) (23)

where 1( ) denotes the indicator function of , PH|st denotes the

Let J = Jh 1(ˆzt 1), and J 1 = Jh 1(ˆzt). We can perform the similar steps to obtain:

E( ˆd1:T ) = E| ˆH1:T |1,0 = Ep(s1:T )E(

n=1 1(J 1Hn,: t J = 0)|st)

= Ep(s1:T )

n=1 E(1(J 1Hn,: t J = 0)|st)

= Ep(s1:T )(

n=1 PHt|st(J 1Hn,: t J = 0)) (24)

The temporal support sparsity constraint suggests that: E( ˆd1:T ) E(d1:T ), which leads to

n=1 PHt|st(J 1Hn,:J = 0)) Ep(s1:T )(

n=1 PHt|st(Hn,: = 0)) 0

= Ep(s1:T )(

n=1 (PHt|st(J 1Hn,: t J ) PHt|st(Hn,: t )) 0 (25)

Eq. 18 suggests that n [1, N], σ(n), s.t.PHt|st(Hn,: t ) = (PHt|st(J 1Hσ(n),: t J ), the L.H.S. of Eq. 25 is a sum of non-negative terms which is itself non-positive. This means that every term in the sum is zero. The rest of the proof remains the same with Theorem 1 to obtain the block-wise identifiability. Moreover, if the conditional independence assumption in Theorem 2 holds, we can further obtain the component-wise identifiability.

Published as a conference paper at ICLR 2025

B.4 ESTIMATING st AND MODIFIED ELBO

The temporal support sparsity in Theorem 3 requires to obtain p(s1:T ) for the identifiability results. In order to allow for gradient-based optimization of ˆs1:T , we take inspirations from the structure learning Brouillard et al. (2020); Lorch et al. (2021) to treat ˆst ˆsc t = S as a 1 N vector. Each entries of this vector is a independent Bernoulli random variable with probability of success σ(αn), where σ is the sigmoid function and αn is a parameter learned using the Gumbel-Softmax trick. Accordingly, our ELBO needs to be modified as following:

LELBO = LRecon + LKLD, subject to ES σ(α)|S| β (26)

where β is an hyperparameter (which should be set ideally to the true dt) and S σ(α) means that each entry of S are independent and distributed according to σ(α). Comparing to Eq. 10, Eq. 26 drops the sparsity regularization terms as we use Gumbel-Softmax instead.

C IMPLEMENTATION DETAILS

C.1 PRIOR LIKELIHOOD DERIVATION

Consider a paradigmatic instance of latent causal processes. In this case, we are concerned with two time-delayed latent variables, namely, zt = [z1 t, z2 t]. We set time lag is defined as 1 for simplicity. This implies that each latent variable, zn t , is formulated as zn t = fn(Pa(zn t ), ϵn t ), where Pa(zn t ) zt 1 is the parent of zn t . The noise terms, ϵn t , are mutually independent. To represent this latent process more succinctly, we introduce a transformation map, denoted as f. It s worth noting that in this context, we employ an overloaded notation; specifically, the symbol f serves dual purposes, representing both transition functions and the transformation map.

z1 t 1 z2 t 1 z1 t z2 t

z1 t 1 z2 t 1 ϵ1 t ϵ2 t

By leveraging the change of variables formula on the map f, we can evaluate the joint distribution of the latent variables p(z1 t 1, z2 t 1, z1 t, z2 t) as:

p(z1 t 1, z2 t 1, z1 t, z2 t) = p(z1 t 1, z2 t 1, ϵ1 t, ϵ2 t)/ |det Jf| , (28)

where Jf is the Jacobian matrix of the map f, which is naturally a low-triangular matrix:

1 0 0 0 0 1 0 0 z1 t z1 t 1

z1 t z2 t 1

z1 t ϵ1 t 0

z2 t z1 t 1

z2 t z2 t 1 0 z2 t ϵ2 t

Given that this Jacobian is triangular, we can efficiently compute its determinant as Q

zn t ϵn t . Further-

more, because the noise terms are mutually independent, and hence ϵ1 t ϵ2 t, and ϵt zt 1, we can write Eq. 28 as:

p(z1 t 1, z2 t 1, z1 t, z2 t) = p(z1 t 1, z2 t 1) p(ϵ1 t, ϵ2 t)/ |det Jf| (because ϵt zt 1)

= p(z1 t 1, z2 t 1) Y

i p(ϵi t)/ |det Jf| (because ϵ1 t ϵ2 t). (29)

Let {f 1 n }n=1,2,3... be a set of learned inverse dynamics transition functions that take the estimated latent causal variables in the dynamics subspace and lagged latent variables, and output the noise terms, i.e., ϵn t = f 1 n (zn t , Pa(zn t )). By eliminating the marginals of the lagged latent variable p(z1 t 1, z2 t 1) on both sides, we derive the total transition prior likelihood as:

p(z1 t, z2 t|z1 t 1, z2 t 1) = Y

n p(ϵi t)/ |det Jf| = Y

n p(f 1 n (zn t , Pa(zn t ))) det J 1 f (30)

Published as a conference paper at ICLR 2025

in which, i, {zi t|i st}, the prior likelihood is:

p(zi t|zt 1) = Y

i p(ϵi t)/ |det Jf| = Y

i p(f 1 i zi t, Pa(zi t) ) det J 1 f . (31)

Then, u, {zu t |u sm t }, given Pa(zu t ) = , the prior likelihood is:

p(zu t ) = Y

u p(ϵu t )/ |det Jf| = Y

u p(f 1 u (zu t )) det J 1 f . (32)

C.2 NETWORK ARCHITECTURES

Configuration Description Output dimensions

Encoder Input: concat(x1:T ) BS T K Dense 128 neurons, Leaky Re LU BS T 128 Dense 128 neurons, Leaky Re LU BS T 128 Dense Temporal embeddings BS T 2N Bottleneck Compute mean and variance of posterior µ, σ Reparameterization Sequential sampling ˆz1:T

Decoder Input: ˆz1:T BS T N Dense 128 neurons, Leaky Re LU BS T 128 Dense 128 neurons, Leaky Re LU BS T 128 Dense input embeddings BS T K

Temporal prior module Input ˆz1:T BS T N Inverse Transition ˆϵt BS T N Jacobian Compute log |det Jf| BS

Table 2: The details of our network architectures for Inter Latent , where BS means batch size.

Table 2 summarizes the network architectures of Inter Latent .

C.3 TRAINING DETAILS

Simulation Experiments

We implemented our models using Py Torch 1.11.0. For optimization, we employed the Adam W optimizer Loshchilov & Hutter (2019), which has been shown to improve generalization performance in deep learning models. The hyperparameters were set as follows: learning rate of 1e-3 and minibatch size of 64. To ensure robustness and statistical significance, we trained each model under 10 different random seeds and report the overall performance as mean standard deviation across these runs. The loss function balances reconstruction error and KL-divergence, with the latter weighted by β = 0.02. This choice of β was determined through preliminary experiments to achieve an optimal trade-off between reconstruction quality and latent space regularity. All experiments were conducted on a single NVIDIA Ge Force RTX 2080 Ti GPU with 11GB meory.

Real-World Experiments

We employ the Adam W optimizer with cosine annealing for training our network. The initial learning rate is set to 2e-3, with a weight decay of 1e-2 to mitigate overfitting. For all video sequences in Volleyball dataset, we uniformly sample T = 10 frames as input. The ELBO loss is computed with a β value of 0.02. We utilize a batch size of 128, which we found to provide a good trade-off between computational efficiency and optimization stability. The network is implemented using Py Torch [2], leveraging its dynamic computational graph and GPU acceleration capabilities. Training

Published as a conference paper at ICLR 2025

is conducted for 80 epochs on a multi-GPU setup consisting of four NVIDIA Ge Force RTX 2080 Ti GPUs, providing a total of 44GB of meory.

D ADDITIONAL EXPERIMENTS DETAILS

D.1 SYNTHETIC DATA GENERATION PROCESS

Our approach generates six distinct scenarios of sequences, encompassing both stationary and nonstationary settings with varying degrees of missingness. Each sequence consists of 9 time steps, with latent variables zt R5 and observations xt R5. Missingness is introduced by selecting a constant value dc {1, 2, 3} for each sequence, representing the number of missing dimensions throughout that sequence. The set of missing dimensions, sc t, is then determined based on dc t. We generate six scenarios in total: (1). non-stationary sequences with dc t = 1; (2). non-stationary sequences with dc t = 2; (3). non-stationary sequences with dc t = 3; (4). stationary sequences with dc t = 1; (5). stationary sequences with dc t = 2; (6). stationary sequences with dc t = 3. In nonstationary sequences, sc t varies every 3 time steps, while in stationary sequences, it remains fixed throughout.

For each scenario, the data generation process begins with 10,000 initial states drawn from z0 U(0, 1). From t = 1 to t = 9, zt within st is generated using a nonlinear function f with non-

additive, zero-biased Gaussian noise ϵi t, where (σ = 0.1): i st, zi t = fi(z {i }dt i =1 t 1 , ϵi t), where

z {i }dt i =1 t 1 is the set of zi t 1 within st 1. The missing dimensions are set as u sc t, zu t = fu(ϵt). Observations are then generated using a mixing function g that only considers zt within st: xt =

g(z {i}dt i=1 t ), where z {i}dt i=1 t is the set of zi t within st.

D.2 ADDITOINAL DETAILS OF THE VOLLEYBALL DATASET

Our preprocessing and feature extraction pipeline builds upon the procedure from Yan et al. (2023). We leverage a pretrained Faster R-CNN model Ren et al. (2016) implemented via the MMDetection toolbox Chen et al. (2019) to detect potential persons in each frame. These detections are then tracked across frames using the method proposed by Danelljan et al. (2014). For feature extraction, we utilize Res Net-18 He et al. (2016). We apply Ro IAlign He et al. (2017) with a crop size of 5 5. The resulting features are embedded into a K = 1024 vector. we select the top N = 20 person proposals based on detection confidence scores.