# reducedrank_linear_dynamical_systems__2de0b49d.pdf

Reduced-Rank Linear Dynamical Systems

Qi She,1,3 Yuan Gao,2 Kai Xu,4,5 Rosa H. M. Chan3

1Princeton Neuroscience Institute, Princeton University 2Tencent AI Lab 3Department of Electronic Engineering, City University of Hong Kong 4Department of Computer Science, Princeton University 5School of Computer Science, National University of Defense Technology

Linear Dynamical Systems are widely used to study the underlying patterns of multivariate time series. A basic assumption of these models is that high-dimensional time series can be characterized by some underlying, low-dimensional and time-varying latent states. However, existing approaches to LDS modeling mostly learn the latent space with a prescribed dimensionality. When dealing with short-length highdimensional time series data, such models would be easily overﬁtted. We propose Reduced-Rank Linear Dynamical Systems (RRLDS), to automatically retrieve the intrinsic dimensionality of the latent space during model learning. Our key observation is that the rank of the dynamics matrix of LDS captures the intrinsic dimensionality, and the variational inference with a reduced-rank regularization ﬁnally leads to a concise, structured, and interpretable latent space. To enable our method to handle count-valued data, we introduce the dispersion-adaptive distribution to accommodate over-/ equal- / and under-dispersion nature of such data. Results on both simulated and experimental data demonstrate our model can robustly learn latent space from short-length, noisy, countvalued data and signiﬁcantly improve the prediction performance over the state-of-the-art methods.

Introduction Deciphering the latent structure from high-dimensional time series is one of the fundamental problems of Artiﬁcial Intelligence, which has been extensively applied in various ﬁelds from social, economics, to biology science (Linderman, Stock, and Adams 2014; She, So, and Chan 2015; She, Chen, and Chan 2016; So et al. 2016; Hein et al. 2016). In such settings, many studies and theories posit that highdimensional time series are noisy observations of some underlying, low-dimensional, and time-varying signal of interest (Pfau, Pnevmatikakis, and Paninski 2013; Archer et al. 2014; Sussillo et al. 2016). Linear Dynamical Systems (LDS) have been employed to extract a low-dimensional implicit network structure from observed multivariate time series data (Archer et al. 2014; Lakshmanan et al. 2015; Linderman et al. 2017), which captures the variability of observations, both spatially and temporally. However, two main challenges exist when using LDS to retrieve an optimal implicit network structure. First, the exist-

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Latent trajectories reconstructed from (a) unconstrained dynamics matrix and (b) reduced-rank dynamics matrix (different colors indicate different simulated trials). The low-dimensional manifold in (b) is smoother and better structured.

ing models need a predeﬁned latent dimensionality. In order to ensure the models capability, it is typically set to be a large value, which leads to the difﬁculties in modeling the shortlength high-dimensional time series data due to overﬁtting. This modeling problem is troublesome since the short-length time series data exist in many real-world scenarios. For example, in neuroscience, we cannot obtain long sequences of high-quality neural data in experiments because of (i) the short lifetime of some neurons, (ii) the limited viable time of recording materials and (iii) the micro-movement of recording electrodes during an activity of the animal (Spira and Hai 2013). In the clinical domain, the length of patient clinical data is usually less than 50 because the hospitalization period for most patients is less than two weeks (Banaee, Ahmed, and Loutﬁ2013). In economics, the econometric multivariate time series, such as gross domestic product and consumer price index, are measured quarterly or yearly which results in short-length data. Second, real-world time series data are often count-valued (rather than real-valued). The application of standard LDS, which assumes the observation follows Gaussian distribution, is infeasible (She, So, and Chan 2016). Examples include multiple spike trains recorded from neural populations (Paninski et al. 2010), the data of trades on the S&P 100 index (Linderman and Adams 2014), to name just a few. Extensions on model to handle count nature of the data are necessary. Recently, Poisson linear dynamical system (PLDS) (Buesing et al. 2014) was proposed for count data modeling. Nevertheless, the Poisson assumption implies equal dispersion of the observations, i.e., the conditional mean and variance are equal.

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

This limits PLDS in characterizing neural spike counts, which are commonly observed to be either overor under-dispersed (variance greater or less than mean) (Churchland et al. 2010; She, Jelfs, and Chan 2016). Without a proper distribution capturing the dispersion of count data, one cannot learn the variability of the data, thus failing to infer an optimal implicit network. In view of these limitations, we propose a novel solution to infer the implicit network from short-length, noisy count data. We focus on the dynamics matrix of LDS, which represents the inﬂuence that one latent node exerts on the subsequent activity of another. In other words, the dynamics matrix is used to govern the node evolution of the implicit network. Our key observation is that the rank of the dynamics matrix captures the intrinsic dimensionality of the state space of the nodes. To prevent LDS from overﬁtting given short-length data, we seek to learn a compact low-rank dynamics matrix. Speciﬁcally, we compose two different low-rank priors for dynamics matrix, i.e., multivariate Laplacian and nuclear norm, which offer similar performance for retrieving the intrinsic dimensionality and are widely applied to different scenarios(Gao and Yuille 2017; Gao, Ma, and Yuille 2017). Moreover, to facilitate the learning of reduced-rank dynamics matrix from count data, we introduce the dispersion-adaptive (DA) distribution and develop a novel, ﬂexibly-parameterized observation model. Figure 1 demonstrates the advantage of reduced-rank dynamics matrix with DA distribution in recovering low dimensional manifolds from short-length, noisy count-valued time series data. The observation is 40-dimension (i.e. 40D) time series data, which is modeled with a 10D dynamics matrix (same initial states). It shows that our method successfully retrieves three intrinsic dimensionalities from the dynamics matrix, leading to a smoother and better structured manifold indicated by the three dimensional curves, while the method with unconstrained dynamics matrix fails. In summary, our contributions are four-folds:

We propose to retrieve intrinsic dimensionality of multivariate time series by imposing two reduced-rank structures on the dynamics matrix.

We introduce a count-valued exponential family distribution (called DA distribution) to capture the dispersion nature of count data, and derive a variety of commonly used distributions as special cases.

We utilize a latent, reduced-rank linear dynamical model to modulate expectation of DA observation distribution, thus forming a novel linear dynamical system model.

We develop a Variational Bayes Expectation Maximization (VBEM) algorithm by extending the current state-of-theart methods to the novel model.

Our framework is evaluated against the baseline methods on both simulated and real-world data. The promising performance demonstrates that our method is able to: (1) automatically reduce redundant dimensions of latent state space, which prevents overﬁtting with large number of predeﬁned latent states; (2) signiﬁcantly improve prediction performance over baseline methods for noisy neuronal spiking

activities; and (3) robustly and efﬁciently retrieve intrinsic dimensionality of underlying complex neural systems from two experimental datasets.

Background Linear Dynamical System (LDS), with well-developed inference and learning methods, is an elegant mathematical framework for modeling and learning multivariate time series (MTS). LDS models real-valued MTS {yt Rq}T t=1 using latent states {xt Rn}T t=1:

xt|xt 1 N(xt|Axt 1, Q), (1) yt|xt N(yt|Cxt, R). (2)

Eq. 1 represents state dynamics, and Eq. 2 is the observation model. Brieﬂy, {xt} is evolved via a dynamics matrix A Rn n. Observations {yt} are generated from {xt} via a emission matrix C Rq n. These two processes have Gaussian noise with mean 0 and covariance matrices Q and R respectively. The initial state x1 is multivariate Gaussian distribution with mean x0 and covariance Q0. The complete set of the LDS parameters is Ω = {A, C, Q, R, x0, Q0}. While in some LDS applications, the model parameters are known a priori, in the majority of real-world applications they are unknown, and we learn them from MTS data. This can be done with LDS learning methods such as the Expectation Maximization (Ghahramani and Hinton 1996) or spectral learning (Buesing, Macke, and Sahani 2012).

Related Work Various regularization methods have been proposed for both time series modeling and prediction tasks with LDS under different applications (Charles et al. 2011). These can be divided into four categories: (1) state regularization; (2) innovation regularization; (3) combination regularization; and (4) parameter regularization.

State Regularization The latent states {xt}T t=1 are sparsiﬁed at each step of Kalman ﬁlter inference. Charles et al. (2011) incorporates sparsity constraints to achieve a sparse state estimate ˆxt at each time stamp. Angelosante et al. (2009) treats state sequence {xt}T t=1 as a state estimate matrix and enforces a row level group lasso on this matrix.

Innovation Regularization The error of state estimation is called innovation , i.e., ||ˆxt Aˆxt 1||. ℓ1 regularization is applied on innovation during state estimation, which can help to balance ﬁdelity to the measurements against the sparsity of the innovations (Asif et al. 2011).

Combination Regularization Ghanem and Ahuja (2010) trains a dictionary of LDSs, in which each LDS is learned via one trial training MTS. The ﬁnal LDS is a weighted combination of these individual LDSs, whose weights are regularized by an ℓ1 penalty.

Parameter Regularization ( ) Parameter regularization imposes regularization penalties on the parameters of a LDS during the learning process. Boots et al. (2007) limited the largest eigenvalue of dynamics matrix within unit circle to avoid unstable latent dynamics. A spectral algorithm has been

Prior Name Prior Form Regularization Multivariate Laplacian exp( β1||Ai||2) β1||Ai||2 Nuclear norm exp( β2||A|| ) β2||A||

Table 1: Prior choices for dynamics matrix

proposed to learn a stable LDS, which is good for simulating and predicting from learned system. Our solution in reducedrank linear dynamical system belongs to this categories as we introduce a low-rank dynamics matrix for recovering the intrinsic dimensionality. Our method is different from category (1) and (2) because both of them learn a sparse representation for latent states {xt}T t=1 while they assume all parameters of LDS as a priori. The combination regularization in (3) require extra training process since they need to construct a LDSs dictionary from different MTS. For limited time series data in our case, they are unable to solve the overﬁtting problem and retrieve the intrinsic dimensionality of MTS. While our method belongs to the same category (4) as the stable LDS proposed by Boots et al. (2007), we focus on different aspects of the problem. They attempt to achieve stability in LDS, while we aim to ﬁnd an appropriate state space and prevent overﬁtting given a small amount of MTS count data. For learning LDS model from Gaussian observations, standard EM algorithm (Ghahramani and Hinton 1996) can iteratively ﬁnd the maximum likelihood solution. Subspace Identiﬁcation (Subspace ID) method compute an asymptotically unbiased solution in closed form by using oblique projection and Singular Value Decomposition (Van Overschee and De Moor 2012). For learning LDS model from count observations, Busing et al. proposed Poisson Linear Dynamical Systems (PLDS) and to learn it using spectral learning method (2012) or variational inference (2014). The Poisson assumption of count data, while offering algorithmic conveniences, implies the conditional mean and variance of count data are equal. This property is improper in some analysis of count data, which are observed to be either overor under-dispersed (Churchland et al. 2010). Thus, it is crucial to develop more general observation distributions to capture over/equal/under-dispersion of count data. To address these needs, we also employ a count-valued exponential family distribution (weighted Poisson distribution), which is superior to current methods in two aspects: (i) adaptive dispersion, where a log-convex/linear/concave weight function will produce the expected over/equal/underdispersion; (ii) ﬂexible setting, with a variety of previous work are derived as special cases under this distribution.

Methodology

Reduced-Rank Structure

In order to recover the intrinsic dimensionality from MTS datasets through the rank of dynamics matrix A, we shall choose speciﬁc priors which can induce the desired lowrank property. We have two choices of inducing a low-rank dynamics matrix: (1) a multivariate Laplacian prior and (2) a nuclear norm prior as shown in Table 1:

(1) Multivariate Laplacian prior It assumes every row in dynamics matrix A is independent of each other and has the multivariate Laplacian density. Also in order to avoid overﬁtting, we introduce a multivariate Gaussian prior to each element in A , which leads to the ridge regularization. Then, we combine the multivariate Laplacian prior and Gaussian prior to get a new prior p ML(A), as

log p ML(A) = β1

i=1 Ai 2 β2

2 A 2 F + const, (3)

where β1,β2 are regularization parameters. {Ai}n i=1 indicates rows of A. 2 and F are ℓ2 and Frobenius norm.

(2) Nuclear norm prior It can be regarded as a convex relaxation of the number of non-zeros eigenvalues (i.e.,the rank) of the dynamics matrix A. We get an alternative prior p N N (A) by applying nuclear norm density and multivariate Gaussian to dynamics matrix, as

log p N N (A) = β3 A β4

2 A 2 F + const, (4)

where is nuclear norm. β3, β4 are regularization parameters. {βi}4 i=1 are selected (in all experiments) by the internal cross validation while optimizing model s predictive performance. We impose p ML(A) and p N N (A) separately to the learning process, and derive two methods to optimize a low-rank dynamics matrix.

Dispersion-adaptive (DA) Distribution For count-valued observations, we deﬁne the DA distribution as the family of count-valued probability distribution:

p DA(Y = k; θ, w( )) = w(k) exp(θk)

k!E[w(Y )] , k N (5)

where θ R and the function w( ) : N R parameterizes the distribution, and E[w(Y )] =

k N w(k) exp(θk)

k! is the normalizing constant. It can be viewed as an extension of Poisson distribution with a weight function w( ). This kind of generalizations has been of interest since del Castillo et al. (2005), and they proved that: (1) log-concave/linear/convex functions w( ) imply under/equal/over-dispersed distributions; (2) the expectation of any weighted Poisson distribution is monotonically increasing with θ, for a ﬁxed w( ). Figure 2 (a) demonstrates different w( ) functions model the dispersion of count data, and controlling θ can adjust the mean value of DA distribution. As shown in Figure 2 (b), we derive many of the commonly used count-data distributions as special cases of DA, by restricting the w( ) function and θ to have certain parametric form. Figure 2 shows that DA offers a rich, ﬂexible exponential family for count data, and allows w( ) and θ to be interpretable for capturing statistics of count-valued data.

Reduced-Rank Linear Dynamical System (RRLDS) With two reduced-rank structures and DA distribution in hand, we now couple them with a latent, linear dynamical system. This system, which we call RRLDS, is beneﬁcial for modeling limited count data to retrieve intrinsic

0 0.5 1 1.5 2 Mean

Increasing Њ

Model Typical parametrization Reparametrization in DA

Bernoulli p(y = k) = pk(1 p)1 k w(k) = 1; k = 0, 1

θ = logit(p)

Poisson p(y = k) = λk exp( λ)

Negative binomial p(y = k) = (k + r 1)!

k!(r 1)! (1 p)rpk w(k) = (k + r 1)!

COM-Poisson p(y = k) = λk

(k!)v / + j=1 λj

w(k) = exp(1 v) + k!

Figure 2: (a) The mean and variance of the DA distribution with different choices of the function w( ). With a ﬁxed log w( ), increasing θ can drive mean and variance to be larger (darker dots); (b) Common count distributions are special cases of DA distribution by parameterizing θ and w( ).

Reduced rank Latent states Dispersion adaptive Response

6WRFKDVWLF SURFHVV

Figure 3: Illustration of the two stages of RRLDS.

dimensionality. We apply it to model time series data (spike counts) recorded from brain neurons, and it is straightforward to extend it to describe and interpret other count-process observations. Denoting yi t,r as the spike count of neuron i {1, . . . , q} at time t {1, . . . , T} on experimental trial r {1, . . . , R}, we assume the spiking activities of neurons are noisy count observations of underlying low-dimensional latent states xt,r Rn(n < q) (modulating mean value of DA distribution) and deﬁne the DA observation model as:

yi t,r|xt,r DA(c i xt,r, wi( )). (6)

We parametrize θ = c i xt,r, where C = [c1, , cq] Rq n is emission matrix mapping latent space to observation space. wi( ) is a neuron-speciﬁc function capturing the dispersion property of each time series. The evolution of latent state xt,r is described by a linear ﬁrst-order process:

x1,r N(x1,r|x0, Q0), xt,r|xt 1,r N(xt,r|Axt 1,r + But 1,r, Q). (7)

Here, x0 and Q0 are the mean and covariance of the initial state and Q is the covariance of the innovations. External input ut,r with coupling effects B are considered in the process of latent evolution. For example, in the hippocampal experiments to be presented in the Results section, the rat s position (trajectory) can be regarded as external stimuli, and locationstimulus ﬁlter B is obtained under this setting. Meanwhile, reduced-rank structures p ML(A) and p N N (A) are imposed on dynamics matrix A. RRLDS is illustrated in Figure 3 along with two-stage model structure: The ﬁrst stage includes reduced-rank structures composed on the dynamics matrix A, which governs the evolution of latent states xt. The second stage maps latent states xt onto responses yt via DA observation model, which learns the dispersion property.

Variational Bayes Expectation-Maximization (VBEM) algorithm is adopted for estimating latent states x1:T,r (Estep) and parameters Θ = {A, B, C, Q, Q0, x0 {wi( )}p i=1} (M-step). Since the posterior distribution of x1:T,r has no analytical solution, an approximation step is implemented via a variational lower bound. Meanwhile, two regularization strategies are proposed for optimizing dynamics matrix.

Inference (E-step) We need to characterize the full posterior distribution of latent states given the observations y1:T,r and parameters Θ, such that: p(x1:T,r|y1:T,r, Θ). This distribution is intractable, and usually we make a Gaussian approximation. Denote xr = vec(x1:T,r) and yr = vec(y1:T,r). For ease of notation in the paper we drop the trial index r. Thus, we use Gaussian approximation as p( x| y) q( x) = N(μ, Σ). We identify the optimal (μ, Σ) by maximizing a variational Bayesian lower bound (also called ELBO ) over the variational parameters μ and Σ as:

L(μ, Σ) = Eq( x)

log p( x|Θ)

+ Eq( x) [log p ( y| x, Θ)] (8)

The ﬁrst term of Eq. 8 is the negative Kullback-Leibler divergence between the variational distribution and prior distribution, encouraging the variational distribution to be close to the prior. The second term involving the DA likelihood encourages the variational distribution to explain the observations well. The integrations in the second term are intractable due to non-conjugation. We use the ideas of Khan et al. (2013) to derive a further lower bound. First, we expend the second term using DA likelihood via Eq. 6 as

Eq(θi t) log p DA yi t|θi t, wi( )

yi tθi t + log wi(yi t) yi t! log

exp(kθi t)wi(k) k!

where θi t = c i xt. Denoting mi t,k = kθi t + log wi(k) log(k!) = kc i xt + log wi(k) log(k!), we can see mi t,k is a linear transformation of xt. Under the variational distribution mi t,k is also normally distributed mi t,k N(hi t,k, ρi t,k). We have hi t,k = kc i μt + log wi(k) log k!, and ρi t,k = k2c i Σtci, where (μt, Σt) is the expectation and covariance

matrix of xt under variational distribution. Then, Eq. 9 is reduced to Eq(mi t,k)[mi t,k log

k exp(mi t,k)]. A further lower bound can then be derived by Jensen s inequality:

Eq(mi t,k)[mi t,k log

k exp(mi t,k)] (10)

k exp(hi t,k + ρi t,k/2) = f(hi t, ρi t).

Combining Eq. 8 and Eq. 10, we can get a tractable variational lower bound, where L (μ, Σ) L(μ, Σ) as

L (μ, Σ) = Eq( x)

log p( x|Θ)

t,i f(hi t, ρi t). (11)

In the E-step, we maximize the new lower bound L via its dual (Khan et al. 2013). Finally, we have sufﬁcient statistics E x[xtx t ], E x[xt 1x t ], and E x[xt], which are necessary to M-step. Details are derived in the supplementary materials.

Learning (M-step)

The M-step requires maximization of L over Θ. This process involves the optimization of three parts: (1) dynamics matrix A; (2) other dynamical system parameters {B, Q, Q0, x0}; and DA model parameters {C, {wi( )}p i }.

Optimization of A In M-step, dynamics matrix A is optimized via maximizing E x[ T t=2 log p(xt|xt 1)] + log p(A). We already introduced two reduced-rank structure p ML(A) and p N N (A). Below we brieﬂy outline two efﬁcient algorithms (i.e., A1 and A2) and our novel contributions.

A1: p ML(A) Two levels of constraints were applied to A: (1) multivariate Laplacian prior; and (2) multivariate Gaussian prior. The ﬁrst leads to low-rank structure and the second prevents overﬁtting. Here the objective function is

min A g(A) + β1

i=1 Ai 2 + β2

2 A 2 F , (12)

where g(A) = 1 2 T t=2 E x[ xt ˆxt 2 Q 1] and ˆxt 1 = Axt 1 But 1. β1, β2 are selected by the internal cross validation. Eq. 12 can be transformed into a quadratic problem with a non-smooth Euclidean norm (details shown in supplementary materials), as

argmin A 1 2a Ha b a + β1

i=1 Ai 2. (13)

For clarity a = vec(A ), and Q 1 = LL . We reformulate

t=2 E x[xt 1x t 1] + β2In2,

t=2 E x[xtx t 1 But 1x t 1] vec(L).

Eq. 13 can be casted into second order cone program (SOCP) and solved using existing SOCP solvers. SOCP always provides solutions with high precision (low duality gap) when

the state size remains moderate (<50), which is the case in our experiments. The transformed SOCP is given as

min α0, ,αnα0 + β1

2a Ha b a, αi Ai 2. (14)

A2: p N N (A) We assume that dynamics matrix A has a nuclear norm density and similarly to p ML(A), we also assume a multivariate Gaussian prior for each element in A. In this case, our objective function becomes:

min A g(A) + β3 A + β4

2 A 2 F . (15)

Denoting h(A) = g(A) + β4

2 A 2 F , which is convex and differentiable with respect to A. We can minimize Eq. 15 based on generalized gradient descent algorithm:

A(j+1) = proxdj A(j) dj h(A(j)) , (16)

here proxdj( ) is the singular value soft-thresholding operator, and dj is the step size in iteration j. We select the step size to assure fast convergence rate based on Theorem 1 and proof is in the supplementary material.

Theorem 1 Generalized gradient descent with a ﬁxed step size d 1/( Q 1 F T 1 t=1 E[xtx t ] F +β3) for minimizing Eq. 15 has convergence rate O(1/J), where J is the total number of iterations.

Optimization of {B, Q, Q0, x0} With variational distribution q( x), the part of the likelihood about Gaussian linear dynamics is quadratic with respect to x, and has closed form solutions based on Ghahramani and Hinton (1996).

Optimization of {C, w( )} The part of the likelihood involving DA observation distribution is not a standard form as in previous work (Ghahramani and Hinton 1996). In our model, considering all experimental trials, we have

r=1 yi t,r(c i μt,r) + log(wi(yi t,r)) (17)

k! exp(k(c i μt,r) + 1

2k2c i Σt,rci)).

This part is concave and can be optimized efﬁciently using existing convex optimization techniques. In practice, we initialize our parameters using Laplace-EM algorithm (Buesing et al. 2014), which empirically gives runtime advantages, and produces a sensible optimum. Above steps provide a fully inference and learning algorithm (VBEM), which is summarized by Algorithm 1.

Results To demonstrate the generality of DA and verify our algorithmic implementation, we ﬁrst test inference and learning method on extensive simulated data. Then we evaluate the prediction performance by comparing with state-of-the-arts over two neuroscience datasets. Finally, we verify that our

Algorithm 1 Framework of inference and learning (VBEM)

Input: {A, B, C, Q, Q0, x0, w( )} initialized via Laplace-EM. Observation sequences yi t and stimuli sequences ui t with i {1, . . . , q} and t {1, . . . , T}.

Output: E x[xt], ˆΘ = { ˆA, ˆB, ˆC, ˆQ, ˆQ0, ˆx0, { ˆwi( )}p i=1} 1: repeat 2: E-step: 3: Compute new ELBO : L (μ, Σ) from Eq. 11 4: {μ, Σ} dual optimization over L (μ, Σ) 5: p( x| y, θ) N(μ, Σ) 6: Compute E x[xtx t ], E x[xt 1x t ], and E x[xt]. 7: M-step: 8: if p(A) p ML(A) then 9: A SOCP solving Eq. 14 10: else p(A) p N N (A) 11: d 1/( Q 1 F T 1 t=1 E[xtx t ] F + β3) 12: Compute gradient of h(A) 13: A proxd (A d h(A)) 14: end if 15: { ˆB, ˆQ, ˆQ0, ˆx0} argmaxθ E xk [log p( x|θ)]. 16: { ˆC, { ˆwi( )}p i=1} optimization over Eq. 17 17: until convergence

with Reduced Rank w/o Reduced Rank with DA RRLDS-ML /-NN DALDS w/o DA RRLDS (w/o DA) alternative LDSs

Table 2: Abbreviations for our method and several baselines. ML stands for Multivariable Laplacian and NN for Nuclear Norm. Alternative LDS methods include vanilla LDS (Ghahramani and Hinton 1996), PLDS (Buesing, Macke, and Sahani 2012), Subspace ID (Van Overschee and De Moor 2012) and Stable LDS (Boots, Gordon, and Siddiqi 2007).

method can retrieve the intrinsic dimensionality from those multivariate time series, which is substantially more powerful than the existing works. Table 2 lists the abbreviations of methods compared in the Results part.

Simulated results Reconstruction of Latent States We evaluate the performance of proposed variational inference for posterior distribution of latent states, given a known set of parameters Θ, observed data {y1:T,r}2 r=1 and stimuli {u1:T,r}2 r=1. The intrinsic dimensionality of simulated data is set to the number of latent states (xt,r R3) in inference. The true state trajectories {xt,r}2 r=1 (grey) and the posterior mean estimates {E [xt,r]}2 r=1 (black) are plotted in Figure 4, with 3 states of 2 trials. It shows that our variational inference method achieves faithful reconstruction of the state trajectories.

Parameter Estimation For parameter estimation, it is inappropriate to perform element-wise comparison between the estimated parameters and ground-truth. This is because dynamics matrices are unique (in terms of producing the same observations) only up to linear transformations. Therefore,

0 50 100 -4

State [a.u.]

0 50 100 -4

0 50 100 -4

0 50 100 Time

State [a.u.]

0 50 100 Time

0 50 100 Time

Figure 4: Reconstruction of latent states from multiple count data. Top row: true (grey) and estimated (black) trajectories of 3 latent states in simulated trial #1; Bottom row: the performance in simulated trial #2.

0 1 0.5 Real Sart

Imaginary Sart

Atrue ARRLDSARRLDSAPLDS

-0.5 0 0.5 True

Figure 5: (a) The spectrum of estimated dynamics matrices using RRLDS-ML (red triangle), RRLDS-NN (yellow square) and LDS with Poisson observation model (PLDS, purple cross). The true complex eigenvalue spectrum is indicated with blue circles. Results of both RRLDS-ML/-NN methods (w/o DA) are close to true eigenvalues, while PLDS fail to eliminate redundant dimensionality. (b) Scatter plot of the elements in stationary covariance matrix of predicted and true count data.

we opt to compare the invariants of the estimated and true parameters as in Macke et al. (2015). Figure 5 (a) compares the eigenvalue spectrum of the estimated (by RRLDS-ML/-NN (w/o DA), PLDS) and the true dynamics matrices. The true rank of dynamics matrix is 10 (# blue circles), while the number of latent states is initialized to be 20, larger than the rank. The experiment is performed on the simulated data y with 40 sequences, each of which has 100 bins. It veriﬁes that RRLDS-ML/NN (w/o DA) indeed result in a low-rank estimation of the dynamics matrix, with higher accuracy than PLDS. Speciﬁcally, PLDS overﬁts the data with redundant eigenvalues, instead, our method learns a rank-10 dynamics matrix from the initial 20D state space, leaving the rest of eigenvalues to be 0. Given the estimated parameters, Figure 5 (b) plots the elements of stationary covariance matrix (Macke, Buesing, and Sahani 2015) for the predicted (using RRLDS-NN) and true count data, also demonstrating the accuracy of our estimation.

Prediction performance We compare the predictive loglikelihood of the learned systems by RRLDS-ML/-NN, DALDS and PLDS. A higher predictive log-likelihood implies better performance. Figure 6 (a) shows the results of four LDSs, each with a dynamics matrix of rank-2, 4, 6, 8,

2 4 6 8 10 Rank of dynamics matrix

Pred. log lkhd

RRLDS-NN RRLDS-ML DALDS PLDS

0 500 1000 1500 2000 Train. data length

Figure 6: Predictive log-likelihood of (a) four learned LDSs with different true ranks of dynamics matrices, and (b) four rank-5 LDSs learned with different lengths of training data. The predeﬁned number of latent states is 10 in both (a) and (b). RRLDS-ML/-NN signiﬁcantly (p <0.001, paired t-test) outperform alternatives.

and 10. The number of latent states is 10, and the dimension of simulated observations y1:T is 40. The length is 500 for training data and 100 for testing data. Ten trials in total are simulated. DALDS outperforms PLDS due to the advantages of dispersion-adaptive distribution. RRLDS-ML/-NN further improve the prediction accuracy by alleviating overﬁtting with intrinsic dimensionality recovery. In Figure 6 (b), we plot the predictive log-likelihood over the length of the training data, by ﬁxing the rank to be 5 and the predeﬁned number of states to be 10. Results show that when the training data become long enough, RRLDSML/-NN, and DALDS converge to a similar performance. It is because the eigenvalue/eigenvector pairs corresponding to redundant dimensions of dynamics matrix are estimated close to the ground-truth. The decomposition space of dynamics matrix spanned by its eigenvectors is similar for with or without regularization. However, only RRLDS-ML/-NN consistently achieve promising predictions for short-length data, which is more applicable to real-world scenarios.

Neuroscience Data

We also evaluated our method on two experimental hippocampus datasets (Mizuseki et al. 2009). These two datasets contain neuronal spike-count data in the brains of rats performing two different running tasks.

Prediction performance of neural activities Figure 7 shows prediction performances of six LDSs (RRLDS-ML/- NN, PLDS, Subspace ID, Stable LDS, and LDS) with different predeﬁned number of latent states for Task #1. As shown by the predictive log-likelihood, while a single latent state cannot model the system well (the bars to the left), RRLDSML/-NN signiﬁcantly (p < 0.001, paired t-test) outperform the alternatives. Figure 8 compares performances for predicting spike counts of 10 neurons in 100 time bins (Task #2). The color represents the count value predicted at each time step. Results demonstrate that RRLDS can capture more details of spiking activities compared with all baselines. PLDS has better prediction than Subspace ID, Stable LDS and LDS, but still loses the temporal precision compared with RRLDS. Speciﬁcally, neuron #4 (during time bins 45-65) and neuron #6 (during time bins 80-95) have high and varying ﬁring

3UHG ORJ ONKG.

55/'6 11 3/'6 6XEVSDFH,' 6WDEOH/'6 /'6 55/'6 0/

10 # /DWHQW VWDWHV

6LQJOH VWDWH IDLOV WR FDSWXUH G\QDPLFV

Figure 7: Predictive Log-likelihood of Experimental Spiking Activities in Task #1

20 40 60 80 100

20 40 60 80 100

20 40 60 80 100

Subspace ID

20 40 60 80 100 Time

20 40 60 80 100 Time

20 40 60 80 100 Time

Figure 8: Prediction performance of ﬁve models for neurons spike counts (Task #2). The rows of each subﬁgure indicate spiking sequence of neurons. The color highlights count values recorded/predicted at each time step.

rate (highlighted in the ﬁgure). RRLDS-NN captured it precisely, while others failed. We would like to denote that without Reduced-Rank structures and DA on LDS, the baseline methods predict spurious spike counts temporally.

Retrieval of intrinsic dimensionality We test the retrieval of intrinsic dimensionality for the complex neural system based on the estimated rank of dynamics matrix. In Figure 9, each subﬁgure plots the normalized eigenvalues of the dynamics matrices learned from different experimental trials. It is observed that given the same task, the rank of the optimized dynamics matrix consistently converges to 5 or 6 for Task #1 and 10 or 11 for Task #2, regardless of varying the number of latent states (10, 20 or 30). This result provides a valuable insight into the internal factors of the neural system: the recorded spiking activities in hippocampus for the two tasks are intrinsically characterized by an underlying low-rank dynamical system.

Conclusion We have proposed reduced-rank linear dynamical systems to retrieve the intrinsic dimensionality from short-length, noisy count-valued data. Two reduced-rank structures and a dispersion-adaptive distribution family are introduced and incorporated into our model. Both simulation and experimental results verify the effectiveness of our method in eliminating redundant latent dimensions. Extensions to nonlinear dynamical system are left for future work, which may require higher algorithmic complexity in the learning with prior p ML/N N (A) and weighted functions w( ). The applications of this system are not limited in neuroscience. We expect our method can beneﬁt the learning of more concise, structured,

#Latent States: 10

#Latent States: 20

0 10 20 30 0

#Latent States: 30

0 5 10 Eigenvalue indices

0 10 20 Eigenvalue indices

0 10 20 30 Eigenvalue indices

Figure 9: Latent state space recovery from neuroscience data using RRLDS-NN. Top row: Task #1; bottom row: Task #2. Different lines in each subﬁgure represent different trials. 10, 20, and 30 latent states are selected for testing robustness of RRLDS for retrieving intrinsic dimensionality.

and interpretable patterns from social science and ﬁnancial data, which are often observed to be short-length, noisy and count-valued. We implement RRLDS in Matlab(2017a), and our code is available at https://github.com/sheqi/RRLDS

Angelosante, D.; Roumeliotis, S. I.; and Giannakis, G. B. 2009. Lasso-kalman smoother for tracking sparse signals. In Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on, 181 185. IEEE. Archer, E. W.; Koster, U.; Pillow, J. W.; and Macke, J. H. 2014. Low-dimensional models of neural population activity in sensory cortical circuits. In NIPS, 343 351. Asif, M. S.; Charles, A.; Romberg, J.; and Rozell, C. 2011. Estimation and dynamic updating of time-varying signals with sparse variations. In ICASSP, IEEE International Conference on, 3908 3911. IEEE. Banaee, H.; Ahmed, M. U.; and Loutﬁ, A. 2013. Data mining for wearable sensors in health monitoring systems: a review of recent trends and challenges. Sensors 13(12):17472 17500. Boots, B.; Gordon, G. J.; and Siddiqi, S. M. 2007. A constraint generation approach to learning stable linear dynamical systems. In NIPS, 1329 1336. Buesing, L.; Machado, T. A.; Cunningham, J. P.; and Paninski, L. 2014. Clustered factor analysis of multineuronal spike data. In NIPS, 3500 3508. Buesing, L.; Macke, J. H.; and Sahani, M. 2012. Spectral learning of linear dynamics from generalised-linear observations with application to neural population data. In NIPS, 1682 1690.

Charles, A.; Asif, M. S.; Romberg, J.; and Rozell, C. 2011. Sparsity penalties in dynamical system estimation. In CISS, 2011 45th Annual Conference on, 1 6. IEEE.

Churchland, M. M.; Byron, M. Y.; Cunningham, J. P.; Sugrue, L. P.; Cohen, M. R.; Corrado, G. S.; Newsome, W. T.; Clark, A. M.; Hosseini, P.; Scott, B. B.; et al. 2010. Stimulus onset quenches neural variability: a widespread cortical phenomenon. Nature neuroscience 13(3):369 378. del Castillo, J., and P erez-Casany, M. 2005. Overdispersed and underdispersed poisson generalizations. Journal of Statistical Planning and Inference 134(2):486 500. Gao, Y., and Yuille, A. L. 2017. Exploiting symmetry and/or manhattan properties for 3d object structure estimation from single and multiple images. CVPR, 2017.

Gao, Y.; Ma, J.; and Yuille, A. L. 2017. Semi-supervised sparse representation based classiﬁcation for face recognition with insufﬁcient labeled samples. IEEE Transactions on Image Processing 26(5):2545 2560. Ghahramani, Z., and Hinton, G. E. 1996. Parameter estimation for linear dynamical systems. Technical report, Technical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science. Hein, G.; Morishima, Y.; Leiberg, S.; Sul, S.; and Fehr, E. 2016. The brains functional network architecture reveals human motives. Science 351(6277):1074 1078. Khan, M. E.; Aravkin, A.; Friedlander, M.; and Seeger, M. 2013. Fast dual variational inference for non-conjugate latent gaussian models. In ICML, 951 959. Lakshmanan, K. C.; Sadtler, P. T.; Tyler-Kabara, E. C.; Batista, A. P.; and Byron, M. Y. 2015. Extracting low-dimensional latent structure from time series in the presence of delays. Neural computation. Linderman, S., and Adams, R. 2014. Discovering latent network structure in point process data. In ICML, 1413 1421. Linderman, S.; Johnson, M.; Miller, A.; Adams, R.; Blei, D.; and Paninski, L. 2017. Bayesian learning and inference in recurrent switching linear dynamical systems. In AISTATS, 914 922. Linderman, S.; Stock, C. H.; and Adams, R. P. 2014. A framework for studying synaptic plasticity with neural spike train data. In NIPS, 2330 2338. Macke, J. H.; Buesing, L.; and Sahani, M. 2015. Estimating state and parameters in state space models of spike trains. In NIPS. 137 159. Mizuseki, K.; Sirota, A.; Pastalkova, E.; and Buzs aki, G. 2009. Theta oscillations provide temporal windows for local circuit computation in the entorhinal-hippocampal loop. Neuron 64(2):267 280. Paninski, L.; Ahmadian, Y.; Ferreira, D. G.; Koyama, S.; Rad, K. R.; Vidne, M.; Vogelstein, J.; and Wu, W. 2010. A new look at statespace models for neural data. Journal of computational neuroscience 29(1-2):107 126. Pfau, D.; Pnevmatikakis, E. A.; and Paninski, L. 2013. Robust learning of low-dimensional dynamics from large neural ensembles. In NIPS, 2391 2399. She, Q.; Chen, G.; and Chan, R. H. 2016. Evaluating the small-world-ness of a sampled network: Functional connectivity of entorhinal-hippocampal circuitry. Scientiﬁc reports 6. She, Q.; Jelfs, B.; and Chan, R. H. 2016. Modeling short overdispersed spike count data: A hierarchical parametric empirical bayes framework. ar Xiv preprint ar Xiv:1605.02869. She, Q.; So, W. K.; and Chan, R. H. 2015. Reconstruction of neural network topology using spike train data: Small-world features of hippocampal network. In EMBC, 2015, 2506 2509. IEEE. She, Q.; So, W. K.; and Chan, R. H. 2016. Effective connectivity matrix for neural ensembles. In EMBC, 2016, 1612 1615. IEEE. So, W. K.; Yang, L.; Jelfs, B.; She, Q.; Wong, S. W.; Mak, J. N.; and Chan, R. H. 2016. Cross-frequency information transfer from eeg to emg in grasping. In EMBC, 2016, 4531 4534. IEEE. Spira, M. E., and Hai, A. 2013. Multi-electrode array technologies for neuroscience and cardiology. Nature nanotechnology 8(2):83 94. Sussillo, D.; Jozefowicz, R.; Abbott, L.; and Pandarinath, C. 2016. Latent factor analysis via dynamical systems. ar Xiv:1608.06315. Van Overschee, P., and De Moor, B. 2012. Subspace identiﬁcation for linear systems: Theory Implementation Applications. Springer Science & Business Media.