# deep_factors_for_forecasting__ff415c77.pdf Deep Factors for Forecasting Yuyang Wang 1 Alex Smola 1 Danielle C. Maddix 1 Jan Gasthaus 1 Dean Foster 1 Tim Januschowski 1 Producing probabilistic forecasts for large collections of similar and/or dependent time series is a practically relevant and challenging task. Classical time series models fail to capture complex patterns in the data, and multivariate techniques struggle to scale to large problem sizes. Their reliance on strong structural assumptions makes them data-efficient, and allows them to provide uncertainty estimates. The converse is true for models based on deep neural networks, which can learn complex patterns and dependencies given enough data. In this paper, we propose a hybrid model that incorporates the benefits of both approaches. Our new method is data-driven and scalable via a latent, global, deep component. It also handles uncertainty through a local classical model. We provide both theoretical and empirical evidence for the soundness of our approach through a necessary and sufficient decomposition of exchangeable time series into a global and a local part. Our experiments demonstrate the advantages of our model both in term of data efficiency, accuracy and computational complexity. 1. Introduction Time series forecasting is a key ingredient in the automation and optimization of business processes. In retail, decisions about which products to stock, when to (re)order them, and where to store them depend on forecasts of future demand in different regions; in (cloud) computing, the estimated future usage of services and infrastructure components guides capacity planning; regional forecasts of energy consumption are used to plan and optimize the generation of power; and workforce scheduling in warehouses and factories depends on forecasts of the future workload. The prevalent forecasting methods in statistics and econo- 1Amazon Research. Correspondence to: Yuyang Wang . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). metrics have been developed in the context of forecasting individual or small groups of time series. The core of these methods is formed by comparatively simple (often linear) models, which require manual feature engineering and model design by domain experts to achieve good performance (Harvey, 1990). Recently, there has been a paradigm shift from these model-based methods to fully-automated data-driven approaches. This shift can be attributed to the availability of large and diverse time series datasets in a wide variety of fields, e.g. energy consumption of households, server load in a data center, online user behavior, and demand for all products that a large retailer offers. These large datasets make it possible and necessary to learn models from data without significant manual work (B ose et al., 2017). A collection of time series can exhibit various dependency relationships between the individual time series that can be leveraged in forecasting. These include: (1) local co-variate relationships (e.g. the price and demand for a product, which tend to be (negatively) correlated), (2) indirect relationships through shared latent causes (e.g. demand for multiple products increasing because an advertising campaign is driving traffic to the site), (3) subtle dependencies through smoothness, temporal dynamics, and noise characteristics of time series that are measurements of similar underlying phenomena (e.g. product sales time series tend to be similar to each other, but different from energy consumption time series). The data in practical forecasting problems typically has all of these forms of dependencies. Making use of this data from related time series allows more complex and potentially more accurate models to be fitted without overfitting. Classical time series models have been extended to address the above dependencies of types (1) and (2) by allowing exogenous variables (e.g. the ARIMAX model and control inputs in linear dynamical systems), and employing multivariate time series models that impose a certain covariance structure (dynamic factor models), respectively. Neural network-based models have been recently shown to excel in extracting complex patterns from large datasets of related time series by exploiting similar smoothness and temporal dynamics, and common responses to exogenous input, i.e. dependencies of type (1) and (3) (Flunkert et al., 2017; Wen et al., 2017; Mukherjee et al., 2018; Gasthaus et al., 2019). These models struggle in producing calibrated uncertainty Deep Factors for Forecasting estimates. They can also be sample-inefficient, and cannot handle type (2) dependencies. See (Faloutsos et al., 2018) for a recent survey on traditional and modern methods for forecasting. The two main challenges that arise in the fully-automated data-driven approaches are: how to build statistical models that are able to borrow statistical strength and effectively learn to forecast from large and diverse data sources exhibiting all forms of dependencies, and how to combine the data efficiency and uncertainty characterization of classical time series models with the expressive power of deep neural networks. In this paper, we propose a family of models that efficiently (in terms of sample complexity) and effectively (in terms of computational complexity) addresses these aforementioned challenges. 1.1. Background Classical time series models, such as general State-Space Models (SSMs), including ARIMA and exponential smoothing, excel at modeling the complex dynamics of individual time series of sufficiently long history. For Gaussian State-Space Models, these methods are computationally efficient, e.g. via a Kalman filter, and provide uncertainty estimates. Uncertainty estimates are critical for optimal downstream decision making. Gaussian Processes (Rasmussen & Williams, 2006; Seeger, 2004) (GPs) are another family of the models that have been applied to time series forecasting (Girard et al., 2003; Brahim-Belhouari & Bermak, 2004). These methods are local, that is, they learn one model per time series. As a consequence, they cannot effectively extract information across multiple time series. Finally, these classical methods struggle with cold-start problems, where more time series are added or removed over time. Mixed effect models (Crawley, 2012) consist of two kinds of effects: fixed (global) effects that describe the whole population, and random (local) effects that capture the idiosyncratic of individuals or subgroups. A similar mixed approach is used in Hierarchical Bayesian (Gelman et al., 2013) methods, which combine global and local models to jointly model a population of related statistical problems. In (Ahmed et al., 2012; Low et al., 2011), other combined local and global models are detailed. Dynamic factor models (DFMs) have been studied in econometrics for decades to model the co-evolvement of multiple time series (Geweke, 1977; Stock & Watson, 2011; Forni et al., 2000; Pan & Yao, 2008). DFMs can be thought as an extension of principal component analysis in the temporal setting. All the time series are assumed to be driven by a small number of dynamic (latent) factors. Similar to other models in classical statistics, theories and techniques are developed with assuming that the data is normally distributed and stationary. Desired theoretical properties are often lost when generalizing to other likelihoods. Closely related are the matrix factorization (MF) techniques (Yu et al., 2016; Xie et al., 2017; Hooi et al., 2019) and tensor factorization (Araujo et al., 2019), which have been applied to the time series matrix with temporal regularization to ensure the regularity of the latent time series. These methods are not probabilistic in nature, and cannot provide uncertainty estimation for non-Gaussian observations. 1.2. Main Contributions In this paper, we propose a novel global-local method, Deep Factor Models with Random Effects. It is based on a global DNN backbone and local probabilistic graphical models for computational efficiency. The global-local structure extracts complex non-linear patterns globally while capturing individual random effects for each time series locally. The main idea of our approach is to represent each time series, or its latent function, as a combination of a global time series and a corresponding local model. The global part is given by a linear combination of a set of deep dynamic factors, where the loading is temporally determined by attentions. The local model is stochastic. Typical local choices include white noise processes, linear dynamical systems (LDS) or Gaussian processes (GPs). The stochastic local component allows for the uncertainty to propagate forward in time. Our contributions are as follows: i) Provide a unique characterization of exchangeable time series (Section 2); ii) Propose a novel global-local framework for time series forecasting, based on i), that systematically marries deep neural networks and probabilistic models (Section 3); iii) Develop an efficient and scalable inference algorithm for non-Gaussian likelihoods that is generally applicable to any normally distributed probabilistic models, such as SSMs and GPs (Section 3). As a byproduct, we obtain new approximate inference methods for SSMs/GPs with non Gaussian likelihoods; iv) Show how state-of-the-art time series forecasting methods can be subsumed in the proposed framework (Section 4); v) Demonstrate the accuracy and data efficiency of our approach through scientific experiments (Section 5). 2. Exchangeable Series In this section, we formulate a general model for exchangeable time series. A distribution over objects is exchangeable, if the probability of the objects is invariant under any permutation. Exchangeable time series are a common occurrence. For instance, user purchase behavior is exchangeable, since there is no specific reason to assign a particular coordinate to a particular user. Other practical examples include sales statistics over similar products, prices of securities on the stock market and the use of electricity. Deep Factors for Forecasting 2.1. Characterization Let zi 2 ZT , where zi denotes the ith exchangeable time series, Z denotes the domain of observations and T 2 N denotes the length of the time series.1 We denote individual observations at some time t as zi,t. We assume that we observe zi,t at discrete time steps to have a proper time series rather than a marked point process. Theorem 1. Let p be a distribution over exchangeable time series zi over Z with length T, where 1 i N. Then p admits the form p(gt|g1:t 1) p(zi,t|zi,1:t 1, gt)dg. In other words, p(z) decomposes into a global time series g and N local times series zi, which are conditionally independent given the latent series g. Proof. It follows from de Finetti s theorem (Diaconis, 1977; Diaconis & Freedman, 1980) that p(zi|g)dg. (1) Since zi are time series, we can decompose p(zi|g) in the causal direction using the chain rule as p(zi,t|zi,1:t 1, g). Substituting this into the de Finetti factorization in Eqn. (1) gives p(zi,t|zi,1:t 1, g)dg. Lastly, we can decompose g, such that gt contains a sufficient statistic of g with respect to z ,t. This holds trivially by setting gt = g, but defeats the purpose of the subsequent models. Using the chain rule on p(g) and substituting the result in proves the claim. Theorem 2. For tree-wise exchangeable time series, that is time series that can be grouped hierarchically into exchangeable sets, there exists a corresponding set of hierarchical latent variable models. The proof is analogous to that of Theorem 1, and follows from a hierarchical extension of de Finetti s theorem (Austin & Panchenko, 2014). This decomposition is useful when dealing with product hierarchies. For instance, the sales events within the category of i Phone charger cables and the category of electronics cables may be exchangeable. 1Without loss of generality and to avoid notational clutter, we omit the extension to time series beginning at different points of time. Our approach is general and covers these cases as well. 2.2. Practical Considerations We now review some common design decisions used in modeling time series. The first is to replace the decomposition QT t=1 p(zi,t|zi,1:t 1) by a tractable, approximate statistic ht of the past, such that p(zi,t|zi,1:t 1) p(zi,t|hi,t). Here, ht typically assumes the form of a latent variable model via p(hi,t|hi,t 1, zi,t 1). Popular choices for real-valued random variables are SSMs and GPs. The second is to assume that the global variable gt is drawn from some p(gt|gt 1). The inference in this model is costly, since it requires constant interaction, via Sequential Monte Carlo, variational inference or a similar procedure between local and global variables at prediction time. One way to reduce these expensive calculations is to incorporate past local observations z ,t 1 explicitly. While this somewhat negates the simplicity of Theorem 1, it yields significantly higher accuracy for a limited computational budget, gt p(gt|gt 1, z ,t 1). Lastly, the time series often comes with observed covariates, such as a user s location or a detailed description of an item being sold. We add these covariates xi,t to the time series signal to obtain the following model: p(gt|gt 1, x ,t, z ,t 1) (2) p(hi,t|gt, hi,t 1, zi,t 1, xi,t) p(zi,t|gt, hi,t, zi,t 1, xi,t) Even though this model is rarely used in its full generality, Eqn. (2) is relevant because it is by some measure the most general model to consider, based on the de Finetti factorization in Theorem 1. 2.3. Special Cases The global-local structure has been used previously in a number of special contexts (Xu et al., 2009; Ahmadi et al., 2011; Hwang et al., 2016; Choi et al., 2011). For instance, in Temporal LDA (Ahmed et al., 2012) we assume that we have a common fixed Dirichlet-Multinomial distribution capturing the distribution of tokens per topic, and a timevariant set of latent random variables capturing the changes in user preferences. This is a special case of the above model, where the global time series does not depend on time, but is stationary instead. A more closely related case is the Neural Survival Recommender model of (Jing & Smola, 2017). This models the temporal dynamics of return times of a user to an app via survival analysis. In particular, it uses a LSTM for the global dynamics and LSTMs for the local survival probabilities. In Deep Factors for Forecasting this form, it falls into the category of models described by Theorem 1. Unlike the models we propose in this paper, it does not capture local uncertainty accurately. It also primarily deals with point processes rather than proper time series, and the inference algorithm differs quite significantly. 3. Deep Factor Models with Random Effects Motivated by the characterization of exchangeable time series, in this section, we propose a general framework for global-local forecasting models, called Deep Factor Models with Random Effects, that follows the structure given by the decomposition in Theorem 1. We describe the family of the methods, show three concrete instantiations (DF-RNN, DF-LDS, and DF-GP), and derive the general inference and learning algorithm. Further models that can be obtained within the same framework, and additional details about the design choices, are described in Appendix A. We are given a set of N time series, with the ith time series consisting of tuples (xi,t, zi,t) 2 Rd R, t = 1, , T, where xi,t are the input co-variates, and zi,t is the corresponding observation at time t. Given a forecast horizon 2 N+, our goal is to calculate the joint predictive distribution of future observations, p({zi,T +1:T + }N i=1|{xi,1:T + , zi,1:T }N i.e. the joint distribution over future observations given all co-variates (including future ones) and past observations. 3.1. Generative Model Our key modeling assumption is that each time series zi,t, t = 1, 2, . . . is governed by a fixed global (non-random) and a random component, whose prior is specified by a generative model Ri. In particular, we assume the following generative process: global factors : gk( ) = RNNk( ), k = 1, , K, fixed effect : fi( ) = wi,k gk( ), (3) random effect : ri( ) Ri, i = 1, , N, (4) latent function : ui( ) = fi( ) + ri( ), (5) emission : zi,t p(zi,t|ui(xi,t)), The observation model p can be any parametric distribution, such as Gaussian, Poisson or Negative Binomial. All the functions gk( ), ri( ), ui( ) take features xi,t as input, and we define ui,t := ui(xi,t), the embedding wi := [wi,k]k. 3.1.1. GLOBAL EFFECTS (COMMON PATTERNS) The global effects are given by linear combinations of K latent global deep factors modeled by RNNs. These deep ri,t 1 ri,t+1 Figure 1. Plate graph of the proposed Deep Factor Model with Ran- dom Effects. The diamond nodes represent deterministic states. factors can be thought of as dynamic principal components or eigen time series that drive the underlying dynamics of all the time series. As mentioned in Section 2.2, we restrict the global effects to be deterministic to avoid costly inference at the global level that depends on all time series. The novel formulation of the fixed effects from the RNN in Eqn. (3) has advantages in comparison to a standard RNN forecaster. Figure 2 compares the generalization errors and average running times of using Eqn. (3) with the L2 loss and a standard RNN forecaster with the same L2 loss and a comparable number of parameters on a real-word dataset electricity (Dheeru & Karra Taniskidou, 2017). Our fixed effect formulation shows significant data and computational efficiency improvement. The proposed model has less variance in comparison to the standard structure. Detailed empirical explorations of the proposed structures can be found in Section 5.1. Figure 2. Generalization Error (solid line), Mean Absolute Per- centage Error (MAPE) on the test set and running time in seconds (dashed line) vs. the size of the training set, in terms of data points per time series. The experiments are repeated over 10 runs. Deep Factors for Forecasting NAME DESCRIPTION LOCAL LIKELIHOOD (GAUSSIAN CASE) DF-RNN Zero-mean Gaussian noise process given by RNN ri,t N(0, σ2 i,t) p(zi) = Q t N(zi,t fi,t|0, σ2 DF-LDS State-space models ri,t LDSi,t (cf. Eqn. (6)) p(zi) given by Kalman Filter DF-GP Zero-mean Gaussian Process ri,t GPi,t (cf. Eqn. (7)) p(zi) = N(zi fi|0, Ki + σ2 Table 1. Summary table of Deep Factor Models with Random Effects. The likelihood column is under the assumption of Gaussian noise. 3.1.2. RANDOM EFFECTS (LOCAL FLUCTUATIONS) The random effects ri( ) in Eqn. (4) can be chosen to be any classical probabilistic time series model Ri. To efficiently compute their marginal likelihood p(zi|Ri), ri,t should be chosen to satisfy the normal distributed observation assumption. Table 1 summarizes the three models we consider for the local random effects models Ri. The first local model, DF-RNN, is defined as ri,t N(0, σ2 i,t), where σi,t is given by a noise RNN that takes input feature xi,t. The noise process becomes correlated with the covariance function implicitly defined by the RNN, resulting in a simple deep generative model. The second local model, DF-LDS, is a part of a special and robust family of SSMs, Innovation State-Space Models (ISSMs) (Hyndman et al., 2008; Seeger et al., 2016; 2017b). This gives the following generative model: hi,t = Fi,thi,t 1 + qi,t i,t, i,t N(0, 1), The latent state hi,t contains information about the level, trend, and seasonality patterns. It evolves by a deterministic transition matrix Fi,t and a random innovation qi,t. The structure of the transition matrix Fi,t and innovation strength qi,t determines which kinds of time series patterns the latent state hi,t encodes (cf. Appendix A.2 for the concrete choice of ISSM). The third local model, DF-GP, is defined as the Gaussian Process, ri,t GP(0, Ki( , )), (7) where Ki( , ) denotes the kernel function. In this model, with each time series has its own set of GP hyperparameters to be learned. 3.2. Inference and Learning Given a set of N time series, our goal is to jointly estimate , the parameters in the global RNNs, the embeddings and the hyper-parameters in the local models. We use maximum likelihood estimation, where = argmax P i log p(zi| ). Computing the marginal likelihood may require doing inference over the latent variables. The general learning algorithm is summarized in Algorithm 1. Algorithm 1 Training Procedure for Deep Factor Models with Random Effects. 1: for each time series {(xi, zi)} do 2: Sample the estimated latent representation from the variational encoder eui qφ( |zi) for non-Gaussian likelihood, otherwise eui := zi. 3: With the current estimate of the model parameters , compute the fixed effect fi,t = PK k=1 wi,k gk(xi,t), and corresponding ISSM parameters for DF-LDS or the kernel matrix Ki for DF-GP. 4: Calculate the marginal likelihood p(zi) as in Table 1 or its variational lower bound as in Eqn. (10). 5: end for 6: Accumulate the loss in the current mini-batch, and per- form stochastic gradient descent. 3.2.1. GAUSSIAN LIKELIHOOD When the observation model p( |ui,t) is Gaussian, zi,t N(ui,t, σ2 i,t), the marginal likelihood can be easily computed for all three models. Evaluating the marginal likelihood for DF-RNN is straightforward (see Table 1). For DF-LDS and DF-GP, the Gaussian noise can be absorbed into the local model, yielding zi,t = ui,t = fi,t + ri,t, where ri,t, instead of coming from the noiseless LDS and GP, is generated by the noisy versions. More precisely, for DF-LDS, ri,t = a> i,thi,t + i,t and i,t N(0, σ2 i ), and the marginal likelihood is obtained with a Kalman filter. In DF-GP, it amounts to adding σ2 i,t δ( , ) to Eqn (7), where δ( , ) is the Dirac delta function. The marginal likelihood becomes the standard GP marginal likelihood, which is the multivariate normal with mean fi := [fi(xi,t)]t and covariance matrix Ki + σ2 i I, where Ki := [K(xi,t, xi,t)]t and I is the identity matrix of suitable size. 3.2.2. NON-GAUSSIAN LIKELIHOOD When the likelihood is not Gaussian, the exact marginal likelihood is intractable. We use variational inference, and optimize a variational lower bound of the marginal likelihood log p(z): p(z, u, h) > IEqφ(u,h) log Deep Factors for Forecasting where u is the latent function values, and h is the latent states in the local probabilistic models 2. Optimizing this stochastic variational lower bound for the joint model over all time series is computationally expensive. We propose a new method that leverages the structure of the local probabilistic model to enable fast inference at the per-time series level. This enables parallelism and efficient inference that scales up to a large collection (millions) of time series. Motivated by (Fraccaro et al., 2017), we choose the following structural approximation qφ(h, u|z) := qφ(u|z)p(h|u), where the second term matches the exact conditional posterior with the random effect probabilistic model R. With this form, given u, the inference becomes the canonical inference problem with R, from Section 3.2.1. The first term qφ(u|z) is given by another neural network parameterized by φ, called a recognition network in the variational Auto-Encoder (VAE) framework (Kingma & Welling, 2014; Rezende et al., 2014). After massaging the equations, the stochastic variational lower bound in Eqn. (8) becomes L (log p(z|euj) + log p(euj) log qφ(euj|z)) , (10) with euj qφ(u) for j = 1, , L sampled from the recognition network. The first and third terms in Eqn. (10) are straightforward to compute. For the second term, we drop the sample index j to obtain the marginal likelihood log p(eu) under the normally distributed random effect models. This term is computed in the same manner as in Section 3.2.1, with zi substituted by eu. When the likelihood is Gaussian, the latent function values u are equal to z, and we arrive at log p(z) from Eqn. (9). 4. Related Work and Discussions Effectively combining probabilistic graphical models and deep learning approaches has been an active research area. Several approaches have been proposed for marrying RNN with SSMs through either one or both of the following: (i) extending the Gaussian emission to complex likelihood models; (ii) making the transition equation non-linear via a multi-layer perceptron (MLP) or interlacing SSM with transition matrices temporally specified by RNNs. Deep Markov Models (DMMs), proposed by (Krishnan et al., 2017; 2015), keep the Gaussian transition dynamics with mean and covariance matrix parameterized by MLPs. Stochastic RNNs (SRNNs) (Fraccaro et al., 2016) explicitly incorporate the deterministic dynamics from RNNs that do not depend on latent variables by interlacing them with a SSM. Chung et al. (2015) first proposed Variational RNNs (VRNNs), which is 2For cleaner notation, we omit the time series index i another way to make the transition equation non-linear, by cutting ties between the latent states and associating them with deterministic states. This makes the state transition non-linearly determined by the RNN. VRNNs are also used in Latent LSTM Allocation (LLA) (Zaheer et al., 2017) and State-Space LSTM (SSL) (Zheng et al., 2017). These models require expensive inference at the global level through a recognition network, with is in stark contrast with Deep Factors, where the structural assumption of the variational approximation decomposes the inference problem to local probabilistic inference that is easily parallelizable and global standard RNN learning (cf. Section 3.2). In Fraccaro et al. (2017) and the recent Deep State Models (Rangapuram et al., 2018), the linear Gaussian transition structure is kept intact, so that the highly efficient Kalman filter/smoother is readily applicable. Our model differs from the former in that we avoid sampling the latent states in the ELBO, and eliminate the variance coming from the Monte-Carlo estimate of the second integral in Eqn. (10). Deep State is designed for a Gaussian likelihood with time varying SSM parameters per time series. In contrast, with time invariant local SSMs and flexible global effects, our model DF-LDS offers a parsimonious representation that can handle non-Gaussian observations. Deep Gaussian Processes (Damianou & N.D., 2013) have attracted growing interests in recent years. Inspired by GPLVM structure (Lawrence, 2004), Deep GPs stacks GPs on top of latent variables, resulting in more expressive mappings. Our framework provides an alternative approach to utilize GPs more efficiently. Due to its flexibility of interpolating between purely local and purely global models, there are a variety of common methods that can be subsumed in our proposed model framework. Deep Factor with one factor and no random effects, accompanied with autoregressive inputs, reduces to Deep AR (Flunkert et al., 2017). One difference in our formulation is that the scale of each time series is automatically estimated rather than pre-specified as it is in Deep AR. Changing the emission probability to Gaussian Mixtures results in AR-MDN (Mukherjee et al., 2018). Sequenceto-Sequence models for forecasting (Wen et al., 2017) are another family of models that are a part of our framework. These methods make predictions discriminatively rather than generatively. By dropping the random effects, using GPs as the prior and removing the restriction of RNNs on the global factors, we recover the semi-parametric latent factor model (Teh et al., 2005). By dropping the global effects, we arrive at the standard SSM or GP. Our newly developed general variational inference algorithm applies to both of these methods and other normally distributed local stochastic models (cf. subsubsection 3.2.2). While we have built upon existing works, to the best of our knowledge, Deep Deep Factors for Forecasting Factors provide the first model framework that incorporate SSMs and GPs with DNNs in a systematic manner. 5. Experiments We conduct experiments with synthetic and real-world data to provide evidence for the practical effectiveness of our approach. We use a p3.8xlarge Sage Maker instance in all our experiments. Our algorithms are implemented in MXNet Gluon (Chen et al., 2015), and make extensive use of its Linear Algebra library (Seeger et al., 2017a; Zhenwen et al., 2018). Further experiments with GPs as the local model are detailed in (Maddix et al., 2018). 5.1. Model Understanding and Exploration The first set of experiments compares our proposed structure in Eqn. (3) with no probabilistic component to the standard RNN structure on the electricity dataset. For each time series i, we have its embedding wi 2 IR10, and two time features xt 2 IR2, day of the week and hour of the day. Given an RNN Cell (LSTM, GRU, etc.), the RNN Forecaster predicts the time series values ˆzi,t = RNNt(xi,t), where xi,t = [wi; xt]. Deep Factor generates the point forecast as ˆzi,t = w> i RNNt(xt). The RNN cell for RNN Forecaster has an output dimension of 1 while that of Deep Factor is of dimension 10. The resulting number of parameters of both structures are roughly the same. Figure 2 demonstrates that the proposed structure significantly improves the data efficiency while having less variance. The runtime of Deep Factor scales better than that of the standard RNN structure. By using the concatenation of [wi; xt], the standard RNN structure operates on the outer product space of xt and wi (of dimension 12 T), while Deep Factor computes the intrinsic structure (of dimension 12 and T separately). Next, we focus on the DF-LDS model, and investigate (i) the effect of the recognition network for non-Gaussian observations with the purely local part (fi,t = 0) (variational LDS cf. Appendix A.2), and (ii) the recovery of the global factors in the presence of Gaussian and non-Gaussian noise. We generate data according to the following model, which is adapted from Example 24.3 in (Barber, 2012). The twodimensional latent vector ht is rotated at each iteration, and then projected to produce a scalar observation, ht+1 = Aht + h, A = cos sin sin cos where h N(0, 2I2), ut+1 = e T 1 ht+1 + v, v N(0, σ2), I2 is the 2 2 identity matrix, and e1 2 R2 is the standard basis vector. The true observations are generated by a Poisson distribution, zt = Poisson[λ(ut)], where λ(ut) = log[1 + exp(ut)]. This could be used to model seasonal products, where most of the sales happen when an item is in season, e.g. snowboards normally sell shortly before or during winters, and less so afterwards. Figure 3 shows the reconstructed intensity function λ(ut), as well as corresponding forecasts for each choice of recognition network. Visual inspections reveal that RNNs are superior over MLPs as recognition networks. This is expected because the time series are sequential. We also test the ability of our algorithm to recover the underlying global factors. Our experiments show that even with the Poisson noise model, we are able to identify the true latent factors in the sense of distances of the subspaces spanned by the global factors (cf. Appendix B.1). 5.2. Empirical Studies In this subsection, we test how our model performs on several real-world and publicly available datasets: electricity (E) and traffic (T) from the UCI data set (Dheeru & Karra Taniskidou, 2017; Yu et al., 2016), nyc taxi (N) (Taxi & Commission, 2015) and uber (U) (Flowers, 2015) (cf. Appendix B.2). In the experiments, we choose the DF-RNN (DF) model with a Gaussian likelihood. To assess the performance of our algorithm, we compare with Deep AR (DA), a state-ofart RNN-based probabilistic forecasting algorithm on the publicly available AWS Sage Maker (Januschowski et al., 2018), MQ-RNN (MR), a sequence model that generates conditional predictive quantiles (Wen et al., 2017), and Prophet (P), a Bayesian structural time series model (Taylor & Letham, 2017). The Deep Factor model has 10 global factors with a LSTM cell of 1-layer and 50 hidden units. The noise LSTM has 1-layer and 5 hidden units. For a fair comparison with Deep AR, we use a comparable number of model parameters, that is, an embedding size of 10 with 1-layer and 50 hidden LSTM units. The student-t likelihood in Deep AR is chosen for its robust performance. The same model structure is chosen for MQ-RNN, and the decoder MLP has a single hidden layer of 20 units. We use the adam optimization method with the default parameters in Gluon to train the DF-RNN and MQ-RNN. We use the default training parameters for Deep AR. We use the quantile loss to evaluate the probabilistic forecasts. For a given quantile 2 (0, 1), a target value zt and -quantile prediction bzt( ), the -quantile loss is defined as QL [zt, bzt( )] = 2 (zt bzt( ))Izt bzt( )>0 + (1 )(bzt( ) zt)Izt bzt( )60 We use a normalized sum of quantile losses, P i,t QL [zi,t, bzi,t( )]/ P i,t |zi,t|, to compute the quantile losses for a given span across all time series. We include results for = 0.5, 0.9, which we abbreviate as the P50QL (mean absolute percentage error (MAPE)) and P90QL, respectively. Deep Factors for Forecasting Figure 3. Deep Factor (DF-LDS) with no global effects (Variational LDS). Left: reconstructed Intensity (ut) with different recognition networks. Center and Right: predictive distributions with MLP (center) and Bidirectional LSTM (right). DS H P50QL P90QL DA P MR DF DA P MR DF E 72 0.216 0.054 0.149 0.204 0.042 0.101 0.006 0.182 0.077 0.103 0.088 0.008 0.049 0.004 24 0.272 0.078 0.124 0.185 0.042 0.112 0.012 0.100 0.013 0.091 0.083 0.008 0.059 0.013 N 72 0.468 0.043 0.397 0.418 0.031 0.212 0.044 0.175 0.005 0.178 0.267 0.038 0.104 0.028 24 0.390 0.042 0.328 0.358 0.029 0.239 0.037 0.167 0.005 0.168 0.248 0.088 0.139 0.035 T 72 0.337 0.026 0.457 0.383 0.040 0.190 0.048 0.184 0.051 0.207 0.226 0.017 0.127 0.026 24 0.296 0.021 0.380 0.331 0.011 0.225 0.050 0.149 0.011 0.191 0.154 0.020 0.159 0.059 U 72 0.417 0.011 0.441 0.730 0.031 0.344 0.033 0.296 0.017 0.239 0.577 0.059 0.190 0.013 24 0.353 0.009 0.468 0.879 0.156 0.425 0.063 0.238 0.009 0.239 0.489 0.069 0.238 0.026 Table 2. Results for the short-term (72-hour) and near-term (24-hour) forecast scenarios with one week of training data. electricity Deep AR Prophet MQRNN Deep Factor Figure 4. P50QL (MAPE) results for the short-term (72-hour) fore- cast in Table 2. Purple denotes the proposed method. For all the datasets, we limit the training length to only one week of time series (168 observations per time series). This represents a relevant scenario that occurs frequently in demand forecasting, where products often have only limited historical sales data. We average the Deep Factor, MQRNN and Deep AR results over ten trials. We use one trial for Prophet, since classical methods are typically less variable than neural-network based models. Figure 4 illustrates the performances of the different algorithms in terms of the MAPE (P50 quantile loss) for the 72 hour forecast horizon. Table 2 shows the full results, and that our model outperforms the others in terms of accuracy and variability in most of the cases. For Deep AR, using Sage Maker s HPO, our preliminary results (cf. Appendix B.2) show that with a larger model, it achieves a performance that is on-par with our method, which has much less parameters. In addition, the sequence-to-sequence structure of Deep AR and MQ-RNN limits their ability to react flexibly to changing forecasting scenarios, e.g. during on-demand forecasting, or interactive scenarios. For a forecasting scenario with a longer prediction horizon than during training horizon, Deep AR needs to be retrained to reflect the changes in the decoder length. Similarly, they cannot generate forecasts that are longer than the length of the training time series, for example, the case in Figure 2. Our method has no difficulty performing this task and has greater data efficiency. 6. Conclusion We propose a novel global-local framework for forecasting a collection of related time series, accompanied with a result that uniquely characterizes exchangeable time series. Our main contribution is a general, powerful and practically relevant modeling framework that scales, and obtains stateof-the-art performance. Future work includes comparing variational dropout (Gal & Ghahramani, 2016) or Deep Ensemble (Lakshminarayanan et al., 2017) of non-probabilistic DNN models (e.g., RNNForecaster (cf. 5.1)) for uncertainty. Deep Factors for Forecasting Acknowledgments We would like to thank the anonymous reviewers, Valentin Flunkert, David Salinas, Michael Bohlke-schneider, Edo Liberty and Bing Xiang for their valuable feedbacks. Ahmadi, B., Kersting, K., and Sanner, S. Multi-evidence lifted message passing, with application to pagerank and the kalman filter. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. J. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 123 132. ACM, 2012. Araujo, M., Ribeiro, P., Song, H. A., and Faloutsos, C. Tensorcast: forecasting and mining with coupled tensors. Knowledge and Information Systems, 59(3):497 522, 2019. Austin, T. and Panchenko, D. A hierarchical version of the de finetti and aldous-hoover representations. Probability Theory and Related Fields, 159(3-4):809 823, 2014. Barber, D. Bayesian reasoning and machine learning. Cam- bridge University Press, 2012. B ose, J.-H., Flunkert, V., Gasthaus, J., Januschowski, T., Lange, D., Salinas, D., Schelter, S., Seeger, M., and Wang, Y. Probabilistic demand forecasting at scale. Proceedings of the VLDB Endowment, 10(12):1694 1705, 2017. Brahim-Belhouari, S. and Bermak, A. Gaussian process for nonstationary time series prediction. Computational Statistics & Data Analysis, 47(4):705 712, 2004. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. ar Xiv preprint ar Xiv:1512.01274, 2015. Choi, J., Guzman-Rivera, A., and Amir, E. Lifted relational kalman filtering. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980 2988, 2015. Crawley, M. J. Mixed-effects models. The R Book, Second Edition, pp. 681 714, 2012. Damianou, A. and N.D., L. Deep gaussian processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 207 215, 2013. Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017. University of California, Irvine, School of Information and Computer Sciences. Diaconis, P. Finite forms of de finetti s theorem on ex- changeability. Synthese, 36:271 281, 1977. Diaconis, P. and Freedman, D. Finite exchangeable se- quences. The Annals of Probability, 8:745 764, 1980. Faloutsos, C., Gasthaus, J., Januschowski, T., and Wang, Y. Forecasting big time series: old and new. Proceedings of the VLDB Endowment, 11(12):2102 2105, 2018. Flowers, A. Uber TLC foil response. https: //github.com/fivethirtyeight/ uber-tlc-foil-response/blob/master/ uber-trip-data/uber-raw-data-apr14. csv, 2015. Flunkert, V., Salinas, D., and Gasthaus, J. Deepar: Prob- abilistic forecasting with autoregressive recurrent networks. ar Xiv preprint ar Xiv:1704.04110, 2017. Forni, M., Hallin, M., Lippi, M., and Reichlin, L. The gen- eralized dynamic-factor model: Identification and estimation. Review of Economics and statistics, 82(4):540 554, 2000. Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199 2207, 2016. Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O. A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems, pp. 3604 3613, 2017. Gal, Y. and Ghahramani, Z. Dropout as a bayesian approx- imation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050 1059, 2016. Gasthaus, J., Benidis, K., Wang, Y., Rangapuram, S. S., Sali- nas, D., Flunkert, V., and Januschowski, T. Probabilistic forecasting with spline quantile function rnns. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1901 1910, 2019. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. Bayesian data analysis. CRC press, 2013. Deep Factors for Forecasting Geweke, J. The dynamic factor analysis of economic time series. Latent variables in socio-economic models, 1977. Girard, A., Rasmussen, C. E., Candela, J. Q., and Murray- Smith, R. Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In Advances in neural information processing systems, pp. 545 552, 2003. Harvey, A. C. Forecasting, structural time series models and the Kalman filter. Cambridge university press, 1990. Hooi, B., Shin, K., Liu, S., and Faloutsos, C. Smf: Drift- aware matrix factorization with seasonal patterns. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 621 629. SIAM, 2019. Hwang, Y., Tong, A., and Choi, J. Automatic construction of nonparametric relational regression models for multiple time series. In International Conference on Machine Learning, pp. 3030 3039, 2016. Hyndman, R., Koehler, A. B., Ord, J. K., and Snyder, R. D. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008. Januschowski, T., Arpin, D., Salinas, D., Flunkert, V., Gasthaus, J., Stella, L., and Vazquez, P. Now available in amazon sagemaker: Deepar algorithm for more accurate time series forecasting. https://aws.amazon.com/blogs/machine-learning/nowavailable-in-amazon-sagemaker-deepar-algorithm-formore-accurate-time-series-forecasting/, 2018. Jing, H. and Smola, A. J. Neural survival recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 515 524. ACM, 2017. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ICLR, 2014. Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters. ar Xiv preprint ar Xiv:1511.05121, 2015. Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101 2109, 2017. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402 6413, 2017. Lawrence, N. D. Gaussian process latent variable models for visualisation of high dimensional data. In Advances in neural information processing systems, pp. 329 336, 2004. Low, Y., Agarwal, D., and Smola, A. J. Multiple domain user personalization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 123 131. ACM, 2011. Maddix, D. C., Wang, Y., and Smola, A. Deep factors with gaussian processes for forecasting. Neur IPS workshop on Bayesian Deep Learning, 2018. Mukherjee, S., Shankar, D., Ghosh, A., Tathawadekar, N., Kompalli, P., Sarawagi, S., and Chaudhury, K. Armdn: Associative and recurrent mixture density networks for eretail demand forecasting. ar Xiv preprint ar Xiv:1803.03800, 2018. Pan, J. and Yao, Q. Modelling multiple time series via common factors. Biometrika, 95(2):365 379, 2008. Rangapuram, S. S., Seeger, M., Gasthaus, J., Stella, L., Wang, Y., and Januschowski, T. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, 2018. Rasmussen, C. E. and Williams, C. K. Gaussian process for machine learning. MIT press, 2006. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278 1286, 2014. Seeger, M. Gaussian processes for machine learning. Inter- national journal of neural systems, 14(02):69 106, 2004. Seeger, M., Hetzel, A., Dai, Z., Meissner, E., and Lawrence, N. D. Auto-differentiating linear algebra. ar Xiv preprint ar Xiv:1710.08717, 2017a. Seeger, M., Rangapuram, S., Wang, Y., Salinas, D., Gasthaus, J., Januschowski, T., and Flunkert, V. Approximate bayesian inference in linear state space models for intermittent demand forecasting at scale. ar Xiv preprint ar Xiv:1709.07638, 2017b. Seeger, M. W., Salinas, D., and Flunkert, V. Bayesian intermittent demand forecasting for large inventories. In Advances in Neural Information Processing Systems, pp. 4646 4654, 2016. Stock, J. H. and Watson, M. Dynamic factor models. Oxford handbook on economic forecasting, 2011. Taxi, N. and Commission, L. TLC trip record data. http://www.nyc.gov/html/tlc/html/ about/trip_record_data.shtml, 2015. Taylor, S. J. and Letham, B. Forecasting at scale. The American Statistician, 2017. Deep Factors for Forecasting Teh, Y., Seeger, M., and Jordan, M. Semiparametric la- tent factor models. In AISTATS 2005-Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 333 340, 2005. Wen, R., Torkkola, K., and Narayanaswamy, B. A multi- horizon quantile recurrent forecaster. NIPS Workshop on Time Series, 2017. Xie, C., Tank, A., Greaves-Tunnell, A., and Fox, E. A unified framework for long range and cold start forecasting of seasonal profiles in time series. ar Xiv preprint ar Xiv:1710.08473, 2017. Xu, Z., Kersting, K., and Tresp, V. Multi-relational learning with gaussian processes. In Twenty-First International Joint Conference on Artificial Intelligence, 2009. Yu, H.-F., Rao, N., and Dhillon, I. S. Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in neural information processing systems, pp. 847 855, 2016. Zaheer, M., Ahmed, A., and Smola, A. J. Latent lstm alloca- tion: Joint clustering and non-linear dynamic modeling of sequence data. In International Conference on Machine Learning, pp. 3967 3976, 2017. Zheng, X., Zaheer, M., Ahmed, A., Wang, Y., Xing, E. P., and Smola, A. J. State space lstm models with particle mcmc inference. ar Xiv preprint ar Xiv:1711.11179, 2017. Zhenwen, D., Meissner, E., and Lawrence, N. D. Mxfusion: A modular deep probabilistic programming library. NIPS 2018 Workshop MLOSS, 2018.