# generative_timeseries_modeling_with_fourier_flows__484d3c08.pdf Published as a conference paper at ICLR 2021 GENERATIVE TIME-SERIES MODELING WITH FOURIER FLOWS Ahmed M. Alaa University of California, Los Angeles, USA ahmedmalaa@ucla.edu Alex J. Chan University of Cambridge, UK ajc340@cam.ac.uk Mihaela van der Schaar University of Cambridge, UK University of California, Los Angeles, USA Cambridge Center for AI in Medicine, UK The Alan Turing Institute, UK mv472@cam.ac.uk Generating synthetic time-series data is crucial in various application domains, such as medical prognosis, wherein research is hamstrung by the lack of access to data due to concerns over privacy. Most of the recently proposed methods for generating synthetic time-series rely on implicit likelihood modeling using generative adversarial networks (GANs) but such models can be difficult to train, and may jeopardize privacy by memorizing temporal patterns in training data. In this paper, we propose an explicit likelihood model based on a novel class of normalizing flows that view time-series data in the frequency-domain rather than the time-domain. The proposed flow, dubbed a Fourier flow, uses a discrete Fourier transform (DFT) to convert variable-length time-series with arbitrary sampling periods into fixedlength spectral representations, then applies a (data-dependent) spectral filter to the frequency-transformed time-series. We show that, by virtue of the DFT analytic properties, the Jacobian determinants and inverse mapping for the Fourier flow can be computed efficiently in linearithmic time, without imposing explicit structural constraints as in existing flows such as NICE (Dinh et al. (2014)), Real NVP (Dinh et al. (2016)) and GLOW (Kingma & Dhariwal (2018)). Experiments show that Fourier flows perform competitively compared to state-of-the-art baselines. 1 INTRODUCTION Lack of access to data is a key hindrance to the development of machine learning solutions in application domains where data sharing may lead to privacy breaches (Walonoski et al. (2018)). Areas where this problem is most conspicuous include medicine, where access to (highly-sensitive) clinical data is stringently regulated by medical institutions; such strict regulations undermine scientific progress by hindering model development and reproducibility. Generative models that produce sensible and realistic synthetic data present a viable solution to this problem artificially-generated data sets produced by such models can be shared widely without privacy concerns (Buczak et al. (2010)). In this paper, we focus on the time-series data setup, where observations are collected sequentially over arbitrary periods of time with different observation frequencies across different features. This general data setup is pervasive in the medical domain it captures the kind of data maintained in electronic health records (Shickel et al. (2017)) or collected in intensive care units (Johnson et al. (2016)). While many machine learning-based predictive models that capitalize on such data have been proposed over the past few years (Jagannatha & Yu (2016); Choi et al. (2017); Alaa & van der Schaar (2019)), much less work has been done on generative models that could emulate and synthesize these data sets. Existing generative models for (medical) time-series are based predominantly on implicit likelihood modeling using generative adversarial networks (GANs), e.g., Recurrent Conditional GAN (RCGAN) (Esteban et al. (2017)) and Time GAN (Yoon et al. (2019)). These models apply representation learning Published as a conference paper at ICLR 2021 via recurrent neural networks (RNN) combined with adversarial training in order to map noise sequences in a latent space to synthetic sequential data in the output space. Albeit capable of flexibly learning complex representations, GAN-based models can be difficult to train (Srivastava et al. (2017)), especially in the complex time-series data setup. Moreover, because they hinge on implicit likelihood modeling, GAN-based models can be hard to evaluate quantitatively due to the absence of an explicitly computable likelihood function. Finally, GANs are vulnerable to training data memorization (Nagarajan et al. (2018)) a problem that would be exacerbated in the temporal setting where memorizing only a partial segment of a medical time-series may suffice to reveal a patient s identify, which defeats the original purpose of using synthetic data in the first place. Here, we propose an alternative explicit likelihood approach for generating time-series data based on a novel class of normalizing flows which we call Fourier flows. Our proposed flow-based model operates on time-series data in the frequency-domain rather than the time-domain it converts variable-length time-series with varying sampling rates across different features to a fixed-size spectral representation using the discrete Fourier transform (DFT), and then learns the distribution of the data in the frequency domain by applying a chain of data-dependent spectral filters to frequency-transformed time-series. Using the convolution property of DFT (Oppenheim (1999)), we show that spectral filtering of a timeseries in the frequency-domain an operation that mathematically resembles affine transformations used in existing flows (Dinh et al. (2016)) is equivalent to a convolutional transformation in the timedomain. This enhancement in the richness of distributions learned by our flow comes at no extra computational cost: using Fast Fourier Transform (FFT) algorithms, we show that the entire steps of our flow run in O(T log T) time, compared to the polynomial complexity of O(T 2) for a direct, time-domain convolutional transformation. We also show that, because the DFT is a linear transform with a Vandermonde transformation matrix, computation of its Jacobian determinant is trivial. The zero-padding and interpolation properties of DFT enables a natural handling of variable-length and inconsistently-sampled time-series. Unlike existing explicit-likelihood models for time-series data, such as deep state-space models (Krishnan et al., 2017; Alaa & van der Schaar, 2019), our model can be optimized and assessed through the exact likelihood rather than a variational lower bound. 2 PROBLEM SETUP We consider a general temporal data setup where each instance of a (discrete) time-series comprises a sequence of vectors x = [x0, . . . , x T 1], xt X, 0 t T 1, covering a period of T time steps. We assume that each dimension in the feature vector xt is sampled with a different rate, i.e., at each time step t, the observed feature vector is xt = [ xt,1[r1], . . . , xt,D[r D] ], where rd N+ is the sampling period of feature dimension d {1, . . . , D}. That is, for a given sampling period rd, we observe a value of xt,d every rd time steps, and observe a missing value (denoted as *) otherwise, i.e., xt,d[rd] xt,d, t mod rd = 0, , t mod rd = 0. (1) The data setup described above is primarily motivated by medical time-series modeling problems, wherein a patient s clinical measurements and bio-markers are collected over time at different rates (Johnson et al. (2016); Jagannatha & Yu (2016)). Despite our focus on medical data, our proposed generative modeling approach applies more generally to other applications, such as speech synthesis (Prenger et al. (2019)) and financial data generation (Wiese et al. (2020)). Each realization of the time-series x is drawn from a probability distribution x p(x). In order to capture variable-length time-series (common in medical problems), the length T of each sequence is also assumed to be a random variable for notational convenience, we absorb the distribution of T into p. One possible way to represent the joint distribution p(x) is through the factorization:1 p(x) = p(x0, . . . , x T 1, T) = p(T) t=0 p(xt | x0, . . . , xt 1, T). (2) We assume that the sampling period rd for each feature d is fixed for all realizations of x. The feature space X is assumed to accommodate a mix of continuous and discrete variables on its D dimensions. 1Our proposed method is not restricted to any specific factorization of p(x). Published as a conference paper at ICLR 2021 Key objective. Using a training data set D = {x(i)}n i=1 comprising n time-series, our goal is to (1) estimate a density function ˆp(x) that best approximates p(x), and (2) sample synthetic realizations of the time-series x from the estimated density ˆp(x). When dealing with data sets with variable lengths for the time-series, we model the distribution p(T) independently following the factorization in (2). We model p(T) using a binomial distribution. Throughout the paper, we focus on developing a flow-based model for the conditional distribution p(x0, . . . , x T 1 | T). 3 PRELIMINARIES Let z RD be a random variable with a known and tractable probability density function p(z), and let g : RD RD be an invertible and differentiable mapping with an inverse mapping f = g 1. Let x = g(z) be a transformed random variable the probability density function p(x) can be obtained using the change of variable rule as p(x) = p(z) | det J[g]| 1 = p(f(x)) | det J[f]|, where J[f] and J[g] are the Jacobian matrices of functions f and g, respectively (Durrett (2019)). Normalizing flows are compositions of M mappings that transform random draws from a predefined distribution z p(z) to a desired distribution p(x). Formally, a flow comprises a chain of bijective maps g = g(1) g(2) g(M) with an inverse mapping f = f (1) f (2) f (M). Using the change of variables formula described above, and applying the chain rule to the Jacobian of the composition, the log-likelihood of x can be written as (Rezende & Mohamed (2015)): log p(x) = log p(z) + m=1 log | det J[fm]|. (3) Existing approaches to generative modeling with normalizing flows construct composite mappings g with structural assumptions that render the computation of the Jacobian determinant in (3) viable. Examples of such structurally-constrained mappings include: Sylvester transformations, with a Jacobian corresponding to a perturbed diagonal matrix (Rezende & Mohamed (2015)), 1 1 convolutions for cross channel mixing, which exhibit a block diagonal Jacobian (Kingma & Dhariwal (2018)), and affine coupling layers that correspond to triangular Jacobian matrices (Dinh et al. (2016)). 3.1 FOURIER TRANSFORM The Fourier transform is a mathematical operation that converts a finite-length, regularly-sampled time domain signal x to its frequency domain representation X (Bracewell & Bracewell (1986)). A T-point discrete Fourier transform (DFT), denoted as X = FT {x}, transforms a (complex-valued) time-stamped sequence x {x0, . . . , x T 1} into a length-T sequence of (complex-valued) frequency components, X {X0, . . . , XT 1}, through the following operation (Oppenheim (1999)): t=0 xt e 2πj kt T , 1 k T 1, (4) where j corresponds to the imaginary unit of a split-complex number. Using Euler s formula, the complex exponential terms in (4) can be expressed as e 2πj kt T = cos 2π kt T j sin 2π kt T . Thus, the Fourier transform decomposes any time-series into a linear combination of sinusoidal signals of varying frequencies the resulting sequence of frequency components, X, corresponds to the coefficients assigned to the different frequencies of the sinusoidal signals constituting the time domain signal x. The DFT is a key computational and conceptual tool in many practical applications involving digital signal processing and communications (Oppenheim (1999)). 3.2 FOURIER TRANSFORM PROPERTIES We will rely in developing our model on various key properties of the DFT. These properties describe various operations on the time-domain data and their dual (equivalent) operations in the frequency domain. The DFT properties relevant to the development of our model are listed as follows: Convolution: x1 x2 X1 X2. Symmetry: If x is real-valued Xk = X k m, m N. Even/Odd Transforms: F{Even(x)} = Re(X), F{Odd(x)} = Im(X), Published as a conference paper at ICLR 2021 where denotes element-wise multiplication, denotes circular convolution, Re(.) and Im(.) denote the real and imaginary components, Even(x) = (x + x )/2, Odd(x) = (x x )/2, where x signifies the reflection of x with respect to the x = 0 axis, and x = Even(x) + Odd(x). Another property that is relevant to our model is the interpolation property, which posits that zero-padding of x in the time domain corresponds to an up-sampled version of X in the frequency domain. Table 1: The two main layers in an N-point Fourier flow, their inverses, and the corresponding log-determinant Jacobian. The input to the flow, x, is a D T matrix comprising a set of D time-series each of length T, and the output, Y , is a 2 D N/2 tensor comprising the filtered spectral representation of x. Here, we show the application of the Fourier transform layer to a given feature dimension d the same operation is applied independently to all feature dimensions. The vector c X d is the reversed conjugate of c Xd, and H = [Hi,j]i,j. When cascading multiple Fourier flows, we alternate between feeding either of the real and imaginary channels of c X, Im(c X) and Re(c X), to the Bi RNN network in the different flows within the cascade. Layer Function Inverse function log |det J| Fourier transform xd = xd 0N T , Xd = [ c Xd, c X d ], xt,d 0, t mod rd = 0, xd = F 1 N { Xd}, log | det W| = 0. Xd = FN{ xd}, xt,d , t mod rd = 0, c Xd = [ X0,d, . . . , X0, N/2 ]. xd = [ xd 0, . . . , xd T 1 ]. Spectral filtering (log H, µ) = Bi RNN(Im(c X)), Y1, Y2 = split(Y ), Y1 = H Re(c X) + M, (log H, µ) = Bi RNN(Y2), P i,j log(|Hi,j|). Y2 = Im(c X), Re(c X) = (Y1 M)/H, Y = concat(Y1, Y2). c X = Re(c X) + j Im(c X). 4 FOURIER FLOWS We propose a new flow for time-series data, the Fourier Flow, which hinges on the frequency domain view of time-series data (explained in Section 3.1), and capitalizes on the Fourier transform properties discussed in Section 3.2. In a nutshell, a Fourier Flow comprises two steps: (a) a frequency transform layer (Section 4.1), followed by (b) a data-dependent spectral filtering layer (Section 4.2). The steps involved in a Fourier Flow are summarized in Table 1 and Figure 1. 4.1 FOURIER TRANSFORM LAYER In the first step of the proposed flow, we transform the time-series x = [ x0, . . . , x T 1 ] into its spectral representation via Fourier transform we do so by applying the DFT operation (described in Section 3.1) to each feature dimension d independently. Let xd = [ x0,d[rd], . . . , x T 1,d[rd] ] be the time-series associated with feature dimension d; the Fourier transform layer computes the N-point DFT of xd for all d {1, . . . , D} through the following three steps: Temporal zero padding: xd = xd 0N T , xt,d[rd] = 0, t, t mod rd = 0, Fourier Transform: Xd = FN{ xd}, Spectral cropping: c Xd = [ X0,d, . . . , X0, N/2 ]. (5) Here, 0N T denotes a set of N T zeros, and the union denotes the padding operation that appends the zeros 0N T to the time-series xd. The temporal zero-padding step capitalizes on the frequency interpolation and sampling properties of DFT (Section 3.2) to ensure that the padded time-series xd and its frequency spectrum Xd have a fixed (predetermined) length of N, irrespective of the interval length T and the sampling period rd. Because the DFT coefficients are complex-valued, Xd is a tensor with dimensions 2 1 N, and the collection of Fourier transforms for the D feature dimensions, X, is a tensor with dimensions 2 D N. That is, the DFT layer converts each time-series x into a two-channel, image-like D N matrices Re( X) and Im( X) as shown in Figure 1. A flow with an N-point DFT will be referred to as an N-point Fourier flow in the rest of the paper. To guarantee a lossless recovery of the time-series x via inverse DFT, we ensure that N T. Published as a conference paper at ICLR 2021 Finally, the spectral cropping step in (5) discards the N/2+1th to N th frequency components in both Re( X) and Im( X). This is because x is real-valued, hence Re( X) and Im( X) are symmetric and anti-symmetric, respectively (See Section 3.2), which renders the discarded frequency components redundant. The final output of this layer, c X, is a 2 D N/2 tensor. DFT computation. Since the DFT operation in (4) is linear, we can represent the frequency transform step in (5) through a linear transformation and apply DFT via matrix multiplication as follows: Xd = W xd, W = 1 1 1 1 1 1 ω ω2 ωN 1 1 ω2 ω4 ω2(N 1) ... ... ... ... ... 1 ωN 1 ω2(N 1) ω(N 1)(N 1) , and ω = e 2πj/N. (6) When N is set to be a power of 2, the DFT operation in (6) can be implemented using any variant of the Fast Fourier Transform (FFT) methods (Nussbaumer (1981); Van Loan (1992)), such as the Cooley Tukey FFT algorithm (Cooley & Tukey (1965)), with a computational complexity of O(N log N). Determinant of the DFT Jacobian. The DFT is a natural and intuitive transformation for temporal data, but how does introducing the DFT mapping affect the complexity of the Jacobian determinant of the flow? To answer this question, we note that the DFT matrix W in (6) is a (square) Vandermonde matrix, thus we can evaluate the DFT Jacobian determinant in closed-form as follows: | det(J[W]) | (a) = | det(W) | (b) = 1 1