# generative_timeseries_modeling_with_fourier_flows__484d3c08.pdf

Published as a conference paper at ICLR 2021

GENERATIVE TIME-SERIES MODELING WITH FOURIER FLOWS

Ahmed M. Alaa University of California, Los Angeles, USA ahmedmalaa@ucla.edu

Alex J. Chan University of Cambridge, UK ajc340@cam.ac.uk

Mihaela van der Schaar University of Cambridge, UK University of California, Los Angeles, USA Cambridge Center for AI in Medicine, UK The Alan Turing Institute, UK mv472@cam.ac.uk

Generating synthetic time-series data is crucial in various application domains, such as medical prognosis, wherein research is hamstrung by the lack of access to data due to concerns over privacy. Most of the recently proposed methods for generating synthetic time-series rely on implicit likelihood modeling using generative adversarial networks (GANs) but such models can be difﬁcult to train, and may jeopardize privacy by memorizing temporal patterns in training data. In this paper, we propose an explicit likelihood model based on a novel class of normalizing ﬂows that view time-series data in the frequency-domain rather than the time-domain. The proposed ﬂow, dubbed a Fourier ﬂow, uses a discrete Fourier transform (DFT) to convert variable-length time-series with arbitrary sampling periods into ﬁxedlength spectral representations, then applies a (data-dependent) spectral ﬁlter to the frequency-transformed time-series. We show that, by virtue of the DFT analytic properties, the Jacobian determinants and inverse mapping for the Fourier ﬂow can be computed efﬁciently in linearithmic time, without imposing explicit structural constraints as in existing ﬂows such as NICE (Dinh et al. (2014)), Real NVP (Dinh et al. (2016)) and GLOW (Kingma & Dhariwal (2018)). Experiments show that Fourier ﬂows perform competitively compared to state-of-the-art baselines.

1 INTRODUCTION

Lack of access to data is a key hindrance to the development of machine learning solutions in application domains where data sharing may lead to privacy breaches (Walonoski et al. (2018)). Areas where this problem is most conspicuous include medicine, where access to (highly-sensitive) clinical data is stringently regulated by medical institutions; such strict regulations undermine scientiﬁc progress by hindering model development and reproducibility. Generative models that produce sensible and realistic synthetic data present a viable solution to this problem artiﬁcially-generated data sets produced by such models can be shared widely without privacy concerns (Buczak et al. (2010)).

In this paper, we focus on the time-series data setup, where observations are collected sequentially over arbitrary periods of time with different observation frequencies across different features. This general data setup is pervasive in the medical domain it captures the kind of data maintained in electronic health records (Shickel et al. (2017)) or collected in intensive care units (Johnson et al. (2016)). While many machine learning-based predictive models that capitalize on such data have been proposed over the past few years (Jagannatha & Yu (2016); Choi et al. (2017); Alaa & van der Schaar (2019)), much less work has been done on generative models that could emulate and synthesize these data sets.

Existing generative models for (medical) time-series are based predominantly on implicit likelihood modeling using generative adversarial networks (GANs), e.g., Recurrent Conditional GAN (RCGAN) (Esteban et al. (2017)) and Time GAN (Yoon et al. (2019)). These models apply representation learning

Published as a conference paper at ICLR 2021

via recurrent neural networks (RNN) combined with adversarial training in order to map noise sequences in a latent space to synthetic sequential data in the output space. Albeit capable of ﬂexibly learning complex representations, GAN-based models can be difﬁcult to train (Srivastava et al. (2017)), especially in the complex time-series data setup. Moreover, because they hinge on implicit likelihood modeling, GAN-based models can be hard to evaluate quantitatively due to the absence of an explicitly computable likelihood function. Finally, GANs are vulnerable to training data memorization (Nagarajan et al. (2018)) a problem that would be exacerbated in the temporal setting where memorizing only a partial segment of a medical time-series may sufﬁce to reveal a patient s identify, which defeats the original purpose of using synthetic data in the ﬁrst place.

Here, we propose an alternative explicit likelihood approach for generating time-series data based on a novel class of normalizing ﬂows which we call Fourier ﬂows. Our proposed ﬂow-based model operates on time-series data in the frequency-domain rather than the time-domain it converts variable-length time-series with varying sampling rates across different features to a ﬁxed-size spectral representation using the discrete Fourier transform (DFT), and then learns the distribution of the data in the frequency domain by applying a chain of data-dependent spectral ﬁlters to frequency-transformed time-series.

Using the convolution property of DFT (Oppenheim (1999)), we show that spectral ﬁltering of a timeseries in the frequency-domain an operation that mathematically resembles afﬁne transformations used in existing ﬂows (Dinh et al. (2016)) is equivalent to a convolutional transformation in the timedomain. This enhancement in the richness of distributions learned by our ﬂow comes at no extra computational cost: using Fast Fourier Transform (FFT) algorithms, we show that the entire steps of our ﬂow run in O(T log T) time, compared to the polynomial complexity of O(T 2) for a direct, time-domain convolutional transformation. We also show that, because the DFT is a linear transform with a Vandermonde transformation matrix, computation of its Jacobian determinant is trivial. The zero-padding and interpolation properties of DFT enables a natural handling of variable-length and inconsistently-sampled time-series. Unlike existing explicit-likelihood models for time-series data, such as deep state-space models (Krishnan et al., 2017; Alaa & van der Schaar, 2019), our model can be optimized and assessed through the exact likelihood rather than a variational lower bound.

2 PROBLEM SETUP

We consider a general temporal data setup where each instance of a (discrete) time-series comprises a sequence of vectors x = [x0, . . . , x T 1], xt X, 0 t T 1, covering a period of T time steps. We assume that each dimension in the feature vector xt is sampled with a different rate, i.e., at each time step t, the observed feature vector is xt = [ xt,1[r1], . . . , xt,D[r D] ], where rd N+ is the sampling period of feature dimension d {1, . . . , D}. That is, for a given sampling period rd, we observe a value of xt,d every rd time steps, and observe a missing value (denoted as *) otherwise, i.e.,

xt,d[rd] xt,d, t mod rd = 0, , t mod rd = 0. (1)

The data setup described above is primarily motivated by medical time-series modeling problems, wherein a patient s clinical measurements and bio-markers are collected over time at different rates (Johnson et al. (2016); Jagannatha & Yu (2016)). Despite our focus on medical data, our proposed generative modeling approach applies more generally to other applications, such as speech synthesis (Prenger et al. (2019)) and ﬁnancial data generation (Wiese et al. (2020)).

Each realization of the time-series x is drawn from a probability distribution x p(x). In order to capture variable-length time-series (common in medical problems), the length T of each sequence is also assumed to be a random variable for notational convenience, we absorb the distribution of T into p. One possible way to represent the joint distribution p(x) is through the factorization:1

p(x) = p(x0, . . . , x T 1, T) = p(T)

t=0 p(xt | x0, . . . , xt 1, T). (2)

We assume that the sampling period rd for each feature d is ﬁxed for all realizations of x. The feature space X is assumed to accommodate a mix of continuous and discrete variables on its D dimensions.

1Our proposed method is not restricted to any speciﬁc factorization of p(x).

Published as a conference paper at ICLR 2021

Key objective. Using a training data set D = {x(i)}n i=1 comprising n time-series, our goal is to (1) estimate a density function ˆp(x) that best approximates p(x), and (2) sample synthetic realizations of the time-series x from the estimated density ˆp(x). When dealing with data sets with variable lengths for the time-series, we model the distribution p(T) independently following the factorization in (2). We model p(T) using a binomial distribution. Throughout the paper, we focus on developing a ﬂow-based model for the conditional distribution p(x0, . . . , x T 1 | T).

3 PRELIMINARIES

Let z RD be a random variable with a known and tractable probability density function p(z), and let g : RD RD be an invertible and differentiable mapping with an inverse mapping f = g 1. Let x = g(z) be a transformed random variable the probability density function p(x) can be obtained using the change of variable rule as p(x) = p(z) | det J[g]| 1 = p(f(x)) | det J[f]|, where J[f] and J[g] are the Jacobian matrices of functions f and g, respectively (Durrett (2019)).

Normalizing ﬂows are compositions of M mappings that transform random draws from a predeﬁned distribution z p(z) to a desired distribution p(x). Formally, a ﬂow comprises a chain of bijective maps g = g(1) g(2) g(M) with an inverse mapping f = f (1) f (2) f (M). Using the change of variables formula described above, and applying the chain rule to the Jacobian of the composition, the log-likelihood of x can be written as (Rezende & Mohamed (2015)):

log p(x) = log p(z) +

m=1 log | det J[fm]|. (3)

Existing approaches to generative modeling with normalizing ﬂows construct composite mappings g with structural assumptions that render the computation of the Jacobian determinant in (3) viable. Examples of such structurally-constrained mappings include: Sylvester transformations, with a Jacobian corresponding to a perturbed diagonal matrix (Rezende & Mohamed (2015)), 1 1 convolutions for cross channel mixing, which exhibit a block diagonal Jacobian (Kingma & Dhariwal (2018)), and afﬁne coupling layers that correspond to triangular Jacobian matrices (Dinh et al. (2016)).

3.1 FOURIER TRANSFORM

The Fourier transform is a mathematical operation that converts a ﬁnite-length, regularly-sampled time domain signal x to its frequency domain representation X (Bracewell & Bracewell (1986)). A T-point discrete Fourier transform (DFT), denoted as X = FT {x}, transforms a (complex-valued) time-stamped sequence x {x0, . . . , x T 1} into a length-T sequence of (complex-valued) frequency components, X {X0, . . . , XT 1}, through the following operation (Oppenheim (1999)):

t=0 xt e 2πj kt

T , 1 k T 1, (4)

where j corresponds to the imaginary unit of a split-complex number. Using Euler s formula, the complex exponential terms in (4) can be expressed as e 2πj kt

T = cos 2π kt

T j sin 2π kt

T . Thus, the Fourier transform decomposes any time-series into a linear combination of sinusoidal signals of varying frequencies the resulting sequence of frequency components, X, corresponds to the coefﬁcients assigned to the different frequencies of the sinusoidal signals constituting the time domain signal x. The DFT is a key computational and conceptual tool in many practical applications involving digital signal processing and communications (Oppenheim (1999)).

3.2 FOURIER TRANSFORM PROPERTIES

We will rely in developing our model on various key properties of the DFT. These properties describe various operations on the time-domain data and their dual (equivalent) operations in the frequency domain. The DFT properties relevant to the development of our model are listed as follows: Convolution: x1 x2 X1 X2. Symmetry: If x is real-valued Xk = X k m, m N.

Even/Odd Transforms: F{Even(x)} = Re(X), F{Odd(x)} = Im(X),

Published as a conference paper at ICLR 2021

where denotes element-wise multiplication, denotes circular convolution, Re(.) and Im(.) denote the real and imaginary components, Even(x) = (x + x )/2, Odd(x) = (x x )/2, where x signiﬁes the reﬂection of x with respect to the x = 0 axis, and x = Even(x) + Odd(x). Another property that is relevant to our model is the interpolation property, which posits that zero-padding of x in the time domain corresponds to an up-sampled version of X in the frequency domain.

Table 1: The two main layers in an N-point Fourier ﬂow, their inverses, and the corresponding log-determinant Jacobian. The input to the ﬂow, x, is a D T matrix comprising a set of D time-series each of length T, and the output, Y , is a 2 D N/2 tensor comprising the ﬁltered spectral representation of x. Here, we show the application of the Fourier transform layer to a given feature dimension d the same operation is applied independently to all feature dimensions. The vector c X d is the reversed conjugate of c Xd, and H = [Hi,j]i,j. When cascading multiple Fourier ﬂows, we alternate between feeding either of the real and imaginary channels of c X, Im(c X) and Re(c X), to the Bi RNN network in the different ﬂows within the cascade.

Layer Function Inverse function log |det J|

Fourier transform xd = xd 0N T , Xd = [ c Xd, c X d ], xt,d 0, t mod rd = 0, xd = F 1 N { Xd}, log | det W| = 0. Xd = FN{ xd}, xt,d , t mod rd = 0,

c Xd = [ X0,d, . . . , X0, N/2 ]. xd = [ xd 0, . . . , xd T 1 ].

Spectral ﬁltering (log H, µ) = Bi RNN(Im(c X)), Y1, Y2 = split(Y ),

Y1 = H Re(c X) + M, (log H, µ) = Bi RNN(Y2), P

i,j log(|Hi,j|).

Y2 = Im(c X), Re(c X) = (Y1 M)/H,

Y = concat(Y1, Y2). c X = Re(c X) + j Im(c X).

4 FOURIER FLOWS

We propose a new ﬂow for time-series data, the Fourier Flow, which hinges on the frequency domain view of time-series data (explained in Section 3.1), and capitalizes on the Fourier transform properties discussed in Section 3.2. In a nutshell, a Fourier Flow comprises two steps: (a) a frequency transform layer (Section 4.1), followed by (b) a data-dependent spectral ﬁltering layer (Section 4.2). The steps involved in a Fourier Flow are summarized in Table 1 and Figure 1.

4.1 FOURIER TRANSFORM LAYER

In the ﬁrst step of the proposed ﬂow, we transform the time-series x = [ x0, . . . , x T 1 ] into its spectral representation via Fourier transform we do so by applying the DFT operation (described in Section 3.1) to each feature dimension d independently. Let xd = [ x0,d[rd], . . . , x T 1,d[rd] ] be the time-series associated with feature dimension d; the Fourier transform layer computes the N-point DFT of xd for all d {1, . . . , D} through the following three steps:

Temporal zero padding: xd = xd 0N T , xt,d[rd] = 0, t, t mod rd = 0,

Fourier Transform: Xd = FN{ xd},

Spectral cropping: c Xd = [ X0,d, . . . , X0, N/2 ]. (5)

Here, 0N T denotes a set of N T zeros, and the union denotes the padding operation that appends the zeros 0N T to the time-series xd. The temporal zero-padding step capitalizes on the frequency interpolation and sampling properties of DFT (Section 3.2) to ensure that the padded time-series xd and its frequency spectrum Xd have a ﬁxed (predetermined) length of N, irrespective of the interval length T and the sampling period rd. Because the DFT coefﬁcients are complex-valued, Xd is a tensor with dimensions 2 1 N, and the collection of Fourier transforms for the D feature dimensions, X, is a tensor with dimensions 2 D N. That is, the DFT layer converts each time-series x into a two-channel, image-like D N matrices Re( X) and Im( X) as shown in Figure 1. A ﬂow with an N-point DFT will be referred to as an N-point Fourier ﬂow in the rest of the paper. To guarantee a lossless recovery of the time-series x via inverse DFT, we ensure that N T.

Published as a conference paper at ICLR 2021

Finally, the spectral cropping step in (5) discards the N/2+1th to N th frequency components in both Re( X) and Im( X). This is because x is real-valued, hence Re( X) and Im( X) are symmetric and anti-symmetric, respectively (See Section 3.2), which renders the discarded frequency components redundant. The ﬁnal output of this layer, c X, is a 2 D N/2 tensor.

DFT computation. Since the DFT operation in (4) is linear, we can represent the frequency transform step in (5) through a linear transformation and apply DFT via matrix multiplication as follows:

Xd = W xd, W = 1

1 1 1 1 1 ω ω2 ωN 1

1 ω2 ω4 ω2(N 1) ... ... ... ... ... 1 ωN 1 ω2(N 1) ω(N 1)(N 1)

, and ω = e 2πj/N. (6)

When N is set to be a power of 2, the DFT operation in (6) can be implemented using any variant of the Fast Fourier Transform (FFT) methods (Nussbaumer (1981); Van Loan (1992)), such as the Cooley Tukey FFT algorithm (Cooley & Tukey (1965)), with a computational complexity of O(N log N).

Determinant of the DFT Jacobian. The DFT is a natural and intuitive transformation for temporal data, but how does introducing the DFT mapping affect the complexity of the Jacobian determinant of the ﬂow? To answer this question, we note that the DFT matrix W in (6) is a (square) Vandermonde matrix, thus we can evaluate the DFT Jacobian determinant in closed-form as follows:

| det(J[W]) | (a) = | det(W) | (b) = 1

1<n<m N (ωm ωn) (c) = 1, (7)

where (a) follows from the fact that the Jacobian of a linear transformation is equal to the transformation matrix W, and (b) follows from the scalar multiplication property of determinants, which posits that det(αVN N) = αN det(VN N) for a scalar α and matrix V , combined with the formula for the determinants of Vandermonde matrices (Bj orck & Pereyra (1970)). (c) follows from the fact that the magnitude of the complex exponential is | ω | = 1, and the polynomial product in (7) comprises a total of N N/2 terms, each of which is of the form | ωnωm | = 1. The result in (7) shows that the DFT mapping does not incur extra computational costs when evaluating the ﬂow likelihood in (3).

4.2 SPECTRAL FILTERING LAYER

The second layer of the Fourier ﬂow is an afﬁne coupling layer, similar to that originally introduced in (Dinh et al. (2014; 2016)), but applied to the time-series x in the frequency domain as follows:

(log H, M) = Bi RNN(Im(c X)), Y1 = H Re(c X) + M, Y2 = Im(c X), Y = concat(Y1, Y2),

where H and µ are D N matrices, Bi RNN denotes a bi-directional recurrent neural network (Schuster & Paliwal (1997)), and denotes the Hadamard (element-wise) product. Here, we split the real and imaginary channels in c X rather than splitting the feature dimensions as in (Dinh et al. (2016)). The detailed steps of the ﬂow and its inversion are given in Table 1.

The afﬁne transformation in the frequency domain can be thought of as a spectral ﬁltering operation whereby the frequency transform of the even part of the time-series, Re(b X), is applied to a ﬁlter with a transfer function H (Recall the even-odd decomposition properties of DFT in Section 3.2). The transfer function itself is data-dependent: it depends on Im(b X), or equivalently, the frequency transform of the odd part of the time-series x. The mapping from Im(b X) to the transfer function H is implemented through an RNN that shares parameters across all different frequency components, since N can grow very large for time-series with a large T. We use a bi-directional RNN since all frequency components are available at our disposal at the same time.

4.3 FOURIER FLOWS IN THE TIME-DOMAIN

How does the Fourier ﬂow mapping look like in the time domain? Using the convolution property of the DFT (Section 3.2), the time-domain view of the Fourier ﬂow can be expressed as follows

y1 = h( x) Even( x) + µ, y2 = Odd( x), y = concat(y1, y2), (8)

Published as a conference paper at ICLR 2021

(a) Fourier Transform Layer (b) Spectral Filtering Layer

Frequency index

FFT algorithm

(2 x D x N Tensor)

Spectral cropping

(1 x D x N/2 Tensor)

Spectral filter

Figure 1: Pictorial depiction of the two layers underlying a Fourier ﬂow. Here, we consider an exemplary instance of a time-series with D = 5 and T = 6, to which we apply a 10-point Fourier ﬂow (N = 10). In all displays, the x-axis corresponds to either the time or frequency indexes, the y-axis corresponds to the feature dimension, and the red dots correspond to the value associated with a given time (frequency) index in a given feature dimension. (Darker shades of red correspond to higher values.) (a) DFT layer. In this example, we assume the sampling periods to be r1 = 2, and rd = 1, d = 1. In the ﬁrst step of the ﬂow, we set all missing values in the under-sampled feature dimension to 0, and padd N T = 4 zeros to all features. (Padded zeros are depicted as white dotted circles), which results in a padded time-series x in the form of a D N matrix. Next, the DFT operation is applied to each row in x, resulting in a 2 D N tensor X comprising the real and imaginary components of the frequency transform. (b) Spectral ﬁltering layer. The imaginary component of the (cropped) spectrum is passed to a bi-directional RNN, which views the input as an N/2-length sequence of D-dimensional features. The RNN outputs a N/2 D sequence that corresponds to the spectral ﬁlter H, which is then multiplied by Re(c X). The ﬁnal output concatenates the ﬁltered spectrum, Y1, with Y2 = Im(c X).

where yi = F 1 N {Yi}, i {1, 2}, µ = F 1 N {M}, and h = F 1 N {H} is the impulse response of the spectral ﬁlter H. That is, the Fourier ﬂow in the time domain corresponds to a circular convolution between the even part of x and the inverse DFT of the ﬁlter H, which depends on the odd part of x.

The time-domain view of a Fourier ﬂow elucidates its multifaceted modeling advantages. First, there is a representational advantage of having a more expressive convolutional transformation in (8) compared to the element-wise afﬁne transformations in the coupling layers of existing methods (Dinh et al. (2014; 2016); Kingma & Dhariwal (2018)). Moreover, a Fourier Flow does not require splitting feature dimensions disjointly, but rather decomposes each feature to its even and odd components. Second, this richer representation comes with no extra computational cost: while time-domain convolution would run in O(N 2D2) time (Hunt (1971)), our spectral ﬁlter applies an O(N) element-wise afﬁne operation in the frequency domain, and the DFT runs in O(DN log N) time. Computation of the Jacobian determinant is achieved by simply adding all elements of H (Table 1) as it is the case in time-domain afﬁne coupling methods. Finally, the DFT operation enables handling of variable-length time-series and inconsistent sampling periods across features without any extra modeling efforts.

Related work. Existing explicit-likelihood models for sequential data are based predominantly on state-space modeling approaches, which assume that hidden state dynamics control the sequence of observed data. Examples of such models include hidden Markov model (Beal et al. (2002)), deep Markov models (Krishnan et al. (2017)), and attentive state-space models (Alaa & van der Schaar (2019)). The key advantage of our model compared to these is that ours is optimized and assessed using the exact likelihood rather than a variational lower bound.

Existing models based on normalizing ﬂows have been primarily focused on static (non-temporal) data (Ziegler & Rush (2019)) examples include NICE (Dinh et al., 2014), Real NVP (Dinh et al., 2016), and GLOW (Kingma & Dhariwal, 2018; Prenger et al., 2019). These ﬂows assume highly-structured transformation with easy-to-compute Jacobian determinants, e.g., diagonal, block diagonal and triangular Jacobian matrices. Fourier ﬂows assume a structured transformation in the frequency domain, which corresponds to a richer and more complex representation than existing ﬂows in time domain. In particular, a Fourier ﬂow can be thought of as the frequency-domain dual of the afﬁne transformation

Published as a conference paper at ICLR 2021

(a) Distributions of the sinusoidal frequencies recovered by the different baselines.

Frequency Frequency Frequency

True Real NVP

True Time GAN

Time GAN Real NVP

Time GAN Real NVP

Number of Epochs Frequency Time step

(b) Performance comparison between the three baselines.

NLL Histogram

Figure 2: Performance of FF, Real NVP and Time GAN in learning to generate random sinusoidal time-series with random frequencies. (a) Here, we estimate the dominant frequency of the sinusoidal signals generated by each baseline and plot the histogram for the estimated frequencies across that of the ground-truth distribution of the true frequency f. We plot a kernel density estimate of the histogram in each plot. Note that the ranges of the xand y-axes in each plot change according to the estimated distributions. (b) Assessment of the time-series data generated by the different baselines in terms of exact data likelihood, spectral properties and predictive utility.

of Real NVP, which translates to a more complex convolutional mapping in the time domain. The closest work to ours is the invertible convolutional ﬂows developed independently in (Karami et al. (2019)). This work does not deal with time-series data, but introduces a convolutional transformation on the feature dimension using Toeplitz matrices, and implements the transformation using DFT.

5 EXPERIMENTS

5.1 ILLUSTRATIVE RESULTS ON SYNTHETIC DATA

To examine the utility of our explicit-likelihood frequency-domain model, we start off by comparing Fourier ﬂows with their time-domain analogues, Real NVP ﬂows (Dinh et al. (2016)), and a state-ofthe-art implicit-likelihood method, Time GAN (Yoon et al. (2019)), in a stylized experimental setup. In particular, we consider the following time-series data generation process:

x = sin(ft + φ), φ N(0, 1), f Beta(α = 2, β = 2), t {0, . . . , T 1}, (9)

where Beta denotes a Beta distribution with shape parameters α and β, and T = 100. That is, our time-series data is a set of sinusoidal sequences with random frequency and phase, where the phase is drawn from a normal distribution and the frequency is drawn from a Beta distribution. We generate a total of 1,000 time-series based on (9) and use these to train all baselines. We compare a Fourier ﬂow (FF) model with a composition of 10 ﬂows and a (single-layer) Bi RNN with 200 hidden units with an equivalent Real NVP model with the same number of ﬂows and an Bi RNN for generating the coefﬁcients of the afﬁne transformation. For the Time GAN model, we use the hyper-parameter conﬁguration recommended in (Yoon et al. (2019)). We train all models with 1,000 iterations and a batch size of 128 we then generate 1,000 synthetic time-series from each trained model.

In Figure 2(a), we plot the histogram for the frequencies of the time-series generated by each model and compare it with the ground-truth Beta distribution of f. We estimate the frequency of each generated time-series (which are generally not perfectly sinusoidal) by ﬁnding the dominant frequency component ˆf in the DFT of each generated time-series. As we can see, the distribution of ˆf recovered by FF bears the strongest resemblance to the true Beta distribution. Moreover, we note that the implicit likelihood approach in Time GAN concentrates the probability mass on few frequencies, indicating that it memorized some of the sinusoidals in training data, which is problematic in applications where

Published as a conference paper at ICLR 2021

we would want to keep the training samples private this problem does not arise in FF where the learned distribution is smooth and does not concentrate around any speciﬁc frequencies. On the other hand, Real NVP learned a distribution with a much wider support than that of the original data.

The richness of the (frequency-domain) FF transformation compared to the (time-domain) Real NVP is evident in the lower negative log-likelihood achieved by our model (Figure 2(b), left). In Figure 2(b) (middle), we compare all baselines with respect to the power spectral density (PSD) of the 1,000 timeseries sampled from each model; the PSD is a spectral representation of a stochastic process given by PSD = F{E[ x x ]}. Figure 2(b) plots the cumulative mean absolute error (MAE) between the PSD of the original data and that of the samples generated by each model within all frequency bands [0, f], f [0, 1]. FF provides competitive accuracy in the recovered PSD compared to Time GAN (MAE: 2.83 for FF, 3.44 for Time GAN), and signiﬁcantly outperforms Real NVP (MAE: 5.12). This implies that the distribution of frequencies generated by FF accurately resembles the original data.

Finally, we evaluated the accuracy of the sampled data in the time-domain by assessing their predictive usefulness as follows: we trained a vanilla RNN model using each of the three synthetically generated samples to sequentially predict the next value in each time-series in the original data. Figure 2(c) (right) shows the MAE of the three RNN models across all time step as we can see, the RNN trained on FF-generated data consistently outperforms the baselines.

Table 2: Performance scores for all baselines.

(F-score, MAE)

Method Stocks Energy Lung cancer

Time GAN (0.938, 0.173) (0.479, 0.056) (0.732, 3.248) WGAN (0.989, 0.010) (0.985, 0.049) (0.517, 4.824) ARIMA (0.648, 0.016) (0.667, 0.037) (0.659, 4.299) Wave Net (0.441, 0.123) (0.013, 0.213) Real NVP (0.979, 0.020) (0.965, 0.053) (0.866, 5.762) FF (0.984, 0.009) (0.949, 0.039) (0.904, 3.207)

5.2 PERFORMANCE EVALUATION ON REAL DATA

We replicate the experimental setup in (Yoon et al. (2019)) to test the performance of our model on multiple time-series data sets with respect to its predictive utility, i.e., the extent to which models trained on the synthetic samples can perform well on prediction tasks in the original sample. We measure performance with respect to this criterion via a predictive score that corresponds to the accuracy of an RNN trained on the synthetic data to predict next step in real data. In addition, we measure the quality of the synthetic data generated by each baseline using the precision and recall metrics (Sajjadi et al. (2018)) averaged over all time steps, and then we combine both scores into a single F-score. We conduct experiments on: Google stocks data, UCI Energy data set, and a longitudinal follow-up clinical data set for lung cancer patients. To highlight the richness of the learned FF distributions, we also visualize the t-SNE plots for the FF samples compared to their time-domain analogues generated by Real NVP.

We compare Fourier Flows with the following baselines: Time GAN (Yoon et al. (2019)), Wasserstein GAN (WGAN) trained on time steps as distinct feature dimensions, a Autoregressive Integrated Moving-Average (ARIMA) model, Wave Net (Oord et al. (2016)) and a na ıve Real NVP benchmark with time steps treated as feature dimensions. Performance results are provided in Table 2 the results indicate that FF consistently generates high-quality synthetic time-series that are useful for predictive model training (as quantiﬁed by the predictive score). FF outperforms all methods with respect to the predictive score on Stocks and Lung cancer data sets, and its learned distribution more closely resemble the real distribution compared to the time-domain method (Figure 3 shows the t-SNE plots on Google s stock data). It is also worth mentioning that FF performs competitively with respect to the domainand task-agnostic precision and recall with respect to all data sets, which means that the FF-generated synthetic samples could reliably replace original ones in a wide variety of modeling tasks.

(a) t-SNE plot for Fourier flows.

(b) t-SNE plot for Real NVP.

Figure 3: t-SNE plots for Stock.

Published as a conference paper at ICLR 2021

6 DISCUSSION

In this paper, we introduced an explicit likelihood model based on a novel class of normalizing ﬂows that models the distributions of stochastic time-series data in the frequency-domain rather than the time-domain. Capitalizing on the Fourier transform properties, the proposed ﬂow naturally handles arbitrary sample rates and duration of time-series data, and enables learning richer representation compared with the existing methods at no extra computational cost.

Fourier ﬂows enjoy three modeling advantages that enable them to accurately model time-series distributions. First, the DFT layer compresses temporal information into a low-dimensional spectral representation, enabling a more efﬁcient distribution learning. Consider for instance the sinusoidal example in Section 5.1. In this example the data is drawn from a stochastic process sin(ft + φ) where f is a random frequency. To model length-T time-series drawn from such process using conventional methods, we would need to model a T-dimensional random variable. However, using the DFT transform, we would be modeling the spectral representation X = 2

j [δ(v + f) δ(v f)] where δ is the Dirac-delta function. Thus, the spectral representation can be fully described with one piece of information, which is the location of the frequency component f, i.e. a 1-dimensional random variable. Second, Fourier ﬂows enable construction of complex transformations at no extra cost for Jacobean computation. That is, a DFT layer followed by a simple transform (such as an afﬁne transform) would amount for a complex overall transform without any extra complications associated with the Jacobean transformation. Finally, the usage of an RNN model in the novel spectral ﬁltering layer captures the sequential nature of the spectral data.

Another key advantage for Fourier ﬂows is that they are explicit likelihood models with a tractable exact likelihood. This means that they can be trained and evaluated using the exact model likelihood instead of a variational lower bound as in most existing approaches. Moreover, explicit likelihood models are hypothesized to be superior to implicit likelihood models, such as GANs, in terms of privacy preservation, as they are less likely to memorize training points. While the theoretical generalization properties of GANs have been previously investigated Nagarajan et al. (2018), less work has been done in studying the theoretical generalization performance of ﬂows. Proving that normalizing ﬂows do not memorize data is beyond the scope of this paper as it would amount for a broader and more general result that is not limited to the time series setup. Such theoretical analysis is an interesting subject for future work.

While Fourier transforms enjoy various computational advantages, they may fall short when modeling non-stationary time-series, where we know that the distribution of each new observation is not just a function of the process history, but also the time index. Thus, investigating generalizations of our model with more general spectral transforms, such as generalized Fourier transforms or discrete Wavelet transforms are promising directions for future work.

ACKNOWLEDGMENTS

We would like to thank the reviewers for their valuable feedback. The research presented in this paper was supported by the US Ofﬁce of Naval Research (ONR), and by the National Science Foundation (NSF) under grant numbers 1407712, 1462245, 1524417, 1533983, and 1722516.

Ahmed M Alaa and Mihaela van der Schaar. Attentive state-space modeling of disease progression. In Advances in Neural Information Processing Systems, pp. 11338 11348, 2019.

Matthew J Beal, Zoubin Ghahramani, and Carl E Rasmussen. The inﬁnite hidden markov model. In Advances in neural information processing systems, pp. 577 584, 2002.

Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing ﬂows for variational inference. ar Xiv preprint ar Xiv:1803.05649, 2018.

ke Bj orck and Victor Pereyra. Solution of vandermonde systems of equations. Mathematics of computation, 24(112):893 903, 1970.

Published as a conference paper at ICLR 2021

Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. Mc Graw-Hill New York, 1986.

Anna L Buczak, Steven Babin, and Linda Moniz. Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 10(1):59, 2010.

Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association, 24(2):361 370, 2017.

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297 301, 1965.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline ﬂows. In Advances in Neural Information Processing Systems, pp. 7511 7522, 2019.

Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.

Crist obal Esteban, Stephanie L Hyland, and Gunnar R atsch. Real-valued (medical) time series generation with recurrent conditional gans. ar Xiv preprint ar Xiv:1706.02633, 2017.

Alex Graves. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013.

Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving ﬂowbased generative models with variational dequantization and architecture design. ar Xiv preprint ar Xiv:1902.00275, 2019.

Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions for generative normalizing ﬂows. ar Xiv preprint ar Xiv:1901.11137, 2019.

Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive ﬂows. International Conference on Machine Learning (ICML), 2018.

B Hunt. A matrix theory proof of the discrete convolution theorem. IEEE Transactions on Audio and Electroacoustics, 19(4):285 288, 1971.

Abhyuday N Jagannatha and Hong Yu. Bidirectional rnn for medical event detection in electronic health records. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2016, pp. 473. NIH Public Access, 2016.

Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientiﬁc data, 3:160035, 2016.

Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Dinh, and Daniel Duckworth. Invertible convolutional ﬂow. In Advances in Neural Information Processing Systems, pp. 5635 5645, 2019.

Durk P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in neural information processing systems, pp. 10215 10224, 2018.

Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing ﬂows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. AAAI, 2017.

Published as a conference paper at ICLR 2021

Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances in neural information processing systems, pp. 4601 4609, 2016.

Vaishnavh Nagarajan, Colin Raffel, and Ian J Goodfellow. Theoretical insights into memorization in gans. In Neural Information Processing Systems Workshop, 2018.

Henri J Nussbaumer. The fast fourier transform. In Fast Fourier Transform and Convolution Algorithms, pp. 80 111. Springer, 1981.

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

Alan V Oppenheim. Discrete-time signal processing. Pearson Education India, 1999.

George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338 2347, 2017.

Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A ﬂow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617 3621. IEEE, 2019.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. International Conference on Machine Learning (ICML), 2015.

Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 5234 5243, 2018.

Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673 2681, 1997.

Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. IEEE journal of biomedical and health informatics, 22(5):1589 1604, 2017.

Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308 3318, 2017.

Charles Van Loan. Computational frameworks for the fast Fourier transform. SIAM, 1992.

Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott Mc Lachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3):230 238, 2018.

Magnus Wiese, Robert Knobloch, Ralf Korn, and Peter Kretschmer. Quant gans: Deep generation of ﬁnancial time series. Quantitative Finance, pp. 1 22, 2020.

Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. Time-series generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 5508 5518, 2019.

Jinsung Yoon, Lydia N Drumright, and Mihaela Van Der Schaar. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE Journal of Biomedical and Health Informatics, 2020.

Zachary M Ziegler and Alexander M Rush. Latent normalizing ﬂows for discrete sequences. ar Xiv preprint ar Xiv:1901.10548, 2019.