# timeseries_generation_by_contrastive_imitation__8ea9a75b.pdf

Time-series Generation by Contrastive Imitation

Daniel Jarrett University of Cambridge, UK daniel.jarrett@maths.cam.ac.uk

Ioana Bica University of Oxford, UK Alan Turing Institute, UK ioana.bica@eng.ox.ac.uk

Mihaela van der Schaar University of California, Los Angeles University of Cambridge, UK Alan Turing Institute, UK mv472@cam.ac.uk

Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies generation by imitation . Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from realworld datasets, verifying that it performs at the standard of existing benchmarks.

1 Introduction

Time-series data are ubiquitous in diverse machine learning applications, such as ﬁnancial, industrial, and healthcare settings. At the same time, lack of public access to data is a recurring obstacle to the development and reproducibility of research in domains where datasets are proprietary [1]. Generating synthetic but realistic time-series data is a promising solution [2], and has received increasing attention in recent years, driven by advances in deep learning and generative adversarial networks [3,4].

Owing to the fact that time-series features are generated sequentially, generative modeling in the temporal setting faces a two-pronged challenge: First, a good generator should accurately capture the conditional dynamics of stepwise transitions p(xt|x1, ..., xt 1); this is important, as the faithfulness of any conceivable downstream time-series analysis depends on the learned correlations across both temporal and feature dimensions. Second, however, the recursive rollouts of the generator should also respect the joint distribution of multi-step trajectories p(x1, ..., x T ); this is equally important, as synthetic trajectories that inadvertently wander beyond the support of original data are useless at best.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Recent work falls into two main categories. On one hand, autoregressive models trained via MLE [5] explicitly factor the distribution of trajectories into a product of conditionals Q

t p(xt|x1, ..., xt 1). While this allows directly learning and computing such transitions, with ﬁnite data this is prone to compounding errors during multi-step generation, due to the discrepancy between closed-loop training (i.e. conditioned on ground-truths as inputs) and open-loop sampling (i.e. conditioned on its own previous outputs) [6]. A variety of methods have sought to counteract this problem of exposure bias, employing auxiliary techniques from curriculum learning [7,8] and adversarial domain adaptation [9,10]; however, such remedies are not without biases [11], and empirical improvements have been mixed [12 14].

On the other hand, adversarial models based on GAN training and its relatives [15 17] directly model the distribution of trajectories p(x1, ..., x T ) [18 20]. To provide a more granular learning signal for the generator, a popular variant matches the induced distribution of sub-trajectories instead, providing stepwise feedback from the discriminator [21,22]. Time GAN [12] is the most recent incarnation of this, and operates within a jointly optimized latent space. GAN-based approaches alleviate the risk of compounding errors, and have been applied to banking [23], sensors [24], biosignals [25], and smartgrids [26]. However, the conditional dynamics are only implicitly learned, yielding no way of inspecting or assessing the quality of sampled transitions nor trajectories. Moreover, the adversarial objective leads to characteristically challenging optimization exacerbated by the temporal dimension.

Three Operations Consider a probabilistic generative model p for some dataset D. We are generally interested in performing one or more of the following operations: (1) sampling a time series τ p, (2) evaluating the likelihood p(τ), and (3) learning the model p from a set of i.i.d. samples τ. In light of the preceding, we investigate a generative framework that attempts to fulﬁll the following criteria:

Samples should respect both the stepwise conditional distributions of features, as well as the joint distribution of full trajectories; unlike pure MLE, we wish to avoid multi-step compounding error. Evaluating likelihoods should be possible as generic measures of sample quality for both transitions and trajectories often desired for sample comparison, model auditing, or bias correction [27,28]. Unlike black-box GAN discriminators, we wish that the evaluator be decoupled from any speciﬁc sampler, such that the two components can be trained non-adversarially, thus may be more stable.

Contributions In the sequel, we explore an approach that seeks to satisfy these criteria. We ﬁrst give precise treatment of the compounding error problem, thus motivating a speciﬁc trajectory-centric optimization objective from ﬁrst principles (Section 2). To carry it out, we develop a general training framework and practical algorithm, along with its theoretical justiﬁcation: We train a forward-looking transition policy to imitate the sequential behavior of time series using a stepwise-decomposable energy model as reinforcement, giving a method that embodies generation by imitation (Section 3). Importantly, to understand its strengths and limitations, we compare the method to existing generative models for time-series data, and relate it to imitation learning of sequential behavior (Section 4). Lastly, through experiments with application to real-world time-series datasets, we verify that it generates predictively useful samples that perform at the standard of comparable benchmarks (Section 5).

2 Synthetic Time Series

2.1 Problem Setup

We operate in the standard discrete-time setting for time series. Let feature vectors xt X be indexed by time steps t, and let a full trajectory of length T be denoted τ := (x1, ..., x T ) T := X T . Also, denote with ht := (x1, ..., xt 1) H := T t=1X t the history prior to time t. For ease of exposition we shall work with trajectories of ﬁxed lengths T, but our results trivially generalize to the case where T itself is a random variable (for instance, by employing padding tokens up to some maximum length).

Consider a dataset D := {τn}N n=1 of N trajectories sampled from some true source s. We assume the trajectories are generated sequentially by some unknown transition process πs (X)H, such that features at each step t are sampled as xt πs( |ht). In addition to this stepwise conditional, denote with µs(h) := 1

T P t p(ht = h|πs) the normalized occupancy measure i.e. the distribution of histories induced by πs. Intuitively, this is the visitation distribution of history states encountered by a generator when navigating about the feature space X by rolling out policy πs. With slight abuse of notation, we may also write µs(h, x) := µs(h)πs(x|h) to indicate the marginal distribution of transitions. Finally, let the joint distribution of full trajectories be denoted by ps(τ) := Q

t πs(xt|ht).

The goal is to learn a sequential generator πθ parameterized as θ using samples τ ps from D, such that pθ ps. Note here that we do not assume stationarity of the time-series data, nor stationarity of the transition conditionals; any inﬂuence of t is implicit through the dependence of πs (and πθ) on variable-length histories. In line with recent work [14, 20], for simplicity we do not consider static metadata as supplemental inputs or outputs, as these are commonly and easily incorporated via an additional conditioning layer or auxiliary generator [12, 19]. Lastly, note that much recent work on sequential modeling is devoted to domain-speciﬁc, architecture-level designs for generating audio [29,30], text [31,32], and video [33,34]. In contrast, our work is closer in spirit to [12,14] in being an agnostic, framework-level study applicable to generic tabular data in any time-series setting.

Measuring Sample Quality How do we determine the quality of a sample? In specialized domains, of course, we often have prior access to task-speciﬁc metrics such as BLEU or ROUGE scores in text generation [6,35] then, the generator can simply be optimized for such scores via standard methods in reinforcement learning [36]. In generic time-series settings, however, the challenge is that any such metric must necessarily be task-agnostic, and access to it must necessarily come from learning.

So, for any data source s, let us speak of some hypothetical function fs : H X [ c, c] with c < , such that fs(h, x) gives the quality of any sampled transition that is, any tuple (h, x). Intuitively, we may interpret this as quantifying how typical it is for the random process to be in state h and step towards x. Likewise, let as also speak of some function Fs : T [ c T, c T] such that Fs(τ) gives the quality of any sampled trajectory. Naturally, in time-series settings where the underlying process is causally-conditioned, it is reasonable to deﬁne this as the decomposition Fs(τ) := P

t fs(ht, xt). Now of course, we have no access to the true Fs. But clearly, in learning a generative model pθ of ps, we wish that the quality of samples τ drawn from pθ and ps be similar in expectation. More precisely:

Deﬁnition 1 (Expected Quality Difference) Let Fs :Θ [ 2c T, 2c T] denote the expected quality difference between ps and pθ, where Θ indicates the space of parameterizations for generator πθ:

Fs(θ) := Eτ ps Fs(τ) Eτ pθFs(τ) (1)

Our objective, then, is to learn a generator πθ that minimizes the expected quality difference Fs(θ). Two points bear emphasis. First, we know nothing about Fs beyond it being the sequential aggregate of fs. This challenge uniquely differentiates this agnostic setting from more popular media-speciﬁc applications for which various predeﬁned measures are readily available for supervision. Second, in addition to matching this expectation over samples, we also wish to match the variety of samples in the original data. After all, we want pθ to mimic samples from ps of different degrees of typicality . So we should expect to incorporate some measure of entropy, e.g. the commonly used Shannon entropy.

2.2 Matching Local Moments

Recall the apparent tradeoff between autoregressive models and adversarial models. In the spirit of the former, suppose we seek to directly learn transition conditionals via supervised learning. That is,

argminθ Eh µs L(πs( |h), πθ( |h)) (2)

Consider the log likelihood loss L(πs( |h), πθ( |h)) := Ex πs( |h) log πθ(x|h). In the case of exponential family models for πθ( |h), a basic result is that this is dual to maximizing its conditional entropy subject to the constraint on feature expectations Eh µs;x πθ( |h)T(x) = Eh µs;x πs( |h)T(x), where T : X R is some sufﬁcient statistic [37 39]. More generally for deep energy-based models, we have (however, recall that strong duality does not generalize to the nonlinear case; see Appendix A):

argminθ E h µs x πθ( |h) log πθ(x|h) + maxf RH X E h µs x πs( |h) f(h, x) E h µs x πθ( |h) f(h, x) (3)

Note that the moment-matching constraint is local that is, at the level of individual transitions, and all conditioning is based on h from µs alone. This is precisely the exposure bias : The objective is only ever exposed to inputs h drawn from the (perfect) source distribution µs, and is thus unaware of the endogeneity of the (imperfect) synthetic distribution µθ induced by πθ. This is not desirable since πθ is rolled out by open-loop sampling at test time. Now, although at the global optimum the momentmatching discrepancy must be zero (i.e. the equality constraint is enforced), in practice there may be a variety of reasons why this is not perfectly achieved (e.g. error in estimating expectations, error in

function approximation, error in optimization, etc). Suppose we could bound how well we are able to enforce the moment-matching constraint; as it turns out, we cannot eliminate error compounding:1

Lemma 1 Let maxf RH X E h µs x πs( |h) f(h, x) E h µs x πθ( |h) f(h, x) ϵ. Then Fs(θ) O(T 2ϵ).

Proof. Appendix A.

This reveals the problem with modeling conditionals per se: Not all mistakes are equal. An objective like Equation 2 penalizes unrealistic transitions (h, x) by treating all conditioning histories h equally regardless of how realistic h is to begin with. Clearly, however, we care much less about how x looks like, if the current subsequence h is already highly unlikely (and vice versa). Intuitively, earlier mistakes in a trajectory should weigh more: Once πθ wanders into areas of H with low support in µs, no amount of good transitions will bring the trajectory back to high-likelihood areas of T under ps.

2.3 Matching Global Moments

Now suppose instead that we seek to directly constrain the trajectory distribution pθ to be similar to ps:

argminθ L(ps, pθ) (4)

Consider the Kullback-Leibler divergence L(ps, pθ) := DKL(ps pθ). Like before, we know that in the case of exponential family models for pθ, this is dual to maximizing its entropy subject to the constraint Eτ ps T(τ) = Eτ pθT(τ), where T : T R is some sufﬁcient statistic [40]. More broadly for deep energy-based models, we have argminθ (Eτ pθ log pθ(τ) + max F RT (Eτ ps F(τ) Eτ pθF(τ))) (but again, recall here that strong duality does not generalize to the nonlinear case; see Appendix A). Now, observe that by deﬁnition of occupancy measure µ, for any function f : H X R it must be the case that Eτ p P

t f(ht, xt) = TEh µ,x π( |h)f(h, x). Therefore we may equivalently write

argminθ E h µθ x πθ( |h) log πθ(x|h) + maxf RH X E h µs x πs( |h) f(h, x) E h µθ x πθ( |h) f(h, x) (5)

Importantly, note that the moment-matching constraint is now global that is, at the level of trajectory rollouts, and πθ is now conditioned on histories h drawn from its own induced occupancy measure µθ. There is no longer any exposure bias here: In order to respect the constraint, not only does πθ( |h) have to be close to πs( |h) for any given h, but the occupancy measure µθ induced by πθ also has to be close to the occupancy measure µs induced by πs. As it turns out, this seemingly minor difference is sufﬁcient to mitigate compounding errors. As before, although at the global optimum the momentmatching discrepancy must be zero, in practice this may not be perfectly achieved. Now, suppose we could bound how well we are able to enforce the moment-matching constraint; but we now have:

Lemma 2 Let maxf RH X E h µs x πs( |h) f(h, x) E h µθ x πθ( |h) f(h, x) ϵ. Then Fs(θ) O(Tϵ).

Proof. Appendix A.

This illustrates why even transition-centric adversarial models such as [12,21] have shown promise in generating realistic trajectories [23 26]. First, unlike trajectory-centric GANs [18,19] which directly attempt to minimize some form of Equation 4, in transition-centric GANs the objective is to match the transition marginals µθ(h, x) and µs(h, x) so the discriminator provides more granular feedback to the generator for training. At the same time, we see from Lemma 2 that matching transition marginals is already indirectly performing the sort of moment-matching that alleviates compounding error.

Can we be more direct? In Section 3, we shall start by tackling Equation 5 itself. As we shall see, this endeavor gives rise to a technique that trains a conditional policy (for sampling), an energy model (for evaluation), and a non-adversarial framework (for learning) addressing our three initial criteria.

3 Generating by Imitating

First, consider the most straightforward implementation: Let us parameterize f RH X as φ, and begin with the primal form of Equation 5, which yields the following adversarial learning objective:

1Lemmas 1 and 2 are similar in spirit to results for error accumulation in imitation by behavioral cloning and distribution matching. See Appendix A; this analogy with imitation learning is formally identiﬁed in Section 4.

L(θ, φ) := maxφ minθ E h µθ x πθ( |h) log πθ(x|h) + E h µs x πs( |h) fφ(h, x) E h µθ x πθ( |h) fφ(h, x) (6)

It is easy to see that this effectively describes variational training of the energy-based model pφ(τ) := exp(Fφ(τ) log Zφ) where Fφ(τ) := P

t fφ(ht, xt) to approximate the true ps(τ), using samples from the variational pθ. The (outer) energy player is the maximizing agent, and the (inner) policy player is the minimizing agent. The form of this objective naturally prescribes a bilevel optimization procedure in which we perform (gradient-based) updates of φ with nested (best-response) updates of θ.

3.1 Challenges of Learning

Abstractly, of course, training energy models using variational samplers is not new: Multiple works in static domains such as image modeling have investigated this approach as a means of bypassing the expense and variance of MCMC sampling [41,42]. In our setting, however, there is the additional temporal dimension: The negative energy Fφ(τ) of any trajectory is computed as the sequential composition of stepwise qualities fφ(ht, xt), and each trajectory sampled from pθ must be generated as the sequential rollout of stepwise policies πθ(xt|ht). Consider the gradient update for the energy,

φL = E h µs x πs( |h) φfφ(h, x) E h µθ x πθ( |h) φfφ(h, x) (7)

and the inner-loop update for the policy,

argminθ E h µθ x πθ( |h) log πθ(x|h) E h µθ x πθ( |h) fφ(h, x) (8)

Note that the max-min optimization requires complete optimization within each inner update in order for the outer update to be correct. Otherwise the gradients will be biased, and there would be no guarantee the procedure converges to anything meaningful. Yet unlike in the static setting for which there exists variety of standard approximations for the inner update [41 44] here the policy update amounts to entropy-regularized reinforcement learning [45 47] using fφ(ht, xt) as reward function. Thus our ﬁrst difﬁculty is computational: Repeatedly performing inner-loop RL is simply infeasible.

Now, an obvious alternative is to dispense with complete policy optimization at each step, and instead to employ importance sampling to ensure that the gradients for the energy updates are still unbiased:

φL = Eτ ps φFφ(τ) 1 Zφ Eτ pθ h exp(P t fφ(ht, xt)) Q t πθ(xt|ht) φFφ(τ) i (9)

where the partition function is computed as Zφ = Eτ pθ[exp( P tfφ(ht, xt))/ P tπθ(xt|ht)], and the sampling policy πθ is no longer required to be perfectly optimized with respect to fφ. Unfortunately, this strategy simply replaces the original difﬁculty with a statistical one: As soon as we consider time-series data of non-trivial lengths T, the multiplicative effect of each time step on the importance weights means the gradient estimates albeit unbiased will have impractically high variance [48,49].

3.2 Contrastive Imitation

We now investigate a generative framework that seeks to avoid these difﬁculties. The key idea is that instead of Equation 7, we shall learn pφ by contrasting (real) positive samples τ ps and (any) negative samples τ pθ, which as we shall see rids us of the requirement that πθ be fully optimized at each step for learning to be guaranteed. First, let us establish the notion of a structured classiﬁer :2

Deﬁnition 2 (Structured Classiﬁcation) Recall the πθ-induced distribution pθ(τ) := Q t πθ(xt|ht). Denote with pφ the un-normalized energy-based model such that pφ(τ) := exp(P

t fφ(ht, xt)), and let Zφ be folded into φ as a learnable parameter. Deﬁne the structured classiﬁer dθ,φ : T [0, 1]:

1 Zφ pφ(τ) + pθ(τ) (10)

2The idea that density estimation can be performed by logistic regression goes back at least to [50], and formalized as negative sampling [51] and noise-contrastive estimation [52]. Structured classiﬁers have been studied in the context of imitation learning [53,54] by analogy with GANs. In the time-series setting, however, we shall see that this approach is equivalent to noise-contrastive estimation with an adaptive noise distribution.

Explicit Policy Rollout

Conditional MLE Gradient

(h, x) Samples

(a) T-Forcing [5]

Black-Box Classification

Implicit Generation

Adversarial LL Gradient

(b) C-RNN-GAN [18]

Black-Box Classification

Adversarial LL Gradient

Implicit Generation

(h, x) Samples

(c) Time GAN [12]

Structured Classification

Non-adversarial Policy Gradient

Explicit Policy Rollout

(d) Time GCI (Ours)

Figure 1: Comparison of Time-series Generative Models. Examples of (a) conditional MLE-based autoregressive model, (b) trajectory-centric GAN, and (c) transition-centric GAN. (d) Our proposed technique. See also Table 1.

That is, unlike a black-box classiﬁer that may be arbitrarily parameterized such as a generic discriminator d in a GAN here dθ,φ is structured in that it is modularly parameterized by the embedded energy and policy functions. Now, we shall train φ such that dθ,φ discriminates well between τ ps and τ pθ that is, so that the output dθ,φ(τ) represents the (posterior) probability that τ is real,

Lenergy(φ; θ) := Eτ ps log dθ,φ(τ) Eτ pθ log 1 dθ,φ(τ) (11)

and as before, Lpolicy(θ; φ) := E h µθ x πθ( |h) log πθ(x|h) E h µθ x πθ( |h) fφ(h, x) (12)

Why is this better? As we now show formally, each gradient update no longer requires θ to be optimal for the current value of φ nor does it require importance sampling unlike the procedure described by Equation 6. The only requirement is that pθ can be sampled and evaluated efﬁciently, e.g. using learned Gaussian policies as usual, or should more ﬂexibility be required with normalizing ﬂowbased policies. As a practical result, this means policy updates can be interleaved with energy updates, instead of being nested within a repeated inner loop. Speciﬁcally, let us establish the following results:

Proposition 3 (Global Optimality) Let fφ RH X , and let pθ (T ) be any distribution satisfying positivity: ps(τ) > 0 pθ(τ) > 0 (this does not require πθ be optimal for fφ). Then Lenergy(φ; θ) is globally minimized at Fφ( ) log Zφ = log ps( ), whence pφ is self-normalized with unit integral.

Proof. Appendix A.

This result is intuitive by analogy with noise-contrastive estimation [52,55]: φ is learnable as long as negative samples τ pθ cover the support of the true ps. The positivity condition is mild (e.g. take Gaussian policies πθ), and so is the realizability condition (e.g. take neural-networks for fφ). Importantly, note that at optimality classiﬁer dθ,φ is decoupled from any speciﬁc value of θ; contrast this with generic discriminators d in GANs, which are only ever optimal for the current generator. Now, in practice we must approximate ps and pθ using ﬁnite samples. In light of this, two questions are immediate: First, does the learned φ converge to the global optimum as the sample size increases? Second, what role does the quality of the policy s samples play in how φ is learned? For the former:

Proposition 4 (Asymptotic Consistency) Let φ denote the minimizer for Lenergy(φ; θ), and let ˆφ M denote the minimizer for its ﬁnite-data approximation that is, where the expectations over ps and pθ are approximated by M samples. Then under some mild conditions, as M increases ˆφ M

Proof. Appendix A.

Now for the second question: Clearly if pθ were too far from ps, learning would be slow the job would be too easy for the classiﬁer dθ,φ, and it may be able to distinguish samples via basic statistics alone. Indeed, in standard noise-contrastive estimation with a ﬁxed noise distribution, learning is ineffective in the presence of many variables [56]. Precisely, however, that is why we continuously update the policy itself as an adaptive noise distribution: As pφ moves closer to ps, so does pθ thus providing more challenging negative samples.3 In fact, should we insist on greedily taking each policy update to optimality, we recover a weighted version of the original max-min gradient from before:

3It is easy to see that minimizing Equation 12 equivalently minimizes the reverse KL div. between pφ and pθ.

Proposition 5 (Gradient Equality) Let φk be the value taken by φ after the k-th gradient update, and let θ k denote the associated minimizer for Lpolicy(θ; φk). Suppose pφ is already normalized; then

φLenergy(φ; θ k) = T

2 φL(θ k, φ) That is, at θ k the energy gradient (of Equation 11) recovers the original gradient (from Equation 7). In the general case, suppose pφ is un-normalized, such that pθ k = pφ/Kφ for some constant Kφ; then

φLenergy(φ; θ k) = T Kφ Kφ+1E h µθ k x πθ k ( |h) φfφ(h, x) T Kφ+1E h µs x πs( |h) φfφ(h, x)

Proof. Appendix A.

This weighting is intuitive: If pφ were un-normalized such that Kφ > 1, the energy loss automatically places higher weights on negative samples h µθ k, x πθ k( |h) to bring it down; conversely, if pφ were un-normalized such that Kφ < 1, the energy loss places higher weights on positive samples h µs, x πs( |h) to bring it up. (If pφ were normalized, then Kφ = 1 and the weights are equal). In sum, we have arrived at a framework that learns an explicit sampling policy without exposure bias, a decoupled energy model without nested or saddle-point optimization, and is self-normalizing without importance sampling or estimating the partition function. Figure 1 gives a representative comparison.

3.3 Optimization Algorithm

Algorithm 1 Time-series Generation by Contrastive Imitation Details in Appendix B 1: Input: source dataset D ps, mini-batch size M, regularization coefﬁcient κ, learning rates λ 2: Initialize: replay buffer B, energy parameter φ, policy parameter θ, critic parameter ψ 3: for each iteration do 4: for each policy rollout do 5: B B {τ pθ} Generate sample 6: for each gradient step do 7: θ θ λactor θ Lactor (θ; φ, ψ) + κ θLmle(θ) Update policy 8: φ φ λenergy φLenergy(φ; θ) Update energy 9: ψ ψ λcritic ψLcritic (ψ; φ) Update critic 10: Output: learned policy parameter θ and energy parameter φ

The only remaining choice is the method of policy optimization. Here we employ soft actor-critic [57], although in principle any technique will do the only requirement is that it performs reinforcement learning with entropy-regularization [45 47]. To optimize the policy per Equation 12, in addition to the policy actor itself, this trains a critic to estimate value functions. As usual, the actor takes soft policy improvement steps, minimizing Lactor(θ; φ, ψ) := Eh B Ex πθ( |h)[log πθ(x|h) Qψ(h, x)], where Qψ : H X R is the transition-wise soft value function parameterized by ψ, and B is a replay buffer of samples generated by πθ. For stability, the actor is regularized with the conditional MLE loss Lmle(θ) := Ex πs( |h) log πθ(x|h). The critic is trained to minimize the soft Bellman residual: Lcritic(ψ; φ) := Eh,x B(Qψ(h, x) fφ(h, x) Vψ(h ))2, where the state-values are bootstrapped as Vψ(h ) := Ex πθ( |h )[Qψ(h , x ) log πθ(x |h )]. By expressly training an imitation policy to mimic time-series behavior using rewards from an energy model trained by contrastive learning, we call this framework Time-series Generation by Contrastive Imitation (Time GCI): See Algorithm 1.

4 Discussion

Our theoretical motivations are apparent (Sections 2.2 3.1), and the practical mechanics of optimization are straightforward (Section 3.2 3.3). To understand the strengths and limitations of Time GCI, two questions remain: First, how does this relate to bread-and-butter imitation learning of sequential decision-making? Second, how does this compare with recent deep generative models for time series?

Imitation Perspective In sequential decision-making, imitation learning deals with training a policy purely on the basis of demonstrated behavior that is, with no knowledge of the reward signals that induced the behavior in the ﬁrst place [58 60]. Consider the standard Markov decision process setting, with states z Z, actions u U, dynamics ω (Z)Z U, and rewards ρ RZ U. Classically, imitation learning seeks to minimize the regret Rs(θ):=Eπs[ P tρ(zt, ut)] Eπθ[ P tρ(zt, ut)], with πs, πθ (U)Z here being the demonstrator and imitator policies, and expectations are taken over episodes generated per ut π( |zt) and zt+1 ω( |zt, ut) [61,62]. First, observe that by interpreting h as states and x as actions , our problem setup bears a precise resemblance to imitation learning:

Table 1: Comparison of Time-series Generative Models. Examples of conditional MLE-based autoregressive models, trajectory-centric GANs, transition-centric GANs, as well as our proposed technique. See also Figure 1.

Type Examples Optimization Objective(s) Generator Signal Discrim. Signal No Exposure Bias Decoupled Discrim. Non Adversarial Explicit Policy Explicit Energy

Condit. MLE

T-Forcing [5] Data LL Stepwise (N/A) (N/A) Z-Forcing [13] Data LL (ELBO) Stepwise (N/A) (N/A) P-Forcing [10] Data LL + Class. LL (pθ v. pθ) Stepwise Global

Traject. GAN

C-RNN-GAN [18] Classiﬁcation LL (pθ v. ps) Global Global Doppel GANger [19] Classiﬁcation LL (pθ v. ps) Global Global COT-GAN [20] Sinkhorn Divergence (pθ v. ps) Global Global

Transit. GAN

RC-GAN [21] Classiﬁcation LL (µθ v. µs) Stepwise Stepwise T-CGAN [22] Classiﬁcation LL (µθ v. µs) Stepwise Stepwise Time GAN [12] Class. LL (µθ v. µs) + Data LL Stepwise Stepwise

Time GCI (Ours) Discrim.: Class. LL (pθ v. ps) Generator: Policy Optimization Stepwise Global

Corollary 6 (Generation as Imitation) Let state space Z := H, action space U := X, and reward function ρ := fs. In addition, let the dynamics be such that ω( |ht, xt) is the Dirac delta centered at ht+1:=(x1, ..., xt). Then the regret exactly corresponds to the expected quality difference: Rs = Fs.

Proof. Immediate from Deﬁnition 1.

Now, since we want low regret but have no knowledge of the true quality measure (i.e. reward signal ), we may naturally learn it together. In this sense, Time GCI is analogous to imitation by inverse reinforcement learning (IRL), which seeks to infer rewards that plausibly induced the demonstrated behavior, and to optimize imitating policies on that basis [63 66]. Further, in simultaneously optimizing for variety (cf. entropy) and typicality (cf. energy), Time GCI is analogous to maximum-entropy IRL [67,68]. Our contrastive approach also bears mild resemblance to stepwise discriminators studied in this vein [54,69], although our framework focuses on trajectory-wise modeling, and is not adversarial (see Appendix D for more discussion on how Time GCI relates to popular imitation learning methods).

There are also crucial differences: In imitation learning, dynamics are generally Markovian; states are readily deﬁned as discrete elements or real vectors, and action spaces are small/discrete. The practical challenge is sample efﬁciency to reduce the cost of environment interactions [70,71]. In time-series generation, however, rollouts are free generating a synthetic trajectory does not require interacting with the real world. But dynamics are never Markovian: The practical challenge is that representations of variable-length histories must be jointly learned. Moreover, actions are the full-dimensional feature vectors themselves, which renders policy optimization more demanding than usual (see Appendix B); beyond the tractable tabular settings we experiment in, higher-dimensional data may prove challenging.

Related Work Table 1 summarizes the key differentiators of Time GCI from prevailing techniques. As discussed in Section 1, MLE-based autoregressive models [5,10,13] are easy to optimize, and learn explicit conditional distributions that can be used for inspection, resampling, or uncertainty estimation, but they suffer from exposure bias [8,11,12]. GAN-based adversarial models fall into two camps: For trajectory-centric methods [18,19,72], with only sequence-level signals to guide the generator, they often struggles to converge to the adversarial objective without extensive tuning [12] with the exception of [72], which utilizes Sinkhorn divergences instead. Transition-centric methods [12,21,22] provide more granular signals to guide the generator, but this simply alters the objective of learning ps to one of learning µs, and still inherits the disadvantages of implicit, adversarial learning.

Our analysis is built on ideas from energy-based models (EBMs) [73 75] and reinforcement learning for sequence prediction [76 78]. In particular, our initial formulation (Section 3.1) can be viewed as a temporal extension of variational EBMs [41,42]. Moreover, by adaptively learning πθ to give negative samples for dθ,φ, the formulation we study (Section 3.2) is equivalent to a temporal analogue of noise-contrastive estimation (NCE) [55,79]. More tangentially, conditional EBMs have been trained with NCE for text generation [80 82], and the strength of global normalization has been studied [83]; that said, these are conﬁned to the case where external input tokens are available for conditioning at each step and not free-running as in our time-series setting. Finally, note that viewing sequence generation as a decision-making problem is present in language modeling [6,35] where task-speciﬁc metrics are available as signals. In the absence of predeﬁned signals, GAN-based methods that jointly train discriminators to provide rewards for imitation have been studied [84 89], although they are adversarial, and all focus on the special case of generating discrete tokens for language modeling.

5 Experiments

Benchmarks We test Algorithm 1 (Time GCI) against the following: The classic Teacher Forcing trains autoregressive networks using ground-truth conditioning (T-Forcing) [5]. Professor Forcing uses adversarial domain adaptation by training an auxiliary discriminator to encourage dynamics of the network s free-running and teacher-forced states to be similar (P-Forcing) [10]. Trajectory-centric recurrent GANs (C-RNN-GAN) directly plug RNNs into the GAN framework as generators and discriminators for full sequences [18]. Causal Optimal Transport GAN (COT-GAN) is the latest variant of this [20], proposing to approximate Sinkkorn divergences instead of the standard JS divergence. For transition-centric recurrent GANs (RC-GAN), the adversarial loss is computed as the sum of log likelihoods for the stepwise feature vectors conditioned on histories [21], instead of directly as the log likelihood for the entire sequence. Finally, Time-series GAN (Time GAN) is its latest incarnation [12], proposing to generate and discriminate within a jointly optimized embedding space for efﬁciency.

Table 2: Summary Statistics for Datasets Used.

Dataset Dimension Length Autocor. +3 Lag +5 Lag

Sines 5 24 0.875 0.623 0.377 Metro 9 24 0.429 0.200 0.029 Gas 20 24 0.656 0.382 0.170 Energy 29 24 0.702 0.411 0.176 MIMIC-III 52 24 0.532 0.212 0.059

Datasets We employ ﬁve tabular time-series datasets with a variety of different characteristics, such as periodicity, noise level, and correlations: First, we use a synthetic dataset of multivariate sinusoids with different frequencies and phases (Sines) [12]. Second, we use a UCI dataset from the monitored energy usage of household appliances in a low-energy house (Energy) [90]. Third, we use a UCI dataset from temperature-modulated semiconductor gas sensors for chemical detection (Gas) [91]. Fourth, we use a UCI dataset of hourly interstate vehicle volume at a state trafﬁc recording station (Metro) [92]. Fifth, we use a medical dataset of intensive-care patients from the Medical Information Mart for Intensive Care (MIMIC-III) [93]. All datasets are accessible from their sources, and we use the original source code for preprocessing sines and the UCI datasets by [12], publicly available at [94]. Table 2 shows summary statistics for the datasets used in the experiments.

Implementation Experiments for each dataset are arranged as follows: The real trajectories that constitute the original dataset D are fed as input to train all algorithms. Each algorithm is subsequently used in test mode to generate 10,000 synthetic trajectories. Then, the performance of each algorithm is evaluated on the basis of these generated trajectories. This process is then performed for a total of 10 repetitions, from which we compile the means and standard errors for each reported result. For fair comparison, analogous network components across all benchmarks share the same recurrent architecture: Wherever a generator, policy, discriminator, energy, or critic network applies, we use LSTMs with one hidden layer of 32 units to compute hidden states for representing histories h, and two fullyconnected hidden layers of 32 units each and ELU activations to compute task-speciﬁc output variables (i.e. the generator output, policy parameters, discriminator output, energy functions, or critic values). In other respects, we use the publicly available source code to construct the benchmark algorithms accessible at [94 98]. See Appendix C for additional detail on hyperparameters and implementations.

Evaluation and Results In the tabular data setting, assessing synthetic data generation is inherently tricky [27,99,100]: Unlike in media-speciﬁc applications, we have no predeﬁned measures such as music polyphony or BLEU scores, nor can we use human evaluation of realism as done for videos. For tabular time-series, the generally accepted standard for comparing synthetic data is to apply the Trainon-Synthetic, Test-on-Real (TSTR) framework, ﬁrst proposed by [21] and employed by most recent work in synthetic time-series generation [12,14,21,22,26,101], as well as more generally for tabular synthetic data of any kind [99,100,102]. Speciﬁcally, we apply the performance measure used by [12,14,101] to quantify how much the synthetic sequences inherit the predictive characteristics of the original dataset (Predictive Score): Using synthetic samples, a generic post-hoc sequence-prediction model is learned to forecast next-step feature vectors over training sequences. Then, the trained model is evaluated on the original data, and its predictive performance is quantiﬁed in terms of the mean absolute error. We use the original source code for computing this metric, publicly available at [94].

Further to prior works using this measure, we additionally believe that synthetic data evaluation should be more general than just next-step TSTR forecasting. After all, the distinguishing characteristic of sequential (vs. static) data generation is that we care about evolution of features over time. Hence we also compute TSTR metrics for horizons of other lengths (+3 Steps Ahead and +5 Steps Ahead). Importantly, note that a key strength of TSTR evaluation is in its sensitivity to mode collapse: If any generation scheme suffers from mode collapse (as GAN methods are prone to), TSTR scores would degrade due to the synthetic data failing to capture the diversity of the real data, which means any

Table 3: Performance Comparison of Time GCI and Benchmarks. Bold numbers indicate best-performing results.

Benchmark Metric Sines Energy Gas Metro MIMIC-III

Predictive Score 0.108 0.002 0.310 0.001 0.035 0.003 0.242 0.001 0.017 0.001 +3 Steps Ahead 0.115 0.001 0.281 0.001 0.080 0.001 0.244 0.001 0.024 0.007 +5 Steps Ahead 0.122 0.003 0.270 0.002 0.111 0.001 0.248 0.001 0.018 0.003 x-Corr. Score 8.369 0.015 194.1 0.043 150.8 0.067 4.222 0.013 400.9 3.203

Predictive Score 0.105 0.001 0.303 0.002 0.037 0.001 0.241 0.001 0.023 0.006 +3 Steps Ahead 0.110 0.001 0.268 0.002 0.086 0.002 0.241 0.001 0.018 0.001 +5 Steps Ahead 0.115 0.001 0.259 0.002 0.121 0.002 0.242 0.001 0.017 0.001 x-Corr. Score 8.156 0.010 207.6 0.057 150.5 0.023 3.014 0.006 346.6 2.901

Predictive Score 0.751 0.001 0.500 0.001 0.242 0.001 0.419 0.005 0.019 0.001 +3 Steps Ahead 0.769 0.001 0.500 0.001 0.243 0.001 0.416 0.002 0.020 0.001 +5 Steps Ahead 0.786 0.001 0.501 0.001 0.241 0.001 0.416 0.003 0.019 0.001 x-Corr. Score 10.76 0.012 644.2 0.112 266.4 0.008 18.39 0.003 1720. 0.339

Predictive Score 0.099 0.001 0.259 0.001 0.022 0.001 0.245 0.001 0.014 0.001 +3 Steps Ahead 0.109 0.001 0.261 0.001 0.050 0.001 0.246 0.001 0.013 0.001 +5 Steps Ahead 0.110 0.001 0.262 0.001 0.072 0.001 0.245 0.001 0.013 0.001 x-Corr. Score 3.114 0.038 67.93 0.227 25.56 0.156 3.055 0.013 497.7 2.581

Predictive Score 0.751 0.001 0.498 0.001 0.243 0.001 0.412 0.003 0.019 0.001 +3 Steps Ahead 0.770 0.001 0.500 0.001 0.244 0.001 0.415 0.004 0.019 0.001 +5 Steps Ahead 0.786 0.001 0.499 0.001 0.243 0.001 0.418 0.004 0.018 0.001 x-Corr. Score 5.649 0.012 582.3 0.047 231.2 0.003 19.77 0.001 1592. 0.192

Predictive Score 0.196 0.006 0.261 0.001 0.264 0.011 0.245 0.002 0.502 0.023 +3 Steps Ahead 0.223 0.006 0.263 0.001 0.251 0.014 0.243 0.001 0.484 0.021 +5 Steps Ahead 0.246 0.005 0.262 0.005 0.252 0.012 0.242 0.001 0.453 0.020 x-Corr. Score 17.86 0.001 667.5 0.001 282.5 0.001 17.11 0.001 2140. 0.010

Time GCI (Ours)

Predictive Score 0.097 0.001 0.251 0.001 0.018 0.000 0.239 0.001 0.002 0.000 +3 Steps Ahead 0.104 0.001 0.251 0.001 0.042 0.001 0.239 0.001 0.001 0.000 +5 Steps Ahead 0.109 0.001 0.251 0.001 0.067 0.001 0.239 0.001 0.001 0.000 x-Corr. Score 1.195 0.011 105.2 0.433 47.91 0.811 0.738 0.019 194.3 0.180

prediction model trained on that basis would also fail to capture this variation). Finally, similar to some recent works [19,20], we also compute the cross-correlations of real and synthetic feature vectors, and report the sum of the absolute differences between them, averaged over time (x-Corr. Score); this serves to verify if feature relationships are preserved well, in addition to temporal relationships. Table 3 shows the results: With respect to these metrics, we ﬁnd that Time GCI somewhat consistently produces synthetic samples that perform similarly or better than benchmark algorithms in all datasets. (Note that we do empirically observe several instances of mode collapse in GAN-based benchmarks).

6 Conclusion

In this work, we invite an explicit analogy between time-series generation and imitation learning, and explore a framework that ﬂeshes out this connection. Two caveats are in order: First, while we began from the notion of moment-matching to address the error compounding problem, in practice there is no guarantee that this is accomplished well during optimization. In particular, scalability is a major limitation beyond the range of feature dimensions and sequence lengths considered in our experiments. Sample-based estimates could rapidly degrade with the horizon, especially if transitions are highly stochastic. A relevant question is whether or not training on ﬁxed subsequence lengths could potentially alleviate this concern for longer sequences. In addition, while our approach seeks to dispense with the instabilities typical of adversarial training, we are instead left with the difﬁculties of policy optimization, which may prove a prohibitive challenge in higher-dimensional feature spaces. For the datasets we consider, we ﬁnd that pre-training and regularizing the policy with maximum likelihood, combined with a small enough learning rate, had the most impact in promoting stability and learning. Second, we reiterate that a perennial challenge in modeling tabular data is in choosing the metric for evaluation. While we opted for the most commonly accepted method of TSTR, this may not be general enough to capture the range of downstream tasks that may be performed on the synthetic data. Future work will beneﬁt from a deeper investigation into more sophisticated measures for time series, such as contrastive methods and how to evaluate different aspects of the quality of the generated trajectories.

Acknowledgments

We would like to thank the reviewers for all their invaluable feedback. This work was supported by Alzheimer s Research UK, The Alan Turing Institute under the EPSRC grant EP/N510129/1, the US Ofﬁce of Naval Research, as well as the National Science Foundation under grant number 1722516.

[1] Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott Mc Lachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 2018.

[2] Anna L Buczak, Steven Babin, and Linda Moniz. Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 2010.

[3] Saloni Dash, Andrew Yale, Isabelle Guyon, and Kristin P Bennett. Medical time-series data generation using generative adversarial networks. International Conference on Artiﬁcial Intelligence in Medicine (AIME), 2020.

[4] James Jordon, Daniel Jarrett, Evgeny Saveliev, Jinsung Yoon, Paul Elbers, Patrick Thoral, Ari Ercole, Cheng Zhang, Danielle Belgrave, and Mihaela van der Schaar. Hide-and-seek privacy challenge: Synthetic data generation vs. patient re-identiﬁcation. Neur IPS 2020 Competition and Demonstration Track, 2021.

[5] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1989.

[6] Marc Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. International Conference on Learning Representations (ICLR), 2016.

[7] Yoshua Bengio and Paolo Frasconi. An input output hmm architecture. Advances in neural information processing systems (Neur IPS), 1995.

[8] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. International Conference on Machine Learning (ICML), 2009.

[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research (JMLR), 2016.

[10] Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. Advances in Neural Information Processing Systems (Neur IPS), 2016.

[11] Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? International Conference on Learning Representations (ICLR), 2016.

[12] Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. Time-series generative adversarial networks. Advances in Neural Information Processing Systems (Neur IPS), 2019.

[13] Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre Côté, Nan Rosemary Ke, and Yoshua Bengio. Z-forcing: Training stochastic recurrent networks. Advances in Neural Information Processing Systems (Neur IPS), 2017.

[14] Ahmed M Alaa, Alex J Chan, and Mihaela van der Schaar. Generative time-series modeling with fourier ﬂows. International Conference on Learning Representations (ICLR), 2020.

[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014.

[16] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint, 2014.

[17] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2018.

[18] Olof Mogren. C-rnn-gan: Continuous recurrent neural networks with adversarial training. Advances in Neural Information Processing Systems (Neur IPS), 2016.

[19] Zinan Lin, Alankar Jain, Chen Wang, Giulia Fanti, and Vyas Sekar. Generating high-ﬁdelity, synthetic time series datasets with doppelganger. ACM Internet Measurement Conference (IMC), 2019.

[20] Tianlin Xu, Li K Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequential data via causal optimal transport. Advances in Neural Information Processing Systems (Neur IPS), 2020.

[21] Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. ar Xiv preprint, 2017.

[22] Giorgia Ramponi, Pavlos Protopapas, Marco Brambilla, and Ryan Janssen. T-cgan: Conditional generative adversarial network for data augmentation in noisy time series with irregular sampling. ar Xiv preprint, 2018.

[23] Luca Simonetto. Generating spiking time series with generative adversarial networks: an application on banking transactions. 2018.

[24] Moustafa Alzantot, Supriyo Chakraborty, and Mani Srivastava. Sensegen: A deep learning architecture for synthetic sensor data generation. In 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (Per Com Workshops), pages 188 193. IEEE, 2017.

[25] Shota Haradal, Hideaki Hayashi, and Seiichi Uchida. Biosignal data augmentation based on generative adversarial networks. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 368 371. IEEE, 2018.

[26] Chi Zhang, Sanmukh R Kuppannagari, Rajgopal Kannan, and Viktor K Prasanna. Generative adversarial network for synthetic time series data generation in smart grids. In 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (Smart Grid Comm), pages 1 6. IEEE, 2018.

[27] Ahmed M Alaa, Boris van Breugel, Evgeny Saveliev, and Mihaela van der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. International Conference on Machine Learning (ICML), 2021.

[28] Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. Advances in Neural Information Processing Systems (Neur IPS), 2019.

[29] Chris Donahue, Julian Mc Auley, and Miller Puckette. Adversarial audio synthesis. International Conference on Learning Representations (ICLR), 2019.

[30] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. Gansynth: Adversarial neural audio synthesis. International Conference on Learning Representations (ICLR), 2019.

[31] Weili Nie, Nina Narodytska, and Ankit Patel. Relgan: Relational generative adversarial networks for text generation. International Conference on Learning Representations (ICLR), 2019.

[32] Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. Language gans falling short. International Conference on Learning Representations (ICLR), 2020.

[33] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. IEEE International Conference on Computer Vision (ICCV), 2017.

[34] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[35] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. International Conference on Learning Representations (ICLR), 2017.

[36] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. Advances in Neural Information Processing Systems (Neur IPS), 2000.

[37] Peter D Grünwald, A Philip Dawid, et al. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. Annals of Statistics, 2004.

[38] Farzan Farnia and David Tse. A minimax approach to supervised learning. Advances in Neural Information Processing Systems (Neur IPS), 2016.

[39] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Dissertation, Carnegie Mellon University, 2010. [40] Michael I Jordan. An introduction to probabilistic graphical models. 2003. [41] Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. International Conference on Learning Representations (ICLR), 2016. [42] Shuangfei Zhai, Yu Cheng, Rogerio Feris, and Zhongfei Zhang. Generative adversarial networks as variational training of energy based models. ar Xiv preprint, 2016. [43] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating energy-based generative adversarial networks. International Conference on Learning Representations (ICLR), 2017. [44] Rithesh Kumar, Sherjil Ozair, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. ar Xiv preprint, 2019. [45] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2016. [46] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. International Conference on Machine Learning (ICML), 2017. [47] Wenjie Shi, Shiji Song, and Cheng Wu. Soft policy gradient method for maximum entropy deep reinforcement learning. International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2019. [48] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [49] Josiah P Hanna and Peter Stone. Towards a data efﬁcient off-policy policy gradient. AAAI Symposium on Data Efﬁcient Reinforcement Learning (AAAI), 2018. [50] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised as supervised learning. The Elements of Statistical Learning, 2009. [51] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (Neur IPS), 2013. [52] Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research (JMLR), 2012. [53] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. Neur IPS Workshop on Adversarial Training, 2016. [54] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. International Conference on Learning Representations (ICLR), 2018. [55] Ian J Goodfellow. On distinguishability criteria for estimating generative models. International Conference on Learning Representations (ICLR), 2015. [56] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. MIT Press Cambridge, 2016. [57] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML), 2018. [58] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. International conference on artiﬁcial intelligence and statistics (AISTATS), 2011. [59] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 2018. [60] Alexandre Attia and Sharone Dayan. Global overview of imitation learning. ar Xiv preprint, 2018.

[61] Stéphane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. International conference on artiﬁcial intelligence and statistics (AISTATS), 2010.

[62] Umar Syed and Robert E Schapire. A reduction from apprenticeship learning to classiﬁcation. Advances in neural information processing systems (Neur IPS), 2010.

[63] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. International conference on Machine learning (ICML), 2000.

[64] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. International conference on Machine learning (ICML), 2004.

[65] Ioana Bica, Daniel Jarrett, Alihan Hüyük, and Mihaela van der Schaar. Learning what-if explanations for sequential decision-making. International Conference on Learning Representations (ICLR), 2021.

[66] Daniel Jarrett, Alihan Hüyük, and Mihaela Van Der Schaar. Inverse decision modeling: Learning interpretable representations of behavior. International Conference on Machine Learning (ICML), 2021.

[67] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. AAAI Conference on Artiﬁcial Intelligence (AAAI), 2008.

[68] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. International conference on machine learning (ICML), 2016.

[69] Ahmed H Qureshi, Byron Boots, and Michael C Yip. Adversarial imitation via variational inverse reinforcement learning. International Conference on Learning Representations (ICLR), 2019.

[70] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefﬁciency and reward bias in adversarial imitation. International Conference on Learning Representations (ICLR), 2019.

[71] Lionel Blondé and Alexandros Kalousis. Sample-efﬁcient imitation learning via gans. International conference on artiﬁcial intelligence and statistics (AISTATS), 2019.

[72] Huan Xu and Shie Mannor. Distributionally robust markov decision processes. Mathematics of Operations Research, 2012.

[73] Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energybased learning. Predicting Structured Data, 2006.

[74] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. International Conference on Machine Learning (ICML), 2016.

[75] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. Advances in Neural Information Processing Systems (Neur IPS), 2019.

[76] Philip Bachman and Doina Precup. Data generation as sequential decision making. Advances in Neural Information Processing Systems (Neur IPS), 2015.

[77] Arun Venkatraman, Martial Hebert, and J Bagnell. Improving multi-step prediction of learned time series models. AAAI Conference on Artiﬁcial Intelligence (AAAI), 2015.

[78] Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K Reddy. Deep reinforcement learning for sequence-to-sequence models. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2019.

[79] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2010.

[80] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. International Conference on Machine Learning (ICML), 2012.

[81] Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efﬁciently with noisecontrastive estimation. Advances in Neural Information Processing Systems (Neur IPS), 2013.

[82] Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efﬁciency. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.

[83] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. Globally normalized transition-based neural networks. Annual Meeting of the Association for Computational Linguistics (ACL), 2016.

[84] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017.

[85] Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, and Weinan Zhang. Cot: Cooperative training for generative modeling of discrete data. International Conference on Machine Learning (ICML), 2019.

[86] Haiyan Yin, Dingcheng Li, Xu Li, and Ping Li. Meta-cotgan: A meta cooperative training paradigm for improving adversarial text generation. AAAI Conference on Artiﬁcial Intelligence (AAAI), 2020.

[87] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: better text generation via ﬁlling in the_. International Conference on Learning Representations (ICLR), 2018.

[88] Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. AAAI Conference on Artiﬁcial Intelligence (AAAI), 2018.

[89] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. International Conference on Machine Learning (ICML), 2017.

[90] Luis M Candanedo, Véronique Feldheim, and Dominique Deramaix. Data driven prediction models of energy use of appliances in a low-energy house. Energy and buildings, 2017.

[91] Javier Burgués, Juan Manuel Jiménez-Soto, and Santiago Marco. Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models. Analytica Chimica Acta, 2018.

[92] John Hogue. Hourly interstate 94 westbound trafﬁc volume for mn dot atr station 301. Minnesota Department of Transportation, 2018.

[93] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Nature Scientiﬁc Data, 2016.

[94] Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. Time-series generative adversarial networks. https://github.com/jsyoon0823/Time GAN, 2019.

[95] Tianlin Xu, Li K Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequential data via causal optimal transport. https://github.com/tianlinxu312/cot-gan, 2020.

[96] Olof Mogren. C-rnn-gan: Continuous recurrent neural networks with adversarial training. https://github.com/olofmogren/c-rnn-gan, 2016.

[97] Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. https://github.com/ratschlab/RGAN, 2017.

[98] Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. Professor forcing. https://github.com/anirudh9119/LM_GANS, 2016.

[99] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems (Neur IPS), 2019.

[100] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. International Conference on Very Large Data Bases (VLDB), 2018.

[101] Mohammad Navid Fekri, Ananda Mohon Ghosh, and Katarina Grolinger. Generating energy data for machine learning with recurrent generative adversarial networks. Energies, 2020.

[102] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. International Conference on Learning Representations (ICLR), 2019.