# language_modeling_via_stochastic_processes__84cb2fb5.pdf Published as a conference paper at ICLR 2022 LANGUAGE MODELING VIA STOCHASTIC PROCESSES Rose E. Wang, Esin Durmus, Noah Goodman, Tatsunori B. Hashimoto Stanford University {rewang, edurmus, ngoodman,thashim}@stanford.edu Modern language models can generate high-quality short texts. However, they often meander or are incoherent when generating longer texts. These issues arise from the next-token-only language modeling objective. To address these issues, we introduce Time Control (TC), a language model that implicitly plans via a latent stochastic process. TC does this by learning a representation which maps the dynamics of how text changes in a document to the dynamics of a stochastic process of interest. Using this representation, the language model can generate text by first implicitly generating a document plan via a stochastic process, and then generating text that is consistent with this latent plan. Compared to domain-specific methods and fine-tuning GPT2 across a variety of text domains, TC improves performance on text infilling and discourse coherence. On long text generation settings, TC preserves the text structure both in terms of ordering (up to +40% better) and text length consistency (up to +17% better). Human evaluators also prefer TC s output 28.6% more than the baselines.1 1 INTRODUCTION Large language models (LLM) such as GPT-2 have been extremely successful in text generation (Radford et al., 2019; Brown et al., 2020). However, LLMs are known to generate incoherent long texts. One reason is that they are unable to plan ahead or represent long-range text dynamics (Kiddon et al., 2016; Fan et al., 2019; Hua & Wang, 2020; Duboue & Mc Keown, 2001; Stent et al., 2004; Tamkin et al., 2020). As a result, they oftentimes produce wandering content with poor discourse structure and low relevance (Hua & Wang, 2020; Zhao et al., 2017; Xu et al., 2020); the text reads as if the model has no anchored goal when generating. These problems with coherence are further exacerbated when forcing autoregressive models to generate longer texts as the model struggles to extrapolate beyond its expected text end point. These problems suggest that LLMs currently fail to properly capture how documents evolve from beginning to end. Doing so is critical for succeeding in goal-oriented tasks such as story, dialog or recipe generation. Prior work has explored the use of planning-based methods for generating globally coherent text (Kiddon et al., 2016; Fan et al., 2019; Hua & Wang, 2020; Duboue & Mc Keown, 2001; Stent et al., 2004). However, these methods rely on manually defining text dynamics for specific domains. Other work has attempted to use sentence representations for modeling text, such as with variational autoencoders (Bowman et al., 2016) or contrastive learning (Gao et al., 2021; Devlin et al., 2019). Their shortcoming in text generation settings is that the latent representations are static: they capture semantic similarity between sentence neighbors, but don t capture how sentence embeddings evolve over a document. Methods including van den Oord et al. (2019) have tried to remedy this by learning a model of local latent dynamics. However, it is difficult to use learned local dynamics for generating accurate goal-conditioned trajectories, especially long-horizon ones. We explore an alternative that explicitly assumes a simple, fixed dynamics model with goal-conditioned generation. In this work, we propose Time Control as a way to learn a latent space with known, goal-conditioned dynamics. We begin by assuming that meandering text generated without a goal can be represented as Brownian motion in latent space; this motion enforces the embeddings of neighboring sentences to be similar to each other, whereas those of distant sentences to be dissimilar. Goal-directed behavior can be incorporated into this model by conditioning on a fixed start and end point. In this 1The accompanying code can be found here: https://github.com/rosewang2008/language_ modeling_via_stochastic_processes. Published as a conference paper at ICLR 2022 case, the Brownian motion becomes a Brownian bridge and the resulting latent trajectories abide by simple, closed-form dynamics. In Time Control, we derive a novel contrastive objective for learning a latent space with Brownian bridge dynamics. We can then use this latent space to generate text that retains local coherence and has improved global coherence. To perform text generation, Time Control first plans a latent trajectory via the Brownian bridge process pinned at a start and end point. It then conditionally generates sentences using this latent plan. In our work, we decode latent plans by fine-tuning GPT2 to generate text conditioned on Time Control s latent trajectory. Trajectories from Time Control act as abstract semantic positions in a document that guide generation of fine-tuned language models. In summary, our work s contributions are the following: We derive Time Control, a language model which explicitly models latent structure with Brownian bridge dynamics learned using a novel contrastive objective. Across a range of text domains, we show that Time Control generates more or equally coherent text on tasks including text infilling and forced long text generation, compared to task-specific methods. We validate that our latent representations capture text dynamics competitively by evaluating discourse coherence with human experiments. We ablate our method to understand the importance of the contrastive objective, enforcing Brownian bridge dynamics, and explicitly modeling latent dynamics. 2 RELATED WORKS Generating long, coherent text is conceptually difficult for autoregressive models because they lack the ability to model text structure and dynamics (Lin et al., 2021). This means that they struggle to plan and look ahead which leads to generating globally incoherent text. Forcing autoregressive models to generate longer texts exacerbates this incoherence because the models struggle to extrapolate beyond their expected text end point. Prior work has tried to address the problem of generating globally coherent text with planning-based approaches (Puduppully et al., 2019; Moryossef et al., 2019; Fan et al., 2019; Kiddon et al., 2016). However, planning-based approaches rely on domain-specific heuristics for capturing text structure and dynamics. Our work uses a contrastive objective to learn latent dynamics in text without domain-specific heuristics. Contrastive objectives have been applied to several domains, including language (Devlin et al., 2019; Iter et al., 2020; Liu & Liu, 2021), vision (Chen et al., 2020), and general time series data (Hyvarinen & Morioka, 2016; Hyvarinen et al., 2019). In particular for language, contrastive objectives have been applied to the next-sentence prediction task for improving BERT embeddings (Devlin et al., 2019) and to the discourse coherence setting (Nie et al., 2019; Chen et al., 2019b) for evaluating how coherent pairs of sentences are. However, these methods have two shortcomings which we address with our work. One is that the resulting sentence embeddings are often static: they capture semantic similarity between sentence neighbors, but don t capture how sentence embeddings evolve over a document. Two is that they are not used for generation and are limited to classification tasks like discourse coherence. Prior work has also tried fitting latent variable models (Bowman et al., 2016), however these generally result in poor language generation (He et al., 2018) or are domain-specific (Weber et al., 2020; Arora et al., 2016). Our work is closely related to Contrastive Predictive Coding (CPC) from van den Oord et al. (2019). The key difference is CPC implicitly learns unconditioned latent dynamics, whereas we impose known goal-conditioned dynamics on our latent space. Doing so, we can extrapolate successfully further in time. Additionally, our method builds off of recent findings that contrastive objectives can be used to approximate local transition kernels of stochastic processes (Liu et al., 2021). The main difference between Liu et al. (2021) and our work is that they focus on provable conditions for latent recovery; we focus on empirically effective methods that leverage similar insights for recovering latent representations from language. Finally, our use of stochastic processes draws similarities to diffusion models (Song et al., 2020; Sohl-Dickstein et al., 2015) which apply a chain of diffusion steps onto the data and learn to reverse the diffusion process. However, our application Published as a conference paper at ICLR 2022 x_0: [USER] Hello, I'd like to buy tickets for tomorrow. x_T: [USER] Could you confirm my tickets just in case? x_t: [ASSISTANT] What movie theater do you prefer? x': [USER] Hi, I'm looking to purchase tickets for my family. Figure 1: Latent space for a positive triplet of sentences (x0, xt, x T ) that are part of the same conversation. Time Control maps positive triplets to a smooth Brownian bridge trajectory. It embeds zt close to the expected embedding µt pinned by z0, z T . The green oval area illustrates the uncertainty over zt as a function of how close t is to 0 and T. In contrast, a negative random sentence x from a different conversation is not coherent with x0 and x T ; thus, it is embedded far from µt. This is captured by our contrastive loss, L. is conceptually different: diffusion processes characterize properties of our latent space and are not a fixed inference method in our work. 3 TIME CONTROL The intuition behind Time Control is to learn a latent space with smooth temporal dynamics for modeling and generating coherent text. We detail Time Control in three sections. The first section discusses training the encoder via contrastive learning to map sentences to a Brownian bridge (Revuz & Yor, 2013) latent space. The second section discusses training a decoder to reconstruct sentences from this latent space. The third section discusses generating text from Time Control. 3.1 TRAINING AN ENCODER WITH BROWNIAN BRIDGE DYNAMICS Our encoder is a nonlinear mapping from raw input space to latent space, fθ : X Z. The objective for the encoder is to map high-dimensional sequential data into low-dimensional latents which follow a stochastic process of interest in this paper, it is the Brownian bridge process. The density of a Brownian bridge process between an arbitrary start point z0 at t = 0 and end point z T at t = T is, p(zt|z0, z T ) = N 1 t T z T , t T t This density is intuitive to understand: It acts like a noisy linear interpolation between the start and end point of the trajectory, where zt should be more like z0 at the start and more like z T at the end of the trajectory. Uncertainty is highest in the middle region, and low near the end points (rf. Figure 1). Consider a set of triplet observations, (x1, x2, x3). The goal of our work is to ensure that fθ(x1), fθ(x2), fθ(x3) follow the Brownian bridge transition density in Equation 1. We ensure this using a contrastive objective. Formally, given multiple sequences of data points, X = {x1, ..., x N}, we draw batches consisting of randomly sampled positive triplets x0, xt, x T where 0 < t < T: B = {(x0, xt, x T )}.2 Our encoder is optimized by, log exp(d(x0, xt, x T ; fθ)) P (x0,xt ,x T ) B exp(d(x0, xt , x T ; fθ)) , where (2) d(x0, xt, x T ; fθ) = 1 fθ(xt) | {z } zt T fθ(x T ) | {z } mean in Equation 1 2We use indices 0, t, T to denote the start, middle and end point of a Brownian bridge, but these do not correspond to strictly sampling the first, middle and last sentence of a document. x0, xt, x T can be any sentence in a document as long as x0 comes before xt and xt before x T in the document. Published as a conference paper at ICLR 2022 σ2 is the variance in Equation 1: t(T t) T . Note that Equation 2 sums over negative middle contrasts, xt . This objective can be viewed as maximizing the extent to which true triplets from the data follow the Brownian bridge process while minimizing the extent to which an alternative mid-point sampled from another sequence does so. 3 Figure 1 illustrates how the objective translates into the language setting for training the encoder. The objective samples triplet sentences from a document. Sentences drawn from the same document make up a smooth latent trajectory; they should be close to each other and follow the conditional density in latent space. Sentences drawn from different documents should not make up a smooth trajectory and should less likely follow bridge dynamics. Connection to mutual information estimation and triplet classification We draw connections between our contrastive loss and the mutual information estimation setup from van den Oord et al. (2019); Poole et al. (2019) (as |B| ) and the classification setup from Liu et al. (2021) (|B| = 1). Following van den Oord et al. (2019); Poole et al. (2019), this objective can be seen as a lower bound on the mutual information between the two end points and the middle point: I(Xt, {X0, XT }) log(N) LN. Hence, by minimizing the contrastive loss, we are maximizing the amount of information between the trajectory and the linear interpolation of its end points. Assuming |B| = 1, we can draw a connection to the classification setup studied in Liu et al. (2021). They train a classifier to distinguish invs. out-of-order input pairs and show that the Bayes optimal logits for pair-wise classification can be written as a function of the stochastic process transition kernel. This is equivalent to the our loss on a single triplet i: li = log exp(d(x0,xt,x T ;fθ)) exp(d(x0,xt,x T ;fθ))+exp(d(x0,xt ,x T ;fθ)). Liu et al. (2021) consider pairs whereas our work considers triplets; we show in Appendix A the pairwise and triplet setups are equivalent. 3.2 TRAINING A DECODER WITH LATENT PLANS Here we discuss how to train a language model to decode latent plans for generation. We first map all the sentences in the training dataset to our learned latent space using the pretrained encoder fθ. This gives us a Brownian bridge trajectory of sentence-level latent codes (z0, . . . , zt, . . . , z T ) for a document in the dataset. Then, rather than learning a decoder from scratch, we fine-tune GPT2 (Radford et al., 2019) to generate text conditioned on past context and the latent plan. We fine-tune in the following manner. Let x1 . . . x W be a document with W tokens and T sentences used to train the decoder. Using the encoder fθ, we can obtain embeddings z1 . . . z T for each sentence. The decoder is a standard auto-regressive language model that is modified in the following way: at time t, the decoder must predict xt using all tokens in the past x