# variational_template_machine_for_datatotext_generation__b980b33f.pdf
Published as a conference paper at ICLR 2020
VARIATIONAL TEMPLATE MACHINE FOR DATA-TOTEXT GENERATION
Rong Ye , Wenxian Shi, Hao Zhou, Zhongyu Wei , Lei Li Fudan University {rye18,zywei}@fudan.edu.cn Byte Dance AI Lab {shiwenxian,zhouhao.nlp,lileilab}@.bytedance.com
How to generate descriptions from structured data organized in tables? Existing approaches using neural encoder-decoder models often suffer from lacking diversity. We claim that an open set of templates is crucial for enriching the phrase constructions and realizing varied generations. Learning such templates is prohibitive since it often requires a large paired
corpus, which is seldom available. This paper explores the problem of automatically learning reusable templates from paired and non-paired data. We propose the variational template machine (VTM), a novel method to generate text descriptions from data tables. Our contributions include: a) we carefully devise a specific model architecture and losses to explicitly disentangle text template and semantic content information in the latent spaces, and b) we utilize both small parallel data and large raw text without aligned tables to enrich the template learning. Experiments on datasets from a variety of different domains show that VTM is able to generate more diversely while keeping a good fluency and quality.
1 INTRODUCTION
Generating text descriptions from structured data (data-to-text) is an important task with many practical applications. Data-to-text has been used to generate different kinds of texts, such as weather reports (Angeli et al., 2010), sports news (Mei et al., 2016; Wiseman et al., 2017) and biographies (Lebret et al., 2016; Wang et al., 2018b; Chisholm et al., 2017). Figure 1 gives an example of data-to-text task, which takes an infobox 1 as the input and outputs a brief description of the information in the table. There are several recent methods utilizing neural encoder-decoder frameworks to generate text description from data tables (Lebret et al., 2016; Bao et al., 2018; Chisholm et al., 2017; Liu et al., 2018).
Although current table-to-text models could generate high quality sentences, the diversity of these output sentences are not satisfactory. We find that templates are crucial in increasing the variations of sentence structure. For example, Table 1 gives three descriptions with their templates for the given table input. Different templates control the sentence arrangement, thus vary the generation. Some related work (Wiseman et al., 2018; Dou et al., 2018) employs hidden semi-Markov hidden model to extract templates from table-text pairs.
We argue that templates can be better considered for generating more diverse outputs. First, it is non-trivial to sample different templates for obtaining different output utterances. Directly adopting variational auto-encoders (VAEs, Kingma & Welling (2014)) in table-to-text only enables to sample in the latent space. However, VAEs always generate irrelevant outputs, which may change the table content instead of sampling templates. This may harm the quality of output sentences. To address the above problem, if we can directly sample in the template space, we may get more diverse outputs while keeping the good quality of output sentences.
Work done while Rong Ye was a research intern at Byte Dance AI Lab. 1An infobox is a table containing attribute-value data about a certain subject. It is mostly used on Wikipedia pages.
Published as a conference paper at ICLR 2020
Table: name[name Variable], eat Type[pub], food[Japanese], price Range[average], customer Rating[low], area[riverside] Template1: [name] is a [food] restaurant, it is a [eat Type] and it has an [price Range] cost and [customer Rating] rating. it is in [area]. Sentence1: name Variable is a Japanese restaurant, it is a pub and it has an average cost and low rating. it is in riverside. Template2: [name] has an [price Range] price range with a [customer Rating] rating, and [name] is an [food] [eat Type] in [area]. Sentence2: name Variable has an average price range with a low rating, and name Variable is an Japanse pub in riverside. Template3: [name] is a [eat Type] with a [customer Rating] rating and [price Range] cost, it is a [food] restaurant and [name] is in [area]. Sentence3: name Variable is a pub with a low rating and average cost, it is a Japanese restaurant and name Variable is in riverside.
Table 1: An example: generating sentences based on different templates.
Second, we can hardly obtain promising sentences by sampling in the template space, if the template space is less informative. Namely, either encoder-decoder models or VAE-based models requires abundant parallel table-text pairs during the training. In such case, constructing high-quality parallel dataset is often labor-intensive. With limited table-sentence pairs, a VAE model cannot construct an informative template space. How to fully utilize raw sentences (without aligned table) to enrich the latent template space is under study.
In this paper, to address the above two problems, we propose the variational template machine (VTM) for data-to-text generation, which enables to generate sentences with diverse templates while preserving the high quality. Particularly, we introduce two latent variables, representing template and content, to control the generation. The two latent variables are disentangled, and thus we can generate diverse outputs by directly sampling in the latent space for template. Moreover, we propose a novel approach for semi-supervised learning in the VAE framework, which could fully exploit the raw sentences for enriching the template space. Inspired by back-translation (Sennrich et al., 2016; Burlot & Yvon, 2018; Artetxe et al., 2018), we design a variational back-translation process. Instead of training a sentence-to-table backward generation model directly, we take the variational posterior of the content latent variable as the backward model to help to train the forward generative model. Auxiliary losses are introduced to ensure the learning of meaningful and disentangled latent variables.
Experimental results on Wikipedia biography dataset (Lebret et al., 2016) and sentence planning NLG dataset (Reed et al., 2018) show that our model can generate texts with more diversity while keeping a good fluency. Training together with a large amount of raw text, VTM can further improve the generation performance. Besides, VTM is more predominant in the case where sentence-to-table backward model is hard to train. Ablation studies also demonstrate the effects of the auxiliary losses on the disentanglement of template and content spaces.
2 PROBLEM FORMULATION AND NOTATIONS
As a data-to-text task, we have table-text pairs Dp = {(xi, yi)}N i=1, where xi is the table, and yi is the output sentence.
Following the description scheme of Lebret et al. (2016), a table x can be viewed as a set of K records of field-position-value triples, i.e., x = {(f, p, v)i}K i=1, where f is the field and p is the index of value v in the field f. For example, an item Name: John Lennon is denoted as two corresponding records: (Name, 1, John) and (Name, 2, Lennon). For each triple, we first embed field, position and value as d-dim vectors ep, ef, ev Rd. Then, the dt-dim representation of the record is obtained by hi = tanh(W[ef, ep, ev]T +b), i = 1...K, where W Rdt 3d and b Rdt are parameters. The final representation of the table, denoted as fenc(x), is obtained by max-pooling over all field-position-value triple records,
fenc(x) = h = Max Pooli{hi; i = 1...K}.
In addition to the table-text pairs, we also have raw texts without table input, denoted as Dr = {yi}M i=1. It usually has M N.
Published as a conference paper at ICLR 2020
Figure 1: Two types of data in the data-to-text task: Row 2 presents an example of table-text pairs; Row 3 shows a sample of raw text, whose table input is missing and only sentence is provided.
Figure 2: The graphical model of VTM: z is the latent variable from template space, and c is the content variable. x is the corresponding table for the tabletext pairs. y is the observed sentence. The solid lines depict the generative model and the dashed lines form the inference model.
3 VARIATIONAL TEMPLATE MACHINE
As shown in the graphical model in Figure 2, our VTM modifies the vanilla VAE model by introducing two independent latent variables z and c, representing template latent variable and content latent variable respectively. c models the content information in the table, while z models the sentence template information. Target sentence y is generated by both content and template variables. The two latent variables are disentangled, which makes it possible to generate diverse and relevant sentences by sampling template variable and retraining the content variable. Considering pairwise and raw data presented in Figure 1, their generation process for the content latent variable c is different.
For a given table-text pair (x, y) Dp, the content is observable from table x. As a result, c is assumed to be deterministic given table x, whose prior is defined as a delta distribution p(c|x) = δ(c = fenc(x)). The marginal log-likelihood is:
log pθ(y|x) = log Z
c pθ(y|x, z, c)p(z)p(c|x)dcdz
z pθ(y|x, z, c = fenc(x))p(z)dz, (x, y) Dp. (1)
For raw text y Dn, the content is unobservable with the absence of table x. As a result, the content latent variable c should be sampled from prior of Gaussian distribution N(0, I). The marginal log-likelihood is:
log pθ(y) = log Z
c pθ(y|z, c)p(z)p(c)dcdz, y Dr. (2)
In order to make full use of both table-text pair data and raw text data, the above marginal loglikelihood should be optimized jointly:
L(θ) = E(x,y) Dp[log pθ(y|x)] + Ey Dr[log pθ(y)]. (3)
Directly optimizing Equation 3 is intractable. Following the idea of variational inference (Kingma & Welling, 2014), a variational posterior qφ( ) is constructed as an inference model (dashed lines in Figure 2) to approximate the true posterior. Instead of optimizing the marginal log-likelihood in Equation 3, we maximize the evidence lower bound (ELBO). In Section 3.1 and 3.2, the ELBO of table-text pairwise data and raw text data are discussed, respectively.
Published as a conference paper at ICLR 2020
3.1 LEARNING FROM TABLE-TEXT PAIR DATA
In this section, we will show the learning loss of table-text pair data. According to the aforementioned assumption, the content variable c is observable and follows a delta distribution centred in the hidden representation of the table x.
ELBO objective. Assuming that the template variable z only relies on the template of target sentence, we introduce qφ(z|y) as an approximation of the true posterior p(z|y, c, x),
The ELBO loss of Equation 1 is written as LELBOp(x, y) = Eqφz (z|y) log pθ(y|z, c = fenc(x), x) + DKL(qφz(z|y) p(z)), (x, y) Dp.
The variational posterior qφz(z|y) is assumed as a multivariate Gaussian distribution N(µφz(y), Σφz(y)), while the prior p(z) is taken as a normal distribution N(0, I).
Preserving-Template Loss. Without any supervision, the ELBO loss alone does not guarantee to learn a good template representation space. Inspired by the work in style-transfer (Hu et al., 2017b; Shen et al., 2017; Bao et al., 2019; John et al., 2018), an auxiliary loss is introduced to embed the template information of sentences into template variable z.
With table, we are able to roughly align the tokens in sentence with the records in the table. By replacing these tokens with a special token , we can remove the content information from sentences and get the sketchy sentence template, denote as y. We introduce the preserving-template loss Lpt to ensure that the latent variable z only contains the information of the template.
Lpt(x, y, y) = Eqφz (z|y) log pη( y|z) = Eqφz (z|y)
t=1 log pη( yt|z, y