# variational_template_machine_for_datatotext_generation__b980b33f.pdf

Published as a conference paper at ICLR 2020

VARIATIONAL TEMPLATE MACHINE FOR DATA-TOTEXT GENERATION

Rong Ye , Wenxian Shi, Hao Zhou, Zhongyu Wei , Lei Li Fudan University {rye18,zywei}@fudan.edu.cn Byte Dance AI Lab {shiwenxian,zhouhao.nlp,lileilab}@.bytedance.com

How to generate descriptions from structured data organized in tables? Existing approaches using neural encoder-decoder models often suffer from lacking diversity. We claim that an open set of templates is crucial for enriching the phrase constructions and realizing varied generations. Learning such templates is prohibitive since it often requires a large paired <table,description> corpus, which is seldom available. This paper explores the problem of automatically learning reusable templates from paired and non-paired data. We propose the variational template machine (VTM), a novel method to generate text descriptions from data tables. Our contributions include: a) we carefully devise a speciﬁc model architecture and losses to explicitly disentangle text template and semantic content information in the latent spaces, and b) we utilize both small parallel data and large raw text without aligned tables to enrich the template learning. Experiments on datasets from a variety of different domains show that VTM is able to generate more diversely while keeping a good ﬂuency and quality.

1 INTRODUCTION

Generating text descriptions from structured data (data-to-text) is an important task with many practical applications. Data-to-text has been used to generate different kinds of texts, such as weather reports (Angeli et al., 2010), sports news (Mei et al., 2016; Wiseman et al., 2017) and biographies (Lebret et al., 2016; Wang et al., 2018b; Chisholm et al., 2017). Figure 1 gives an example of data-to-text task, which takes an infobox 1 as the input and outputs a brief description of the information in the table. There are several recent methods utilizing neural encoder-decoder frameworks to generate text description from data tables (Lebret et al., 2016; Bao et al., 2018; Chisholm et al., 2017; Liu et al., 2018).

Although current table-to-text models could generate high quality sentences, the diversity of these output sentences are not satisfactory. We ﬁnd that templates are crucial in increasing the variations of sentence structure. For example, Table 1 gives three descriptions with their templates for the given table input. Different templates control the sentence arrangement, thus vary the generation. Some related work (Wiseman et al., 2018; Dou et al., 2018) employs hidden semi-Markov hidden model to extract templates from table-text pairs.

We argue that templates can be better considered for generating more diverse outputs. First, it is non-trivial to sample different templates for obtaining different output utterances. Directly adopting variational auto-encoders (VAEs, Kingma & Welling (2014)) in table-to-text only enables to sample in the latent space. However, VAEs always generate irrelevant outputs, which may change the table content instead of sampling templates. This may harm the quality of output sentences. To address the above problem, if we can directly sample in the template space, we may get more diverse outputs while keeping the good quality of output sentences.

Work done while Rong Ye was a research intern at Byte Dance AI Lab. 1An infobox is a table containing attribute-value data about a certain subject. It is mostly used on Wikipedia pages.

Published as a conference paper at ICLR 2020

Table: name[name Variable], eat Type[pub], food[Japanese], price Range[average], customer Rating[low], area[riverside] Template1: [name] is a [food] restaurant, it is a [eat Type] and it has an [price Range] cost and [customer Rating] rating. it is in [area]. Sentence1: name Variable is a Japanese restaurant, it is a pub and it has an average cost and low rating. it is in riverside. Template2: [name] has an [price Range] price range with a [customer Rating] rating, and [name] is an [food] [eat Type] in [area]. Sentence2: name Variable has an average price range with a low rating, and name Variable is an Japanse pub in riverside. Template3: [name] is a [eat Type] with a [customer Rating] rating and [price Range] cost, it is a [food] restaurant and [name] is in [area]. Sentence3: name Variable is a pub with a low rating and average cost, it is a Japanese restaurant and name Variable is in riverside.

Table 1: An example: generating sentences based on different templates.

Second, we can hardly obtain promising sentences by sampling in the template space, if the template space is less informative. Namely, either encoder-decoder models or VAE-based models requires abundant parallel table-text pairs during the training. In such case, constructing high-quality parallel dataset is often labor-intensive. With limited table-sentence pairs, a VAE model cannot construct an informative template space. How to fully utilize raw sentences (without aligned table) to enrich the latent template space is under study.

In this paper, to address the above two problems, we propose the variational template machine (VTM) for data-to-text generation, which enables to generate sentences with diverse templates while preserving the high quality. Particularly, we introduce two latent variables, representing template and content, to control the generation. The two latent variables are disentangled, and thus we can generate diverse outputs by directly sampling in the latent space for template. Moreover, we propose a novel approach for semi-supervised learning in the VAE framework, which could fully exploit the raw sentences for enriching the template space. Inspired by back-translation (Sennrich et al., 2016; Burlot & Yvon, 2018; Artetxe et al., 2018), we design a variational back-translation process. Instead of training a sentence-to-table backward generation model directly, we take the variational posterior of the content latent variable as the backward model to help to train the forward generative model. Auxiliary losses are introduced to ensure the learning of meaningful and disentangled latent variables.

Experimental results on Wikipedia biography dataset (Lebret et al., 2016) and sentence planning NLG dataset (Reed et al., 2018) show that our model can generate texts with more diversity while keeping a good ﬂuency. Training together with a large amount of raw text, VTM can further improve the generation performance. Besides, VTM is more predominant in the case where sentence-to-table backward model is hard to train. Ablation studies also demonstrate the effects of the auxiliary losses on the disentanglement of template and content spaces.

2 PROBLEM FORMULATION AND NOTATIONS

As a data-to-text task, we have table-text pairs Dp = {(xi, yi)}N i=1, where xi is the table, and yi is the output sentence.

Following the description scheme of Lebret et al. (2016), a table x can be viewed as a set of K records of ﬁeld-position-value triples, i.e., x = {(f, p, v)i}K i=1, where f is the ﬁeld and p is the index of value v in the ﬁeld f. For example, an item Name: John Lennon is denoted as two corresponding records: (Name, 1, John) and (Name, 2, Lennon). For each triple, we ﬁrst embed ﬁeld, position and value as d-dim vectors ep, ef, ev Rd. Then, the dt-dim representation of the record is obtained by hi = tanh(W[ef, ep, ev]T +b), i = 1...K, where W Rdt 3d and b Rdt are parameters. The ﬁnal representation of the table, denoted as fenc(x), is obtained by max-pooling over all ﬁeld-position-value triple records,

fenc(x) = h = Max Pooli{hi; i = 1...K}.

In addition to the table-text pairs, we also have raw texts without table input, denoted as Dr = {yi}M i=1. It usually has M N.

Published as a conference paper at ICLR 2020

Figure 1: Two types of data in the data-to-text task: Row 2 presents an example of table-text pairs; Row 3 shows a sample of raw text, whose table input is missing and only sentence is provided.

Figure 2: The graphical model of VTM: z is the latent variable from template space, and c is the content variable. x is the corresponding table for the tabletext pairs. y is the observed sentence. The solid lines depict the generative model and the dashed lines form the inference model.

3 VARIATIONAL TEMPLATE MACHINE

As shown in the graphical model in Figure 2, our VTM modiﬁes the vanilla VAE model by introducing two independent latent variables z and c, representing template latent variable and content latent variable respectively. c models the content information in the table, while z models the sentence template information. Target sentence y is generated by both content and template variables. The two latent variables are disentangled, which makes it possible to generate diverse and relevant sentences by sampling template variable and retraining the content variable. Considering pairwise and raw data presented in Figure 1, their generation process for the content latent variable c is different.

For a given table-text pair (x, y) Dp, the content is observable from table x. As a result, c is assumed to be deterministic given table x, whose prior is deﬁned as a delta distribution p(c|x) = δ(c = fenc(x)). The marginal log-likelihood is:

log pθ(y|x) = log Z

c pθ(y|x, z, c)p(z)p(c|x)dcdz

z pθ(y|x, z, c = fenc(x))p(z)dz, (x, y) Dp. (1)

For raw text y Dn, the content is unobservable with the absence of table x. As a result, the content latent variable c should be sampled from prior of Gaussian distribution N(0, I). The marginal log-likelihood is:

log pθ(y) = log Z

c pθ(y|z, c)p(z)p(c)dcdz, y Dr. (2)

In order to make full use of both table-text pair data and raw text data, the above marginal loglikelihood should be optimized jointly:

L(θ) = E(x,y) Dp[log pθ(y|x)] + Ey Dr[log pθ(y)]. (3)

Directly optimizing Equation 3 is intractable. Following the idea of variational inference (Kingma & Welling, 2014), a variational posterior qφ( ) is constructed as an inference model (dashed lines in Figure 2) to approximate the true posterior. Instead of optimizing the marginal log-likelihood in Equation 3, we maximize the evidence lower bound (ELBO). In Section 3.1 and 3.2, the ELBO of table-text pairwise data and raw text data are discussed, respectively.

Published as a conference paper at ICLR 2020

3.1 LEARNING FROM TABLE-TEXT PAIR DATA

In this section, we will show the learning loss of table-text pair data. According to the aforementioned assumption, the content variable c is observable and follows a delta distribution centred in the hidden representation of the table x.

ELBO objective. Assuming that the template variable z only relies on the template of target sentence, we introduce qφ(z|y) as an approximation of the true posterior p(z|y, c, x),

The ELBO loss of Equation 1 is written as LELBOp(x, y) = Eqφz (z|y) log pθ(y|z, c = fenc(x), x) + DKL(qφz(z|y) p(z)), (x, y) Dp.

The variational posterior qφz(z|y) is assumed as a multivariate Gaussian distribution N(µφz(y), Σφz(y)), while the prior p(z) is taken as a normal distribution N(0, I).

Preserving-Template Loss. Without any supervision, the ELBO loss alone does not guarantee to learn a good template representation space. Inspired by the work in style-transfer (Hu et al., 2017b; Shen et al., 2017; Bao et al., 2019; John et al., 2018), an auxiliary loss is introduced to embed the template information of sentences into template variable z.

With table, we are able to roughly align the tokens in sentence with the records in the table. By replacing these tokens with a special token <ent>, we can remove the content information from sentences and get the sketchy sentence template, denote as y. We introduce the preserving-template loss Lpt to ensure that the latent variable z only contains the information of the template.

Lpt(x, y, y) = Eqφz (z|y) log pη( y|z) = Eqφz (z|y)

t=1 log pη( yt|z, y<t)

where m is the length of the y, and η denotes the parameters of the extra template generator. Lpt is trained via parallel data. In practice, due to the insufﬁcient amount of parallel data, template generator pη may not be well-learned. However, experimental results show that this loss is sufﬁcient to provide a guidance for learning a template space.

3.2 LEARNING FROM RAW TEXT DATA

Our model is able to make use of a large number of raw data without table since the content information of table could be obtained by the content latent variable.

ELBO objective. According to the deﬁnition of generative model in Equation 2, the ELBO of raw text data is

log pθ(y) = Eqφ(z,c|y) log pθ(y, z, c)

qφ(z, c|y) , y Dr.

With the mean ﬁeld approximation (Xing et al., 2003), qφ(z, c|x) can be factorized as: qφ(z, c|y) = qφz(z|y)qφc(c|y). We have: LELBOr(y) = Eqφz (z|y)qφc(c|y) log pθ(y|z, c)

+ DKL(qφz(z|y)||p(z)) + DKL(qφc(c|y)||p(c)), y Dr. In order to make use of template information contained in raw text data effectively, the parameters of generation network pθ(y|z, c) and posterior network qφz(z|y) are shared for pairwise and raw data. In decoding process, for raw text data, we use content variable c as the table embedding for the missing of table x. Variational posterior for c is deployed as another multivariate Guassian qφc(c|y) = N(µφc(y), Σφc(y)). Both p(z) and p(c) are taken as normal distribution N(0, I).

Preserving-Content Loss. In order to make the posterior qφc(c|y) correctly infers the content information, the table-text pairs are used as the supervision to train the recognition network of qφc(c|y). To this end, we add a preserving-content loss

Lpc(x, y) = Eqφc(c|y) c h 2 + DKL(qφc(c|y)||p(c)), (x, y) Dp,

where h = fenc(x) is the embedding of table obtained by the table encoder. Minimizing Lpc is also helpful to bridge the gap of c between pairwise (taking c = h) and raw training data (sampling from qφ(c|y)). Moreover, we ﬁnd that the ﬁrst term of Lpc is equivalent to (1) make the mean of qφ(c|y) closer to h; (2) minimize the trace of co-variance of qφ(c|y). The second term serves as a regularization. Detailed explanations and proof are referred in supplementary materials.

Published as a conference paper at ICLR 2020

Algorithm 1 Training procedure

Input: Model parameters φz, φc, θ, η Table-text pair data Dp = {(x, y)i}N i=1; raw text data Dr = {yj}M j=1; M N Procedure TRAIN(Dp, Dr): 1: Update φz, φc, θ, η by gradient descent on LELBOp + LMI + Lpt + Lpc 2: Update φz, φc, θ by gradient descent on LELBOr + LMI 3: Update φz, φc, θ, η by gradient descent on Ltot

3.3 MUTUAL INFORMATION LOSS

As introduced by previous works (Chen et al., 2016; Zhao et al., 2017; 2018), adding mutual information term to ELBO could alleviate KL collapse effectively and improve the quality of variational posterior. Adding mutual information terms directly imposes the association of content and template latent variables with target sentences. Besides, theoretical proof2 and experimental results show that introducing mutual information bias is necessary in the presence of preserving-template loss Lpt(xp, yp).

As a result, in our work, the following mutual information term is added to objective

LMI(y) = I(z, y) I(c, y).

3.4 TRAINING PROCESS

The ﬁnal loss of VTM is made up of the ELBO losses and extra losses:

Ltot(xp, yp, yr) = LELBOp(xp, yp) + LELBOr(yr) + λMI(LMI(yp) + LMI(yr))

+ λpt Lpt(xp, yp) + λpc Lpc(xp, yp), (xp, yp) Dp, yr Dr.

λMI, λpt and λpc are hyperparameters with respect to auxiliary losses.

The training procedure is shown in Algorithm 1. The parameters of generation network θ and posterior network φz,c could be trained jointly by both table-text pair data and raw text data. In this way, a large number of raw text data can be used to enrich the generation diversity.

4 EXPERIMENT

4.1 DATASETS AND BASELINE MODELS

Dataset. We perform the experiment on SPNLG (Reed et al., 2018)3 and WIKI (Lebret et al., 2016; Wang et al., 2018b). Two datasets come from two different domains. The former is a collection of restaurant descriptions, which expands the E2E dataset4 into a total of 204, 955 utterances with more varied sentence structures and instances. The latter contains 728, 321 sentences of biographies from Wikipedia. To simulate the environment that a large number of raw texts provided, we just use part of the table-text pairs from two datasets, leaving most of the instances as raw texts. Concretely, for two datasets, we initially keep the ratio of table-text pairs to raw texts as 1:10. For WIKI dataset, in addition to the data from Wiki Bio (Lebret et al., 2016), the raw text data is further extended by the biographical descriptions of people5 from external Wikipedia Person and Animal Dataset (Wang et al., 2018a). The statistics for the number of table-text pairs and raw texts in the training, validation and test sets are shown in Table 2.

Evaluation Metrics. For WIKI dataset, we evaluate the generation quality based on BLEU-4, NIST, ROUGE-L (F-score). For SPNLG, we use BLEU-4, NIST, METEOR, ROUGE-L (F-score), and CIDEr. We use the same automatic evaluation script from E2E NLG Challenge6. The diversity of generation is evaluated by self-BLEU (Zhu et al., 2018). The lower self-BLEU, the more diversely the model generates.

2Proof can be found in Appendix C 3https://nlds.soe.ucsc.edu/sentence-planning-NLG 4http://www.macs.hw.ac.uk/Interaction Lab/E2E/ 5https://eaglew.github.io/patents/ 6https://github.com/tuetschek/e2e-metrics

Published as a conference paper at ICLR 2020

Train Valid Test Dataset #table-text pair #raw text #table-text pair #raw text #table-text pair SPNLG 14, 906 149, 058 20, 495 / 20, 496 WIKI 84, 150 841, 507 72, 831 42, 874 72, 831 Table 2: Dataset statistics in our experiments.

Baseline models. We implement the following models as baselines:

Table2seq: Table2seq model ﬁrst encodes the table into hidden representations then generates the sentence in a sequence-to-sequence architecture (Sutskever et al., 2014). For a fair comparison, we apply the same table-encoder architecture as in Section 2 and the same LSTM decoder with attention mechanism as our model. The model is only trained on pair-wise data. During the testing, we generate ﬁve sentences with beam size ranging from one to ﬁve to increase some variations. We denote the model as Table2seq-beam. We also implement the decoding with forward sampling strategy (namely Table2seq-sample). Moreover, to incorporate raw data, we ﬁrst pretrain the decoder using raw text as a language model, then train Table2seq on the table-text pairs, which is noted as Table2seq-pretrain. Table2seq-pretrain has the same decoding strategy as Table2seq-beam.

Temp-KN: Template-KN model (Lebret et al., 2016) ﬁrst generates a template according to the interpolated 5-gram Kneser-Ney (KN) language modeled over sentence templates, then replaces the special token for the ﬁeld with the corresponding words from the table.

The hype-parameters of the VTM are chosen based on the lowest LELBOp on the validation set of SPNLG and LELBOp + LELBOr on the validation set of WIKI. Word embeddings are randomly initialized with 300-dimension. During training, we use Adam optimizer (Kingma & Ba, 2015) with the initial learning rate as 0.001. Details on hyperparameters are listed in Appendix D.

4.2 EXPERIMENTAL RESULTS ON SPNLG DATASET

Quantitative analysis. According to the results in Table 3, we ﬁnd that our variational template machine (VTM) can generally produce sentences with more diversity under a promising performance in terms of BLEU metrics. Table2seq with beam search algorithm (Table2seq-beam), which is only trained on parallel data, generates the most ﬂuent sentences, but its diversity is rather poor. Although the sampling decoder (Table2seq-sample) gets the lowest self-BLEU, it sacriﬁces the ﬂuency at the cost. Table2seq performs even worse when the decoder is pre-trained by raw data as a language model. Because there is still a gap between the language model and data-to-text task, the decoder fails to learn how to use raw text in the generation of data-to-text stage. On the contrary, VTM can make full use of the raw data with the help of content variables. As a template-based model, Temp-KN receives the lowest self-BLEU score, but it fails to generate ﬂuent sentences.

Ablation study. To study the effectiveness of the auxiliary loses and the augmented raw texts, we progressively remove the auxiliary losses and raw data in the ablation study. We reach the conclusions as follows.

Without the preserving-content loss Lpc, the model has a relative decline in generation quality. This implies that, by training the same inference model of content variable in pairwise data, preserving-content loss provides an effective instruction for learning the content space.

VTM-noraw is the model trained without using raw data, where only the loss functions in Section 3.1 are optimized. Comparing with VTM-noraw, VTM gets a substantial improvement in generation quality. More importantly, without extra raw text data, there is also a decline in diversity (self-BLEU). Experimental results show that raw data plays a valuable role in improving both generation quality and diversity, which is often neglected by previous studies.

We further remove the mutual information loss and preserving-template loss from VTM-noraw model. Both generation quality and diversity continuously decline, which veriﬁes the effectiveness of the two losses. Moreover, the automatic evaluation results of VTM-noraw-LMI-Lpt empirically show that preserving-template loss may be a hinder if we only add it during the training, as illustrated in Section 3.3.

Published as a conference paper at ICLR 2020

Methods BLEU NIST METEOR ROUGE CIDEr Self-BLEU Table2seq-beam 40.61 6.31 38.67 56.95 3.74 97.14 Table2seq-sample 34.97 5.68 35.46 52.74 3.00 65.69 Table2seq-pretrain 40.56 6.33 38.51 56.32 3.75 100.00 Temp-KN 6.45 0.45 12.53 27.60 0.23 37.85 VTM 40.04 6.25 38.31 56.48 3.64 88.77 -Lpc 39.58 6.24 38.30 56.24 3.69 87.20 VTM-noraw 39.94 6.22 38.42 56.72 3.66 88.92 -LMI 38.33 6.02 37.77 55.92 3.51 96.55 -LMI-Lpt 39.63 6.24 38.35 56.36 3.70 92.54

Table 3: Result for SPNLG data set. Under the 0.05 signiﬁcance level, VTM gets signiﬁcantly higher results in all the ﬂuency metrics than all the baselines except Table2seq-beam.

0.4 0.5 0.6 0.7 0.8 0.9 1

Table2seq VTM

Figure 3: Quality-diversity trade-off curve on SPNLG dataset.

0 1 2 3 4 5 6 7 8 9 10 proportion of raw texts

Figure 4: Self-BLEU and the proportion of raw texts to table-sentence pairs.

Experiment on quality and diversity trade-off. The quality and diversity trade-off is further analyzed to illustrate the superiority of VTM. In order to evaluate the quality and diversity under different sampling methods, we conduct experiment on sampling from the softmax with different temperatures. Sampling from the softmax with temperature is commonly applied to shape the distribution (Ficler & Goldberg, 2017; Holtzman et al., 2019). Given the logits u1:|V | and temperature τ, we sample from the distribution:

p(yt = Vl|y<t, x, z, τ) = exp (ul/τ) P

l exp (ul /τ)

When τ 0, it approaches greedy decoding. When τ = 1.0, it is the same as forward sampling. In the experiment, we gradually adjust temperature from 0 to 1, taking τ = 0.1, 0.2, 0.3, 0.5, 0.6, 0.9, 1.0. BLEU and self-BLEU under different temperatures are evaluated for both Table2seq and VTM. The self-BLEU in different temperatures and BLEU and self-BLEU curves are plotted in Figure 3. It empirically demonstrates the trade-off between the generation quality and diversity. By sampling from different temperatures, we can plot the portfolios of (Self-BLEU,BLEU) pairs of Table2seq and VTM. The closer the curve is to the upper left, the better the performance of the model. VTM generally gets lower self-BLEU with more diverse outputs under the comparable level of BLEU score.

Human evaluation In addition to the quantitative experiments, human evaluation is conducted as well. We randomly select 120 generated samples (each has ﬁve sentences) and ask three annotators to rate them on a 1-5 Likert scale in terms of the following features:

Accuracy: whether the generated sentences are consistent with the content in the table. Coherence: whether the generated sentences are coherent. Diversity: whether the sentences have as many patterns/structures as possible.

Based on the qualitative results in Table 4, VTM generates the best sentences with the highest accuracy and coherence. Besides, VTM is able to obtain the comparable diversity with Table2seqsample and Temp-KN. Compared with the model without using raw data (VTM-no raw), there is a signiﬁcant improvement in diversity, which indicates that raw data essentially enriches the latent

Published as a conference paper at ICLR 2020

Methods Accuracy Coherence Diversity Table2seq-sample 3.44 4.54 4.87 Temp-KN 2.90 2.78 4.85 VTM 4.44 4.84 4.33 VTM-noraw 4.33 4.62 3.44

Table 4: Human evaluation results on different models. The bold numbers are signiﬁcantly higher then others under 0.01 signiﬁcance level.

Methods BLEU NIST ROUGE Self-BLEU Table2seq-beam 26.74 5.97 48.20 92.00 Table2seq-sample 21.75 5.32 42.09 36.07 Table2seq-pretrain 25.43 5.44 45.86 99.88 Temp-KN 11.68 2.04 40.54 73.14 VTM 25.22 5.96 45.36 74.86 -Lpc 22.16 4.28 40.91 80.39 VTM-noraw 21.59 5.02 39.07 78.19 -LMI 21.30 4.73 40.99 79.45 -LMI-Lpt 16.20 3.81 38.04 84.45

Table 5: Results for WIKI dataset. All the metrics are signiﬁcant under 0.05 signiﬁcance level.

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NER+Table2seq

Figure 5: Quality-diversity trade-off curve compared with NER+Table2seq.

template space. Although obtaining the highest scores in diversity for Table2seq-sample and Temp KN, their generation qualities are much inferior to the VTM, and comparable generation quality is the prerequisite when comparing the diversity.

Experiment on the diversity under different proportions of raw. In order to show how much raw data may contribute to the VTM model, we train the model under different proportions of raw data to pairwise data in training. Speciﬁcally, we control the ratio of raw sentences to the table-text pairs under 0.5:1, 1:1, 2:1, 3:1, 5:1, 7:1 and 10:1. As shown in Figure 4, the self-BLEU rapidly decreases even adding a small number of raw data, and continuously decreases until the ratio equals 5:1. The improvement is marginal after adding more than 5 times of raw data.

Case study. According to Table 8 (in Appendix E), despite template-like structures vary much in a forward sampling model, the information in sentences may be wrong. For example, Sentence 3 says that the restaurant is a Japanese place. Notably, VTM produces correct texts with more diversity of templates. VTM is able to generate different number of sentences and conjunctions. For example, [name] is a [food] place in [area] with a price range of [price Range]. It is a [eat Type]. (Sentence 1, two sentences, with aggregation), [name] is a [eat Type] with a price range of [price Range]. It is in [area]. It is a [food] place. (Sentence 2, three sentences, with aggregation), [name] is a [food] restaurant in [area] and it is a [food]. (Sentence 4, one sentence, and aggregation).

4.3 EXPERIMENTAL RESULTS ON WIKI DATASET

Table 5 shows the results for WIKI dataset, the same conclusions can be drawn as in the results in SPNLG dataset for both the quantitative analysis and ablation study. VTM is able to generate sentences with the comparable quality as Table2seq-beam but more diversity.

Comparison with the pseudo-table-based method. Another way to incorporate raw data is to construct pseudo-table from the given sentence by applying a sentence-to-table backward model via name entity recognition (NER). However, when the type of entities is complicated, such as in product introduction, or the raw data comes from the different domains as pairwise data, the commonlyused model for NER cannot provide accurate pseudo-tables. In this experiment, we replace 841,507 biography raw sentences with 101,807 sentences that describe the animals (Wang et al., 2018b) to test the generalization of our model in raw data of different domains. NER+Table2seq is the twostep model that ﬁrst constructs the pseudo-table by a Bi-LSTM-CRF (Huang et al., 2015) model trained from the table-text pairs, then trains Table2seq from both table-text pairs and pseudo-tabletext pairs. We control the temperature in decoding method as previous, and results are plotted in Figure 5. We ﬁnd that compared with NER+Table2seq, the curve of VTM is closer to the upper left,

Published as a conference paper at ICLR 2020

Table2seq VTM-noraw VTM Train 30min / 6 epochs 30min / 6 epochs 160min / 15 epochs Test 80min 80min 80min

Table 6: Computational cost for each model.

name[Jack Ryder], country[Australia], fullname[John Ryder], nickname[the king of Collingwood], birth date[8 August 1889], birth place[Collingwood, Victoria, Australia], death date[4 April 1977], death place[Fitzroy, Victoria, Australia], club[Victoria], testdebutyear[1920 england], aritcle title[Jack Ryder (cricketer)]

Reference John Jack Ryder, mbe (8 August 1889 3 April 1977) was a cricketer who played for Victoria and Australia.

Table2seq-sample

1: john Ryder ( 8 August 1889 3 April 1977) was an Australian cricketer . 2: john Ryder Ryder ( 8 August 1889 3 April 1977) was an Australian cricketer . 3: john Ryder Ryder ( 8 August 1889 3 April 1977) was an Australian cricketer who played for gloucestershire cricket club in 1912 . 4: john Ryder ( 8 August 1889 3 April 1977) was an Australian cricketer . 5: john Ryder oliveira ( 8 August 1889 3 April 1977) was an Australian test cricketer who played against great Britain with international cricket club .

1: jack Ryder ( born August 8, 1889) is a former professional cricketer) . 2: jack Ryder ( born August 8, 1889) is a former professional cricketer) who played in the national football league. 3: jack Ryder ( born 8 August 1889 in Collingwood, Victoria,) is a former professional cricketer) . 4: Jack Ryder ( born August 8, 1889, in Collingwood, Victoria, Australia) is a former professional football player who is currently a member of the united states . 5: jack Ryder ( born August 8, 1889) is a former professional cricketer) .

1: John Ryder (8 August 1889 4 April 1977) was an Australian cricketer. 2: Jack Ryder (born August 21, 1951 in Melbourne, Victoria) was an Australian cricketer. 3: John Ryder (21 August 1889 4 April 1977) was an Australian cricketer. 4: Jack Ryder (8 March 1889 3 April 1977) was an Australian cricketer. 5: John Ryder (August 1889 April 1977) was an Australian cricketer.

1: John Ryder (8 August 1889 4 April 1977) was an Australian cricketer. 2: John Ryder (born 8 August 1889) was an Australian cricketer. 3: Jack Ryder (born August 9, 1889 in Victoria, Australia) was an Australian cricketer. 4: John Ryder (August 8, 1889 April 4, 1977) was an Australian rules footballer who played for Victoria in the Victorian football league (VFL). 5: John Ryder, also known as the king of Collingwood (8 August 1889 4 April 1977) was an Australian cricketer.

Table 7: An example of the generated text by our model and the baselines on WIKI dataset.

which implies that VTM can generate more diverse (lower Self-BLEU) under the commensurate BLEU.

Computational cost. We further compare the computational cost of VTM with other models, for both training and testing phases. We train and test the models on a single Tesla V100 GPU. The time spent to reach the lowest ELBO in the validation set is listed in Table 6. VTM is trained about ﬁve times longer than the baseline Table2seq model (160 minutes, 15 epochs in total) because of the training of an extra large number of raw data (84k pairwise data and 841k raw texts). In the testing phase, VTM enjoys the same speed as other competitor models, approximately 80 minutes to generate 72k wiki sentences in the test set.

Case study. Table 7 shows an example of sentences generated by different models. Although forward sampling enables the Table2seq model to generate diversely, it is more likely to generate incorrect and irrelevant content. For example, it generates the wrong club name in Sentence 3. By sampling from template space, VTM-noraw can generate texts with multiple templates, like different expressions for birth date and death date, while preserving readability. Furthermore, with extra raw data, VTM is able to generate more diverse expressions, which other models cannot produce, such as [fullname], also known as [nickname] ([birth date] [daeth date]) was a [country] [article name 4]. (Sentence 5). It implies that raw sentences not in the pairwise dataset could additionally enrich the information in template space.

Published as a conference paper at ICLR 2020

5 RELATED WORK

Data-to-text Generation. Data-to-text generation aims to produce summary for the factual structured data, such as numerical table. Neural language models have made distinguished progress by generating sentences from the table in an end-to-end style. Jain et al. (2018) proposed a mixed hierarchical attention model to generate weather report from the standard table. Gong et al. (2019) proposed a hierarchical table-encoder and a decoder with dual attention. Although encoder-decoder models can generate ﬂuent sentences, they are criticized for deﬁciency in sentence diversity. Other works focused on controllable and interpretable generation by introducing templates as latent variables. Wiseman et al. (2018) designed a Semi-HMM decoder to learn discrete templates representation, and Dou et al. (2018) created a platform, Data2Text Studio, equipped with a Semi-HMMs model, to extract template and generate from table input in an interactive way.

Semi-supervised Learning From Raw Data. It is easier to acquire raw text than to get structured data, and most neural generators cannot make the best use of raw text, universally. Ma et al. (2019) proposed that encoder-decoder framework may fail when not enough parallel corpus is provided. In the area of machine translation, back-translation have been proved to be an effective method to utilize monolingual data (Sennrich et al., 2016; Burlot & Yvon, 2018).

Latent Variable Generative Model. Deep generative models, especially variational autoencoders (VAE) (Kingma & Welling, 2014) have shown a promising performance in generation. Bowman et al. (2016) showed that a RNN-based VAE model can produce diverse and well-formed sentences by sampling from the prior of continuous latent variable. Recent works explored methods to learn disentangled latent variables (Hu et al., 2017a; Zhou & Neubig, 2017; Bao et al., 2019). For instance, Bao et al. (2019) devised multi-task losses adversarial losses to disentangle the latent space into syntactic space and semantic space. Motivated by the idea of back-translation and variational autoencoders, VTM model proposed in this work can not only fully utilize the non-parallel text corpus, but also learn a disentangled representation for template and content.

6 CONCLUSION

In this paper, we propose the Variational Template Machine (VTM) based on a semi-supervised learning approach in the VAE framework. Our method not only builds independent latent spaces for template and content for diverse generation, but also exploits raw texts without tables to further expand the template diversity. Experimental results on two datasets show that VTM outperforms the model without using raw data in terms of both generation quality and diversity, and it can achieve a comparable quality in generation with Table2seq, as well as promote the diversity by a large margin.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their insightful comments. Hao Zhou and Zhongyu Wei are the corresponding authors of this paper.

Gabor Angeli, Percy Liang, and Dan Klein. A simple domain-independent probabilistic approach to generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2010.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations, 2018.

Junwei Bao, Duyu Tang, Nan Duan, Zhao Yan, Yuanhua Lv, Ming Zhou, and Tiejun Zhao. Tableto-text: Describing table region with natural language. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2018.

Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, Olga Vechtomova, Xinyu Dai, and Jiajun Chen. Generating sentences from disentangled syntactic and semantic spaces. In Proceedings of the Conference of the Association for Computational Linguistics, 2019.

Published as a conference paper at ICLR 2020

Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of the Conference on Computational Natural Language Learning., 2016.

Franck Burlot and Franc ois Yvon. Using monolingual data in neural machine translation: a systematic study. In Proceedings of the Conference on Machine Translation: Research Papers, 2018.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Proceedings of the Advances in Neural Information Processing Systems, 2016.

Andrew Chisholm, Will Radford, and Ben Hachey. Learning to generate one-sentence biographies from wikidata. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, 2017.

Longxu Dou, Guanghui Qin, Jinpeng Wang, Jin-Ge Yao, and Chin-Yew Lin. Data2text studio: Automated text generation from structured data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018.

Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, 2017.

Heng Gong, Xiaocheng Feng, Bing Qin, and Ting Liu. Table-to-text generation with effective hierarchical encoder on three dimensions (row, column and time). In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, 2019.

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. ar Xiv preprint ar Xiv:1904.09751, 2019.

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In Proceedings of the International Conference on Machine Learning, 2017a.

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In Proceedings of the International Conference on Machine Learning, 2017b.

Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. ar Xiv preprint ar Xiv:1508.01991, 2015.

Parag Jain, Anirban Laha, Karthik Sankaranarayanan, Preksha Nema, Mitesh M Khapra, and Shreyas Shetty. A mixed hierarchical attention based encoder-decoder approach for standard table summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2018.

Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. Disentangled representation learning for non-parallel text style transfer. In Proceedings of the Conference of the Association for Computational Linguistics, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, 2014.

R emi Lebret, David Grangier, and Michael Auli. Neural text generation from structured data with application to the biography domain. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016.

Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2018.

Shuming Ma, Pengcheng Yang, Tianyu Liu, Peng Li, Jie Zhou, and Xu Sun. Key fact as pivot: A two-stage model for low resource table-to-text generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2019.

Published as a conference paper at ICLR 2020

Hongyuan Mei, Mohit Bansal, and Matthew R Walter. What to talk about and how? selective generation using lstms with coarse-to-ﬁne alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.

Lena Reed, Shereen Oraby, and Marilyn Walker. Can neural generators for dialogue learn sentence planning and discourse structuring? In Proceedings of the International Conference on Natural Language Generation, 2018.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016.

Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. In Proceedings of the Advances in Neural Information Processing Systems, 2017.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, 2014.

Qingyun Wang, Xiaoman Pan, Lifu Huang, Boliang Zhang, Zhiying Jiang, Heng Ji, and Kevin Knight. Describing a knowledge base. In Proceedings of the International Conference on Natural Language Generation, 2018a.

Qingyun Wang, Xiaoman Pan, Lifu Huang, Boliang Zhang, Zhiying Jiang, Heng Ji, and Kevin Knight. Describing a knowledge base. In Proceedings of the International Conference on Natural Language Generation, 2018b.

Sam Wiseman, Stuart Shieber, and Alexander Rush. Challenges in data-to-document generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017.

Sam Wiseman, Stuart Shieber, and Alexander Rush. Learning neural templates for text generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018.

Eric P Xing, Michael I Jordan, and Stuart Russell. A generalized mean ﬁeld algorithm for variational inference in exponential families. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2003.

Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. ar Xiv preprint ar Xiv:1706.02262, 2017.

Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2018.

Chunting Zhou and Graham Neubig. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2017.

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018.

Published as a conference paper at ICLR 2020

A EXPLANATION FOR PRESERVING-CONTENT LOSS

The ﬁrst term of Lpc(x, y) is equivalent to:

Eqc(c|x)||c h||2 = Eqc(c|x)

i=1 (ci hi)2

i=1 Eqc(c|x)(ci hi)2

i=1 [(E(ci hi))2 + var(ci)]

i=1 [(E(ci) hi)2 + var(ci)]

i=1 [(µi hi)2 + Σii]

= ||µ h||2 + tr(Σ)

When we minimize it, we jointly minimize the distance between mean of approximated posterior distribution, and the trace of the co-variance matrix.

B PROOF FOR ANTI-INFORMATION PROPERTY OF ELBO

Consider the KL divergence over the whole dataset (or a mini-batch of data), we have

Ex p(x)[DKL(q(z|x) p(x))] =Eq(z|x)p(x)[log q(z|x) log p(z)]

= H(z|x) Eq(z) log p(z)

= H(z|x) + H(z) + DKL(q(z) p(z)) =I(z, x) + DKL(q(z) p(z))

where q(z) = Ex D(q(z|x)) and I(z, x) = H(z) H(z|x). Since KL divergence can be viewed as a regularization term in ELBO loss, When ELBO is maximized, the KL term is minimized, and mutual information between x and latent z, I(z, x) is minimized. This implies that z and x eventually become more independent.

C PROOF FOR THE PRESERVING-TEMPLATE LOSS WHEN POSTERIOR COLLAPSE HAPPENS

When posterior collapse happens, DKL(q(z|y)||p(z)) 0,

Lpt(Y, Y ) =E y p( y),y p(y)Ez q(z|y) log pη( y|z)

=E y p( y)Ez p(z) log pη( y|z)

z p(z) log pη( y|z)dz d y

y p( y) log pη( y|z)dz d y

=Ez E y[log pη(y)|z] = E y log pη(y)

During the back-propagation, || z Lpt(Y, Y )|| = 0

thus, φz is not updated.

Published as a conference paper at ICLR 2020

D IMPLEMENTATION DETAILS

For the model trained on WIKI dataset, the the dimension of latent template variable is set as 100, and the dimension of latent content variable is set as 200. The dimension of the hidden for table is 300. For the hyperparameters of total loss Ltot, we set λMI = 0.5, λpt = 1.0 and λpc = 0.5.

For the model trained on SPNLG dataset, the dimension of latent template variable is set as 64, and the dimension of latent content variable is set as 100. The dimension of the hidden for table is also 300. For the hyperparameters of total loss Ltot, we set λMI = λpt = λpc = 1.0.

E CASE STUDY ON SPNLG EXPERIMENT

Table name[name Variable], eat Type[pub], food[French], price Range[20-25], area[riverside] Reference name Variable is a French place with a price range of 20-25. It is in riverside. It is a pub.

Table2seq-sample

1: name Variable is a pub with a price range of 20-25. It is a French restaurant in riverside. 2: name Variable is a French restaurant in riverside with a price range of 20-25. name Variable is a pub. 3: name Variable is a pub with a price range of 20-25 and name Variable is a French restaurant in riverside. 4: name Variable is a pub with a price range of 20-25, also it is in riverside. it is a Japanese place. 5: name Variable is a pub with a average rating and it is a French place in riverside.

1: name Variable is in riverside, also it is in riverside. 2: name Variable is a French restaurant. 3: name Variable is the best restaurant. 4: name Variable is in riverside, and name Variable is in [location]. 5: name Variable is in. Its a French restaurant and it is in [location] with food and, even if name Variable is [food qual], it is the best place.

1: name Variable is a pub with a price range of 20-25. It is a French place in riverside. 2: name Variable is a pub with a price range of 20-25. it is a pub. It is in riverside. 3: name Variable is a French place in riverside with a price range of 20-25. It is a pub. 4: name Variable is a French place in riverside with a price range of 20-25. It is a pub. 5: name Variable is a French place in riverside with a price range of 20-25. It is a pub.

1: name Variable is a French place in riverside with a price range of 20-25. It is a pub. 2: name Variable is a pub with a price range of 20-25. It is in riverside. It is a French place. 3: name Variable is a French pub in riverside with a price range of 20-25, and it is a pub. 4: name Variable is a French restaurant in riverside and it is a pub. 5: name Variable is a French place in riverside with a price range of 20-25. It is a pub.

Table 8: An example of the generated text by our model and the baselines on SPNLG dataset.