# opendomain_text_evaluation_via_contrastive_distribution_methods__123e63f5.pdf Open-Domain Text Evaluation via Contrastive Distribution Methods Sidi Lu 1 Hongyi Liu 2 Asli Celikyilmaz 3 Tianlu Wang 3 Nanyun Peng 1 Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models generation quality remains a challenge. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between increasing model parameters and enhanced LLM performance, CDM creates a mapping from the contrast of two probabilistic distributions one known to be superior to the other to quality measures. We investigate CDM for open-domain text generation evaluation under two paradigms: 1) Generative CDM, which harnesses the contrast of two language models distributions to generate synthetic examples for training discriminator-based metrics; 2) Discriminative CDM, which directly uses distribution disparities between two language models for evaluation. Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM s superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach. 1 1. Introduction In recent years, open-domain text generation, fueled by large pretrained generative language models (LLMs), has made significant advancements, garnering substantial attention (Radford et al., 2018; 2019; Brown et al., 2020; Open AI, 2022; 2023). These systems have showcased remark- 1Department of Computer Science, University of California, Los Angeles 2Shanghai Jiao Tong University 3Meta FAIR. Correspondence to: Sidi Lu , Nanyun Peng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 1Code: https://github.com/Plus Lab NLP/CDM able capabilities, such as producing human-like responses, contributing to natural language comprehension, and even performing complex tasks like programming and content generation. With the empirical success, the development of reliable and scalable automatic evaluation metrics for these models become imperative, yet the problem remains an unresolved challenge. Existing automatic evaluate metrics from pre-LLM eras have their respective limitations. Specifically, referencebased statistical metrics (e.g. BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee & Lavie, 2005)) do not work well for open-ended generation problems with high content diversity like storytelling (Yao et al., 2019) and dialogue systems (Mesgar et al., 2019; Li et al., 2017; Wen et al., 2016), as for these tasks, it is challenging, if not impossible, to collect a sufficiently large number of reference examples to represent the distribution of all feasible outputs. Therefore, prior works have shown their low correlation with human judgments (Liu et al., 2016; Hu et al., 2020). With recent progress in pretrained models, model-based reference metrics like BERTScore (Zhang et al., 2019), Bluert (Sellam et al., 2020) are proposed to facilitate automatic evaluation for text generation. They alleviate the sample efficiency issue of statistical reference-based methods by using pretrained models to compute the similarities between texts based on higher-level semantics. However, the effectiveness of such methods is still reliant on the representativeness of the reference set, and thus falls short when the output semantic space is also highly diverse. Reference-free evaluation metrics, which assess text directly and provide a quality score, offer a more flexible solution for automatically evaluating open-domain text generation. There are two major paradigms for training models to evaluate texts without references: 1) Discriminator-based approaches like ADEM (Lowe et al., 2017) and DEAM (Ghazarian et al., 2022), treat the problem as a prediction task. They train a classifier or regressor to generate a score as the quality assess. However, these methods typically require extensive human annotations or involve dedicated manual designs for generating negative samples to train the classifier. 2) Distribution/Divergence-based approaches (Pillutla et al., 2021; Pimentel et al., 2022) focus on obtaining a continuous divergence score between distributions. These approaches have shown promising results in system-level evaluations. Open-Domain Text Evaluation via Contrastive Distribution Methods Sentence: Have you watched Love, Death and Robots? Generative CDM: Generate negative samples based on the contrastive of the two distributions. Discriminative CDM: directly evaluate the sentence pooling the step-wise probabilities generated by the two distributions. Expert-Model Distribution Amateur-Model Distribution generated negative examples: Have you worked for Love, Death and Robots? Have you tasted Love, Death and Robots? Figure 1. Conceptual illustration of the Contrastive Distribution Methods (CDM). (a) Generative CDM generates negative examples for training a discriminator-based metric. (b) Discriminative CDM directly evaluate the distribution/sequence by contrasting the step-wise likelihood scores. However, they often face challenges to accurately assigning credit to individual data points, limiting their ability to perform instance-level evaluations. In this paper, we propose Contrastive Distribution Methods (CDM), a general and reference-free framework for evaluating open-domain text generation. CDM operates on an intuitive yet broadly applicable premise: models with similar architectures but varying sizes generally exhibit improved performance as model size increases. Consequently, CDM is designed to capture the dynamics of model performance as it scales with the increasing number of parameters. Utilizing such dynamics, CDM contrasts two language models distributions and conduct inference in both generative and discriminative manners to create automatic evaluation metrics. Specifically, Generative CDM as illustrated in the upper right corner of Figure 1 produces effective negative samples to facilitate the learning of discriminator-based evaluation metrics without the requirement of additional human annotations or sophisticated design for the data generation process, and Discriminative CDM as illustrated in the lower right corner of Figure 1 provides a distribution-level measurement of quality for each instance, and thus results in reliable distribution-based metrics without compromising instance-level evaluation performance. Experiments on open-domain dialogue evaluation and commonsense keywords-to-text evaluation demonstrate strong performance of CDMs, consistently outperforming strong baselines such as G-Eval (Liu et al., 2023a) in terms of correlation with human judgements across datasets. 2. Background and Related Works Open-Domain Text Evaluation There has been a synchronously growing interest in developing robust evaluation methods for open-domain text generation models. Traditional evaluation metrics, such as BLEU and ROUGE, have been shown to be inadequate for assessing the quality of complex, multi-sentence responses generated by these models. As a result, researchers have explored alternative eval- uation methods, including human evaluation, adversarial evaluation, and unsupervised metrics. Human evaluation remains the gold standard, but it is time-consuming and costly. Adversarial evaluation, which involves testing models against a set of challenging examples, has shown promise in identifying weaknesses in current models. Unsupervised metrics, such as BERTScore and Perplexity, provide quick and automated evaluation, but their correlation with human judgments remains a topic of debate. The field of opendomain text evaluation continues to evolve, and developing reliable evaluation methods will be essential for advancing the state-of-the-art in this exciting area of research. Discriminator-based Metrics ADEM (Lowe et al., 2017) is one of the first attempts at training a model to evaluate machine-generated text. It deals with single-turn dialogue evaluation problem, and uses the contextualized representation of the context in interaction with that of the responses to train the model. DEAM (Ghazarian et al., 2022) and AMRFact (Qiu et al., 2023) are novel evaluation metrics that aim to assess open-end generation models with structured manipulations to create negative samples from positive ones, allowing for a more nuanced assessment of model performance like coherence (for dialogue system) or factuality (for summarization models). Typically, they operate by first parsing the sequence into an abstract meaning representation (AMR), and then manipulating the AMR to introduce inconsistencies and irrelevancies that undermine the coherence of the dialogue. The manipulated AMR is then transformed back into text form for evaluation. This method supports multi-turn dialogue evaluation and has achieved state-of-theart performance on various benchmark datasets. By using AMR-based semantic manipulations, these methods provide a class of promising approaches for performing automatic evaluation in a more comprehensive and accurate manner. Generative CDM shares a similar process , as it manipulates the positive true samples for the generation of negative samples, serving the purpose of training a classifier. Distribution/Divergence-based Metrics MAUVE and follow-up works (Pillutla et al., 2021; Pimentel et al., 2022) analyse the quality gap between human-generated text and machine-generated text by studying the divergence frontier of human-generated samples in contrast to the learnt model. While their setup is not directly relevant to our approach, it provides an insightful perspective of using the likelihood predictions of LMs for evaluation purposes. Zhong et al. (2022a) proposes a multi-dimensional evaluation system for more robust automatic evaluation. It ensembles the score from a set of discriminator-based metrics, each of which trained to evaluate a specific aspect in intuition of the text quality. GPTEval (Liu et al., 2023b) tries to quantitatively exploit large language models that are trained with strong human alignment. It uses the score prediction from GPT-4 (Open AI, 2023) to evaluate how well the given text adheres Open-Domain Text Evaluation via Contrastive Distribution Methods to human opinion. Discriminative CDM falls under this paradigm, since it serves as a metric with more continuously distributed scores for the evaluated text. Contrastive Decoding, Contrastive Momentum and Ex PO Contrastive decoding is a decoding algorithm that leverages the strengths of two language models: a stronger expert model and a weaker amateur model. The algorithm decodes towards the objective of maximizing the difference between the log-probabilities of the expert and amateur models, resulting in high-quality generated samples. Specifically, the algorithm tries to decode sequences that maximize the contrastive momentum: log pe(x) log pa(x), (1) where pe and pa represent the expert and the amateur models, respectively, and x is the generated sample. The original paper (Li et al., 2022) demonstrates that this approach results in higher quality samples than decoding from the expert model alone. Contrastive decoding provides an insightful way to study the dynamics of how models capabilities scale up with larger parameter numbers. The proposed CDM is highly inspired by the Contrastive decoding method, yet leveraging it for evaluation purposes. Noticeably, a recent preference optimization method called Ex PO (Zheng et al., 2024) also shares a similar idea. Ex PO significantly improves the instruction following abilities of (open-sourced) large language models without the necessity of performing any costly training and with even less data and/or trial sampling from the LLMs. It creates the extrapolation of human-aligned models (in our notion, the expert) in contrast to its primitive version after only the supervised finetuning (SFT) stage (in our notion, the amateur) on instruction-following data. The biggest difference between Ex PO and contrastive decoding or this paper is, since the amateur and expert models in Ex PO share the same parameter space and is assumed to be very close to each other, Ex PO directly performs the extrapolation in the parameter space, instead of the log-probability space as in ours or the contrastive decoding algorithm. 3. Methodology 3.1. Notations and Problem Formulation We use s to denote a sequence and si to denote the i th token in s. p(s) denotes the probability of sequence s under a model p. We assume model p is a probabilistic distribution defined on Σ , where Σ is the set of valid tokens and Σ is the universal set of all sequences consisting of such tokens. Consider an imaginary distribution-level oracle metric E(p) which projects from a model distribution p(s) to a measure of model performance a scalar. This function does not necessarily have an analytical form, however, we assume Model Performance Contrastive Momentum Models (Imaginary Oracle) gpt-2-large gpt-2-xlarge Figure 2. (a) While it is hard to assume a total order for models from different model classes under the oracle metric E(p), it is plausible to assume partial orders for models from the same model class. (b) Generative CDM uses the degraded distribution pn to synthesize fake samples for training a discriminator as the metric. The warm/cold region indicates the decision boundary of the resulting trainable metric induced by fake samples from pn. (c) Discriminative CDM directly determines the decision boundary by pooling the values of the step-wise contrastive momentum. that we have access to some partial order relations it defines. Intuitively, this imaginary oracle E(p) should correlate perfectly with human judgements of the model performance, and any evaluation metric that correlates better with human judgments is a better approximation of E(p). With the notion of oracle E(p), we can perform: Discriminative inference: a) Distribution-level evaluation to evaluate any existing models by ranking them according to E(p) b) Sample-level evaluation to use E(p) p(s) to reflect the quality of s. Because given the evaluated sequence s, E(p) p(s) represents whether and how much altering the model p towards higher p(s) would improve E(p). Generative inference: to improve or degenerate the generation quality by altering p towards better or worse of E(p). The altered distribution produces more obfuscating fake examples, which can then be used to train discriminator-based sample-level evaluation metrics. In the following, we will explain discriminative and gen- Open-Domain Text Evaluation via Contrastive Distribution Methods erative inference of CDM for automatically evaluation of open-domain generation in more details. 3.2. The Partial Order Assumption While it is nontrivial to come up with analytical forms for E(p), we can make some assumptions to obtain partial orders from E(p). Consider a series of models that share similar architectures and other pretraining/finetuning setups, but differ in model sizes (e.g. T5-small/base/large, etc.). It is usually safe to assume that the model with a larger number of parameters perform better than the smaller one under most aspects. More formally, we can assume a partial order (a linear order within one concerned model class) induced by the oracle metric E(p) as illustrated in Equation 2 and Figure 2(a): E(psmall) < E(pbase) < E(plarge) (2) Limitation Note that, while the partial order assumption is usually true for most existing model families in empirical practices, we are open to the possibility that it might not hold in some cases. As a result, the effectiveness of the proposed approach is inherently limited to cases where the partial order assumption holds. 3.3. First Order Approximation of E(p) Since we do not assume the knowledge about the analytical form of E(p), it is intractable to compute E(p) p(s) . However, following similar approach as in (Li et al., 2022), we can approximate E(p) using a secant hyperplane between two distributions in the range of E(p), i.e., the amateur distribution pa and the expert distribution pe. In other words, we approximate E(p) use the following analytic form: log pe(s) log pa(s) p(s), (3) It s trivial to prove that this approximation ensures E(pe) > E(pa). We can further define the contrastive momentum m(s) = log pe(s) log pa(s). Different choices of pa and pe result in different contrastive momentum and thus distinct quality of the first-order approximations for E(p), hence different performance of the evaluation metric. We investigate the general principle for choosing the expert and amateur distributions in the experiment section. 3.4. Contrastive Distribution Methods 3.4.1. GENERATIVE CDM Generative CDM focuses on synthetic data generation using contrastive distributions. We follow prior works such as ADEM (Lowe et al., 2017) and DEAM (Ghazarian et al., 2022) to formulate reference-free evaluation metrics as pre- diction tasks. In order to evaluate generated texts, a discriminator can be trained on positive and negative examples to serve as the evaluation metric.2 However, while we can assume human-written texts are positive examples, negative examples are non-trivial to obtain. Randomly generating negative examples using a uniform distribution over all possible sequences of tokens is not efficient, as most negative examples generated this way would be too trivial. On the other hand, generating negative examples by masking out spams in positive examples and having pretrained largelanguage models to fill in the masks may not result in real low-quality texts, which would confuse the discriminator. To this end, generative CDM provides a controllable approach to reduce the quality of pretrained language models to generate deceptive negative examples . Specifically, it generates from a novice distribution pn that descends along the direction of E(p) log p from the amateur model pa a weaker distribution than the amateur model. Applying the approximation in Equation 3, we follow the reversed direction of the contrastive momentum m = log pe log pa to degenerate from the amateur model pa. Mathematically, we obtain a probability distribution log pn log pa γm that amplifies the likelihood of machine artifacts in a controllable (by setting the hyper-parameter γ) scale. Sampling from pn allows us to obtain suitable negative examples. Implementation Details. We hereby discuss how to generate targeted negative examples. We start from existing positive examples s and construct the negative samples by masking out certain part s M+ of the positive ones, then conduct conditional generation using the remaining part s\s M+ as the initial context. As a result, the generated negative examples would be more disguising compared to sampling directly from pn. To achieve this, we train a segment infilling model. Given a positive example and the position at which a segment is removed (randomly or strategically), we model the conditional distribution that reconstructs the original segment. We train an expert and an amateur model with segment infilling capabilities, we can then compose the distribution for sampling in the following form: log pedit(s M|s \ s M+) log pa(s M|s \ s M+) γm(s M|s \ s M+), (4) where m(s M|s \ s M+) = log pe(s M|s \ s M+) log pa(s M|s \ s M+) This enables us to flexibly generate targeted negative examples that are deceptive. Figure 3(a) and 2(b) illustrates this process in the procedural and distributional views. The full process of generative CDM can be summarized as 2The discriminator does not necessarily need to provide binary decisions, it can also produce scores. But we use binary examples for simplicity. Open-Domain Text Evaluation via Contrastive Distribution Methods Expert-Model Distribution Amateur-Model Distribution watched 0.71 worked for 0.14 ... decided 0.001 watched 0.56 worked for 0.21 ... decided 0.003 Contrastive Momentum Degraded Amateur worked for 0.34 ... watched 0.001 Input Sentence: Have you ______ Love, Death and Robots? generated negative examples: Have you worked for Love, Death and Robots? Have you tasted Love, Death and Robots? Discriminatorbased Metric (a) Generative CDM Expert-Model Distribution Amateur-Model Distribution Input Sentence: Have you heard Love, Death and Robots? Contrastive Momentum Have [0.01] you [0.20] heard [1.917] Love [0.70] , [-0.14] Death [1.14] and [0.78] Robots[0.97] ? [0.09] Divergence-based Polynomial Probe Pooling (Max Pool/Avg Pool/etc) (b) Discriminative CDM Figure 3. A more detailed illustration of the two Contrastive Distribution Methods (CDM). (a) Generative CDM constructs fake negative samples from positive ones for training a discriminator-based metric. (b) Discriminative CDM directly evaluate the distribution/sequence by contrasting and aggregating the step-wise likelihood scores. Algorithm 1 Generative CDM 1: Train the amateur model pa to solve the segment insertion problem 2: Train the expert model pe to solve the segment insertion problem 3: Construct the contrastive momentum ma e = log pe log pa 4: Construct the degraded distribution log p a log pa γma e 5: negative Samples = {} 6: for positive Sample s+ in positive Samples do 7: Remove a segment e+ s+ from s+ to construct the context c = s+ e+ 8: Regenerate a segment e in the same position using p a (e|c) 9: Obtain the reconstructed negative sample s = c e 10: Add s to negative Samples 11: end for 12: Train the metric model D as a discriminator with {negative Samples, positive Samples} 13: return metric D 3.4.2. DISCRIMINATIVE CDM Although Generative CDM is a reasonably flexible and scalable framework, there are many variable factors in the generation process (e.g. how to choose which segment to remove, the degradation strength factor γ etc.) that may affect the performance of the resulting data and thus the evaluation metrics. Therefore, we propose an alternative paradigm under the CDM framework to remove the generation subroutine completely. In generative CDM, after data generation, we train a discriminator to distinguish positive and negative examples as the evaluation metrics. Effectively, we are learning the boundary between positive and negative samples, because we usually do not have a tractable model for the positive or negative distribution. However, under the CDM framework, we do have a tractable model for the negative distribution, which is composed from the amateur model pa and the expert model pe. In light of this, we can consider directly deploying m as a divergence-based metric for evaluation. For each sequence, we collect the step-wise contrastive momentum m(xt|s