# language_modelling_via_learning_to_rank__c1b9928a.pdf Language Modelling via Learning to Rank Arvid Frydenlund1,2, Gagandeep Singh3, Frank Rudzicz1,2,4 1 Department of Computer Science, University of Toronto 2 Vector Institute for Artificial Intelligence 3 Nuance Communications Inc. 4 Unity Health Toronto arvie@cs.toronto.edu, gagandeep.singh1@nuance.com, frank@cs.toronto.edu We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top-k ranks, we generate them using pre-trained LMs: GPT2, BERT, and Born-Again models. This leads to a rankbased form of knowledge distillation (KD). We also develop a method using N-grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM. We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally improves perplexity (PPL) often with statistical significance when compared to Kullback Leibler-based KD. Surprisingly, given the simplicity of the method, the N-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model as teachers. GPT-2 always acts as the best teacher, though, and using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70. 1 Introduction and Motivation More often than not, there are many ways to say the same thing. For example, the cat sat on the mat is semantically equivalent to the cat sat on the rug , in most contexts. This basic fact about language is ignored when training language models, where we try to maximize the probability of predicting a single ground-truth word for a given context and penalize all other words regardless of any potential semantic equivalence. In particular, a word-level language model (LM) defines a distribution over T discrete tokens, x1, . . . , x T , as p(x1, . . . , x T | x0) = t=1 p(xt | x located of Harry Colorado to thought the Kansas Missouri run north in Platte Payette flows 298 valleys Missouri begins took Yellowstone Back ( by Mickey Red valley paymen Mississippi Yellowst Provincial is Cheyenne Grand sockeye Columbia Mississarea Ohio , Osage and Kissimmee Figure 1: Branching sets created for the ground-truth Adams River is a tributary to the Thompson and Fraser Rivers ... using the ground-truths (Ogt) and four orders (O5, O4, O3, O2) from the Wiki02 training partition. The branching sets for tributary and Thompson are only the ground-truths since all other orders have been pruned. that continues those contexts will be in the BS. Then, in order to derive artificial rank GTs from the BS, we can give a weak ordering to the words in the BS by specifying that words in the BS with a larger context should have a higher rank relevance than those with a smaller context. If, for example, we consider bigrams and trigrams then any word that continues the trigram context will also continue the bigram context but words which only continue the bigram context but not the trigram context will be less relevant. Unigrams are excluded since they use no context. More formally, we construct a tuple of sets referred to as orders, O = (Ogt, OM, . . . , O2) which will be merged to form the BS. Ogt = {xt}, only contains the original GT. The other orders are made up of all words which continue a specific context length, such that we construct each of the orders as Om = {x|x V \O>m, C(ct(m), x) 1}, where m {M, . . . , 2} and C(ct(m), x) is the count for how many times x appears in the context ct(m). Note, we exclude words already included in previous orders. We also use a cut-off, q, so that we only consider sufficiently small orderings |Om| < q, since large BSs will not be informative; e.g., all words which follow the or is a . The BS is constructed as an ordered tuple where bt = (x | x Oi, Oi O). The artificial GT ranks are then specified as the weak rank ordering of the BS, which is weak since each Om may contain multiple unordered words. We can further resolve the rank within each order by using future context in addition to past context. This assumes that, had we access to longer past contexts, the ranks would have resolved the same way as using future information. Note that we can introduce future information here because the teacher model does not need to form a valid probability distribution since it only defines targets for the student model. Let ct(np, nf) be the context defined by using np 1 words in the past and nf 1 words in the fu- ture. We prioritize past context over future context, since that is what the student model will be able to observe as input. This is done by specifying the orders by a reverse orthographical sorting of past and future context lengths, i.e., ct(3, 4) ct(3, 3) ct(2, 4). This does not completely resolve the weak ordering, though, since two words may share both the same past and future contexts. Figure 1 shows an example BS and Figure 2 shows how it was constructed using future information. 2.2 Plackett-Luce Rank Loss Let us assume that the BS, bt, is strongly ordered by some preferred (partial) rank over k |V | words in V for step t and that the GT word has the first rank. Let yt be the set of words in V and say that bt(m) indexes the element of rank m according to bt, i.e, bt defines a rank permutation. If we assume that yt and wt are sorted by rank, we can drop the indexing in the following section. Given strongly ordered top-k ground-truths, we train our LM by learning to rank with the top-k Plackett-Luce (PL) rank loss (Plackett 1975; Cao et al. 2007; Xia, Liu, and Li 2009). The PL loss models the data as a Plackett-Luce distribution, where for any time step t, pt(y1 t yk t ) = i=1 pt(yi t | y1 t , . . . , yi 1 t ) ewi t P|V | j=i ewj t . The PL distribution represents a generative process of sampling a softmax distribution without replacement, were we re-normalize the distribution by excluding previous samples at every ranking step. The loss is then defined as the negative log-likelihood of equation 3, Adams River is a tributary to the Thompson and Fraser Rivers a tributary , the Thompson and Sam a tributary of the and Rivers tributary the and Kansas Rivers tributary in and Platte Rivers tributary valleys and Missouri Rivers tributary took and Yellowstone Rivers Figure 2: BS construction for the example in Figure 1 using four orders, 2p-1f, 2p, 1p-1f, 1p and the GTs. Where 2p-1f indicates an order which matches two past words and one future word. We show the branching sets for to and Fraser , centred in dash lines, and the context that selected those words in solid lines. The table has been truncated. j=i ewj t wi t = i=1 log Zt X j