# a_branching_decoder_for_set_generation__298ea224.pdf

Published as a conference paper at ICLR 2024

A BRANCHING DECODER FOR SET GENERATION

Zixian Huang, Gengyang Xiao State Key Laboratory for Novel Software Technology Nanjing University, Nanjing, China {zixianhuang, gyxiao}@smail.nju.edu.cn

Yu Gu The Ohio State University Columbus, USA gu.826@osu.edu

Gong Cheng State Key Laboratory for Novel Software Technology Nanjing University, Nanjing, China gcheng@nju.edu.cn

Generating a set of text is a common challenge for many NLP applications, for example, automatically providing multiple keyphrases for a document to facilitate user reading. Existing generative models use a sequential decoder that generates a single sequence successively, and the set generation problem is converted to sequence generation via concatenating multiple text into a long text sequence. However, the elements of a set are unordered, which makes this scheme suffer from biased or conflicting training signals. In this paper, we propose a branching decoder, which can generate a dynamic number of tokens at each time-step and branch multiple generation paths. In particular, paths are generated individually so that no order dependence is required. Moreover, multiple paths can be generated in parallel which greatly reduces the inference time. Experiments on several keyphrase generation datasets demonstrate that the branching decoder is more effective and efficient than the existing sequential decoder.

1 INTRODUCTION

In the past few years, the research of generative models has greatly promoted the development of AI. Pioneering studies, such as those represented by BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), demonstrate that tasks with diverse structures can be seamlessly transformed into a unified text-to-text framework. This development has allowed generative models to transition from a task-specific orientation to a more versatile, general-purpose capability. The introduction of models such as GPT-3 (Brown et al., 2020) further solidified this trend within both academic and industrial spheres. Consequently, there has been a pronounced push towards delving into more nuanced tasks that generative models can address. A prime example is set generation (Zhang et al., 2019; Madaan et al., 2022), wherein the model is entrusted with generating a variable number of target sequences. Such functionality is imperative in contexts such as a news application displaying multiple keyphrases in a long document to assist reading (Gallina et al., 2019) or a question answering system offering multiple responses to a user query (Li et al., 2022).

Building on the foundational text-to-text paradigm, many current methods for set generation have naturally evolved to adopt the One2Seq scheme (Yuan et al., 2020; Meng et al., 2021; Madaan et al., 2022). Within this scheme, all text from the set is concatenated into a singular, extended sequence, upon which the model is trained for generation. The One2Seq scheme specifically employs a sequential decoder, generating multiple answers successively in an autoregressive fashion (see Figure 1). While several investigations explore optimizing the concatenation order (Ye et al., 2021; Cao & Zhang, 2022), they invariably retain the use of sequential decoder.

Challenges. However, we argue that the prevailing One2Seq approach is not optimal for set generation, presenting limitations during both the training and inference phases of model development.

Corresponding author.

Published as a conference paper at ICLR 2024

Manchester outclassed North London in a Premier League goals frenzy, with United thrashing Arsenal, 8-2, and City routing Tottenham, 5-1

Manchester United <sep> Arsenal <sep> Manchester City <sep> Tottenham Hot Spur

Hot Spur City

Figure 1: A comparison of traditional sequential decoder and our branching decoder for a case of set generation in keyphrase generation task.

The training phase is marked by a void in a universally recognized canonical sequence concatenation order. Distinct sequences, resulting from varied arrangements of set elements, can introduce divergent and conflicting training signals. This inconsistency disrupts the optimization trajectory, potentially limiting the model s performance (Meng et al., 2021; Ye et al., 2021). During inference, generating a sequence encompassing all target text proves to be considerably time-consuming, further underscoring the inefficiencies of the current approach (He & Tang, 2022).

Our contributions. To address the noted limitations inherent in the sequential decoder, we introduce a novel decoding scheme, termed One2Branch. This scheme is meticulously crafted to enable the generation of multiple sequences within a set concurrently. As shown in the lower segment of Figure 1, our branching decoder generates each keyphrase individually, foregoing the need to concatenate them into a singular extended sequence. Our novel design circumvents the challenges of ambiguous concatenation order that beleaguer the sequential decoder. In addition, by generating each sequence in the set separately, the model can capitalize on parallel computation, significantly enhancing generation speed. Our methodology is anchored by the branching decoder, adept at systematically delineating multiple paths from a specific input sequence, where each path uniquely signifies a sequence within the target set. A cornerstone of our contribution is the introduction of a unified framework that harmoniously integrates the ZLPR loss (Su et al., 2022) for model training (Section 3.3) and a threshold-based decoding algorithm for inference (Section 3.4). This fusion ensures that, at every decoding step, our model can selectively identify a dynamic set of tokens with logits exceeding a designated threshold (i.e., 0), facilitating the emergence of new generative branches. In a thorough evaluation spanning three representative keyphrase generation benchmarks, One2Branch unequivocally outpaces the established One2Seq techniques, yielding both augmented performance and heightened efficiency. Our methodology, seamlessly integrable with existing generative models, holds potential to redefine the paradigm for set generation using generative models.

Code: https://github.com/nju-websoft/One2Branch

2 RELATED WORK

Text generation. Text generation has made great progress in many aspects in recent years, such as different model architectures (Brown et al., 2020; Lewis et al., 2020; Du et al., 2021) and rich optimization techniques (Li & Liang, 2021; Goodwin et al., 2020; Houlsby et al., 2019). Existing models are usually based on the text-to-text framework (Raffel et al., 2020), i.e. the input and output of the model are unstructured text. Some works adapt the encoder to structured inputs, such as text set (Izacard & Grave, 2021), table (Liu et al., 2019), and graph (Zhao et al., 2020), but their decoder still generates an unstructured text sequence. Some works (Zhang et al., 2018; Wen et al., 2023) use multiple decoders to generate different information, but they essentially adopt a multi-view scheme to obtain a single enhanced text sequence rather than multiple text as our branching decoder. Some other works (Vijayakumar et al., 2016; Holtzman et al., 2020) study decoding strategies to obtain multiple diversified text outputs, but these unsupervised strategies need a manually specified number of generated paths. In contrast, our branching decoder is optimized to generate multiple sequences

Published as a conference paper at ICLR 2024

in the training stage, and the decoding strategy in the inference stage is consistent with the behavior of training, so as to automatically determine the number of sequences generated.

Set generation. Many tasks in the NLP community can be regarded as set generation, such as keyphrase generation (Yuan et al., 2020; Ye et al., 2021), named entitiy recognition (Tan et al., 2021), multi-label classification (Yang et al., 2018; Cao & Zhang, 2022), and multi-answer question answering (Huang et al., 2023a;b). Early work used the One2One scheme (Meng et al., 2017) to deal with set generation, which couples each target sequence with its input sequence to form an individual training sample. However, One2One relies on manually setting a beam size to generate sequences, which results in either sacrificing recall to only generate a small number of sequences with the highest probability, or sacrificing precision to sample many sequences with a large beam number (Meng et al., 2021). Similar limitations also appear in some work based on non-autoregressive decoders (Tan et al., 2021; Sui et al., 2021) that can only generate a set of a fixed size.

Current research on set generation is mainly based on the One2Seq scheme (Yuan et al., 2020), which concatenates multiple sequences into one long sequence. However, since set is unordered, this scheme is easy to introduce order bias during training (Meng et al., 2021). The main idea to alleviate this problem is to reduce the order bias in training by finding a more reasonable concatenation order, which includes designing heuristic rules (Yang et al., 2018), using the reward feedback of reinforcement learning (Yang et al., 2019), and selecting the optimal matching between the generated sequence and the target sequence to calculate the loss (Ye et al., 2021; Cao & Zhang, 2022). Another way to enhance One2Seq is data augmentation, which adds sequences representing different concatenation orders to the training set (Madaan et al., 2022). Although these methods have been verified to alleviate the problem of order bias during One2Seq s training on some tasks, it is arguable whether a particular optimal order exists for tasks such as keyphrase generation. Compared with the above work, our One2Branch scheme not only can generate an unfixed number of sequences like One2Seq, but also is not troubled by the disorder of the set like One2Seq.

3 ONE2BRANCH: A BRANCHING DECODER

3.1 OVERVIEW

Given an input sequence X = X1, . . . , Xl with l tokens, the goal of set generation is to generate a target set Y = {Y1, . . . , Yn} containing n target sequences, where each text output Yi = Yi,1, . . . , Yi.mi contains mi tokens. A vocabulary V = V1, . . . , Vµ containing µ tokens is predefined and all generated tokens are selected from it. The operator Index( ) is used to map tokens to their index positions in the vocabulary, i.e. Index(Vi) = i.

Figure 2 gives an example of the generation strategy of our branching decoder, containing 3 generation paths that correspond to 3 target sequences. Compared with the existing generative model using a sequential decoder, the branching decoder has two characteristics: (1) generating a dynamic number of tokens at each time-step, and (2) generating multiple target sequences in parallel.

Specifically, different from sequential decoder that can only select a fixed beam number of tokens, branching decoder uses a thresholding strategy to allow selecting a dynamic number of tokens, so one or more new generation paths can be branched at each time-step. As shown in Figure 2, branching decoder can generate 2-3-3-1 tokens respectively at time-step 1 to 4. Benefitting from the ability to dynamically branch, multiple target sequences can be generated in parallel. In Figure 2 the branching decoder can use 1-2-3-1 decoders to generate multiple target sequences in parallel at time-step 1 to 4. Compared with the traditional sequential decoder that needs to use a total of 12 time-steps (including 2 separating tokens) to generate all target sequences, the branching decoder only needs to use 4 time-steps since multiple sequences can be generated in parallel.

3.2 ARCHITECTURE

Our One2Branch scheme uses the classic encoder-decoder architecture, but the output of the encoder can be fed into multiple parameter-sharing decoders to generate in parallel. A stacked transformer encoder first encodes the input sequence to obtain its hidden states representation:

HE = Encoder(X) , (1)

Published as a conference paper at ICLR 2024

Time-step 1

Time-step 2 Time-step 3

1 Decoder 2 Decoders 3 Decoders

B C Generated

B A B C C B

B A B C A C B

Time-step 4

B A B C A C B

Decoder Number

Figure 2: An example of branching decoder for set generation. It uses 0 as the threshold to select nodes (blue) at each time-step and generates the set { B A , B C A , C B }.

where HE Rl d and d denotes the dimension of representation.

At the t-th time-step, for the i-th generated path with t 1 generated tokens Pi,:t 1, the representations of the 1-st to t-th time-steps can be obtained as:

Si,:t, Hi,:t = Decoder(Pi,:t 1, HE) , (2)

where Decoder( , ) is an autoregressive decoder composed of stacked transformer, Si,:t Rt µ

and Hi,:t Rt d contain generation scores and representations of t tokens on the i-th path, respectively.. For the t-th token, its generation scores Si,t Si,:t is a vector containing the score of each token in the predefined vocabulary V . For the tokens with a score greater than the threshold, they will be selected as generated tokens and each of them will branch a new path at the next time-step. In the following, we will introduce our training strategy, which enables a fixed threshold to decode a dynamic number of tokens and branch out to multiple paths to generate in parallel.

3.3 TRAINING

Our training strategy consists of: sharing decoder, learning threshold, and negative sequence.

Sharing decoder. The branching decoder should generate multiple unordered target sequences in parallel, so we adopt a sharing decoder way to train the model. As shown in Figure 3, each target sequence is independently input into a decoder with shared parameters using teacher forcing manner to obtain the generation score of each token in the vocabulary, which is implemented by Equation (2):

Si, Hi = Decoder([ bos ; Yi], HE) , (3)

where Si R(|Yi|+1) µ and Hi R(|Yi|+1) d are the generation scores and representations of all the |Yi| + 1 tokens on the i-th path, respectively, [ ; ] is the operator of concatenation, and bos is a special token used to represent the beginning of a generated sequence.

Although each sequence is input to a decoder independently, their label sets are related to each other. As illustrated in Figure 3, when the decoder inputs are the common prefix bos of all the target sequences, because both B and C are tokens that can be generated next, the positive label sets are Ω+ i,1 = Index({ B , C }) at time-step 1 for every decoder i. Similarly, when the decoder inputs are the common prefix bos B of the target sequences B A and B C A , the positive label sets are Ω+ i,2 = Index({ A , C }) at time-step 2 for decoders i = 1 and i = 2. For each target sequence at each time-step, the negative label set is obtained by removing the positive label set from the vocabulary, i.e. Ω i,t = Index(V )\Ω+ i,t.

Published as a conference paper at ICLR 2024

Shared Decoder

Shared Decoder

Shared Decoder

Threshold 0

C B eos eos

Shared Decoder

Target Sequences Negative Sequence

𝑖=1 𝑖=2 𝑖=3 𝑖=4

Figure 3: An example of One2Branch scheme s training strategy. For the target set { B A , B C A , C B }, One2Branch uses three decoders with shared parameters to learn from each target sequence in the teacher forcing manner. A negative sequence A B is also used to train. For each input token, the scores of positive tokens in the vocabulary are optimized to be greater than the threshold 0, and the scores of the remaining negative tokens are optimized to be less than 0.

If two target sequences Yi and Yj have a common prefix up to time-step t but differ at time-step t + 1, then the positive label sets of the two sequences at time-step t separately include Yi,t+1 and Yj,t+1. This approach of constructing labels does not fix the number of positive and negative labels, which requires the model to learn to predict a dynamic number of tokens at each time-step.

Learning threshold. In order to realize the selection of a dynamic number of tokens at each timestep to branch multiple paths, we expect that the model learns a threshold to qualify tokens that can be selected. Benefiting from the work of Su et al. (2022) used in multi-label classification task, we introduce the ZLPR loss to generative model to learn the threshold at 0:

L = log(1 +

i=1 s+ i ) + log(1 +

where s+ i =

exp( Si,t,w) , s i =

exp(Si,t,w) , (4)

where n is the size of the target sequences set, mi is the length of target sequence Yi, and Si,t,w R is the generation score of token Vw at time-step t on the i-th path. When using Equation (4) to supervise model training, both s+ i , s i (0, + ] will be optimized to be minimal, meaning the scores of position tokens in Ω+ will be optimized toward positive infinity away from 0, and the scores of negative tokens in Ω will be optimized toward negative infinity away from 0, which enables using 0 as the threshold in the inference stage.

Negative sequence. Recall that a common way to train a decoder is to calculate the loss from positive sequences using teacher forcing, which often leads to exposure bias and affects model generalizability (Bengio et al., 2015). To address it, we incorporate a loss from negative sequences to supervise the model in distinguishing between positive and negative sequences.

From the first incorrect token of a negative sequence, for all the subsequent tokens, we construct their label sets Ω+ i,t = , Ω i,t = Index(V ). As Figure 3 shows, A B is a negative sequence; A and B in the second and third time-steps are negative tokens with empty positive label sets.

In practice, we train branching decoder in two stages. In the first stage, only target sequences are used. In the second stage, we first use the model trained in the first stage to predict on the training set and collect the incorrectly generated sequences having the highest scores as hard negatives, and then further train branching decoder with both target sequences and negative sequences.

Published as a conference paper at ICLR 2024

Algorithm 1 Decoding Input: X, stepmax, kmin, kmax

1: F, CF , . // F: the set of finished paths. CF: the scores of F 2: U, CU { bos }, {0} // U: the set of unfinished paths. CU: The scores of U. 3: HE Encoder(X) 4: t 1 5: while U is not empty AND t stepmax do 6: B, CB , // B: the set of newly branched paths. CB: The scores of B. 7: S, H = Decoder(U, HE) // Calculate the generation scores of all the paths in U in a batch. 8: for k 1 to kmax do 9: Si,t,w, i, w k-th Largest(S:,t) // Return the k-th largest score Si,t,w, and its path index i and vocabulary index w at time-step t. 10: if Si,t,w > 0 or k kmin then 11: B.add([Ui; Vw]), CB.add(CU i + Si,t,w)

12: F, CF End With Eos(B, CB, F, CF) // If a path has generated the eos , it is a finished path. 13: U, CU B\F, CB\CF

14: t t + 1 15: return F, CF

3.4 DECODING

Using the above training strategy, as illustrated in Figure 2, the branching decoder can select an unfixed number of tokens with a generation score greater than 0 at each time-step in the inference phase, and dynamic branch out new paths to continue generating in parallel at the next time-step. However, this naive decoding strategy may suffer from insufficient generation branching out too few paths, or excessive generation branching out too many paths to fit in the memory. In order to avoid such extreme cases, we set the minimum number of explored paths kmin and the maximum number of generated paths kmax, and rely on the average score of the entire path instead of the score of the current token to smooth the result.

We show the decoding algorithm of branching decoder in Algorithm 1. For an input sequence X, we first use Equation (1) to obtain its representation HE Rl d (line 3), and then iterate up to stepmax time-steps to parallelly generate paths (lines 5-14). In each iteration, the representation HE is repeated |U| times and fed to Equation (3) with the set U to obtain the scores of all possible generated paths S Rn |U| µ (line 7). In order to avoid potential memory overflow caused by generating too many paths, we only generate new paths from the kmax highest-scored tokens among all n µ candidate tokens at time-steps t (lines 8 9). Then, the candidate tokens Vw whose generation scores Si,t,w are greater than 0 or are among the top-kmin are appended to path Ui and added to the set B as new paths, and their corresponding generation scores are added to set CB (lines 10 11). The paths that end with the terminal symbol eos are moved into the set F, their corresponding scores are moved to CF (line 12), and the rest will continue to generate in the next time-step (line 13). Finally, we filter out the paths whose average score is less than 0 to obtain the final generated sequences.

4 EXPERIMENTAL SETUP

4.1 DATASETS AND EVALUATION METRICS

Keyphrase generation (KG) is a classic task of set generation with rich experimental data. We selected three large-scale KG datasets as our main experimental data: KP20k (Meng et al., 2017), KPTimes (Gallina et al., 2019), and Stack Ex (Yuan et al., 2020), which are from the fields of science, news and online forums, respectively. Each dataset contains not only keyphrases that are present in a document but also those that are absent. Dataset statistics are shown in Table 1.

Following Chen et al. (2020), we used macro-averaged F1@5 and F1@M to report the generation performance of present and absent keyphrases. When the number of predicted keyphrases is below 5, F1@5 first appends incorrect keyphrases until 5 and then compares with ground-truth to calculate

Published as a conference paper at ICLR 2024

Table 1: Dataset statistics. # KP, |KP|, and % Abs KP refer to the average number of keyphrases per document, the average number of words that each keyphrase contains, and the percentage of absent keyphrases, respectively. All of them are calculated over the dev set.

Dataset Field # Train # Dev # Test # KP |KP| % Abs KP KP20k Science 509 K 20 K 20 K 5.3 2.1 39.8 KPTimes News 259 K 10 K 20 K 5.0 2.2 56.4 Stack Ex Forum 298 K 16 K 16 K 2.7 1.3 46.5

F1 score defined by Yuan et al. (2020). F1@M is the version of F1@5 without appending incorrect keyphrases, which compares all predicted keyphrases with the ground-truth to compute F1 score.

4.2 IMPLEMENTATION DETAILS

We implemented our One2Branch scheme based on the code of huggingface transformers 4.12.5 1 and used T5 Raffel et al. (2020) as backbone. For the two stages of training, we trained 15 epochs in the first stage, and 5 epochs in the second stage. We set stepmax = 20 to ensure that stepmax is greater than the length of all keyphrases on all dev sets.

We set kmax = 60 to ensure that kmax is greater than all numbers of keyphrases on all dev set. We tuned kmin on each dev set from 1 to 15 to search for the largest sum of all metrics, and the best kmin was used on the test set. On all three datasets, the best performance was achieved with kmin = 8.

We followed the setting of Wu et al. (2022) using batch size 64, learning rate 1e 4, maximum sequence length 512, and Adam W optimizer. We used three seeds {0, 1, 2} and took the mean results. We used gradient accumulation 64, trained One2Branch based on T5-Base (223 M) on a single RTX 4090 (24 G), and trained One2Branch based on T5-Large (738 M) on two RTX 4090. For inference, we ran both base and large versions on a single RTX 4090. Unless otherwise stated, the results of our baseline methods came from the work of Wu et al. (2022).

We also implemented our One2Branch scheme based on Mind Spore 2.0.

5 EXPERIMENTAL RESULTS

5.1 MAIN RESULTS: COMPARISON WITH SEQUENCE DECODER

We presented the comparison of generation performance, inference throughput and GPU memory usage between the One2Branch scheme based on branching decoder and the One2Seq scheme based on sequential decoder. The implementation of One2Seq was trained with Present-Absent concatenation ordering, using a delimiter sep to concatenate the present keyphrases and the absent keyphrases, which has been seen as an effective ordering in previous works (Yuan et al., 2020; Meng et al., 2021). We experimented the throughput and GPU memory usage with batch size 1 on a single RTX 4090, and used greedy search to decode One2Seq (see results in Table 2).

Generation performance. One2Branch outperforms One2Seq in all F1 scores on absent keyphrase and more than half of F1 scores on present keyphrase.

For absent keyphrases, they are completely unordered, and it is difficult for One2Seq to give a reasonable concatenation order to avoid bias during training, while One2Branch is a label order agnostic model which can deal with this problem. The results in Table 2 support our motivation; compared with One2Seq, One2Branch improves by up to 3.5, 11.6, and 10.2 in F1@5 and to 3.6, 6.3, and 9.5 in F1@M on three datasets. It shows that One2Branch has better set generation capabilities.

For present keyphrases, they may not be an unordered set in some cases due to the Present-Absent concatenation ordering, which potentially reduces the risk of One2Seq suffering from order bias. For example, the cases whose present keyphrase number is only one are ordered for the present part, and their proportions in the three datasets are 16.7%, 15.6% and 38.1%. Despite this, One2Branch still performs better overall, outperforming One2Seq at least 3.5 in F1@5, and at least 1.7 in F1@M

1https://github.com/huggingface/transformers

Published as a conference paper at ICLR 2024

Table 2: Comparison with the sequential decoder of the One2Seq scheme. The reported results are the average scores of three seeds. The standard deviation of each F1 score is presented in the subscript. For example, 33.61 means an average of 33.6 with a standard deviation of 0.1. We reported the throughput and GPU Memory in inference stage with batch size 1.

Backbone Present Absent Throughput (example/s) GPU Memory F1@5 F1@M F1@5 F1@M

KP20k One2Seq T5-Base 33.61 38.80 1.70 3.40 2.5 1,714 M One2Branch 36.31 35.23 4.71 5.61 6.8 (2.7x) 1,858 M One2Seq T5-Large 34.32 39.30 1.70 3.50 1.6 3,972 M One2Branch 36.71 38.40 5.20 7.10 5.1 (3.2x) 3,632 M

KPTimes One2Seq T5-Base 34.62 49.22 15.31 24.21 3.0 1,534 M One2Branch 38.22 51.11 25.71 28.75 7.5 (2.5x) 2,084 M One2Seq T5-Large 36.60 50.81 15.71 24.11 1.7 3,944 M One2Branch 40.12 52.52 27.38 30.45 4.4 (2.6x) 3,682 M

Stack Ex One2Seq T5-Base 28.71 56.11 9.40 21.61 6.6 1,476 M One2Branch 29.93 55.01 19.67 31.16 10.6 (1.6x) 1,500 M One2Seq T5-Large 30.52 58.03 10.61 23.92 3.7 3,486 M One2Branch 30.31 56.01 20.52 32.01 6.3 (1.7x) 3,472 M

on KPTimes. It is worth noting that KPTimes has the largest proportion of absent keyphrase compared to the other two datasets (shown in Table 1), which may bring greater challenges to the training of One2Seq with unordered set, resulting in overall performance lagging behind One2Branch. One2Branch performs comparably to One2Seq on KP20k and Stack Ex with better F1@5 but lower F1@M. On these two datasets, One2Branch generally performs better when the target set is larger.

Throughput. One2Branch has significant advantages over One2Seq in inference speed. Benefiting from the parallel decoding capability of the branching decoder, the advantages of One2Branch are more obvious when the target set is larger. Recall the statistics in Table 1, KP20k has the highest average number of keyphrases, so it witnesses the largest speedup accordingly (3.2 times faster than One2Seq based on T5-Large). Stack Ex has the lowest average number of keyphrases, so the speedup on this dataset is lower than the other two datasets, but still at least 1.6 times faster than One2Seq. Impressively, One2Branch based on T5-Large may even be faster than One2Seq based on T5-Base (5.1 vs. 2.5 on KP20k and 4.4 vs. 3.0 on KPTimes), even though the former has 2 3 times as many parameters as the latter.

GPU memory usage. Although the higher throughput of One2Branch comes from generating multiple paths in parallel at the same time, their GPU memory usage is comparable, and One2Branch even uses less memory on the large version. There are two factors that cause this phenomenon. Firstly, the common prefix of multiple sequences is only generated once, while One2Seq needs to generate each one independently. Secondly, the sequence generated by One2Seq needs to interact with the previously generated sequences, which will increase the usage of GPU memory.

5.2 COMPARISON WITH SOTA MODELS

Following the work of Wu et al., 2022, we compared with two categories of state-of-the-art keyphrase generation models. The first category is to study better model structures and mechanisms to improve sequential decoder. Cat Seq (Yuan et al., 2020) integrates a copy mechanism (Meng et al., 2017) into the model, and Ex Hi RD-h (Chen et al., 2021) further improves it by using an exclusion mechanism to reduce duplicates. Ye et al. (2021) first introduces Transformer (Vaswani et al., 2017) to One2Seq model and then proposes Set Trans with an order-agnostic training algorithm for sequential decoder. The second category is to use powerful pre-trained models such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), and further pre-train on in-domain data, such as Key Bart Kulkarni et al. (2022), News Bart Wu et al. (2022), Sci Bart Wu et al. (2022). The comparison results are reported in Table 3. For the absent keyphrases, One2Branch exceeds all

Published as a conference paper at ICLR 2024

Table 3: Comparison with state-of-the-art One2Seq models.

KP20k KPTimes Stack Ex

F1@5 F1@M F1@5 F1@M F1@5 F1@M

Present Cat Seq 29.1 36.7 29.5 45.3 - - Ex Hi RD-h 31.11 37.40 32.116 45.27 28.82 54.82 Transformer 33.31 37.62 30.25 45.36 30.85 55.42 Set Trans 35.60 39.12 35.65 46.34 35.83 56.75 BART-Base 32.22 38.83 35.91 49.92 30.41 57.11 BART-Large 33.24 39.22 37.316 51.015 31.22 57.88 T5-Base 33.61 38.80 34.62 49.22 28.71 56.11 T5-Large 34.32 39.30 36.60 50.81 30.52 58.03 Key BART 32.51 39.82 37.86 51.31 31.95 58.92 Sci BART-Base 34.11 39.62 34.84 48.81 30.46 57.64 Sci BART-Large 34.73 41.54 35.34 49.72 30.93 57.82 News BART-Base 32.43 38.72 35.42 49.81 30.73 57.50 One2Branch (T5-Base) 36.31 35.23 38.22 51.11 29.93 55.01 One2Branch (T5-Large) 36.71 38.40 40.12 52.52 30.31 56.01

Absent Cat Seq 1.5 3.2 15.7 22.7 - - Ex Hi RD-h 1.60 2.50 13.42 16.51 10.11 15.51 Transformer 2.22 4.64 17.11 23.11 10.42 18.72 Set Trans 3.51 5.81 19.83 21.92 13.91 20.70 BART-Base 2.21 4.22 17.12 24.91 11.70 24.92 BART-Large 2.72 4.72 17.610 24.419 12.41 26.13 T5-Base 1.70 3.40 15.31 24.21 9.40 21.61 T5-Large 1.70 3.50 15.71 24.11 10.61 23.92 Key BART 2.61 4.71 18.07 25.52 13.05 27.15 Sci BART-Base 2.93 5.24 17.23 24.62 11.16 24.28 Sci BART-Large 3.12 5.73 17.23 25.72 12.61 26.71 News BART-Base 2.21 4.42 17.63 26.11 12.13 25.74 One2Branch (T5-Base) 4.71 5.61 25.71 28.75 19.67 31.16 One2Branch (T5-Large) 5.20 7.10 27.38 30.45 20.52 32.01

One2Seq baselines on all three datasets, demonstrating its superior set generation capabilities. For the present keyphrases, One2Branch is also among the best on two datasets.

5.3 ADDITIONAL EXPERIMENTS

We conducted supplementary experiments, detailed in the appendix. We analysed the effectiveness of training with negative sequences in Section A.1 and the influence of the hyperparameter kmin in Section A.2. Case study is presented in Section A.3 to analyze the characteristics of One2Branch concretely. To more comprehensively assess One2Branch s performance, we experimented with four out-of-distribution datasets (Section A.4) and a question answering dataset (Section A.5). These experiments reinforce the advantages of One2Branch as discussed in Section 5.1.

6 CONCLUSION

In this paper, we propose a new decoder called branching decoder, which can generate a set of sequences in parallel, in contrast with existing decoders that successively generate all the sequences in a single concatenated long sequence. In the experiments, the branching decoder performed impressively in both set generation performance and inference speed.

As a new paradigm for decoders, although we have demonstrated its promising effectiveness and efficiency in this work, more in-depth exploration is still needed. Currently, the branching decoder is implemented based on the existing pre-trained model with inconsistent training methods, which may limit the capability of branching decoder. It would be helpful to explore pre-training methods that better fit the branching decoder. In addition, whether branching decoder can achieve consistent performance on different generation architectures, such as GPT-style causal language model, deserves further experimental analysis.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

This work was supported by the NSFC (62072224) and the CAAI-Huawei Mind Spore Open Fund.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1171 1179, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/ e995f98d56967d946471af29d7bf99f1-Abstract.html.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

Jie Cao and Yin Zhang. Otseq2set: An optimal transport enhanced sequence-to-set model for extreme multi-label text classification. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 5588 5597. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.emnlp-main.377. URL https://doi.org/10.18653/v1/2022.emnlp-main.377.

Wang Chen, Hou Pong Chan, Piji Li, and Irwin King. Exclusive hierarchical decoding for deep keyphrase generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 1095 1105. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.103. URL https://doi.org/10.18653/v1/ 2020.acl-main.103.

Wang Chen, Piji Li, and Irwin King. A training-free and reference-free summarization evaluation metric via centrality-weighted relevance and self-referenced redundancy. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 404 414. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021. acl-long.34. URL https://doi.org/10.18653/v1/2021.acl-long.34.

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. All NLP tasks are generation tasks: A general pretraining framework. Co RR, abs/2103.10360, 2021. URL https://arxiv.org/abs/2103.10360.

Ygor Gallina, Florian Boudin, and B eatrice Daille. Kptimes: A large-scale dataset for keyphrase generation on news documents. In Kees van Deemter, Chenghua Lin, and Hiroya Takamura (eds.), Proceedings of the 12th International Conference on Natural Language Generation, INLG 2019, Tokyo, Japan, October 29 - November 1, 2019, pp. 130 135. Association for Computational Linguistics, 2019. doi: 10.18653/v1/W19-8617. URL https://aclanthology. org/W19-8617/.

Travis R. Goodwin, Max E. Savery, and Dina Demner-Fushman. Towards zero shot conditional summarization with adaptive multi-task fine-tuning. In Trevor Cohn, Yulan He, and Yang Liu

Published as a conference paper at ICLR 2024

(eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 3215 3226. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.289. URL https: //doi.org/10.18653/v1/2020.findings-emnlp.289.

Yuxin He and Buzhou Tang. Setgner: General named entity recognition as entity set generation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 3074 3085. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.emnlp-main.200. URL https://doi.org/10.18653/ v1/2022.emnlp-main.200.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview. net/forum?id=ryg GQyr Fv H.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2790 2799. PMLR, 2019. URL http://proceedings.mlr.press/v97/houlsby19a.html.

Zixian Huang, Jiaying Zhou, Chenxu Niu, and Gong Cheng. Spans, not tokens: A span-centric model for multi-span reading comprehension. In Ingo Frommholz, Frank Hopfgartner, Mark Lee, Michael Oakes, Mounia Lalmas, Min Zhang, and Rodrygo L. T. Santos (eds.), Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023, pp. 874 884. ACM, 2023a. doi: 10.1145/3583780.3615064. URL https://doi.org/10.1145/3583780.3615064.

Zixian Huang, Jiaying Zhou, Gengyang Xiao, and Gong Cheng. Enhancing in-context learning with answer feedback for multi-span question answering. Co RR, abs/2306.04508, 2023b. doi: 10. 48550/ARXIV.2306.04508. URL https://doi.org/10.48550/ar Xiv.2306.04508.

Gautier Izacard and Edouard Grave. Distilling knowledge from reader to retriever for question answering. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/ forum?id=NTEz-6wysdb.

Mayank Kulkarni, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. Learning rich representation of keyphrases from text. In Marine Carpuat, Marie-Catherine de Marneffe, and Iv an Vladimir Meza Ru ız (eds.), Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp. 891 906. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.findings-naacl.67. URL https://doi.org/10. 18653/v1/2022.findings-naacl.67.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7871 7880. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.703. URL https://doi.org/10.18653/v1/2020.acl-main.703.

Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. Multispanqa: A dataset for multi-span question answering. In Marine Carpuat, Marie-Catherine de Marneffe, and Iv an Vladimir Meza Ru ız (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp. 1250 1260. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.naacl-main.90. URL https://doi.org/ 10.18653/v1/2022.naacl-main.90.

Published as a conference paper at ICLR 2024

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4582 4597. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.353. URL https://doi.org/10.18653/v1/ 2021.acl-long.353.

Tianyu Liu, Fuli Luo, Qiaolin Xia, Shuming Ma, Baobao Chang, and Zhifang Sui. Hierarchical encoder with auxiliary supervision for neural table-to-text generation: Learning better representation for tables. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 6786 6793. AAAI Press, 2019. doi: 10.1609/ aaai.v33i01.33016786. URL https://doi.org/10.1609/aaai.v33i01.33016786.

Aman Madaan, Dheeraj Rajagopal, Niket Tandon, Yiming Yang, and Antoine Bosselut. Conditional set generation using seq2seq models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 4874 4896. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.emnlp-main.324. URL https://doi.org/10.18653/v1/2022.emnlp-main.324.

Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. Deep keyphrase generation. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 582 592. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1054. URL https://doi.org/10.18653/v1/ P17-1054.

Rui Meng, Xingdi Yuan, Tong Wang, Sanqiang Zhao, Adam Trischler, and Daqing He. An empirical study on neural keyphrase generation. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-T ur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp. 4985 5007. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.396. URL https://doi.org/10.18653/v1/2021. naacl-main.396.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-totext transformer. J. Mach. Learn. Res., 21:140:1 140:67, 2020. URL http://jmlr.org/ papers/v21/20-074.html.

Jianlin Su, Mingren Zhu, Ahmed Murtadha, Shengfeng Pan, Bo Wen, and Yunfeng Liu. ZLPR: A novel loss for multi-label classification. Co RR, abs/2208.02955, 2022. doi: 10.48550/ar Xiv.2208. 02955. URL https://doi.org/10.48550/ar Xiv.2208.02955.

Dianbo Sui, Chenhao Wang, Yubo Chen, Kang Liu, Jun Zhao, and Wei Bi. Set generation networks for end-to-end knowledge base population. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 9650 9660. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.760. URL https://doi.org/10.18653/v1/ 2021.emnlp-main.760.

Zeqi Tan, Yongliang Shen, Shuai Zhang, Weiming Lu, and Yueting Zhuang. A sequence-to-set network for nested named entity recognition. In Zhi-Hua Zhou (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pp. 3936 3942. ijcai.org, 2021. doi: 10.24963/ijcai.2021/542. URL https://doi.org/10.24963/ijcai.2021/542.

Published as a conference paper at ICLR 2024

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998 6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. Co RR, abs/1610.02424, 2016. URL http://arxiv.org/abs/ 1610.02424.

Yuqiao Wen, Yongchang Hao, Yanshuai Cao, and Lili Mou. An equal-size hard EM algorithm for diverse dialogue generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/pdf?id=k5PEHHY4sp M.

Di Wu, Wasi Uddin Ahmad, and Kai-Wei Chang. Pre-trained language models for keyphrase generation: A thorough empirical study. Co RR, abs/2212.10233, 2022. doi: 10.48550/ar Xiv.2212. 10233. URL https://doi.org/10.48550/ar Xiv.2212.10233.

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/ forum?id=ze Frfgy Zln.

Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. SGM: sequence generation model for multi-label classification. In Emily M. Bender, Leon Derczynski, and Pierre Isabelle (eds.), Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 3915 3926. Association for Computational Linguistics, 2018. URL https://aclanthology.org/C18-1330/.

Pengcheng Yang, Fuli Luo, Shuming Ma, Junyang Lin, and Xu Sun. A deep reinforced sequenceto-set model for multi-label classification. In Anna Korhonen, David R. Traum, and Llu ıs M arquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28August 2, 2019, Volume 1: Long Papers, pp. 5252 5258. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1518. URL https://doi.org/10.18653/v1/p19-1518.

Jiacheng Ye, Tao Gui, Yichao Luo, Yige Xu, and Qi Zhang. One2set: Generating diverse keyphrases as a set. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4598 4608. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.354. URL https://doi.org/10.18653/ v1/2021.acl-long.354.

Xingdi Yuan, Tong Wang, Rui Meng, Khushboo Thaker, Peter Brusilovsky, Daqing He, and Adam Trischler. One size does not fit all: Generating and evaluating variable number of keyphrases. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7961 7975. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020. acl-main.710. URL https://doi.org/10.18653/v1/2020.acl-main.710.

Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. Asynchronous bidirectional decoding for neural machine translation. In Sheila A. Mc Ilraith and Kilian Q. Weinberger (eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),

Published as a conference paper at ICLR 2024

Table 4: The ablation study of negative sequence. Avg denotes the average of F1@5 and F1@M.

Backbone KP20k KPTimes Stack Ex

F1@5 F1@M F1@5 F1@M F1@5 F1@M

Present One2Branch T5-Base 36.3 35.2 38.2 51.1 29.9 55.0 w/o negative 34.5 36.4 35.5 51.0 33.3 51.2 One2Branch T5-Large 36.7 38.4 40.1 52.5 30.3 56.0 w/o negative 36.5 38.2 39.2 52.4 33.4 52.3

Absent One2Branch T5-Base 4.7 5.6 25.7 28.7 19.6 31.1 w/o negative 4.3 5.9 26.7 27.2 22.9 24.0 One2Branch T5-Large 5.2 7.1 27.3 30.4 20.5 32.0 w/o negative 5.0 7.0 27.7 29.2 23.3 25.2

New Orleans, Louisiana, USA, February 2-7, 2018, pp. 5698 5705. AAAI Press, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16784.

Yan Zhang, Jonathon S. Hare, and Adam Pr ugel-Bennett. Deep set prediction networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3207 3217, 2019. URL https://proceedings.neurips.cc/paper/ 2019/hash/6e79ed05baec2754e25b4eac73a332d2-Abstract.html.

Chao Zhao, Marilyn A. Walker, and Snigdha Chaturvedi. Bridging the structural gap between encoding and decoding for data-to-text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 2481 2491. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.224. URL https://doi.org/10.18653/v1/2020.acl-main.224.

A.1 ABLATION STUDY: TRAINING WITH NEGATIVE SEQUENCES

A unique feature of One2Branch is its training with negative sequences. In this section, we delve into the effectiveness of this approach. For a fair comparison, we trained the checkpoints from the first stage for an additional 5 epochs without using negative sequences. Table 4 reports the ablation results. The incorporation of negative sequences resulted in performance improvement in most cases (16 out of 24). Among them, the absent keyphases of Stack Ex showed the most significant increase on F1@M, recording 7.1 for the base version and 6.8 for the large version. Although using negative sequences induces a drop of 8 indicators, the technique holds promise. Exploring it further, for instance, through dynamic negative sequence sampling as suggested by Xiong et al. (2021), might offer enhancements in the future.

A.2 INFLUENCE OF THE MINIMUM NUMBER OF EXPLORED PATHS

To prevent One2Branch from inadequate generation, we set a minimum for explored paths, denoted as kmin, as detailed in section 3.4. Instead of requiring a score strictly greater than 0 at every timestep, we allow the average generation score to be above 0. This adjustment, meaning a larger kmin, allows the branching decoder to explore a broader generation space. Figure 4 shows the average number of keyphrases (# KP) generated by One2Branch; as kmin increases, # KP also increases. For Stack Ex, when kmin is greater than 5, # KP changes slowly, indicating that a larger exploration space is not necessary. For KP20k and KPTimes, # KP is still a growing trend when kmin = 15.

In order to further analyze the effect of kmin on generation performance, we presented the F1 score of One2Branch on the test sets in Figure 5 and Figure 6. As kmin increases, the F1@5 of both present

Published as a conference paper at ICLR 2024

0 5 10 15 0

0 5 10 15 0

Figure 4: The average number of keyphrases generated by One2Branch-Base (left) and One2Branch Large (right) under different kmin. Red (dot), blue (square), and orange (triangle) refer to the KP20k, KPTimes, and Stack Ex dataset, respectively.

Present F1@5

0 5 10 15 30

Present F1@M

0 5 10 15 0

Absent F1@5

0 5 10 15 0

Absent F1@M

Figure 5: The keyphrase generation performance of One2Branch (T5-Base) under different kmin. Red (dot), blue (square), and orange (triangle) refer to the KP20k, KPTimes, and Stack Ex dataset, respectively.

and absent keyphrases gradually increases and then flattens, indicating that a larger exploration space helps generate correct sequences for the samples with more target keyphrases. From the performance on KPTimes and Stack Ex, we note that as kmin increases, F1@M first increases and then decreases. This suggests that moderately increasing kmin can enhance exploration and generation. However, an excessively large kmin can introduce noise.

A.3 CASE STUDY

Table 5 presents a case from the test set of KP20k generated by One2Branch-Large with kmin set to 1, 8, and 15. Compared with kmin = 1, a larger kmin can help the model generate more insightful keyphrases. For example, when kmin 8, One2Branch generates the absent keyphrase cryptography which only obscurely expresses in the document.

There are many similar expressions in the generated keyphrases. For example, the generated sequences { message authentication code , message authentication , authentication , mac scheme , mac } all have the same meaning and some of them have the same prefix. One2Branch would generate multiple semantically similar sequences when it is difficult to determine the optimal one. This problem may be alleviated by adding a deduplication module. In addition, an excessively large kmin can lead to overly broad generation. For example, when kmin = 15, the generated keyphrase digital signature is irrelevant to the document. Therefore, a suitable kmin is integral to One2Branch.

Published as a conference paper at ICLR 2024

Present F1@5

0 5 10 15 30

Present F1@M

0 5 10 15 0

Absent F1@5

0 5 10 15 0

Absent F1@M

Figure 6: The keyphrase generation performance of One2Branch (T5-Large) under different kmin. Red (dot), blue (square), and orange (triangle) refer to the KP20k, KPTimes, and Stack Ex dataset, respectively.

Table 5: A case generated by One2Branch from KP20k. The keyphrases generated correctly are marked blue.

Construct message authentication code with one way hash functions and block ciphers. We suggest an mac scheme which combines a hash function and an block cipher in order. We strengthen this scheme to prevent the problem of leaking the intermediate hash value between the hash function and the block cipher by additional random bits. The requirements to the used hash function are loosely. Security of the proposed scheme is heavily dependent on the underlying block cipher. This scheme is efficient on software implementation for processing long messages and has clear security properties.

Ground-truth Present: {one way hash function, block cipher, mac} Absent: {cryptography} kmin = 1 Generated KPs {block cipher, message authentication code, message authentication, one way hash function}

kmin = 8 Generated KPs

{block cipher, message authentication code, message authentication, one way hash function, hash function, authentication, authentication code, mac scheme, cryptography, security proof, mac}

kmin = 15 Generated KPs

{block cipher, message authentication code, message authentication, one way hash function, hash function, authentication, authentication code, mac scheme, cryptography, security proof, mac, digital signature}

Table 6: Statistics of keyphrase generation (KG) and question answering (QA) datasets. # Trg, |Trg|, and % Abs Trg refer to the average number of target sequences per document, the average number of words that each target sequence contains, and the percentage of absent target sequences, respectively. All of them are calculated over the dev set.

Dataset Task # Test # Trg |Trg| % Abs Trg

Inspec KG 500 5.3 2.0 37.1 Krapivin KG 400 9.8 2.5 26.4 NUS KG 211 5.9 2.2 44.3 Sem Eval KG 100 14.7 2.4 57.4 MSQA QA 653 2.9 3.0 0

A.4 OUT-OF-DISTRIBUTION KEYPHRASE GENERATION DATASETS

Following Wu et al. (2022), we used the One2Branch model trained on KP20k to evaluate on four out-of-distribution datasets. Some statistics of these datasets are shown in Table 6.

Table 7 reports the comparsion of One2Branch and One2Seq in term of generation performance, inference speed, and GPU memory usage. Similar to the results in Section 5.1, One2Branch has outstanding performance in the generation of absent keyphrase and inference speed. It is comparable and alternately leads with One2Seq on present keyphrases.

Published as a conference paper at ICLR 2024

Table 7: Comparison with the sequential decoder of the One2Seq scheme on out-of-domain keyphrase generation datasets. The reported results are the average scores over three seeds. The standard deviation of each F1 score is presented in the subscript. For example, 33.61 means an average of 33.6 with a standard deviation of 0.1. We reported the throughput and GPU Memory in inference stage with batch size 1.

Backbone Present Absent Throughput (example/s) GPU Memory F1@5 F1@M F1@5 F1@M

Inspec One2Seq T5-Base 28.85 33.95 1.11 2.03 3.2 1,698 M One2Branch 30.46 33.73 2.61 3.42 8.7 (2.7x) 1,652 M One2Seq T5-Large 29.51 34.34 1.13 2.16 2.1 3,948 M One2Branch 32.21 34.21 3.10 4.00 7.2 (3.4x) 3,516 M

Krapivin One2Seq T5-Base 30.23 35.02 2.32 4.34 2.8 1,716 M One2Branch 27.53 26.95 5.01 6.13 9.0 (3.2x) 1,710 M One2Seq T5-Large 31.52 35.95 2.34 4.57 1.9 3,960 M One2Branch 30.31 31.31 5.00 7.91 6.7 (3.5x) 3,614 M

NUS One2Seq T5-Base 38.86 44.04 2.70 5.13 3.2 1,712 M One2Branch 39.12 39.74 5.67 6.56 9.3 (2.9x) 1,706 M One2Seq T5-Large 39.84 43.86 2.53 4.26 1.9 3,944 M One2Branch 41.61 43.60 6.00 8.30 6.8 (3.6x) 3,592 M

Sem Eval One2Seq T5-Base 29.516 32.616 1.44 2.05 3.6 1,712 M One2Branch 29.32 30.64 2.51 2.95 8.7 (2.4x) 1,618 M One2Seq T5-Large 29.710 32.111 1.51 2.03 1.9 3,974 M One2Branch 28.20 31.51 2.70 3.41 7.1 (3.7x) 3,560 M

In Table 8, we reported the comparsion of One2Branch with state-of-the-art One2Seq models. It can be seen that One2Branch achieves a significant lead in absent keyphrase generation compared to all baselines. For present keyphrase, Set Trans is still the best performing model, thanks to its ability of capturing the dependencies between generated sequences.

A.5 QA PERFORMANCE

Besides the keyphrase generation datasets, we also experimented with One2Branch on a multianswer question answering dataset MSQA (Li et al., 2022). Each sample in MSQA includes a question with at least two answers, and a document containing all answers is given. Some dataset statistics are reported in Table 6.

For the implementation of both One2Branch and One2Seq, we searched the best learning rate from {1e-4, 3e-4} and trained models 50 epochs. The maximum sequence length was set to 2048 to ensure that all documents are not truncated. For One2Seq, we concatenated the answers in the order of their occurrence in the document. We followed Li et al. (2022) to use two metrics. Exact Match (EM) measures the F1 score between generated answers and gold-standard answers, which requires an exact match between answer texts. Partial Match (PM) generalizes EM by using the length of the longest common substring to score each predicted answer based on its nearest ground-truth answer.

We experimented with MSQA in two configurations, with and without the given document. When using the document, MSQA becomes an extraction task, extracting all answers from the document. It is worth noting that the dataset provides the order in which the answers appear in the document. This is no longer an unordered set generation task, which allows the model to generate answers in order. When the document is not used, MSQA becomes a fully absent generation task, so that models need to generate answers from its parameters.

The comparison results of One2Branch and One2Seq on MSQA are reported in Table 9. One2Branch lags slightly behind One2Seq in the configuration of using document, with a maximum

Published as a conference paper at ICLR 2024

Table 8: Comparison with state-of-the-art One2Seq models on four out-of-distribution datasets. All of the models were trained on the KP20k dataset.

Inspec Krapivin NUS Sem Eval

F1@5 F1@M F1@5 F1@M F1@5 F1@M F1@5 F1@M

Present Cat Seq 22.5 26.2 26.9 35.4 32.3 39.7 24.2 28.3 Ex Hi RD-h 25.44 29.13 28.64 30.84 - - 30.417 28.218 Transformer 28.87 33.35 31.49 36.57 37.86 42.99 28.85 32.18 Set Trans 29.13 32.81 33.510 37.511 39.98 44.622 32.28 34.214 BART-Base 27.03 32.37 27.06 33.66 36.61 42.48 27.111 32.121 BART-Large 27.611 33.39 28.42 34.73 38.08 43.511 27.412 31.116 T5-Base 28.85 33.95 30.23 35.02 38.86 44.04 29.516 32.616 T5-Large 29.51 34.34 31.52 35.95 39.84 43.86 29.710 32.111 Key BART 26.83 32.55 28.76 36.514 37.37 43.010 26.08 28.94 Sci BART-Base 27.510 32.88 28.28 32.911 37.37 42.114 27.08 30.48 Sci BART-Large 26.112 31.713 27.111 32.412 36.418 40.912 27.914 32.012 News BART-Base 26.210 31.711 26.28 32.315 36.98 42.410 26.421 30.423 One2Branch (T5-Base) 30.46 33.73 27.53 26.95 39.12 39.74 29.32 30.64 One2Branch (T5-Large) 32.21 34.21 30.31 31.31 41.61 43.60 28.20 31.51

Absent Cat Seq 0.4 0.8 1.8 3.6 1.6 2.8 2.0 2.8 Ex Hi RD-h 1.11 1.62 2.23 3.34 - - 1.64 2.16 Transformer 1.20 2.31 3.32 6.34 2.54 4.49 1.62 2.24 Set Trans 1.91 3.01 4.51 7.23 3.710 5.517 2.22 2.92 BART-Base 1.01 1.72 2.83 4.96 2.64 4.29 1.61 2.12 BART-Large 1.53 2.44 3.11 5.12 3.15 4.89 1.93 2.43 T5-Base 1.11 2.03 2.32 4.34 2.70 5.13 1.44 2.05 T5-Large 1.13 2.16 2.34 4.57 2.53 4.26 1.51 2.03 Key BART 1.42 2.32 3.62 6.46 3.14 5.57 1.64 2.25 Sci BART-Base 1.62 2.84 3.34 5.48 3.31 5.32 1.81 2.21 Sci BART-Large 1.52 2.62 3.41 5.63 3.25 5.07 2.66 3.38 News BART-Base 1.01 1.82 2.42 4.54 2.44 4.09 1.61 2.22 One2Branch (T5-Base) 2.61 3.42 5.01 6.13 5.67 6.56 2.51 2.95 One2Branch (T5-Large) 3.10 4.00 5.00 7.91 6.00 8.30 2.70 3.41

Table 9: Comparison with the sequential decoder of the One2Seq scheme on the multi-span question answering dataset (MSQA). The reported results are the average scores over three seeds. The standard deviation of each F1 score is presented in the subscript. For example, 72.34 means an average of 72.3 with a standard deviation of 0.4. We reported the throughput and GPU Memory in inference stage with batch size 1.

Backbone Dev Test Throughput (example/s) GPU Memory EM PM EM PM

w/ doc (extraction task) One2Seq T5-Base 72.34 84.01 71.58 84.16 5.3 2,220 M One2Branch 71.41 83.22 70.23 82.95 9.0 (1.7x) 2,060 M One2Seq T5-Large 74.62 85.94 74.77 86.43 3.2 4,986 M One2Branch 74.73 85.73 73.36 85.56 5.7 (1.8x) 4,640 M

w/o doc (generation task) One2Seq T5-Base 16.44 35.62 16.04 35.45 7.2 2,346 M One2Branch 19.92 36.55 20.94 37.22 8.9 (1.2x) 1,480 M One2Seq T5-Large 18.22 37.34 18.12 37.31 4.8 3,878 M One2Branch 21.62 38.83 23.35 39.75 8.0 (1.7x) 3,352 M

Published as a conference paper at ICLR 2024

gap of 1.4 on the EM of the test set. This may be a benefit of One2Seq in which later generated sequences can interact with previously generated sequences. Moreover, this is not a set generation task because the order of the answers is known. However, One2Branch has obvious advantages in inference speed. The inference time of One2Branch based on T5-Large is even better than One2Seq based on T5-Base, when the former has significant advantage in QA performance.

One2Branch achieves more significant lead in the configuration of without using document, with a maximum gap of 5.2 on the EM of the test set. This demonstrates the potential of One2Branch a broader range of set generation tasks beyond keyphrase generation.