# towards_optimal_multidraft_speculative_decoding__77646fb2.pdf

Published as a conference paper at ICLR 2025

TOWARDS OPTIMAL MULTI-DRAFT SPECULATIVE DECODING

Zhengmian Hu1,2, , Tong Zheng1, , Vignesh Viswanathan2, 3, Ziyi Chen1, Ryan A. Rossi2, Yihan Wu1, Dinesh Manocha1,4, Heng Huang1

1Department of Computer Science, University of Maryland, College Park, MD, USA 2Adobe Research, San Jose, CA, USA 3Manning College of Information & Computer Sciences, University of Massachusetts Amherst, MA, USA 4Department of Electrical and Computer Engineering, University of Maryland, College Park, MD, USA

Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate. For the first time, we measure the theoretical upper bound of MDSD efficiency for vocabulary sizes in the thousands and quantify the gap between existing verification algorithms and this bound. We also compare different draft sampling methods based on their optimal acceptance rates. Our results show that the draft sampling method strongly influences the optimal acceptance rate, with sampling without replacement outperforming sampling with replacement. Additionally, existing verification algorithms do not reach the theoretical upper bound for both without replacement and with replacement sampling. Our findings suggest that carefully designed draft sampling methods can potentially improve the optimal acceptance rate and enable the development of verification algorithms that closely match the theoretical upper bound.

1 INTRODUCTION

Autoregressive language models have achieved state-of-the-art results in various language tasks (Brown et al., 2020; Touvron et al., 2023), including chatbots (Luo et al., 2022) and code generation (Chen et al., 2021). These models generate outputs by predicting the next token sequentially. However, this autoregressive decoding process leads to significant computational resource requirements and high latency, posing challenges for user experience and limiting potential applications.

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023a) has been proposed to address the high inference cost issue. The method uses a small, fast draft model to generate candidate results, which are then verified and corrected by a large, accurate target model to maintain the model output distribution. Compared to other acceleration methods, such as knowledge distillation, model quantization, and model pruning, speculative decoding has the advantage of significantly reducing inference latency without sacrificing quality of the generated content.

*Co-first authors.

Published as a conference paper at ICLR 2025

Multi-Draft Speculative Decoding (MDSD) (Miao et al., 2024; Cai et al., 2024; Li et al., 2024; Spector & Re, 2023) is a recent advancement in speculative decoding. When generating each token, the small draft model generates multiple draft tokens instead of a single one, as in vanilla speculative decoding. The target LLM verifies these tokens in parallel, ensuring that the final output aligns with the target model s distribution while achieving a higher overall acceptance rate than vanilla speculative decoding, as the multiple drafts provide better coverage of the target model s possible outputs.

MDSD algorithms have two main design choices: (1) The draft sampling method. Common approaches include sampling with replacement, where each token is independently sampled from the draft model output distribution, and sampling without replacement, where the probability of selecting a token is updated after each draw to exclude previously selected tokens. (2) The verification algorithm design. Examples include Recursive Rejection Sampling (RRS) (Yang et al., 2024b; Jeon et al., 2024), which sequentially verifies the draft tokens, and K-SEQ (Sun et al., 2024e), which is designed to improve acceptance rate for sampling with replacement.

The acceptance rate, a measure of MDSD algorithm performance, also depends on these two design choices. Any verification algorithm that guarantees the final output aligns with the target model distribution can be viewed as a transport from the draft tokens distribution to the target model s distribution. For a fixed draft sampling method, the optimal verification algorithm is a solution to an optimal transport problem (Sun et al., 2024e), corresponding to an optimal acceptance rate.

However, the complexity of this optimal transport problem, with the number of variables and constraints growing exponentially with the number of draft tokens, makes it difficult to find efficient solutions. This difficulty has led to two open questions:

(1) For modern LLMs, where the vocabulary size is typically in the thousands, the optimal acceptance rate has never been computed, to the best of our knowledge. Simple linear program (LP) solvers can only compute the optimal transport for small toy models, making it challenging to measure the optimal acceptance rate in practical scenarios.

(2) Although it is widely known that existing verification algorithms are only approximate solutions to the optimal transport problem, the gap between their performance and the theoretical upper bound has never been quantified with respect to real text distribution. Without knowing the optimal acceptance rate, it is difficult to assess how suboptimal these algorithms are.

This paper addresses these two open questions. Our contributions include:

We transform the problem of solving the optimal acceptance rate corresponding to the optimal transport into a subset selection problem by considering the dual of the problem and then applying total unimodularity. This provides a novel perspective for understanding the efficiency of MDSD.

For certain special cases, we propose efficient methods to solve the subset selection problem by noticing convexity-like structures in the set function. This includes sampling with replacement and sampling without replacement. For the first time, we provide a practical method to compute the theoretical acceptance rate upper bound of MDSD for a draft distribution.

For the first time, we measure the theoretical upper bound of MDSD efficiency on real text, and the gap of existing verification algorithms. We compare different draft sampling methods through their optimal acceptance rates and observe that sampling without replacement outperforms sampling with replacement. We evaluate existing verification algorithms, including K-SEQ for with replacement and RRS for without replacement and with replacement sampling, and find that they still have significant gaps from the theoretical upper bound.

We propose a novel draft sampling method that greedily selects high-probability drafts, with only the last draft being random. In some cases, it achieves an even higher optimal acceptance rate than without replacement. We also propose a corresponding verification algorithm that perfectly reaches the theoretical acceptance rate upper bound.

2 PRELIMINARIES 2.1 SPECULATIVE DECODING FOR ACCELERATING LLM INFERENCE Let Σ denote the vocabulary set. We have a target model Ptarget( |x1, x2, ..., xm), which is a probabilistic model that predicts the probability of the next word. Our goal is to sample from this model as the output.

Published as a conference paper at ICLR 2025

The process of single step Multi-Draft Speculative Decoding is as follows:

1. For a draft model Pdraft( |x1, x2, ..., xm), sample n draft tokens bx(1), . . . bx(n). 2. Compute the probabilities of the target model in parallel: Ptarget( |x1, x2, ..., xm), Ptarget( |x1, x2, ..., xm, bx(1)), ..., Ptarget( |x1, x2, ..., xm, bx(n)). Due to parallel computation, this step is not much slower than computing Ptarget( |x1, x2, ..., xm) alone. 3. Run the verification algorithm xm+1 Pverify( |bx(1), . . . bx(n)). 4. If accepted, for some draft bx(i), we have xm+1 = bx(i). In this case, we can perform another sampling step xm+2 Ptarget( |x1, x2, ..., xm, bx(i)), generating two tokens in one step and achieving acceleration.

Speculative Decoding can generate multiple steps, with multiple drafts at each step and all drafts forming a tree. However, we only consider the single-step case in this paper. For following analysis, we use p( ) = Ptarget( |x1, x2, ..., xm) to denote the target distribution and pdraft for distribution of draft tokens. 2.2 SPECULATIVE DECODING WITH A SINGLE DRAFT TOKEN

Informally, the verification algorithm depends on two distributions p and pdraft, and one draft token j pdraft. The goal is to output i p such that the objective max P(i = j) is achieved, that is to maximize the probability of random variable i to be the same as random variable j.

More formally, given p Σ and pdraft Σ representing two probability distributions over the space Σ, we seek a joint distribution π Π(p, pdraft) such that the marginal distributions are p and pdraft, respectively, and the objective max P

i Σ π(i, i) is maximized. This forms a optimal transport problem. The optimal transport is denoted as π p,pdraft Π(p, pdraft), and the optimal objective function value is α (p, pdraft) = P

i Σ π p,pdraft(i, i).

The problem can be formulated as an LP by representing the joint distribution as a matrix:

max C RΣ Σ X

i Σ Ci,i s.t. X

j Σ Ci,j = p(i), X

i Σ Ci,j = pdraft(j), Ci,j 0 i, j Σ. (1)

The optimal transport has the following closed-form expression

π p,pdraft(i, j) = C i,j =

( min(p(i), pdraft(i)) i = j (p(i) pdraft(i))+(pdraft(j) p(j))+ P

z Σ(pdraft(z) p(z))+ i = j , (2)

and the optimal objective function value is

α (p, pdraft) = X

i Σ min(p(i), pdraft(i)) . (3)

The conditional distribution of i given j when (i, j) π is denoted as π( |j) G(Σ), where

π(i|j) = πi,j P

i Σ πi,j = πi,j pdraft(j) . (4)

For the optimal transport, this leads to

π p,pdraft(i|j) =

( min( p(i) pdraft(j), 1) i = j

(1 p(j) pdraft(j))+ (p(i) pdraft(i))+ P

z Σ(pdraft(z) p(z))+ i = j . (5)

The basic single-step, single-draft speculative decoding can be improved in two directions. Multistep methods generate a draft sequence. Some improvements (Sun et al., 2024d; Hu & Huang, 2024; Sun et al., 2024c) in this scenario are discussed in Appendix B.3. Our paper focuses on the multi-draft direction, where multiple draft tokens are generated at each step.

2.3 MULTI-DRAFT SPECULATIVE DECODING

For i Σ, define the incidence set Ai := { i Σn| j [n], ij = i}.

Informally, the verification algorithm depends on two distributions p and pdraft, where pdraft is now a joint distribution of n tokens. Common constructions of pdraft include:

Sampling with replacement: Given a draft model with output distribution q( ), independently sample n times. For i = ( i1, . . . , in) Σn, we have pdraft( i) = Qn j=1 q( ij).

Published as a conference paper at ICLR 2025

Sampling without replacement: pdraft( i) = Qn j=1 q i1,..., ij 1( ij), where

q i1,..., ij 1(x) =

z { i1,..., ij 1} q(z) x / { i1, . . . , ij 1}

0 x { i1, . . . , ij 1} . (6)

Product of different draft distributions: pdraft( i) = Qn j=1 qj( ij).

Given multiple draft tokens i = ( i1, . . . , in) pdraft, the goal is to output i p such that the objective max P( j [n], i = ij) or equivalently max P( i Ai) is achieved, that is to maximize the probability of random variable i to be the same as one of random variable in ( i1, . . . , in).

More formally, given p Σ and pdraft Σn representing a probability distribution over the space Σ and a probability distribution over the space Σn, respectively, we seek a joint distribution π Π(p, pdraft) such that the marginal distributions are p and pdraft, respectively, and the objective max P

i Σ P i Ai π(i, i) is maximized. The optimal transport is denoted as π p,pdraft Π(p, pdraft), and the optimal objective function value is α (p, pdraft) = P

i Σ P i Ai π p,pdraft(i, i).

The problem can be formulated as an LP by representing the joint distribution as a tensor: max C RΣ Σn X

i Σn Ci, i = p(i) i Σ, X

i Σ Ci, i = pdraft( i) i Σn,

Ci, i 0 i Σ, i Σn.

The difficulty lies in the exponential number of variables and constraints.

Several approximation have been proposed for the multi-draft speculative decoding problem, including Recursive Rejection Sampling (RRS) (Yang et al., 2024b; Jeon et al., 2024) and K-SEQ (Sun et al., 2024e). RRS recursively verifies the draft tokens, while K-SEQ improves the acceptance rate for sampling with replacement. Due to space constraints, we move the details of these methods (Appendix A), other related work (Appendix B) and all proofs (Appendix C) to the appendix.

3 OPTIMAL ACCEPTANCE RATE AS SUBSET SELECTION PROBLEM

We show that the optimal acceptance rate can be expressed as a subset selection problem:

α (p, pdraft) = 1 + min H Σ

i Hn pdraft( i) . (8) 3.1 DUAL PROBLEM

We start from the linear programming formulation (7) and derive an equivalent formulation:

max S RΣ Σn X

i Σn Si, i p(i) i Σ, X

i Σ Si, i pdraft( i) i Σn,

Si, i 0 i Σ, i Σn, Si, i = 0 i Σ, i / Ai.

Lemma 1. The two formulations (7) and (9) are equivalent. This equivalent formulation transforms the transportation problem (Hitchcock, 1941) into a bmatching problem, whose dual is a w-vertex cover problem (Schrijver et al., 2003) (with a detailed derivation in Appendix C.1):

min y RΣ,z RΣn X

i Σ yip(i) + X

i Σn z ipdraft( i)

s.t. yi + z i 1 i Σ, i Ai, yi 0 i Σ, z i 0 i Σn. (10)

3.2 TOTAL UNIMODULARITY

The coefficient matrix of the constraints in (10) is totally unimodular (TUM). The first set of constraints forms an incidence matrix of a bipartite graph, where one side of the nodes corresponds to Σ and the other side corresponds to Σn. There is an edge between i and i if and only if i Ai. Therefore, it is a totally unimodular matrix (Biggs, 1993). The second and third sets of constraints have

Published as a conference paper at ICLR 2025

coefficient matrices that are identity matrices, with |Σ| + |Σ|n variables and |Σ| + |Σ|n constraints. The concatenation of a TUM matrix and an identity matrix is also TUM (Commoner, 1973).

Since the right-hand side of the constraints are integers, the dual problem (10) always has an integer optimal solution (Hoffman & Kruskal, 2010).

3.3 SUBSET SELECTION FORMULATION

By restricting the variables in (10) to integers, we obtain:

min y ZΣ,z ZΣn X

i Σ yip(i) + X

i Σn z ipdraft( i)

s.t. yi + z i 1 i Σ, i Ai, yi 0 i Σ, z i 0 i Σn. (11)

In the optimal solution, yi and z i will not exceed 1, so they can only take values 0 or 1. Therefore, the problem can be further simplified as:

min y {0,1}Σ min z {0,1}Σn X

i Σ yip(i) + X

i Σn z ipdraft( i)

s.t. yi + z i 1 i Σ, i Ai . (12)

Define H = {i Σ|yi = 1}. The problem becomes:

min H Σ min z {0,1}Σn X

i H p(i) + X

i Σn z ipdraft( i)

s.t. z i 1 i Σ \ H, i Ai . (13)

The optimal solution for z is

x Σ\H Ax 0 i / S

x Σ\H Ax . (14)

Substituting this solution, we obtain the subset selection formulation:

i H p(i) + X

x Σ\H Ax pdraft( i) . (15)

Finally, note that X

x Σ\H Ax pdraft( i) + X

i Hn pdraft( i) = 1 . (16)

This completes the derivation of the subset selection formulation (8).

4 COMPUTING OPTIMAL ACCEPTANCE RATE IN SPECIAL CASES

In this section, we discuss how to efficiently compute the optimal acceptance rate for certain special cases of the draft distribution pdraft. For any set function f, we define the marginal value of an element x with respect to a set H as f(x|H) = f(H {x}) f(H). We also define the following shorthand notations: P(H) = P

i H p(i), Q(H) = P i Hn pdraft( i), f(H) = P(H) Q(H).

The optimal acceptance rate can be expressed as α (p, pdraft) = 1 + min H Σ f(H).

4.1 q-CONVEX FUNCTIONS

Definition 2 (q-Convex Function). A set function Q : 2Σ R is called a q-convex function if there exists a function q : Σ R 0 such that for all H Σ and x, y Σ \ H with x = y, we have

q(x) Q(y|H {x})

q(y) . (17)

Intuitively, if we order the elements of Σ arbitrarily and construct a sequence of sets Hi by adding elements one by one, then the curve of Q(Hi) against the sum of q values is always convex.

Published as a conference paper at ICLR 2025

Theorem 3. For sampling with replacement, the function Q is a q-convex function. Theorem 4. For sampling without replacement, the function Q is a q-convex function. Theorem 5. All q-convex functions are supermodular functions.

For both sampling with replacement and without replacement, computing α (p, pdraft) can be formulated as an unconstrained submodular minimization problem, which has polynomial-time algorithms (Iwata, 2008). However, by fully exploiting the properties of q-convex functions, we can solve the problem even faster, as shown in the next section.

4.2 EFFICIENT COMPUTATION

Theorem 6. Suppose that Q is a q-convex function, Q is monotone increasing, and p(x) > 0 for all x Σ. For all H Σ and x, y Σ \ H with x = y, if q(x)

p(y) and f(x|H) 0, then f(y|H {x}) 0.

The above theorem requires p(x) > 0. When p(x) = 0, there exists an optimal set H for (8) that contains x because Q is monotone increasing.

4.2.1 ALGORITHM

Inspired by Theorem 6, we can compute the optimal acceptance rate efficiently as follows:

1. Find an ordering σ of Σ such that q(σ1)

p(σ1) q(σ|Σ|) p(σ|Σ|). 2. Construct a sequence of sets Hi = {σ1, . . . , σi}. 3. Compute α (p, pdraft) = 1 + mini f(Hi).

Intuitively, we sort the elements by the ratio of q and p in non-increasing order and then perform a linear search.

4.2.2 COMPLEXITY OF COMPUTING Q AND α

For sampling with replacement, Q has a simple expression Q(H) = (P

x H q(x))n. The time complexity for computing α (p, pdraft) is O(|Σ| log |Σ|) for the sorting step, plus O(|Σ|) for the linear scan.

For sampling without replacement, we can compute Q(H) = Wn,H

Wn,Σ based on the coefficient of generating function Wn,H = Coefftn GH(t) = Coefftn Q

i H(1 + q(i)t) and apply dynamic programming with recurrence relation Wn,H {x} = Wn,H + q(x)Wn 1,H. The time complexity is O(|Σ| log |Σ|) for the sorting step, plus O(n|Σ|) for computing coefficient of generating function with dynamic programming.

5 A GREEDY APPROACH FOR SELECTING DRAFT TOKENS

In this section, we propose a novel method for constructing the draft distribution pdraft and a corresponding verification algorithm that achieves the optimal acceptance rate for this distribution.

5.1 DRAFT CONSTRUCTION

Given a draft model output distribution q Σ, we construct the draft tokens i = ( i1, . . . , in) as follows:

The first n 1 tokens are deterministically set to be the top n 1 tokens according to the probability in q, i.e., i1, . . . , in 1 = Topn 1(q), such that q( i1) q( in 1) and maxi Σ\{ i1,..., in 1} q(i) q( in 1). Only the last token in is randomly sampled from q without replacement (i.e., it is different from the previous n 1 tokens): in q Topn 1(q)( ) = q( ) 1 Pn 1 j=1 q( ij).

The resulting draft distribution is

pdraft( i) = q Topn 1(q)( in) i1, . . . , in 1 = Topn 1(q) 0 i1, . . . , in 1 = Topn 1(q) . (18)

Published as a conference paper at ICLR 2025

5.2 VERIFICATION ALGORITHM

The corresponding optimal transport problem for this draft distribution is simple because only one draft token is random. We can design a verification algorithm that strictly achieves the optimal acceptance rate for this draft distribution (, with unfolded definition in Appendix D):

πGreedy p,pdraft(i| i) = π p,q Topn 1(q)(i| in) . (19)

Theorem 7. The optimal acceptance rate for the greedy draft distribution is

α (p, pdraft) = αGreedy(p, pdraft) = X

i Topn 1(q) p(i) + X

i Σ min(p(i), q Topn 1(q)(i)) . (20)

Our subset selection formulation (8) provides a convenient way to prove the above theorem.

5.3 CONNECTION TO SPECHUB

Spec Hub (Sun et al., 2024b) is a recently proposed MDSD method that is only applicable to the case of n = 2. The draft construction in Spec Hub is as follows:

First, sample the first draft token i1. If i1 = Top1(q) is the token with the highest probability in q, then sample the second draft token i2 without replacement to ensure it is different from i1. If i1 = Top1(q) is not the token with the highest probability in q, then deterministically set the second draft token to be the token with the highest probability, i.e., i2 = Top1(q).

The resulting draft distribution is:

pdraft( i) =

q( i1) i2 = Top1(q) q( i1) 1 q( i1)q( i2) i1 = Top1(q) 0 otherwise . (21)

We note that the greedy method for n = 2 is essentially equivalent to Spec Hub because both methods ensure that at least one draft token is the token with the highest probability in q. However, the specific draft distributions are different, leading to a simpler verification algorithm for the greedy method.

6 EXPERIMENTS

The goal of our experiments is to measure the acceptance rates of various MDSD methods on real text distributions and compare them with the theoretical upper bounds. In the previous sections, we analyzed the theoretical acceptance rate α (p, pdraft) for three different draft distributions: sampling with replacement, sampling without replacement, and greedy approach (Section 5). We also discussed some existing verification methods (Appendix A), such as RRS and K-SEQ, whose acceptance rates are expected to be lower than the theoretical upper bound. For K-SEQ, its average acceptance rate αK-SEQ can be derived theoretically (see Appendix A.2 for details). Our efficient computation methods (Section 4) make it possible, for the first time, to obtain the theoretical upper bound of MDSD for vocabulary sizes of thousands.

To obtain realistic distributions p and pdraft, we select real-world datasets for various tasks, including Alpaca (Taori et al., 2023) for instruction-following, WMT 14 De-En (Bojar et al., 2014) for translation, and CNN-Daily Mail (Hermann et al., 2015) for summarization. For each task, we use an LLM to generate responses on 1024 data samples, with a maximum length of 128 tokens. We then measure the logits of the target model and the draft model on these generated responses to construct p and pdraft.

We evaluated different approaches based on four publicly available large language models, including 1) LLa MA (Touvron et al., 2023), 2) Vicuna (Chiang et al., 2023), the instruction fine-tuned version of LLa MA models, 3) OPT (Zhang et al., 2022), and 4) Qwen2 (Yang et al., 2024a). Specifically, for the LLa MA family, we select LLa MA-7B as the target model and LLa MA-68M as the draft model, which is consistent with previous work (Miao et al., 2024). For the OPT family, we select OPT-6.7B as the target model and OPT-125M as the draft model. Moreover, for the Vicuna family

Published as a conference paper at ICLR 2025

Table 1: Acceptance rates of different MDSD methods across various models and tasks. α means the gap between a verification method and the theoretical upper bound, with statistically significant differences indicated by directional arrows.

Model Pairs Draft Sampling Method Alpaca CNN-Daily Mail WMT 14

α α α α α α

OPT-125M OPT-6.7B

With Replacement

RRS 85.4 0.1 1.5 77.3 0.1 3.3 70.6 0.1 1.5 K-SEQ 85.8 0.1 1.1 78.4 0.1 2.2 71.1 0.1 0.9 αK-SEQ 85.9 0.1 0.9 78.5 0.1 2.2 71.0 0.1 1.0 α 86.9 0.1 - 80.7 0.1 - 72.0 0.1 -

Without Replacement RRS 88.9 0.1 0.9 81.5 0.1 2.8 75.1 0.1 1.0 α 89.9 0.1 - 84.3 0.1 - 76.0 0.1 -

Greedy Verify 90.7 0.1 0.0 84.2 0.1 0.1 77.0 0.1 0.0 α 90.7 0.1 - 84.3 0.1 - 77.1 0.1 -

LLa MA-68M LLa MA-7B

With Replacement

RRS 71.6 0.1 1.5 65.3 0.1 2.2 59.8 0.1 1.0 K-SEQ 71.9 0.1 1.1 66.0 0.1 1.5 60.0 0.1 0.8 αK-SEQ 72.0 0.1 1.0 66.2 0.1 1.3 60.2 0.1 0.6 α 73.0 0.1 - 67.5 0.1 - 60.8 0.1 -

Without Replacement RRS 75.7 0.1 0.8 70.5 0.1 1.3 63.3 0.1 0.1 α 76.5 0.1 - 71.8 0.1 - 63.4 0.1 -

Greedy Verify 78.4 0.1 0.1 73.2 0.1 0.1 66.1 0.1 0.0 α 78.4 0.1 - 73.1 0.1 - 66.1 0.1 -

Eagle-0.24B Vicuna-7B

With Replacement

RRS 63.4 0.2 1.0 56.7 0.1 1.1 32.9 0.2 0.2 K-SEQ 63.9 0.2 0.5 57.0 0.1 0.8 33.0 0.2 0.1 αK-SEQ 63.7 0.1 0.7 57.1 0.1 0.7 32.9 0.2 0.1 α 64.4 0.1 - 57.8 0.1 - 33.1 0.2 -

Without Replacement RRS 70.9 0.2 0.6 63.4 0.1 0.8 36.9 0.2 0.4 α 71.5 0.1 - 64.2 0.1 - 36.4 0.2 -

Greedy Verify 72.6 0.2 0.2 65.7 0.1 0.1 39.5 0.2 0.1 α 72.8 0.1 - 65.8 0.1 - 39.5 0.2 -

Eagle-0.26B Qwen2-7B

With Replacement

RRS 59.6 0.2 1.0 46.7 0.1 1.6 38.3 0.1 0.4 K-SEQ 59.9 0.2 0.8 47.2 0.1 1.1 38.3 0.1 0.4 αK-SEQ 59.9 0.1 0.7 47.3 0.1 1.0 38.3 0.1 0.3 α 60.7 0.1 - 48.3 0.1 - 38.7 0.1 -

Without Replacement RRS 68.3 0.2 1.1 52.4 0.1 1.7 43.9 0.1 0.1 α 69.4 0.1 - 54.1 0.1 - 44.0 0.1 -

Greedy Verify 69.9 0.2 0.0 54.0 0.1 0.0 45.5 0.1 0.1 α 70.0 0.1 - 53.9 0.1 - 45.4 0.1 -

and the Qwen family, we select Vicuna-7B-v1.3 and Qwen2-7B-Instruct as target models, and we use paired draft models provided by EAGEL (Li et al., 2024), with 0.24B parameters and 0.26B parameters, respectively.

Unless otherwise specified, we use a default generation temperature of 0.7 and a draft token number of 3. The total computational cost is less than 50 GPU hours on RTXA6000.

6.1 MAIN EXPERIMENT

In the main experiment, we compare the acceptance rates of different MDSD methods across various LLMs and tasks. The results are shown in Table 1. We observe that the existing verify methods, RRS and K-SEQ, still have gaps compared to the theoretical acceptance rate upper bound. Sampling without replacement achieves higher acceptance rates than sampling with replacement, both in terms of the theoretical upper bound and the existing verification algorithms. We can attribute this to the fact that sampling with replacement may lead to duplicate draft tokens, which are less helpful for acceleration. The greedy method obtains the highest acceptance rate, but this is not always the case, as we will see in the ablation study below that the greedy method performs worse when the temperature is 1.

Published as a conference paper at ICLR 2025

Greedy Optimal Without Replacement Optimal With Replacement Optimal K-SEQ Theory Greedy Without Replacement RRS With Replacement RRS K-SEQ

0.1 0.3 0.5 0.7 0.9

Temperatures

(a) WMT 14 De-En

0.1 0.3 0.5 0.7 0.9

Temperatures

(b) CNN-Daily Mail

0.1 0.3 0.5 0.7 0.9

Temperatures

(c) Alpaca Figure 1: Comparison of acceptance rate α for different temperatures across datasets.

Greedy Optimal Without Replacement Optimal With Replacement Optimal K-SEQ Theory Greedy Without Replacement RRS With Replacement RRS K-SEQ

(a) WMT 14 De-En

2 4 6 8 10 0.5

(b) CNN-Daily Mail

(c) Alpaca Figure 2: Comparison of acceptance rate α for different number of drafts across datasets.

6.2 ABLATION STUDY I: IMPACT OF TEMPERATURE

We study the impact of different temperatures on the acceptance rates. The temperature affects the distributions of the target model and the draft model, even if the logits remain unchanged. It also affects the output text during the sampling process, resulting in different responses. Figure 1 shows the results. We use LLa MA-7B as the target model and LLa MA-68M as the draft model for our ablation studies. We can have the following observations:

The impact of temperature is non-monotonic. Moreover, different methods respond differently to temperature changes. At low temperatures, all methods fall into two categories. The first includes methods that allow duplicate tokens. When T = 0, these methods essentially have only one effective draft token, the one with the largest logits on the draft model. The second includes methods that prevent duplicate tokens. When T = 0, these methods always select the top n tokens on the draft model. The gap between the optimal acceptance rate and acceptance rates for previously existing verification methods, RRS and K-SEQ, gradually increases as the temperature rises. As temperature increases, the gap between methods with replacement sampling and methods without replacement sampling decreases. We can attribute this to the fact that, at high temperatures, the probability distribution is less concentrated, making with replacement sampling strategies have less probability to generate duplicate tokens.

6.3 ABLATION STUDY II: IMPACT OF NUMBER OF DRAFTS

We investigate the impact of different numbers of drafts on the acceptance rates. The results are shown in Figure 2. We have the following observations:

As the number of drafts increases, the coverage of the target model s possible outputs improves, therefore leading to better acceptance rate. This trend holds for all methods. The draft sampling strategy significantly impacts the benefits derived from an increase in the number of drafts. Sampling without replacement generally benefit more from an increase in drafts compared to sampling with replacement. This is because sampling with replacement can lead to redundant drafts, which do not fully leverage the advantages of increasing the number of drafts.

Published as a conference paper at ICLR 2025

Table 2: Acceptance rates of different MDSD methods on MT-Bench based on Eagle framework.

Method # Drafts = 2, # Steps = 4 # Drafts = 4, # Steps = 3 EAGLE default sparse tree

α Speed α Speed α Speed

RRS w/ replacement 75.3 0.3 - 78.4 0.3 - 74.7 0.3 - RRS w/o replacement 79.4 0.3 1.04 ( 0.02) 80.4 0.3 1.03 ( 0.01) 76.8 0.3 1.04 ( 0.02) Spec Hub 84.0 0.3 1.11 ( 0.02) - - - - Greedy 84.7 0.3 1.13 ( 0.02) 88.8 0.2 1.17 ( 0.01) 79.1 0.3 1.08 ( 0.02)

RRS w/ replacement 78.8 0.3 - 84.7 0.3 - 76.2 0.3 - RRS w/o replacement 82.4 0.3 1.07 ( 0.02) 88.6 0.2 1.07 ( 0.01) 77.6 0.3 1.05 ( 0.02) Spec Hub 82.3 0.3 1.02 ( 0.02) - - - - Greedy 82.8 0.3 1.04 ( 0.02) 90.0 0.2 1.09 ( 0.01) 78.3 0.3 1.01 ( 0.02)

RRS w/ replacement 76.7 0.3 - 83.5 0.3 - 72.1 0.3 - RRS w/o replacement 76.4 0.3 1.00 ( 0.02) 85.3 0.3 1.05 ( 0.01) 74.1 0.3 1.03 ( 0.02) Spec Hub 79.5 0.3 1.01 ( 0.02) - - - - Greedy 79.2 0.3 1.02 ( 0.02) 87.8 0.2 1.08 ( 0.01) 72.9 0.3 0.97 ( 0.02)

The gap between the optimal acceptance rate and acceptance rates for previously existing verification methods, RRS and K-SEQ, gradually increases as the number of drafts rises.

6.4 EVALUATING THE GREEDY SAMPLING METHOD ON GENERATION TASKS

In this section, we evaluate the effectiveness and generation efficiency of the proposed Greedy draft sampling method (Section 5) on real-world generation tasks and compare it with other MDSD methods.

We implement the Greedy method within the EAGLE Framework (Li et al., 2024), which supports multi-step MDSD with a draft tree structure. We experiment with three types of tree structures: (1) drafts = 2, depths = 4; (2) drafts = 4, depths = 3; and (3) a sparse tree with up to 4 drafts and 5 steps, which is the default setting in EAGLE. We conduct experiments on the MT-Bench dataset (Zheng et al., 2023) using Vicuna-7B-v1.3 (Chiang et al., 2023) as the target model and its corresponding Eagle model with 0.24B parameters as the draft model.

Table 3 presents the results. As discussed in Section 5.3, the Greedy method and Spec Hub have equal acceptance rates when the number of draft tokens is 2. Our experiments confirm this theoretical insight, showing no statistically significant difference between the two methods for any temperature.

The Greedy method demonstrates improved performance at low temperatures. For example, at T=0.1, it achieves a higher acceptance rate compared to RRS without replacement, leading to faster generation. However, as the temperature increases, the performance gain of the Greedy method diminishes. This observation is consistent with the ablation study in Figure 1.

7 CONCLUSION

In this paper, we studied the acceptance rate of Multi-Draft Speculative Decoding (MDSD).

On the theoretical side, we discovered an equivalence between the optimal acceptance rate and a subset selection problem. We also provided efficient methods to compute the optimal acceptance rate for common draft distributions.

On the practical side, for the first time, we measured the optimal acceptance rate under real text distributions and quantified the gap between existing algorithms and the optimal acceptance rate.

Furthermore, we proposed a practical greedy draft construction method that, in some cases, achieves an even higher acceptance rate than sampling without replacement.

We hope that our work will stimulate further research on improving the efficiency of large language model inference and make these powerful models more accessible and applicable in real-world scenarios.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENT

This work was partially supported by NSF IIS 2347592, 2348169, DBI 2405416, CCF 2348306, CNS 2347617.

Norman Biggs. Algebraic graph theory. Number 67. Cambridge university press, 1993.

Ondˇrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pp. 12 58, 2014.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901. Curran Associates, Inc., 2020.

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. ar Xiv preprint ar Xiv:2401.10774, 2024.

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. ar Xiv preprint ar Xiv:2302.01318, 2023a.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021.

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. ar Xiv preprint ar Xiv:2402.12374, 2024.

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, and Kevin Chen-Chuan Chang. Cascade speculative drafting for even faster llm inference. ar Xiv preprint ar Xiv:2312.11462, 2023b.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna, March 2023.

Frederic G Commoner. A sufficient condition for a matrix to be totally unimodular. Networks, 3(4): 351 365, 1973.

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D Lee, and Di He. Rest: Retrieval-based speculative decoding. ar Xiv preprint ar Xiv:2311.08252, 2023.

Karl Moritz Hermann, Tom as Kocisk y, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Twenty-eighth Conference on Neural Information Processing Systems, pp. 1693 1701, 2015.

Frank L Hitchcock. The distribution of a product from several sources to numerous localities. Journal of mathematics and physics, 20(1-4):224 230, 1941.

Alan J Hoffman and Joseph B Kruskal. Integral boundary points of convex polyhedra. 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art, pp. 49 76, 2010.

Published as a conference paper at ICLR 2025

Zhengmian Hu and Heng Huang. Accelerated speculative sampling based on tree monte carlo. In Forty-first International Conference on Machine Learning, 2024.

Satoru Iwata. Submodular function minimization. Mathematical Programming, 112:45 64, 2008.

Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, and Christopher Lott. Recursive speculative decoding: Accelerating llm inference via sampling without replacement. ar Xiv preprint ar Xiv:2402.14160, 2024.

Ashish J Khisti, Arash Behravesh, Hassan Dbouk, Arash Behboodi, Roland Memisevic, and Christos Louizos. Importance weighted multi-draft speculative sampling. In ICML 2024 Workshop on Theoretical Foundations of Foundation Models, 2024.

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274 19286. PMLR, 2023.

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. ar Xiv preprint ar Xiv:2401.15077, 2024.

Bei Luo, Raymond YK Lau, Chunping Li, and Yain-Whar Si. A critical review of state-of-the-art chatbot designs and applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(1):e1434, 2022.

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pp. 932 949, 2024.

Giovanni Monea, Armand Joulin, and Edouard Grave. Pass: Parallel speculative sampling. ar Xiv preprint ar Xiv:2311.13581, 2023.

Jie Ou, Yueming Chen, and Wenhong Tian. Lossless acceleration of large language model via adaptive n-gram parallel decoding. ar Xiv preprint ar Xiv:2404.08698, 2024.

Alexander Schrijver et al. Combinatorial optimization: polyhedra and efficiency, volume 24. Springer, 2003.

Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding. ar Xiv preprint ar Xiv:2308.04623, 2023.

Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. ar Xiv preprint ar Xiv:2404.11912, 2024a.

Ryan Sun, Tianyi Zhou, Xun Chen, and Lichao Sun. Spec Hub: Provable acceleration to multi-draft speculative decoding. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20620 20641, Miami, Florida, USA, November 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1148.

Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Ahmad Beirami, Jae Hun Ro, and Ananda Theertha Suresh. Block verification accelerates speculative decoding. In Workshop on Efficient Systems for Foundation Models II@ ICML2024, 2024c.

Ziteng Sun, Jae Hun Ro, Ahmad Beirami, and Ananda Theertha Suresh. Optimal block-level draft verification for accelerating speculative decoding. ar Xiv preprint ar Xiv:2403.10444, 2024d.

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024e.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model, 2023.

Published as a conference paper at ICLR 2025

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. ar Xiv preprint ar Xiv:2407.10671, 2024a.

Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. Inference with reference: Lossless acceleration of large language models. ar Xiv preprint ar Xiv:2304.04487, 2023.

Sen Yang, Shujian Huang, Xinyu Dai, and Jiajun Chen. Multi-candidate speculative decoding. ar Xiv preprint ar Xiv:2401.06706, 2024b.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595 46623, 2023.

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-Franc ois Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. ar Xiv preprint ar Xiv:2310.08461, 2023.

Published as a conference paper at ICLR 2025

A APPROXIMATE SOLUTIONS

A.1 RECURSIVE REJECTION SAMPLING (RRS)

Yang et al. (2024b) and Jeon et al. (2024) use the Recursive Rejection Sampling method. Define the residual distribution Resp q Σ for p, q Σ as:

Resp q(i) = (p(i) q(i))+ P

z Σ(p(z) q(z))+ (22)

When p = q, Resp q can be defined as an arbitrary distribution.

For pdraft from sampling with replacement, the RRS algorithm is recursively defined as:

πRRS,w p,pdraft(i| i) = eπRRS,w p,q (i| i) (23)

eπRRS,w p,q (i| i) =

( min( p( i1)

q( i1), 1) i = i1 (1 p( i1)

q( i1))+eπRRS,w Resp q,q(i| i2:) i = i1 (24)

and eπRRS,w p,q (i|()) = p(i) (25)

Here i2: denotes the sequence i with the first element removed.

For pdraft from sampling without replacement, the RRS algorithm is defined as:

πRRS,w p,pdraft(i| i) = eπRRS,wo p,q (i| i) (26)

eπRRS,wo p,q (i| i) =

( min( p( i1)

q( i1), 1) i = i1 (1 p( i1)

q( i1))+eπRRS,wo Resp q,q i1(i| i2:) i = i1 (27)

and eπRRS,wo p,q (i|()) = p(i) (28)

The acceptance rates are denoted as αRRS,w(p, pdraft) and αRRS,wo(p, pdraft), respectively.

Sun et al. (2024e) proposed the K-SEQ method to verify drafts sampled with replacement. Define

βp,q(ρ) = X

i Σ min(p(i)

ρ , q(i)) (29)

and let ρ be the solution to the equation

1 (1 βp,q(ρ))n = ρβp,q(ρ) (30)

The K-SEQ algorithm is defined as:

πK-SEQ p,pdraft(i| i) = eπK-SEQ p,q,ρ (i| i) (31)

eπK-SEQ p,q,ρ (i| i) = (1 p( i1)

ρq( i1))+eπK-SEQ p,q,ρ (i| i2:) +

( min( p( i1)

ρq( i1), 1) i = i1 0 i = i1 (32)

eπK-SEQ p,q,ρ (i|()) = p(i) min n q(i), p(i)

ρ o 1 (1 βp,q(ρ))n

βp,q(ρ) (1 βp,q(ρ))n (33)

The acceptance rate is denoted as αK-SEQ(p, pdraft) = 1 (1 βp,q(ρ))n, which is theoretically guaranteed to achieve a (1 e 1)-approximation of the optimal acceptance rate.

Published as a conference paper at ICLR 2025

B RELATED WORKS

B.1 DRAFT MODEL DESIGN

Numerous studies have explored the design of better draft models for speculative decoding. In principle, any autoregressive probabilistic model can serve as a draft model. The simplest approaches include using n-gram models (Ou et al., 2024) or document retrieval as draft models (Yang et al., 2023; He et al., 2023). Small transformer-based language models have also been employed (Leviathan et al., 2023; Chen et al., 2023a), often with distillation techniques to further increase the overlap between the draft and target models (Zhou et al., 2023).

The design of a good draft model involves a trade-off between its similarity to the target model and its computational complexity. More complex draft models lead to higher acceptance rates due to their closer resemblance to the target model, but they also incur higher computational overhead. To achieve a better trade-off, some works have proposed reusing the target model s computational results. For example, Monea et al. (2023) use the original model with look ahead tokens, while Cai et al. (2024) add new heads to the last hidden layer of the original model to predict tokens further ahead. Li et al. (2024) reuse the last layer hidden state computation of the large model and introduce a new attention layer to predict the next token. Sun et al. (2024a) employ the target model with a partial key-value cache as the draft model.

B.2 MULTI-DRAFT SPECULATIVE DECODING

Many related works on Multi-Draft Speculative Decoding (MDSD) have been introduced in other sections. This paper focuses on the single-step Multi-Draft scenario. When MDSD generates multiple steps, with each step involving multiple drafts, it forms a tree structure. Sequoia (Chen et al., 2024) propose a dynamic programming algorithm to search for the optimal tree topology.

As the tree grows deeper, the acceptance probability of certain branches decreases. Cascade Speculative Drafting (Chen et al., 2023b) addresses this issue by assigning the largest draft model to generate draft tokens at shallower levels, which are more likely to be accepted, and gradually using smaller models to generate drafts for less relevant branches.

Khisti et al. (2024) studied the optimal acceptance rate for special case of sampling with replacement for n = 2 drafts, and obtained the following result:

α (p, pdraft) = min H Σ

This is essentially the same as our result (8) under this special case. However, our theory is more general, without any assumption on the draft sampling methods or the number of draft tokens.

B.3 MULTI-STEP SPECULATIVE DECODING

The basic single-step, single-draft speculative decoding, as introduced in Section 2.1, can be applied to multiple steps, with each step having only one draft and an independent verification process (Leviathan et al., 2023; Chen et al., 2023a). However, such an approach of repeatedly applying single-step verification is not optimal for the multi-step scenario. Some works, such as Sun et al. (2024d); Hu & Huang (2024); Sun et al. (2024c), have designed better verification algorithms specifically for the multi-step setting. These algorithms are tailored for the multi-step scenario while remaining compatible with the single-step case, reducing to the basic speculative sampling algorithm when applied to a draft sequence of length 1.

Multi-step speculative decoding and multi-draft speculative decoding represent different directions for improvement.

As shown in Figure 3, Sun et al. (2024e); Khisti et al. (2024) and our work improve speculative decoding from the multi-draft perspective. When there is only a single draft, it reduces to the case in Leviathan et al. (2023); Chen et al. (2023a). On the other hand, Sun et al. (2024d); Hu & Huang

Published as a conference paper at ICLR 2025

Speculative Decoding Leviathan et al. (2023) Chen et al. (2023a)

Sun et al. (2024d) Hu & Huang (2024) Sun et al. (2024c)

Sun et al. (2024e) Khisti et al. (2024) This paper

multi-draft

Figure 3: Different directions for improving speculative decoding.

(2024); Sun et al. (2024c) enhance speculative decoding from the multi-step perspective. When there is only a single step, it reduces to the case in Leviathan et al. (2023); Chen et al. (2023a).

Combining both improvements in the multi-draft and multi-step scenario would be ideal, and could be a direction for future research.

Proof of Lemma 1. Let f1(C) and f2(S) denote the objective function values of (7) and (9), respectively. Let v1 = f1(C ) and v2 = f2(S ) be the optimal objective function values.

First, we show that the optimal solution of (7) is feasible for (9). Define Si, i = C i, i for i Ai and Si, i = 0 for i / Ai. This solution maintains the objective function value, i.e., f2(S) = f1(C ). Therefore, v2 = f2(S ) f2(S) = f1(C ) = v1.

Next, we show that the optimal solution of (9) is feasible for (7). Define pres(i) = p(i) P i Σn S i, i 0 for i Σ and pres draft( i) = pdraft( i) P i Σ S i, i 0 for i Σn. We have P i Σn pres draft( i) = P

i Σ pres(i). Define Ci, i = S i, i + pres(i)pres draft( i) P

i Σ pres(i) S i, i. This solution has a larger objective function value, i.e., f1(C) f1(S ). Therefore, v1 = f1(C ) f1(C) f2(S ) = v2.

Combining the two parts, we have v1 = v2, which proves the equivalence of the two formulations.

Proof of Theorem 3. For i = ( i1, . . . , in) Σn, we have pdraft( i) = Qn j=1 q( ij). The function Q(H) = P i Hn pdraft( i) represents the probability that all n samples drawn with replacement are in the set H. Therefore, Q(H) = (P

x H q(x))n.

Consider the convex function g(x) = xn. To prove the q-convexity of Q, it suffices to show that for all H Σ and x, y Σ \ H with x = y, we have:

(Q(H) + q(x))n Q(x)n

q(x) (Q(H) + q(x) + q(y))n (Q(H) + q(x))n

This can be rewritten as: g(Q(H) + q(x)) g(Q(x))

q(x) g(Q(H) + q(x) + q(y)) g(Q(H) + q(x))

Note that both sides are finite differences of the convex function g. Define a = Q(x), b = Q(x) + q(x), and c = Q(x) + q(x) + q(y). It suffices to show that:

b a g(c) g(b)

This follows directly from the convexity of g.

Published as a conference paper at ICLR 2025

Proof of Theorem 4. The function Q(H) = P i Hn pdraft( i) represents the probability that all n samples drawn without replacement are in the set H. To handle the more complex case of sampling without replacement, we use generating functions.

Define the generating function GH(t) = Q i H(1 + q(i)t) and the coefficient Wn,H = Coefftn GH(t). Note that Q(H) = Wn,H

The coefficients satisfy the following recurrence relation:

Wn,H {x} = Coefftn GH {x}(t) (38)

= Coefftn GH(t) + q(x)t GH(t) (39) = Wn,H + q(x)Wn 1,H (40)

To prove the q-convexity of Q, it suffices to show that for all H Σ and x, y Σ \ H with x = y, we have: Wn,H {x} Wn,H

q(x)Wn,Σ Wn,H {x,y} Wn,H {x}

q(y)Wn,Σ (41)

Applying the recurrence relation, it suffices to show that:

Wn 1,H Wn 1,H {x} (42)

Applying the recurrence relation again, it suffices to show that:

0 q(x)Wn 2,H (43)

This holds because the coefficients of G are always non-negative, i.e., Wn 2,H 0.

Proof of Theorem 5. It suffices to show that for all H Σ and x, y Σ \ H with x = y, we have:

Q(x|H) Q(x|H {y}) (44)

By the q-convexity of Q, we have:

q(x) Q(y|H {x})

q(x) Q(x|H) + Q(y|H {x})

q(x) + q(y) = Q(H {x, y}) Q(H)

q(x) + q(y) Q(y|H {x})

Similarly, by symmetry, we can reverse x and y to obtain:

q(y) Q(H {x, y}) Q(H)

q(x) + q(y) Q(x|H {y})

Therefore, Q(x|H)

q(x) Q(H {x, y}) Q(H)

q(x) + q(y) Q(x|H {y})

This implies that: Q(x|H) Q(x|H {y}) (49)

Proof of Theorem 6. We have:

f(x|H) = p(x) Q(x|H) (50)

= p(x)(1 q(x)

p(x) Q(x|H)

q(x) ) 0 (51)

Published as a conference paper at ICLR 2025

Therefore, f(x|H)

p(x) = 1 q(x)

p(x) Q(x|H)

q(x) 0 (52)

By assumption, q(x)

p(y). By the q-convexity of Q, we have Q(x|H)

q(x) Q(y|H {x})

q(y) . Therefore,

q(y) p(y) Q(y|H {x})

p(x) Q(x|H)

It follows that: f(y|H {x})

p(y) = 1 q(y)

p(y) Q(y|H {x})

p(x) 0 (55)

Proof of Theorem 7. We first prove that the acceptance rate of the greedy method is:

αGreedy(p, pdraft) (56)

i Ai πGreedy p,pdraft(i, i) (57)

i Ai πGreedy p,pdraft(i| i)pdraft( i) (58)

i Topn 1(q)

i Ai πGreedy p,pdraft(i| i)pdraft( i) (59)

i Σ\Topn 1(q)

i Ai πGreedy p,pdraft(i| i)pdraft( i) (60)

i Topn 1(q)

in Σ π p,q Topn 1(q)(i| in)q Topn 1(q)( in) (61)

i Σ\Topn 1(q) πGreedy p,pdraft(i|(Topn 1(q), i))q Topn 1(q)(i) (62)

i Topn 1(q) p(i) (63)

i Σ\Topn 1(q) π p,q Topn 1(q)(i|i)q Topn 1(q)(i) (64)

i Topn 1(q) p(i) + X

i Σ\Topn 1(q) min(p(i), q Topn 1(q)(i)) (65)

Note that P

i Topn 1(q) min(p(i), q Topn 1(q)(i)) = P

i Topn 1(q) min(p(i), 0) = 0.

Next, we compute the optimal acceptance rate. Note that when Topn 1(q) H, we must have Q(H) = 0. When Topn 1(q) H, we have Q(H) = P

i H q Topn 1(q)(i). Therefore,

α (p, pdraft) (66) =1 + min H Σ P(H) Q(H) (67)

=1 + min H Σ,s.t. Topn 1(q) H

i H p(i) q Topn 1(q)(i) (68)

The optimal set is H = {i Σ|q Topn 1(q)(i) p(i)} Topn 1(q). In this case,

α (p, pdraft) (69)

Published as a conference paper at ICLR 2025

i Σ (q Topn 1(q)(i) p(i))+ + X

i Topn 1(q) p(i) (70)

i Σ min(p(i), q Topn 1(q)(i))+ + X

i Topn 1(q) p(i) (71)

C.1 DERIVATION OF THE DUAL PROBLEM

We start from the primal problem (9):

max S RΣ Σn X

i Σn Si, i p(i) i Σ

i Σ Si, i pdraft( i) i Σn

Si, i 0 i Σ, i Σn

Si, i = 0 i Σ, i / Ai

We introduce dual variables yi for each constraint P i Σn Si, i p(i) and z i for each constraint P

i Σ Si, i pdraft( i). The Lagrangian function is:

L(S, y, z) = X

i Σn Si, i (73)

i Σ yi(p(i) X

i Σn Si, i) (74)

i Σn z i(pdraft( i) X

i Σ Si, i) (75)

The dual function is: g(y, z) = max S RΣ Σn L(S, y, z)

s.t. Si, i 0 i Σ, i Σn

Si, i = 0 i Σ, i / Ai

Rearranging the Lagrangian function:

L(S, y, z) = X

i Σn (1 yi z i)Si, i (77)

i Σ yip(i) + X

i Σn z ipdraft( i) (78)

For the dual function to be bounded, we must have:

1 yi z i 0, i Σ, i Ai (79)

Therefore, the dual problem is:

min y RΣ,z RΣn X

i Σ yip(i) + X

i Σn z ipdraft( i)

s.t.yi + z i 1 i Σ, i Ai yi 0 i Σ

This completes the derivation of the dual problem.

Published as a conference paper at ICLR 2025

D ADDITIONAL ILLUSTRATION

Illustration of single-step draft tokens generation and verification:

Input (Context)

Draft Model Pdraft

Generate n Draft Tokens bx(1), . . . , bx(n)

Verification Algorithm Pverify

Output Token

Compute Probabilities Ptarget( )

Target Model Ptarget

Generate Drafts

Draft Tokens Probabilities

Probabilities

Parallel Computation

Pseudo code for apply multi-draft speculative sampling for multiple steps, with arbitrary tree topology.

def m u l t i d r a f t s p e c u l a t i v e d e c o d i n g ( prompt , t r e e t o p o l o g y , draft model , t a r g e t m o d e l ) : Multi Draft S p e c u l a t i v e Decoding algorithm f o r a c c e l e r a t i n g language model i n f e r e n c e .

Example t r e e t o p o l o g y : t r e e t o p o l o g y = [ [0] , [1] , [2] , [3] , # F i r s t l e v e l : 4 branches [0 ,0] , [0 ,1] , [0 ,2] , [1 ,0] , [1 ,1] , [2 ,0] , [2 ,1] , [3 ,0] , # Second l e v e l [0 ,0 ,0] , [0 ,0 ,1] , [0 ,0 ,2] , [0 ,1 ,0] , [0 ,1 ,1] , [0 ,2 ,0] , [0 ,2 ,1] , [1 ,0 ,0] , # Third l e v e l [0 ,0 ,0 ,0] , [0 ,0 ,0 ,1] , [0 ,0 ,0 ,2] , # Fourth l e v e l [0 ,0 ,0 ,0 ,0] , [0 ,0 ,0 ,0 ,1] # F i f t h l e v e l ] This i s the d e f a u l t EAGLE t r e e s t r u c t u r e where each number r e p r e s e n t s which d r a f t token to use at each l e v e l .

Published as a conference paper at ICLR 2025

# I n i t i a l i z e d i c t i o n a r i e s to s t o r e d r a f t s and d i s t r i b u t i o n s d r a f t s = {} d r a f t d i s t r i b u t i o n s = {} t a r g e t d i s t r i b u t i o n s = {}

# Generate d r a f t s f o r each p r e f i x in the t r e e topology for p r e f i x in [ [ ] ] + t r e e t o p o l o g y : c h i l d r e n = g e t c h i l d r e n ( t r e e t o p o l o g y , p r e f i x ) i f not c h i l d r e n : continue # Skip i f no expandable c h i l d r e n paths # Generate d r a f t s and corresponding d i s t r i b u t i o n s # e . g . sampling with / without replacement or S e c t i o n 5.1 ( d r a f t s [ tuple ( p r e f i x ) ] , d r a f t d i s t r i b u t i o n s [ tuple ( p r e f i x ) ] ) = g e n e r a t e d r a f t t o k e n s ( draft model , p r e f i x )

# Compute p r o b a b i l i t y d i s t r i b u t i o n s from t a r g e t model f o r a l l d r a f t s t a r g e t d i s t r i b u t i o n s = c o m p u t e t a r g e t d i s t r i b u t i o n s ( target model , d r a f t s )

# S t a r t v e r i f i c a t i o n and g e n e r a t i o n process p r e f i x = [ ] while True : i f tuple ( p r e f i x ) not in d r a f t s : break # End i f no a v a i l a b l e d r a f t s # e . g . RRS or K Seq or S e c t i o n 5.2 o u t p u t t o k e n = v e r i f i c a t i o n ( d r a f t s [ tuple ( p r e f i x ) ] , d r a f t d i s t r i b u t i o n s [ tuple ( p r e f i x ) ] , t a r g e t d i s t r i b u t i o n s [ tuple ( p r e f i x ) ] ) p r e f i x . append ( o u t p u t t o k e n )

# Return the generated sequence return p r e f i x

def g e t c h i l d r e n ( t r e e t o p o l o g y , p r e f i x ) : Get a l l c h i l d paths in t r e e t o p o l o g y t h a t extend the given p r e f i x by one token .

Examples : t r e e t o p o l o g y = [ [ 0 ] , [1] , [0 ,0] , [0 ,1] , [1 ,0]] g e t c h i l d r e n ( [ ] , t r e e t o p o l o g y ) > [ [ 0 ] , [1]] g e t c h i l d r e n ( [ 0 ] , t r e e t o p o l o g y ) > [ [ 0 , 0 ] , [0 ,1]] g e t c h i l d r e n ( [ 1 ] , t r e e t o p o l o g y ) > [ [ 1 , 0 ] ] g e t c h i l d r e n ( [ 0 , 0 ] , t r e e t o p o l o g y ) > [] return [ path for path in t r e e t o p o l o g y i f len ( path ) == len ( p r e f i x ) + 1 and path [ : len ( p r e f i x ) ] == p r e f i x ]

Published as a conference paper at ICLR 2025

Unfolded definition of verify algorithm for greedy draft construction.

πGreedy p,pdraft(i| i)

=π p,q Topn 1(q)(i| in)

min( p(i)(1 P

j Topn 1(q) q(j))

q(i) , 1) i = in

(1 p( in)(1 P

j Topn 1(q) q(j))

q( in) )+ p(i)(1 P

j Topn 1(q) q(j)) P

z Σ(p(z)(1 P

j Topn 1(q) q(j)) I(z / Topn 1(q))q(z))+ i Topn 1(q)

(1 p( in)(1 P

j Topn 1(q) q(j))

q( in) )+ (p(i)(1 P

j Topn 1(q) q(j)) q(i))+ P

z Σ(p(z)(1 P

j Topn 1(q) q(j)) I(z / Topn 1(q))q(z))+ i = in, i / Topn 1(q)

E SUMMARY OF NOTATIONS

Σ: The vocabulary set

Σ: The probability simplex over vocabulary Σ

[n]: The set 1,...,n

Ptarget( |x1, x2, ..., xm): The target model, a probabilistic model that predicts the probability of the next word given the context

Pdraft( |x1, x2, ..., xm): The draft model used to generate candidate tokens

Pverify( |bx(1), . . . bx(n)): The verification algorithm that selects the final output token from the draft tokens

p( ) = Ptarget( |x1, x2, ..., xm): Shorthand for the target distribution

pdraft: The distribution of draft tokens

π Π(p, pdraft): A joint distribution with marginal distributions p and pdraft π p,pdraft Π(p, pdraft): The optimal transport joint distribution

α (p, pdraft): The optimal acceptance rate

Ai := { i Σn| j [n], ij = i}: The incidence set for token i

q( ): The shorthand notation of the output distribution of the draft model

q i1,..., ij 1(x): The probability of token x when sampling without replacement, excluding previously selected tokens

Resp q Σ: The residual distribution

πRRS,w p,pdraft, πRRS,wo p,pdraft : The RRS verification algorithms for with/without replacement sampling

αRRS,w(p, pdraft), αRRS,wo(p, pdraft): Acceptance rates for RRS with/without replacement

βp,q(ρ): A function used in the K-SEQ algorithm

πK-SEQ p,pdraft: The K-SEQ verification algorithm

αK-SEQ(p, pdraft): Acceptance rate for the K-SEQ algorithm

i H p(i): Sum of target probabilities over set H

Q(H) = P i Hn pdraft( i): Sum of draft probabilities over set Hn

f(H) = P(H) Q(H): Difference between target and draft probabilities over set H

πGreedy p,pdraft: The greedy verification algorithm

αGreedy(p, pdraft): Acceptance rate for the greedy draft sampling method

Ci,j: Matrix representation of joint distribution

(p(i) pdraft(i))+: The positive part of the difference

GH(t): Generating function defined as Q

i H(1 + q(i)t)

Wn,H: Coefficient of tn in GH(t)

Published as a conference paper at ICLR 2025

f(x|H): The marginal value of element x with respect to set H

Topn 1(q): The top n-1 tokens according to probability in q

S RΣ Σn: Variables in the equivalent LP formulation

y RΣ, z RΣn: Dual variables

π( |j): The conditional distribution given j

πi,j: Individual elements of the joint distribution matrix

i2:: The sequence i with the first element removed

Hi: Sets constructed by adding elements one by one

σ: An ordering of Σ used in the efficient computation algorithm

Coefftn: Coefficient of tn in a generating function

F ADDITIONAL EXPERIMENTS

Table 3: Average generation length τ of different MDSD methods and their ratio on MT-Bench based on Eagle framework.

Method # Drafts = 2, # Steps = 4 # Drafts = 4, # Steps = 3 EAGLE default sparse tree

RRS w/ replacement 3.04 0.02 - 2.83 0.02 - 3.19 0.03 - RRS w/o replacement 3.27 0.02 1.07 ( 0.02) 2.96 0.02 1.05 ( 0.01) 3.42 0.03 1.07 ( 0.02) Spec Hub 3.63 0.02 1.19 ( 0.02) - - - - Greedy 3.62 0.02 1.19 ( 0.02) 3.39 0.01 1.20 ( 0.02) 3.70 0.03 1.16 ( 0.02)

RRS w/ replacement 3.22 0.02 - 3.11 0.01 - 3.41 0.02 - RRS w/o replacement 3.52 0.02 1.09 ( 0.02) 3.39 0.01 1.09 ( 0.01) 3.71 0.02 1.09 ( 0.02) Spec Hub 3.52 0.02 1.09 ( 0.02) - - - - Greedy 3.52 0.02 1.09 ( 0.02) 3.45 0.01 1.11 ( 0.01) 3.66 0.02 1.07 ( 0.02)

RRS w/ replacement 3.14 0.02 - 3.09 0.01 - 3.22 0.02 - RRS w/o replacement 3.22 0.02 1.02 ( 0.02) 3.25 0.01 1.05 ( 0.01) 3.43 0.02 1.06 ( 0.02) Spec Hub 3.35 0.02 1.07 ( 0.02) - - - - Greedy 3.33 0.02 1.06 ( 0.02) 3.34 0.01 1.08 ( 0.01) 3.33 0.02 1.03 ( 0.02)

Remark 8. For # Drafts = 2, # Steps = 4, and T = 0.6, three methods - RRS without replacement, Spec Hub, and Greedy - show similar average generation lengths. After truncating to two decimal places, they appear to be the same. However, they are actually different numbers: 3.51781, 3.51944, 3.51975.