# block_verification_accelerates_speculative_decoding__dbca441b.pdf

Published as a conference paper at ICLR 2025

BLOCK VERIFICATION ACCELERATES SPECULATIVE DECODING

Ziteng Sun Google Research zitengsun@

Uri Mendlovic Google Research urimend@

Yaniv Leviathan Google Research leviathan@

Asaf Aharoni Google Research asafaharoni@

Jae Hun Ro Google Research jaero@

Ahmad Beirami Google Research beirami@

Ananda Theertha Suresh Google Research theertha@

Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then veriﬁed in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft veriﬁcation is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Veriﬁcation, a simple draft veriﬁcation algorithm that veriﬁes the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and speciﬁcally is never worse than the standard token-level veriﬁcation. Empirically, block veriﬁcation provides modest but consistent wall-clock speedups over the standard token veriﬁcation algorithm of 5%-8% in a range of tasks and datasets. Given that block veriﬁcation does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding veriﬁcation algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.

1 INTRODUCTION

Large language models (LLMs) (Chowdhery et al., 2022; Touvron et al., 2023; Achiam et al., 2023; Gemini Team et al., 2023) are often decoded through autoregressive sampling, where generating k tokens requires k costly serial evaluations of the model. To improve generation latency, Leviathan et al. (2022) proposed speculative decoding, which enables an LLM to generate several tokens concurrently. In each iteration, conditioned on the current decoded preﬁx, a guess of the next block of γ tokens is made by a fast drafter (e.g., a small model or a heuristic). Each of the resulting γ + 1 preﬁxes are then evaluated by the large target model in parallel. To guarantee that the ﬁnal output follows the same distribution as that of the large model, some of the generated tokens are accepted while others are rejected. The accepted tokens1 are then appended to the preﬁx, and the process repeats until generation ends. See Figure 1 and Algorithm 3.

In Leviathan et al. (2022), the drafts are veriﬁed through a sequence of token-level rejection steps. More speciﬁcally, given a preﬁx c, let X1, X2, . . . , Xγ be one sample block of length γ from the draft model Ms, where i γ, Xi Ms( | c, Xi 1). Using the conditional distributions under the target large model Mb returned by the parallel evaluation step ( 0 i γ, Mb( | c, Xi)) , the algorithm iterates over the draft tokens sequentially, and accepts each token Xi with probability

min 1, Mb(Xi | c, Xi 1)

Ms(Xi | c, Xi 1)

All emails @google.com. 1With an extra token sampled from either a residual distribution or the large model distribution.

Published as a conference paper at ICLR 2025

will be mostly cloudy

Tomorrow, the weather will be sunny

Next iteration (if not E.O.S.)

will be mostly sunny

Tomorrow, the weather

Prefix Sample a block of tokens

from small model

Evaluate draft with large model

Draft verification

and correction Add to the prefix

Figure 1: One iteration of speculative decoding (Algorithm 3). The preﬁxes and veriﬁed tokens are in blue, the unveriﬁed tokens from the draft model are in red, and the tokens sampled from the residual distribution are underlined.

The process continues until a token is rejected, at which point an extra token is sampled, for free, according to a residual distribution (see Algorithm 1 and Leviathan et al. (2022) for more details). We refer to this algorithm as Token Veriﬁcation. Since its introduction in Leviathan et al. (2022), this token-by-token veriﬁcation procedure has been the standard for follow-up works (see Section 7).

In this work, we make the surprising observation that the standard token veriﬁcation algorithm, is not optimal, and propose a strictly better method. Our key observation is that we can increase the number of accepted tokens, while maintaining the identical distribution guarantee, by jointly verifying the entire block of draft tokens instead of verifying each token independently. Our proposed algorithm, which we call Block Veriﬁcation, has the following advantages:

Simple to use. The algorithm is a plug-and-play replacement of the standard token veriﬁcation algorithm of speculative decoding. It does not incur additional computation or code complexity costs. See Algorithms 1 and 2 for a side-by-side comparison. Identical distribution. Importantly, our method is not an approximation and maintains the identical distribution guarantee of speculative decoding (Theorem 1). Optimal improvement. With the same drafting model, the speedup of block veriﬁcation is no worse than that of standard token veriﬁcation. Moreover, we show that block veriﬁcation is an optimal veriﬁcation procedure (Theorem 2).

We empirically test block veriﬁcation and compare it with the standard token veriﬁcation on a range of tasks and datasets. We show that our algorithm consistently improves over block efﬁciency (i.e. the expected number of generated tokens) by 7%-10% and overall empirical wall clock times by 5%-8% (see Table 1). Notably, our algorithm provides improvements only through the veriﬁcation phase of speculative decoding, and hence the improvements can be combined with improvements obtained from other works that aim at improving the drafting phase. Since these improvements come for free, our block veriﬁcation algorithm can be used as the draft veriﬁcation algorithm by default in speculative decoding implementations.

2 A MOTIVATING EXAMPLE

The standard token veriﬁcation algorithm stochastically rejects draft tokens with a higher probability from Ms than from Mb. This is necessary to guarantee that the generated tokens follow the same distribution as that of Mb. Our main observation is that considering whether to reject a block of draft tokens jointly, instead of one-by-one, can result in accepting more tokens. We now illustrate this through a simple example.

Consider the following trivial language model whose token space consists only of 2 tokens: A and B. Further, assume that both the large model Mb and the small model Ms are context-independent, and speciﬁcally that c,

Mb(A) = 1/3, Mb(B) = 2/3, Ms(A) = 2/3, Ms(B) = 1/3. (2)

Published as a conference paper at ICLR 2025

In this setting, token veriﬁcation will accept each draft token X independently with probability 1 if X = B and 1/2 if X = A. With a block size of γ = 2, since the total variation (TV) distance d TV(Mb, Ms) = 1/3, the expected number of accepted tokens2 from Ms with the token veriﬁcation algorithm is 1 1/3 + (1 1/3)2 = 10/9 (see analysis in Leviathan et al. (2022)).

An ideal algorithm with full information. Suppose an algorithm can decide on what tokens to accept from Ms based on the full joint distributions of both tokens, i.e.,

Mb(AA) = 1/9, Mb(AB) = 2/9, Mb(BA) = 2/9, Mb(BB) = 4/9,

Ms(AA) = 4/9, Ms(AB) = 2/9, Ms(BA) = 2/9, Ms(BB) = 1/9.

The algorithm would have performed the following improved acceptance logic: always accept X1X2 when X1X2 = AB, BA, or BB since Mb(X1X2) Ms(X1X2), and accept AA with probability Mb(AA)/Ms(AA) = 1/4 (correcting the samples to BB). The expected number of accepted tokens from Ms now becomes: 2(Ms(AB) + Ms(BA) + Ms(BB) + 1/4 Ms(AA)) = 12/9 > 10/9. This illustrates the beneﬁt of considering the distribution of draft blocks jointly.

Veriﬁcation with partial information. In general the full distribution over the next block of tokens is intractable to calculate. Instead, we only have access to the conditional distributions of the next token along the sample path of the draft block, Mb( | c, Xi), Ms( | c, Xi) for various i s. To emphasize, the ideal rejection logic does not need access to the full distribution, but care is needed in properly assigning the residual distribution. Our block veriﬁcation does exactly this, as follows.

For the simple toy example describe above, we propose the following improved algorithm, which is a simpliﬁed version of the general block veriﬁcation algorithm stated in Algorithm 2. When the draft tokens X1X2 = AB or BB, Pr (Accept X1X2) = 1 similar to the idealized algorithm. When X1X2 = AA, Pr (Accept X1X2) = 1/4, and else the algorithm rejects both tokens and only corrects the ﬁrst token to B since the algorithm doesn t have access to Mb( | B). When X1X2 = BA, it always accepts B, and then accepts A with probability 1/2 (else it corrects the second token to B). Importantly, the marginal distributions of the generated tokens at the ﬁrst token and the second token are always Mb( ). Moreover, the algorithm only uses distributions that are conditioned on the sample path of the draft block, and hence it works in the partial information setting. We then simply add the generated tokens to the preﬁx and proceed to the next iteration. The expected number of accepted tokens is 2 (Ms(AB)+Ms(BB))+(1+1/2) Ms(BA)+1/4 2 Ms(AA) = 11/9, which is better than the 10/9 obtained by token veriﬁcation. This example proves the following result:

Lemma 1. The standard token veriﬁcation algorithm of speculative decoding is not optimal.

Note that while the expected number of accepted tokens in the example for block veriﬁcation (11/9) is higher than that of the standard token veriﬁcation algorithm (10/9), it is still less than that of the ideal algorithm with access to the full distribution (12/9). In Section 4, we show that block veriﬁcation is indeed optimal in the partial information case, with natural assumptions.

3 BLOCK VERIFICATION

In this section, we extend the above intuition to develop a general block veriﬁcation algorithm, which works for standard speculative decoding with partial information. The high-level idea is to couple the acceptance of each draft token with other draft tokens. To do this, the algorithm considers draft sub-blocks with different lengths, and decides whether to accept each sub-block independently. The ﬁnal accepted draft block is the longest accepted sub-block in the above process. The acceptance probabilities for each sub-block and the residual distributions are carefully chosen to maintain the distribution guarantee of the ﬁnal output, and achieve optimal speedup.

See Algorithm 2 for a sketch implementation of block veriﬁcation, and Algorithm 1 for a sketch implementation of the standard token veriﬁcation for comparison. Note that the implementations follow the same overall structure (the differences are highlighted).

2This is different from the number of generated token in one iteration, which is the number of accepted tokens plus one (corrected token).

Published as a conference paper at ICLR 2025

Algorithm 1 Token Veriﬁcation Input: Draft block Xγ; small model distributions i < γ, Ms( | c, Xi); large model distributions i γ, Mb( | c, Xi). 1: Sample η1, . . . , ηγ U(0, 1). 2: Set τ = 0. 3: for i = 1, . . . γ do

4: Set htoken i = min{ Mb(Xi|c,Xi 1)

Ms(Xi|c,Xi 1), 1}.

5: if ηi htoken i then 6: Set τ = i. 7: else 8: break. 9: end if 10: end for 11: if τ = γ then 12: Sample Y from Mb( | c, Xγ). 13: else 14: Sample Y from ptoken res ( | c, Xτ) as in Equation (3). 15: end if 16: Return Xτ, Y .

Algorithm 2 Block Veriﬁcation Input: Draft block Xγ; small model distributions i < γ, Ms( | c, Xi); large model distributions i γ, Mb( | c, Xi). 1: Sample η1, . . . , ηγ U(0, 1). 2: Set τ = 0, p0 = 1 . 3: for i = 1, . . . γ do

4: Set pi = min{pi 1 Mb(Xi|c,Xi 1) Ms(Xi|c,Xi 1), 1}.

5: Set hblock i as in Equation (5).

6: if ηi hblock i then 7: Set τ = i. 8: else 9: continue. 10: end if 11: end for 12: if τ = γ then 13: Sample Y from Mb( | c, Xγ). 14: else 15: Sample Y from pblock res ( | c, Xτ) as in Equation (4). 16: end if 17: Return Xτ, Y .

Residual distribution in Algorithm 1 (Line 15): x X,

ptoken res (x | c, Xi) = max{Mb(x | c, Xi) Ms(x | c, Xi), 0} P

x X Mb(x | c, Xi) Ms(x | c, Xi), 0}. (3)

Residual distribution in Algorithm 2 (Line 15): x X,

pblock res (x | c, Xi) = max{ pi Mb(x | c, Xi) Ms(x | c, Xi), 0} P

x X max{ pi Mb(x | c, Xi) Ms(x | c, Xi), 0}. (4)

Acceptance probability in Algorithm 2 (Line 5): hblock γ = pγ, and when i < γ,

hblock i = P

x X max{pi Mb(x | c, Xi) Ms(x | c, Xi), 0} P

x X max{pi Mb(x | c, Xi) Ms(x | c, Xi), 0} + 1 pi . (5)

Figure 2: The acceptance probabilities and residual distributions in Algorithms 1 and 2.

Importantly, token veriﬁcation stops as soon as a token is rejected (the break in Line 9 of Algorithm 1), while block veriﬁcation always operates on the full block. Equivalently, in token veriﬁcation, τ = arg min{ηi htoken i } while in block veriﬁcation, τ = arg max{ηi hblock i }. This difference makes block veriﬁcation tend to accept longer sub-blocks compared to token veriﬁcation, resulting in higher block efﬁciencies. In Figure 3, we plot the empirical complementary CDF of the acceptance length for both algorithms with the toy models introduced in Equation (2) to demonstrate this.

See Algorithm 3 for the outer loop of the speculative decoding algorithm, which remains unchanged for both veriﬁcation methods. See Figure 2 for additional deﬁnitions. See Appendix A for sketch Python implementations. Due to the simplicity of the change, the algorithm can be easily implemented without incurring additional code complexity in practical systems.

Published as a conference paper at ICLR 2025

0 1 2 3 4 5 6 7 8 9 10 0.0

Frequency ( x)

Token Verification Block Verification

Figure 3: Empirical complementary CDF of τ for both algorithms with draft length γ = 10. The draft and target models are the context-independent toy models introduced in Equation (2).

Algorithm 3 Speculative decoding (SPECDEC) (Leviathan et al., 2022) Input: Preﬁx c, target model Mb, draft model Ms. Draft length γ. Veriﬁcation algorithm VERIFY.

1: while E.O.S / (Xτ, Y ) do 2: Sample X1, . . . , Xγ Ms( | c) using autoregressive sampling, keep the conditional probabilities at each step Ms( | c, Xi) for i = 0, . . . , γ 1. {Obtain draft block.} 3: Call the large model Mb and compute conditional probabilities Mb( | c, Xi) for i = 0, 1, . . . , γ in parallel. {Parallel scoring.} 4: Get the accepted tokens with draft veriﬁcation {Draft veriﬁcation and correction.}

Xτ, Y = VERIFY(Xγ, {Ms( | c, Xi)}γ 1 i=0 , {Mb( | c, Xi)}γ i=0).

5: c c, Xτ, Y. {Add decoded tokens to the preﬁx.} 6: end while

Theoretical guarantees. Speculative decoding with block veriﬁcation preserves the distribution of its outputs (Theorem 1). Moreover, block veriﬁcation achieves the optimal speedup among all valid draft veriﬁcation algorithms in the outer loop of speculative decoding (Algorithm 3), resulting in a strict improvement over the standard token veriﬁcation (Theorem 2). We defer the formal statements and the intuitions the parameter choices in Algorithm 2 to Section 4.

4 THEORETICAL GUARANTEES

In this section, we present the formal theoretical guarantees of block veriﬁcation. Notably, that it produces the correct distribution and that it is optimal in terms of the expected number of generated tokens. Let M ( | c) denote the distribution of the sequence up to the end of the generative process under model M and context c.

Deﬁnition 1 (Valid draft veriﬁcation algorithm). A draft veriﬁcation algorithm VERIFY takes the draft block Xγ, small model distributions and large model distributions along the sample path, namely i < γ, Ms( | c, Xi) and i γ, Mb( | c, Xi) as inputs, and outputs a preﬁx Xτ, τ γ of Xγ, and an additional token Y . VERIFY is said to be a valid draft veriﬁcation algorithm if c, models Ms, Mb, and block length γ, the outputs of Algorithm 3 (SPECDEC) with veriﬁcation algorithm VERIFY satisfy

SPECDEC(c, Mb, Ms, γ, VERIFY) p M b( | c)3, (6)

i.e., the distribution of the outputs is preserved.

Note for example that the standard token veriﬁcation algorithm is a valid draft veriﬁcation algorithm (Appendix A.1 in Leviathan et al. (2022)).

3We use p to denote that two distributions are the same.

Published as a conference paper at ICLR 2025

We now claim the following:

Theorem 1. Block veriﬁcation (Algorithm 2) is a valid draft veriﬁcation algorithm.

In other words, speculative decoding with block veriﬁcation preserves the distribution of the output sequence.

We now further claim that block veriﬁcation is optimal for all valid draft veriﬁcation algorithms4.

Theorem 2. For i > 0, let N(i) be the number of decoded tokens after i iterations in Algorithm 3. For any valid draft veriﬁcation algorithm VERIFY in Deﬁnition 1, we have c, Ms, Mb, γ, and i,

EBLOCKVERIFY[N(i)] EVERIFY[N(i)],

where the randomness is over the randomness of the draft block and the randomness of the algorithm.

In particular,

EBLOCKVERIFY[N(i)] ETOKENVERIFY[N(i)].

In other words, among all valid veriﬁcation algorithms, speculative decoding with block veriﬁcation decodes the highest number of tokens in expectation in a ﬁxed number of iterations. Note that since the computation overhead added by block veriﬁcation is negligibly small, this establishes the overall optimality of the block veriﬁcation algorithm. In particular, block veriﬁcation provides a greater speedup than the standard token veriﬁcation. We defer the proofs to Appendix B. Below we give intuitions on the algorithm changes that contribute to achieving the above guarantees.

Intuition on parameter choices and theoretical guarantees. The key quantities for achieving the speedup and distribution matching guarantees are pi s. In Lemma 3 in Appendix B.1, we show that pi corresponds to the probability that the sub-block Xi will be kept in the ﬁnal output. This is guaranteed by choosing the per-step acceptance probability properly since block veriﬁcation keeps the longest accepted sub-block. Next we discuss how pi s contribute to the distribution matching and optimality guarantees.

Distribution matching guarantee (Theorem 1). To start, we ignore the minimum operation in the recursive deﬁnition of pi s. In such case, each pi is simply Mb(Xi | c)/Ms(Xi | c), which is an upper bound on the actual pi s. As shown in Lemma 3, for any Xi, the probability that it is in the accepted block is Ms(Xi)pi(Xi). Since the draft block Xi is generated with probability Ms(Xi | c), this guarantees that the probability of getting Xi by accepting it from the draft will be at most Mb(Xi | c), and hence the algorithm is not accepting Xi more than needed.

The remaining part is to choose a suitable residual distribution pblock res s so that the distribution on the next token follows Mb( | Xi). Note that for any possible next token x, (Xi, x) could also be obtained by accepting Xi+1 when Xi+1 = x, with a probability of Ms(Xi, x)pi+1(Xi, x), which should be subtracted to obtain the residual mass on (Xi, x). This leads to the choice of pblock res in Equation (4) after proper normalization.

Optimality guarantee (Theorem 2). For optimality guarantee, the main proof is to show that for any preﬁx Xi in the draft block, pi(Xi) is the maximum probability that a valid veriﬁcation algorithm can accept Xi, which is stated in Lemma 4. This implies that in one iteration, block veriﬁcation accepts the most tokens in expectation, and the multi-iteration case can be obtained by an induction argument.

To see that why pi(Xi) is the upper bound on the acceptance probability, we show that this is necessary to guarantee that for any preﬁx Xi that could be obtained from multiple draft sample paths, the distribution over subsequent tokens are always the same Mb( | c, Xi). This enables block veriﬁcation to be used as a plug-and-play replacement of token veriﬁcation in the outer loop of speculative decoding (Algorithm 3).

4We use BLOCKVERIFY and TOKENVERIFY to denote block veriﬁcation and token veriﬁcation respectively when convenient.

Published as a conference paper at ICLR 2025

Finally, we note that the optimality guarantee holds for all veriﬁcation algorithms that can be used in Algorithm 3 as is. Speciﬁcally, there exist veriﬁcation procedures that force the decoding logic to depend on the previous accept/reject decisions that produce more accepted tokens in average in one iteration. However, this will affect the decoding speed in subsequent iterations. In Appendix C, we present such an algorithm and name it greedy block veriﬁcation. We empirically observe that block veriﬁcation consistently outperforms it, so we include it mainly as a theoretical result.

5 EXPERIMENT SETUP

We conduct experiments using PALM-2 models (Chowdhery et al., 2022), and Vicuna models (Chiang et al., 2023).

For the experiments on PALM-2 models, we use PALM-2-S as the large target model and PALM-2XXS / PALM-2-XXXS as the small drafter model. The order of the sizes of the models is PALM-2XXXS < PALM-2-XXS < PALM-2-S. We evaluate on prompts from a wide range of datasets and tasks, including language modeling with one-billion language benchmark (LM1B) (Chelba et al., 2013), Chat GPT prompts sourced from Learn GPT (GPT Prompt) (Rashad, 2023), reasoning questions (Web QA) (Berant et al., 2013), physical commonsense reasoning questions (PIQA) (Bisk et al., 2020), scraped conversations with Chat GPT (Share GPT) (Rashad, 2023; Ryoko AI, 2023), summarization tasks (XSum) (Narayan et al., 2018), grade school math problems (GSM8K) (Cobbe et al., 2021), and German to English translation (WMT De En) (Bojar et al., 2014). For all datasets, we decode the ﬁrst 1000 prompts using a max input prompt length of 512 and decode up to 128 output tokens. We use a batch size of 1 in all experiments except for the experiments in Appendix D.1. Note that since our method only modiﬁes the veriﬁcation phase of the algorithm and doesn t introduce additional draft tokens, the speedup we get is independent of the batch size. We use a temperature of 1.0 for the experiments on PALM-2 models.

For Vicuna family of models (Chiang et al., 2023), we conduct the set of experiments in Spec-Bench (Xia et al., 2024). We discussed detailed settings for these expereiments in Section 6.2.

We focus our main experiments on the comparison between block veriﬁcation and token veriﬁcation. Recent works (Sun et al., 2023; Miao et al., 2023) have extended speculative decoding to the case with multiple draft blocks to improve block efﬁciency. However, these methods also increase the required computation from the large model to verify the drafts, which is undesirable when query batching is performed. We empirically show that our method outperforms these methods in the large batch setting even with only one draft block. We defer the results to Appendix D.1 and focus on the one draft case in the main section below.

6.1 EXPERIMENTAL RESULTS ON PALM-2 MODELS

We empirically compare speculative decoding with block veriﬁcation to speculative decoding with token veriﬁcation, and ﬁnd that block veriﬁcation provides small yet consistent improvements in a wide range of settings, both when measuring idealized block efﬁciency and real world wall clock time.

Block efﬁciency measures the speedup in an idealized settings where we neglect the evaluation time of the draft model and assume that we have enough compute capacity for evaluating the large model on all draft preﬁxes in parallel. Speciﬁcally, it measures the average number of decoded tokens per serial call to the target model. We observe consistent improvements for all datasets and draft models. For γ = 8 with PALM-2-XXS as the drafter, the improvement in block efﬁciency ranges from 7.00% to 10.06% with an average of 8.30%.

We also observe consistent improvements in wall clock time, which measures the actual speedup, including all the real-world overheads. See (Leviathan et al., 2022; Chen et al., 2023a) for a more detailed discussion of these overheads. For γ = 8 with PALM-2-XXS as the drafter, the improvement in block efﬁciency ranges from 5.36% to 8.14% with an average of 6.49%. The detailed numbers for this setting are listed in Table 1.

Published as a conference paper at ICLR 2025

Table 1: Speedup comparison between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with γ = 8 with PALM-2-S as the target model and PALM-2-XXS as the draft model on various datasets and tasks. We list the average and standard deviation across 3 runs with different seeds on 1000 test prompts.

Dataset Block efﬁciency Wall clock time speedup over baseline

TOKENV BLOCKV Improve. % TOKENV BLOCKV Improve. %

LM1B 3.21 0.01 3.49 0.02 8.68 0.79 2.17 0.01 2.32 0.01 6.85 0.74 GPT Prompt 3.41 0.04 3.76 0.02 10.06 1.66 2.30 0.02 2.48 0.01 8.14 1.55 Web QA 3.44 0.01 3.70 0.01 7.53 0.24 2.32 0.00 2.45 0.01 5.75 0.22 PIQA 3.40 0.02 3.68 0.00 8.30 0.62 2.29 0.01 2.44 0.00 6.52 0.58 Share GPT 3.34 0.01 3.62 0.03 8.45 0.98 2.25 0.01 2.40 0.02 6.68 0.91 XSum 3.49 0.02 3.76 0.01 7.63 0.94 2.35 0.01 2.49 0.01 5.82 0.88 GSM8K 3.81 0.01 4.15 0.03 8.74 0.56 2.55 0.01 2.73 0.02 6.84 0.51 WMT-De En 3.19 0.01 3.41 0.02 7.00 0.78 2.15 0.01 2.27 0.01 5.36 0.73

Average 3.41 3.70 8.30 2.30 2.45 6.49

The effect of draft length γ. We also perform comparisons of the algorithms for other block lengths (γ = 4 and γ = 6) and observe consistent improvements. We plot the average improvement over all datasets in Figure 5 with the numbers in Figure 4. With the same drafter, the relative improvement of block veriﬁcation over token veriﬁcation increases as γ increases. This is consistent with our intuition since when γ = 1, the two algorithms are the same and as γ increases, block veriﬁcation would beneﬁt more from coordinating the acceptance rule considering the realization of all tokens in the draft block.

As shown in Figure 4, similar to token veriﬁcation, the block efﬁciency of block veriﬁcation increases as γ increases. However, the wall clock speedup peaks at a certain draft length (γ = 4 or γ = 6 for all settings) due to the increased computation cost in the veriﬁcation phase. Hence we focus on γ 8 in our experiments.

The effect of the drafter. We also consider the effect of the quality of the drafter on the improvement. In Figure 4, we list the average block efﬁciency and wall clock speed up under different draft lengths for both drafters. Note that PALM-2-XXS is a larger model than PALM-2-XXXS, and hence a better drafter in terms of quality, as demonstrated by the better average block efﬁciencies in the table. In Figure 5, we plot the average improvement under different drafter models, PALM-2-XXS and PALM-2-XXXS. The improvements hold for both drafters. And the relative improvement in block efﬁciency under PALM-2-XXS is greater than that under PALM-2-XXXS. This shows that the improvement obtained from block veriﬁcation can be combined with the improvement on the quality of the drafter, and the improvement might be more signiﬁcant under better drafters.

Detailed results for experiments performed with different drafters, different datasets, and different draft lengths are listed in Appendix D.

6.2 EXPERIMENTAL RESULTS ON SPEC-BENCH WITH VICUNA MODELS

We also conduct the set of experiments proposed in Spec-Bench (Xia et al., 2024) with Vicuna family of models (Chiang et al., 2023). The benchmark includes various generation subtasks including multi-turn conversation, retrieval-augmented generation, summarization, translation, question answering, and mathematical reasoning. See Xia et al. (2024) for a detailed discussion of the subtasks. For all experiments in this section, we use a single NVIDIA H100 GPU with a batch size of 1 and a max generation length of 1024. We use Vincuna-7B-v1.3 as the target model and Vincuna-68M as the draft model. To study the effect of temperature, we consider temperatures in {0.2, 0.6, 1.0} and ﬁx γ = 8. The results are listed in Table 2. The reported numbers are the average of 3 runs.

Published as a conference paper at ICLR 2025

γ Drafter TOKENV BLOCKV BE WS BE WS

4 XXS 2.89 2.44 2.99 2.50 XXXS 2.35 2.36 2.43 2.43

6 XXS 3.23 2.43 3.43 2.54 XXXS 2.50 2.39 2.63 2.50

8 XXS 3.41 2.30 3.70 2.45 XXXS 2.57 2.28 2.73 2.40

Figure 4: Table on average block efﬁciency (BE) and wall clock speedup (WS) across all datasets for token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with different γ. The large model is PALM-2-S and the drafter model is either PALM-2-XXS (XXS) or PALM-2-XXXS (XXXS).

4 5 6 7 8 Draft length

Improvement %

BE, XXS WS, XXS BE, XXXS WS, XXXS

Figure 5: Average relative improvement of block veriﬁcation over token veriﬁcation in block efﬁciency (BE) and wall clock speedup (WS) across all datasets for different drafters and draft lengths.

Our algorithm obtains consistent improvement compared to token veriﬁcation (up to 8.7% in block efﬁciency and up to 6.7% in wall clock speedup) across different draft lengths for all temperatures bigger than 0. This demonstrates the applicability of our method for different families of models.

The effect of temperature. Note that for temperature of 0, which corresponds to greedy decoding, our algorithm degenerates to token veriﬁcation and doesn t provide additional speedups. In non-zero temperature settings, the advantage is consistent and the additional improvement is higher for larger temperatures. The observation is consistent with the intuition behind the algorithm, which obtains improvement on block efﬁciency by coordinating the randomness in the acceptance decisions at different token locations.

Table 2: Speedup comparisons between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) on Spec-Bench (Xia et al., 2024) for temperature T {0.2, 0.6, 1.0}. We use Vicuna7B-v1.3 as the target model and Vicuna-68M as the draft model. γ = 8 for all experiments and each number is an average of 3 runs.

T Block efﬁciency Wall clock speedup over baseline

TOKENV BLOCKV Improve. % TOKENV BLOCKV Improve. %

0.2 2.75 2.85 3.72 1.22 1.24 1.66 0.6 2.75 2.90 5.32 1.23 1.29 4.24 1.0 2.79 3.04 8.70 1.27 1.34 6.07

7 RELATED WORK

Parallel decoding. Our work improves speculative decoding (Leviathan et al., 2022), a framework for decoding several tokens concurrently. Draft and verify (Stern et al., 2018) was an earlier work, which proposed to independently predict and decode several tokens in parallel, for the greedy decoding case (zero temperature). Speculative decoding has later also been proposed in Chen et al. (2023a).

Single draft improvements. There have been many works aiming to improve speculative decoding without making use of more than one decoding draft. In Table 3, we list a set of works in the draft and verify framework with a breakdown of their drafting and veriﬁcation algorithms. See Xia et al. (2024) for a comprehensive study. In the single-draft case, several works have worked

Published as a conference paper at ICLR 2025

Table 3: Recent works based on the draft and verify framework. Temperature 0 refers to greedy decoding and non-zero temperature refers to sampling.

Work # drafts Temp. Drafting Veriﬁcation

Stern et al. (2018) 1 0 parallel softmax layers token matching Yang et al. (2023) 1 0 additional text token matching

Leviathan et al. (2022) 1 0 small LM TOKENVERIFY (Algorithm 1) Chen et al. (2023a) 1 0 small LM TOKENVERIFY (Algorithm 1) He et al. (2023) 1 0 database retrieval TOKENVERIFY (Algorithm 1) Chen et al. (2023b) 1 0 cascade of small LMs TOKENVERIFY (Algorithm 1) Sun et al. (2024) 1 0 hierarchical drafters TOKENVERIFY (Algorithm 1) Zhou et al. (2023) 1 0 distilled small LMs TOKENVERIFY (Algorithm 1) Liu et al. (2023) 1 0 distilled small LMs TOKENVERIFY (Algorithm 1) Gloeckle et al. (2024) 1 0 parallel softmax layers TOKENVERIFY (Algorithm 1) Zhang et al. (2024) 1 0 layer skip TOKENVERIFY (Algorithm 1) Elhoushi et al. (2024) 1 0 early exit TOKENVERIFY (Algorithm 1)

This work 1 0 small LM BLOCKVERIFY (Algorithm 2)

Sun et al. (2023) 2 0 small LM Spec Tr Miao et al. (2023) 2 0 small LM multi-round TOKENVERIFY Li et al. (2024) 2 0 small LM multi-round TOKENVERIFY Chen et al. (2024) 2 0 small LM multi-round TOKENVERIFY

on improving the drafting phase of speculative decoding (He et al., 2023; Chen et al., 2023b; Sun et al., 2024; Zhou et al., 2023; Liu et al., 2023; Gloeckle et al., 2024; Zhang et al., 2024; Elhoushi et al., 2024). However, these algorithms all use the same token veriﬁcation algorithm. Our proposed block veriﬁcation algorithm can be used in tandem with the drafting techniques in Table 3, yielding combined gains. We leave a more systematic study of the improvement of block veriﬁcation in these cases for future study.

The only other work that we are aware of that improves the veriﬁcation step in speculative decoding is the independent work of Hu and Huang (2024), which uses tree Monte Carlo to improve speculative decoding in the single draft case, and have proved that their algorithm improves over token veriﬁcation. On the contrary, we prove that our algorithm achieves the optimal speedup among all valid veriﬁcation algorithms, including theirs. Our algorithm also requires minimal changes to the original token veriﬁcation algorithm, making it easy to implement and adapt everywhere in practice.

Multiple drafts. Recently, speculative decoding is extended to multiple drafts (Sun et al., 2023; Miao et al., 2023) and new veriﬁcation algorithms for the multi-draft scenario are proposed (Li et al., 2024; Chen et al., 2024). While increasing the number of draft sequences has shown to improve the overall speedup, it comes at the cost of more computation. In Appendix D.1, we show that in the large batch setting, where the inference is less memory bound, our method outperforms these methods without increasing the number of draft blocks. In all of these works, the veriﬁcation algorithm is a generalization of the token veriﬁcation procedure. Extending block veriﬁcation to the multi-sample case is an interesting future direction.

8 DISCUSSION

We showed that the standard token veriﬁcation algorithm used by speculative decoding is not optimal. Further, we proposed a new veriﬁcation algorithm, block veriﬁcation and proved that it is an optimal veriﬁcation algorithm. We also demonstrated empirically that block veriﬁcation consistently outperforms token veriﬁcation in a range of tasks. While the theoretical proofs are somewhat involved, the actual implementation of block veriﬁcation is not more complex than the standard algorithm (see Appendix A), and since our proposed algorithm can only perform better, never worse, than the standard token veriﬁcation algorithm (see Theorem 2), it can be used as a good default in speculative decoding implementations.

Published as a conference paper at ICLR 2025

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic parsing on Freebase from question-answer pairs. In D. Yarowsky, T. Baldwin, A. Korhonen, K. Livescu, and S. Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533 1544, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1160.

Y. Bisk, R. Zellers, R. Le bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 34(05):7432 7439, Apr. 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index. php/AAAI/article/view/6239.

O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. s. Tamchyna. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12 58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W14/W14-3302.

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. ar Xiv preprint ar Xiv:1312.3005, 2013.

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling. ar Xiv preprint ar Xiv:2302.01318, 2023a.

Z. Chen, X. Yang, J. Lin, C. Sun, J. Huang, and K. C.-C. Chang. Cascade speculative drafting for even faster llm inference. ar Xiv preprint ar Xiv:2312.11462, 2023b.

Z. Chen, A. May, R. Svirschevski, Y. Huang, M. Ryabinin, Z. Jia, and B. Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding, 2024.

W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training veriﬁers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021.

M. Elhoushi, A. Shrivastava, D. Liskovich, B. Hosmer, B. Wasti, L. Lai, A. Mahmoud, B. Acun, S. Agarwal, A. Roman, A. A. Aly, B. Chen, and C.-J. Wu. Layerskip: Enabling early exit inference and self-speculative decoding, 2024.

Gemini Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023.

F. Gloeckle, B. Y. Idrissi, B. Rozi ere, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction, 2024.

Z. He, Z. Zhong, T. Cai, J. D. Lee, and D. He. Rest: Retrieval-based speculative decoding. ar Xiv preprint ar Xiv:2311.08252, 2023.

Z. Hu and H. Huang. Accelerated speculative sampling based on tree monte carlo. In Forty-ﬁrst International Conference on Machine Learning, 2024. URL https://openreview.net/ forum?id=st Mhi1Sn2G.

Published as a conference paper at ICLR 2025

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efﬁcient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 23, page 611 626, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.

Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274 19286. PMLR, 2022.

Y. Li, F. Wei, C. Zhang, and H. Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2024.

X. Liu, L. Hu, P. Bailis, I. Stoica, Z. Deng, A. Cheung, and H. Zhang. Online speculative decoding, 2023.

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, R. Y. Y. Wong, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree veriﬁcation. ar Xiv preprint ar Xiv:2305.09781, 2023.

S. Narayan, S. B. Cohen, and M. Lapata. Don t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. Ar Xiv, abs/1808.08745, 2018.

J. P. Quirk and R. Saposnik. Admissibility and measurable utility functions. The Review of Economic Studies, 29(2):140 146, 1962.

M. Rashad. Chatgpt-prompts. https://huggingface.co/datasets/Mohamed Rashad/ Chat GPT-prompts, 2023.

Ryoko AI. Sharegpt. https://huggingface.co/datasets/Ryoko AI/Share GPT52K, 2023.

M. Stern, N. Shazeer, and J. Uszkoreit. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.

H. Sun, Z. Chen, X. Yang, Y. Tian, and B. Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding, 2024.

Z. Sun, A. T. Suresh, J. H. Ro, A. Beirami, H. Jain, and F. Yu. Spectr: Fast speculative decoding via optimal transport. ar Xiv preprint ar Xiv:2310.15141, 2023.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efﬁcient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.

H. Xia, Z. Yang, Q. Dong, P. Wang, Y. Li, T. Ge, T. Liu, W. Li, and Z. Sui. Unlocking efﬁciency in large language model inference: A comprehensive survey of speculative decoding. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 7655 7671, Bangkok, Thailand and virtual meeting, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.ﬁndings-acl.456. URL https://aclanthology.org/2024.findings-acl.456.

N. Yang, T. Ge, L. Wang, B. Jiao, D. Jiang, L. Yang, R. Majumder, and F. Wei. Inference with reference: Lossless acceleration of large language models. ar Xiv preprint ar Xiv:2304.04487, 2023.

J. Zhang, J. Wang, H. Li, L. Shou, K. Chen, G. Chen, and S. Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding, 2024.

Y. Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh, S. Kumar, J.-F. Kagy, and R. Agarwal. Distillspec: Improving speculative decoding via knowledge distillation, 2023.

Published as a conference paper at ICLR 2025

A PYTHON IMPLEMENTATION

In this section we provide a sketch implementation of block veriﬁcation (Algorithm 2) in Python. Note that these are meant for illustration purposes only and are not ﬁt for practical use.

Let V = |X| be the size of the vocabulary. The inputs to the algorithm are:

ps: an (γ+1) V numpy array with the distributions from the large model Mb( | c, Xi); qs: an γ V numpy array with the distributions from the draft model Ms( | c, Xi); drafts: a length-γ numpy array with the ids of the draft tokens Xγ;

def block_verification( ps: np.ndarray, qs: np.ndarray, drafts: np.ndarray) -> list[int]: draft_length, vocab_size = qs.shape qs.resize((draft_length+1, vocab_size)) # Append a zero vector token_sequence = [] # Will include the token sequence we return accept_probability = 1.0 # Acceptance prob. for each sub-block probability_ratios = ps / qs # Add one token to indicate rejecting the sequence vocab_plus_one = np.arange(vocab_size + 1) for token_index, token_value in enumerate(xs):

# Unnormalized residual probability sampling_weights[:vocab_size] = np.maximum(

0, ps[token_index] * accept_probability - qs[token_index]) # Unnormalized probability of rejecting the sequence sampling_weights[vocab_size] = 1 - accept_probability sampling_weights /= np.sum(sampling_weights) chosen_token = np.random.choice(vocab_plus_one, p=sampling_weights) # Update the sequence if chosen_token < vocab_size: token_sequence = xs[:token_index] + [chosen_token] # Update the acceptance probability accept_probability = min(1, probability_ratios[ token_index, token_value] * accept_probability) return token_sequence

For reference, here is a sketch implementation of the token veriﬁcation algorithm (Algorithm 1):

def token_verification( ps: np.ndarray, qs: np.ndarray, drafts: np.ndarray) -> list[int]: draft_length, vocab_size = qs.shape qs.resize((draft_length+1, vocab_size)) # Append a zero vector. token_sequence = [] # Will include the token sequence we return probability_ratios = ps / qs token_index = 0 vocab_range = np.arange(vocab_size) for token_value in xs: accept_probability = probability_ratios[token_index, token_value] if (not np.isfinite(accept_probability) or

np.random.random() > accept_probability): # Rejection break token_index += 1 token_sequence.append(token_value) # Calculate the residual distribution sampling_weights = np.maximum(0, ps[token_index] - qs[token_index]) sampling_weights /= np.sum(sampling_weights)

Published as a conference paper at ICLR 2025

token_sequence.append(np.random.choice(vocab_range, p=sampling_weights)) return token_sequence

B FORMAL PROOFS

We start by setting up a few necessary notations. Let X be the space of output tokens. For ℓ> 1, we use Mℓ( | c) to denote the joint distribution of the next ℓtokens conditioned on the preﬁx under M, i.e., for all x1, . . . xℓ X ℓ, Mℓ(x1, . . . , xℓ| c) = Qℓ i=1 M(xi | c, xi 1). We use M ( | c) to denote the distribution of the sequence up to the end of the generative process. Below we ﬁrst describe a necessary and sufﬁcient condition for a valid draft veriﬁcation algorithm in Algorithm 3.

Lemma 2. c, Ms, Mb, γ, let Xγ be generated from Mγ s( | c), and

Xτ, Y = VERIFY(Xγ, {Ms( | c, Xi)}γ 1 i=0 , {Mb( | c, Xi)}γ i=0).

Let Zγ τ be generated from Mγ τ b ( | c, Xτ, Y ).

VERIFY is a valid draft veriﬁcation algorithm (Deﬁnition 1) if and only if c, Ms, Mb, γ,

Xτ, Y, Zγ τ p Mγ+1 b ( | c). (7)

Proof. We ﬁrst prove the forward direction (Equation (7) implies that VERIFY satisﬁes Deﬁnition 1) by induction on the maximum generation length of Mb( | c). When the maximum generation length is 0, for all new context c , we have the next token is a point mass over E.O.S, i.e.,

Mb(x | c, c ) = δ{x = E.O.S}.

Then Equation (7) implies that VERIFY will only output E.O.S, which is the same as Deﬁnition 1. Suppose Equation (6) holds for all context and Mb with generation length at most T, for a context c and Mb with maximum generation length at most T + 1, we have that the output of SPECDEC(c, Mb, Ms, γ, VERIFY) is

Xτ, Y, SPECDEC((c, Xτ, Y ), Mb, Ms, γ, VERIFY).

Let Zγ τ be the ﬁrst γ τ tokens from SPECDEC((c, Xτ, Y ), Mb, Ms, γ, VERIFY), and O be the tokens after. Since Xτ, Y is at least of length one, the generation length of Mb( | c, Xτ, Y ) is at most T. By the induction hypothesis, we have

Zγ τ p Mγ τ b ( | c, Xτ, Y ),

and O p M b( | c, Xτ, Y, Zγ τ).

And hence by Equation (7),

SPECDEC(c, Mb, Ms, γ, VERIFY) = Xτ, Y, SPECDEC((c, Xτ, Y ), Mb, Ms, γ, VERIFY)

= Xτ, Y, Zγ τ, O

p M b( | c).

This completes the proof for the forward direction.

For the backward direction, we have Equation (6) implies that for all Xτ, Y ,

SPECDEC((c, Xτ, Y ), Mb, Ms, γ, VERIFY)[: γ τ]5 p Mγ τ b ( | c, Xτ, Y ).

Let Zγ τ be a draw from Mγ τ b ( | c, Xτ, Y ), then

Zγ τ p SPECDEC((c, Xτ, Y ), Mb, Ms, γ, VERIFY)[: γ τ].

5We use v[i : j] to denote the entries i to j in v.

Published as a conference paper at ICLR 2025

And hence when Xτ, Y is the output of VERIFY,

Xτ, Y, Zγ τ p Xτ, Y, SPECDEC((c, Xτ, Y ), Mb, Ms, γ, VERIFY)[: γ τ] p SPECDEC(c, Mb, Ms, γ, VERIFY)[: γ + 1]

p Mγ+1 b ( | c),

where the last derivation follows from Equation (6) in Deﬁnition 1.

In all proofs below, we ﬁx the context c, and the models Ms and Mb. We note that the proofs won t use speciﬁc information about these choices and hence can be easily extended to all cases.

B.1 PROOF OF THEOREM 1

By Lemma 2, it would be enough to prove that block veriﬁcation satisﬁes Equation (7). For simplicity, we often refer to the sequence (Xτ, Y, Zγ τ) by Oγ+1. Note that Oγ+1 Mb( | c, Oγ) always holds since when τ < γ, Oγ+1 Mb( | c, Oγ) by deﬁnition and when τ = γ, Oγ+1 = Y Mb( | c, Xγ) = Mb( | c, Oγ). Hence it is enough to prove the following

ℓ γ, xℓ X ℓ, Pr Oℓ= xℓ = Mℓ b(xℓ| c), (8)

Note that in block veriﬁcation, pi s depend on the draft tokens Xγ. The following deﬁnition makes this explicit. Let pi be such that p0 = 1, and 1 i γ, xi X i,

pi(xi | c) = min pi 1(xi | c)Mb(xi | c, xi 1)

Ms(xi | c, xi 1), 1 . (9)

For most cases, when the preﬁx c is clear, we will ignore c and simply use pi(xi) = pi(xi | c). We will only make the preﬁx explicit when necessary.

We ﬁrst state the following lemma on the distribution of the number of tokens accepted by block veriﬁcation.

Lemma 3. Let Xγ Mγ s( | c), and

Xτ, Y = BLOCKVERIFY(Xγ, {Ms( | c, Xi)}γ 1 i=0 , {Mb( | c, Xi)}γ i=0).

Then we have i γ, and xi X i,

Pr τ i | Xi = xi = pi(xi).

We ﬁrst prove Theorem 1 based on Lemma 3 and defer the proof of the lemma to Appendix B.3. We prove Equation (8) by induction on the time index ℓ. When ℓ= 1, O1 is either X1, or a residual sample from pblock res ( | c) where

pblock res ( | c) = max{Mb(x | c) Ms(x | c), 0} P

x max{Mb(x | c) Ms(x | c), 0},

Hence we have x X, by Lemma 3,

Pr (O1 = x) = Pr (O1 = x, τ 1) + Pr (O1 = x, τ = 0)

= Pr (X1 = x) Pr (τ 1 | X1 = x) + X

x Pr (X1 = x )(1 Pr (τ 1 | X1 = x )) pblock res (x | c)

= Ms(x | c) p1(x) + X

x Ms(x | c)(1 p1(x )) pblock res (x | c)

= min{Mb(x | c), Ms(x | c)} + X

x max{Mb(x | c) Ms(x | c), 0} pblock res (x | c) (10)

= min{Mb(x | c), Ms(x | c)} + max{Mb(x | c) Ms(x | c), 0} (11) = Mb(x | c),

Published as a conference paper at ICLR 2025

where Equation (10) comes from the deﬁnition of p1 in Equation (9) and Equation (11) is due to Equation (4) with i = 0. Hence the Equation (8) holds for ℓ= 1. Suppose Equation (8) holds up to ℓ< γ. For ℓ= ℓ+ 1, we have Oℓ+1 is either equal to Xℓ+1 when τ ℓ+ 1, or a sample from pblock res ( | c, Xℓ) when τ = ℓ, or a sample from Mb( | c, Oℓ) when τ < ℓ. Hence Pr Oℓ+1 = xℓ+1 can be broken down below:

Pr Oℓ+1 = xℓ+1 =

Pr Oℓ+1 = xℓ+1, τ ℓ+ 1 + Pr Oℓ+1 = xℓ+1, τ = ℓ + Pr Oℓ+1 = xℓ+1, τ < ℓ (12)

For the ﬁrst term (τ ℓ+ 1), we have

Pr Oℓ+1 = xℓ+1, τ ℓ+ 1

= Pr Xℓ+1 = xℓ+1 Pr τ ℓ+ 1 | Xℓ+1 = xℓ+1

= Ms(xℓ+1 | c) pℓ+1(xℓ+1)

= Ms(xℓ| c) min{pℓ(xℓ)Mb(xℓ+1 | c, xℓ), Ms(xℓ+1 | c, xℓ)}. (13)

For the second term (τ = ℓ), we have

Pr Oℓ+1 = xℓ+1, τ = ℓ

= Pr Xℓ= xℓ Pr τ = ℓ| Xℓ= xℓ Pr Oℓ+1 = xℓ+1 | Oℓ= xℓ, τ = ℓ

= Pr Xℓ= xℓ Pr τ = ℓ| Xℓ= xℓ pblock res (xℓ+1 | c, xℓ)

Pr τ = ℓ| Xℓ= xℓ

= Pr τ ℓ| Xℓ= xℓ X

x Ms(x | c, xℓ) Pr τ ℓ+ 1 | c, Xℓ+1 = xℓ, x

x Ms(x | c, xℓ) pℓ+1(xℓ, x)

x min{pℓ(xℓ)Mb(x | c, xℓ), Ms(x | c, xℓ)}

x max{pℓ(xℓ)Mb(x | c, xℓ) Ms(x | c, xℓ), 0}.

And hence by the deﬁnition of pblock res (xℓ+1 | c, xℓ) in Equation (4), we have

Pr Oℓ+1 = xℓ+1, τ = ℓ

= Ms(xℓ| c) max{pℓ(xℓ)Mb(xℓ+1 | c, xℓ) Ms(xℓ+1 | c, xℓ), 0}. (14)

For the third term (τ < ℓ), by induction, and the generation process of Oγ+1, we have

Pr Oℓ+1 = xℓ+1, τ < ℓ = Pr Oℓ= xℓ, τ < ℓ Pr Oℓ+1 = xℓ+1 | Oℓ= xℓ, τ < ℓ

= Pr Oℓ= xℓ Pr Oℓ= xℓ, τ ℓ Mb(xℓ+1 | c, xℓ)

= Mb(xℓ| c) Ms(xℓ| c)pℓ(xℓ) Mb(xℓ+1 | c, xℓ) (15)

Plugging Equations (13) to (15) into Equation (12), we get xℓ+1 X ℓ+1,

Pr Oℓ+1 = xℓ+1 = Mb(xℓ+1 | c),

completing the induction step and hence the proof of Equation (8) and Theorem 1.

B.2 PROOF OF THEOREM 2

We ﬁrst state the following lemma, which when combined with Lemma 3, shows that in one iteration, among all valid draft veriﬁcation algorithms, block veriﬁcation accepts each subsequence with the highest probability.

Published as a conference paper at ICLR 2025

Lemma 4. For draft veriﬁcation algorithms that satisfy the constraints in Lemma 2, we have i γ, and xi X i, Pr τ i | Xi = xi pi(xi).

We defer the proof of the lemma to Appendix B.4 and ﬁrst prove Theorem 2 based on the lemma.

We start by breaking down the expected number of decoded tokens EVERIFY[N(i)] into the distribution of N(i) on different sample paths. Let O = O1, O2, . . . , be the complete output sequence from speculative decoding. We set all tokens after E.O.S to be E.O.S as well. Then we have

EVERIFY[N(i)] =

ℓ=1 Pr VERIFY (N(i) ℓ) = X

ℓ=1 Pr VERIFY (O = x , N(i) ℓ)

Hence it would be enough to prove the following. Lemma 5. For all draft veriﬁcation algorithms that satisfy the constraints in Lemma 2, we have c, and output x X

Pr VERIFY (O = x , N(i) ℓ| c) Pr BLOCKVERIFY (O = x , N(i) ℓ| c) (16)

We prove the lemma by induction on the number of iterations i. We ﬁrst prove the following lemma for all veriﬁcation algorithms. Lemma 6. For all ℓ γ,

Pr VERIFY (O = x , τ ℓ| c) = Pr VERIFY Oℓ= xℓ, τ ℓ| c M b xℓ+1: | c, xℓ .

Proof. When Pr VERIFY Oℓ= xℓ, τ ℓ| c = 0, the bound is trivial since both sides are 0. Otherwise we have

Pr VERIFY (O = x , τ ℓ| c)

= Pr VERIFY Oℓ+1: = xℓ+1: , Oℓ= xℓ, τ ℓ| c

= Pr VERIFY Oℓ= xℓ, τ ℓ| c Pr VERIFY Oℓ+1: = xℓ+1: | Oℓ= xℓ, τ ℓ, c

It would be enough to show that

Pr VERIFY Oℓ+1: = xℓ+1: | Oℓ= xℓ, τ ℓ, c = M b xℓ+1: | c, xℓ . (17)

M b xℓ+1: | c, xℓ

= Pr VERIFY Oℓ+1: = xℓ+1: | Oℓ= xℓ, c

= Pr VERIFY Oℓ+1: = xℓ+1: , τ ℓ| Oℓ= xℓ, c + Pr VERIFY Oℓ+1: = xℓ+1: , τ < ℓ| Oℓ= xℓ, c

= Pr VERIFY τ ℓ| Oℓ= xℓ, c Pr VERIFY Oℓ+1: = xℓ+1: | Oℓ= xℓ, τ ℓ, c

+ Pr VERIFY τ < ℓ| Oℓ= xℓ, c Pr VERIFY Oℓ+1: = xℓ+1: | Oℓ= xℓ, τ < ℓ, c (18)

When Pr VERIFY τ < ℓ| Oℓ= xℓ, c = 0, we have Pr VERIFY τ ℓ| Oℓ= xℓ, c = 1, and hence

Pr VERIFY Oℓ+1: = xℓ+1: | Oℓ= xℓ, τ ℓ, c = M b xℓ+1: | c, xℓ .

When Pr VERIFY τ < ℓ| Oℓ= xℓ, c > 0, since VERIFY is a valid veriﬁcation algorithm (Deﬁnition 1), we have tokens starting from location ℓ+ 1 is valid draw from Mb( | c, xℓ), i.e.,

Pr VERIFY Oℓ+1: = xℓ+1: | Oℓ= xℓ, τ < ℓ, c = M b xℓ+1: | c, xℓ .

Plugging this into Equation (18) completes the proof.

Published as a conference paper at ICLR 2025

When i = 1, we have that N(1) = τ + 1, where τ is the number of accepted tokens. Hence we have

Pr VERIFY (O = x , N(1) ℓ| c)

= Pr VERIFY (O = x , τ ℓ 1 | c)

= Pr VERIFY Oℓ 1 = xℓ 1, τ ℓ 1 | c M b xℓ: | c, xℓ 1 Lemma 6

= Ms(xℓ 1 | c)M b(xℓ: | c, xℓ 1) Pr VERIFY τ ℓ 1 | Xℓ 1 = xℓ 1, c ,

where the last equality is because Pr VERIFY Oℓ 1 = xℓ 1, τ ℓ 1 | c is the probability of the event that Oℓ 1 = xℓ 1 is contained in the accepted tokens, which happens under the joint of two events: (1) The ﬁrst ℓ 1 tokens in the draft block from the small model Xℓ 1 = xℓ 1. This probability is Ms(xℓ 1 | c); (2) Conditioned on Xℓ 1 = xℓ 1, at least ℓ 1 tokens are accepted, this is Pr VERIFY τ ℓ 1 | Xℓ 1 = xℓ 1, c , and hence

Pr VERIFY Oℓ 1 = xℓ 1, τ ℓ 1 | c = Ms(xℓ 1 | c) Pr VERIFY τ ℓ 1 | Xℓ 1 = xℓ 1, c .

Similarly, we have

Pr BLOCKVERIFY (O = x , N(1) ℓ| c)

=Ms(xℓ 1 | c)M b(xℓ: | c, xℓ 1) Pr BLOCKVERIFY τ ℓ 1 | Xℓ 1 = xℓ 1, c .

Note that Lemmas 3 and 4 imply that for all veriﬁcation algorithm, we have

Pr VERIFY τ ℓ 1 | Xℓ 1 = xℓ 1, c Pr BLOCKVERIFY τ ℓ 1 | Xℓ 1 = xℓ 1, c .

Combining these, we have

Pr VERIFY (O = x , N(1) ℓ| c) Pr BLOCKVERIFY (O = x , N(1) ℓ| c). (19)

Suppose the lemma holds for all iterations up to i, for the (i + 1)th iteration, let τi+1 be the number of tokens accepted in the (i + 1)th iteration, we have

Pr VERIFY (O = x , N(i + 1) ℓ| c)

ℓ <ℓ Pr VERIFY (O = x , N(i) = ℓ , N(i + 1) ℓ| c)

ℓ <ℓ Pr VERIFY (O = x , N(i) = ℓ | c) Pr VERIFY (τi+1 ℓ ℓ 1 | c, O = x ,N(i) = ℓ )

= Pr VERIFY (O = x | c) X

ℓ <ℓ Pr VERIFY (N(i) = ℓ | O = x , c)

Pr VERIFY (τi+1 ℓ ℓ 1 | c, O = x , N(i) = ℓ )

= Mb(x | c) X

ℓ <ℓ Pr VERIFY (N(i) = ℓ | O = x , c) Pr VERIFY (τi+1 ℓ ℓ 1 | c, O = x , N(i) = ℓ )

Let ηVERIFY be a random variable distributed according to Pr VERIFY (N(i) = ℓ | O = x , c), and

f VERIFY(η) = Pr VERIFY (τi+1 ℓ η 1 | c, O = x , N(i) = η).

Plugging these into Equation (20), we have

Pr VERIFY (O = x , N(i + 1) ℓ| c) = Mb(x | c)EηVERIFY [f VERIFY(ηVERIFY)]

Note that let cη = c, xη, we have

f VERIFY(η) = Pr VERIFY (τi+1 ℓ η 1 | c, O = x , N(i) = η)

= Pr VERIFY τ ℓ η 1 | cη, O = xη+1: (21)

Published as a conference paper at ICLR 2025

where Equation (21) is due to the iterative structure of speculative decoding and after generating Oη = xη in the ﬁrst i iterations (N(i) = η), the next iteration is the same as generating from scratch with context cη = c, xη. Similarly, we have

f BLOCKVERIFY(η) = Pr BLOCKVERIFY τ ℓ η 1 | cη, O = xη+1: .

Note that c, x X , and i, we have

Pr BLOCKVERIFY (τ i | c, O = x ) = Pr BLOCKVERIFY (O = x , τ i | c)

Pr BLOCKVERIFY (O = x | c)

= Pr BLOCKVERIFY (O = x , τ i | c)

Pr VERIFY (O = x , τ i | c)

M b(x | c) (22)

= Pr VERIFY (τ i | c, O = x ), (23)

where Equation (22) is due to Equation (19) and that N(1) = τ + 1. Equation (23) implies that f BLOCKVERIFY(η) f VERIFY(η), and hence we have

Pr VERIFY (O = x , N(i + 1) ℓ| c) = Mb(x | c)EηVERIFY [f VERIFY(η)]

Mb(x | c)EηVERIFY [f BLOCKVERIFY(η)] .

Note that Pr BLOCKVERIFY (O = x , N(i + 1) ℓ| c) = Mb(x | c)EηBLOCKVERIFY [f BLOCKVERIFY(η)]. It would be enough to prove that

EηVERIFY [f BLOCKVERIFY(η)] EηBLOCKVERIFY [f BLOCKVERIFY(η)] . (24)

Next we prove Equation (24) using the lemma below.

Lemma 7 (Quirk and Saposnik (1962)). Let f : R R be an increasing function and X1 stochastically dominates X2, meaning x, we have Pr (X1 x) Pr (X2 x), then we have

E[f(X1)] E[f(X2)].

By the induction hypothesis, we have ηBLOCKVERIFY stochastically dominates (Quirk and Saposnik, 1962) ηVERIFY for any valid veriﬁcation algorithm. It remains to show that f BLOCKVERIFY(η) is an increasing function. By deﬁnition, since 0 τ γ, when η < ℓ γ 1, f BLOCKVERIFY(η) = 0 and when η > ℓ 1, f BLOCKVERIFY(η) = 1. When ℓ γ 1 η ℓ 2, by deﬁnition and Lemma 3, f BLOCKVERIFY(η) = pℓ η 1(xη+1:ℓ 1 | c, xη). To see that pℓ η 1(xη+1:ℓ 1 | c, xη) is an increasing function of η , for η = η + 1, we can obtain pℓ η 1(xη +1:ℓ 1 | c, xη ) by following the same recursion steps as in Equation (9) but replacing p1(xη+1:ℓ 1 | c, xη) with p0(xη+2:ℓ 1 | c, xη+1) = 1, and hence only increasing the values.

This proves that f BLOCKVERIFY is increasing and hence Equation (24) holds. This implies that the induction step holds due to Lemma 7, completing proof of Theorem 2.

B.3 PROOF OF LEMMA 3

Note that in Line 4 of Algorithm 2, pi = pi(Xi). We prove the statement by backward induction. When i = γ, we have by deﬁnition of hblock γ in Figure 2, xγ X γ,

Pr (τ γ | Xγ = xγ) = hblock γ = pγ(xγ).

Published as a conference paper at ICLR 2025

Suppose the statement holds for i ℓ. When i = ℓ 1, we have

Pr τ ℓ 1 | Xℓ 1 = xℓ 1

xℓ X Ms(xℓ| c, xℓ 1) Pr τ ℓ 1 | Xℓ= xℓ

xℓ X Ms(xℓ| c, xℓ 1) Pr τ ℓ| Xℓ= xℓ + Pr τ = ℓ 1 | Xℓ= xℓ

xℓ X Ms(xℓ| c, xℓ 1) Pr τ ℓ| Xℓ= xℓ + Pr τ < ℓ| Xℓ= xℓ hblock ℓ 1

xℓ X Ms(xℓ| c, xℓ 1) pℓ(xℓ) + (1 pℓ(xℓ)) hblock ℓ 1

xℓ X Ms(xℓ| c, xℓ 1) pℓ(xℓ) + hblock ℓ 1 X

xℓ X Ms(xℓ| c, xℓ 1) (1 pℓ(xℓ))

xℓ X Ms(xℓ| c, xℓ 1) pℓ(xℓ) + hblock ℓ 1 (1 X

xℓ X Ms(xℓ| c, xℓ 1)pℓ(xℓ)). (26)

Equation (25) above holds since τ = ℓ 1 happens under the joint event of τ < ℓand ηℓ 1 < hblock ℓ 1 . Note that in the deﬁnition of hblock ℓ 1 (Equation (5)), X

x max{pℓ 1(xℓ 1)Mb(x | c, xℓ 1) Ms(x | c, xℓ 1), 0}

pℓ 1(xℓ 1)Mb(x | c, xℓ 1) min{pℓ 1(xℓ 1)Mb(x | c, xℓ 1), Ms(x | c, xℓ 1)}

= pℓ 1(xℓ 1) X

x min{pℓ 1(xℓ 1)Mb(x | c, xℓ 1), Ms(x | c, xℓ 1)}

Plugging this into Equation (5),

hblock ℓ 1 = pℓ 1(xℓ 1) P

x min{pℓ 1(xℓ 1)Mb(x | c, xℓ 1), Ms(x | c, xℓ 1)} 1 P

x min{pℓ 1(xℓ 1)Mb(x | c, xℓ 1), Ms(x | c, xℓ 1)} . (27)

Moreover, we have by the deﬁnition of pℓ(xℓ), X

xℓ X Ms(xℓ| c, xℓ 1) pℓ(xℓ) = X

xℓ X min{pℓ 1(xℓ 1)Mb(xℓ| c, xℓ 1), Ms(xℓ| c, xℓ 1)}

x X min{pℓ 1(xℓ 1)Mb(x | c, xℓ 1), Ms(x | c, xℓ 1)}.

Plugging Equation (28) and Equation (27) into Equation (26), we get

Pr τ ℓ 1 | Xℓ 1 = xℓ 1 = pℓ 1(xℓ 1),

as desired. The lemma hence follows by induction.

B.4 PROOF OF LEMMA 4

Recall that we use Oγ+1 to denote the sequence (Xτ, Y, Zγ τ) in Equation (7). Without loss of generality, we only consider oℓsuch that Pr Oℓ= oℓ > 0 and Pr Xℓ= oℓ > 0 since otherwise Pr τ i | Xi = xi is either zero or ill-deﬁned. We break the proof into the two cases below.

If i < ℓ, it satisﬁes that pi 1(xi 1)Mb(xi | c, xi 1) Ms(xi | c, xi 1), then we have in the recursive formula of pi s in Algorithm 2, we always have

pi(xi) = pi 1(xi 1)Mb(xi | c, xi 1)

Ms(xi | c, xi 1),

Published as a conference paper at ICLR 2025

pℓ 1(xℓ 1) = Mb(xℓ 1 | c)

Ms(xℓ 1 | c),

And for xℓ, we have

pℓ(xℓ) = min{Mb(xℓ| c)

Ms(xℓ| c), 1}.

Pr Oℓ= xℓ, τ ℓ = Pr Xℓ= xℓ Pr τ ℓ| Xℓ= xℓ = Ms(xℓ| c) Pr τ ℓ| Xℓ= xℓ

Moreover, we have

Pr Oℓ= xℓ, τ ℓ Pr Oℓ= xℓ = Mb(xℓ| c),

Pr Oℓ= xℓ, τ ℓ = Pr Xℓ= xℓ Pr τ ℓ| Xℓ= xℓ Pr Xℓ= xℓ = Ms(xℓ| c).

Pr τ ℓ| Xℓ= xℓ = Pr Oℓ= xℓ, τ ℓ

Ms(xℓ| c) min{Mb(xℓ| c)

Ms(xℓ| c), 1} = pℓ(xℓ).

In the other case, there must exist some i such that pi 1(xi 1)Mb(xi | c, xi 1) > Ms(xi | c, xi 1), then we have

pi(xi) = min{pi 1(xi 1)Mb(xi | c, xi 1)

Ms(xi | c, xi 1) , 1} = 1.

WLOG, let i be the largest such index. In this case, we have i < j < ℓ, pj 1(xj 1)Mb(xj | c, xj 1) Ms(xj | c, xj 1), and hence

pℓ(xℓ) = pi(xi)Mℓ i b (xi+1:ℓ| c, xi) Mℓ i s (xi+1:ℓ| c, xi) = Mℓ i b (xi+1:ℓ| c, xi) Mℓ i s (xi+1:ℓ| c, xi) .

Moreover, by deﬁnition, we have

pi 1(xi 1) pi 2(xi 2)Mb(xi 1 | c, xi 2)

Ms(xi 1 | c, xi 2) . . . Mi 1 b (xi 1 | c) Mi 1 s (xi 1 | c),

and hence when pi 1(xi 1)Mb(xi | c, xi 1) > Ms(xi | c, xi 1),

Mb(xi | c) Ms(xi | c) = Mi 1 b (xi 1 | c) Mi 1 s (xi 1 | c) Mb(xi | c, xi 1)

Ms(xi | c, xi 1) pi 1(xi 1) Mb(xi | c, xi 1)

Ms(xi | c, xi 1) > 1.

Pr Oi = xi, τ i = Pr Xi = xi Pr τ i | Xi = xi Ms(xi | c) < Mb(xi | c),

Pr Oi = xi, τ < i = Pr Oi = xi Pr Oi = xi, τ i = Mb(xi | c) Ms(xi | c) > 0.

Note that when Oi = xi, τ < i, by constraints in Equation (7), we have

Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ < i = Mℓ i b (xi+1:ℓ| c, xi).

This implies

Pr Oℓ= xℓ = Pr Oi = xi, τ < i Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ < i

+ Pr Oi = xi, τ i Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ i

= Pr Oi = xi, τ < i Mℓ i b (xi+1:ℓ| c, xi)

+ Pr Oi = xi, τ i Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ i

Published as a conference paper at ICLR 2025

Moreover, we have Pr Oℓ= xℓ = Mb(xi)Mℓ i b (xi+1:ℓ| c, xi) Combining both, we get

1 = Pr Oℓ= xℓ

Mb(xi)Mℓ i b (xi+1:ℓ| c, xi)

= Pr τ < i | Oi = xi + Pr τ i | Oi = xi Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ i

Mℓ i b (xi+1:ℓ| c, xi) ,

= 1 Pr τ i | Oi = xi Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ i

Mℓ i b (xi+1:ℓ| c, xi) 1

and this implies that (note by assumption Pr τ i | Oi = xi = 0),

Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ i = Mℓ i b (xi+1:ℓ| c, xi). (29)

Pr Oℓ= xℓ, τ ℓ Pr Oℓ= xℓ, τ i

= Pr Oi = xi, τ i Pr Oi+1:ℓ= xi+1:ℓ| Oi = xi, τ i

Ms(xi | c)pi(xi)Mℓ i b (xi+1:ℓ| c, xi)

= Ms(xi | c)Mℓ i b (xi+1:ℓ| c, xi).

If Pr τ ℓ| Xℓ= xℓ > pℓ(xℓ), we have

Pr Oℓ= xℓ, τ ℓ = Pr Xℓ= xℓ Pr τ ℓ| Xℓ= xℓ

> Ms(xℓ| c)pℓ(xℓ)

= Ms(xℓ| c)Mℓ i b (xi+1:ℓ| c, xi) Mℓ i s (xi+1:ℓ| c, xi)

= Ms(xi | c)Mℓ i b (xi+1:ℓ| c, xi),

which leads to a contradiction. This completes the proof.

C GREEDY BLOCK VERIFICATION

In this section, we show that it is possible to accept more tokens than block veriﬁcation (Algorithm 2) in one iteration with a modiﬁcation to the speculative decoding framework in Algorithm 3 that allows the decoding logic to depend on the previous accept/reject decisions. However, as shown in Table 4, the resulting algorithm, greedy block veriﬁcation, doesn t improve over block veriﬁcation. We include the description and analysis of the algorithm as a theoretical result. The claims in the main paper holds independent of the results in this section.

We start by introducing the algorithm (Algorithm 4) and then discuss the necessary modiﬁcations to maintain the identity distribution guarantee.

The above greedy block veriﬁcation algorithm has a similar procedure as block veriﬁcation (Algorithm 2) with differences in the setting of of acceptance probabilities and residual distributions, as highlighted.

Similar to Algorithm 2, Algorithm 4 maintains a list of probabilities pi s, which satisﬁes that min{1, pi} is the probability that the subblock Xi is accepted. hi s are chosen to achieve the above acceptance guarantee, and pgreedy res s are chosen to maintain the identical distribution guarantee.

Note that compared to pi in block veriﬁcation, the recursive deﬁnition of pi doesn t have a minimum over one term, hence it is always an upper bound on pi s. This leads to a higher acceptance probability for every subblock in greedy block veriﬁcation (Theorem 3). However, Algorithm 4 cannot be used directly in the iterative implementation of speculative decoding in Algorithm 3. To see this, consider the simple example in Section 2. Greedy block veriﬁcation will perform the following:

Published as a conference paper at ICLR 2025

Algorithm 4 Greedy block veriﬁcation Input: Draft block Xγ; small model distributions i < γ, Ms( | c, Xi); target model distributions i γ, Mb( | c, Xi). 1: Sample η1, . . . , ηγ U(0, 1). 2: Set τ = 0, p0 = 1. 3: for i = 1, . . . , γ 1 do

4: Set pi = pi 1 Mb(Xi|c,Xi 1) Ms(Xi|c,Xi 1).

5: Set hi =

x max{ pi Mb(x|c,Xi) Ms(x|c,Xi), 0} P

x max{Ms(x|c,Xi) pi Mb(x|c,Xi), 0} 6: if ηi hi then 7: Set τ = i. 8: else 9: continue. 10: end if 11: end for

12: pγ = pγ 1 Mb(Xγ|c,Xγ 1) Ms(Xγ|c,Xγ 1)

13: if ηγ < pγ then

14: Set τ = γ, and sample Y from Mb( | c, Xγ). 15: else 16: Sample Y from pgreedy res ( | c, Xτ) as below:

pgreedy res (x | c, Xi) = max{ pi Mb(x | c, Xi) Ms(x | c, Xi), 0} P

x X max{ pi Mb(x | c, Xi) Ms(x | c, Xi), 0} . (30)

17: end if 18: Return Xτ, Y .

Accept X1X2 = AB, BA, BB with probability one, and sample an extra token from Mb( ). Accept X1X2 = AA with probability 1/4 and sample an extra token from Mb( ). When X1X2 = AA is rejected, accept no tokens and sample a correction token Y = B. Note that in this case, if the algorithm uses Y as the context for the next iteration and sample based on Mb, the next token will be A with probability 1/3. This makes the total probability of generating BA as the ﬁrst two tokens

Ms(BA) Pr (Accept BA) + Ms(AA) Pr (Reject AA and Y = B)Mb(A) = 2/9 1 + 4/9 3/4 1/3 = 1/3,

which is higher than Mb(BA) = 2/9. This violates the identical distribution guarantee. Below we introduce a distribution modiﬁcation algorithm, which can be used with Algorithm 4 to maintain the identical distribution guarantee.

It can be shown that if Xτ, Y are returned in Algorithm 4, and the next τ γ 1 tokens are sampled according to Mnew from Algorithm 5, the identical distribution guarantee is maintained. In particular, we have the following lemma:

Lemma 8. Let Xγ Mγ s( | c) be the draft tokens and Xτ, Y be the output from Algorithm 4. Let Mnew be the modiﬁed distribution based on Algorithm 5, and Zγ τ 1 Mγ τ 1 new ( | c, Xτ, Y ). Then we have

Xτ, Y, Zγ τ 1 Mγ b ( | c).

The proof is presented in Appendix C.3. The above leads to the following speculative decoding algorithm with greedy block veriﬁcation, presented in Algorithm 6. Note that Lemma 8 implies that it maintains the identical distribution guarantee.

Published as a conference paper at ICLR 2025

Algorithm 5 Distribution modiﬁcation Input: Small model Ms; target model Mb; draft length γ; generated tokens from Algorithm 4

Xτ, Y . 1: Let M b be such that i γ τ 1, and xi X i, Mnew(xi | c, Xτ, Y, xi 1) =

max{Mb(c, Xτ, Y, xi) Ms(c, Xτ, Y, xi), 0} P

x X max{Mb(c, Xτ, Y, xi 1, x ) Ms(c, Xτ, Y, xi 1, x ), 0}, (31)

{Modify the distribution at rejected locations.} and i > γ τ 1, and xi X i,

Mnew(xi | c, Xτ, Y, xi 1) = Mb(xi | c, Xτ, Y, xi 1)

{Keep the distributions for future locations unchanged.} 2: Return Mnew.

Algorithm 6 Speculative decoding with greedy block veriﬁcation Input: Preﬁx c, large model Mb, draft model Ms. Draft length γ..

1: while E.O.S / (Xτ, Y ) do 2: Sample X1, . . . , Xγ Ms( | c) using autoregressive sampling, keep the conditional probabilities at each step Ms( | c, Xi) for i = 0, . . . , γ 1. {Obtain draft block.} 3: Call the large model Mb and compute conditional probabilities Mb( | c, Xi)6 for i = 0, 1, . . . , γ in parallel. {Parallel scoring.} 4: Get the accepted tokens with draft veriﬁcation {Draft veriﬁcation and correction.}

Xτ, Y = VERIFY(Xγ, {Ms( | c, Xi)}γ 1 i=0 , {Mb( | c, Xi)}γ i=0).

5: c c, Xτ, Y. {Add decoded tokens to the preﬁx.}

6: Mb DISTRIBUTIONMODIFY(Mb, Ms, γ, Xτ, Y ) {Modify target distribution.} 7: end while

C.1 COMPARISON TO BLOCK VERIFICATION.

In one draft iteration, with the same pair of draft and target distributions, greedy block veriﬁcation is always better.

Theorem 3 (Informal). In one draft iteration with the same models Ms, Mb and draft length γ, greedy block veriﬁcation decodes at least as many tokens as block veriﬁcation.

The theorem is proved in Appendix C.3. However, due to the distribution modiﬁcation step, the target distribution might change after the ﬁrst iteration, which might affect the expected number of accepted tokens. For example, in the Bernoulli example considered in Section 2, when the draft block X1X2 = AA and they are rejected by greedy block veriﬁcation. It can be shown that the modiﬁed distribution will be a point mass on token B. And in future iterations, if the algorithm still uses Ms as the draft model, there is lower chance that the draft tokens will be accepted. Hence, theoretically it is unclear whether one approach dominates the other.

Empirical comparison. We conduct the same set of experiments in Section 6 on greedy block veriﬁcation to compare the two approaches empirically. We list the block efﬁciency comparison when PALM-2-XXS is used as the drafter and γ = 8 in Table 4. As we can see, while greedy block veriﬁcation still consistently improves over token veriﬁcation, the improvement is less signiﬁcant compared to block veriﬁcation. The trend is the same for wall clock numbers as well as in other parameter settings. Hence we recommend using block veriﬁcation instead of the greedy version.

6In cases where Mb is not the original large transformer model. Mb can be obtained by evaluating using the original large model, and then perform the modiﬁcation in Equation (31).

Published as a conference paper at ICLR 2025

Table 4: Block efﬁciency comparison between among token veriﬁcation, block veriﬁcation, and greedy block veriﬁcation with γ = 8. Each statistic is computed using 1000 test prompts from different datasets on various tasks (each run is an average with 3 different random seeds).

Dataset Token Veriﬁcation Block veriﬁcation Greedy block veriﬁcation

LM1B 3.21 3.49 3.30 GPT Prompt 3.41 3.76 3.51 Web QA 3.44 3.70 3.52 PIQA 3.40 3.68 3.49 Share GPT 3.34 3.62 3.44 XSum 3.49 3.76 3.59 GSM8K 3.81 4.15 3.96 WMT-De En 3.19 3.41 3.26

C.2 PROOF OF LEMMA 8

In the proof, we ignore the context c and the proof will generalize to arbitrary c. We start by introducing two useful quantities.

premain(xi):= X

x max{Mb(xi, x) Ms(xi, x), 0}, (32)

prej(xi):= X

x max{Ms(xi, x) Mb(xi, x), 0}. (33)

Note that pi in Algorithm 4 depends on the draft block xi, and by the recursive deﬁnition of pi s, we have pi = Mb(xi)

Ms(xi). Hence we have

x max{ pi Mb(x | c, Xi) Ms(x | c, Xi), 0} P

x max{Ms(x | c, Xi) pi Mb(x | c, Xi), 0} = premain(xi)

Moreover, the expression for pi also implies that pgreedy res ( | c, Xτ) = Mnew( | c, Xτ) (deﬁned in Equation (30) and Equation (31) resepectively).

We now prove the following lemma about the acceptance length τ in Algorithm 4. Lemma 9. For all ℓ [1, γ], and xℓ X ℓ,

Pr Xℓ= xℓ, τ ℓ = min{Mb(xℓ), Ms(xℓ)}.

Proof. We prove this by induction in the backward direction. When ℓ= γ, Step 12-14 in Algorithm 4 accepts Xγ with probability

Pr (τ = γ | Xγ = xγ) = min{1, pγ} = min 1, Mb(xγ)

Pr (Xγ = xγ, τ γ) = Pr (Xγ = xγ) Pr (τ = γ | Xγ = xγ) = min {Ms(xγ), Mb(xγ)} .

Suppose the equation holds for ℓ ℓ0, for ℓ= ℓ0 1, we have

Pr Xℓ0 1 = xℓ0 1, τ ℓ0 1 = Pr Xℓ0 1 = xℓ0 1, τ ℓ0 + Pr Xℓ0 1 = xℓ0 1, τ = ℓ0 1 .

Next we consider the two terms separately. For the ﬁrst term, due to the induction assumption, we have Pr Xℓ0 1 = xℓ0 1, τ ℓ0 = X

x X min{Mb(xℓ0 1, x), Ms(xℓ0 1, x)}

Published as a conference paper at ICLR 2025

For the second term, we have

Pr Xℓ0 1 = xℓ0 1, τ = ℓ0 1

= Pr Xℓ0 1 = xℓ0 1, τ ℓ0 1 Pr Xℓ0 1 is accepted. (34)

= Pr Xℓ0 1 = xℓ0 1, τ ℓ0 1 Pr (ηℓ0 1 hℓ0 1)

= Pr Xℓ0 1 = xℓ0 1 Pr Xℓ0 1 = xℓ0 1, τ ℓ0 min 1, premain(xℓ0 1)

prej(xℓ0 1)

Ms(xℓ0 1, x) min{Mb(xℓ0 1, x), Ms(xℓ0 1, x)} min 1, premain(xℓ0 1)

prej(xℓ0 1)

= prej(xℓ0 1) min 1, premain(xℓ0 1)

prej(xℓ0 1)

= min premain(xℓ0 1), prej(xℓ0 1) .

In the above derivation, Equation (34) is due to that Algorithm 4 is outputing the longest accepted subblock. Equation (35) is due to the induction hypothesis. Equation (36) is due to the deﬁnition of prej in Equation (33).

Combining the two terms, we have

Pr Xℓ0 1 = xℓ0 1, τ ℓ0 1

= Pr Xℓ0 1 = xℓ0 1, τ ℓ0 + Pr Xℓ0 1 = xℓ0 1, τ = ℓ0 1

x X min{Mb(xℓ0 1, x), Ms(xℓ0 1, x)} + min premain(xℓ0 1), prej(xℓ0 1)

x X min{Mb(xℓ0 1, x), Ms(xℓ0 1, x)} + premain(xℓ0 1),

x X min{Mb(xℓ0 1, x), Ms(xℓ0 1, x)} + prej(xℓ0 1)}

x X Mb(xℓ0 1, x), X

x X Ms(xℓ0 1, x)

= min Mb(xℓ0 1), Ms(xℓ0 1) ,

which is the desired quantity in the lemma. This concludes the proof. Here Equation (37) is due to the deﬁnition of premain and prej in Equation (32) and Equation (33).

Next we proceed to prove Lemma 8. Let Oγ = (Xτ, Y, Zγ τ 1). It would be enough to show that for all i γ and xi X i,

Pr Oi = xi = Mb(xi).

We prove this via induction. Note that the corollary holds for i = 0, which is the trivial case and both sides are equal to 1. Suppose the claim holds for i i0. This means xi0 X i0, we have

Pr Oi0 = xi0 = Mb(xi0).

When i = i0 + 1, by the algorithm, we have that either τ i0 + 1, where Oi0+1 is an accepted token, or τ io, where Oi0+1 is sampled according to Mnew( | Oi0) (or pgreedy res ( | Oi0), which

Published as a conference paper at ICLR 2025

is the same as Mnew( | Oi0)). By Lemma 9, we have

Pr Oi0+1 = xi0+1

= Pr Oi0+1 = xi0+1, τ i0 + 1 + Pr Oi0+1 = xi0+1, τ i0

= Pr Xi0+1 = xi0+1, τ i0 + 1 + Pr Oi0 = xi0, τ i0 Mnew(xi0+1 | xi0)

= Pr Xi0+1 = xi0+1, τ i0 + 1

+ Pr Oi0 = xi0 Pr Oi0 = xi0, τ i0 + 1 Mnew(xi0+1 | xi0))

= min{Mb(xi0+1), Ms(xi0+1)}

x min{Mb(xi0, x), Ms(xi0, x)}

Mnew(xi0+1 | xi0))

= min{Mb(xi0+1), Ms(xi0+1)} + X

x max{Mb(xi0, x) Ms(xi0, x), 0} Mnew(xi0+1 | xi0))

= min{Mb(xi0+1), Ms(xi0+1)} + max{Mb(xi0, xi0+1) Ms(xi0, xi0+1), 0} (39)

= Mb(xi0+1).

Here Equation (38) follows by the induction hypothesis, and Equation (39) follows by the deﬁnition of Mnew in Equation (31). By induction, this concludes the proof.

C.3 PROOF OF THEOREM 3

To prove Theorem 3, we ﬁrst observe the following: For Algorithm 4, let τ be the number of accepted tokens, due to Lemma 9, we have

EXγ Mγ s [τ] =

ℓ=1 Pr (τ ℓ) =

xℓ X ℓ Pr Xℓ= xℓ, τ ℓ

xℓ X ℓ min{Ms(xℓ), Mb(xℓ)}.

Next we show that the above expected acceptance length is optimal for a family of draft veriﬁcation algorithms that performs a coupling between sample blocks from the draft and target distributions. For all xγ, yγ X γ, let the maximum common preﬁx length be deﬁned as

β(xγ, yγ):= max ℓ γ { i ℓ, xi = yi}.

Formally, let π be a joint distribution over X γ X γ, then Algorithm 4 solves the following optimization problem.

max π EXγ,yγ π [β(Xγ, yγ)] , (40)

subject to constraints X

yγ π(xγ, yγ) = Ms(xγ), xγ X γ, (41)

xγ π(xγ, yγ) = Mb(yγ), yγ X γ. (42)

In this above formulation, the marginal distributions satisfy Xγ Mγ s and Y γ Mγ b . And the maximum common preﬁx length refers to the number of accepted tokens in one iteration of speculative decoding. Note that the optimization problem can be viewed as an optimal transport problem (Villani et al., 2009) between distributions Mγ s( ) and Mγ b ( ) with the cost function being (γ β(Xγ, Y γ). The next lemma establishes the optimality of Algorithm 4 in solving this problem.

Published as a conference paper at ICLR 2025

Lemma 10. The solution to Equation (40) is upper bounded by

xτ X τ min{Ms(xℓ), Mb(xℓ)}

Proof of Lemma 10: For all π that satisﬁes Equations (41) and (42) and Xγ, Y γ π, we have

EXγ,Y γ[β(Xγ, Y γ)] (a) = X

ℓ γ Pr Xγ,Y γ (β(Xγ, Y γ) ℓ)

xℓ Pr Xγ,Y γ Xℓ= Y ℓ= xℓ, β(Xγ, Y γ) ℓ

xℓ Pr Xγ,Y γ Xℓ= Y ℓ= xℓ

xℓ min Pr Xℓ= xℓ , Pr Y ℓ= xℓ

xℓ min{Mℓ s(xℓ), Mℓ b(xℓ)}.

Here (a) follows from the fact that for a positive integer random variable E[X] = P

i Pr(X i); (b) follows from the fact that the joint probability is upper bounded by the minimum of the marginals.

Then Theorem 3 holds by noticing that block veriﬁcation Algorithm 2 is also an instance of the coupling by setting Xγ to be the draft tokens and Y γ = (Xτ, Y , Zγ τ 1) where (Xτ, Y ) are the outputs from Algorithm 2 and Zγ τ 1 Mb( | c, Xτ, Y ) (Lemma 2).

D ADDITIONAL EXPERIMENTAL RESULTS

D.1 COMPARISON TO SPECULATIVE DECODING WITH MULTIPLE DRAFTS.

Recent works (Sun et al., 2023; Miao et al., 2023) have extended speculative decoding to the case with multiple draft blocks to improve block efﬁciency. However, these methods also increase the required computation from the large model to verify the drafts. In high-throughput LLM serving systems, query batching (Kwon et al., 2023) is a common technique where multiple preﬁxes are decoded at the same time. In these cases, the inference will be less memory bound and there will not be enough extra parallel compute to evaluate the increased number of drafts without decreasing latency.

We empirically compare block veriﬁcation and Spec Tr (Sun et al., 2023), Spec Infer (Miao et al., 2023) with query batching. We set the batch size B = 8 and use PALM-2-XXS as the draft model. In Table 5, we list the wall clock speedup and block efﬁciency for γ = 8. The number of draft blocks for Spec Tr and Spec Infer are taken to be 2, which is the one that achieves the lowest latency over {2, 4, 8} when B = 8, γ = 8.

We observe that while Spec Tr and Spec Infer can achieve higher block efﬁciencies, due to the increased computation to evaluate more candidates, our method achieves better speedup than Spec Tr and Spec Infer, demonstrating the advantage of our method in the common practical setting with query batching.

D.2 DETAILED RESULTS WITH OTHER PARAMETER SETTINGS

In this section, we present experimental results for the same set of experiments described in Section 6 with different block lengths (γ = 4, 6, 8) and different drafters (PALM-2-XXS and PALM-2XXXS):

Table 6. Drafter: PALM-2-XXS, γ = 4. Table 7. Drafter: PALM-2-XXS, γ = 6.

Published as a conference paper at ICLR 2025

Table 5: γ = 8, B = 8. Speedup comparison between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with PALM-2-XXS as the draft model on various datasets and tasks.

Dataset Wall clock time over baseline Block efﬁciency

TOKENV BLOCKV Spec Tr Spec Infer TOKENV BLOCKV Spec Tr Spec Infer

GPT Prompt 1.300 1.381 1.290 1.263 3.394 3.715 3.898 3.833 Web QA 1.302 1.368 1.279 1.274 3.451 3.7 3.933 3.894 Share GPT 1.267 1.333 1.244 1.236 3.366 3.63 3.824 3.78 GSM8K 1.353 1.445 1.344 1.319 3.856 4.179 4.356 4.277 XSum 1.328 1.403 1.300 1.285 3.487 3.768 3.949 3.897 PIQA 1.305 1.377 1.270 1.280 3.401 3.685 3.846 3.82 LM1B 1.274 1.344 1.253 1.245 3.218 3.494 3.669 3.629 WMT-De En 1.222 1.293 1.204 1.194 3.165 3.422 3.603 3.56

Table 8. Drafter: PALM-2-XXXS, γ = 4. Table 9. Drafter: PALM-2-XXXS, γ = 6. Table 10. Drafter: PALM-2-XXXS, γ = 8.

Table 6: Speedup comparison between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with γ = 4 and PALM-2-XXS being the draft model. Each statistic is computed using 1000 test prompts from different datasets on various tasks (each run is an average with 3 different random seeds). Numbers after represent standard deviation.

Dataset Block efﬁciency Wall clock time speedup over baseline

TOKENV BLOCKV Improve. % TOKENV BLOCKV Improve. %

LM1B 2.78 0.01 2.88 0.01 3.48 0.24 2.36 0.00 2.42 0.01 2.51 0.22 GPT Prompt 2.88 0.01 3.00 0.00 4.33 0.25 2.43 0.01 2.51 0.00 3.43 0.24 Web QA 2.91 0.01 2.99 0.01 2.83 0.65 2.45 0.01 2.50 0.01 1.94 0.61 PIQA 2.89 0.00 2.99 0.01 3.48 0.21 2.44 0.00 2.50 0.01 2.66 0.20 Share GPT 2.85 0.01 2.95 0.00 3.48 0.19 2.41 0.01 2.47 0.00 2.63 0.17 XSum 2.94 0.01 3.03 0.01 3.24 0.51 2.48 0.01 2.54 0.01 2.35 0.48 GSM8K 3.12 0.01 3.21 0.02 3.06 0.95 2.62 0.01 2.68 0.02 2.19 0.89 WMT-De En 2.75 0.01 2.83 0.01 2.99 0.09 2.33 0.01 2.38 0.01 2.18 0.09

Average 2.89 2.99 3.36 2.44 2.50 2.49

Table 7: Speedup comparison between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with γ = 6 and PALM-2-XXS being the draft model. Each statistic is computed using 1000 test prompts from different datasets on various tasks (each run is an average with 3 different random seeds). Numbers after represent standard deviation.

Dataset Block efﬁciency Wall clock time speedup over baseline

TOKENV BLOCKV Improve. % TOKENV BLOCKV Improve. %

LM1B 3.08 0.01 3.27 0.01 6.42 0.07 2.32 0.01 2.43 0.01 5.00 0.06 GPT Prompt 3.22 0.01 3.44 0.02 6.55 0.83 2.42 0.00 2.54 0.02 5.06 0.77 Web QA 3.26 0.01 3.44 0.01 5.60 0.22 2.45 0.01 2.55 0.01 4.24 0.21 PIQA 3.22 0.02 3.43 0.02 6.36 0.78 2.42 0.01 2.54 0.01 4.92 0.72 Share GPT 3.18 0.02 3.37 0.01 6.13 0.53 2.39 0.02 2.50 0.01 4.74 0.49 XSum 3.29 0.01 3.48 0.01 5.91 0.82 2.47 0.01 2.58 0.01 4.47 0.77 GSM8K 3.56 0.01 3.80 0.03 6.86 0.60 2.66 0.01 2.80 0.02 5.38 0.56 WMT-De En 3.04 0.01 3.19 0.01 4.92 0.29 2.29 0.01 2.37 0.01 3.57 0.27

Average 3.23 3.43 6.10 2.43 2.54 4.67

Published as a conference paper at ICLR 2025

Table 8: Speedup comparison between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with γ = 4 and PALM-2-XXXS being the draft model. Each statistic is computed using 1000 test prompts from different datasets on various tasks (each run is an average with 3 different random seeds). Numbers after represent standard deviation.

Dataset Block efﬁciency Wall clock time speedup over baseline

TOKENV BLOCKV Improve. % TOKENV BLOCKV Improve. %

LM1B 2.24 0.00 2.33 0.01 4.23 0.44 2.25 0.00 2.34 0.01 3.89 0.41 GPT Prompt 2.41 0.02 2.48 0.01 2.96 1.00 2.42 0.02 2.48 0.01 2.72 0.94 Web QA 2.38 0.01 2.45 0.01 2.87 0.13 2.39 0.01 2.45 0.01 2.63 0.12 PIQA 2.36 0.01 2.43 0.01 3.22 0.37 2.37 0.01 2.44 0.01 2.97 0.35 Share GPT 2.34 0.00 2.42 0.01 3.49 0.12 2.35 0.00 2.42 0.01 3.16 0.12 XSum 2.38 0.01 2.45 0.01 2.91 0.63 2.39 0.01 2.45 0.01 2.68 0.60 GSM8K 2.51 0.01 2.58 0.02 2.99 0.47 2.51 0.01 2.58 0.02 2.74 0.44 WMT-De En 2.22 0.00 2.28 0.00 2.59 0.09 2.24 0.00 2.29 0.00 2.37 0.08

Average 2.35 2.43 3.16 2.36 2.43 2.89

Table 9: Speedup comparison between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with γ = 6 and PALM-2-XXXS being the draft model. Each statistic is computed using 1000 test prompts from different datasets on various tasks (each run is an average with 3 different random seeds). Numbers after represent standard deviation.

Dataset Block efﬁciency Wall clock time speedup over baseline

TOKENV BLOCKV Improve. % TOKENV BLOCKV Improve. %

LM1B 2.36 0.01 2.48 0.00 4.93 0.46 2.27 0.01 2.37 0.00 4.55 0.43 GPT Prompt 2.58 0.04 2.72 0.02 5.57 1.29 2.46 0.03 2.59 0.01 5.10 1.22 Web QA 2.54 0.00 2.68 0.02 5.46 0.50 2.43 0.00 2.55 0.01 5.02 0.47 PIQA 2.50 0.00 2.62 0.01 5.06 0.39 2.39 0.00 2.50 0.01 4.66 0.37 Share GPT 2.47 0.01 2.60 0.01 5.10 0.49 2.37 0.01 2.48 0.01 4.69 0.46 XSum 2.54 0.01 2.67 0.01 4.83 0.47 2.43 0.01 2.54 0.01 4.45 0.44 GSM8K 2.71 0.03 2.83 0.00 4.27 0.89 2.58 0.02 2.69 0.00 3.92 0.84 WMT-De En 2.31 0.01 2.43 0.02 5.38 0.57 2.21 0.00 2.32 0.01 4.99 0.54

Average 2.50 2.63 5.07 2.39 2.50 4.67

Table 10: Speedup comparison between token veriﬁcation (TOKENV) and block veriﬁcation (BLOCKV) with γ = 8 and PALM-2-XXXS being the draft model. Each statistic is computed using 1000 test prompts from different datasets on various tasks (each run is an average with 3 different random seeds). Numbers after represent standard deviation.

Dataset Block efﬁciency Wall clock time speedup over baseline

TOKENV BLOCKV Improve. % TOKENV BLOCKV Improve. %

LM1B 2.40 0.01 2.55 0.01 6.19 0.43 2.13 0.01 2.25 0.01 5.28 0.40 GPT Prompt 2.66 0.01 2.82 0.02 6.28 1.01 2.35 0.01 2.47 0.02 5.37 0.95 Web QA 2.61 0.01 2.78 0.00 6.27 0.49 2.31 0.01 2.43 0.00 5.39 0.46 PIQA 2.57 0.01 2.76 0.01 7.48 0.51 2.27 0.01 2.42 0.01 6.51 0.47 Share GPT 2.54 0.01 2.71 0.01 6.63 0.72 2.25 0.01 2.38 0.01 5.68 0.68 XSum 2.60 0.01 2.77 0.00 6.46 0.49 2.30 0.01 2.43 0.00 5.53 0.46 GSM8K 2.82 0.02 2.98 0.03 5.48 1.18 2.49 0.01 2.60 0.03 4.62 1.11 WMT-De En 2.37 0.00 2.49 0.01 5.33 0.46 2.10 0.00 2.20 0.01 4.53 0.43

Average 2.57 2.73 6.27 2.28 2.40 5.36