# faster_cascades_via_speculative_decoding__ef6cc727.pdf Published as a conference paper at ICLR 2025 FASTER CASCADES VIA SPECULATIVE DECODING Harikrishna Narasimhan1, Wittawat Jitkrittum1, Ankit Singh Rawat1 Seungyeon Kim2, Neha Gupta3, , Aditya Krishna Menon1, Sanjiv Kumar1 1Google Research, 2Google Deep Mind, 3Mistral AI {hnarasimhan, wittawat, ankitsrawat, adityakmenon, sanjivk}@google.com Cascades and speculative decoding are two common approaches to improving language models inference efficiency. Both approaches interleave two models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for hard inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel scoring mode. These mechanisms offer different benefits: cascades offer compelling cost-quality trade-offs, often even outperforming the large model; speculative cascades offer impressive speed-ups, while guaranteeing quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Experiments with Gemma and T5 models on a range of language benchmarks show that our approach yields better cost-quality trade-offs than cascading and speculative decoding baselines. 1 INTRODUCTION Large language models (LLMs) have yielded significant advances in quality on a range of natural language processing tasks (Radford et al., 2018; Raffel et al., 2020; Brown et al., 2020; Black et al., 2022; Chowdhery et al., 2022; Anil & et al., 2023; Touvron et al., 2023; Team et al., 2023; et al., 2024b;a), at the cost of an increase in inference latency. This has sparked a growing body of literature on reducing LLMs inference costs without (overly) compromising on quality (Elbayad et al., 2020; Pope et al., 2022; Schuster et al., 2022; Leviathan et al., 2023; Chen et al., 2023a; Sheng et al., 2023; Sun et al., 2024). One such line of work involves constructing a family of models of various sizes (e.g., a small and large model), and suitably orchestrating amongst them to make a prediction. Two canonical instantiations of this strategy are model cascading (Wang et al., 2020; Mamou et al., 2022; Varshney & Baral, 2022; Khalili et al., 2022; Dohan et al., 2022; Chen et al., 2023b; Gupta et al., 2024; Ding et al., 2024) and speculative decoding (Stern et al., 2018; Chen et al., 2023a; Leviathan et al., 2023; Sun et al., 2024; Li et al., 2024a; Xia et al., 2024). While similar in spirit, cascades and speculative decoding are fundamentally different in details. Cascades employ a deferral rule to identify hard inputs, and only invoke larger models on such inputs. For example, in a two-model cascade, one first invokes the smaller model, and uses its associated probability of the generated output to decide whether to defer to the larger model. By contrast, speculative decoding uses a small model to draft a block of tokens via standard autoregressive decoding, which are then verified in parallel by a large model. One then accepts all drafted tokens until the first implausible one, which is rolled back based on the larger LM s prediction. Owing to their different mechanisms, both methods have complementary strengths. Cascades seek to output distributions that have the best quality for a given cost budget, sometimes even yielding better quality than the individual models they are constructed with (Jitkrittum et al., 2023; Kim et al., 2023) ( 3). By contrast, speculative decoding is theoretically guaranteed to match the output distribution (or a close approximation thereof (Tran-Thien, 2023)), and is practically observed to provide impressive speed-ups (Stern et al., 2018; Chen et al., 2023a; Leviathan et al., 2023; Sun et al., 2024). Given their complementary nature, a natural question arises: can we leverage the best of both techniques? Work done while working at Google. Published as a conference paper at ICLR 2025 Figure 1: Speculative cascade inference between a small and a large LM via a deferral rule. In this paper, we do so by designing new techniques for two-model cascades that implement their deferral rule in a speculative manner: we have the smaller model generate drafts auto-regressively, and the larger model execute in parallel on the drafts to decide whether or not to defer on them. We show that this speculative cascading approach yields better cost-quality trade-offs than both standard cascades and speculative decoding. In detail, we make the following contributions: (i) We introduce a general recipe for speculative execution, where we seek to mimic a general target distribution that interleaves the drafter s and verifier s distributions. Lossy speculative sampling (Tran-Thien, 2023) is a special case of this recipe for a particular target distribution ( 4.1). (ii) We show how common cascading rules, such as Chow s rule (Chow, 1970) and confidencedifference thresholding (Jitkrittum et al., 2023), can be implemented speculatively by plugging in their target distribution into our framework. We refer to these as speculative cascades ( 4.2). (iii) We characterize the theoretically optimal deferral rule for a speculative cascade, and design a speculative cascading technique that implements a plug-in estimate to the optimal rule ( 4.3, Lemma 4, Table 1). We also present token-specific variants of our deferral rules ( 5). (iv) Through experiments with Gemma (Team et al., 2024) and T5 models (Raffel et al., 2020) on a range of benchmark language tasks including summarization, translation, reasoning, coding and QA, we show that speculative cascades are able to provide better cost-quality trade-offs than their sequential cascade and speculative decoding counterparts ( 6). Overall, we aim to develop a principled approach to trade-off quality and inference costs by interleaving two models of different sizes, with promising empirical results. We hope to inspire future research adapting the proposed ideas with ingredients underpinning the state-of-the-art in speculative decoding (Cai et al., 2024; Li et al., 2024a;b; Chen et al., 2024). 2 A TALE OF TWO EFFICIENT LM INFERENCE STRATEGIES Let V denote a finite vocabulary of tokens, with V denoting the set of all finite-length sequences generated by this vocabulary. Let V denote the set of all probability distributions over tokens in V. Given an arbitrary length sequence x = x1x2 . . . x L V and index i L, we denote x α (1 δ) q(u) + δ p(u) Speculative Token Cascade [Chow] (Chow, 1970) 1 maxv q(v) < 1 α (1 δ) q(u) + δ p(u) Sequential Oracle [Diff] (Jitkrittum et al., 2023) 1 maxv q(v) < maxv p(v) α (1 δ) q(u) + δ p(u) Oracle Spec Cascade [Chow] 1 maxv q(v) < 1 α (1 δ) q(u) + δ p(u) Speculative Spec Cascade [Diff] 1 maxv q(v) < maxv p(v) α (1 δ) q(u) + δ p(u) Speculative Spec Cascade [OPT] 1 maxv q(v) < maxv p(v) α DTV(p, q) (1 δ) q(u) + δ p(u) Speculative Speculative decoding is an alternate strategy that applies token-level interleaving between q and p, seeking to provably match the larger model quality at a reduced inference cost (Stern et al., 2018; Leviathan et al., 2023). Given a prefix x Ev P( |xt), which is not available during inference time. A common approach in the cascades literature is to replace the expected losses with the models confidence estimates (Jitkrittum et al., 2023). For example, when ℓ= ℓ0-1, it may be reasonable to use 1 maxv qt(v) as an estimate of the expected 0-1 loss Ext P( |x 0. Equivalently, one may minimize an unconstrained objective similar to (3), for suitable cost parameter α > 0 (see D.5): Lspec(r; x Ev P( |x Ev P( |x 1 1 α p(x) 1 β p(x). In this case, π(x) = 1 1 α p(x). As a result: κπ(x) = min 1, p(x) (1 α) q(x) pπ res(x) = norm max 0, 1 1 α p(x) q(x) = 0 = norm max 0, 1 β p(x) q(x) = pres(x). Published as a conference paper at ICLR 2025 Case (ii): 1 1 α p(x) 1 β p(x) > q(x). In this case, π(x) = 1 β p(x). As a result: κπ(x) = min 1, p(x) β q(x) = 1 = min 1, p(x) (1 α) q(x) pπ res(x) = norm max 0, 1 β p(x) q(x) = pres(x). Case (iii): 1 1 α p(x) q(x) 1 β p(x). In this case, π(x) = q(x). As a result: κπ(x) = 1 = min 1, p(x) (1 α) q(x) pπ res(x) = 0 = norm max 0, 1 β p(x) q(x) = pres(x). In all three cases, the acceptance probabilities and residual distributions are identical. A.3 PROOF OF LEMMA 3 Proof. Under a target distribution πt, the probability of a draft token drawn from qt being is rejected is given by (Leviathan et al., 2023): rejection probability = X v V qt(v) 1 min 1, πt(v) v V min {qt(v), πt(v)} v V πt(v) X v V min {qt(v), πt(v)} v V max {0, πt(v) qt(v)} . Expanding πt, the rejection probability becomes: rejection probability = X v V max {0, (1 r(x 0. When (i) holds, ˆr OPT(x α (1 δ) q(u) + δ p(u) Speculative Cascade [Chow] (Chow, 1970) 1 maxv V q(v) < 1 α (1 δ) q(u) + δ p(u) Sequential Cascade [Chow Log] 1 entropy(q) > α (1 δ) q(u) + δ p(u) Sequential Oracle [Diff] (Jitkrittum et al., 2023) 1 maxv V q(v) < maxv V p(v) α (1 δ) q(u) + δ p(u) Oracle Oracle [Diff Log] 1 entropy(p) < entropy(q) α (1 δ) q(u) + δ p(u) Oracle Spec Cascade [Chow] 1 maxv V q(v) < 1 α (1 δ) q(u) + δ p(u) Speculative Spec Cascade [Chow Log] 1 entropy(q) > α (1 δ) q(u) + δ p(u) Speculative Spec Cascade [Diff01] 1 maxv V q(v) < maxv V p(v) α (1 δ) q(u) + δ p(u) Speculative Spec Cascade [Diff Log] 1 entropy(p) < entropy(q) α (1 δ) q(u) + δ p(u) Speculative Spec Cascade [OPT01] 1 maxv V q(v) < maxv V p(v) α DTV(p, q) (1 δ) q(u) + δ p(u) Speculative Spec Cascade [OPTLog] 1 entropy(p) < entropy(q) α DTV(p, q) (1 δ) q(u) + δ p(u) Speculative Table 3: Target distributions associated with different inference algorithms, where α is a free parameter and β 1 α is a parameter dependent on q, p and α. The last column indicates whether the execution is sequential (Algorithm 2), via an oracle (Algorithm 3), or speculative (Algorithm 5) (Leviathan et al., 2023). The third row presents a variant of the Bi LD algorithm of Kim et al. (2023), where D(q, p) is a measure of discrepancy between q and p; the original algorithm differs from (Leviathan et al., 2023) in the use of a deterministic speculative decoding procedure with a dynamic draft window (see B). B FURTHER RELATED WORK Several works have studied improving the drafting process in speculative decoding; these include having the drafter and verifier share their backbone (Stern et al., 2018; Kim et al., 2024; Cai et al., 2024; Monea et al., 2023; Hooper et al., 2023; Zhang et al., 2023; Elhoushi et al., 2024; Liu et al., 2024), using multiple small draft models (Chen et al., 2023c; Wang et al., 2024), using tree-structured draft batches (Spector & Re, 2023; Miao et al., 2024), distilling the drafter with the verifier (Zhou et al., 2024), and leveraging multiple sampled drafts (Sun et al., 2024; Chen et al., 2024). The work that is most closely related to our specific proposal is the Big Little Decoder (Bi LD) (Kim et al., 2023), which can be seen as another lossy variant of speculative decoding (Leviathan et al., 2023; Tran-Thien, 2023; Zhou et al., 2024). Bi LD has two phases: a fallback phase, during which the drafter q is run auto-regressively until its maximum predicted probability is sufficiently low; and a rollback phase, during which the verifier p is run in parallel on the prefixes generated by q and rolls back to the point where D(q, p) > α, for a metric D that measures discrepancy and threshold α. The fallback phase implements Chow s deferral rule in (2), and allows for the draft window size to vary dynamically based on an estimate of how likely the draft tokens will be accepted; the rollback phase can be seen as a deterministic variant of the rejection sampling algorithm of Leviathan et al. (2023). An advantage of Bi LD over the rejection sampling algorithm in (Leviathan et al., 2023) is the use of Chow s rule to vary the draft window size. However, the final target distribution it seeks to mimic, TBi LD(q, p)(v) = 1(D(q, p) α) q(v) + 1(D(q, p) > α) p(v), is an approximation to p; specifically, the target distribution π = TBi LD(q, p) is chosen to satisfy D(π, p) α. Hence, in cases where q deviates substantially from p, Bi LD would choose p as the target distribution, even when q offers better quality on a prefix (where quality can be measured using a suitable loss function). In contrast, our proposed approach in 4 uses speculative decoding to approximate target distributions that seek to optimally cascade between q and p. In our experiments, we compare the efficacy of using TBi LD as the target distribution with the target distributions we propose in this paper (see Table 3). Published as a conference paper at ICLR 2025 C CONTRASTING SPECULATIVE CASCADES AND LOSSY SPECULATIVE SAMPLING UNDER DIFFERENT SAMPLING SCHEMES We contrast how speculative cascades and lossy speculative sampling behave under temperature sampling, top-P sampling and greedy decoding. C.1 SPECULATIVE CASCADES UNDER TEMPERATURE SAMPLING AND TOP-P SAMPLING When implementing speculative cascades with temperature sampling and top-P sampling, we compute the deferral rule on the original distributions p and q, but use the deferral decisions to interleave the temperature-scaled (or top-P truncated) versions of p and q. For the cascaded deferral rules in Table 1, with the exception of OPT, we construct the target distribution in (6) as follows: πt(v) = (1 δ(qt, pt)) S(qt)(v) + δ(qt, pt) S(p)(v), (23) where S : V V denotes a transformation of the distribution such as temperature scaling or top-P truncation, and δ : V V {0, 1} denotes the deferral rule. One may run Algorithm 5 with πt as the target distribution, and S(qt) and S(pt) as the drafter and verifier distributions. In the case of the OPT rule, we would formulate the constrained problem in (7) to use the TV distance between the distributions S(qt) and S(pt) to measure the rejection rate. The optimal deferral rule in Lemma 4 would now use DTV(S(pt), S(qt)) instead of DTV(pt, qt). To construct a plug-in estimator to this optimal rule, we still prescribe using the unscaled probabilities qt and pt to estimate the expected loss, giving us, for ℓ= ℓ0-1: δ(pt, qt) = 1 max v qt(v) < max v pt(v) α DTV(S(pt), S(qt)). For the token-specific deferral rules in 5, we compute the target distribution in (16) as follows: πToken(v) = S(qt)(v) (1 r(x 0 T = 0 Spec Decode [Lossy] Leviathan et al. (2023) min n 1, S(p(v)) (1 α) S(q(v)) o 1(v = arg maxv p(v )) Spec Decode [Lossy, Greedy] Leviathan et al. (2023) - p(v) (1 α) maxv p(v ) Spec Cascade [Token V3] This paper min n 1, S(π(v)) S(q(v)) o , where π is in (16) p(v) (1 α) maxv p(v ) C.4 LOSSY SPECULATIVE GREEDY DECODING VARIANT BY LEVIATHAN ET AL. (2023) For the special case of greedy decoding, Leviathan et al. (2023) propose an alternate lossy variant of speculative decoding (Appendix A.5 in their paper), where a draft token v is accepted deterministically whenever pt(v) (1 α) maxv pt(v ); when the token is rejected, it is replaced with a new token sampled from pt( ). We will refer to this variant as Spec Decode [Lossy, Greedy]. We now show that the proposed speculative cascades with the Token V3 deferral rule (15) is identical to Spec Decode [Lossy, Greedy] when sampling with temperature 0. Lemma 7. For any fixed trade-off parameter α [0, 1], Spec Cascade [Token V3] is identical to Spec Decode [Lossy, Greedy] when sampling with temperature 0. Proof. Let S0(p) denote a temperature-scaled one-hot version of distribution p which places all its mass on the mode of p. Let pt = S0(pt) and qt = S0(qt). With the Token V3 rule, the acceptance criterion is computed against the target distribution in (16) with trade-off parameter α. πt(v) = qt(v) 1 v Tα + pt(v) X v / Tα qt(v ), where Tα = {v V : pt(v) maxv pt(v ) (1 α)} is the set of top ranked tokens by the original (unscaled) distribution pt( ). Under greedy decoding, the draft token is given by v = arg maxv qt(v). We consider two cases: (i) v Tα and (ii) v / Tα. In the first case, we have πt(v ) = qt(v ). As a result, a draft token v is accepted with probability: min 1, S0( qt(v )) = min 1, qt(v ) In the second case, it is clear that the draft token v is not the maximizer of pt( ). Furthermore, πt(v ) = pt(v ). As a result, the draft token v is rejected since the acceptance probability for the token becomes: min 1, S0( pt(v )) = min 1, pt(v ) = min 1, 0 qt(v ) It is then replaced with a token sampled from: norm (max {0, S0( pt( )) S0(qt( ))}) = norm (max {0, pt( ) qt( )}) = norm ( pt( )) = pt( ), which would produce the token maximizing pt( ). In both cases, the sampling procedure is identical to that of Spec Decode [Lossy, Greedy]. Table 4 summarizes the acceptance criteria for different speculative inference strategies under temperature sampling and what they reduce to under greedy decoding. Published as a conference paper at ICLR 2025 D OPTIMAL DEFERRAL: ADDITIONAL DISCUSSION We provide additional discussion for the deferral rules derived in 3 and 4. D.1 DERIVATION OF CHOW S RULE We show below that Chow s rule is a plug-in estimator to the optimal solution to the following objective Lrej(r; x α. (25) If ℓ= ℓ0-1, one may employ a plug-in estimator to (25) by replacing the expected 0-1 loss on qt with 1 maxv V qt(v), giving us ˆr Chow(x α, (26) where entropy(q) = P v V q(v) log(q(v)). D.2 OPTIMAL SEQUENTIAL DEFERRAL WHEN ℓ= ℓlog Recall that the optimal deferral rule for a sequential cascade in Lemma 1 takes the form: r (x Ext P( |x Ext P( |x 0. Then: Lspec(r OPT; x 0, the Diff and OPT deferral rules are computed as: ˆr Diff(x 0. Since r is a binary variable, we may formulate an equivalent unconstrained problem with the same minimizer: min r {0,1}(1 r) c0 + r c1 + α r c2, where we choose α = 0 when c2 B and choose an α > 1 c2 (c0 c1) otherwise. This unconstrained optimization problem is of the form in (8). Published as a conference paper at ICLR 2025 E TOKEN-SPECIFIC SPECULATIVE CASCADE We provide a modification of Algorithm 5 to accommodate the token-specific deferral rules in 5. Algorithm 6 Token Spec Cascade Input: Models q, p, Token-specific deferral rule r, Prefix x 0 is a budget parameter. However, unlike 4.3, the above constrained optimization problem does not directly lend itself to a simple closed-form solution. In some highly simplistic special cases, we may be able to derive a trivial solution. For example, suppose ℓ= ℓ0-1, and the mode of qt coincides with that of P( |x α. In our implementation, we adopt the speculative sampling algorithm from (Leviathan et al., 2023): we do not have the fallback policy and replace the determinisic rollback policy with the rejection sampling in Algorithm 4. Figure 9 (top) compares the original version of Bi LD with the version we use in 6. We interleave between a T5-small and T5-large model on WMT, using greedy decoding (T = 0) for inference. As prescribed by the authors (Kim et al., 2023), we use the following discrepancy metric for greedy decoding: D(q, p) = log p arg max v V q(v) . We compare our implementation (Bi LD ), where we set the block size to 5 (same as our proposed speculative cascading approaches), with the original Bi LD for different choices of maximum block size γ and different fallback thresholds αf. For both methods, we vary the threshold α on D(q, p) to vary the latency and plot the resulting BLEU score. A higher fallback threshold αf results in larger draft generation windows; this gives an advantage in the low latency regime, where most of the draft tokens are accepted. As a result, Bi LD [γ = 10, α = 0.9] yields the lowest latencies, but also yields lower quality. A low fallback threshold results in very small draft generation windows, and consequently, in higher latencies. This is why Bi LD [γ = 5, α = 0.1] is the slowest but yields high quality metrics. Our implementation Bi LD is seen to perform comparable to the best parameter choices for the original Bi LD algorithm in Figure 9. Published as a conference paper at ICLR 2025 0.40 0.45 0.50 0.55 0.60 0.65 Relative latency WMT: Small Large (T=0) Bi LD* Bi LD ( = 5, f = 0.1) Bi LD ( = 5, f = 0.3) Bi LD ( = 5, f = 0.5) Bi LD ( = 5, f = 0.7) Bi LD ( = 5, f = 0.9) Bi LD ( = 10, f = 0.1) Bi LD ( = 10, f = 0.3) Bi LD ( = 10, f = 0.5) Bi LD ( = 10, f = 0.7) Bi LD ( = 10, f = 0.9) 0.00 0.05 0.10 0.15 0.20 Fraction of calls to larger model WMT: Small Large (T=1) Spec Decode [Lossy] Spec Cascade [Lossy ] 0.00 0.05 0.10 0.15 0.20 Fraction of calls to larger model XSum: Small Large (T=1) Spec Decode [Lossy] Spec Cascade [Lossy ] Figure 9: Top: Plots of quality vs. latency comparing Bi LD with the original Bi LD algorithm in Kim et al. (2023) with varying maximum draft window size γ and fallback confidence threshold αf. Bottom: Comparison of lossy speculative decoding with β = 1 [Lossy] and β tuned using the procedure in (Tran-Thien, 2023) [Lossy ]. 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Relative latency WMT: Small Large (T=1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.6 0.7 0.8 0.9 1.0 Relative latency XSum: Small Large (T=1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Relative latency CNNDM: Small Large (T=1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] Figure 10: Plots of quality vs. latency for T5 models with all three token-specific speculative cascade deferral rules in equations 13 15. Each method interleaves a T5 small and a T5 large model. The x-axis tracks the latency relative to that of calling the large model on all inputs. The horizontal dotted line denotes the quality of the large model. It is worth noting that while we view TBi LD as the target distribution that the algorithm in (Kim et al., 2023) seeks to mimic, the presence of the fallback phase could mean that on some inputs a output response is generated without the verification (or rollback) phase being invoked. In such cases, the output response will come solely from drafter if it turns out that it contains tokens for which D(qt, pt) > α. Published as a conference paper at ICLR 2025 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Rejection rate WMT 5-shot: 2B 27B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 0.25 Rejection rate CNNDM 5-shot: 2B 27B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Rejection rate Exact Match GSM8K 8-shot: 2B 27B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 0.25 Rejection rate SQu AD 1-shot: 2B 27B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 0.25 Rejection rate Web Q 1-shot: 2B 27B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection rate Natural QA 1-shot: 2B 27B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Rejection rate MBPP 3-shot: 2B 27B [IT] (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Rejection rate Trivia QA 1-shot: 2B 27B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] Figure 11: Plots of quality vs. rejection rate for Gemma models with all three token-specific speculative cascade deferral rules in equations 13 15. Each method interleaves a Gemma 2B drafter with a Gemma 27B verifier. The horizontal dotted line denotes the quality of the large model. We include all three token-specific speculative cascade deferral rules in equations 13 15. F.6 LOSSY SPECULATIVE DECODING VARIANTS In our experiments in the main text ( 6), we compared against the lossy speculative decoding (Tran Thien, 2023; Zhou et al., 2024) described in 2, with the parameter β set to 1. We now present results for this method with β tuned according to the procedure in Tran-Thien (2023), and show that choosing β = 1 fares at least as well as tuning β. The goal in Tran-Thien (2023) is to choose α and β so as to maximize the acceptance rate for the draft token, while ensuring that the KL divergence between the resulting target distribution and p is within an allowable limit R. The authors prescribe specifying R, and for each prefix, tuning α and β to solve the resulting constrained optimization problem. To be consistent with the rest of our experimental setup, we vary α to vary the draft acceptance rate (note that each choice of α corresponds to a particular KL divergence to p), and tune β 1 α to satisfy the following condition Published as a conference paper at ICLR 2025 0.00 0.05 0.10 0.15 0.20 0.25 Rejection rate WMT 5-shot: 2B 9B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Rejection rate Exact Match GSM8K 8-shot: 2B 9B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Rejection rate SQu AD 2.0 1-shot: 2B 9B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 0.25 Rejection rate Web Questions 1-shot: 2B 9B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Rejection rate Natural QA 1-shot: 2B 9B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] 0.00 0.05 0.10 0.15 0.20 Rejection rate Trivia QA 1-shot: 2B 9B (T = 1) Spec Decode [Lossy] Spec Cascade [OPT] Spec Cascade [Token V3] Spec Cascade [Token V2] Spec Cascade [Token V1] Figure 12: Plots of quality vs. rejection rate with Gemma 2B 9B speculative cascades. Each method interleaves a Gemma 2B drafter with a Gemma 9B verifier. The horizontal dotted line denotes the quality of the large model. We include all three token-specific speculative cascade deferral rules in equations 13 15. outlined in Tran-Thien (2023): v V max 0, q(v) p(v) v V max 0, p(v) We pick β using a grid-search over 1000 values between α and 10. Since this tuning procedure, in turn, can add to the method s latency, for a fair comparison, we plot quality as a function of the fraction of calls to the large model (rejection rate), instead of relative latency. In Figure 9 (bottom), we plot these trade-off curves for loss speculative decoding with β = 1 (Lossy) and for speculative decoding with β tuned using the above procedure (Lossy ). We compare performances on WMT and XSum, and in each case, interleave a T5-small model with a T5-large model. In both cases, setting β = 1 provides trade-offs comparable to or better than using a tuned value of β. The reason using a tuned value of β fares worse than setting β = 1 might be because we are measuring quality in terms of BLEU or ROUGE-2, which is different from the KL divergence to p objective that the tuning procedure in Tran-Thien (2023) seeks to optimize. F.7 TOKEN-SPECIFIC DEFERRAL RULE VARIANTS In Figure 10, we present latency-quality trade-off plots for cascades constructed from a T5-small and a T5-large model. We include in these comparisons, all three token-specific deferral rules in Published as a conference paper at ICLR 2025 (13) (15). In Figure 11, we present trade-off plots for cascades constructed from Gemma 2B and Gemma 27B models with all three token-specific rules, and in Figure 12, we include similar plots for cascades constructed from Gemma 2B and Gemma 9B models. We note that the trends with the 2B 9B are similar to those seen with the 2B 27B cascades. With the T5 models, the results are mixed, with the V1 and V2 variants sometime surpassing the V3 variant (which is the variant we included in the main experiments results in 6). Interestingly, with the Gemma models, the V3 variant is seen to outperform the others for most rejection rates, with the exception of the 2B 27B cascade on SQu AD 2.0, where the V2 variant is better. The reason for the V3 variant outperforming the V1 and V2 variants on the Gemma models could be due to the fact that it uses the larger model s distribution pt( ) to measure confidence for both the drafter and verifier tokens (see LHS and RHS in (13)). We expect this to be particularly helpful when there is a larger gap in sizes between q and p, and the larger model s distribution is better aligned with the data-generating distribution compared to the smaller model. Furthermore, as per the discussion in 5, the multiplicative form of the rule (15) results in a target distribution with an intuitive form: it seeks to mimic qt( ) on the top-α ranked tokens by pt( ) and uses a re-scaled version of pt( ) for the other tokens: πToken V3(v) = qt(v) 1 v Tα + pt(v) X v / Tα qt(v ), where Tα = {v V : pt(v) maxv pt(v ) (1 α)}. Published as a conference paper at ICLR 2025 G LIMITATIONS One of the limitations of our proposal is the use of plug-in estimators to approximate the optimal rule (9). While these approximations are effective in practice, they rely on the individual models being calibrated. An alternative to the use of plug-in estimators is to use a router model explicitly trained to mimic the optimal rule using a validation sample drawn from P (Gupta et al., 2024). Another limitation is that the optimization objectives we seek to minimize are local objectives that seek to make the best deferral decision at the current position t. In doing so, they ignore the downstream effects of choosing a particular model in the current step. Devising a global deferral objective that takes downstream errors into account would be an interesting direction for future work. More broadly, our paper seeks to improve cost-quality trade-offs in LM inference. It is important that such improvements do not unfairly advantage one slice of the data or a subset of the population, at the cost of others. Ensuring that the trade-off gains that our approach offers is equitable across different slices of the data is another important direction for the future.