# promptbased_depth_pruning_of_large_language_models__51c72bd5.pdf Prompt-based Depth Pruning of Large Language Models Juyun Wee * 1 Minjae Park * 1 Jaeho Lee 1 Depth pruning aims to reduce the inference cost of a large language model without any hardwarespecific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly taskdependent a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined Pu DDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. Pu DDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that Pu DDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines. Project Page: jwee01.github.io/Pu DDing Code: github.com/tada0347/Pu DDing 1. Introduction Recent advances in large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks (Brown et al., 2020; Touvron et al., 2023; Dubey et al., 2024). However, significant computational requirements of LLMs pose challenges in resourceconstrained environments, limiting their practicality. For example, LLa MA-3.3-70B needs 140GB of RAM to be loaded in bf16, which is often too big for memory-constrained local devices. Thus, reducing the model size is essential to make LLMs feasible for on-device applications. Depth pruning is a versatile model compression technique *Equal contribution 1POSTECH. Correspondence to: Jaeho Lee . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Figure 1. The general framework of prompt-based depth pruning. Given some query from the user, the goal is to identify which layers from an LLM can be omitted, so that one can make accurate prediction on low-memory consumer devices. that is particularly effective for on-device scenarios (Song et al., 2024; Kim et al., 2024). Such methods simply remove several transformer blocks (which we call omission set ) from the pretrained model, based on some measures of block importance computed using a small amount of calibration samples. As everything is identical except for the number of blocks, the pruned model is suitable to be deployed on any hardware without tailored supports on low-precision (e.g., integer cores) or fine-grained sparsity (e.g., 2:4 sparsity). Furthermore, as there is no extensive training involved, depth pruning can be easily done in a device-by-device manner for deployment on various devices. A key limitation of typical depth pruning algorithms is that their pruning decision is static, i.e., the same omission set is removed regardless of the query given to the model. While this choice allows one to save storage (e.g., flash drives) by discarding the pruned parameters at the local device, it sacrifices the ability to adapt to various downstream tasks. Indeed, our empirical observations show that pruning some transformer blocks in an LLM may incur significant accuracy degradation on certain tasks, while being highly unnecessary for other tasks (see Section 3). Can we make dynamic depth pruning decisions to improve the performance on various tasks? This question has not been well studied yet, especially in the context of on-device Prompt-based Depth Pruning of LLMs inference. A recent line of work develops effective dynamic token routing mechanisms to save training/inference computation by processing each token with a limited number of transformer blocks (Raposo et al., 2024; Wang et al., 2024). However, such methods require all parameters to be loaded on high-speed memories (e.g., on-GPU memory); thus, the methods are appropriate for large-scale server clusters, not for on-device inference with memory constraints. Contribution. To overcome the limitations, we develop a new prompt-based depth pruning approach (Section 4): In the pre-fill stage, based on the prompt given from the user, a limited number of transformer blocks are selected and loaded to the on-device RAM from the storage drive. This approach does not require a large memory to hold all parameters or highly repeated per-token routing, and thus can effectively accelerate inference on low-memory devices. A na ıve way to achieve this goal might be to conduct conventional static depth pruning at each inference, using the given prompt as calibration samples. However, this approach incurs a large latency in running static pruning algorithms in every inference. Furthermore, such a method is likely to fail making a good pruning decision due to the shortage of calibration data, especially in single-batch inference cases common in on-device scenarios. To this end, we propose a training-based method for the prompt-based depth pruning of large langauge models (Section 5). Our method, coined Prompt-routed Dynamic Depth Pruning (Pu DDing), works in two steps. 1. Candidate omission set generation. We construct a small yet diverse and performant family of omission sets. This is done by drawing multiple splits of calibration data from various task dataset, and then finding an omission set which achieves low loss on each split; here, we use a newly developed task-centric loss instead of perplexity. 2. Router training. We train a lightweight router which predicts the appropriate omission set from the given prompt. This is done by generating a training dataset consisting of prompt-loss pairs for each omission set, and training the model to predict the loss from the prompt; routing can be done by choosing the minimum-loss option. Empirically, we find that the proposed Pu DDing enjoys a clear advantage over static depth pruning algorithms, achieving more than 4%p accuracy increase on zero-shot commonsense reasoning tasks (Section 6). At the same time, as the algorithm uses the router only once per each prompt, Pu DDing enjoys over 1.2 generation speedup over the dense model, similar to the static depth pruning algorithms. Our key contributions can be summarized as follows: Our observations reveal that optimal depth pruning decisions may be highly depend on the task given at hand, underscoring the need for task-dependent depth pruning. We consider the task of prompt-based depth pruning for the first time (to our knowledge), and propose a trainingbased strategy as a solution. Comparing with static depth pruning algorithms, our algorithm achieves a much higher zero-shot accuracies on various tasks, while being competitive in terms of the computational efficiency. Table 1. A high-level comparison of the proposed prompt-based depth pruning framework with related depth pruning approaches: Static depth pruning and dynamic token routing. Task Adaptive Peak Memory Routing Static Pruning (Song et al., 2024; Kim et al., 2024) Sparse - Token Routing (Raposo et al., 2024; Wang et al., 2024) Dense Per token Prompt-based depth pruning (this paper) Sparse Per prompt 2. Related Work In this section, we provide an in-depth comparison of the proposed framework against existing depth and width sparsity frameworks. See Table 1 for a concise summary. 2.1. Static Depth Pruning Static depth pruning methods select and remove unnecessary blocks from a pretrained LLM using various proxy metrics to measure the importance of the blocks. Short GPT (Men et al., 2024) measures the block importance using the expected cosine similarity between the input and output activations of the block; a block that does not change the direction of the activation is deemed unnecessary. Shortened-LLa MA (Kim et al., 2024) directly measures the perplexity drop after removing each transformer block, and SLEB (Song et al., 2024) combines this idea with an iterative pruning. Several recent works also focus on layer-level depth pruning, instead of removing an entire transformer block. In particular, Siddiqui et al. (2024), He et al. (2024) discover that pruning out self-attention layers have a much less significant impact than removing the feed-forward layers. Unlike these works, this paper aims to perform dynamic depth pruning using the prompts for the downstream tasks; to account for this difference, we design and use new likelihood-based metrics to measure the block importance. 2.2. Dynamic Token Routing Inspired by the success of mixture-of-experts (Jacobs et al., 1991; Fedus et al., 2022), several recent works have developed mechanisms to route tokens through only a fraction of all transformer blocks. Mixture-of-Depth (Raposo et al., 2024) adopts the depth sparsity during the training phase with a jointly trained router, to reduce the training cost of Prompt-based Depth Pruning of LLMs LLMs. Here, the trained router can also be used at inference. D-LLM (Wang et al., 2024) trains a router that can be applied on pretrained LLMs to reduce their inference cost. Our approach differs from both of these works in the sense that it needs only a limited number of transformer blocks active for a single input query (or prompt); the routing is conducted once per input prompt, not per token. 2.3. Contextual Sparsity Our work is most closely related to the idea of contextual sparsity, where a lightweight router selects an inputdependent subnetwork at inference time without updating the base weights. In the context of width pruning, prior works Deja Vu (Liu et al., 2023), Shadow LLM (Akhauri et al., 2024), Sirius (Zhou et al., 2024), and CATS (Lee et al., 2024) have demonstrated that context-aware routing can be done with minimal or no degradation in task performance. Pu DDing extends this paradigm to depth pruning for the first time: instead of skipping neurons or channels, our router decides which entire transformer blocks to omit. This preserves the original matrix shapes and avoids hardware mismatches often caused by width pruning. 3. A Motivating Observation Before describing the proposed framework, we briefly describe a motivating observation which demonstrate that: The importance of a transformer block in a language model may be highly task-dependent. Setup. To show this point, we have compared the zero-shot accuracies of the LLMs whose omission sets differ by a single transformer block. More concretely, we compare the performance of an omission set (b1, b2, . . . , bk 1, bk) to another omission set (b1, b2, . . . , bk 1, bk), on the LLa MA 3.1-8B model. Here, we have used the SLEB (Song et al., 2024) to generate an omission set, and then replaced a single block to get another one. Then, we observe the impact of such replacement on three commonsense reasoning tasks: Bool Q, PIQA, and Wino Grande. Result. Figure 2 illustrates our findings. We observe that pruning out block 29 instead of block 30 has a two-sided impact: On Bool Q, the change makes a dramatic drop in accuracy (62.2% 38.0%, 62.5% 37.9%). However, on PIQA and Wino Grande, we observe a slight accuracy boost. This phenomenon suggests that the block 29 may contain more knowledge relevant to answering Bool Q questions, while 30 may be more knowledgeable about PIQA and Wino Grande. This observation highlights the need to consider task variability during the selection of the omission set. To formally address such need, this paper considers an Figure 2. The impact of pruning the transformer block 29 vs. block 30. On the Bool Q dataset, pruning the block 29 instead of block 30 incurs a dramatic performance degradation, with over 20%p drop. On the other hand, on PIQA and Wino Grande, the accuracy does not change much, or even increases. inference of task information from the prompt. 4. Problem Description Inspired by the observations in Section 3, we now formalize the problem of prompt-based depth pruning. In a nutshell, given some pretrained LLM and a prompt, the goal of the prompt-based depth pruning is to designate which transformer blocks should be removed from the model to generate the most accurate response to the prompt. More concretely, let x be the prompt given to the model, and let W = (W1, . . . , Wd) be the weight parameters of a pretrained LLM consisting of d transformer blocks, with Wi indicating the weights of the ith block. The prediction quality of this language model is measured by the expected loss between the model output and the ground-truth, i.e., L(W) := E[ℓ((x, y); W)], (1) where ℓ(( , ); W) is some loss function which also encapsulates the generative procedure of language model with parameter W (e.g., perplexity). In static depth pruning, the goal is to find which blocks to prune from the given LLM. More formally, define omission set as a(n unordered) set of transformer block indices b = {b1, b2, . . . , bk} {1, 2, . . . , d}, (2) which designates which blocks will be omitted from the target LLM. Then, let W\b be a sequence of d k weights, with bith block eliminated from the W. Then, the static Prompt-based Depth Pruning of LLMs depth pruning aims to solve the minimization min b:|b| k L(W\b), (3) given the depth constraint k designated by the operational constraints, such as the desired latency or the peak memory. Prompt-based Depth Pruning. The problem of promptbased depth pruning can be described as optimizing the omission set as a function ˆb(x), i.e., solving min ˆb( ) E ℓ((x, y); W\ˆb(x)) , (4) subject to Pr |ˆb(x)| k = 1. Note that we are constraining the omission set to have the cardinality greater than k for all x. In other words, the pruned model should always have d k or less blocks. This is because we mainly consider the peak memory constraint, i.e., the RAM cannot hold more than d k blocks. Otherwise, one can consider a slightly modified version of the problem (4) with a probabilistic constraint. We now formally describe the proposed Pu DDing (Promptrouted Dynamic Depth Pruning) an algorithm to train a router ˆb( ) for the prompt-based depth pruning. In a nutshell, Pu DDing operates in two steps: 1. Generating candidate omission sets using the promptanswer dataset collected from various tasks (Section 5.1) 2. Training a router to predict the best option among the candidate omission sets (Section 5.2) During the inference phase, the given prompt is fed to the router, which predicts which omission set (among the candidates) one should use for the given prompt. Then, the model parameters are loaded from the storage to the high-speed memory to constitute the depth-pruned LLM (see Figure 3). We note that this classification-based approach is in contrast with the approach of dynamic token routing (Wang et al., 2024), where one makes yes/no decisions for omitting each block in a sequential manner; this change is to make the router training easier and generalizable. 5.1. Candidate Omission Set Generation The first step is to generate a candidate pool of omission sets. That is, we generate a family of omission sets B = {b1, . . . , bm}, (5) which will be used as the codomain of the router ˆb( ), which will simply be an m-class classifier. Desirable properties of the candidate set B are as follows: Coverage: For any realistic prompt-answer pair (x, y) from a wide range of tasks, the set B should contain at least one bi with a small loss ℓ(y, f(x; W\bi)). Cardinality: The number of omission sets m should be sufficiently small, so that one can train a nice predictor for B with a limited number of samples. To obtain these properties, we adopt the following strategy: First, we collect t calibration datasets D1, . . . , Dt on a diverse set of downstream tasks. Then, on each calibration dataset, we select the omission set that minimizes some loss criterion, i.e., solve bi = arg min b EDi ℓ(y; f(x; W\b) . (6) Here, the minimization is done in a greedy manner, similar to Song et al. (2024). We apply l different loss criteria on each calibration dataset to get m = t l omission sets. Losses. As the loss function, we use new task-focused variants of the perplexity loss, which we call the task likelihood losses. The perplexity measures the fluency of the generated sentences by measuring the average log-likelihood losses over the whole sequence. That is, for a sample sentence z = (z1, z2, . . . , z T ), the perplexity is ppl(z; W) = exp i=1 log pi(zi|z21%) 34.90 66.25 49.11 61.60 74.37 57.22 57.24 SLEB per prompt Depth 7 (>21%) 33.44 50.59 57.95 53.57 63.44 56.51 52.58 Shortened LLa MA Depth 7 (>21%) 34.30 65.15 44.52 60.55 73.67 56.43 55.77 Pu DDing (Ours) Depth 7 (>21%) 41.47 67.09 62.02 62.92 73.94 64.16 61.93 (+4.69) FLAP Width - (15%) 33.19 60.02 69.45 58.18 71.16 61.88 58.98 Slice GPT Width - (15%) 32.59 59.60 49.82 58.59 67.14 64.56 55.38 SLEB Depth 5 (>15%) 39.59 70.58 58.17 67.16 75.63 63.77 62.48 Shortened LLa MA Depth 5 (>15%) 40.78 69.11 60.67 67.46 76.28 64.09 63.07 Pu DDing (Ours) Depth 5 (>15%) 42.32 72.39 65.11 67.28 75.79 65.35 64.71 (+1.64) FLAP Width - (10%) 36.43 66.20 69.69 63.29 74.10 66.61 62.72 Slice GPT Width - (10%) 38.14 68.90 63.67 65.47 70.78 66.30 62.21 SLEB Depth 3 (>9%) 45.73 76.01 68.93 71.96 77.53 68.98 68.19 Shortened LLa MA Depth 3 (>9%) 38.57 69.91 69.72 71.28 77.31 67.48 65.71 Pu DDing (Ours) Depth 3 (>9%) 48.98 77.02 70.18 73.26 77.20 68.11 69.13 (+0.94) Table 3. Zero-shot task accuracy comparison on LLa MA 3.1 8B, OPT 6.7B, and Vicuna 1.5 7B. The best performances are marked in bold, and the runner-up is marked with underline. We have applied 20% sparsity (i.e., pruned seven blocks). Method LLa MA 3.1 8B OPT 6.7B Vicuna 1.5 7B Dense 74.90 62.51 70.49 FLAP 50.96 46.68 51.45 Slice GPT 50.28 55.45 59.11 SLEB 57.24 56.55 58.68 Shortened LLa MA 55.77 54.58 59.78 Pu DDing (Ours) 61.93 58.37 60.01 only compare on a limited number of scenarios. Dataset: Evaluation. We evaluate on the test splits of six zero-shot commonsense reasoning tasks: ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2020), and Bool Q (Clark et al., 2019). Dataset: Calibration for Baselines. For the baseline algorithms, we have used the calibration data designated in the original papers. For SLEB, FLAP, and Slice GPT, we have used the Wiki Text-2 (Merity et al., 2022). For Shortened LLa MA, we have used the Book Corpus (Zhu et al., 2015). Training. To generate the candidate omission set for our algorithm, we have used 128 randomly drawn samples from the training splits of five zero-shot commonsense reasoning tasks: ARC-Challenge, ARC-Easy, Hella Swag, PIQA, and Wino Grande. That is, we use total 10 omission sets (as we use two different losses). For training the router, we have used the full training splits. Bool Q dataset has been left out in order to evaluate the generalization to unseen sets. The router has been trained with Adam W with learning rate 10 5, weight decay 0.01, and batch size 32 for 10 epochs, with 500 warm-up steps. Also, for Wino Grande dataset, we use a tailored data pre-processing procedure; we describe this in detail in Appendix A. Hardware. We have mainly used NVIDIA RTX 6000 Ada for evaluation and training. In addition, we have used cloud instances of NVIDIA A100 for evaluation. 6.2. Main Experiment Table 2 provides a comparison of zero-shot accuracies of the model compression methods, on LLa MA-3.1 8B model. From the table, we observe that Pu DDing achieves the highest average accuracy on all sparsity levels tested. Especially when 7 blocks have been pruned (over 20% sparsity), the improvement over the best baselines is almost 3%p. An interesting observation is the poor performance of SLEB per prompt, which measures which block to remove on the fly, by using the given prompt as a calibration dataset. In fact, the performance is worse than the vanilla SLEB. We hypothesize that this is because a single prompt usually does not contain enough information to work as a good calibration data. Our training-based strategy circumvents such Prompt-based Depth Pruning of LLMs Table 4. Zero-shot accuracy comparisons of Pu DDing vs. other depth pruning methods on LLa MA-3.1 8B, with Lo RA finetuning. The best performances are marked in bold, and the runner-up is marked with underline. Method Pruned Blocks (Sparsity) Per-task Accuracies Average Acc. (%) Arc-C Arc-E Bool Q Hella Swag PIQA Wino Grande Dense 0 53.50 81.52 82.20 78.81 79.98 73.40 74.90 SLEB + Lo RA 7 (>21%) 45.39 74.92 69.05 70.92 78.35 64.40 67.17 Shortened LLa MA + Lo RA 7 (>21%) 43.52 74.07 63.88 71.74 78.35 63.85 65.90 LLM-Streamline (w/ fine-tune) 7 (>21%) 44.80 70.12 70.06 67.15 72.63 71.74 66.08 Pu DDing + Lo RA (Ours) 7 (>21%) 45.39 75.34 71.96 71.58 77.26 66.54 68.01 (+0.84) Table 5. Accuracy comparisons of Pu DDing vs. other depth pruning methods on LLa MA-3.1 8B in the unseen tasks that require more complicated reasoning. The best performances are marked in bold, and the runner-up is marked with underline. Method Pruned Blocks Open Book QA Math QA MMLU Pub Med QA Sci Q Dense 0 44.60 39.53 63.49 75.80 96.00 SLEB 7 36.00 25.19 23.76 56.40 89.20 Shortened LLa MA 7 34.20 25.76 26.78 52.60 89.20 Pu DDing 7 36.40 27.20 39.00 60.00 92.70 difficulty by training a router from the data. Regarding the out-of-distribution generalization, we observe that Pu DDing also works well on unseen dataset (Bool Q). Pu DDing outperforms all other baselines except for FLAP, which works extraordinarily well on this specific dataset. In Table 3, we provide comparisons on other language models: Vicuna and OPT. We confirm that our algorithm works better than other baselines under this setup as well. 6.3. Lo RA Fine-tuning Next, we compare the performance where we assume that we can recover the accuracies using Lo RA (Hu et al., 2022). For Pu DDing, we generate Lo RA updates for each omission set (thus total 10 for these experiments). This requires additional storage space for storing 10 separate copies of Lo RA weights for each omission set. However, this increase only incurs 2.5% increase in the total storage space. For training Lo RA weights, we have followed the setup and hyperparameters used for Lo RA training in shortened LLa MA (Kim et al., 2024); we have used Alpaca dataset (Taori et al., 2023) for training, as in the paper. Table 4 provides Lo RA fine-tuned results of depth pruning algorithms on zero-shot commonsense reasoning tasks, for LLa MA-3.1-8B pruned to 20% sparsity. We observe that Pu DDing continues to achieve the best performance among all options evaluated. That is, the advantages of promptadaptivity also exists after fine-tuning. Table 6. Wall clock inference speed of the Pu DDing-compressed LLa MA-3.1 8B evaluated on NVIDIA A100 and RTX 6000 Ada. A100 Pre-fill (TTFT) Pre-fill + Generation Prompt Length 128 256 512 128 128 128 Gen. Length 1 1 1 128 256 512 Dense 0.137s 0.251s 0.505s 3.296s 6.634s 13.595s Pu DDing 0.109s 0.201s 0.393s 2.694s 5.375s 11.024s Router +0.004s + 0.005s +0.008s +0.004s +0.004s +0.004s Speedup 1.21 1.22 1.23 1.22 1.23 1.23 RTX 6000 Ada Pre-fill (TTFT) Pre-fill + Generation Prompt Length 128 256 512 128 128 128 Gen. Length 1 1 1 128 256 512 Dense 0.008s 0.171s 0.323s 4.923s 9.877s 19.973s Pu DDing 0.069s 0.134s 0.260s 3.946s 7.926s 16.039s Router +0.005s +0.005s +0.005s +0.005s +0.005s +0.005s Speedup 1.19 1.23 1.22 1.25 1.25 1.25 6.4. More Complicated Tasks In Table 5, we compare the performance of various depth pruning algorithms on more complicated tasks, including Open Book QA (Mihaylov et al., 2018), Math QA (Amini et al., 2019), and MMLU (Hendrycks et al., 2021). From the results, we observe that Pu DDing continues to perform better than the baselines, even though these tasks have not been observed during the training of the router. 7. Analysis We now provide further analyses on Pu DDing. In particular, we provide the following analyses: Wall clock speedup (Section 7.1), and visualization of omission sets for tasks (Section 7.2). In Appendix B, we conduct ablation studies. 7.1. Wall-clock Speedup We now provide wall-clock analyses and estimates on the latency and throughput of the Pu DDing-compressed models. Inference. Table 6 presents the average wall-clock inference time comparison between the dense and Pu DDing-pruned version of the LLa MA 3.1 8B, evaluated on NVIDIA A100 and RTX 6000 Ada. For Pu DDing, we have pruned seven Prompt-based Depth Pruning of LLMs Figure 4. A visual illustration of the Pu DDing s pruning rate of each transformer block, given the prompts drawn from various zero-shot tasks. The results are for the LLa MA 3.1 8B model, pruned to 20% sparsity (seven blocks removed). The color red indicates that the blocks are likely to be pruned, and the color green indicates that the blocks are likely to be retained. We provide additional visualizations on the other language models (OPT 6.7B and Vicuna 1.5 7B) in the Appendix C. Table 7. Wall clock inference speed of the Pu DDing-compressed LLa MA-3.1 8B evaluated on edge devices (M3 pro, Apple). M3 Pro (Apple) Pre-fill (TTFT) Pre-fill + Generation Prompt Length 128 256 512 128 128 128 Gen. Length 1 1 1 128 256 512 Dense 0.177s 0.300s 0.480s 7.890s 15.970s 32.520s Pu DDing 0.138s 0.235s 0.376s 6.174s 12.497s 25.447s Router 0.009s 0.016s 0.029s 0.009s 0.009s 0.009s Speed Up 1.20 1.20 1.19 1.28 1.28 1.28 Table 8. The estimated time required to transfer the weight parameters of LLa MA-3.1 8B and Pu DDing (with seven blocks pruned) to NVIDIA A100 GPU through various communication channels. Bandwidth Dense Pu DDing PCIe Gen4 x4 64GB/s 0.250s 0.198s NVIDIA NVlink 600GB/s 0.027s 0.021s layers (21.88% sparsity). We observe that Pu DDing provides a consistent 1.19-1.23 speedup during the pre-fill stage, and 1.22-1.25 speedup including the generation stage. The total routing time takes up to 4-8ms, which can be deemed negligible comparing with the overall latency. Also, Table 7 presents results on edge devices (e.g., Apple M3 Pro), showing consistent speedup. This outcome shows that the proposed method is well-suited for both server-like and edge-like hardwares. Parameter loading. Table 8 presents the estimated time required for loading the model parameters of LLa MA-3.1 8B (16GB in FP32) from the storage to the GPU. Pu DDing can save around 52ms on PCIe and 6ms on NVLink, which is nonnegligibly large comparing with the computational scale of running these models. However, a pitfall is that, for repeated inference, Pu DDing may require loading additional weights to account for different prompts. This additional cost can be minimized by loading only the previously unloaded blocks from the storage; in fact, many blocks overlap, as we will demonstrate in Section 7.2. 7.2. Pruned Block vs. Task Figure 4 depicts the distribution of the pruned transformer blocks in LLa MA-3.1-8B model, given the prompts from different tasks. Again, we consider the case where we drop seven transformer blocks for each prompt. From the figure, we make two intriguing observations: First, several blocks are considered almost universally unnecessary. In particular, the blocks 20, 26, 27 are removed with over 80% probability in all tasks. Similarly, there are certain block which are almost never pruned, e.g., blocks 1 3 and 5 8. Second, regarding some blocks, the importance of the block highly varies over task. For instance, transformer block 4 is pruned with over 80% for ARC-Easy and ARCChallenge. On the other hand, for PIQA and Wino Grande, the pruning rate is less than 40%; in these tasks, the blocks 9 and 10 are likelier to be less important. We note that similar patterns can be observed for OPT and Vicuna; see Appendix C for visualizations on these models. 8. Conclusion In this paper, we have developed a new paradigm for the depth pruning of large language models, where we dynamically determine which blocks should be utilized for processing the prompt given from the user. By doing so, we can save both the memory access cost and the inference computation, thus suitable for on-device deployment of large language models. We have proposed Pu DDing, an algorithm to train a router using various task data. Through our experiments, we have confirmed that such framework is quite effective, clearly outperforming existing static depth pruning algorithms consistently over multiple LLMs. Limitations and future work. A notable limitation of the proposed method is that we assume that we have access to various task datasets. In particular, we have focused on the case where we use LLMs for commonsense reasoning tasks, instead of an open-ended language generation. A promising future direction will be to develop new techniques Prompt-based Depth Pruning of LLMs to harness unlabeled text corpus, such as Alpaca or C4, to generate diverse clusters of calibration data for attaining corresponding omission sets. Another limitation is a general lack of mechanisms to account for the different difficulties of the tasks. For some tasks, it may be necessary to utilize all layers to generate an answer with sufficiently high quality; on the other hand, some tasks can be simply handled with very few layers. While our decision to consider a fixed number of transformer blocks is motivated by the practical constraints of on-device inference, we believe that utilizing variable-depth can be even more effective whenever the on-device memory is spacious but can be preempted to other processes. Acknowledgements This work has been supported by National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIT) (Nos. RS-2023-00213710, RS-2024-00453301). Impact Statement Our paper targets for advancing the general field of machine learning and LLM compression. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Akhauri, Y., Abou Elhamayed, A., Dotzel, J., Zhang, Z., Rush, A. M., Huda, S., and Abdelfattah, M. Shadow LLM: Predictor-based contextual sparsity for large language models. In Empirical Methods in Natural Language Processing, 2024. Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., and Hajishirzi, H. Math QA: Towards interpretable math word problem solving with Operation-Based formalisms. In North American Chapter of the Association for Computational Linguistics, 2019. An, Y., Zhao, X., Yu, T., Tang, M., and Wang, J. Fluctuationbased adaptive structured pruning for large language models. In Association for the Advancement of Artificial Intelligence, 2024. Ashkboos, S., Croci, M. L., do Nascimento, M. G., Hoefler, T., and Hensman, J. Slice GPT: Compress large language models by deleting rows and columns. In International Conference on Learning Representations, 2024. Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Association for the Advancement of Artificial Intelligence, 2020. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020. Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing GPT-4 with 90%* Chat GPT quality, March 2023. URL https://lmsys.org/blog/ 2023-03-30-vicuna/. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Bool Q: Exploring the surprising difficulty of natural yes/no questions. In North American Chapter of the Association for Computational Linguistics, 2019. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. In ar Xiv preprint, 2018. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. In ar Xiv preprint, 2024. Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1 39, 2022. He, S., Sun, G., Shen, Z., and Li, A. What matters in transformers? not all attention is needed. In ar Xiv preprint, 2024. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lo RA: Low-Rank adaptation of large language models. In International Conference on Learning Representations, 2022. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. Kim, B.-K., Kim, G., Kim, T.-H., Castells, T., Choi, S., Shin, J., and Song, H.-K. Shortened llama: A simple Prompt-based Depth Pruning of LLMs depth pruning for large language models. In Proceedings of the ICLR Workshop on Memory-Efficient Foundation Models (ME-Fo Mo), 2024. Lee, D., Lee, J.-Y., Zhang, G., Tiwari, M., and Mirhoseini, A. Cats: Contextually-aware thresholding for sparsity in large language models. In Conference On Language Modeling, 2024. Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. In Proceedings of Machine Learning and Systems, 2024. Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y., R e, C., et al. Deja vu: contextual sparsity for efficient LLMs at inference time. In International Conference on Machine Learning, 2023. Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y., Han, X., and Chen, W. Shortgpt: Layers in large language models are more redundant than you expect. In ar Xiv preprint, 2024. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2022. Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Empirical Methods in Natural Language Processing, 2018. Raposo, D., Ritter, S., Richards, B., Lillicrap, T., Humphreys, P. C., and Santoro, A. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. In ar Xiv preprint, 2024. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. In Association for the Advancement of Artificial Intelligence, 2020. Siddiqui, S. A., Dong, X., Heinrich, G., Breuel, T., Kautz, J., Krueger, D., and Molchanov, P. A deeper look at depth pruning of LLMs. In ICML 2024 Workshop on Theoretical Foundations of Foundation Models, 2024. Song, J., Oh, K., Kim, T., Kim, H., Kim, Y., et al. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. In International Conference on Machine Learning, 2024. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model, 2023. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. In ar Xiv preprint, 2023. Wang, H., Xie, L., Zhao, H., Zhang, C., Qian, H., Lui, J. C., et al. D-LLM: A token adaptive computing resource allocation strategy for large language models. In Advances in Neural Information Processing Systems, 2024. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hella Swag: Can a machine really finish your sentence? In Association for Computational Linguistics, 2019. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. In ar Xiv preprint, 2022. Zhou, Y., Chen, Z., Xu, Z., Lin, X. V., and Chen, B. SIRIUS: Contexual sparisty with correction for efficient LLMs. In Advances in Neural Information Processing Systems, 2024. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards Story-Like visual explanations by watching movies and reading books. In International Conference on Computer Vision, 2015. Prompt-based Depth Pruning of LLMs A. Pre-processing for the Wino Grande Dataset The Wino Grande dataset, originally consisting of fill-in-the-blank sentences, was initially computed using the sentence-level likelihood (sl) as follows: sl(z; W) = 1 i=1 log pi(zi|z