# parameterefficient_finetuning_of_state_space_models__68c9c361.pdf Parameter-Efficient Fine-Tuning of State Space Models Kevin Galim * 1 Wonjun Kang * 1 2 Yuchen Zeng * 3 Hyung Il Koo 1 Kangwook Lee 3 Deep State Space Models (SSMs), such as Mamba (Gu & Dao, 2024), have become powerful tools for language modeling, offering high performance and linear scalability with sequence length. However, the application of parameter-efficient fine-tuning (PEFT) methods to SSM-based models remains largely underexplored. We start by investigating two fundamental questions on existing PEFT methods: (i) How do they perform on SSMbased models? (ii) Which parameters should they target for optimal results? Our analysis shows that Lo RA and its variants consistently outperform all other PEFT methods. While Lo RA is effective for linear projection matrices, it fails on SSM modules yet still outperforms other methods applicable to SSMs, indicating their limitations. This underscores the need for a specialized SSM tuning approach. To address this, we propose Sparse Dimension Tuning (SDT), a PEFT method tailored for SSM modules. Combining SDT for SSMs with Lo RA for linear projection matrices, we achieve state-of-the-art performance across extensive experiments. 1. Introduction In the past few years, Large Language Models (LLMs) such as Chat GPT (Achiam et al., 2023; Brown et al., 2020) have achieved groundbreaking performance and are now widely used in daily life. While many models rely on the Transformer architecture (Vaswani et al., 2017), its quadratic time complexity due to the attention mechanism poses challenges for long sequences. To address this, alternative architectures such as linear attention (Katharopoulos et al., 2020), RWKV (Peng et al., 2023), Ret Net (Sun et al., 2023), and Mamba (Gu & Dao, 2024) have been developed, offering *Equal contribution. Authors listed in alphabetical order. 1Furiosa AI 2Seoul National University 3University of Wisconsin-Madison. Correspondence to: Kangwook Lee . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). subquadratic time complexity. Efficient attention alternatives often rely on State Space Models (SSMs) or their variants (Gu et al., 2021; 2022b;a; Gu & Dao, 2024), which are akin to linear RNNs, maintaining hidden states of fixed size for sequential processing. S4 (Gu et al., 2022b;a) overcomes RNNs parallel training limitations by constraining parameter structures, enabling a convolutional form for efficient parallel computation. S6 (Gu & Dao, 2024) improves this with input-dependent parameters, enabling selective focus on relevant information per token. Building on S6 with linear projection matrices (analogous to the Feed-Forward Networks in Transformer layers), Mamba-I (Gu & Dao, 2024) emerged as a prominent SSM-based model. Mamba-I was later extended to Mamba-II (Dao & Gu, 2024), with both models achieving Transformer-level performance in language modeling and gaining widespread recognition. As SSMs gain popularity, performing parameter-efficient fine-tuning (PEFT) on pretrained models for downstream tasks is crucial, since full fine-tuning is costly and inefficient. Numerous PEFT methods (Houlsby et al., 2019; Hu et al., 2021; He et al., 2021; Li & Liang, 2021; Lester et al., 2021; Zaken et al., 2022; Liu et al., 2021; 2022; Houlsby et al., 2019) have been developed, achieving notable success on Transformer models. The most popular PEFT methods fall into three categories: (i) input-injection methods, which add sequences to the model s main input (Lester et al., 2021) or prepend tokens to the intermediate inputs at each layer (Li & Liang, 2021); (ii) architecture-enhancement methods, which adjust the model architecture. For example, Houlsby et al. (2019) added layers between Transformer layers, while Additional-scan (Yoshimura et al., 2025) expands state dimensions in the SSM module; (iii) weight-tuning methods, which directly modify existing model weights. Notable weight-tuning approaches include Bit Fit (Zaken et al., 2022), which updates only bias terms, and Lo RA (Hu et al., 2021), which modifies weight matrices through lowrank updates, along with its variants such as Do RA (Liu et al., 2024) and Lo RA+ (Hayou et al., 2024). For simplicity, we denote Lo RA and its variants as Lo RA . Despite the success that existing PEFT methods have achieved in adapting Transformer-based models, their efficacy in adapting SSM-based models remains largely underexplored, leaving many interesting questions open. Parameter-Efficient Fine-Tuning of State Space Models Lo RA (Lin Proj) + SDT (SSM) Prompt Tuning Prefix-Tuning Additional-Scan State Freezing Sparse Dimension Tuning Core Parameters in S6 Channel Freezing 𝜎 Lo RA Lo RA Sec. 4 Sec. 4 Sec. 5, 6 Benchmarking PEFT Methods Better PEFT Method on SSM Lo RA (Lin Proj) + Lo RA (SSM) Lo RA (Lin Proj) Other PEFT Methods Lo RA Lo RA Frozen Trainable Table 1 Table 3, 4 Theorem 1 Table 1 Figure 1. A visual guide to PEFT methods in SSM-based models: benchmarking and innovation. We compare various existing PEFT approaches on SSM-based models, demonstrating that Lo RA applied to linear projection matrices outperforms all other methods. However, extending Lo RA to SSM modules fails to yield further improvements. To address this, we propose Sparse Dimension Tuning (SDT), which achieves state-of-the-art performance on SSM-based models when combined with Lo RA for linear projection matrices. 1. Do existing popular PEFT methods remain effective for SSM-based models? 2. If applicable, what is the optimal way to integrate these methods into SSM-based models, and which parameters should be updated? 3. If not, can we design specialized variants tailored to SSMs that yield superior performance? Our main contributions to address these questions are: Comprehensive Benchmarking of PEFT Methods. We benchmark six widely used PEFT methods across three categories on diverse tasks, including natural language understanding, generation, and computer vision. We evaluate these methods on both SSM-based models (i.e., Mamba) and a hybrid model (i.e., Jamba (Lieber et al., 2025)), which consists of both Transformer layers and Mamba layers. Our results show that Lo RA consistently outperforms all other PEFT methods on both SSM-based and hybrid models. However, its effectiveness is limited to linear projection matrices, as further tuning of SSM modules does not improve performance. Notably, other methods applicable to SSM modules perform worse than Lo RA , further underscoring the need for a specialized approach to tuning SSM modules. Introducing Sparse Dimension Tuning (SDT) for SSM Modules. To develop an effective method for tuning SSM modules, we conduct a theoretical analysis to understand the roles of different parameters. This analysis motivates the Sparse Dimension Tuning and Pruning (SDT-P) method, which improves efficiency by freezing and pruning certain channel and state dimensions while training only the remaining ones. We establish theoretical guarantees for its effectiveness in SSM-based models when combined with Lo RA applied to linear projection matrices. We then simplify SDT-P into Sparse Dimension Tuning (SDT) by omitting explicit pruning, as pruned dimensions can be considered equivalent to training dimensions set to zero. SDT selectively updates channels and fine-tunes specific dimensions within them, as illustrated in Fig. 1. Demonstrating Effectiveness of SDT. Through extensive experiments, we demonstrate that integrating SDT into SSM-based models, combined with applying Lo RA to their linear projection matrices, achieves state-of-the-art fine-tuning performance. The roadmap of our paper is illustrated in Fig. 1. Our code is available at https://github.com/furiosa-ai/ ssm-peft. 2. Related Works Concurrent Works of PEFT on SSMs. Several concurrent studies (Halloran et al., 2024; Yoshimura et al., 2025; Kang et al., 2025) have investigated PEFT methods for SSM-based models. Halloran et al. (2024) studied both in-context learning and parameter-efficient fine-tuning, with an orthogonal focus on analyzing Mamba s stability under mixed-precision training using Lyapunov exponents. Kang et al. (2025) introduced state-based PEFT methods and proposed State-offset Tuning, solely focusing fine-tuning Mamba s S6 blocks. Yoshimura et al. (2025) benchmarked multiple PEFT approaches including established methods and a new method called Additional-scan (which adds a trainable state dimension to the SSM module), plus partial tuning (fine-tuning only a subset of parameters) and introduced Mamba PEFT through PEFT search strategies. While Yoshimura et al. (2025) solely focused on Mamba-I, providing an in-depth study of that particular architecture, our work investigates a broader class of SSM-based models including deep S4, Mamba-I, Jamba in the main body, as well as Mamba-II presented in Sec. C.2 and E.2, aiming to offer general insights on how to effectively tune SSMs rather than focusing on a single variant. Parameter-Efficient Fine-Tuning of State Space Models Sparse Tuning. Several studies have explored sparse parameter selection in fine-tuning (Song et al., 2024) and skill localization (Panigrahi et al., 2023). Song et al. (2024) showed that sparse tuning is an effective PEFT method, linking the low intrinsic dimensionality of pre-trained models to the proportion of parameters needing updates. They propose selecting optimal fine-tuning parameters based on gradient magnitudes. We enable sparse tuning for SSM by applying sparsity across entire dimensions (channel and state) rather than specific neurons. Panigrahi et al. (2023) focused on identifying neurons responsible for specific downstream tasks by fully fine-tuning the model and computing neuron masks to minimize task loss. While effective for skill localization, this method is computationally expensive and not optimized for parameter-efficient fine-tuning. In Sec. A, we provide a more detailed discussion of related work on SSMs and PEFT. 3. Preliminaries 3.1. State Space Models Discrete-Time SSMs. The initial SSM is derived from a specific continuous system that maps a one-dimensional function or signal x(t) R to y(t) R via an Hdimensional latent state h(t) RH, as described in (1). In this formulation, input transition vector B RH 1 indicates the input s impact on the state of the system, state matrix A RH H characterizes the system s internal state dynamics, and the output mapping vector C R1 H relates the state to the output y(t).1 h (t) = Ah(t) + Bx(t) y(t) = Ch(t) (1) ht = Aht 1 + Bxt, K = (CB, CA B, . . . , CA t 1B), (y1, . . . , yt) = (x1, . . . , xt) K To handle discrete inputs, the continuous parameters (A, B) are discretized into (A, B) using a learnable step size R. A common discretization rule, the zero-order hold, defines A = exp( A), B = ( A) 1(exp( A) I) B. The discrete-time SSM, given in (2), enables efficient inference via long convolution described in (3). For multichannel inputs x, y RD, separate SSMs are used per channel, with a superscript (d) indicating channel-specific parameters when needed. Structured State Space Sequence Model (S4). S4, introduced by Gu et al. (2022b), is an early application of SSMs in deep learning, featuring a diagonal state matrix A. To introduce non-linearity and cross-channel mixing, S4 integrates a position-wise linear layer, an activation func- 1Note that B, C are vectors; we use bold capitals for consistency with prior work (Gu et al., 2022b; Gu & Dao, 2024). tion, and a residual connection from input to output. Let represent the element-wise product, and S4( ) denote the S4 mechanism, where each channel s output follows (3) with its convolutional kernel K (d). To facilitate theoretical analysis, certain subtle details such as activation functions may differ slightly from those in previous studies (Gu et al., 2022b;a). We define the deep S4 layer as: yt = Re LU(W S4t(x1, . . . , xt) + β + u xt), (4) where W RD D and β RD represent the linear projection matrix and bias, respectively, and u RD is the coefficient of the residual connection. Trainable parameters include SSM parameters (A(d), B(d), C(d), (d)) across D channels with A(d) being diagonal, as well as linear layer (W , β) and residual connection u. Selective State Space Models (S6). All SSMs mentioned above exhibit linear time invariance (LTI), meaning their dynamics remain constant over time. A key limitation of LTI SSMs is their fixed dynamics, hindering selective context extraction and input-dependent state transitions. S6 (Gu & Dao, 2024) addresses this by making parameters input-dependent. At each time step t, given input xt RD, S6 introduces input-dependent step sizes t = ( (1) t , . . . , (D) t ) RD, input transition vectors Bt RH 1 and output mapping vectors Ct R1 H via linear projection: t = softplus(W xt + β ), Bt = WBxt, Ct = WCxt, where the diagonal state matrices A(1), . . . , A(D) remain input-independent. The weight W RD D is factorized as W = W , W , , with W , RD R, W , RR D to reduce computation (Wang et al., 2021; 2023a). Trainable parameters in S6 include A(d) across D channels, W , , W , and β for computing t, and WB, WC RH D for computing Bt, Ct. Discretization follows: A (d) t = exp( (d) t A(d)), B (d) t = (d) t Bt. Unlike S4, where B(d) varies per channel, S6 s variation on B (d) stems from the scalar (d) t . Additionally, S6 shares Ct for all channels at each time step t, while S4 assigns a distinct C(d) to each channel. Mamba & Jamba. Similar to the Transformer block, which consists of attention and linear layers, the Mamba-I block proposed by Gu & Dao (2024) features an S6 module, a point-wise 1D causal convolution layer (Conv1d) for token mixing, linear layers including input (Win) and output (Wout) projection layers and a gated MLP. Mamba II (Dao & Gu, 2024) further simplifies the state matrix A to be a scalar. Building on Mamba-I, Jamba (Lieber et al., 2025) introduces a hybrid architecture that integrates both Transformer blocks and Mamba blocks, leveraging the strengths of both to enhance performance. This paper Parameter-Efficient Fine-Tuning of State Space Models focuses on Mamba-I (referred as Mamba in this paper) and Jamba, deferring Mamba-II discussions to the appendix. 3.2. Parameter-Efficient Fine-Tuning Input-Injection Methods. Input-injection methods, such as prompt tuning (Lester et al., 2021) and prefix-tuning (Li & Liang, 2021), enhance the model s input by injecting specialized sequences. Prompt tuning prepends a set of trainable embeddings P RD M to the original input X RD N, forming the concatenated sequence f X = [P ; X]. Prefix-tuning (Li & Liang, 2021) instead injects learnable vectors into the key and value matrices of each attention layer. For a Transformer layer, it prepends prefix states P K, P V RL D to the original projections: f K = [P K; K], e V = [P V ; V ], where K and V are the key and value matrices derived from the input. We note that prefix-tuning is functionally equivalent to prepending soft tokens to the input at each attention layer and discarding the corresponding outputs associated with the prepended tokens. This view simplifies adaptation to SSMs, which lack explicit key and query projections. Yoshimura et al. (2025) also adopt this implementation, though they refer to it as affix-tuning. Architecture-Enhancement Methods. These methods modify the model s internal structure to introduce tunable components. In the context of SSMs, one example is Additional-scan (Yoshimura et al., 2025), which expands the state dimensions within the SSM block and fine-tunes only the added parameters, leaving the original weights untouched. Weight-Tuning Methods. Notable weight-tuning methods include Lo RA (Hu et al., 2021) and its variants (Liu et al., 2024; Hayou et al., 2024), as well as Bit Fit (Zaken et al., 2022). Lo RA (Hu et al., 2021) fine-tunes a model by introducing low-rank updates to its weight matrices. Given a weight matrix W0 RD D, Lo RA updates it as follows: W = W0 + W W , with W RD R, W RR D, and R D being the rank. Only W and W are trained, reducing the number of trainable parameters from D2 to 2RD. Weight Decomposed Low-Rank Adaptation (Do RA) (Liu et al., 2024) improves upon Lo RA by decomposing the weight matrix into two components: magnitude (m RD) and direction (W W ), leading to the formulation W = m W0 + W W This additional parameter m enhances both training capacity and stability. Lo RA+ (Hayou et al., 2024) modifies Lo RA by applying different learning rates to W and W , enabling more effective feature learning. In contrast, Bit Fit (Zaken et al., 2022) updates only the bias terms, offering a lightweight and highly parameter-efficient alternative. 4. Benchmarking PEFT Methods on SSM-based Models In this section, we examine the effectiveness of popular PEFT methods when applied naively to SSM-based models, specifically Mamba and Jamba. 4.1. Experiment Setup We evaluate PEFT methods across three categories: inputinjection, architecture-enhancement, and weight-tuning. For input-injection methods, we use prompt tuning (Lester et al., 2021) and prefix-tuning (Li & Liang, 2021), where prefixtuning employs an overparameterized MLP for stable optimization. For architecture-enhancement methods, we include Additional-scan (Yoshimura et al., 2025), which introduces and fine-tunes newly added state dimensions in SSM modules. For weight-tuning, we consider Bit Fit (Zaken et al., 2022) and Lo RA , including Lo RA (Hu et al., 2021) and Do RA (Liu et al., 2024), while Lo RA+ (Hayou et al., 2024) is deferred to Sec. E.2. Bit Fit fine-tunes the bias terms of Conv1d and W , . We use six datasets spanning different domains: GLUE for natural language understanding (Wang et al., 2019), DART for RDF-to-text generation (Nan et al., 2021), SAMSum (Gliwa et al., 2019) for summarization, Spider for text-to-SQL generation (Yu et al., 2018), and two vision datasets CIFAR-10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015), with the vision datasets processed by cropping, resizing, and flattening pixel values into space-separated numerical sentences. Details are in Sec. B. Prefix-tuning requires significantly more parameters than other PEFT methods due to its per-layer MLP for projecting fixed sequences into soft tokens. For all methods except prefix-tuning and the special case of Lo RA and Do RA when applied to both linear projection layers we limit trainable parameters to below 1% for Mamba and below 0.15% for Jamba. For Jamba, all PEFT methods are applied to Mamba layers, while Transformer layers remain frozen to isolate performance effects. See more details in Sec. C.1. 4.2. Results Table 1 summarizes the benchmarking results. Detailed results for GLUE and Spider subtasks appear in Sec. C.2. We analyze the results from three key perspectives below. Parameter-Efficient Fine-Tuning of State Space Models Model Method Major Target Module GLUE DART SAMSum Spider CIFAR-10 Celeb A Avg. Score METEOR BLEU R1 R2 RL Acc. Acc. Acc. Prompt Tuning Other 63.8 66.2 39.8 50.1 25.6 41.6 43.6 30.4 82.5 Prefix-Tuning SSM 68.6 66.6 42.5 50.6 26.5 42.1 39.7 41.0 86.5 Bit Fit Both 76.8 67.0 43.7 50.3 25.7 41.9 48.4 44.4 86.9 Lo RA SSM 76.9 68.8 48.0 50.4 26.0 41.8 55.0 52.3 87.0 Lin Proj 81.2 70.9 49.5 50.9 27.0 42.3 57.5 61.0 87.0 Both 80.3 70.2 52.2 50.7 26.8 42.4 57.0 58.4 89.8 Do RA SSM 77.9 68.3 47.3 48.1 24.2 39.6 55.3 44.5 87.1 Lin Proj 81.1 70.7 51.6 51.0 26.9 42.8 60.7 57.6 86.7 Both 80.8 70.8 51.4 51.3 27.2 43.0 58.1 58.2 89.8 Additional-Scan SSM 62.4 60.6 15.8 37.6 17.5 30.9 26.9 32.2 86.0 Full Fine-Tuning Both 80.5 71.0 51.8 51.2 27.3 42.9 66.2 60.0 89.4 Prompt Tuning Other 73.3 54.1 6.3 54.7 31.8 46.8 74.9 40.9 85.6 Prefix-Tuning SSM 56.9 59.6 14.4 11.5 1.8 10.4 0.3 29.9 82.2 Bit Fit Other 75.2 59.2 14.8 54.7 31.9 47.0 73.7 45.6 86.3 Lo RA Lin Proj 73.9 68.9 37.8 54.6 32.3 46.8 69.3 59.7 89.0 Do RA Lin Proj 71.4 68.1 28.8 55.2 32.2 47.3 70.9 58.6 89.0 Additional-Scan SSM 68.3 63.3 20.1 53.4 30.5 45.6 69.3 50.6 0.0 Table 1. Benchmarking popular Parameter-Efficient Fine-Tuning (PEFT) methods on Mamba (Gu & Dao, 2024) and Jamba (Lieber et al., 2025) across six real-world datasets. R1/R2/RL stand for ROUGE-1/2/L. We evaluate PEFT applied to different target modules: SSM module only, linear projection matrices (Lin Proj) only, both, or other components such as embedding layer. For both Mamba and Jamba, all methods use fewer than 1% and 0.15% of parameters, respectively, except when the target module for Lo RA or Do RA is set to Both or when prefix-tuning is applied. Comprehensive hyperparameter tuning was performed for all methods. Bold values indicate the best performance for each model (Mamba and Jamba) separately, while underlined values denote the second-best performance for each task, excluding full fine-tuning. Key findings include: (i) among PEFT methods applied to SSM modules, Lo RA outperforms others, (ii) for all PEFT methods, Lo RA achieves the best performance, (iii) applying Lo RA to linear projections yields results comparable to applying it to both linear projections and SSM modules, while outperforming its application solely to SSM modules, and (iv) input-injection methods (i.e., prompt tuning and prefix-tuning), are generally ineffective. Superiority of Lo RA . The most prominent finding is that Lo RA consistently outperforms other PEFT methods (e.g., prompt tuning, prefix-tuning, Bit Fit, additional-scan), regardless of the target module. Finding: Across all target modules, Lo RA surpasses existing PEFT methods in performance. Even when restricted to SSM modules, Lo RA still outperforms all other PEFT baselines applied to the same target. Limitations of Input-Injection Methods. Input-injection methods like prefix-tuning are ineffective for SSM-based models (Table 1), as their expressiveness reduces to tuning only the initial hidden state (Proposition 1). Formal statement, proof and empirical verification are in Sec. C.3. Optimal Application of Lo RA in SSM-based Models. Table 1 shows that Lo RA outperforms all other PEFT methods in most scenarios. From our results, we explore the optimal layers for applying Lo RA in SSM-based models: the SSM module, the linear projection matrices, or a combi- nation of both. Note that S6 in Mamba and Jamba includes fine-grained parameters like x_proj (WB, WC, W , ) and dt_proj (W , ), which were already explored by Yoshimura et al. (2025) on Mamba. We defer a deeper discussion of them to Sec. C.4 and focus on the key question here: Is applying Lo RA to SSM modules necessary for performance gains? By narrowing our scope, we aim to clarify Lo RA s impact across major components (e.g., SSM modules, linear projection matrices) rather than all specific parameters. We evaluate Lo RA s performance on linear projections using Win, Wout, and both combined. Since the performance of different combinations of linear projections is consistent across datasets (see Sec. C.4.), we only report the results for Lo RA applied to Win in Table 1. For SSM modules, we apply Lo RA to weight matrices, including those for the input-dependent step size . For state transition matrices A, we treat their diagonal structures as vectors, concatenate them across channels to form a matrix, and apply Lo RA . Table 1 summarizes results for the best-performing configurations (see Sec. C.2 for full results). Based on these results, we present the following finding: Parameter-Efficient Fine-Tuning of State Space Models Finding: For Lo RA : Tuning on SSMs is less effective than tuning linear projection matrices, with the latter performing comparably to tuning both. Detailed experiments, including Lo RA on different linear projection matrices and additional evaluations of Mamba-II, are presented in Sec. C.2. These experiments reinforce the finding that Lo RA is highly effective for linear projections but less suitable for SSM modules. To further elucidate this concept, we present the following lemma, which examines a simplified model architecture consisting of S6 with two linear input projection matrices at each layer. We demonstrate that fine-tuning one input projection matrix encompasses the expressivity of fine-tuning the parameters WB, WC, and W , . Consider an S6 model with two input projection matrices Win,1, Win,2 RD D: the first affects how internal parameters depend on the input, while the second governs the input passed directly into the S6 module. Under this setup, the output y(d) N can be expressed as: Input-dependent CN z }| { C(Win,1x N) N X Input-dependent Am z }| { A(Win,1xm) Input-dependent Bn z }| { Bn(Win,1xn) | {z } Parameters depending on input after projection Win,1 (Win,2xn)(d) | {z } Input after projection Win,2 When Win,1 = Win,2, this reduces to a standard architecture with a single input projection followed by an S6 layer. For simplicity, we let β = 0. Then the full model is parameterized by ({A(d)}D d=1, WB, WC, W , , W , , Win,1, Win,2). Assume none of the parameters are zero and D > 2H + R, where R is the rank of W , W , . Lemma 1 (Expressivity of Fine-Tuning Projection Matrices). Consider two models with the architecture described above. Let: A target model f parameterized by ({A (d)}D d=1, W B, W C, W , , W , , W in,1, W in,2); A frozen model f0 parameterized by ({A (d)}D d=1, WB, WC, W , , W , , Win,1, W in,2). The two models share {A (d)}D d=1, W , , and W in,2, while differing in WB, WC, W , , and Win,1. Then, there exists an updated projection matrix c Win,1 such that the frozen model matches the output of the target model without updating WB, WC, W , for any input sequence, i.e., f( ; {A (d)}D d=1, WB, WC, W , , W , , [ Win,1, W in,2) = f ( ; {A (d)}D d=1, W B, W C, W , , W , , W in,1, W in,2). We expand on this discussion in Sec. C.4, where we present both theoretical proofs and empirical validation. The lemma shows that tuning the linear projection matrix can match the expressive power of certain SSM parameters (i.e., WB, WC, and W , ), aligning with our empirical observation that tuning only the linear projections already performs well. However, a key limitation of tuning only the linear projection matrices remains: such tuning lacks the expressive power to affect the state matrix A, which is an essential parameter for sequence-to-sequence operations. Therefore, tuning the SSM modules is still necessary. Existing PEFT methods fall short in effectively tuning SSM modules: (i) alternative methods underperform compared to Lo RA on SSM modules, and (ii) applying Lo RA to SSM modules does not improve performance beyond applying it to linear projections alone. These findings highlight a gap in current PEFT techniques for SSM modules, leading to an importantca question: Is there a more effective strategy for fine-tuning SSM modules? 5. Sparse Dimension Tuning This section aims to develop an algorithm for tuning SSM modules. In doing so, we start by first analyzing the roles of different parameters, as outlined in Lemma 2. This analysis motivates us to classify channels and state dimensions into three categories: (i) zero, (ii) trainable, and (iii) frozen, leading to the development of the Sparse Dimension Tuning and Pruning (SDT-P) method. We then establish theoretical guarantees for applying SDT-P to SSM modules and Lo RA to linear projection matrices (Theorem 1). Finally, we simplify SDT-P into Sparse Dimension Tuning (SDT) by omitting pruning, as pruned parameters can be effectively considered as being trained to zero. This simplified version serves as the primary method used in our experiments. 5.1. Understanding Key Parameters in S4 Modules Problem Setting. Inspired by the work by Zeng & Lee (2024), we analyze the expressive power of S4 parameters using a similar framework. We assume a well-performing target model and a frozen model (pretrained or random) and aim to update the frozen model efficiently to match the target. Following Zeng & Lee (2024), we assume the frozen model has a capacity at least as large as the target model. This assumption ensures analytical tractability and is reasonable, as frozen models are typically overparameterized in practice. Both models are S4 with hidden dimensions H (target) and H H (frozen). Assuming all hidden dimensions are active (i.e., all parameters are non-zero), we define their dynamics using discretized parameters (A, B, C): (Target model) f (x)n = Xn m=1 C A m n B xm, (Frozen model) f0(x)n = Xn m=1 C0A m n 0 B0xm, where diag(A ), B , C RH , diag(A0), B0, C0 RH. This formulation shows that the S4 module remains unchanged even if the state dimensions are permuted. Parameter-Efficient Fine-Tuning of State Space Models Parameter Efficiency Analysis on S4. We analyze the parameter efficiency of updating a frozen S4 module after discretizing its parameters (A0, B0, C0) to match the functionality of a target S4 module with discretized parameters (A , B , C ). Based on this setup, we present the following result characterizing the minimum number of parameters that must be tuned for functional equivalence. Lemma 2 (Minimal Parameter Adjustment for S4 Fine-Tuning). Assume all hidden dimensions of the target model f are non-zero, i.e., all elements of diag(A ) B C are non-zero. To update frozen model f0 such that it becomes functionally equivalent to the target model f , the minimum number of tunable parameters is: eliminating redundant dimensions z }| { diag(A) B C aligning remaining dimensions with target model z }| { 1:H ,1:H A 0 + (A, B, C) {(P A0P , P B0, C0P ) : P is a permutation matrix}. Note that the search space consists of all possible S4 parameterizations that can be obtained by permuting the hidden dimensions of the frozen model. Proofs and further details are provided in Sec. D.1. This result highlights three distinct roles of the state dimensions. First, any dimensions that do not contribute to the target function (represented by the first term in (5)) are effectively zero and can be pruned. These correspond to state dimensions larger than those of the target model after permutation, indicating that redundant information can be directly removed to eliminate its impact. Second, among the remaining dimensions, alignment is necessary for those that do not already match the target. The state matrix A plays a crucial role in sequence modeling by capturing dependencies between tokens at different positions. To achieve functional equivalence (as represented by the second term in (5)), A must be aligned. Notably, dimensions that are already aligned with the target require no updates. These two insights motivate our Sparse Dimension Tuning and Pruning (SDT-P) method, which classifies hidden dimensions into three categories: (i) zero, (ii) frozen (already aligned), and (iii) trainable. Finally, the third term in (5) indicates that the expressive power of B and C is essentially equivalent, meaning that tuning either one is equivalent to updating both. 5.2. Sparse Dimension Tuning and Pruning (SDT-P) Building on Lemma 2, we introduce SDT-P, the precursor to Sparse Dimension Tuning (SDT). SDT-P updates parameters selectively based on the role of each state dimension. In the multi-channel case, we first categorize the channel dimensions into three groups: pruned, frozen, and trainable. Then, the state dimensions of each trainable channel are also categorized as pruned, frozen, or trainable. This hierarchical selection ensures that updates are applied only when necessary, while pruned dimensions are discarded and frozen dimensions remain unchanged. Dimension Selection Algorithm. To enable this structured tuning process, we first introduce our dimension selection algorithm. The algorithm starts with a warmup epoch, where the SSM modules are updated using a subset of the dataset for one epoch. After this warmup, we classify channel dimensions based on the magnitude of the state matrix A: dimensions with small magnitude are pruned (set to zero), those with significant changes are marked as trainable, and the rest remain frozen. Next, we apply the same classification to state dimensions, but only within the trainable channels. The detailed pseudo-code is in Sec. D.4. Parameter Update Scheme. Once the channel and state dimensions are selected, we determine how to update the parameters. (S4) For S4, Gu et al. (2022a) showed that tuning C alone is as effective as tuning both B and C. Therefore, we always freeze B and update only A and C. Specifically, an entry in A or C is trainable if and only if both its channel and state dimensions are trainable. If either the channel or state dimension is pruned, the entry is pruned as well. All other entries remain frozen. (S6) For S6, where parameters are input-dependent, we update A, WB, and WC instead. Since WB and WC operate across channels, we categorize their updates based only on channel dimensions we do not update individual state dimensions differently for each channel. Based on this categorization, we mark the corresponding columns of WB and WC as trainable, frozen, or pruned accordingly. The dimension selection algorithm and parameter updates together form the SDT-P method for tuning SSM modules. Next, we provide theoretical guarantees for applying SDT-P to SSM modules and Lo RA to linear projection matrices. 5.3. Expressive Power of SDT-P Combined with Lo RA Our analysis focuses on simplified SSM-based models, where each layer consists of an SSM module followed by linear projection matrices with residual connections. We refer to this structure as a deep SSM layer: i) a deep S4 layer consists of an S4 module followed by linear projections; ii) a deep S6 layer follows the same structure but replace S4 with S6. A deep S4 model is composed of deep S4 layers, while a deep S6 model consists of deep S6 layers. The detailed formulation of deep S4 layers is provided in Sec. 3, and a deep S6 layer follows the same structure with S4 replaced by S6. The following theorem highlights the expressive power of SDT-P on updating SSM modules, where each layer uses a single type of SSM module (S4 or S6) followed by linear projections. Parameter-Efficient Fine-Tuning of State Space Models Theorem 1 (Expressive Power of SDT-P with Lo RA on Simplified SSM-based Models). Assume all layers use linear activations. Let f0 be a frozen deep S4 or S6 model with L layers, each containing H hidden states per channel. Let f be a smaller target model of the same type (S4 or S6), with no residual connections, L < L layers, and H < H hidden states per channel. Then, there exists a set of parameter updates to f0 satisfying the following conditions such that for any finite-length input sequence X = (x1, . . . , x N) with xn X RD, where X is bounded, the resulting model f satisfies f(X) = f (X): 1. (SDT-P on SSM) In each SSM module, update at most DL /L channels. Within each updated channel, finetune at most H hidden states and set the rest to zero. 2. (Lo RA on Linear Projections) Apply rank L/L updates to each linear projection matrix. 3. (Minimal Additional Updates) Update only the residual connections, per-layer biases, and the final-layer output projection matrix. For proof and details, refer to Sec. D.2 and D.3. This theorem shows that a larger pretrained model can be fine-tuned into any smaller model of the same architecture by applying SDT-P to SSM modules and Lo RA to linear projection matrices. Moreover, for less complex tasks, where the target model has fewer layers (L ) and hidden states (H ), the required number of trainable channels and hidden states also decreases. This aligns with the theoretical analysis of Lo RA by Zeng & Lee (2024), which demonstrates that larger pretrained models require fewer learnable parameters (i.e., a lower-rank update) during fine-tuning, especially for simpler tasks. While our theorem assumes linear activations, no residual connections in the target model, and full finetuning of the last-layer projection matrix, our findings have broader implications. As our experimental results in Sec. 6 will show, these insights generalize beyond these theoretical constraints. Algorithm 1 Dimension Selection Algorithm of SDT Input: A small subset of dataset D, warmup epochs E, number of layers L, total channels D, total states H, channel freeze ratio α, state freeze ratio β /* Warmup epochs */ Perform full update on SSM modules using D for E epochs for l = 1 to L do /* Unfreeze dimensions */ Sort channels D based on changes of A (d) Freeze the bottom β|D| channels, denoted by D Sort state dimensions by the changes in A (d) Freeze the bottom α|H| state dimensions at the d-th channel 5.4. Sparse Dimension Tuning (SDT): A Pruning-Free Alternative While SDT-P classifies channels and states into three categories, we simplify our approach by omitting pruning and categorizing parameters as either trainable or frozen. We refer to this simplified method as Sparse Dimension Tuning (SDT). This reduces the number of hyperparameters, as pruned parameters are effectively equivalent to being trained to zero. The resulting dimension selection approach is outlined in the pseudo-code (Alg. 1), which corresponds to the update scheme illustrated in Fig. 1. Experiments will show that this simplification remains effective. Overhead Analysis. We assess the computational overhead of applying SDT with Lo RA (for linear projection matrices) versus Lo RA alone with Table 2 summarizing the results. Although SDT involves an additional dimension selection stage, Table 2 shows that this incurs minimal extra cost. Furthermore, with the same parameter budget, SDT for SSM modules combined with Lo RA on linear projections runs faster than Lo RA alone, since Lo RA introduces extra matrix multiplications between two low-rank matrices for the SSM modules, whereas SDT does not. In Sec. D.6, we detail the experimental settings and present a memory usage analysis showing that SDT also consumes less memory during fine-tuning for the same reason. Stage Method Mamba-130M Mamba-1.4B Jamba-Mini-52B Dim. Selection Lo RA & SDT 16.5 3.9 85.8 5.3 163.9 10.2 Training (per epoch) Lo RA 410.0 80.0 2060.0 135.0 3427.5 185.0 Lo RA & SDT 330.0 77.5 1697.5 87.5 3065.0 232.5 Table 2. PEFT combining SDT with Lo RA is more efficient than Lo RA alone when the same number of trainable parameters are used. Shown are dimension selection and per-epoch training times (s) for Mamba and Jamba models. 6. Experimental Studies of SDT In this section, we evaluate the performance of SDT in tuning SSM modules, comparing it to Lo RA , the best existing PEFT method for fine-tuning SSM modules, as shown in Sec. 4. Our experiments reveal the key result: Finding: SDT outperforms Lo RA on SSM modules. 6.1. Synthetic Experiments on Deep S4 Models This experiment validates our theoretical guarantees under broader conditions, including residual connections and Re LU activations in both models, without fully fine-tuning the last-layer projection matrix. See Sec. E.1 for details. (Experimental Setup) We employ a regression setting to validate our theoretical results. We randomly ini- Parameter-Efficient Fine-Tuning of State Space Models tialize two models: a one-layer deep S4 model as the target and a four-layer deep S4 model as the frozen model. Lo RA is applied to linear projection matrices, while different methods are tested on the SSM module to assess their effectiveness. The goal is to update the frozen model to match the target model s functionality. Frozen Lo RA SDT # Trainable Parameters (%) Figure 2. SDT surpasses Lo RA in tuning S4 within deep S4 models when Lo RA is applied to linear projection matrices in synthetic experiments. We generate an input sequence X of length 200 and dimension 64, with values uniformly drawn from integers between 0 and 9. This input is then processed through the target model to obtain the corresponding outputs. These input-output pairs are used to train the frozen model over 500 iterations using the Mean Squared Error (MSE) loss. (Results) Figure 2 shows the MSE, averaged across all tokens, plotted against the number of trainable parameters for different methods on SSM modules. SDT achieves significantly lower MSE than Lo RA on SSM modules, demonstrating its effectiveness in updating SSM modules. 6.2. Real-World Experiments on Pretrained Models Lastly, we conduct experiments to evaluate our approach on pretrained models, including Mamba and Jamba with different model sizes. We consider five datasets: GLUE, DART, SAMSum, Spider, and Celeb A. For these experiments, we split the datasets into three parts: train, validation, and test, different from benchmarking experiments. We combine our proposed SDT with Lo RA and evaluate it in three different settings against three pure Lo RA settings. In SDT, 99% of channels are frozen, and we adjust state freeze ratios. For the pure Lo RA settings, we apply Lo RA to different parameter sets, selecting ranks to ensure all settings have a comparable parameter budget for fair comparison. Residual connections and biases are frozen and learning rates are independently selected via a small grid search over data subsets. See Sec. E.2 for further details. Mamba. The experimental results of Mamba are reported in Table 3, showing that applying SDT on SSM modules outperforms pure Lo RA , even when 99% of the channels are frozen. This underscores the effectiveness of SDT on fine-tuning SSM modules. Jamba. We extend our experiments to Jamba, applying all tested methods exclusively to its Mamba layers. Notably, the performance gain on Jamba is smaller compared to Mamba. This is because we freeze all Transformer layers to isolate the effect of Mamba layers for a fair evaluation. Additionally, since the Mamba layers in Jamba contain sig- Lin Proj S6 GLUE DART Celeb A SAMSum Spider Avg. BLEU MET. Acc. R1 R2 RL Acc. Lo RA Lo RA 80.8 51.0 70.2 88.6 51.6 28.2 43.2 83.5 SDT 81.1 51.5 70.5 88.6 51.7 28.1 43.4 84.5 Do RA Do RA 80.1 51.2 70.4 88.4 51.8 28.0 43.4 83.8 SDT 78.2 51.5 70.8 88.6 52.1 28.3 43.7 85.1 Table 3. Performance comparison between SDT and Lo RA on pretrained Mamba models. Bold numbers indicate the best performance for each task. We use Mamba-130M to compare the performance of SDT and Lo RA on GLUE (Wang et al., 2019), DART (Nan et al., 2021), and Celeb A (Liu et al., 2015) benchmarks. For all other datasets, we employ Mamba-1.4B. We report only the best setting out of three for each method. We observe that SDT outperforms Lo RA on updating SSM modules on Mamba. nificantly fewer parameters than those in the Mamba model, fine-tuning them yields limited performance improvements. Nevertheless, results on GLUE (Table 4) validate the effectiveness of our method. See Table 22 for more results. Lin Proj S6 RTE MRPC Co LA SST-2 QNLI QQP MNLI Avg. Do RA Do RA 65.7 77.8 7.1 93.9 77.8 67.8 85.4 67.9 SDT 67.1 77.5 7.5 94.2 79.6 72.7 85.5 69.2 Table 4. Performance comparison between SDT and Do RA on pretrained Jamba models. Bold numbers indicate the best performance for each task. We use Jamba-Tiny-319M to compare the performance of SDT and Do RA on the GLUE (Wang et al., 2019) benchmark. We report only the best setting out of three for each method. We observe that SDT outperforms Do RA on updating SSM modules on Jamba. 7. Discussion In this paper, we study PEFT methods applied to SSM-based models. Our evaluation of existing PEFT methods provides valuable insights and guidelines for future researchers to parameter-efficiently fine-tune SSM-based models for other domains. Moreover, we take an initial step in establishing a theoretical framework for studying PEFT methods on SSMbased models. Additionally, we introduce SDT, a PEFT method specifically tailored to SSM modules, demonstrating superior performance compared to existing approaches. Limitations & Future Works. The theoretical guarantees for SDT are restricted to linear activations and require full fine-tuning of the last layer. Nonetheless, our experiments show that SDT performs well in practice despite these constraints. Addressing these theoretical limitations or developing new PEFT methods applicable to broader scenarios is a promising future direction. Additionally, our theory shows that modifying a subset of channels and states is sufficient but does not guide optimal selection. Our approach, based on a warmup stage and parameter magnitude, might not be optimal. Future research could explore the impact of channel/state selection and improve dimension selection algorithms. Parameter-Efficient Fine-Tuning of State Space Models Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgment The works is supported by NSF Award DMS-2023239, NSF CAREER Award CCF-2339978, Amazon Research Award, and a grant from Furiosa AI. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901, 2020. Dao, T. and Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, 2024. Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., yong Sohn, J., Papailiopoulos, D., and Lee, K. LIFT: Language-interfaced fine-tuning for non-language machine learning tasks. In Advances in Neural Information Processing Systems, 2022. Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards language modeling with state space models. In International Conference on Learning Representations, 2022. Giannou, A., Rajput, S., and Papailiopoulos, D. The expressive power of tuning only the normalization layers. In The Thirty Sixth Annual Conference on Learning Theory, pp. 4130 4131, 2023. Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, pp. 70, 2019. Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024. Gu, A., Dao, T., Ermon, S., Rudra, A., and Ré, C. Hippo: Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, volume 33, pp. 1474 1487, 2020. Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Advances in Neural Information Processing Systems, volume 34, pp. 572 585, 2021. Gu, A., Goel, K., Gupta, A., and Ré, C. On the parameterization and initialization of diagonal state space models. In Advances in Neural Information Processing Systems, volume 35, pp. 35971 35983, 2022a. Gu, A., Goel, K., and Re, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022b. Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982 22994, 2022. Halloran, J. T., Gulati, M., and Roysdon, P. F. Mamba statespace models can be strong downstream learners. ar Xiv preprint ar Xiv:2406.00209, 2024. Hayou, S., Ghosh, N., and Yu, B. Lora+ efficient low rank adaptation of large models. In Proceedings of the 41st International Conference on Machine Learning, pp. 17783 17806, 2024. He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2021. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pp. 2790 2799, 2019. Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lo RA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021. Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.-P., Bing, L., Xu, X., Poria, S., and Lee, R. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5254 5276, 2023. Jang, U., Lee, J. D., and Ryu, E. K. Lo RA training in the ntk regime has no spurious local minima. In International Conference on Machine Learning, 2024. Parameter-Efficient Fine-Tuning of State Space Models Kang, W., Galim, K., Zeng, Y., Lee, M., Koo, H. I., and Cho, N. I. State-offset tuning: State-based parameterefficient fine-tuning for state space models. ar Xiv preprint ar Xiv:2503.03499, 2025. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156 5165, 2020. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045 3059, 2021. Li, X. L. and Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582 4597, 2021. Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedigos, I., Safahi, E., Meirom, S., Belinkov, Y., Shalev Shwartz, S., et al. Jamba: Hybrid transformer-mamba language models. In The Thirteenth International Conference on Learning Representations, 2025. Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weightdecomposed low-rank adaptation. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 32100 32121, 2024. Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. GPT Understands, Too. ar Xiv:2103.10385, 2021. Liu, X., Ji, K., Fu, Y., Tam, W., Du, Z., Yang, Z., and Tang, J. P-Tuning: Prompt Tuning Can Be Comparable to Finetuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 61 68, 2022. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Nan, L., Radev, D., Zhang, R., Rau, A., Sivaprasad, A., Hsieh, C., Tang, X., Vyas, A., Verma, N., Krishna, P., et al. DART: Open-Domain Structured Data Record to Text Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 432 447, 2021. Oymak, S., Rawat, A. S., Soltanolkotabi, M., and Thrampoulidis, C. On the role of attention in prompt-tuning. In International Conference on Machine Learning, pp. 26724 26768, 2023. Panigrahi, A., Saunshi, N., Zhao, H., and Arora, S. Taskspecific skill localization in fine-tuned language models. In International Conference on Machine Learning, pp. 27011 27033, 2023. Park, J., Park, J., Xiong, Z., Lee, N., Cho, J., Oymak, S., Lee, K., and Papailiopoulos, D. Can Mamba learn how to learn? a comparative study on in-context learning tasks. In International Conference on Machine Learning, pp. 39793 39812, 2024. Peng, B., Alcaide, E., Anthony, Q. G., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M. N., Derczynski, L., et al. RWKV: Reinventing RNNs for the transformer era. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. Petrov, A., Torr, P. H., and Bibi, A. When do prompting and prefix-tuning work? a theory of capabilities and limitations. In International Conference on Learning Representations, 2024. Scholak, T., Schucher, N., and Bahdanau, D. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895 9901, 2021. Song, W., Li, Z., Zhang, L., Zhao, H., and Du, B. Sparse is enough in fine-tuning pre-trained large language models. In Proceedings of the 41st International Conference on Machine Learning, pp. 46121 46135, 2024. Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models. ar Xiv preprint ar Xiv:2307.08621, 2023. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. Parameter-Efficient Fine-Tuning of State Space Models Wang, H., Agarwal, S., and Papailiopoulos, D. Pufferfish: Communication-efficient models at no extra cost. In Proceedings of Machine Learning and Systems, volume 3, pp. 365 386, 2021. Wang, H., Agarwal, S., Tanaka, Y., Xing, E., Papailiopoulos, D., et al. Cuttlefish: Low-rank model training without all the tuning. Proceedings of Machine Learning and Systems, 5, 2023a. Wang, Y., Chauhan, J., Wang, W., and Hsieh, C.-J. Universality and limitations of prompt tuning. In Advances in Neural Information Processing Systems, 2023b. Yoshimura, M., Hayashi, T., and Maeda, Y. Mamba PEFT: Exploring parameter-efficient fine-tuning for mamba. In The Thirteenth International Conference on Learning Representations, 2025. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., et al. Spider: A large-scale human-labeled dataset for complex and crossdomain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3911 3921, 2018. Zaken, E. B., Goldberg, Y., and Ravfogel, S. Bit Fit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1 9, 2022. Zeng, Y. and Lee, K. The expressive power of low-rank adaptation. In International Conference on Learning Representations, 2024. Parameter-Efficient Fine-Tuning of State Space Models Appendix A Additional Related Works 14 A.1 Additional Related Works on SSMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Additional Related Works on PEFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B Details of Datasets 15 C Details of Sec. 4: Benchmarking PEFT Methods on SSM-based Models 16 C.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C.2 Extended Results on Benchmarking Existing PEFT Methods . . . . . . . . . . . . . . . . . . . . . . . . 16 C.3 Limitations of Applying Input-injection Methods on SSMs . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.4 Optimal Application of Lo RA in SSM-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D Details of Sec. 5: SDT 24 D.1 Understanding the Roles of State Matrix A, Input Transition Vector B, and Output Mapping Vector C for a Single Channel in S4 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.2 Extension to Deep S4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.3 Extension to S6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.4 Sparse Dimension Tuning and Pruning (SDT-P) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D.5 Extension to S5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 D.6 Memory Usage and Runtime Analysis of SDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 E Expanded Sec. 6: Evaluation of SDT 33 E.1 Experiments on Deep S4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.2 Experiments on Mamba-II, Jamba, and Lo RA+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Parameter-Efficient Fine-Tuning of State Space Models A. Additional Related Works A.1. Additional Related Works on SSMs Linear State-Space Layers (LSSL) represent one of the earliest SSM layers utilized in deep learning, functioning as continuous-time, recurrent, and convolutional models (Gu et al., 2021). LSSL employs Hi PPO theory (Gu et al., 2020) to initialize the state matrix A, enabling the capture of long dependencies. However, LSSL is computationally expensive, limiting its practical application. Gu et al. (2022b) introduced Structured State Space Models (S4), which optimize computation efficiency by employing a structured state matrix A. Gupta et al. (2022) proposed DSS, which simplifies the model by using a diagonal matrix for A and empirically demonstrated that it suffices to achieve performance comparable to S4. Further, Gu et al. (2022a) provided a theoretical explanation for the effectiveness of the diagonal state matrix A in DSS and introduced S4D, which offers various initialization methods for A. Subsequently, the diagonal structure of the state matrix A has been adopted in follow-up methods (Gu & Dao, 2024). Despite differences in optimization algorithms, we refer to S4 and its close variants, including DSS and S4D, collectively as S4. This terminology encompasses models that maintain the standard discrete-time SSM form with a diagonal state matrix. Despite of the remarkable performance of SSMs on certain tasks of sequence modeling, SSMs still showed worse performance than Transformers on language modeling. Fu et al. (2022) transitioned from synthetic language modeling tasks to real language modeling tasks with SSMs. They proposed H3, which is inspired by Linear Attention (Katharopoulos et al., 2020), introducing both diagonal SSM and shift SSM. Recently, Mamba (Gu & Dao, 2024; Dao & Gu, 2024) escaped from linear time invariance (LTI) modeling by introducing input-dependent terms and achieved better performance than Transformers on language modeling. Furthermore, several hybrid models (Lieber et al., 2025; Park et al., 2024) tried to exploit the advantages of both SSMs and Transformers. A.2. Additional Related Works on PEFT In this section, we provide a more detailed description of the baseline methods. Lo RA (Hu et al., 2021). Lo RA (Low-Rank Adaptation) focuses on fine-tuning large models by freezing pretrained parameters and injecting trainable low-rank matrices into each layer of the Transformer architecture. The intuition behind using low-rank matrices comes from linear algebra, where a large matrix can be closely approximated by the product of two smaller matrices. The number of trainable parameters can be controlled with the rank of the low-rank matrices. Lo RA also uses a scaling parameter (Lo RA alpha) for the weight matrices to control the balance of the original model weights and Lo RA weights during training. After fine-tuning, Lo RA weights can be merged with the original model weights, introducing no additional inference overhead. Prompt Tuning (Lester et al., 2021). Prompt tuning freezes all model weights and prepends a trainable soft prompt to the input prompt. The soft prompt consists of trainable virtual tokens, which are continuous. At inference time, prompt tuning introduces an inference overhead based on the number of virtual tokens used. Prefix-Tuning (Li & Liang, 2021). Prefix-tuning also prepends trainable tokens to the input like prompt tuning but injects separate prefixes in every layer. For each Transformer layer, prefix-tuning prepends trainable embeddings to the attention s K and V matrix. The authors have found that directly training these prefixes can lead to unstable training, so they propose to over-parameterize them with a large MLP to increase training stability. After training, the MLP can be dropped. Like prompt tuning, prefix-tuning introduces an inference overhead, scaling with the number of trainable embeddings. Bit Fit (Zaken et al., 2022). Bit Fit is a simple but effective PEFT method that freezes all model weights except the bias terms, consequently greatly reducing the number of trainable parameters. As no additional parameters are added, no inference overhead occurs. Theoretical Understanding of PEFT. Numerous efforts have been made to theoretically understand existing PEFT methods. For input-injection methods, Wang et al. (2023b), Petrov et al. (2024), and Oymak et al. (2023) have theoretically analyzed the effectiveness and limitations of prompt tuning and prefix-tuning for Transformer-based models. For Lo RA, Zeng & Lee (2024) explored its expressive power by demonstrating that even a randomly initialized model can be adapted to match any smaller target model using Lo RA. Some of our theoretical analysis draws upon the framework established by Parameter-Efficient Fine-Tuning of State Space Models Zeng & Lee (2024). Jang et al. (2024) conducted a theoretical exploration of Lo RA within the neural tangent kernel (NTK) regime. B. Details of Datasets In this paper, we consider six datasets across three domains: (i) Natural Language Understanding (NLU), represented by GLUE (Wang et al., 2019); (ii) Natural Language Generation (NLG), including SAMSum (Gliwa et al., 2019), Spider (Yu et al., 2018) and DART (Nan et al., 2021); and (iii) Computer Vision (CV), represented by CIFAR-10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015). GLUE (Wang et al., 2019). The GLUE (General Language Understanding Evaluation) benchmark is a collection of datasets used for training, evaluating, and analyzing natural language understanding models across a range of diverse tasks. The benchmark includes nine sentenceor sentence-pair language understanding tasks that require various features of understanding, such as sentiment analysis, linguistic acceptability, semantic textual similarity, and question answering. We use seven datasets from the GLUE benchmark (RTE, MRPC, Co LA, SST-2, QNLI, QQP, MNLI) where the model has to choose between two or three (for MNLI) different choices for the respective task. Except for Co LA, we evaluate all used datasets with the accuracy metric. For Co LA, Matthews correlation is employed. SAMSum (Gliwa et al., 2019). SAMSum is a dataset for dialogue summarization research, comprising approximately 16,000 synthetic text conversations with accompanying summaries. Created by English-fluent linguists, these exchanges simulate real-world digital communications across various topics and styles. The conversations range from informal to formal, incorporating elements like slang and emoticons to reflect authentic messaging patterns. Each dialogue is paired with a concise, third-person summary, capturing its essential content. This structure makes SAMSum particularly useful for developing and evaluating automated summarization systems capable of processing conversational text. Spider (Yu et al., 2018). Spider is a large-scale, complex, and cross-domain semantic parsing and text-to-SQL dataset. It contains about 10,000 annotated SQL queries, distributed across 200+ databases, each with multiple tables. We follow Scholak et al. (2021) and use about 7,000 examples for training and about 1,000 examples for validation, where we ignore sequences longer than 1536 tokens. The dataset consists of English question and SQL query pairs, which cover a wide range of SQL operations including SELECT, WHERE, COUNT, GROUP BY, ORDER BY, JOIN, and more. Given an English question and an SQL database scheme, the task for the model is to translate the English question into an appropriate SQL statement. Evaluation is performed via accuracy where the output is considered as correct if the model s predicted SQL query and the included GT SQL query give the same result when executed on the database. The dataset additionally categorizes each query into easy (25%), medium (40%), hard (20%), and extra hard (15%) based on the complexity of the required SQL statement. For evaluation, we report the execution accuracy of all categories. DART (Nan et al., 2021). The DART (DAta Record to Text) benchmark is a large-scale, structured dataset designed for RDF-to-text (Resource Description Framework-to-text) generation with 80,000+ instances. The DART benchmark is composed of a collection of structured data triples and corresponding text summaries which are organized into different categories. The task of the DART benchmark is to generate natural language summaries that correctly represent the given structured data inputs. DART is typically evaluated with METEOR and BLEU. CIFAR-10 (Krizhevsky et al., 2009). The CIFAR-10 (Canadian Institute For Advanced Research) dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for image classification. The CIFAR-10 dataset contains 60,000 (50,000 for training, 10,000 for validation) 32 32 color images in 10 different classes. The 10 different classes are: airplane, car, bird, cat, deer, dog, frog, horse, ship, and truck. There are 6,000 images of each class. For training, we center crop each image to 24 24 pixels and flatten each image to a string, with a total of 24 24 3 words, where each word is a number between 0-255 representing the respective pixel value. Although CIFAR-10 is a dataset for computer vision, previous work (Dinh et al., 2022) showed that Transformers can be adapted to the vision domain from the language domain. In our work, we extend this investigation to SSMs, examining their ability to perform on vision data. Celeb A (Liu et al., 2015). The Celeb A (Celeb Faces Attributes) dataset is an extensive collection of more than 200,000 celebrity images, each tagged with 40 attributes. This dataset is notable for its diversity, volume, and comprehensive Parameter-Efficient Fine-Tuning of State Space Models Dataset Size (Train) Size (Val) Size (Test) Max. seq. len. #Epochs Mamba Size Jamba Size Metrics RTE 1992 498 277 291 10 130M 319M Accuracy MRPC 2934 734 408 105 10 130M 319M Accuracy Co LA 6840 1711 1043 47 10 130M 319M Matthews corr. SST-2 53879 13470 872 68 10 130M 319M Accuracy QNLI 83794 20949 5463 602 10 130M 319M Accuracy QQP 291076 72770 40430 316 3 130M 319M Accuracy MNLI 314161 78541 19647 425 3 130M 319M Accuracy Spider 5543 1375 1034 1412 10 1.4B, 2.8B 52B Accuracy SAMSum 14732 818 819 1174 10 1.4B 52B ROUGE DART 62659 2768 5097 491 10 130M 52B METEOR, BLEU CIFAR-10 40000 10000 10000 1730 5 130M 319M Accuracy Celeb A 162770 19867 19962 12614 3 130M 319M Accuracy Table 5. Datasets and models for our experiments. For each dataset, we report the number of training, validation, and test samples, maximum sequence length, training epochs, model size, and evaluation metric used. annotations, encompassing 10,177 distinct identities, 202,599 facial images, and annotations of five landmark points with 40 binary attributes per image. The dataset, which includes images with varied poses and complex backgrounds, is an essential resource for tasks in computer vision such as face recognition, attribute analysis, and detection, as well as facial landmark localization, and it offers significant utility in face editing and synthesis. The dataset characteristics, including our train, validation and test set sizes, sequence lengths, and number of epochs, are summarized in Table 5. C. Details of Sec. 4: Benchmarking PEFT Methods on SSM-based Models In this section, we provide a comprehensive experimental setup, proofs and further discussion of theoretical results, and more detailed experimental outcomes. C.1. Experiment Setup For each dataset, we choose the model size depending on how challenging the dataset is and perform a small grid search for one epoch on a subset of the data (1k-2k instances) with learning rates {4 10 1, 2 10 1, 1 10 1, ..., 1 10 5} to find the optimal learning rate of each PEFT method. We only report the validation metric of the best epoch during training (early stopping) in our results. We fine-tune pretrained Mamba and Jamba models with Adam W with a linear learning rate decay schedule. For Lo RA we set rank to 8, alpha to 8, and dropout to 0.1 for all experiments. For evaluating NLG tasks, we employ beam search with five beams and a maximum beam length of 1024. C.2. Extended Results on Benchmarking Existing PEFT Methods Mamba-I. We present comprehensive fine-tuning results for the GLUE benchmark (Wang et al., 2019), DART dataset (Nan et al., 2021), SAMSum dataset (Gliwa et al., 2019), Spider dataset (Yu et al., 2018), and CIFAR-10 (Krizhevsky et al., 2009) in Table 6, Table 7, Table 8, Table 9, and Table 10 respectively. These experimental results encompass various Lo RA implementations (on different weight matrices and modules) and provide more fine-grained results across all subtasks. Mamba-II. Table 11 and Table 12 present the benchmark results of Lo RA and full fine-tuning across different layers of Mamba-II. We follow the same experimental setup used for Mamba-I and demonstrate that, on Mamba-II, our conclusion holds: Lo RA is more effective on linear projection layers than on SSM modules. Parameter-Efficient Fine-Tuning of State Space Models Jamba. Table 13 presents the benchmark results of Lo RA and full fine-tuning across different layers of Jamba. Our findings demonstrate that, on Jamba, Lo RA is more effective on linear projection layers than on SSM modules, which aligns with our conclusion on Mamba. Layer Method # Params (%) RTE MRPC Co LA SST-2 QNLI QQP MNLI Avg. Pretrained 0.00 46.9 67.9 0.0 52.4 50.5 36.8 32.3 41.0 All All Full 100.00 71.1 80.6 63.2 92.2 87.4 87.9 80.8 80.5 Lo RA 1.92 69.9 80.9 61.4 91.9 88.4 87.6 81.1 80.2 Prompt Prompt Tuning 16 tokens 0.01 56.0 71.6 12.0 89.4 76.8 79.6 61.5 63.8 Prefix-Tuning 1 token (no MLP) 0.03 67.5 75.7 43.4 91.5 83.4 83.1 35.6 68.6 Bias β , Conv1d Bit Fit 0.06 69.5 80.4 54.7 92.0 86.2 85.3 77.2 77.9 Linear Projection Matrices All Lo RA 1.02 70.0 82.4 57.7 93.3 88.7 88.7 82.5 80.5 Win,x Lo RA 0.34 70.4 82.1 57.4 91.7 88.3 87.7 81.2 79.8 Win,z Lo RA 0.34 70.0 82.4 58.1 92.4 87.3 87.3 80.4 79.7 Win,x, Win,z Lo RA 0.68 70.4 84.3 62.4 92.5 88.6 88.3 81.7 81.2 Wout Lo RA 0.34 70.4 82.8 60.6 92.4 88.4 87.7 81.5 80.5 All Full 4.31 69.7 78.9 59.1 91.5 88.1 87.5 80.5 79.3 Lo RA 0.92 66.1 78.7 57.8 90.8 87.8 86.9 79.8 78.3 A Full 0.46 68.2 82.1 54.2 90.9 86.4 87.9 79.4 78.4 WB, WC, W , Full 2.28 69.7 77.0 55.8 91.4 85.4 85.0 76.8 77.3 Lo RA 0.69 67.9 78.9 48.8 91.4 86.9 85.8 78.6 76.9 W , Full 1.40 66.1 75.2 56.7 91.1 86.2 87.1 78.5 77.3 Lo RA 0.23 67.1 79.9 55.1 90.9 52.7 86.6 78.7 73.0 Conv1d Full 0.14 68.2 78.4 57.9 91.1 86.0 86.0 78.0 77.9 Others D, Layer Norm Full 0.04 65.3 79.2 40.3 91.1 83.9 86.0 67.0 73.3 Table 6. Full benchmark results on the GLUE (Wang et al., 2019) benchmark using Mamba-I 130M. We report accuracy ( ) for RTE, MRPC, SST-2, QNLI, QQP, and MNLI tasks. Co LA performance is measured using Matthews Correlation Coefficient ( ). In each Mamba block, Win,x and Win,z are input projections that preprocess the input for the SSM modules and the gating branch, respectively. Wout denotes the output projection after the gating mechanism. WB and WC are weight matrices for computing input-dependent Bn and Cn. W , and W , represent down and up projections of low-rank weight matrices in the linear layer computing input-dependent step size n. β represents the bias in this linear layer. D denotes the weight of residual connections. C.3. Limitations of Applying Input-injection Methods on SSMs We start by introducing the necessary notations. Denote the space of S4 mechanisms with D channels as FS4,D. Let H0 = (h(1) 0 , h(2) 0 , . . . , h(D) 0 ) RH D represent the initial hidden state, and X = (x1, x2, . . . , x N) RD N denote the input sequence. The output of the S4 mechanism is represented as f(X; H0). Furthermore, for d-th channel, let state transition matrix A (d) = diag (a(d) 1 , , a(d) H ) and input transition vector B (d) = (b1, , b H) , where d = 1, . . . , D. For any vector v Rn, we use vi:j Rj i to denote the subvector of v containing elements from i N+ to j N+, where i < j. Similarly, for any matrix M Rm n, we use Mi1:j1,i2:j2 to denote the submatrix containing rows i1 N+ to j1 N+ and columns i2 N+ to j2 N+, where i1 < j1, i2 < j2. Proposition 1 (Expressivity of Prefix-Tuning on SSMs). Let f FS4,D be an S4 mechanism. Consider prefix-tuning that prepends a sequence P = (p1, . . . , p M) RD M to the input sequence X = (x1, x2, . . . , x N) RD N. For any prefix P RD M, there exists an initial hidden state H 0 RH D such that the output of S4 after prefix-tuning and that after initial state tuning are identical, i.e., f(X; H 0) f([P , X]; H0)1:D,M+1:M+N for all X RD N. Furthermore, assume that Q 0 i