# tuning_frequency_bias_of_state_space_models__fd168899.pdf Published as a conference paper at ICLR 2025 TUNING FREQUENCY BIAS OF STATE SPACE MODELS Annan Yu1 Dongwei Lyu2 Soon Hoe Lim3,4 Michael W. Mahoney5 7 N. Benjamin Erichson5,6 1 Center for Applied Mathematics, Cornell University, Ithaca, NY 14853, USA 2 Data Science Institute, University of Chicago, Chicago, IL 60637, USA 3 Department of Mathematics, KTH Royal Institute of Technology, Stockholm, Sweden 4 Nordita, KTH Royal Institute of Technology and Stockholm University, Stockholm, Sweden 5 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA 6 International Computer Science Institute, Berkeley, CA 94704, USA 7 Department of Statistics, University of California at Berkeley, Berkeley, CA 94720, USA ay262@cornell.edu, dwlyu@uchicago.edu, shlim@kth.se, mmahoney@stat.berkeley.edu, erichson@icsi.berkeley.edu State space models (SSMs) leverage linear, time-invariant (LTI) systems to effectively learn sequences with long-range dependencies. By analyzing the transfer functions of LTI systems, we find that SSMs exhibit an implicit bias toward capturing low-frequency components more effectively than high-frequency ones. This behavior aligns with the broader notion of frequency bias in deep learning model training. We show that the initialization of an SSM assigns it an innate frequency bias and that training the model in a conventional way does not alter this bias. Based on our theory, we propose two mechanisms to tune frequency bias: either by scaling the initialization to tune the inborn frequency bias; or by applying a Sobolev-norm-based filter to adjust the sensitivity of the gradients to highfrequency inputs, which allows us to change the frequency bias via training. Using an image-denoising task, we empirically show that we can strengthen, weaken, or even reverse the frequency bias using both mechanisms. By tuning the frequency bias, we can also improve SSMs performance on learning long-range sequences, averaging an 88.26% accuracy on the Long-Range Arena (LRA) benchmark tasks. 1 INTRODUCTION Sequential data are ubiquitous in fields such as natural language processing, computer vision, generative modeling, and scientific machine learning. Numerous specialized classes of sequential models have been developed, including recurrent neural networks (RNNs) (Arjovsky et al., 2016; Chang et al., 2019; Erichson et al., 2021; Rusch & Mishra, 2021; Orvieto et al., 2023), convolutional neural networks (CNNs) (Bai et al., 2018; Romero et al., 2022), continuous-time models (CTMs) (Gu et al., 2021b; Yildiz et al., 2021), transformers (Katharopoulos et al., 2020; Choromanski et al., 2020; Kitaev et al., 2020; Zhou et al., 2022; Nie et al., 2023), state space models (SSMs) (Gu et al., 2022b;a; Hasani et al., 2023; Smith et al., 2023), and Mamba (Gu & Dao, 2023; Dao & Gu, 2024). Among these, SSMs stand out for their ability to learn sequences with long-range dependencies. Using the continuous-time linear, time-invariant (LTI) systems, x (t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t), (1) where A Cn n, B Cn m, C Cp n, and D Cp m, an SSM computes the output time-series y(t) from the input u(t) via a latent state vector x(t). Compared to an RNN, a major computational advantage of an SSM is that the LTI system can be trained both efficiently (i.e., the training can be parallelized for long sequences) and numerically robustly (i.e., it does not suffer from vanishing and exploding gradients). An LTI system can be computed in the time domain via convolution: y(t) = (h u + Du)(t) = ˆ h(t τ)u(τ)dτ + Du(t), h(t) = Cexp(t A)B. Published as a conference paper at ICLR 2025 Problem Formulation Model Input Model Output a2 cos(16x) a3 cos(256x) s = 1 s = 16 s = 256 Figure 1: In a synthetic example to illustrate the frequency bias of SSMs, we form the inputs by superposing three waves of low, moderate, and high frequencies, respectively. We train an S4D model to regress the magnitudes of the three waves. We observe that the magnitudes of the lowfrequency waves can be approximated much better compared to those of the high-frequency waves. In Figure 10, we show how to tune the frequency bias in this example. Alternatively, it can be viewed as an action in the frequency domain: ˆy(s) = G(is)ˆu(s), G(is) := C(is I A) 1B + D, s R, (2) where i is the imaginary unit and I is the identity matrix. The function G is called the transfer function of the LTI system. The frequency-domain characterization of the LTI systems in eq. (2) sets the stage for understanding the so-called frequency bias of an SSM. The term frequency bias originated from the study of a general overparameterized multilayer perceptron (MLP) (Rahaman et al., 2019), where it was observed that the low-frequency content was learned much faster than the high-frequency content. It is a form of implicit regularization (Mahoney, 2012). Frequency bias is a double-edged sword: on one hand, it partially explains the good generalization capability of deep learning models, because most high-frequency noises are not learned until the low-frequency components are well-captured; on the other hand, it puts a curse on learning the useful high-frequency information in the target. In this paper, we aim to understand the frequency bias of SSMs. In Figure 1, we observe that, similar to most deep learning models, SSMs are also better at learning the low frequencies than the high ones. To understand that, we develop a theory that connects the spectrum of A to the SSM s capability of processing high-frequency signals. Then, based on the spectrum of A, we analyze the frequency bias in two steps. First, we show that the most popular initialization schemes (Gu et al., 2020; 2021b; Yu et al., 2024b) lead to SSMs that have an innate frequency bias. More precisely, they place the spectrum of A, Λ(A), in the low-frequency region in the s-plane, preventing LTI systems from processing high-frequency input, regardless of the values of B and C. Second, we consider the training of the SSMs. Using the decay properties of the transfer function, we show that if an eigenvalue aj Λ(A) is initialized in the low-frequency region, then its gradient is insensitive to the loss induced by the high-frequency input content. Hence, if an SSM is not initialized with the capability of handling high-frequency inputs, then it will not be trained to do so by conventional training. The initialization of the LTI systems equips an SSM with a certain frequency bias, but this is not necessarily the appropriate implicit bias for a given task. Depending on whether an SSM needs more expressiveness or generalizability, we may want less or more frequency bias, respectively (see Figure 10). Motivated by our analysis, we propose two ways to tune the frequency bias: 1. Instead of using the Hi PPO initializations, the most popular class of initializations used in practice, we scale Λ(A) to lower or higher-frequency regions at initialization as a hard tuning strategy that marks out the regions in the frequency domain that can be learned. 2. Motivated by the Sobolev norm, which applies weights to the Fourier domain, we can apply a multiplicative factor of (1 + |s|)β to the transfer function G(is). This is a soft tuning strategy that reweighs each location in the frequency domain. By selecting a positive or negative β, we make the gradients more or even less sensitive to the high-frequency input content, respectively, which changes the frequency bias during training. One can think of these two mechanisms as ways to tune frequency bias at initialization and during training, respectively. After rigorously analyzing them, we present an experiment on imagedenoising with different noise frequencies to demonstrate their effectiveness. We also show that tuning the frequency bias enables better performance on tasks involving long-range sequences. Equipped with our two tuning strategies, a simple S4D model can be trained to average an 88.26% accuracy on the Long-Range Arena (LRA) benchmark tasks (Tay et al., 2021). Published as a conference paper at ICLR 2025 Contribution. Here are our main contributions: (1) We formalize the notion of frequency bias for SSMs and quantify it using the spectrum of A. We show that a diagonal SSM initialized by Hi PPO has an innate frequency bias. We are the first to study the training of the state matrix A, and we show that training the SSM does not alter this frequency bias. (2) We propose two ways to tune frequency bias, by scaling the initialization, and by applying a Sobolev-norm-based filter to the transfer function of the LTI systems. We study the theory of both strategies and provide guidelines for using them in practice. (3) We empirically demonstrate the effectiveness of our tuning strategies using an image-denoising task. We also show that tuning the frequency bias helps an S4D model to achieve state-of-the-art performance on the Long-Range Arena tasks and provide ablation studies. To make the presentation cleaner, throughout this paper, we focus on a single single-input/singleoutput (SISO) LTI system Γ = (A, B, C, D) in an SSM, i.e., m = p = 1, although all discussions naturally extend to the multiple-input/multiple-output (MIMO) case, as in an S5 model (Smith et al., 2023). Hence, the transfer function G : C C is complex-valued. We emphasize that while we focus on a single system Γ, we do not isolate it from a large SSM; in fact, when we study the training of Γ in section 4, we backpropagate through the entire SSM. Related Work. The frequency bias, also known as the spectral bias, of a general neural network (NN) was initially observed and studied in Rahaman et al. (2019); Yang & Salman (2019); Xu (2020). The name spectral bias stemmed from the spectral decomposition of the so-called neural tangent kernels (NTKs) (Jacot et al., 2018), which provides a means of approximating the training dynamics of an overparameterized NN (Arora et al., 2019; Su & Yang, 2019; Cao et al., 2019). By carefully analyzing the eigenfunctions of the NTKs, Basri et al. (2019); Bietti & Mairal (2019) proved the frequency bias of an overparameterized two-layer NN for uniform input data. The case of nonuniform input data was later studied in Basri et al. (2020); Yu et al. (2023). The idea of Sobolev-norm-based training of NNs has been considered in Vlassis & Sun (2021); Yu et al. (2023); Tsay (2021); Son et al. (2021); Czarnecki et al. (2017); Zhu et al. (2021); Son (2023); Liu et al. (2024). The initialization of the LTI systems in SSMs plays a crucial role, which was first observed in Gu et al. (2020). Empirically successful initialization schemes called Hi PPO were proposed in Voelker et al. (2019); Gu et al. (2020; 2023). Other efforts in improving the initialization of an SSM were studied in Yu et al. (2024b); Liu & Li (2024a). Later, Orvieto et al. (2023); Yu et al. (2024a) attributed the success of Hi PPO to the proximity of the spectrum of A to the imaginary axis (i.e., the real parts of the eigenvalues of A are close to zero). This paper considers the imaginary parts of the eigenvalues of A, which was also discussed in the context of the approximationestimation tradeoff in Liu & Li (2024b). The training of SSMs has mainly been considered in Sm ekal et al. (2024); Liu & Li (2024a), where the matrix A is assumed to be fixed, making the optimization convex. To our knowledge, we are the first to consider the training of A. While we consider the decay of the transfer functions of the LTI systems in the frequency domain, there is extensive literature on the decay of the convolutional kernels in the time domain (i.e., the memory) (Hardt et al., 2018; Gu et al., 2020; Wang & Li, 2023; Wang & Xue, 2024; Orvieto et al., 2024; Yu et al., 2024a). 2 WHAT IS THE FREQUENCY BIAS OF AN SSM? Transfer Function G s imaginary axis / frequency domain converge to D converge to D V blow alow (G) ahigh bhigh V bhigh ahigh (G) low-frequency region high-frequency region high-frequency region Figure 2: The frequency bias of an SSM says that the frequency response has more variation in the low-frequency area than the high-frequency one. In Figure 1, we see an example where an S4D model is better at predicting the magnitude of a low-frequency component in the input than a high-frequency one. This coincides with our intuitive interpretation of frequency bias: the model is better at handling low frequencies than high frequencies. To rigorously analyze this phenomenon for SSMs, however, we need to formalize the notion of frequency bias. This is our goal in this section. One might imagine that an SSM has a frequency bias if, given a time-series input u(t) that has rich highfrequency information, its time-series output y(t) lacks high-frequency content. Unfortunately, this is not the case: an SSM is capable of generating high-frequency outputs. Indeed, the skip connection D of an LTI system is an all-pass filter, multiplying the whole input u(t) by a factor of D and adding it to the output y(t). On the other hand, the secret of a successful SSM hides in A, B, and C (Yu et al., 2024a). Published as a conference paper at ICLR 2025 In an ablation study, when D is removed, an S4D model only loses less than 2% of accuracy on the s CIFAR-10 task (Tay et al., 2021; Krizhevsky et al., 2009), whereas the model completely fails when we remove A, B, and C. This can be ascribed to the LTI system s power to model complicated behaviors in the frequency domain. That is, each Fourier mode in the input has its own distinct pass rate (see Adamyan et al. (1971); Sun (2020); Yu & Townsend (2024) for why this is an important feature of LTI systems). For example, the task in Figure 1 can be trivially solved if the LTI system can filter out a single mode a1 cos(x), a2 cos(16x), or a3 cos(256x) from the superposition of the three; the skip connection D alone is not capable of doing that. Given that, we can formulate the frequency bias of an LTI system as follows (see Figure 2): Frequency bias of an SSM means that the frequency responses (i.e., the transfer functions G) of LTI systems have more variation in the low-frequency area than the high-frequency area. More precisely, given the transfer function G(is), we can study its total variation in a particular interval [a, b] in the Fourier domain defined by V b a (G) := sup a=s0 maxj |wj|, we have |cj| |wj + B|, V B ( G) |cj| |wj B|. Given the formula of the transfer function, the proof of Lemma 1 is almost immediate. In this paper, we leave all proofs to Appendix B and D. Lemma 1 illustrates a clear and intuitive concept: If the imaginary parts of aj are distributed in the low-frequency region, i.e., |wj| are small, the transfer function has a small total variation in the high-frequency areas ( , B] and [B, ) as B , inducing a frequency bias of the SSM. We can now apply Lemma 1 to study the innate frequency bias of the Hi PPO initialization (Gu et al., 2020). While there are many variants of Hi PPO, we choose the one that is commonly used in practice (Gu et al., 2021a). All other variants can be similarly analyzed. Published as a conference paper at ICLR 2025 Corollary 1. Assume that aj = 0.5 + i( 1)j j/2 π and ξj, ζj N(0, 1) i.i.d., where N(0, 1) is the standard normal distribution. Then, given B > nπ/2 and δ > 0, we have V B ( G), V B ( G) ln(1/δ)) B n/2 with probability 1 δ. In particular, Corollary 1 tells us that the Hi PPO initialization only captures the frequencies s [ B, B] up to B = O(n), because when B = ω(n), we see that V B ( G), V B ( G) vanish as n increases. This means that no complicated high-frequency responses can be learned. 4 FREQUENCY BIAS OF AN SSM DURING TRAINING In section 3, we see that the initialization of the LTI systems equips an SSM with an innate frequency bias. A natural question to ask is whether an SSM can be trained to adopt high-frequency responses. Analyzing the training of an SSM (or many other deep learning models) is not an easy task, and we lack theoretical characterizations. Two notable exceptions are Liu & Li (2024b); Sm ekal et al. (2024), where the convergence of a trainable LTI system to a target LTI system is analyzed, assuming that the state matrix A is fixed to make the optimization problem convex. Unfortunately, this assumption is too strong to be applied for our purpose. Indeed, Lemma 1 characterizes the frequency bias using the distribution of wj = Im(aj), making the training dynamics of A a crucial element in our analysis. Even if we set aside the issue of A, analyzing an isolated LTI system in an SSM remains unrealistic: when an SSM, consisting of hundreds of LTI systems, is trained for a single task, there is no clear notion of ground truth for each individual LTI system within the model. To make our discussion truly generic, we assume that there is a loss function L(Θ) that depends on all parameters Θ of an SSM. In particular, Θ contains vj, wj, ξj, ζj, and D from every LTI system within the SSM, as well as the encoder, decoder, and inter-layer connections. With mild assumptions on the regularity of the loss function L, we provide a quantification of the gradient of L with respect to wj that leads to a qualitative statement about the frequency bias during training. Theorem 1. Let L(Θ) be a loss function and (A, B, C, D) be a diagonal LTI system in an SSM defined in section 3. Let G be its associated real-valued transfer function defined in eq. (3). Suppose the functional derivative of L(Θ) with respect to G(is) exists and is denoted by ( / G(is))L. Then, if |( / G(is))L| = O(|s|p) for some p < 1, we have L G(is) Kj(s) ds, Kj(s) := ζj((s wj)2 v2 j) 2ξjvj(s wj) [v2 j + (s wj)2]2 , (4) for every 1 j n. In particular, we have that |Kj(s)| = O |ζjs 2| + |ξjs 3| as |s| . In Theorem 1, we use a technical tool called the functional derivative (Gelfand et al., 2000). The assumption that ( / G(is))L exists is easily satisfied, and we leave a survey of functional derivatives to Appendix C. The assumption that |( / G(is))L| grows at most sublinearly is to guarantee the convergence of the integral in eq. (4); it is also easily satisfiable. We will see that the growth/decay rate of |( / G(is))L| plays a more important role when we start to tune the frequency bias using the Sobolev-norm-based method (see section 5.2). As usual, one can intuitively think of the functional derivative ( / G(is))L as a measurement of the sensitivity of the loss function L to an LTI system s action on a particular frequency s (i.e., G(is)). The fact that it is multiplied by a factor of Kj(s) in the computation of the gradient in eq. (4) conveys the following important message: The gradient of L with respect to wj highly depends on the part of the loss that has local frequencies near s = wj. It is relatively unresponsive to the loss induced by high frequencies, with a decaying factor of O(|s| 2) as the frequency increases, i.e., as |s| . Hence, the loss landscape of the frequency domain contains many local minima, and an LTI system can rarely learn the high frequencies with the usual training. To verify this, we train an S4D model initialized by Hi PPO to learn the s CIFAR-10 task for 100 epochs. We measure the relative change of each parameter θ: (θ) = (|θ(0) θ(100)|)/(|θ(0)|), where the superscripts indicate the epoch number. As we will show in section 6, the Hi PPO initialization is unable to capture the high frequencies in the CIFAR-10 pictures fully. From Table 1, however, we see that Im(diag(A)) is trained Published as a conference paper at ICLR 2025 very little: every wj is only shifted by 1.43% on average. This can be explained by Theorem 1: wj is easily trapped by a low-frequency local minimum. Table 1: The average relative change of each LTI system matrix in an S4D model trained on the s CIFAR-10 task. We see that the imaginary parts of diag(A) are almost unchanged during training. Parameter Re(diag(A)) Im(diag(A)) B C D 1002.705 0.0143 1.1801 0.8913 An Illustrative Example. Our analysis of the training dynamics of wj in Theorem 1 is very generic, relying on the notion of the functional derivatives. To make the theorem more concrete, we consider a synthetic example (see Figure 3). We fall back to the case of approximating a target function F(is) = Re 5 is ( 1 50i) + 0.2 is ( 1 + 50i) + 0.01 cos 9 4s 1[ 2π,2π] using a trainable G, where 9/4 is chosen to guarantee the continuity of F. We set the number of states to be one, i.e., n = 1. For illustration purposes, we fix v = 1 and ζ = 0; therefore, we have G(is) = Re ξ is ( 1 wi) where our only trainable parameters are w and ξ. Our target function F contains two modes and some small noises between 2π and 2π, whereas G is unimodal with a trainable position and height (see Figure 3 (left)). We apply gradient flow on w and ξ with respect to the L2-loss in the Fourier domain, in which case the functional derivative ( / G(is))L simply reduces to the residual: L G(is) = 2( F G)(is), L = F(is) G(is) L2. In Figure 3 (middle), we show the training dynamics of (w(τ), ξ(τ)), initialized with different values (w(0), ξ(0) = 3), where τ is the time index of the gradient flow. We make two remarkable observations that corroborate our discussion of the frequency bias during training: 1. Depending on the initialization of w(0), it has two options of moving left or right. Since we fix ζ = 0, by Theorem 1, a mode (ˆw, ˆξ) = ( 50, 5) or (50, 0.2) in the residual F G impacts the gradient ( / w)L inverse-proportionally to the cube of the distance between the mode ˆw and the current w(τ). Since |24.5 50|3/|24.5 ( 50)|3 0.2/5, we indeed observe that when w(0) 24.5, it tends to move leftward, and rightward otherwise. 2. Although the magnitude of the noises in [ 2π, 2π] is only 5% of the smaller mode at ˆw = 50 and 0.2% of the larger mode at ˆw = 50, once w(τ) of the trainable LTI system G enters the noisy region, it gets stuck in a local minimum and never converges to one of the two modes of F (see Region II in Figure 3). This corroborates our discussion that the training dynamics of w is sensitive to local information and it rarely learns the high frequencies when initialized in the low-frequency region. 5 TUNING THE FREQUENCY BIAS OF AN SSM In section 3 and 4, we analyze the frequency bias of an SSM initialized by Hi PPO and trained by a gradient-based algorithm. While we now have a theoretical understanding of the frequency bias, from a practical perspective, we want to be able to tune it. In this section, we design two strategies to enhance, reduce, counterbalance, or even reverse the bias of an SSM against the high frequencies. The two strategies are motivated by our discussion of the initialization (see section 3) and training (see section 4) of an SSM, respectively. 5.1 TUNING FREQUENCY BIAS BY SCALING THE INITIALIZATION Since the initialization assigns an SSM some inborn frequency bias, a natural way to tune the frequency bias is to modify the initialization. Here, we introduce a hyperparameter α > 0 as a simple way to scale the Hi PPO initialization defined in Corollary 1: aj = 0.5 + i( 1)j j/2 απ, 1 j n. (5) Published as a conference paper at ICLR 2025 Ground Truth Trainable LTI The Target Function function value -10 -5 0 5 10 -0.1 The L2 Loss Landscape & Trajectories of (w(τ), ξ(τ)) The H2 Loss Landscape & Trajectories of (w(τ), ξ(τ)) Figure 3: We train an LTI system to learn a noisy bimodal target transfer function. The convergence to a local minimum depends on the initial location of the pole. Left: the ground truth contains a large mode and a small mode, plus some small noises. We want to investigate which mode, if any, our trainable LTI system converges to. Middle: we train the LTI system with respect to the L2-loss. We show the trajectories of (w(τ), ξ(τ)) given different initializations (w(0), ξ(0) = 3). The two local minima corresponding to the two modes of F are shown in red crosses. The green trajectories (initialized in Region I) converge to the mode at w = 50, the magenta trajectories (initialized in Region III) converge to the mode at w = 50, and the black ones (initialized in Region II) converge to neither. Right: the experiment is repeated with the H2-loss (see section 5.2). Compared to the original Hi PPO initialization, we scale the imaginary parts of the eigenvalues of A by a factor of α. By making the modification, we lose the polynomial projection interpretation of Hi PPO that was originally proposed as a way of explaining the success of the Hi PPO initialization; yet, as shown in Orvieto et al. (2023); Yu et al. (2024a), this mechanism is no longer regarded as the key for a good initialization. By setting α < 1, the eigenvalues aj are clustered around the origin, enhancing the bias against the high-frequency modes; conversely, choosing α > 1 allows us to capture more variations in the high-frequency domain, reducing the frequency bias. So far, our discussion in the paper is from a perspective of the continuous-time LTI systems acting on continuous time-series. For SSMs, however, the inputs come in a discrete sequence. Hence, we inevitably have to discretize our LTI systems. To study the scaling laws of α, we assume in this work that an LTI system is discretized using the bilinear transform (Glover, 1984; Gu et al., 2022b) with a sampling interval t > 0. Other discretization choices can be similarly studied. Then, given an input sequence u RL of length L, the output y can be computed by discretizing eq. (2): y = i FFT(FFT(u) G(s)), sj = 2 t exp (i2π(j 1)/L) 1 exp (i2π(j 1)/L) + 1, (6) with the same transfer function G in eq. (3). Our goal is to propose general guidelines for an upper bound of α. We leave most technical details to Appendix D; but we intuitively explain why α cannot be arbitrarily large for discrete inputs. Given a fixed sampling interval t, there is an upper bound for the frequency, called the Nyquist frequency, above which a signal cannot be seen by sampling, causing the so-called aliasing errors (Oppenheim, 1999; Condon & Ransom, 2016; Trefethen, 2019). As a straightforward example, one cannot distinguish between cos(5t) and cos(t) from their samples at t = kπ, k Z. Our next result tells us how to avoid aliasing by constraining the range of α. Proposition 1. Let a = v + iw be given and define G(is) = 1/(is a). Let g = G(s) be the vector of length L, where s is defined in eq. (6). Then, there exist constants C1, C2 > 0 such that C1 g 2 |w| t C2 g 2, C1 g 1 1+||w| 2( t) 1 tan ((1 (L 1)/L) π/2)| C2 g . You may have noticed that in Proposition 1, we study the norm of the complex G instead of its real part restriction G. The reason is that in an LTI system parameterized by complex numbers, we multiply G by a complex number ξ + iζ and then extract its real part. Hence, both Re(G) and Im(G) are important. By noting that max1 j n |wj| nπ/2 in our scaled initialization in eq. (5), Proposition 1 gives us two scaling laws of α that prevent g 2 and g from vanishing, respectively. First, the 2-norm of g measures the average-case contribution of the partial fraction 1/(is a) to the input-to-output mapping in eq. (6). Published as a conference paper at ICLR 2025 Rule I: (Law of Non-vanishing Average Information) For a fixed task, as n and t vary, one should scale α = O(1/(n t)) to preserve the LTI system s impact on an average input. Next, the -norm of g tells us the maximum extent of the system s action on any inputs. Therefore, if g is too small, then 1/(is a) can be dropped without seriously affecting the system at all. Rule II: (Law of Nonzero Information) Regardless of the task, one should never take α 4 tan ((1 (L 1)/L) π/2) /nπ t to avoid a partial fraction that does not contribute to the evaluation of the model. We reemphasize that our scaling laws provide upper bounds of α. Of course, one can always choose α to be much smaller to capture the low frequencies better. 5.2 TUNING FREQUENCY BIAS BY A SOBOLEV FILTER In section 5.1, we see that we can scale the Hi PPO initialization to redefine the region in the Fourier domain that can be learned by an LTI system. Here, we introduce another way to tune the frequency bias: by applying a Sobolev-norm-based filter. The two strategies both tune the frequency bias, but by different means: scaling the initialization identifies a new set of frequencies that can be learned by the SSM, whereas the filter in this section introduces weights to different frequencies. Our method is rooted in the Sobolev norm, which extends a general L2 norm. Imagine that we approximate a ground-truth transfer function F(is) using G(is). We can define the loss to be F G 2 Hβ := ˆ (1 + |s|)2β| F(is) G(is)|2 ds (7) for some hyperparameter β R. The scaling factor (1 + |s|)2β naturally reweighs the Fourier domain. Note that you may have seen other forms of this factor they all lead to the same norm up to norm-equivalency. When β = 0, eq. (7) reduces to the standard L2 loss. The high frequencies become less important when β < 0 and more important when β > 0. Unfortunately, as discussed in section 4, there lacks a notion of the ground-truth F for every single LTI system within an SSM, making eq. (7) uncomputable. To address this issue, instead of using a Sobolev loss function, we apply a Sobolev-norm-based filter to the transfer function G to redefine the dynamical system: ˆy(s) = G(β)(is)ˆu(s), G(β)(is) := (1 + |s|)β G(is). (8) This equation can be discretized using the same formula in eq. (6) by replacing G with G(β). Equation (8) can be alternatively viewed as applying the filter (1 + |s|)β to the FFT of the input u, which clearly allows us to reweigh the frequency components. Surprisingly, there is even more beyond this intuition: applying the filter allows us to modify the training dynamics of wj! Theorem 2. Let L(Θ) be a loss function and Γ = (A, B, C, D) be a diagonal LTI system in an SSM defined in section 3. For any β R, we apply the filter in eq. (8) to Γ and let G(β) be the new transfer function. Suppose the functional derivative of L(Θ) with respect to G(β)(is) exists and is denoted by ( / G(β)(is))L. Then, if |( / G(β)(is))L| = O(|s|p) for some p < 1 β, we have L G(β)(is) K(β) j (s)ds, K(β) j (s):=(1+|s|)β ζj((s wj)2 v2 j) 2ξjvj(s vj) [v2 j + (s wj)2]2 , (9) for every 1 j n. In particular, we have that K(β) j (s) =O |ζjs 2+β|+|ξjs 3+β| as |s| . Compared to Theorem 1, the gradient of L with respect to wj now depends on the loss at frequency s by a factor of O(|s| 2+β). Thus, the effect of our Sobolev-norm-based filter is not only a rescaling of the inputs in the frequency domain, but it also allows better learning the high frequencies: The higher the β is, the more sensitive wj is to high-frequency losses. Hence, wj is no longer constrained by the local-frequency loss and will activately learn the high frequencies. The decay constraint that |( / G(β)(is))L| = O(|s|p) for some p < 1 β is needed to guarantee the convergence of the integral in eq. (9). When it is violated, the theoretical statement breaks, but Published as a conference paper at ICLR 2025 we could still implement the filter in practice, which is similar to Yu et al. (2023). In Figure 3 (right), we reproduce the illustrative example introduced in Section 4 using the Sobolev-norm-based filter in eq. (8) with β = 2. This is equivalent to training an ordinary LTI system with respect to the H2-loss function defined in eq. (7). We find that in this case, the trajectories of (w(τ), ξ(τ)) always converge to one of the two modes in F regardless of the initialization, with more of them converging to the high-frequency global minimum on the left. This verifies our theory, because by setting β = 2, we amplify the contribution of the high-frequency residuals in the computation of ( / w)L, pushing a y out of the noisy region between 2π and 2π. We leave more illustrative experiments to Appendix E, which show the effect of our tuning filter also when β < 0. 6 EXPERIMENTS AND DISCUSSIONS (I) SSMs as Denoising Sequential Autoencoders. We now provide an example of how our two mechanisms allow us to tune frequency bias. In this example, we train an SSM to denoise an image in the Celeb A dataset (Liu et al., 2015). We flatten an image into a sequence of pixels in the row-major order and feed it into an S4D model. We collect the corresponding output sequence and reshape it into an image. Similar to the setting of an autoencoder, our objective is to learn the identity map. To make the task non-trivial, we remove the skip connection D from the LTI systems. During inference, we add two different types of noises to the input images: horizontal or vertical stripes (see Figure 4). While the two types of noises may be visually similar to each other, since we flatten the images using the row-major order, the horizontal stripes turn into low-frequency noises while the vertical stripes become high-frequency ones (see Figure 9). In Figure 4, we show the outputs of the models trained with different values of α and β as defined in section 5.1 and 5.2, respectively. α=0.1, β = 1 Outputs of Four Models α=1, β =0 α=100, β =1 Low-Freq. Noise High-Freq. Noise High-Frequency-Pass Filter Figure 4: The outputs of image-denoising S4D models trained with different configurations. From Figure 4, we can see that as α and β increase, our model learns the high frequencies in the input better; consequently, the high-frequency noises get preserved in the outputs and the low-frequency noises are dampened. This corroborates our intuitions from section 5.1 and 5.2. We can further quantify the pass rates of the low and high-frequency noises. That is, we compute the percentage of the low and high-frequency noises that are preserved in the output. We show in Table 2 the ratio between the low-pass rate and the high-pass rate, which decreases as α and β increase. (II) Tuning Frequency Bias in the Long-Range Arena. The two tuning strategies section 5.1 and 5.2 are not only good when one needs to deal with a particular high or low frequency, but they also improve the performance of an SSM on general long-range tasks. In Table 3, we show that equipped with the two tuning strategies, our SSM achieves state-of-the-art performance on the Long-Range Arena (LRA) tasks (Tay et al., 2021). Published as a conference paper at ICLR 2025 Table 2: We compute the percentage of the low and high-frequency noises that are preserved in the output of a model trained with a pair of configurations (α, β). The table shows the ratio between the low-frequency pass rate and the high-frequency pass rate. The more bluish a cell is, the better our model learns the high frequencies. Circled in red is the S4D default. β 1.0 0.5 0.0 0.5 1.0 0.1 4.463e+07 2.409e+06 1.198e+05 4.613e+03 1.738e+02 1 4.912e+05 2.124e+05 1.758e+04 9.595e+02 5.730e+01 10 9.654e+04 7.465e+03 6.073e+02 5.699e+01 6.394e+00 100 3.243e+00 3.745e-02 3.801e-03 7.299e-05 5.963e-06 10-4 10-2 100 102 104 106 Table 3: Test accuracies in the Long-Range Arena of different variants of SSMs. The bold (resp. underlined) numbers indicate the best (resp. second best) performance on a task. An entry is left blank if no result is found. The row labeled Ours stands for the S4D model equipped with our two tuning strategies. Experiments were run with 5 random seeds and the medians and the standard deviations are reported. The S4 and S4D results are from the original papers (Gu et al., 2022b;a). The sizes of our models are the same or smaller than the corresponding S4D models. Model List Ops Text Retrieval Image Pathfinder Path-X Avg. DSS (Gupta et al., 2022) 57.60 76.60 87.60 85.80 84.10 85.00 79.45 S4++ (Qi et al., 2024) 57.30 86.28 84.82 82.91 80.24 - - Reg. S4D (Liu & Li, 2024a) 61.48 88.19 91.25 88.12 94.93 95.63 86.60 Spectral SSM (Agarwal et al., 2023) 60.33 89.60 90.00 - 95.60 90.10 - Liquid S4 (Hasani et al., 2023) 62.75 89.02 91.20 89.50 94.80 96.66 87.32 S5 (Smith et al., 2023) 62.15 89.31 91.40 88.00 95.33 98.58 87.46 S4 (Gu et al., 2022b) 59.60 86.82 90.90 88.65 94.20 96.35 86.09 S4D (Gu et al., 2022a) 60.47 86.18 89.46 88.19 93.06 91.95 84.89 Ours 62.75 89.76 92.45 90.89 95.89 97.84 88.26 0.78 0.22 0.16 0.35 0.13 0.21 (III) Ablation Studies. We perform ablation studies of our two tuning strategies by training a smaller S4D model to learn the grayscale s CIFAR-10 task. From Figure 5, we obtain better performance when we slightly increase α or decrease β. It might feel like a contradiction because increasing α helps to learn the high frequencies while decreasing β downplays their role. It is not: α and β control two different notions of the bias: scaling α affects which frequencies we can learn; scaling β affects how much we want to learn a certain frequency. They can be used collaboratively and interactively to attain the optimal extent of frequency bias that we need for a problem. 10-1 100 101 102 72 S4D Default Little high-freq. info. can be learned Limited deg. of freedom in the low-freq. domain -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 72 S4D Default High freqs. get too little attention High freqs. are weighted too much Figure 5: Two ablation studies of the tuning strategies proposed in this paper. We train an S4D model with varying parameters of α and β, respectively. On the left, we see that holding β = 0 (the default value), the model achieves its best performance when α = 4; on the right, when we fix α = 1 (the default value), the model performs the best when β = 0.5. (IV) More Experiments. One can find more supplementary experiments in Appendix I on wave prediction (see Figure 1) and video generation. 7 CONCLUSION We formulated the frequency bias of an SSM and showed its existence by analyzing both the initialization and training. We proposed two different tuning mechanisms based on scaling the initialization and on applying a Sobolev-norm-based filter to the transfer function. As a future direction, one could develop ways to analyze the spectral information of the inputs of a problem and use it to guide the selection of the hyperparameters in our tuning mechanisms. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS We would like to acknowledge the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC0205CH11231. In addition, SHL would like to acknowledge support from the Wallenberg Initiative on Networks and Quantum Information (WINQ) and the Swedish Research Council (VR/2021-03648). We would also like to thank the anonymous reviewers for their comments that helped improve this paper. Vadim Movsesovich Adamyan, Damir Zyamovich Arov, and Mark Grigor evich Krein. Analytic properties of schmidt pairs for a hankel operator and the generalized schur takagi problem. Matematicheskii Sbornik, 128(1):34 75, 1971. Naman Agarwal, Daniel Suo, Xinyi Chen, and Elad Hazan. Spectral state space models. ar Xiv preprint ar Xiv:2312.06837, 2023. Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pp. 1120 1128. PMLR, 2016. S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Inter. Conf. Mach. Learn., pp. 322 332. PMLR, 2019. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ar Xiv preprint ar Xiv:1803.01271, 2018. R. Basri, D. Jacobs, Y. Kasten, and S. Kritchman. The convergence rate of neural networks for learned functions of different frequencies. Adv. Neur. Info. Proc. Syst., 32, 2019. R. Basri, M. Galun, A. Geifman, D. Jacobs, Y. Kasten, and S. Kritchman. Frequency bias in neural networks for input of non-uniform density. In Inter. Conf. Mach. Learn., pp. 685 694. PMLR, 2020. A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. Adv. Neur. Info. Proc. Syst., 32, 2019. Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. ar Xiv preprint ar Xiv:1912.01198, 2019. Bo Chang, Minmin Chen, Eldad Haber, and Ed H Chi. Antisymmetricrnn: A dynamical system view on recurrent neural networks. In International Conference on Machine Learning, 2019. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In International Conference on Machine Learning, 2020. James J Condon and Scott M Ransom. Essential radio astronomy, volume 2. Princeton University Press, 2016. W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. Adv. Neur. Info. Proc. Syst., 30, 2017. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. ar Xiv preprint ar Xiv:2405.21060, 2024. Li Deng. The MNIST database of handwritten digit images for machine learning research. IEEE signal processing magazine, 29(6):141 142, 2012. N Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W Mahoney. Lipschitz recurrent neural networks. In International Conference on Learning Representations, 2021. Published as a conference paper at ICLR 2025 Izrail Moiseevitch Gelfand, Richard A Silverman, et al. Calculus of Variations. Courier Corporation, 2000. Keith Glover. All optimal hankel-norm approximations of linear multivariable systems and their L, -error bounds. International journal of control, 39(6):1115 1193, 1984. Walter Greiner and Joachim Reinhardt. Field quantization. Springer Science & Business Media, 2013. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. ar Xiv preprint ar Xiv:2312.00752, 2023. Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher R e. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33: 1474 1487, 2020. Albert Gu, Karan Goel, and Christopher R e. s4. https://github.com/state-spaces/s4, 2021a. Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher R e. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572 585, 2021b. Albert Gu, Karan Goel, Ankit Gupta, and Christopher R e. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971 35983, 2022a. Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022b. Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher R e. How to train your hippo: State space models with generalized orthogonal basis projections. International Conference on Learning Representations, 2023. Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982 22994, 2022. Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. Journal of Machine Learning Research, 19(29):1 44, 2018. Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. International Conference on Learning Representations, 2023. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neur. Info. Proc. Syst., 31, 2018. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156 5165. PMLR, 2020. Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Machine Learning, 2020. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pp. 1302 1338, 2000. Soon Hoe Lim. Understanding recurrent neural networks using nonequilibrium response theory. Journal of Machine Learning Research, 22(47):1 48, 2021. Fusheng Liu and Qianxiao Li. From generalization analysis to optimization designs for state space models. ar Xiv preprint ar Xiv:2405.02670, 2024a. Published as a conference paper at ICLR 2025 Fusheng Liu and Qianxiao Li. The role of state matrix initialization in ssms: A perspective on the approximation-estimation tradeoff. ICML 2024 NGSM Workshop, 2024b. Xinliang Liu, Bo Xu, Shuhao Cao, and Lei Zhang. Mitigating spectral bias for the multiscale operator learning. Journal of Computational Physics, 506:112944, 2024. ISSN 0021-9991. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. M. W. Mahoney. Approximate computation and implicit regularization for very large-scale data analysis. In Proceedings of the 31st ACM Symposium on Principles of Database Systems, pp. 143 154, 2012. Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023. Alan V Oppenheim. Discrete-time signal processing. Pearson Education India, 1999. Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. ar Xiv preprint ar Xiv:2303.06349, 2023. Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L Smith. Universality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In Forty-first International Conference on Machine Learning, 2024. Robert G. Parr and Weitao Yang. Density-Functional Theory of Atoms and Molecules. International Series of Monographs on Chemistry. Oxford University Press, 1994. ISBN 9780195357738. Biqing Qi, Junqi Gao, Dong Li, Kaiyan Zhang, Jianxing Liu, Ligang Wu, and Bowen Zhou. S4++: Elevating long sequence modeling with state memory reply. 2024. N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In Inter. Conf. Mach. Learn., pp. 5301 5310. PMLR, 2019. David W Romero, Anna Kuzina, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Ckconv: Continuous kernel convolution for sequential data. In International Conference on Machine Learning, 2022. T Konstantin Rusch and Siddhartha Mishra. Unicornn: A recurrent model for learning very long time dependencies. In International Conference on Machine Learning, pp. 9168 9178. PMLR, 2021. Jakub Sm ekal, Jimmy TH Smith, Michael Kleinman, Dan Biderman, and Scott W Linderman. Towards a theory of learning dynamics in deep state space models. ar Xiv preprint ar Xiv:2407.07279, 2024. Jimmy Smith, Shalini De Mello, Jan Kautz, Scott Linderman, and Wonmin Byeon. Convolutional state space models for long-range spatiotemporal modeling. Advances in Neural Information Processing Systems, 36, 2024. Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. H. Son, J.W. Jang, W.J. Han, and H.J. Hwang. Sobolev training for the neural network solutions of PDEs. ar Xiv preprint ar Xiv:2101.08932, 2021. Hwijae Son. Sobolev acceleration for neural networks. 2023. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843 852. PMLR, 2015. Published as a conference paper at ICLR 2025 L. Su and P. Yang. On learning over-parameterized neural networks: A functional approximation perspective. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Adv. Neur. Info. Proc. Syst., volume 32. Curran Associates, Inc., 2019. Dennis Sun. Introduction to probability. https://dlsun.github.io/probability/, 2020. Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. International Conference in Learning Representations, 2021. Lloyd N Trefethen. Approximation theory and approximation practice, extended edition. SIAM, 2019. C. Tsay. Sobolev trained neural network surrogate models for optimization. Comp. Chem. Eng., 153:107419, 2021. N.N. Vlassis and W. Sun. Sobolev training of thermodynamic-informed neural networks for interpretable elasto-plasticity models with level set hardening. Compu. Meth. Appl. Mech. Eng., 377: 113695, 2021. Aaron Voelker, Ivana Kaji c, and Chris Eliasmith. Legendre memory units: Continuous-time representation in recurrent neural networks. Advances in neural information processing systems, 32, 2019. Shida Wang and Qianxiao Li. Stable SSM: Alleviating the curse of memory in state-space models through stable reparameterization. ar Xiv preprint ar Xiv:2311.14495, 2023. Shida Wang and Beichen Xue. State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. Advances in Neural Information Processing Systems, 36, 2024. Z.-Q. J. Xu. Frequency principle: Fourier analysis sheds light on deep neural networks. Commun. Comput. Phys., 28(5):1746 1767, 2020. G. Yang and H. Salman. A fine-grained spectral perspective on neural networks. ar Xiv preprint ar Xiv:1907.10599, 2019. Cagatay Yildiz, Markus Heinonen, and Harri L ahdesm aki. Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pp. 12009 12018. PMLR, 2021. Annan Yu and Alex Townsend. Leveraging the hankel norm approximation and data-driven algorithms in reduced order modeling. Numerical Linear Algebra with Applications, pp. e2555, 2024. Annan Yu, Yunan Yang, and Alex Townsend. Tuning frequency bias in neural network training with nonuniform data. International Conference on Learning Representations, 2023. Annan Yu, Michael W Mahoney, and N Benjamin Erichson. There is hope to avoid hippos for long-memory state space models. ar Xiv preprint ar Xiv:2405.13975, 2024a. Annan Yu, Arnur Nigmetov, Dmitriy Morozov, Michael W. Mahoney, and N. Benjamin Erichson. Robustifying state-space models for long sequences via approximate diagonalization. In The Twelfth International Conference on Learning Representations, 2024b. Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning, pp. 27268 27286. PMLR, 2022. B. Zhu, J. Hu, Y. Lou, and Y. Yang. Implicit regularization effects of the Sobolev norms in image processing. ar Xiv preprint ar Xiv:2109.06255, 2021. Published as a conference paper at ICLR 2025 A MORE ON THE DEFINITION OF FREQUENCY BIAS In this section, we provide an additional figure that explains the frequency bias of an SSM. Figure 6: If the transfer function has a large total variation in the low-frequency region, then given two different low-frequency input signals, the LTI system sets very different pass rates for them. Conversely, when the transfer function has a small total variation in the high-frequency region, then given two different high-frequency input signals, the LTI system must set similar pass rates. The Fourier transform of a cosine wave involves both a positive and a negative frequency. We drop the negative frequency component for the cleanliness of the figure. We also provide additional justifications for our definition of the frequency bias in section 2. As noted in Yu et al. (2024a), one reason why an LTI system in an SSM degenerates is that its transfer function is too flat. That is, if G can be well-approximated by a much lower-degree rational function, i.e., inf G Rd sup s R |G(is) G(is)| is small for some d n, where Rd is the space of rationals of degree d, then the large LTI system can be well-approximated by a much smaller one (i.e., given the same input u, the outputs of the two systems are close to each other). In other words, our LTI system is not as expressive as its size may have suggested. If we now restrict our attention to only a part of the frequency domain, then the same reasoning applies: if the transfer function G is flat on [a, b], then that means inf G Rd sup s [a,b] |G(is) G(is)| is small for some d n. Therefore, there exists a much smaller LTI system Γ and its actions on the Fourier modes in [a, b] are very similar to those of the original system. Hence, this essentially means that our LTI system is unable to capture complex patterns in the frequency interval [a, b] because all it does in [a, b] can also be done by a much smaller system. In this section, we provide the proofs of all theoretical statements in the manuscript. Published as a conference paper at ICLR 2025 First, we prove the statements about the initialization of the LTI systems. The total variation can be bounded straightforwardly using the decay of the transfer functions. Proof of Lemma 1. Since G is the real part of G, its total variation is always no larger than the total variation of G. Hence, we have V B ( G) V B (G) The other bound is similarly obtained. Proof of Corollary 1. By Lemma 1 and the H older s inequality, we have |cj| |wj + B| B C 2 w 2, where wj = 1/(wj + B). Since ξj, ζj N(0, 1) i.i.d., we have that B C 2 2 follows the χ2distribution with degree 2n. By Laurent & Massart (2000), we have with probability at least 1 δ that B C 2 2 2n + p 2n ln(1/δ) + 2 ln(1/δ) 2( n + p ln(1/δ))2. By the definition of B, we have that w 2 2 n (B n/2)2 . The result for V B ( G) follows from the last two inequalities. The bound on V B ( G) can be derived similarly. The initialization in Corollary 1 is called the Hi PPO-Lin initialization. Using the same idea, we can immediately prove a result for the Hi PPO-Leg S initialization. Corollary 2. Assume that the diagonal entries aj are initialized using Hi PPO-Leg S (Gu et al., 2022a) and ξj, ζj N(0, 1) i.i.d., where N(0, 1) is the standard normal distribution. Then, given B > 0 and δ > 0, we have V B ( G), V B ( G) ln(1/δ)) B with probability 1 δ. Proof. The proof is done by noting that every aj is real-valued and mimicing the proof of Corollary 1. We skip the proof of Proposition 1 and defer it to Appendix D when we present a detailed derivation of the scaling laws. We next prove the statement about the training dynamics of the imaginary parts of diag(A). Proof of Theorem 1. Fix some 1 j n, we first view the transfer function G(s, wj) as a function of two variables. We compute the derivative of G(s, wj) with respect to wj: ζj(s wj) ξjvj v2 j + (s wj)2 ζj(s wj) ξjvj v2 j + (s wj)2 = (v2 j + (s wj)2) ( / wj)(ζj(s wj) ξjvj) (v2 j + (s wj)2)2 (ζj(s wj) ξjvj) ( / wj)(v2 j + (s wj)2) (v2 j + (s wj)2)2 = (v2 j + (s wj)2)ζj (v2 j + (s wj)2)2 + 2(ζj(s wj) ξjvj)(s wj) (v2 j + (s wj)2)2 = ζj( v2 j (s wj)2 + 2(s wj)2) ξj(2vj(s wj)) (v2 j + (s wj)2)2 = Kj(s). Published as a conference paper at ICLR 2025 Since we assume that |( / G(is))L| = O(|s|p) for some p < 1, the integral in eq. (4) converges. By Parr & Yang (1994, Appendix A), eq. (4) holds. The statement about the training dynamics of wj given a Sobolev filter follows immediately from Theorem 1. Proof of Theorem 2. Since we assume that |( / G(is))L| = O(|s|p) for some p < 1 β, the integral in eq. (9) converges. The result follows by noting that G(β)(s, wj) wj = (1 + |s|)β G(s, wj) wj = (1 + |s|)βKj(s) = K(β) j (s) and applying the equation in Parr & Yang (1994, Appendix A). C FUNCTIONAL DERIVATIVES In this section, we briefly introduce the notion of functional derivatives (see Appendix A.1 in (Lim, 2021) for a more technical overview). To make our discussion concrete, we do it in the context of Theorem 1. Consider the transfer function G(is) defined in eq. (3). It depends on the model parameters vj, wj, ξj, and ζj. In this section, we separate out a single wj for a fixed 1 j n, leaving the remaining parameters unchanged. Then, for every w R, we can define f (w)(s) to be the transfer function G(is) when wj = w. Under this setting, the set of all possible transfer functions indexed by w, i.e., F = {f (w)|w R} is a subset of a Banach space, say L2(R). To avoid potential confusions, we shall remark that f (w) is not linear in its index w, i.e., f (w1+w2) = f (w1) + f (w2) in general, neither is F a subspace of L2(R). This does not impact our following discussion. Now, consider the loss function L. Given a choice of wj = w and a corresponding transfer function f (w), the loss function maps f (w) to a real number that corresponds on the current loss. Hence, L can be viewed as a (not necessarily linear) functional of f (w). We would like to ask: how does L respond to a small change of f (w)(s) at some s R? Ideally, this can be measured as lim ϵ 0 L(f (w) + ϵδs) f (w) where δs is the Dirac delta function at s. However, the loss function L is not defined for distributions, making eq. (10) not directly well-defined. To fix this issue, we have to go through the functional derivatives. The idea, as usual in functional analysis, is to pass the difficulty of handling a distribution to smooth functions that approximate it. If there exists a function ( / f (w))L such that the equation ˆ L f (w) (s)ϕ(s) ds = lim ϵ 0 L(f (w) + ϵϕ) L(f (w)) holds for all smooth C 0 functions ϕ that are infinitely differentiable and vanish at infinity, then ( / f (w))L is defined to be the functional derivative of L at f (w). Taking {ϕj} j=1 to be an approximate identity centered at s, we recover eq. (10) using ( / f (w))L(s). One nice thing about the functional derivatives is that they allow us to write down a continuous analog of the chain rule, which is the meat of Theorem 1. To get some intuition, let us first consider a function L(w) = L(f1(w), . . . , fk(w)) that depends on w via k intermediate variables f1, . . . , fk. Assuming sufficient smoothness conditions, the derivative of L with respect to w can be calculated using the standard chain rule: w = h L f1 L fk Published as a conference paper at ICLR 2025 The only difference in the case of L is that instead of depending on k discrete intermediate variables f1, . . . , fk, our L depends on a continuous family of intermediate variables f (w)(s) indexed by s R. In this case, one would naturally expect that in eq. (11), the sum becomes an integral, or equivalently, the row and the column vectors become the row and the column functions. This is indeed the case given our functional derivative: L f (w) (s) f (w)(s) This formula can be found in Parr & Yang (1994, (A.24)) and Greiner & Reinhardt (2013, sect. 2.3). D SCALING LAWS OF THE INITIALIZATION In this section, we expand our discussions in section 5.1 and give the proof of Proposition 1. Throughout this section, we assume that we use the bilinear transform to discretize our continuoustime LTI system. The length of our sequence is L and the sampling interval is t. The bilinear transform is essentially a Mobius transform between the closed left half-plane of the s-domain and the closed unit disk in the z-domain. Hence, it gives us two ways to study this filter by either transplanting the transfer function G onto the unit circle and analyzing in the discrete domain or by transplanting the FFT nodes from the z-domain to the imaginary axis in the s-domain. The two ways are equivalent, but we choose the second method for simplicity. The output of an LTI system can be computed by y = i FFT(FFT(u) G(ω)), where G is the transfer function of the discrete system and where ω = exp 2πi 0 L exp 2πi L 1 is the length-L vector consisting of Lth roots of unity. We do not have direct access to G, but we do know G, its continuous analog, in the partial fractions format. They are related by the following equation: G(z) = G(s), s = 2 t z 1 z + 1. In that case, the vector G(ω) can be equivalently written as 2 t exp 2πi j This is how we obtained eq. (6). The locations of the new samplers on the imaginary axis are shown in Figure 7, with L = 101 and t = 0.01. Note that the right figure (the s-domain) is on a logarithmic scale and only the upper half-plane is shown due to the scale. We also choose L odd; when L is even, a pole is placed at infinity in the s-domain, at which any partial fraction vanishes. Why do we go through all the pains to study this bilinear transformation? The reason is that it gives us a guideline for scaling the poles. For instance, for L = 101 and t = 0.01, if a pole has a much larger imaginary part than 104, then the discrete sequence will hardly see the effect of this partial fraction even though the underlying continuous system will. This corresponds to the intuition behind the aliasing error that we discussed in the main text. As in Proposition 1, it suffices to study a single partial fraction instead of all. Hence, instead of studying the entire transfer function together, we focus on one component of it: G(is) = 1 is a, Re(s) = 0, Re(a) < 0. This is a partial fraction in the s-domain. For a fixed L and t, this partial fraction corresponds to a bounded linear operator G : ℓ2([L]) ℓ2([L]) that maps an input sequence to an output sequence, where [L] = {1, . . . , L} is the set of the first L natural numbers. We consider the norm of this Published as a conference paper at ICLR 2025 -0.5 0 0.5 1 -2000 0 2000 100 Figure 7: The poles of the FFT samplers in the z-domain and the samplers in the s-domain for L = 101 and t = 0.01. Due to the essential difficulty of plotting the entire real line on a logarithmic scale, we only show the semilog plot of the upper half-plane of the s-domain (right). Hence, on the right figure, we omit the samplers corresponding to the lower arc of the unit circle on the left. operator, where we will find that as |Im(s)| , the norm of the operator vanishes. The rate of vanishing will guide us in selecting an appropriate range for the pole. So, how can we tell the norm of this operator? By definition, the norm of G is defined by G ℓ2 ℓ2 = sup u ℓ2=1 y ℓ2 = sup ˆu ℓ2=1 ˆy ℓ2, where y is the output of the operator G given input u, i.e., y = Gu, and the second step follows from the Parseval s identity. By H older s inequality, we further have G ℓ2 ℓ2 = sup ˆu ℓ2=1 ˆu G(ω) ℓ2 g ℓ , where G(ω) = g is the sample vector of the bilinearly transformed transfer function of G in the z-domain, i.e., g = G(ω), G(ω)j = G(ωj) = G i 2 = 1 i2 tan (πj/L) / t a. Hence, we have Re(a)2 + (Im(a) 2 tan (πj/L) / t)2 . When Im(a) > 2 tan (πj/L) / t for all j, |gj|2 is maximized when j = L/2 1, in which case we have G 2 ℓ2 ℓ2 1 Re(a)2 + (Im(a) (2/ t) tan ((π/2)(1 (L 1)/L)))2 . (12) This gives us the second rule (Law of Zero Information) when scaling the diagonal of A. This is a worst-case analysis, where we essentially assume that the Fourier coefficient of u is one-hot at the highest frequency. In practice, of course, this assumption is a bit unrealistic; in fact, the Fourier coefficients usually decay as the frequency gets higher. Therefore, we should derive another rule for the average-case scenario. We consider the operator ˆG that maps the Fourier coefficients of the inputs to those of the outputs, then the norm ˆG ℓ ℓ2 is a good average-case estimate, because arg max ˆu ℓ =1 ˆGˆu ℓ2 is necessarily at a vertex of the simplex defined by ˆu ℓ 1 1. That is, ˆuj = 1 for all 1 j L. Now, using the H older s inequality again, we have ˆG ℓ ℓ2 = sup ˆu ℓ =1 ˆu g ℓ2 g ℓ2. 1Note that we can use max instead of sup because the domain { ˆu ℓ = 1} is compact. Published as a conference paper at ICLR 2025 Hence, instead of studying the ℓ -norm of g, we consider the ℓ2-norm for the average-case estimate. The precise computation of the ℓ2-norm can be hard, but let us write out the full expression: Re(a)2 + (Im(a) (2/ t) tan (πj/L))2 . Given the imaginary part of a > 0, we grab all Fourier nodes on the jω axis that are below Im(a)/2 and lower-bound them; we also grab all above Im(a)/2 and assume that they collapse to Im(a). This gives us an estimate of the ℓ2 norm: g 2 ℓ2 |{j|(2/ t) tan (πj/L) Im(a)/2}| Re(a)2 + Im(a)2/4 + |{j|(2/ t) tan (πj/L) > Im(a)/2}| L(2 arctan(Im(a) t/4)/π) + 1 Re(a)2 + Im(a)2/4 + L(1 2 arctan(Im(a) t/4)/π) + 1 (2 arctan(Im(a) t/4)/π) + 1/L Re(a)2 + Im(a)2/4 | {z } N1 + (1 2 arctan(Im(a) t/4)/π) + 1/L Re(a)2 | {z } N2 Proof of Proposition 1. Given eq. (12) and (13). Proposition 1 follows immediately. Let us take a closer look at this expression. Ideally, the ℓ2-norm of g should be independent of L and t; that is, as L and t 0, we do not want g 2 ℓ2/L to diminish. First, we note that N1 and N2 are independent of L as L . As t 0, N1 inevitably vanish, regardless of the location of Im(a). In order to maintain N2 a constant, we would need Im(a) t/4 to not blow up. This gives us the first rule (Law of Zero Information) for scaling the poles. We can further work out some constants in O to be used in practice. For example, to guarantee that Im(a) is smaller than the top 5% Fourier nodes, we would need that 2 π arctan Im(a) t 0.95 Im(a) 50.82 In particular, eq. (12) and eq. (13) together give us the proof of the lower bounds in Proposition 1. The upper bounds are proved by noting all all derivations in this section are asymptotically tight. E MORE NUMERICAL EXPERIMENTS ON THE ILLUSTRATIVE EXAMPLE In section 4 and 5.2, we see that using a Sobolev-norm-based filter with β > 0, one is able to escape from the local minima caused by small local noises. In this section, we present a similar set of experiments to show the effect of our filter, even when β < 0. We choose our new objective function to be F(is) = Re 5 is ( 1 75i) + 0.2 is ( 1 + 25i) Compared to the objective in section 4, we see two differences. First, we remove the sinusoidal noises around the origin. Second, we shift the locations of the two modes in the target: instead of locating at s = 50 and s = 50, we shift them to s = 75 and s = 25, respectively. This allows us to have a large high-frequency mode and a small low-frequency mode in the ground truth. We show the results in Figure 8 when we train an LTI system using a Sobolev-norm-based filter with different values of β. Note that when β = 0, the picture only differs from Figure 1 (middle) in the frequency labels because in that case, the gradient ( / w)L only cares about the relative difference w s but not the absolute values of s. From Figure 8, we see that as β increases, more trajectories converge to the local minimum near (ξ = 5, w = 75). The reason is that a larger β favors a higher frequency (see Theorem 2). Published as a conference paper at ICLR 2025 H0 (i.e., L2) Loss w Figure 8: An SSM is trained with a filter based on the Sobolev-norm (see section 5.2). The plots are read in the same way as those in Figure 1. The transfer function converges to one of the two local minima. As β ranges from 2 to 2, the transfer function becomes more sensitive to the highfrequency global information rather than the local information. Hence, the edge between the two different convergences shifts rightward. F IMPLEMENTATION OF THE SOBOLEV-NORM-BASED FILTER There are two popular ways to evaluate an LTI system: via a convolutional kernel (Gu et al., 2022b) or via parallel scans (Smith et al., 2023). The first approach is directly based on the frequency domain computation in eq. (8); thus, the implementation is trivial. The parallel scan algorithm does not operate in the frequency domain, making the computation of eq. (8) less trivial. However, by noting that ˆy(s) = G(β)(is)ˆu(s) = G(is) (1 + |s|)β ˆu(s) , one can imagine that we apply the standard parallel scan algorithm to the transformed inputs (1 + |s|)β ˆu(s) . Hence, all we need to do is to preprocess the inputs by applying the filter (1 + |s|)β in the Fourier domain and then use the standard parallel scan algorithm. G CHOICE OF THE SCALING FACTOR In this section, we briefly provide a practical guideline for how α can be chosen. We will advocate two ways to select an αmax as an upper bound of α. Then, given αmax, one needs to tune a hyperparameter α (0, αmax) that gives a satisfactory amount of frequency bias. We do not recommend tuning α together with other hyperparameters. Instead, we suggest starting with α = αmax, and once all the rest of the hyperparameters are chosen, one can tune α using the bisection method, which only requires O(log αmax) many trials. The reason why the bisection method works is that we believe in a unique local minimum of the loss as α changes (see Figure 5). To select an αmax, we can either be informed by the input data or not. If we choose not to study the input data, then we propose to guarantee that Im(a) is smaller than the top 10% Fourier nodes at which we sample the transfer function G (see Appendix D). That is, 2 π arctan Im(a) t 0.9 Im(a) 25.26 t αmax = 50.52 The other way is to select αmax based on the input data. That is, one can plot the densities of the FFTs of all input training data and identify an edge smax for which [ smax, smax] contains most densities. Since the Fourier domain is one-dimensional, one can identify smax by simply eyeballing. Then, we can select αmax so that the information in [ smax, smax] can be learned. That is, L t αmax = smax πn L t, where L is the length of the sequence. Published as a conference paper at ICLR 2025 H DETAILS OF THE EXPERIMENTS H.1 DENOISING SEQUENTIAL AUTOENCODER For every image in the Celeb A dataset, we reshaped it to have a resolution of 1024 256 pixels to allow for higher-frequency noises. We trained a single-layer S4D model with n = 128 and d model = 3. We dropped the skip connection D from the model. The model was trained using the MSE loss. That is, for every predicted sequence of pixels, we compared the model against the true image and computed the 2-norm of the difference vector. Flattening Vertical Stripes Horizontal Stripes Low-Frequency Noises High-Frequency Noises Image Sequence of Pixels Figure 9: When a noisy image is flattened into pixels using the row-major order, the noises induced by horizontal stripes become low-frequency noises while those induced by vertical stripes become high frequencies. We trained the model on the original images, i.e., those without any noises. When we inferred from the model, we added noises to the inputs, introducing about 10 cycles of horizontal or vertical stripes, respectively. Our noises were large, almost shielding the underlying images. When the value of a pixel was out of range, then we ignored such as issue during training; we clipped its value to the appropriate range when rendering the image in Figure 4. To obtain the numbers in Table 2, we computed with our trained models, where we set the inputs to be pure horizontal or vertical noises with no underlying images. Then, we evaluated the size of the output image and took the ratio of the outputs over the inputs. We call this value the pass rate of a particular noise. Table 2 shows the ratio between the pass rate of the low-frequency noises over the high-frequency ones. Our model did not have a nonlinear activation function, which made the model linear. Hence, it does not matter what the magnitude of the inputs was. H.2 LONG-RANGE ARENA In this section, we present the hyperparameters of our models trained on the Long-Range Arena tasks. Our model architecture and hyperparameters are almost identical to those of the S4D models reported in Gu et al. (2022a), with only two exceptions: for the List Ops experiment, we set n = 2 instead of n = 64, which aligns with Smith et al. (2023) instead; for the Path X experiment, we set d model = 128 to reduce the computational burden. We do not report the dropout rates since they are set to be the same as those in Gu et al. (2022a). Also, we made β a trainable parameter. I SUPPLEMENTARY EXPERIMENTS I.1 PREDICT THE MAGNITUDES OF WAVES In Figure 1, we see an example of the frequency bias of SSMs, where the model is better at extracting the wave information of a low-frequency wave than a high-frequency one. In this section, we produce more examples on the same task to show that one is able to tune frequency bias by playing with α and β we introduced in section 5.1 and 5.2, respectively. Published as a conference paper at ICLR 2025 Task Depth #Features Norm Prenorm α LR BS Epochs WD Range List Ops 8 256 BN False 3 0.002 50 80 0.05 (1e-3,1e0) Text 6 256 BN True 5 0.01 32 300 0.05 (1e-3,1e-1) Retrieval 6 128 BN True 3 0.004 64 40 0.03 (1e-3,1e-1) Image 6 512 LN False 3 0.01 50 1000 0.01 (1e-3,1e-1) Pathfinder 6 256 BN True 3 0.004 64 300 0.03 (1e-3,1e-1) Path-X 6 128 BN True 5 0.001 20 80 0.03 (1e-4,1e-1) Table 4: Configurations of our S4D model, where LR, BS, and WD stand for learning rate, batch size, and weight decay, respectively. The hyperparameter α is the scaling factor introduced in section 5.1. We set the parameter in section 5.2 as a trainable parameter to reduce the need for hyperparameter tuning. Frequency bias is . . . Counterbalanced s = 1 s = 16 s = 256 Figure 10: Reproduction of the experiment shown in Figure 1, with difference choices of (α, β). From the left to right, our choices of (α, β) are (0.01, 1), (1, 0) (the default), (10, 0.5), and (100, 1), respectively. The legend applies to all pictures. We can see that these choices allow us to enhance, counterbalance, or even reverse the frequency bias. We see from Figure 10 that by tuning hyperparameters, we can change the frequency bias of an SSM. In particular, when α = 100 and β = 1, we reversed the frequency bias so that the magnitude of the low-frequency wave cos(t) cannot be well-predicted, while a high-frequency wave is captured relatively well. I.2 TUNING FREQUENCY BIAS IN MOVING MNIST VIDEO PREDICTION In this section, we present an experiment to tune frequency bias in a video prediction task. We show that our frequency bias analysis and the tuning strategies not only work for vanilla SSMs but also their variants. We examine a model architecture called Conv S5 that combines SSMs and spatial convolution (Smith et al., 2024). We apply the model to predict movies from the Moving MNIST dataset (Srivastava et al., 2015). In this dataset, two (or more) digits taken from the MNIST dataset (Deng, 2012) move on a larger canvas and bounce when touching the border. This forms a video over time. In our experiment, we slightly modify the movies by coloring the two digits. In particular, every movie contains two moving digits a fast-moving red one and a slow-moving blue one. The speed of the red digit is ten times that of the blue digit; consequently, the red digit can be considered as a high-frequency component, whereas the blue digit is a low-frequency component. Our goal in this experiment is to use a Conv S5 model to generate up to 100 frames, conditioned on 500 frames. The Conv S5 model applies LTI systems to the time domain (i.e., the axis of the frames), but in the meantime incorporates spatial convolutions in the LTI systems, where the LTI systems are still initialized by the Hi PPO initialization. In this experiment, we train two models using two different initializations. The first initialization we use is the default Hi PPO initialization. Then, we try another initialization, where for every wj that is the imaginary part of an eigenvalue of A, we transform wj by wj 7 sign(wj)(|wj| + 200). (14) That is, we shift every wj away from the origin by 200. This does not correspond to any α > 0 that we introduced in section 5.1, but our intuition is still based on our discussions in section 3 and 4. That is, when we move away every wj that is contained in [ 200, 200], our model is incapable Published as a conference paper at ICLR 2025 of handling the low frequencies. This is indeed observed in Figure 11: when we use the original Hi PPO initialization, the high-frequency red digit cannot be predicted, whereas when we modify the initialization based on eq. (14), we well-predicted the red digit but the low-frequency blue digit is completely distorted. Conditioning Figure 11: A Conv S5 model is trained to predict the Moving MNIST videos. In Model1 , we use the original Hi PPO initialization; in Model2 , we modify the initialization based on eq. (14). We see that when we use the Hi PPO initialization, only the slow-moving blue digit can be generated; on the other hand, pushing all eigenvalues of A to the high-frequency regions (see eq. (14)) allows us to predict the fast-moving red digit.