# random_feature_representation_boosting__2025054a.pdf

Random Feature Representation Boosting

Nikita Zozoulenko 1 Thomas Cass * 1 Lukas Gonon * 1 2

We introduce Random Feature Representation Boosting (RFRBoost), a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost uses random features at each layer to learn the functional gradient of the network representation, enhancing performance while preserving the convex optimization benefits of RFNNs. In the case of MSE loss, we obtain closed-form solutions to greedy layer-wise boosting with random features. For general loss functions, we show that fitting random feature residual blocks reduces to solving a quadratically constrained least squares problem. Through extensive numerical experiments on tabular datasets for both regression and classification, we show that RFRBoost significantly outperforms RFNNs and end-to-end trained MLP Res Nets in the smallto medium-scale regime where RFNNs are typically applied. Moreover, RFRBoost offers substantial computational benefits, and theoretical guarantees stemming from boosting theory.

1. Introduction

Random feature neural networks (RFNNs) are singlehidden-layer neural networks where all model parameters are randomly initialized or sampled, with only the linear output layer being trained. This approach presents a computationally efficient alternative to neural networks trained via stochastic gradient descent (SGD), avoiding the challenges associated with non-convex optimization and vanishing/- exploding gradients. Despite their simplicity, RFNNs and related random feature models have strong provable generalization guarantees (Rahimi & Recht, 2008b; Rudi & Rosasco, 2017; Lanthaler & Nelsen, 2023; Cheng et al., 2023), and have demonstrated state-of-the-art performance

Equal last authors. 1Department of Mathematics, Imperial College London, UK 2School of Computer Science, University of St. Gallen, Switzerland. Correspondence to: Nikita Zozoulenko <n.zozoulenko23@imperial.ac.uk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

and speed across various tasks (Bolager et al., 2023; Dempster et al., 2023; Gattiglio et al., 2024; Prabhu et al., 2024).

Recent theoretical work on Fourier RFNNs has shown that deep residual RFNNs can achieve lower generalization errors than their single-layer counterparts (Kammonen et al., 2022). Current theory and algorithms for training deep RFNNs, however, are limited to the Fourier activation function (Davis et al., 2024), and uses ideas from control theory to sample from optimal weight distributions. While skip connections have been crucial to the success of deep endto-end-trained Res Nets (He et al., 2015), introducing them into general RFNNs is not straightforward, as naively stacking random layers may degrade performance. In this paper, we introduce random feature representation boosting (RFRBoost), a novel method for constructing deep Res Net RFNNs. Our approach not only significantly improves performance, but also retains the highly tractable convex optimization framework inherent to RFNNs, with theoretical guarantees stemming from boosting theory.

Res Nets have traditionally been studied from two primary perspectives. The first views a Res Net Φt, defined by

Φt(x) = Φt 1(x) + gt(Φt 1(x)), (1)

as an Euler discretization of a dynamical system dΦt(x) = gt(Φt)dt (E, 2017). Here gt represents a residual block at layer t, often expressed as gt(x) = Aσ(Bx+b). This framework was later generalized to the setting of neural ODEs (Chen et al., 2018; Dupont et al., 2019; Kidger et al., 2020; 2021; Walker et al., 2024). The second point of view is that of gradient boosting, where a Res Net can be seen as an ensemble of weak, shallow neural networks of varying sizes, ΦT (x) = PT t=1 gt(Φt 1(x)), derived by unravelling equation (1) (Veit et al., 2016). This led to the development of gradient representation boosting (Nitanda & Suzuki, 2018; Suggala et al., 2020), which studies residual blocks via functional gradients in the space of square integrable random vectors LD 2 (µ), where µ is the distribution of the data.

A key challenge in extending RFNNs to deep Res Net architectures lies in the crucial role of the residual blocks gt when they are composed of random features. If the magnitude of gt is too small, the initial representation Φ0 dominates, rendering the added random features ineffective. Conversely, if gt is too large, information from previous layers can be lost. This problem is not merely one of scale; ideally each

Random Feature Representation Boosting

residual block should approximate the negative functional gradient of the loss with respect to the network representation. However, random layers are not guaranteed to possess this property. Unlike end-to-end trained networks where SGD with backpropagation can adjust all network weights to learn an appropriate scale and representation, RFNNs lack a comparable mechanism because their hidden layers are fixed. We address this issue by using random features at each layer of the Res Net to learn a mapping to the functional gradient of the training data, enabling tractable learning of optimal random feature residual blocks via analytical solutions or convex optimization.

1.1. Contributions

Our paper makes the following contributions:

Introducing RFRBoost: We propose a novel method for constructing deep Res Net RFNNs, overcoming the limitations of naively stacking random features layers. RFRBoost supports arbitrary random features, extending beyond classical random Fourier features.

Analytical Solutions and Algorithms: For MSE loss, we derive closed-form solutions to greedy layer-wise boosting using random features by solving what we term sandwiched least squares problems , a special case of generalized Sylvester equations. For general losses, we show that fitting random feature residual blocks is equivalent to solving a quadratically constrained least squares problem.

Theoretical Guarantees: We provide a regret bound for RFRBoost based on Rademacher complexities and established results from boosting theory.

Empirical Validation: Through numerical experiments on 91 tabular regression and classification tasks from the curated Open ML repository, we demonstrate that RFRBoost significantly outperforms both singlelayer RFNNs and end-to-end trained MLP Res Nets, while offering substantial computational advantages.

1.2. Related literature

Classical Boosting: Boosting aims to build a strong ensemble of weak learners via additive modelling of the objective function, dating back to the 1990s with the development of Ada Boost (Freund & Schapire, 1997). Gradient boosting, introduced as a generalization of boosting, supports general differentiable loss functions (Mason et al., 1999; Friedman et al., 2000; Friedman, 2001), and includes popular frameworks such as XGBoost (Chen & Guestrin, 2016), Light GBM (Ke et al., 2017), and Cat Boost (Prokhorenkova et al., 2018). These models typically use decision trees as weak learners and are widely considered the best out-of-thebox models for tabular data (Grinsztajn et al., 2022).

Boosting for Neural Networks: Applications of boosting for neural network first appeared in Ada Net (Cortes et al., 2017), which built a network graph using boosting to minimize a data-dependent generalization bound. Huang et al. (2018) introduced an Ada Boost-inspired algorithm for sequentially learning residual blocks, boosting the feature representation rather than the class labels. Nitanda & Suzuki (2018; 2020) proposed using Fr echet derivatives in LD 2 (µ) to learn residual blocks that preserve small functional gradient norms, motivated by Reproducing Kernel Hilbert Space (RKHS) theory and smoothing techniques. Grow Net (Badirli et al., 2020) constructs an ensemble of neural networks in the classical sense of gradient boosting the labels, not as a Res Net, but by concatenating the features of previous models to the next weak learner. Suggala et al. (2020) introduced gradient representation boosting, which studies layer-wise training of Res Nets by greedily or gradient-greedily minimizing a risk function, and gives modular excess risk bounds based on Rademacher complexities. Yu et al. (2023) used gradient boosting to improve the training of dynamic depth neural networks. Finally, Emami & Mart ınez-Mu noz (2023) constructed a network neuron-by-neuron using gradient boosting.

Random feature models: Random feature models use fixed, randomly generated features to map data into a high dimensional space, enabling efficient linear learning. These methods encompass a broad class of models studied under various names, including random Fourier features (Rahimi & Recht, 2007; Sriperumbudur & Szabo, 2015; Li et al., 2019; Kammonen et al., 2022; Davis et al., 2024), extreme learning machines (Huang et al., 2004; 2012; Huang, 2014), random features or RFNNs (Huang et al., 2006; Rahimi & Recht, 2008a;b; Rudi & Rosasco, 2017; Carratino et al., 2018; Yehudai & Shamir, 2019; Mei & Montanari, 2022; Bolager et al., 2023; Lanthaler & Nelsen, 2023; Ayme et al., 2024), reservoir computing (Jaeger, 2001; Lukoˇseviˇcius & Jaeger, 2009; Gallicchio et al., 2017; Grigoryeva & Ortega, 2018; Tanaka et al., 2019; Hart et al., 2020; Gonon et al., 2023; 2024), kernel methods (Kar & Karnick, 2012; Sinha & Duchi, 2016; Sun et al., 2018; Szabo & Sriperumbudur, 2019; Cheng et al., 2023; Wang et al., 2024; Wang & Feng, 2024), and scattering networks (Bruna & Mallat, 2013; Cotter & Kingsbury, 2017; Oyallon et al., 2019; Trockman et al., 2023). Recently, RFNNs and related approaches have demonstrated exceptional performance in both speed and accuracy across a wide variety of applications. These include solving partial differential equations (Nelsen & Stuart, 2021; Gonon, 2023; Gattiglio et al., 2024; Neufeld & Schmocker, 2024), online learning (Prabhu et al., 2024), time series classification (Dempster et al., 2020; 2023), and in mathematical finance (Jacquier & Zuric, 2023; Herrera et al., 2024). RFNNs have also been explored in the context of quantum computing (Innocenti et al., 2023; Gonon &

Random Feature Representation Boosting

Jacquier, 2023; Mart ınez-Pe na & Ortega, 2023; Xiong et al., 2024), and in relation to random neural controlled differential equations and state-space models (Cuchiero et al., 2021; Cirone et al., 2023; 2024; Biagini et al., 2024).

2. Functional Gradient Boosting

In this section we introduce notation and give a brief overview of classical gradient boosting (Mason et al., 1999; Friedman, 2001), and its connection to deep Res Nets via gradient representation boosting (Nitanda & Suzuki, 2018; 2020; Suggala et al., 2020).

Let (X, Y ) µ be a random sample with its corresponding probability measure µ. Denote by X µX the features in Rq, with targets Y µY in Rd, and µY |X=x the conditional law. Let bµ = 1

n Pn i=1 δ(xi,yi) be the empirical measure of µ. We will work in the Hilbert space LD 2 (µ), with inner product given by f, g LD 2 (µ) = Eµ[ f, g RD]. The concept of functional gradient plays a key role in gradient boosting: Definition 2.1. Let H be a Hilbert space. A function f : H R is said to be Frechet differentiable at h H if there exists a bounded linear operator f : U H for some open neighbourhood U H of h such that

f(h + g) = f(h) + g, f(h) H + o( g H )

for g U. The element f(h) is often referred to as the functional gradient of f at h, or simply the gradient.

2.1. Traditional Gradient Boosting

Let l : Rd Rd R be a sufficiently regular loss function. Traditional gradient boosting seeks to minimize the risk R(F) = Eµ l(F(X), Y ) by additively modelling F span(G) within a space of weak learners G, typically a set of decision trees (Friedman, 2001; Chen & Guestrin, 2016). It iteratively updates an estimate Ft = Pt s=1 ηαsgs based on a first-order expansion of the risk:

R(Ft + g) R(Ft) + g, R(Ft) Ld 2(µ).

To minimize the risk based on this approximation, a new weak learner gt+1 is fit to the negative functional gradient, R(Ft), and the ensemble is then updated: Ft+1 = Ft + ηαt+1gt+1, where η, αt+1 > 0 are the global and local learning rates. The functional gradient can be shown to be equal to

R(F)(x) = EµY |X=x 1l(F(x), Y ) ,

and is easily computed for empirical measures bµ = 1 n Pn i=1 δ(xi,yi) using the empirical risk b R(F). The method for fitting gt+1 varies between implementations and depends on the class of weak learners, as seen in modern approaches like XGBoost (Chen & Guestrin, 2016) which incorporates second-order information.

2.2. Gradient Representation Boosting

Unlike classical boosting, which boosts in target space, neural network gradient representation boosting aims to additively learn a feature representation by modelling F via F = Ft(X) = W t Φt(X), where Φt = Pt s=0 ηgs is a gradient boosted feature representation and Wt RD d is the top-level linear predictor. In other words, gradient representation boosting seeks to learn the feature representation Φt additively to build a single strong predictor, rather than constructing Ft as an ensemble of weak predictors (Huang et al., 2018; Nitanda & Suzuki, 2018; 2020; Suggala et al., 2020). For a depth t Res Net Φt, defined recursively as

Φt = Φt 1 + ηht(Φt 1), (2)

we see by unravelling (2), that the weak learners gt are of the form gt = ht(Φt 1). These functions gt are sometimes referred to as weak feature transformations. Prior work has employed shallow neural networks as gt, trained greedily layer-by-layer with SGD (see references above). In the setting of gradient representation boosting we denote the risk as R(W, Φ) = Eµ l(W Φ(X), Y ) , and the functional gradient with respect to the feature representation Φ is

2R(W, Φ)(x) = EµY |X=x W 1l(W Φ(X), Y ) .

Let b R(W, Φ) be the empirical risk, W the hypothesis set of top-level linear predictors, and Gt the set of weak feature transformations. As outlined by Suggala et al. (2020), there are two main strategies for training Res Nets via gradient representation boosting, which we describe below.

1.) Exact-Greedy Layer-wise Boosting: At boosting step t, the model is updated greedily by solving the joint optimization problem

Wt, gt = argmin W W,g Gt b R(W, Φt 1 + g), (3)

leading to the update Φt = Φt 1 + ηgt. This approach is feasible when W and g are jointly optimized by SGD. For general Res Nets, this translates to gt = ht(Φt 1), where ht is a shallow neural network (i.e., a residual block) and W the top-level linear predictor.

However, in the context of random feature Res Nets, using SGD for the joint optimization in Equation (3) undermines the computational benefits inherent to the random feature approach. As will be demonstrated in Section 3, by restricting gt to a simple residual block gt = Atft(Φt 1), with ft representing random features and At a linear map, we can derive closed-form analytical solutions for a two-stage approach to the optimization problem (3). This speeds up training and removes the need for hyperparameter tuning of the scale of the random features ft at each individual layer.

2.) Gradient-Greedy Layer-wise Boosting: An alternative to directly minimizing the risk by solving (3), which can be

Random Feature Representation Boosting

intractable depending on the loss l and the family of weak learners, is to follow the negative functional gradient. At boosting iteration t, this approach approximates the risk using the first-order functional Taylor expansion:

R(W, Φ + g) R(W, Φ) + g, 2R(W, Φ) LD 2 (µ). (4)

A new weak learner gt Gt is fit by minimizing the empirical inner product g, Φ b R(Wt 1, Φt 1) LD 2 (bµ). The depth t feature representation is obtained by Φt = Φt 1 + ηgt, and the top-level predictor is then updated:

Wt = argmin W W b R(W, Φt).

Potential Challenges: The gradient-greedy approach for constructing Res Nets has remained relatively unexplored in the literature. Suggala et al. (2020), who introduced both strategies in the context of end-to-end trained networks, only mention in passing that the gradient-greedy strategy performed worse comparatively. We believe this might be attributed to the functional direction of g failing to be properly preserved during training with SGD. Specifically, when g is chosen from the family g = h(Φt 1), where h is a shallow neural network, minimizing g, 2 b R(Wt 1, Φt 1) LD 2 (bµ) via SGD might increase g LD 2 (bµ) without ensuring that g aligns with the functional gradient in LD 2 (bµ). Since the first-order approximation (4) only holds for g with small functional norm, we argue that this objective should instead be minimized under the constraint g LD 2 (bµ) = 1. In our random feature setting, we incorporate this constraint and show in Section 4 that it leads to the gradient-greedy approach outperforming the exact-greedy strategy.

3. Random Feature Representation Boosting

Our goal is to construct a random feature Res Net of the form Φt = Φt 1 + ηAtft,

where ft = ft(x, Φt 1(x)) Rp are random features, At RD p is a linear map, x is input data, and η > 0 is the learning rate. We propose to use the theory of gradient representation boosting (Suggala et al., 2020) to derive optimal expressions for At. We consider three different cases where At is either a scalar (learning optimal learning rate), a diagonal matrix (learning dimension-wise learning rate), or a dense matrix (learning the functional gradient). Note that the scalar and diagonal cases assume that p = D.

This section is outlined as follows: We first define a random feature layer. We then analyze the case of MSE loss l(x, y) = 1

2 x y 2 Rd, deriving closed-form solutions for layer-wise exact greedy boosting with random features. We then explore the gradient-greedy approach, which supports any differentiable loss function.

Figure 1. Diagram of random feature representation boosting.

3.1. Random Feature Layer

With a random feature layer, we mean any mapping which produces a feature vector ft Rp, which will not be trained after initialization. This is in contrast to end-to-end trained networks, where all model weights are adjusted continuously during training. A common choice for ft is a randomly sampled dense layer, ft(x) = σ(BtΦt 1(x)), with activation function σ and weight matrix Bt Rp D. To enhance the expressive power of RFRBoost, we allow the random features ft to be functions of both the input data x and the previous Res Net layer s output Φt 1(x), similar to Grow Net (Badirli et al., 2020) and Res FGB-FW (Nitanda & Suzuki, 2020). Figure 1 provides a visual representation of the RFRBoost architecture. Specifically, in our experiments on tabular data, we let the random feature layer be ft(x) = σ(concat(BtΦt 1(x), Ctx)). The initial mapping Φ0 can for instance be a random fixed linear projection, or the identity. Instead of initializing Bt and Ct i.i.d., we use SWIM random features (Bolager et al., 2023) in our experiments (see Appendix E for more details). Note however that RFRBoost supports arbitrary types of random features, not only randomly initialized or sampled dense layers.

3.2. Exact-Greedy Case for Mean Squared Error Loss

Recall that in the exact-greedy layer-wise paradigm, the objective at layer t is to find a function gt Gt that additively and greedily minimizes the empirical risk b R(Wt, Φt 1+gt) for some linear map Wt W, given a Res Net feature representation Φt 1. We propose using functional gradient boosting to construct a residual block of the form gt = Atft, where At RD p is a linear map, and ft is a random feature layer. We consider the cases where At is a scalar

Random Feature Representation Boosting

multiple of the identity, a diagonal matrix, or a general dense matrix. The procedure is outlined below:

Step 1: Generate random features ft = ft(x, Φt 1(x)), which may depend on both the raw training data x and the activations at the previous layer Φt 1(x). Using MSE loss, we find that

At = argmin A b R(Wt 1, Φt 1 + Aft(xi))

i=1 yi W t 1(Φt 1(xi) + Aft(xi)) 2

i=1 ri W t 1Aft(xi) 2, (5)

where ri = yi W t 1Φt 1(xi) are the residuals of the model at layer t 1. We term problems of the form (5) sandwiched least squares problems, for which Theorem 3.1 below provides the existence of closed-form analytical solutions which are fast to compute.

Step 2: After computing At, we obtain the depth t representation Φt = Φt 1 + ηAtft. The top-level linear regressor Wt+1 is then updated via multiple least squares:

Wt = argmin W W

i=1 yi W Φt(xi) 2. (6)

In practice, ℓ2 regularization is added to equations (5) and (6). Steps 1 and 2 are repeated for T layers. The complete procedure is detailed in Algorithm 1.

Theorem 3.1. Let ri Rd, W RD d, zi Rp for all i [n]. Let λ > 0. Consider the setting of scalar A, diagonal A, and dense A RD p. Write Z RN p and R RN d for the stacked data and residual matrices. Then the minimum of

ri W Azi 2 + λ A 2 F

is in the scalar, diagonal, and dense case attained at

Ascalar = R, ZW F ZW 2 F + nλ,

Adiag = (WW Z Z + λI) 1diag(WR Z),

Adense = U U WR ZV λN1

+ diag(ΛW ) diag(ΛZ) V ,

where F is the Frobenius norm, denotes element-wise division, is the outer product, 1 is a matrix of ones, and WW = UΛW U and Z Z = V ΛZV are spectral decompositions.

Algorithm 1 Greedy RFRBoost MSE Loss

Input: Data (xi, yi)n i=1, T layers, learning rate η, ℓ2 regularization λ, initial representation Φ0. W0 argmin W

1 n Pn i=1 yi W Φ0(xi) 2

for t = 1 to T do

Generate random features ft,i = ft(xi, Φt 1(xi)) Compute residuals ri yi W t 1Φt 1(xi) Solve sandwiched least squares At argmin A

1 n Pn i=1 ri W t 1Aft,i 2 + λ A 2 F Build Res Net layer Φt Φt 1 + ηAtft Update top-level linear regressor Wt argmin W

1 n Pn i=1 yi W Φt(xi) 2+λ W 2 F end for Output: Res Net ΦT and regressor head WT .

Algorithm 2 Gradient RFRBoost General Loss

Input: Data (xi, yi)n i=1, loss l, T layers, learning rate η, ℓ2 regularization λ, initial representation Φ0. W0 argmin W

1 n Pn i=1 l W Φ0(xi), yi

for t = 1 to T do

Generate random features ft,i = ft(xi, Φt 1(xi)) Gradient Gi Wt 1 1l W t 1Φt 1(xi), yi

Fit gradient At least squares(ft, n G G F , λ) Solve line search with convex solver αt argmin α R

1 n Pn i=1 l W (Φt 1 + αAtft), yi

Build Res Net layer Φt Φt 1 + ηαt Atft Update top Wt argmin W

1 n Pn i=1 l W Φt(xi), yi

end for Output: Res Net ΦT and top linear layer WT .

Proof. See Propositions A.1 to A.3 in the Appendix.

3.3. Gradient Boosting Random Features

While the greedy approach provides optimal solutions for MSE loss, many applications require more general loss functions, such as cross-entropy loss for classification. In such cases, we turn to the gradient-greedy strategy. Recall that in this setting, we aim to minimize the first-order functional Taylor expansion of the risk:

R(W, Φ + g) R(W, Φ) + g, 2R(W, Φ) LD 2 (µ),

which holds for functions g with small LD 2 (µ)-norm. As discussed in the context of general gradient representation boosting, a potential issue with directly minimizing g, 2 b R(W, Φ) LD 2 (bµ) is that g might learn to maximize its magnitude without following the direction of the functional gradient. To address this, we constrain the problem to ensure that g maintains a unit norm in LD 2 (bµ). If we restrict g

Random Feature Representation Boosting

to residual blocks of the form g = Af, where A RD p is a linear map, and f Rp are random features, then solving the constrained LD 2 (bµ)-inner product minimization problem becomes equivalent to solving a quadratically constrained least squares problem. See Appendix B for the proof.

Theorem 3.2. Let bµ = 1 n Pn i=1 δ(xi,yi) be the empirical measure of the data. Then

argmax A RD p such that Af LD 2 (bµ) 1

Af, 2 b R(W, Φ)

is the solution to the quadratically constrained least squares problem

minimize 1 n

i=1 2 b R(W, Φ)(xi) Af(xi) 2,

subject to 1 n

i=1 Af(xi) 2 = 1,

which in particular has the closed form analytical solution

A = n G F robenius G F(F F) 1

when F has full rank, where F Rn p is the feature matrix, and G Rn D is the matrix given by Gi,j = 2 b R(W, Φ)(xi)

The procedure for using Gradient RFRBoost at layer t to build a random feature Res Net Φt is outlined below (see also Figure 1):

Step 1: Generate random features ft. Compute the data matrix of the functional gradient Gi,j = 2 b R(Wt 1, Φt 1)(xi)

j. For MSE loss, this is given by

GMSE i,j = Wt 1(W t 1Φt 1(xi) yi)

and for negative cross-entropy loss by

GCCE i,j = Wt 1(s(W t 1Φt 1(xi)) eyi)

where s is the softmax function, and eyi is the one-hot vector for label yi. For full derivation, see Appendix C. We then fit At by fitting ft+1 to the negative normalized functional gradient n G G F via multiple least squares, according to Theorem B.1. In practice, we also include ℓ2 regularization.

Step 2: Find the optimal amount of say αt via line search

αt = argmin α>0 b R(Wt 1, Φt 1 + αAtft)

using Theorem 3.1 for MSE loss, or via a suitable convex optimizer such as Newton s method for cross-entropy loss.

Step 3: Update the feature representation Φt = Φt 1 + ηαt Atft, and the top-level linear predictor

Wt = argmin W W b R(W, Φt)

using an appropriate convex minimizer depending on the specific loss function, such as L-BFGS. The full gradientgreedy procedure is detailed in Algorithm 2.

3.4. Theoretical Guarantees

A key advantage of using RFRBoost to construct Res Nets over traditional end-to-end trained networks is that RFRBoost inherits strong theoretical guarantees from boosting theory. We analyze RFRBoost within the theoretical framework of Generalized Boosting (Suggala et al., 2020), where excess risk bounds have been established in terms of modular Rademacher complexities of the family of weak learners Gt at each boosting iteration t. These bounds are based on the (β, ϵ)-weak learning condition, defined below.

Definition 3.3 (Suggala et al. (2020)). Let β (0, 1] and ϵ 0. We say that Gt+1 satisfies the (β, ϵ)-weak learning condition if there exists a g Gt+1 such that g, 2R(Wt, Φt)

LD 2 (µ) β supg Gt+1 g LD 2 (µ) + ϵ 2R(Wt, Φt) LD 2 (µ).

Intuitively, this condition states that there exists a weak feature transformation in Gt+1 that is negatively correlated with the functional gradient. In the sample-splitting setting, where each boosting iteration uses an independent sample of size en = n/T , we derive the following regret bound for RFRBoost. This bound describes how the excess risk decays compared to an optimal predictor as T and en vary.

Theorem 3.4. Let l be L-Lipschitz and M-smooth with respect to the first argument. For a matrix W, let λmin(W) and λmax denote its smallest and largest singular values, respectively. Consider RFRBoost with the hypothesis set of linear predictors W = W RD d : λmin(W) > λ0 > 0, λmax(W) < λ1 , and weak feature transformations Gt = {x 7 A tanh(BΦt 1(x)) : λmax(A), λmax(B) < λ1} satisfying the (β, ϵt)-weak learning condition. Let the boosting learning rates be ηt = ct s for some s β+1

β+2, 1 and c > 0. Then T-layer RFRBoost satisfies the following risk bound for any W , Φ , and a 0, β(1 s) with probability at least 1 δ over datasets of size n:

R(WT , ΦT ) R(W , Φ ) + 2

T a + T 2 s 1 + q

Random Feature Representation Boosting

where the constant C does not depend on T, n, and δ.

Proof. See Appendix D.

For similar results in the literature, see for instance Suggala et al. (2020), Corollary 4.3.

3.5. Time Complexity

The serial time complexity of RFRBoost with MSE loss, assuming tabular data and dense random features, is O(T[N(D2 + Dd + p2 + p D) + D3 + d D2 + p3 + Dp2]), as derived from Algorithm 2. Here N is the dataset size, D is the dimension of the neural network representation, d is the output dimension of the regression task, p is the number of random features, and T is the number of layers (boosting rounds). The computation is dominated by matrix operations, which are well-suited for GPU acceleration. The classification case follows similarly.

4. Numerical Experiments

In this section, we compare RFRBoost against a set of baseline models on a wide range of tabular regression and classification datasets, as well as on a challenging synthetic point cloud separation task. All our code is publicly available at https://github.com/nikitazozoulenko/ random-feature-representation-boosting.

4.1. Tabular Regression and Classification

Datasets: We experiment on all datasets of the curated Open ML tabular regression (Fischer et al., 2023) and classification (Bischl et al., 2021) benchmark suites with 200 or fewer features. This amounts to a total of 91 datasets per model. Due to the large number of datasets, we limit ourselves to 5000 observations per dataset. We preprocess each dataset by one-hot encoding all categorical variables and normalizing all numerical features to have zero mean and variance one.

Evaluation Procedure: We use a nested 5-fold crossvalidation (CV) procedure to tune and evaluate all models, run independently for each dataset. The innermost CV is used to tune all hyperparameters, for which we use the Bayesian hyperparameter tuning library Optuna (Akiba et al., 2019). For regression, we use MSE loss, and for classification, we use cross-entropy loss. All experiments are run on a single CPU core on an institutional HPC cluster, mostly comprised of AMD EPYC 7742 nodes. We report the average test scores for each model, as well as mean training times for a single fit. The average relative rank of each model is presented in a critical difference diagram, indicating statistically significant clusters based on a Wilcoxon signed-rank test with Holm correction. This

is a well-established methodology for comparing multiple algorithms across multiple datasets (Demˇsar, 2006; Garc ıa & Herrera, 2008; Benavoli et al., 2016).

Baseline Models: We compare RFRBoost, a random feature Res Net, against several strong baselines, including endto-end (E2E) trained MLP Res Nets, single-layer random feature neural networks (RFNNs), ridge regression, logistic regression, and XGBoost (Chen & Guestrin, 2016). The E2E Res Nets are trained using the Adam optimizer (Kingma & Ba, 2015) with cosine learning rate annealing, Re LU activations, and batch normalization. While XGBoost is a powerful gradient boosting model, it differs fundamentally from RFRBoost. XGBoost ensembles a large number of weak decision trees (up to 1000 in our experiments) to build a strong predictor. RFRBoost, on the other hand, uses gradient representation boosting to construct a small number of residual blocks, followed by a single linear predictor. For RFRBoost and RFNNs we use SWIM random features with tanh activations, as detailed in Appendix E. In the regression setting, we evaluate three variants of RFRBoost: using a scalar, a diagonal, or a dense A matrix. When using a dense A, we set the initial mapping Φ0(x) to the identity; otherwise, Φ0(x) is a randomly initialized dense layer. For classification, we use the gradient-greedy variant of RFRBoost with LBFG-S as the convex solver. We use the official implementation of XGBoost, and implement all other baseline models in Py Torch (Paszke et al., 2019).

Hyperparameters: All hyperparameters are tuned with Optuna in the innermost fold, using 100 trials per outer fold, model, and dataset. For ridge and logistic regression, we tune the ℓ2 regularization. For the neural network-based models, we fix the feature dimension of each residual block to 512 and use 1 to 10 layers. For E2E networks, we tune the hidden size, learning rate, learning rate decay, number of epochs, batch size, and weight decay. For RFRBoost, we tune the ℓ2 regularization of the linear predictor and functional gradient mapping, the boosting learning rate, and the variance of the random features. For RFNNs, we tune the random feature dimension, random feature variance, and ℓ2 regularization. For XGBoost, we tune the ℓ1 and ℓ2 regularization, tree depth, boosting learning rate, and the number of weak learners. For a detailed list of hyperparameter ranges, along with an ablation study comparing SWIM random features to i.i.d. Gaussian random features, we refer the reader to Appendix E.

Results: Summary results for regression and classification are presented in Tables 1 to 2 and Figures 2 to 3, respectively. Full dataset-wise results are reported in Appendix E. We find that RFRBoost ranks higher than all other baseline models. For regression tasks, the gradient-greedy version of RFRBoost outperforms the exact-greedy variant, contrary to the observations of Suggala et al. (2020) for SGD-trained gradient representation boosting. This difference likely arises

Random Feature Representation Boosting

Table 1. Average test RMSE and single-core CPU fit times on the Open ML regression datasets.

MODEL MEAN RMSE FIT TIME (S)

GRADIENT RFRBOOST 0.408 1.688 GREEDY RFRBOOST Adense 0.408 2.734 GREEDY RFRBOOST Adiag 0.415 1.631 GREEDY RFRBOOST Ascalar 0.434 1.024

XGBOOST 0.394 1.958 E2E MLP RESNET 0.412 19.309 RFNN 0.434 0.053 RIDGE REGRESSION 0.540 0.001

1 2 3 4 5 6 7 8

Gradient RFRBoost

Greedy RFRBoost A

Greedy RFRBoost A

E2E MLP Res Net

Greedy RFRBoost A

Ridge Regression

Figure 2. Critical difference diagram based on pairwise relative rank of test RMSE. Bars indicate no significant difference (α = 0.05). The average rank is displayed for each model.

because the SGD-based approach does not incorporate the LD 2 (µ)-norm constraint during training, which is crucial for preserving the functional direction of the residual block. The test scores follow the ordering Adense > Adiag > Ascalar, demonstrating that RFRBoost is more expressive when mapping random feature layers to the functional gradient, rather than simply stacking random feature layers. While XGBoost achieves a slightly lower RMSE than RFRBoost, it performs worse in terms of average rank. Moreover, RFRBoost significantly outperforms both RFNNs and E2E MLP Res Nets, while being an order of magnitude faster to train than the latter. Although the reported training times are CPU-based, our implementation suggests both methods would benefit similarly from GPU acceleration, making the presented times representative.

4.2. Point Cloud Separation

We evaluate RFRBoost on a challenging synthetic dataset originating from the neural ODE literature (Sander et al., 2021). The dataset consists of 10,000 points sampled from concentric circles, and the task is to linearly separate (i.e. classify) the concentric circles while restricting the Res Net hidden size to 2, see Figure 4. We compare RFRBoost to E2E MLP Res Nets, while also presenting classification results for logistic regression and RFNNs in Table 3. All models were trained with cross entropy loss, using a 5fold CV grid search for hyperparameter tuning. For E2E

Table 2. Average test accuracies and single-core CPU fit times on the Open ML classification datasets.

MODEL MEAN ACC FIT TIME (S)

RFRBOOST 0.853 2.519

XGBOOST 0.853 3.859 E2E MLP RESNET 0.851 20.881 RFNN 0.845 1.189 LOGISTIC REGRESSION 0.821 0.165

E2E MLP Res Net

Logistic Regression

Figure 3. Critical difference diagram based on pairwise relative rank of test accuracy. Bars indicate no significant difference (α = 0.05). The average rank is displayed for each model.

Table 3. Average test accuracies on the concentric circles point cloud separation task, averaged across 10 runs.

MODEL MEAN ACC STD DEV

RFRBOOST 0.997 0.002

RFNN 0.887 0.037 E2E MLP RESNET 0.732 0.144 LOGISTIC REGRESSION 0.334 0.023

Figure 4. Point cloud separation of test data at each layer.

MLP Res Nets, the learning rate was tuned, while for the other models, only the ℓ2 regularization of the classification head was tuned. The hidden size, residual block feature dimension, and activation function was fixed at 2, 512, and tanh, respectively, for all models. See Appendix E.3 for more details.

Random Feature Representation Boosting

Notably, the RFNN, which uses SWIM random features to map the initial 2-dimensional input to a higher-dimensional space before applying a linear classifier, fails to classify all points correctly. RFRBoost, in contrast, achieves nearperfect linear separation by first mapping the random features to the functional gradient of the network representation, before applying a 2-dimensional linear classifier. This result is somewhat surprising because RFRBoost does not rely on explicit Fourier features or a learnt radial basis, demonstrating the power of gradient representation boosting. It further illustrates how RFRBoost can solve problems that are intractable for E2E-trained networks and RFNNs. Figure 4 shows that the E2E MLP Res Net correctly separates only two of the nine concentric rings, similar to the failure to converge observed by Sander et al. (2021).

4.3. Experiments on Larger-scale Datasets

To complement our experiments on the Open ML benchmark suite, we conducted additional full-scale evaluations on four larger datasets (two with 100k samples, two with 500k samples) to assess performance and scalability in larger data regimes. Full experimental details, including dataset splits, hyperparameter grids, and plots of training time and predictive performance versus training set size, are provided in Appendix E.5. We find that RFRBoost significantly outperforms traditional single-layer RFNNs across all tested datasets and training sizes, particularly as dataset size increases, highlighting its effectiveness in leveraging depth for random feature models. While RFRBoost performs strongly against E2E trained networks in medium-sized data regimes, our findings indicate that E2E networks and XGBoost eventually outperform RFRBoost as dataset size increases on 3 out of 4, and 2 out of 4 datasets, respectively. We hypothesize that using RFRBoost as an initialization strategy, followed by end-to-end fine-tuning, could be a promising direction to further enhance the performance of deep networks. We leave this to future work.

5. Conclusion

This paper introduced RFRBoost, a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost addresses the limitations of single-layer RFNNs by using random features to learn optimal Res Net-like residual blocks that approximate the negative functional gradient at each layer of the network, thereby enhancing performance while retaining the computational benefits of convex optimization for RFNNs. This procedure can be viewed as performing functional gradient descent on the network neurons, which has connections to gradient boosting theory, as opposed to classical SGD which performs gradient descent on the network weights and biases. In our framework, we de-

rived closed-form solutions for greedy layer-wise boosting with MSE loss, and presented a general fitting algorithm for arbitrary loss functions based on solving a quadratically constrained least squares problem. Through extensive numerical experiments on tabular datasets for both regression and classification, we demonstrated that RFRBoost significantly outperforms traditional RFNNs and end-to-end trained MLP Res Nets in the smallto medium-scale regime where RFNNs are typically applied, while offering substantial computational advantages, and theoretical guarantees stemming from boosting theory. RFRBoost represents a significant step towards building powerful, stable, efficient, and theoretically sound deep networks using untrained random features. Future work will focus on extending RFRBoost to other domains such as time series or image data, exploring different types of random features and momentum strategies, implementing more efficient GPU acceleration, scaling to large datasets, and using RFRBoost as an initialization strategy for large-scale end-to-end training.

Acknowledgements

TC has been supported by the EPSRC Programme Grant EP/S026347/1. NZ has been supported by the Roth Scholarship at Imperial College London, and acknowledges conference travel support from G-Research. We acknowledge computational resources and support provided by the Imperial College Research Computing Service (DOI: 10.14469/hpc/2232). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 19, pp. 2623 2631, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016.

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.,

Random Feature Representation Boosting

Ayme, A., Boyer, C., Dieuleveut, A., and Scornet, E. Random features models: a way to study the success of naive imputation. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 2108 2134. PMLR, 21 27 Jul 2024.

Badirli, S., Liu, X., Xing, Z., Bhowmik, A., Doan, K., and Keerthi, S. S. Gradient boosting neural networks: Grownet, 2020.

Benavoli, A., Corani, G., and Mangili, F. Should we really use post-hoc tests based on mean-ranks? Journal of Machine Learning Research, 17(5):1 10, 2016.

Bertin-Mahieux, T. Year Prediction MSD. UCI Machine Learning Repository, 2011. DOI: https://doi.org/10.24432/C50K61.

Biagini, F., Gonon, L., and Walter, N. Universal randomised signatures for generative time series modelling, 2024.

Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N., and Vanschoren, J. Open ML benchmarking suites. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.

Blackard, J. Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N.

Bolager, E. L., Burak, I., Datar, C., Sun, Q., and Dietrich, F. Sampling weights of deep neural networks. In Advances in Neural Information Processing Systems, volume 36, pp. 63075 63116. Curran Associates, Inc., 2023.

Bruna, J. and Mallat, S. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872 1886, 2013.

Carratino, L., Rudi, A., and Rosasco, L. Learning with sgd and random features. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pp. 785 794, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322.

Cheng, T. S., Lucchi, A., Kratsios, A., Dokmani c, I., and Belius, D. A theoretical analysis of the test error of finite-rank kernel ridge regression. In Advances in Neural Information Processing Systems, volume 36, pp. 4767 4798. Curran Associates, Inc., 2023.

Cirone, N. M., Lemercier, M., and Salvi, C. Neural signature kernels as infinite-width-depth-limits of controlled resnets. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023.

Cirone, N. M., Orvieto, A., Walker, B., Salvi, C., and Lyons, T. Theoretical foundations of deep selective state-space models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., and Yang, S. Ada Net: Adaptive structural learning of artificial neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 874 883. PMLR, 06 11 Aug 2017.

Cotter, F. and Kingsbury, N. Visualizing and improving scattering networks. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6, 2017.

Cuchiero, C., Gonon, L., Grigoryeva, L., Ortega, J.-P., and Teichmann, J. Expressive power of randomized signature. In Advances in Neural Information Processing Systems, 2021.

Davis, O., Geraci, G., and Motamed, M. Deep learning without global optimization by random fourier neural networks, 2024.

Dempster, A., Petitjean, F., and Webb, G. I. Rocket: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, 34(5):1454 1495, 2020. ISSN 1573-756X.

Dempster, A., Schmidt, D. F., and Webb, G. I. Hydra: competing convolutional kernels for fast and accurate time series classification. Data Mining and Knowledge Discovery, 37(5):1779 1805, Sep 2023. ISSN 1573-756X.

Demˇsar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7 (1):1 30, 2006.

Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural odes. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

E, W. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5 (1):1 11, Mar 2017. ISSN 2194-671X.

Random Feature Representation Boosting

Emami, S. and Mart ınez-Mu noz, G. Sequential training of neural networks with gradient boosting. IEEE Access, 11: 42738 42750, 2023.

Fischer, S. F., Feurer, M., and Bischl, B. Open ML-CTR23 a curated tabular regression benchmarking suite. In Auto ML Conference 2023 (Workshop), 2023.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. ISSN 0022-0000.

Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). The Annals of Statistics, 28(2):337 407, 2000.

Friedman, J. H. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 1232, 2001.

Gallicchio, C., Micheli, A., and Pedrelli, L. Deep reservoir computing: A critical experimental analysis. Neurocomputing, 268:87 99, 2017. ISSN 0925-2312. Advances in artificial neural networks, machine learning and computational intelligence.

Garc ıa, S. and Herrera, F. An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research, 9(89):2677 2694, 2008.

Gattiglio, G., Grigoryeva, L., and Tamborrino, M. Randnetparareal: a time-parallel PDE solver using random neural networks. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Gonon, L. Random feature neural networks learn blackscholes type pdes without curse of dimensionality. Journal of Machine Learning Research, 24(189):1 51, 2023.

Gonon, L. and Jacquier, A. Universal approximation theorem and error bounds for quantum neural networks and quantum reservoirs, 2023.

Gonon, L., Grigoryeva, L., and Ortega, J.-P. Approximation bounds for random neural networks and reservoir systems. The Annals of Applied Probability, 33(1):28 69, 2023.

Gonon, L., Grigoryeva, L., and Ortega, J.-P. Infinitedimensional reservoir computing. Neural Networks, 179: 106486, 2024. ISSN 0893-6080.

Grigoryeva, L. and Ortega, J.-P. Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems. Journal of Machine Learning Research, 19(24):1 40, 2018.

Grinsztajn, L., Oyallon, E., and Varoquaux, G. Why do treebased models still outperform deep learning on typical tabular data? In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS 22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088.

Hart, A., Hook, J., and Dawes, J. Embedding and approximation theorems for echo state networks. Neural Networks, 128:234 247, 2020. ISSN 0893-6080.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2015.

Herrera, C., Krach, F., Ruyssen, P., and Teichmann, J. Optimal stopping via randomized neural networks. Frontiers of Mathematical Finance, 3(1):31 77, 2024.

Huang, F., Ash, J., Langford, J., and Schapire, R. Learning deep Res Net blocks sequentially using boosting theory. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2058 2067. PMLR, 10 15 Jul 2018.

Huang, G.-B. An insight into extreme learning machines: Random neurons, random features and kernels. Cognitive Computation, 6:376 390, 09 2014.

Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. Extreme learning machine: a new learning scheme of feedforward neural networks. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), volume 2, pp. 985 990 vol.2, 2004.

Huang, G.-B., Chen, L., and Siew, C.-K. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks, 17(4):879 892, 2006.

Huang, G.-B., Zhou, H., Ding, X., and Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(2):513 529, 2012.

Innocenti, L., Lorenzo, S., Palmisano, I., Ferraro, A., Paternostro, M., and Palma, G. M. Potential and limitations of quantum extreme learning machines. Communications Physics, 6(1):118, May 2023. ISSN 2399-3650.

Jacquier, A. and Zuric, Z. Random neural networks for rough volatility, 2023.

Jaeger, H. The echo state approach to analysing and training recurrent neural networks-with an erratum note . Bonn, Germany: German National Research Center for

Random Feature Representation Boosting

Information Technology GMD Technical Report, 148, 01 2001.

Kammonen, A., Kiessling, J., Plech aˇc, P., Sandberg, M., Szepessy, A., and Tempone, R. Smaller generalization error derived for a deep residual neural network compared with shallow networks. IMA Journal of Numerical Analysis, 43(5):2585 2632, 09 2022. ISSN 0272-4979.

Kar, P. and Karnick, H. Random feature maps for dot product kernels. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pp. 583 591, La Palma, Canary Islands, 21 23 Apr 2012. PMLR.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Kidger, P., Morrill, J., Foster, J., and Lyons, T. Neural controlled differential equations for irregular time series. In Advances in Neural Information Processing Systems, volume 33, pp. 6696 6707. Curran Associates, Inc., 2020.

Kidger, P., Foster, J., Li, X., and Lyons, T. J. Neural sdes as infinite-dimensional gans. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5453 5463. PMLR, 18 24 Jul 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.

Lanthaler, S. and Nelsen, N. H. Error bounds for learning with vector-valued random features. In Advances in Neural Information Processing Systems, volume 36, pp. 71834 71861. Curran Associates, Inc., 2023.

Li, Z., Ton, J.-F., Oglic, D., and Sejdinovic, D. Towards a unified analysis of random Fourier features. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3905 3914. PMLR, 09 15 Jun 2019.

Lukoˇseviˇcius, M. and Jaeger, H. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127 149, 2009. ISSN 1574-0137.

Mart ınez-Pe na, R. and Ortega, J.-P. Quantum reservoir computing in finite dimensions. Phys. Rev. E, 107:035306, Mar 2023.

Mason, L., Baxter, J., Bartlett, P., and Frean, M. Boosting algorithms as gradient descent. In Advances in Neural

Information Processing Systems, volume 12. MIT Press, 1999.

Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667 766, 2022.

Middlehurst, M., Ismail-Fawaz, A., Guillaume, A., Holder, C., Guijo-Rubio, D., Bulatova, G., Tsaprounis, L., Mentel, L., Walter, M., Sch afer, P., and Bagnall, A. aeon: a python toolkit for learning from time series. Journal of Machine Learning Research, 25(289):1 10, 2024.

Nelsen, N. H. and Stuart, A. M. The random feature model for input-output maps between banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212 A3243, 2021.

Neufeld, A. and Schmocker, P. Universal approximation property of banach space-valued random feature models including random neural networks, 2024.

Nitanda, A. and Suzuki, T. Functional gradient boosting based on residual network perception. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3819 3828. PMLR, 10 15 Jul 2018.

Nitanda, A. and Suzuki, T. Functional gradient boosting for learning residual-like networks with statistical guarantees. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2981 2991. PMLR, 26 28 Aug 2020.

Oyallon, E., Zagoruyko, S., Huang, G., Komodakis, N., Lacoste-Julien, S., Blaschko, M., and Belilovsky, E. Scattering networks for hybrid representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2208 2221, 2019.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K opf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Py Torch: an imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA, 2019.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12(null):2825 2830, November 2011. ISSN 15324435.

Random Feature Representation Boosting

Prabhu, A., Sinha, S., Kumaraguru, P., Torr, P., Sener, O., and Dokania, P. K. Random representations outperform online continually learned representations. In The Thirtyeighth Annual Conference on Neural Information Processing Systems, 2024.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. Catboost: unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 6639 6649, Red Hook, NY, USA, 2018. Curran Associates Inc.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008a.

Rahimi, A. and Recht, B. Uniform approximation of functions with random bases. In 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555 561, 2008b.

Rudi, A. and Rosasco, L. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Sander, M. E., Ablin, P., Blondel, M., and Peyr e, G. Momentum residual neural networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9276 9287. PMLR, 18 24 Jul 2021.

Sinha, A. and Duchi, J. C. Learning kernels with random features. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Sriperumbudur, B. and Szabo, Z. Optimal rates for random fourier features. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

Suggala, A., Liu, B., and Ravikumar, P. Generalized boosting. In Advances in Neural Information Processing Systems, volume 33, pp. 8787 8797. Curran Associates, Inc., 2020.

Sun, Y., Gilbert, A., and Tewari, A. But how does it work in theory? linear svm with random features. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Szabo, Z. and Sriperumbudur, B. On kernel derivative approximation with random fourier features. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 827 836. PMLR, 16 18 Apr 2019.

Tanaka, G., Yamane, T., H eroux, J. B., Nakane, R., Kanazawa, N., Takeda, S., Numata, H., Nakano, D., and Hirose, A. Recent advances in physical reservoir computing: A review. Neural Networks, 115:100 123, 2019. ISSN 0893-6080.

Trockman, A., Willmott, D., and Kolter, J. Z. Understanding the covariance structure of convolutional filters. In The Eleventh International Conference on Learning Representations, 2023.

Veit, A., Wilber, M. J., and Belongie, S. Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Walker, B., Mc Leod, A. D., Qin, T., Cheng, Y., Li, H., and Lyons, T. Log neural controlled differential equations: The lie brackets make a difference. In Forty-first International Conference on Machine Learning, 2024.

Wang, C. and Feng, X. Optimal kernel quantile learning with random features. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 50419 50452. PMLR, 21 27 Jul 2024.

Wang, C., Bing, X., HE, X., and Wang, C. Towards theoretical understanding of learning large-scale dependent data via random features. In Forty-first International Conference on Machine Learning, 2024.

Xiong, W., Facelli, G., Sahebi, M., Agnel, O., Chotibut, T., Thanasilp, S., and Holmes, Z. On fundamental aspects of quantum extreme learning machines, 2024.

Yehudai, G. and Shamir, O. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Yu, H., Li, H., Hua, G., Huang, G., and Shi, H. Boosted dynamic neural networks. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 23/IAAI 23/EAAI 23. AAAI Press, 2023. ISBN 978-1-57735-880-0.

Random Feature Representation Boosting

A. Analytic Solutions to Sandwiched Least Squares Problems

In this section, we derive the analytic closed form expressions of the sandwiched least squares problems presented in Theorem 3.1. We use Num Py notation for element and row indexing, i.e. if X is a matrix, then Xi denotes the i th row of X. F denotes the Frobenius norm.

Proposition A.1 (Scalar case). Let R Rn d, W RD d, A R, and X Rn D. Let λ > 0. Then the minimum of

Ri W AXi 2 + λA2

is uniquely attained at

Ascalar = R, XW F XW 2 F + nλ =

1 n Pn i=1 W Xi, Ri

1 n Pn i=1 W Xi 2 + λ.

Proof. We first rewrite the objective J(A) using the Frobenius norm

n XAW R 2 F + λA2.

Differentiating with respect to A R and setting J (A) = 0 gives

0 = J (A) = 2

n(XW) (AXW R) + 2λA

A XW 2 F + nλA = XW, R F ,

from which the result follows by factoring out A. Note that the second derivative is

n XW 2 F + 2λ > 0,

hence the problem is convex and the local minima is the unique global minima.

Python Num Py code for this is

1 XW = X @ W

2 top = np.sum(R * XW) / n

3 bot = np.sum(XW * XW) / n

4 A = top / (bot + l2_reg)

Proposition A.2 (Diagonal case). Let R Rn d, W RD d, A = diag(a1, ..., a D) RD D, and X Rn D. Let λ > 0. Then the minimum of

Ri W AXi 2 + λ A 2 F

is uniquely attained by the solution to the system of linear equations

b = (C + λI)A,

C = WW X X, b = diag(WR X).

Random Feature Representation Boosting

Proof. We expand J(A) and find that

J(A) = J(a1, ..., a D)

Ri W AXi 2 + λ A 2 F

Ri W [a1Xi,1, ..., a DXi,D] Ri W [a1Xi,1, ..., a DXi,D] + λ

Differentiating with respect to a specific ak gives

0 = J(a1, ..., a D)

i=1 2 Wk Xi,k Ri W [a1Xi,1, ..., a DXi,D] + 2λak

i=1 (Wk Xi,k) Ri =

(Wk Xi,k) W AXi + λak

j=1 (Wk Xi,k) Wj Xi,jaj

This implies that the value of A which minimizes the objective is the solution to the system of linear equations given by

b = (C + λI)A,

where for k, j [D]

Wk Xi,k Wj Xi,j, bk =

i=1 R i Wk Xi,k.

Simplifying we obtain that

C = WW X X, b = diag(WR X).

This solution is unique since the objective is strictly convex with hessian given by 2

Python code for this is

1 b = np.mean( (R @ W.T) * X, axis=0)

2 C = (W @ W.T) * (X.T @ X) / n

3 A = np.linalg.solve(C + l2_reg * np.eye(D), b)

Proposition A.3 (Dense case). Let R Rn d, W RD d, A RD p, and X Rn D. Let λ > 0. Then the minimum of

ri W Axi 2 +

j=1 λA2 k,j

n W AX R 2 F + λ A 2 F

is uniquely obtained by solving the system of linear equations given by

WR X = WW AX X + λn A,

which can be solved using the spectral decompositions WW = UΛW U and X X = V ΛXV ,

Adense = U U WR XV λn1 + diag(ΛW ) diag(ΛX) V ,

where denotes element-wise division, is the outer product, and 1 is a matrix of ones.

Random Feature Representation Boosting

Proof. Using the gradient chain rule we find that

n W(W AX R )X + 2λA

WR X = WWAX X + λNA.

Letting WW = UΛW U and X X = V ΛXV be spectral decompositions, and setting e A = U AV we see that

U WR XV = ΛW e AΛX + λn e A.

By inspecting this equation element-wise we find that

(U WR XV )k,j = ΛW k,k e Ak,jΛX j,j + λn e Ak,j,

which implies that

e Ak,j = (U WR XV )k,j

ΛW k,kΛX j,j + Nλ ,

from which the conclusion follows after change of basis A = U e AV . The solution is unique since the problem is strictly convex, which is easily proved using the convexity of the Frobenius norm and the linearity of the matrix expressions involving A, together with the strict convexity of the regularization term.

Python code for this is

1 SW, U = np.linalg.eigh(W @ W.T)

2 SX, V = np.linalg.eigh(X.T @ X)

3 A = (U.T @ W @ R.T @ X @ V)

4 A = A / (n*lambda_reg + SW[:, None]*SX[None, :])

5 A = U @ A @ V.T

B. Functional Gradient Inner Product

In this section we prove Theorem 3.1, showing that minimizing the LD 2 (µ) inner product under a norm constraint for a simple random feature residual block is equivalent to solving a quadratically constrained least squares problem. We use the same matrix notation as in Appendix A.

Theorem B.1. Let µ = 1

n Pn i=1 δxi be an empirical measure, h LD 2 (µ), and f Lp 2(µ). Then solving

argmin A RD p such that Af LD 2 (µn) 1 h, Af L2(µn)

is equivalent to solving the quadratically constrained least squares problem

i=1 h(xi) Af(xi) 2, subject to 1 n

i=1 Af(xi) 2 = 1.

In particular, when F is of full rank, we obtain the closed form solution

A = n H F H F(F F) 1,

where F Rn p and H Rn D are the matrices given by Fi,j = f(xi)j and Hi,k = h(xi)k.

Random Feature Representation Boosting

Proof. Using the definition of the LD 2 (µ) norm for empirical measures, we find that

h, Af LD 2 (µ) = 1

i=1 h(xi), Af(xi) = 1

i=1 Hi, AFi . (7)

Furthermore, the constraint can be expressed as

Af 2 LD 2 (µn) = 1

i=1 Af(xi) 2 = 1

i=1 AFi 2 = 1, (8)

with equality instead of inequality, since we always obtain a bigger inner product in magnitude by normalizing by Af 2 LD 2 (µn). Minimizing (7) subject to (8) is equivalent to solving a constrained least squares problem, since we can write

i=1 Hi AFi 2 = 1

i=1 Hi 2 2 Hi, AFi + AFi 2,

where we see that the first term Hi 2 is constant w.r.t A, and the third term gives the constraint. Hence, minimizing the constrained least squares problem is equivalent to maximizing the inner product, and the solution to the original problem is obtained by multiplying the least squares solution by 1 since we are interested in the argmin rather than the argmax.

Continuing, to solve the quadratically constrained least squares problem, we introduce the Lagrangian

J(A, ν) = 1

i=1 |Hi AFi|2 ν

i=1 |AFi|2 1

n H AF 2 F ν 1

n AF 2 F 1 .

Differentiating with respect to A gives

0 = 1J(A, ν) = 2

n (H AF )F ν 2

which implies that

H F = (1 ν)AF F.

A = 1 1 ν H F(F F) 1

= 1 1 ν H UΛ 1V ,

assuming that ν = 1 and that F is of full rank, with SVD decomposition F = UΛV . The constraint becomes

= 1 (1 ν)2 H UΛ 1V V ΛU 2 F

= 1 (1 ν)2 H 2 F ,

therefore 1 ν = H F n . The solution to the constraint least squares problem is obtained by using the positive sign, hence the solution to the original problem is given by the negative sign.

If ν = 1, then H F = 0, implying that Hi, AFi LD 2 (µ) = 0 for all matrices A. Hence the same closed-form solution holds in this case too.

Remark B.2. It is clear from the proof how to augment the expression for A when F is not of full rank. However, in practice we instead use ridge regression for increased numerical stability. A similar result as the above can be proven for ridge regression, albeit with a more complicated expression for A involving non-trivial combinations of Λ and λ. We omit this detail here, and simply use ridge regression in practice. Note also that F(F F) 1 can be expressed as a pseudo-inverse of F, after suitable transpositions of the matrices involved.

Random Feature Representation Boosting

C. Gradient Calculations

For completeness, we derive the functional gradient used in Gradient RFRBoost, for MSE loss, categorical cross-entropy loss, and binary cross-entropy loss. Recall that the functional gradient, in all cases, is given by

2R(W, Φ)(x) = EµY |X=x[W 1L(W Φ(x), Y )],

where i denotes the gradient with respect to the i th argument.

C.1. Mean Squared Error Loss

For regression, we use mean squared error loss l(x, y) = 1

2 x y 2. In this case, we find that

1l(x, y) = x y,

2R(W, Φ)(x) = EY |X=x[W 1l(W Φ(x), Y )]

= EY |X=x[W(W Φ(x) Y )].

When µ = Pn i=1 δ(xi,yi) is an empirical measure, the above reads in matrix form as

G = WW X WY

where X Rn D and Y Rn d are the matrices given by Xi,j = Φj(xi) and Yi,k = (yi)k.

C.2. Binary Cross-Entropy Loss

Denote the binary cross-entropy (BCE) loss as l(x, y) = y log(σ(x)) (1 y) log(1 σ(x)), where σ(x) = 1 1+e x is the sigmoid function. The gradient of the BCE loss with respect to the logit σ(x) is

σ(x) = y σ(x) + 1 y 1 σ(x) = σ(x) y σ(x)(1 σ(x)).

Using the chain rule together with the fact that σ (x) = σ(x)(1 σ(x)) gives that

1l(x, y) = l(x, y)

x = σ(p(x)) y,

whence we obtain

2R(W, Φ)(x) = EµY |X=x W(σ(W Φ(x)) y) .

C.3. Categorical Cross-Entropy Loss

The analysis for the multi-class case is similar to the binary case. The cross-entropy loss l : RK {1, . . . , K} [0, ) for logits x with true label y is given by

l(x, y) = log(sy(x))

where s is the softmax function defined by

sy(x) = exp(xy) PK j=1 exp(xj) .

Random Feature Representation Boosting

We aim to prove that 1l(x, y) = p(x) ey, where ey RK is the one-hot vector for y. To see this, consider for any 1 k K the following:

xk log(sy(x))

= 1 sy(x) xk sy(x)

= 1 sy(x) xk

exp(xy) PK j=1 exp(xj)

1y=k exp(xy) PK j=1 exp(xj) exp(xy) exp(xk) PK j=1 exp(xj) 2

= 1 sy(x) (1y=ksy(x) sy(x)sk(x))

= sk(x) 1y=k

Since this holds for all k, we have that

1l(x, y) = s(x) ey.

2R(W, Φ)(x) = EµY |X=x[W 1l(W Φ(x), Y )]

= EµY |X=x[W(s(W Φ(x)) e Y )].

When µ = Pn i=1 δ(xi,yi) is an empirical measure, the above reads in matrix form as

G = W(P(X) EY ) ,

where P(X) Rn K is the matrix given by P(X)i,k = sk(W Φ(xi)), and EY Rn K is the matrix given by (EY )i,k = 1yi=k.

D. Excess Risk Bound

In this section we study the excess risk bound of RFRBoost in the framework of Generalized Boosting (Suggala et al., 2020). We take a slightly different approach in our proofs which streamlines the process: instead of bounding the 1 norm of the rows of the weight matrices, we instead bound the maximum singular values σmax.

D.1. Preliminaries

We repeat the following definition from Section 3. Definition D.1 (Suggala et al. (2020)). Let β (0, 1] and ϵ 0. We say that Gt+1 satisfies the (β, ϵ)-weak learning condition if there exists a g Gt+1 such that g, 2R(Wt, Φt)

LD 2 (µ) β supg Gt+1 g LD 2 (µ) + ϵ 2R(Wt, Φt) LD 2 (µ).

Consider the sample-splitting variant of boosting, where at each boosting iteration we use an independent sample of size en = n/T. Let µt = Pen i=1 δ(xt,i,yt,i) denote the empirical measure of the t-th independent sample. The risk bounds in the sequel depend on Rademacher complexities related to the class of weak feature transformations Gt and set of linear predictors W, which we define below:

R(W, Gt) = Eρ

sup W W g Gt

k=1 ρi,k[W g(xt,i)]k

Random Feature Representation Boosting

j=1 ρi,j[g(xt,i)]k

where ρi,j are Rademacher random variables, that is, independent random variables taking values 1 and 1 with equal probability.

The excess risk bound of RFRBoost is based on the following result:

Theorem D.2 (Suggala et al. (2020)). Suppose that the loss l is L-Lipschitz and M-smooth with respect to the first argument. Let the hypothesis set of linear predictors W be such that all W W satisfy λmin(W W) σ2 min > 0 and λmax(W W) σ2 max. Moreover, suppose for all t that Gt satisfies the (β, ϵt)-weak learning condition for µt, and that all g Gt are bounded with sup X g(X) 2 R. Let the boosting learning rates (ηt)T t=1 be ηt = ct s for some s β+1

and c > 0. Then both the exact-greedy and gradient-greedy representation boosting algorithms of Section 2.2 satisfy the following risk bound for any W , Φ , and a 0, β(1 s) , with probability at least 1 δ over datasets of size n:

R(WT , ΦT ) R(W , Φ ) + O

T a + T 2 s

LR(W, Gt) + LR(Gt) + ϵt

We aim to apply the excess risk bound of Theorem D.2 in the setting of RFRBoost. For simplicity, we consider only the traditional Res Net structure where the random feature layer ft(x) = tanh(BΦt(x)) only takes as input the previous layer of the Res Net and not the raw features. This corresponds to the weak feature transformation hypothesis class

Gt = {h Φt 1 : h H}

where H is the set of simple residual blocks

H = {x 7 A tanh(Bx) : λmax(A) and λmax(B) bounded by λ}.

Here λmax denotes the maximum singular value. We will use the properties that A 2 λ, Ak 1 λ p, Ak λ, Bj 1 λ

D, and Bj λ. Here Ak denotes the k-th row of A, and p the ℓp vector norm or matrix spectral norm.

To prove Theorem 3.4, we need to compute the Rademacher complexities of RFRBoost, and verify that our class of weak learners satisfy all the assumptions of Theorem D.2. The only critical assumption to check is that supgt Gt,x X gt(x) 2 is bounded. This is the case for our particular model class, since gt(x) 2 = At tanh BtΦt 1(x) 2 At 2 tanh BtΦt 1(x) 2 λ p, which follows from basic properties of vector and matrix norms.

The following lemma will prove useful for the computation of the Rademacher complexity of RFRBoost.

Lemma D.3 (Allen-Zhu et al. (2019), Proposition A.12). Let σ : R R be a 1-Lipschitz function. Let F1, ..., Fm be sets of functions X R and suppose for each j [m] there exists a function fj Fj satisfying supx X |σ(fj(x))| R. Then

j=1 vjσ(fj(x)) : fj Fj, v 1 C, v Rm, v A

j=1 R(Fj) + O CR log m n

D.2. RFRBoost Rademacher Computations

The proof follows along similar lines as Suggala et al. (2020), however, our hypothesis class is larger and the proof has to be adjusted accordingly. Our proof additionally differs by bounding the hypothesis class by the largest singular values rather than ℓ1 norms, which we believe is more natural and leads to more representative bounds.

Random Feature Representation Boosting

Lemma D.4. Under the assumptions of Theorem 3.4, the Rademacher complexity R(Gt) satisfies

p D 3 2 log(D)t1 s

Proof. Using basic properties of the supremum, H older s inequality, Hoeffding s inequality, and Lemma D.3, we obtain the following:

k=1 ρi,kgk(xt,i)

i=1 ρi,kgk(xt,i)

A RD p λmax(A),λmax(B) λ1

i=1 ρi,k Ak, tanh(BΦt 1(xt,i)

sup B Rp D λmax(B) λ1

i=1 ρi,1 Bj, Φt 1(xt,i) + DO p log p

i=1 ρi,1Φt 1(xt,i)

+ DO p log p

en Φt 1(xt,i) + DO p log p

D log(D)ct1 s

en + DO p log(p)

p D 3 2 log(D)t1 s

The second inequality follows from Lemma D.3 and the fact that tanh is bounded by 1. The third inequality uses H older s inequality, and the fourth follows from Hoeffding s inequality for bounded random variables. Finally, the fifth inequality follows from the fact that Φt 1(x) Pt 1 r=1 gr ηr λ1ct1 s

1 s , derived from the recursive definition of Φt = Φt 1 + ηtgt.

Lemma D.5. Under the assumptions of Theorem 3.4, the Rademacher complexity R(W, Gt) satisfies

R(W, Gt) = O

pd D 3 2 log(D)t1 s

Random Feature Representation Boosting

Proof. We proceed similarly as in the previous result, and obtain that

R(W, Gt) = Eρ

sup W W g Gt

j=1 ρi,j W g(xt,i)

sup W W g Gt

i=1 ρi,j W j , g(xt,i)

(lemma) 2λ1d

i=1 ρi,1gk(xt,i) + d O λ2 1 p D log D

(prev result) 2λ1d O

p D 3 2 log(D)t1 s

+ d O λ2 1 p D log D

pd D 3 2 log(D)t1 s

Here we additionally used the fact that supx |gk(x)| = supx | Ak, tanh(BΦt 1)(x) | λ1 p for Lemma D.3.

We obtain Theorem 3.4 by combining Theorem D.2 with Lemma D.5 and Lemma D.4, and using the bound that supx g(x) 2 λ p.

E. Experimental Setup

This section provides additional details of the experimental setup, SWIM random features, evaluation procedures, baseline models, and hyperparameter ranges used in our numerical experiments.

E.1. SWIM Random Features

In our experiments, we use SWIM random features (Bolager et al., 2023) for initializing the dense layer weights, as opposed to traditional i.i.d. initialization. Let X denote the input space and σ an activation function. A SWIM layer is formally defined as follows:

Definition E.1 (Bolager et al. (2023)). Let Φ(x) RD represent the features of a neural network for input x X. Consider a dense layer x 7 σ(Ax + b) to be added on top of Φ, where A RH D, b RH, and H is the number of neurons in the new layer. Let (x(1) i , x(2) i )H i=1 X X be pairs of training data points. Let

Ai = c2 Φ(x(2) i ) Φ(x(1) i )

Φ(x(2) i ) Φ(x(1) i ) 2 , bi = Ai, Φ(x(1) i ) c1, (9)

where c1 and c2 are fixed constants. We say that x 7 σ(Ax + b) is a pair-sampled layer if the rows of the weight matrix A and the biases b are of the form (9).

The pairs of points (x(1) i , x(2) i )H i=1 can be sampled uniformly or based on a training data-dependent sampling scheme. In our experiments, we adopt the gradient-based approach by Bolager et al. (2023). Pairs of data points are sampled at each layer Φ of the Res Net approximately proportional to

q(x(1), x(2)|Φ) = f(x(1)) f(x(2)) Φ(x(1)) Φ(x(2)) + ϵ, (10)

where ϵ > 0 and f is the true target function, i.e., the mapping xi to the true label yi = f(xi). Specifically, to avoid quadratic time complexity in dataset size n, we first consider the n points x(1) i , i = 1, ..., n, and then uniformly generate n offset points x(2) i = x(1) i+ji. We then sample H points from these proportional to q( , |Φ). This ensures the sampling procedure remains linear with respect to dataset size n.

Random Feature Representation Boosting

The constants c1 and c2 are chosen such that the pre-activated neurons are symmetric about the input point (Φ(x(1) j ) +

Φ(x(2) j ))/2. For general inputs, the pre-activation of neuron j in a sampled layer is

(AΦ(x) + b)j = Aj, Φ(x) Aj, Φ(x(1) j ) c1

= c2 Φ(x(2) j ) Φ(x(1) j ), Φ(x) Φ(x(1) j )

Φ(x(2) j ) Φ(x(1) j ) 2 c1,

and specifically, when applied to x(2) j and x(1) j , we get

(AΦ(x(1) j ) + b)j = c1,

(AΦ(x(2) j ) + b)j = c2 c1.

Thus, as suggested by Bolager et al. (2023), setting c2 = 2c1 centers the pre-activation symmetrically about the mean (Φ(x(1) i ) + Φ(x(2) i ))/2. Intuitively, this facilitates the creation of a decision boundary between the two chosen points. The constant c2 is what we refer to as the SWIM scale in the hyperparameter section.

E.2. Model Architectures and Hyperparameter Ranges for Open ML Tasks

We detail the hyperparameter ranges used when tuning all baseline models for the 91 regression and classification tasks of the curated Open ML benchmark suite. The hyperparameter tuning was conducted using the Bayesian optimization library Optuna (Akiba et al., 2019), with 100 trials per outer fold, model, and dataset, using an inner 5-fold cross validation. The specific ranges for each model are presented in Table 4.

Table 4. Hyperparameter ranges for Open ML experiments

MODEL HYPERPARAMETER RANGE

E2E MLP RESNET N LAYERS 1 TO 10 LR 10 6 TO 10 1 (LOG SCALE) END LR FACTOR 0.01 TO 1.0 (LOG SCALE) N EPOCHS 10 TO 50 (LOG SCALE) WEIGHT DECAY 10 6 TO 10 3 (LOG SCALE) BATCH SIZE 128, 256, 384, 512 HIDDEN DIM 16 TO 512 (LOG SCALE) FEATURE DIM FIXED AT 512

RFRBOOST N LAYERS 1 TO 10 (LOG SCALE) L2 LINPRED 10 5 TO 10 (LOG SCALE) L2 GHAT 10 5 TO 10 (LOG SCALE) BOOST LR 0.1 TO 1.0 (LOG SCALE) SWIM SCALE 0.25 TO 2.0 FEATURE DIM FIXED AT 512

RFNN L2 REG 10 5 TO 10 (LOG SCALE) FEATURE DIM 16 TO 512 (LOG SCALE) SWIM SCALE 0.25 TO 2.0

RIDGE/LOGISTIC REGRESSION L2 REG 10 5 TO 10 (LOG SCALE)

XGBOOST ALPHA 10 5 TO 10 2 (LOG SCALE) LAMBDA 10 3 TO 100 (LOG SCALE) LEARNING RATE 0.01 TO 0.5 (LOG SCALE) N ESTIMATORS 50 TO 1000 (LOG SCALE) MAX DEPTH 1 TO 10

For the E2E MLP Res Net, hidden dim refers to the dimension D of the Res Net features Φ(x) RD, and the feature dimension denotes the number of neurons in the residual block (also known as bottleneck size, which was also fixed for

Random Feature Representation Boosting

RFRBoost). The residual blocks are structured sequentially as follows: [dense(hidden dim, feature dim), batchnorm, relu, dense(feature dim, hidden dim), batchnorm]. An additional upscaling layer was used to match the input dimension to the hidden dim .

We used the Aeon library (Middlehurst et al., 2024) to generate the critical difference diagrams in Section 4.1. Below, we provide full dataset-wise results in Tables 5 to 7 for each model and dataset used in the evaluation, averaged across all 5 folds. Figures 5 and 6 visualize these results using scatter plots overlaid with box-and-whiskers plots, providing insight into the distribution and variability of performance across datasets. We refer the reader to the critical difference diagrams for a rigorous statistical comparison of model rankings across all datasets.

Table 5. Test classification scores on Open ML datasets.

Logistic Regression RFRBoost RFNN E2E MLP Res Net XGBoost

3 0.974 (0.005) 0.996 (0.002) 0.989 (0.004) 0.996 (0.001) 0.997 (0.002) 6 0.768 (0.005) 0.912 (0.008) 0.890 (0.009) 0.944 (0.009) 0.903 (0.015) 11 0.878 (0.033) 0.981 (0.006) 0.979 (0.004) 0.934 (0.042) 0.931 (0.031) 14 0.812 (0.011) 0.844 (0.017) 0.804 (0.018) 0.828 (0.020) 0.830 (0.008) 15 0.964 (0.016) 0.966 (0.011) 0.963 (0.017) 0.964 (0.015) 0.960 (0.012) 16 0.948 (0.004) 0.974 (0.009) 0.945 (0.008) 0.973 (0.007) 0.954 (0.016) 18 0.728 (0.010) 0.739 (0.010) 0.727 (0.010) 0.739 (0.007) 0.721 (0.017) 22 0.815 (0.012) 0.837 (0.004) 0.812 (0.011) 0.840 (0.017) 0.795 (0.012) 23 0.512 (0.054) 0.551 (0.038) 0.553 (0.044) 0.532 (0.035) 0.547 (0.022) 28 0.970 (0.005) 0.991 (0.003) 0.980 (0.006) 0.989 (0.005) 0.978 (0.004) 29 0.854 (0.017) 0.854 (0.024) 0.845 (0.024) 0.852 (0.027) 0.862 (0.026) 31 0.761 (0.023) 0.762 (0.028) 0.756 (0.028) 0.743 (0.039) 0.744 (0.031) 32 0.946 (0.002) 0.993 (0.003) 0.990 (0.002) 0.992 (0.001) 0.986 (0.003) 37 0.762 (0.020) 0.758 (0.027) 0.760 (0.032) 0.767 (0.025) 0.762 (0.034) 38 0.968 (0.005) 0.979 (0.005) 0.977 (0.004) 0.977 (0.004) 0.988 (0.004) 44 0.931 (0.007) 0.950 (0.004) 0.936 (0.009) 0.941 (0.009) 0.952 (0.002) 50 0.983 (0.006) 0.983 (0.010) 0.980 (0.010) 0.976 (0.016) 0.995 (0.007) 54 0.801 (0.026) 0.849 (0.018) 0.829 (0.021) 0.843 (0.015) 0.772 (0.019) 151 0.760 (0.015) 0.796 (0.014) 0.784 (0.015) 0.792 (0.010) 0.847 (0.013) 182 0.861 (0.009) 0.909 (0.007) 0.908 (0.007) 0.911 (0.009) 0.913 (0.006) 188 0.641 (0.057) 0.664 (0.049) 0.643 (0.031) 0.635 (0.038) 0.652 (0.029) 307 0.838 (0.012) 0.993 (0.002) 0.972 (0.016) 0.989 (0.010) 0.924 (0.031) 458 0.994 (0.000) 0.994 (0.007) 0.993 (0.006) 0.988 (0.013) 0.990 (0.006) 469 0.187 (0.023) 0.188 (0.028) 0.196 (0.021) 0.193 (0.028) 0.178 (0.031) 1049 0.908 (0.012) 0.910 (0.013) 0.904 (0.008) 0.907 (0.014) 0.919 (0.009) 1050 0.894 (0.013) 0.889 (0.010) 0.898 (0.010) 0.892 (0.006) 0.890 (0.007) 1053 0.808 (0.009) 0.807 (0.011) 0.806 (0.012) 0.810 (0.011) 0.809 (0.014) 1063 0.841 (0.023) 0.845 (0.039) 0.833 (0.024) 0.841 (0.020) 0.839 (0.021)

Random Feature Representation Boosting

Table 6. Test classification scores on Open ML datasets.

Logistic Regression RFRBoost RFNN E2E MLP Res Net XGBoost

1067 0.853 (0.019) 0.855 (0.018) 0.858 (0.015) 0.855 (0.016) 0.853 (0.012) 1068 0.925 (0.008) 0.927 (0.016) 0.928 (0.009) 0.928 (0.010) 0.935 (0.012) 1461 0.902 (0.008) 0.902 (0.005) 0.904 (0.009) 0.900 (0.006) 0.903 (0.005) 1462 0.988 (0.005) 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.997 (0.004) 1464 0.766 (0.039) 0.791 (0.037) 0.794 (0.046) 0.769 (0.048) 0.783 (0.043) 1475 0.478 (0.005) 0.545 (0.010) 0.534 (0.012) 0.570 (0.005) 0.601 (0.008) 1480 0.715 (0.037) 0.703 (0.017) 0.703 (0.011) 0.671 (0.033) 0.684 (0.020) 1486 0.943 (0.006) 0.949 (0.003) 0.946 (0.004) 0.950 (0.006) 0.959 (0.003) 1487 0.935 (0.012) 0.945 (0.012) 0.945 (0.012) 0.944 (0.009) 0.940 (0.011) 1489 0.748 (0.010) 0.876 (0.015) 0.864 (0.005) 0.906 (0.005) 0.903 (0.006) 1494 0.873 (0.016) 0.882 (0.016) 0.868 (0.018) 0.878 (0.015) 0.873 (0.010) 1497 0.701 (0.009) 0.929 (0.005) 0.907 (0.008) 0.944 (0.007) 0.998 (0.001) 1510 0.977 (0.004) 0.974 (0.008) 0.974 (0.014) 0.961 (0.023) 0.961 (0.013) 1590 0.850 (0.005) 0.858 (0.010) 0.856 (0.010) 0.851 (0.009) 0.865 (0.013) 4534 0.938 (0.007) 0.952 (0.010) 0.949 (0.003) 0.960 (0.009) 0.956 (0.008) 4538 0.478 (0.014) 0.554 (0.017) 0.513 (0.014) 0.606 (0.014) 0.660 (0.017) 6332 0.735 (0.031) 0.791 (0.024) 0.763 (0.009) 0.791 (0.023) 0.796 (0.021) 23381 0.626 (0.040) 0.640 (0.030) 0.624 (0.034) 0.566 (0.048) 0.612 (0.041) 23517 0.505 (0.010) 0.516 (0.009) 0.502 (0.017) 0.489 (0.013) 0.500 (0.012) 40499 0.997 (0.001) 0.997 (0.001) 0.992 (0.000) 0.999 (0.001) 0.985 (0.004) 40668 0.752 (0.017) 0.775 (0.014) 0.759 (0.019) 0.777 (0.010) 0.800 (0.014) 40701 0.867 (0.003) 0.936 (0.002) 0.923 (0.004) 0.951 (0.009) 0.953 (0.003) 40966 0.994 (0.005) 0.999 (0.002) 0.989 (0.007) 0.997 (0.002) 0.984 (0.016) 40975 0.932 (0.007) 0.997 (0.004) 0.977 (0.006) 0.999 (0.001) 0.994 (0.004) 40982 0.718 (0.028) 0.752 (0.016) 0.758 (0.024) 0.755 (0.028) 0.800 (0.018) 40983 0.971 (0.003) 0.986 (0.002) 0.987 (0.003) 0.988 (0.004) 0.983 (0.003) 40984 0.901 (0.009) 0.934 (0.012) 0.930 (0.007) 0.927 (0.011) 0.929 (0.011) 40994 0.961 (0.018) 0.957 (0.019) 0.957 (0.019) 0.946 (0.015) 0.944 (0.021) 41027 0.685 (0.004) 0.812 (0.007) 0.798 (0.012) 0.844 (0.007) 0.838 (0.006)

Random Feature Representation Boosting

Figure 5. Scatter and box-and-whiskers plot of classification accuracy across all Open ML datasets. Each dot corresponds to a single dataset, and box plots summarize the distribution for each model.

Figure 6. Scatter and box-and-whiskers plot of regression RMSE across all Open ML datasets. Each dot corresponds to a single dataset, and box plots summarize the distribution for each model.

Random Feature Representation Boosting

Table 7. Test RMSE scores on Open ML regression datasets.

Ridge Regression

Greedy RFRBoost Ascalar

Greedy RFRBoost Adiag

Greedy RFRBoost Adense

Gradient RFRBoost RFNN E2E MLP Res Net XGBoost

41021 0.233 (0.012) 0.233 (0.013) 0.232 (0.013) 0.231 (0.012) 0.231 (0.011) 0.232 (0.012) 0.265 (0.011) 0.249 (0.011) 44956 0.647 (0.020) 0.631 (0.018) 0.621 (0.019) 0.615 (0.019) 0.616 (0.020) 0.622 (0.022) 0.614 (0.023) 0.633 (0.020) 44957 0.688 (0.033) 0.362 (0.026) 0.246 (0.017) 0.238 (0.011) 0.233 (0.013) 0.347 (0.021) 0.316 (0.026) 0.197 (0.015) 44958 0.595 (0.010) 0.239 (0.019) 0.205 (0.010) 0.193 (0.016) 0.191 (0.027) 0.246 (0.022) 0.246 (0.025) 0.039 (0.011) 44959 0.589 (0.026) 0.304 (0.007) 0.275 (0.013) 0.277 (0.012) 0.286 (0.018) 0.305 (0.015) 0.311 (0.011) 0.245 (0.026) 44960 0.290 (0.028) 0.069 (0.013) 0.050 (0.005) 0.049 (0.005) 0.049 (0.006) 0.088 (0.024) 0.200 (0.027) 0.037 (0.007) 44962 0.446 (0.063) 0.448 (0.062) 0.448 (0.062) 0.446 (0.063) 0.447 (0.062) 0.447 (0.063) 0.474 (0.069) 0.452 (0.067) 44963 0.841 (0.011) 0.784 (0.005) 0.773 (0.020) 0.749 (0.008) 0.752 (0.005) 0.783 (0.013) 0.703 (0.017) 0.694 (0.010) 44964 0.521 (0.017) 0.443 (0.018) 0.406 (0.013) 0.395 (0.012) 0.398 (0.009) 0.443 (0.014) 0.369 (0.005) 0.340 (0.010) 44965 0.883 (0.066) 0.867 (0.071) 0.873 (0.077) 0.860 (0.075) 0.858 (0.070) 0.871 (0.077) 0.864 (0.075) 0.839 (0.081) 44966 0.719 (0.082) 0.724 (0.085) 0.718 (0.084) 0.719 (0.082) 0.721 (0.080) 0.726 (0.075) 0.731 (0.079) 0.720 (0.080) 44967 0.786 (0.065) 0.788 (0.071) 0.790 (0.068) 0.788 (0.065) 0.784 (0.067) 0.795 (0.069) 0.814 (0.045) 0.789 (0.056) 44969 0.409 (0.011) 0.084 (0.027) 0.027 (0.007) 0.022 (0.002) 0.025 (0.002) 0.061 (0.022) 0.058 (0.015) 0.086 (0.003) 44970 0.645 (0.042) 0.609 (0.044) 0.610 (0.044) 0.618 (0.035) 0.615 (0.040) 0.603 (0.045) 0.616 (0.029) 0.622 (0.043) 44971 0.840 (0.028) 0.789 (0.018) 0.779 (0.022) 0.780 (0.017) 0.777 (0.021) 0.785 (0.024) 0.729 (0.009) 0.685 (0.015) 44972 0.796 (0.036) 0.776 (0.028) 0.775 (0.030) 0.768 (0.026) 0.767 (0.030) 0.774 (0.026) 0.777 (0.029) 0.722 (0.012) 44973 0.602 (0.007) 0.287 (0.013) 0.235 (0.008) 0.194 (0.007) 0.194 (0.004) 0.284 (0.016) 0.177 (0.007) 0.244 (0.008) 44974 0.457 (0.014) 0.214 (0.013) 0.184 (0.015) 0.153 (0.008) 0.152 (0.007) 0.211 (0.012) 0.158 (0.012) 0.126 (0.012) 44975 0.026 (0.008) 0.027 (0.008) 0.026 (0.008) 0.026 (0.008) 0.026 (0.008) 0.026 (0.008) 0.056 (0.012) 0.109 (0.003) 44976 0.270 (0.004) 0.192 (0.007) 0.177 (0.009) 0.170 (0.005) 0.170 (0.006) 0.196 (0.009) 0.161 (0.006) 0.192 (0.007) 44977 0.590 (0.021) 0.525 (0.019) 0.517 (0.016) 0.508 (0.019) 0.506 (0.021) 0.525 (0.018) 0.486 (0.020) 0.446 (0.022) 44978 0.242 (0.016) 0.149 (0.005) 0.145 (0.007) 0.146 (0.006) 0.148 (0.007) 0.149 (0.006) 0.142 (0.004) 0.134 (0.011) 44979 0.252 (0.015) 0.163 (0.015) 0.148 (0.012) 0.142 (0.010) 0.140 (0.010) 0.165 (0.013) 0.145 (0.011) 0.142 (0.011) 44980 0.766 (0.022) 0.430 (0.008) 0.324 (0.007) 0.318 (0.008) 0.317 (0.007) 0.438 (0.022) 0.275 (0.010) 0.457 (0.031) 44981 0.914 (0.017) 0.913 (0.016) 0.913 (0.017) 0.912 (0.017) 0.913 (0.017) 0.914 (0.017) 0.636 (0.015) 0.611 (0.006) 44983 0.431 (0.004) 0.263 (0.012) 0.250 (0.013) 0.247 (0.016) 0.244 (0.014) 0.257 (0.010) 0.235 (0.019) 0.229 (0.014) 44984 0.689 (0.023) 0.660 (0.023) 0.658 (0.024) 0.656 (0.022) 0.657 (0.023) 0.659 (0.023) 0.656 (0.023) 0.656 (0.023) 44987 0.334 (0.035) 0.233 (0.020) 0.194 (0.024) 0.183 (0.026) 0.192 (0.036) 0.246 (0.019) 0.229 (0.057) 0.211 (0.020) 44989 0.308 (0.010) 0.293 (0.009) 0.287 (0.006) 0.276 (0.012) 0.274 (0.013) 0.295 (0.009) 0.304 (0.016) 0.263 (0.005) 44990 0.404 (0.009) 0.395 (0.009) 0.396 (0.008) 0.395 (0.009) 0.393 (0.009) 0.399 (0.010) 0.405 (0.008) 0.399 (0.008) 44993 0.799 (0.017) 0.776 (0.020) 0.773 (0.017) 0.770 (0.018) 0.769 (0.019) 0.775 (0.017) 0.787 (0.018) 0.773 (0.019) 44994 0.265 (0.016) 0.215 (0.011) 0.216 (0.009) 0.215 (0.007) 0.216 (0.008) 0.216 (0.009) 0.231 (0.013) 0.225 (0.010) 45012 0.448 (0.027) 0.343 (0.020) 0.339 (0.019) 0.330 (0.018) 0.330 (0.017) 0.341 (0.016) 0.351 (0.022) 0.331 (0.021) 45402 0.621 (0.023) 0.526 (0.024) 0.507 (0.017) 0.499 (0.017) 0.497 (0.019) 0.532 (0.019) 0.474 (0.030) 0.502 (0.029)

E.3. Point Cloud Separation Task

For the point cloud separation task, we use the same model structure as in the Open ML experiments, but fix the number of Res Net layers to 3 residual blocks. We additionally fix the activation to tanh and feature dimension to 512 for all models. Further architectural details and learning rate annealing strategies are consistent with those described in Section 4.1 and Appendix E.2. The SWIM scale was fixed at 1.0 for all random feature models. We use 5,000 randomly sampled data points for training and validation, and leave the remaining 5,000 for the final test set.

The hyperparameter configurations for RFRBoost, E2E MLP Res Net, RFNN, and logistic regression are detailed below in Table 8.

Random Feature Representation Boosting

Table 8. Hyperparameter ranges for the point cloud separation task.

MODEL HYPERPARAMETER VALUES

RFRBOOST L2 CLS [1, 1E-1, 1E-2, 1E-3, 1E-4, 1E-5] L2 GHAT 1E-4 HIDDEN DIM 2 FEATURE DIM 512 BOOST LR 1.0 USE BATCHNORM TRUE

RFNN L2 CLS [1, 1E-1, 1E-2, 1E-3, 1E-4, 1E-5] FEATURE DIM 512

LOGISTIC REGRESSION L2 [1, 1E-1, 1E-2, 1E-3, 1E-4]

E2E MLP RESNET LR [1E-1, 1E-2, 1E-3, 1E-4, 1E-5] HIDDEN DIM 2 FEATURE DIM 512 N EPOCHS 30 END LR FACTOR 0.01 WEIGHT DECAY 1E-5 BATCH SIZE 128

E.4. SWIM vs i.i.d. Ablation Study

In this section, we present a small ablation study comparing SWIM random features to standard i.i.d. Gaussian random features across Open ML regression and classification tasks. The evaluation procedure is the same as presented in Section 4.1, the only change being that the SWIM scale parameter is replaced by an i.i.d. scale parameter, ranging logarithmically from 0.1 to 10 for all models. Pairwise comparisons for each model variant are visualized in Figures 7 to 8, where each point represents a dataset and compares the performance of SWIM and i.i.d. features. Summary statistics, including average RMSE/accuracy, number of dataset-wise wins, and average training time, are reported in Tables 9 to 10.

Overall, performance differences between SWIM and i.i.d. features are small but consistent. For regression, SWIM-based models achieve lower mean RMSE across all variants, and win on 22 out of 34 datasets in the best case (Greedy RFRBoost Adense). For classification, accuracy is nearly identical across methods, though i.i.d. features slightly outperform SWIM in the total number of wins (32 out of 56). SWIM-based methods typically incur slightly higher training time, reflecting the additional structure in their initialization. These results suggest that while SWIM can offer marginal gains in accuracy or RMSE, the benefits must be balanced against training efficiency.

Table 9. Comparison of SWIM vs i.i.d. feature initialization on Open ML regression tasks.

MODEL VARIANT MEAN RMSE WINS FIT TIME (S)

RFNN SWIM 0.434 20 0.053 I.I.D. 0.441 14 0.029

GRADIENT RFRBOOST SWIM 0.408 18 1.688 I.I.D. 0.415 16 1.816

GREEDY RFRBOOST Adense SWIM 0.408 22 2.734 I.I.D. 0.416 12 2.667

GREEDY RFRBOOST Adiag SWIM 0.415 19 1.631 I.I.D. 0.415 15 1.490

GREEDY RFRBOOST Ascalar SWIM 0.434 22 1.024 I.I.D. 0.443 12 0.720

Random Feature Representation Boosting

Table 10. Comparison of SWIM vs i.i.d. feature initialization on Open ML classification tasks.

MODEL VARIANT MEAN ACCURACY WINS FIT TIME (S)

RFNN SWIM 0.845 25 1.189 I.I.D. 0.844 31 0.634

GRADIENT RFRBOOST SWIM 0.853 32 2.519 I.I.D. 0.850 24 2.350

Figure 7. Comparison of classification accuracy using SWIM versus i.i.d. Gaussian random features across Open ML datasets. Each scatter plot compares accuracy on a per-dataset basis for different model variants. Points below the diagonal indicate better performance with SWIM.

Random Feature Representation Boosting

Figure 8. Comparison of RMSE using SWIM versus i.i.d. Gaussian random features across Open ML regression tasks. Each scatter plot compares RMSE per dataset for different model variants. Points above the diagonal indicate better performance with SWIM.

Random Feature Representation Boosting

E.5. Supplementary Larger-Scale Experiments

To further assess the scalability and performance of RFRBoost relative to baseline models, we conducted experiments on four larger datasets: two of the largest from the Open ML Benchmark suite used in this work (each with approximately 100k samples one classification: ID 23517; one regression: ID 44975), and two widely-used benchmark datasets with approximately 500k samples (one classification: Cover Type (Blackard, 1998); one regression: Year Prediction MSD (YPMSD) (Bertin-Mahieux, 2011)).

For these larger-scale evaluations, we employed a hold-out test set strategy. Hyperparameters for each model were selected via a grid search performed on a 20% validation split of the training data. The models considered are consistent with those in Section 4.1: E2E MLP Res Net, Gradient RFRBoost, Logistic/Ridge Regression, RFNN, and XGBoost. To analyze performance trends with increasing data availability, models were trained on exponentially increasing subsets of the training data, starting from 2048 samples (211) up to the maximum available training size for each dataset.

For the Open ML datasets, the test sets were defined as 33% (ID 23517, classification) and 11% (ID 44975, regression) of the total instances, resulting in a maximum training set size of approximately 64,000 instances (216). For YPMSD (regression), we adhered to the designated split: the first 463,715 examples for training and the subsequent 51,630 examples for testing. For Cover Type (classification), we used the standard train-test split provided by the popular Python machine learning library Scikit-learn (Pedregosa et al., 2011), yielding 464,809 training and 116,203 test instances.

The hyperparameter grids used for each model in these experiments are detailed in Table 11. To account for variability, each experiment was repeated 5 times with different random seeds for each model. All experiments were carried out on a single NVIDIA RTX 6000 (Turing architecture) GPU.

Table 11. Hyperparameter grid for the larger-scale experiments.

MODEL HYPERPARAMETER VALUES

RFRBOOST L2 CLS [1E-1, 1E-2, 1E-3, 1E-4, 1E-5] N LAYERS [1, 3, 6] FEATURE DIM 512 BOOST LR 1.0

RFNN L2 [1E-1, 1E-2, 1E-3, 1E-4, 1E-5] FEATURE DIM 512

LOGISTIC/RIDGE REGRESSION L2 [1E-1, 1E-2, 1E-3, 1E-4]

XGBOOST LR [0.1, 0.033, 0.01] N ESTIMATORS [250, 500, 1000] MAX DEPTH [1, 3, 6, 10] LAMBDA 1.0

E2E MLP RESNET LR [1E-1, 1E-2, 1E-3] N EPOCHS [10, 20, 30] FEATURE DIM 512 END LR FACTOR 0.01 WEIGHT DECAY 1E-5 BATCH SIZE 256

The model training times and predictive performance (RMSE for regression, accuracy for classification) as a function of training set size are presented in Figure 9 and Figure 10, respectively. Regarding training times shown in Figure 9, Ridge/Logistic Regression and RFNNs consistently exhibit the fastest training, as expected. RFRBoost consistently trains faster than both XGBoost and the E2E MLP Res Nets across all datasets. Notably, on the Open ML ID 23517 dataset, RFRBoost trains faster than the single-layer RFNN. This may be attributed to RFNNs fitting a potentially very wide random feature layer directly to the output targets (logits or regression values), which can be computationally intensive. In contrast, RFRBoost constructs a comparatively shallow representation layer-wise with random feature residual blocks, where only the final, potentially lower-dimensional, representation is mapped to the output targets. In some scenarios, this may lead to lower computational times.

Considering the predictive performance shown in Figure 10, RFRBoost consistently and significantly outperforms the

Random Feature Representation Boosting

baseline single-layer RFNN across all datasets and training sizes, demonstrating its effectiveness in using depth for random feature models to increase performance. This improvement is particularly pronounced on the larger Cover Type and YPMSD datasets, underscoring the benefit of RFRBoost s deep, layer-wise construction. However, in the larger data regimes E2E trained MLP Res Nets and XGBoost generally achieve superior predictive performance. The Open ML ID 23517 dataset stands out, however, as RFRBoost and Logistic Regression not only exhibit lower variance but also achieve the best overall performance. Furthermore, we find that XGBoost significantly underperforms on Open ML ID 44975 compared to all other baseline models. Given RFRBoost s strong performance relative to its computational cost and its ability to construct meaningful deep representations, exploring its use as an initialization strategy for subsequent fine-tuning of end-to-end trained networks presents a promising avenue for future research, potentially further enhancing the performance of E2E models.

Figure 9. Model training time (seconds, log scale) versus training set size on larger datasets. Lines represent the mean training time over 5 runs, and shaded areas indicate 95% confidence intervals.

Random Feature Representation Boosting

Figure 10. Model predictive performance (RMSE for regression, accuracy for classification) versus training set size on larger datasets. Lines represent the mean performance over 5 runs, and shaded areas indicate 95% confidence intervals.