# optimal_kernel_quantile_learning_with_random_features__2718a65e.pdf

Optimal Kernel Quantile Learning with Random Features

Caixing Wang 1 Xingdong Feng 1

Abstract The random feature (RF) approach is a wellestablished and efficient tool for scalable kernel methods, but existing literature has primarily focused on kernel ridge regression with random features (KRR-RF), which has limitations in handling heterogeneous data with heavy-tailed noises. This paper presents a generalization study of kernel quantile regression with random features (KQR-RF), which accounts for the nonsmoothness of the check loss in KQR-RF by introducing a refined error decomposition and establishing a novel connection between KQR-RF and KRR-RF. Our study establishes the capacitydependent learning rates for KQR-RF under mild conditions on the number of RFs, which are minimax optimal up to some logarithmic factors. Importantly, our theoretical results, utilizing a datadependent sampling strategy, can be extended to cover the agnostic setting where the target quantile function may not precisely align with the assumed kernel space. By slightly modifying our assumptions, the capacity-dependent error analysis can also be applied to cases with Lipschitz continuous losses, enabling broader applications in the machine learning community. To validate our theoretical findings, simulated experiments and a real data application are conducted.

1. Introduction

Kernel methods are pivotal in statistical analysis and have seen extensive applications in nonparametric regression (Wahba, 1990; Vapnik, 1999) and classification (Sch olkopf & Smola, 2002; Steinwart & Christmann, 2008). Despite their empirical success, typical kernel algorithms struggle with large-scale datasets, hindered by substantial computational cost, scaling as O(|D|3), and considerable storage

1School of Statistics and Management, Shanghai University of Finance and Economics. Correspondence to: Xingdong Feng <feng.xingdong@mail.shufe.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

requirements, scaling as O(|D|2), where |D| is the sample size of the dataset. This limitation has motivated significant research efforts towards scalable kernel methods, such as distributed learning (Zhang et al., 2015; Lin et al., 2017; 2020; Lian, 2022), Nystr om subsampling (Williams & Seeger, 2001; Rudi et al., 2015; Li et al., 2023a), random features (Rahimi & Recht, 2007; 2008; Rudin, 2017; Rudi & Rosasco, 2017), stochastic gradient methods (Lin & Rosasco, 2016; Lin & Cevher, 2020), Falkon (Rudi et al., 2017) and Eigen Pro 3.0 (Abedsoltan et al., 2023).

Among these popular accelerated methods, random features (Rahimi & Recht, 2007) is a kernel approximation technique that maps the attribute space to a finite and low-dimensional space through Bochner s theorem (Bochner, 2005; Rudin, 2017). Recent attention in theoretical analysis has been directed toward kernel methods employing random features (Sutherland & Schneider, 2015; Rudi & Rosasco, 2017; Avron et al., 2017; Bach, 2017; Carratino et al., 2018). In the context of kernel ridge regression with random features, significant efforts have focused on achieving optimal learning rates (Rudi & Rosasco, 2017; Avron et al., 2017; Bach, 2017; Li et al., 2023b; Liu & Lian, 2023), aligning with the minimax optimal rates of exact KRR (Caponnetto & De Vito, 2007) under mild conditions on the number of random features. However, it is worth pointing out that these works mainly focus on the least square loss which exclusively estimates the conditional mean of the response given the covariate variables. Broader investigations encompass losses that are Lipschitz continuous, as in support vector machine (SVM) and logistic regression (Bach, 2017; Sun et al., 2018; Li et al., 2021). It is worth noting that their statistical guarantees are capacity-independent and rely on the stringent assumption that the true target function fρ lies in the assumed kernel space, i.e., fρ HK, known as the realizable setting. However, the agnostic setting, where the true target function fρ is not in the assumed kernel space, i.e., fρ / HK, is more common in practice. This leads to a motivating question: can the capacity-dependent optimal rates for some general losses using random features be maintained in the agnostic settings?

This paper primarily investigates the statistical and computational trade-offs inherent in random feature approximation for nonparametric quantile regression within a reproducing kernel Hilbert space (RKHS), also known as

Optimal Kernel Quantile Learning with Random Features

kernel quantile regression with random features (KQR-RF). In contrast to KRR-RF, KQR-RF models the entire conditional quantiles of the response, enhancing robustness against outliers and handling data heterogeneity more effectively (Koenker, 2005; Takeuchi et al., 2006; Li et al., 2007; Lian, 2022; Wang et al., 2024+). Our objective is to establish the capacity-dependent optimal rates for KQR-RF applicable to both realizable and agnostic settings. Based on this special check loss, we extend the theoretical framework to a broader family of Lipschitz continuous loss functions. This expansion encompasses various commonly employed methodologies as specialized instances, including mean regression, quantile regression, likelihood-based classification, and margin-based classification.

1.1. Our Contributions

The main contributions of this paper are multi-folds.

Comprehensive theoretical assessments. We propose a comprehensive theoretical analysis of KQR-RF, offering deep insights into the impact of random features on kernel quantile learning. To the best of our knowledge, this is the first work to provide generalization analysis for random features in kernel quantile learning. Moreover, the optimal learning rates we derived can be directly extended to the general Lipschitz loss functions. Compared to the existing results, which are either capacity-independent (Rahimi & Recht, 2008; Bach, 2017; Li et al., 2021) or suboptimal (Sun et al., 2018), we provide the capacity-dependent optimal learning rates (Theorem 3.13) for KQR-RF (and RF for Lipschitz loss) for both realizable and agnostic settings.

Efficient computational improvement. For both uniform sampling and data-dependent sampling schemes, we obtain, to the best of our knowledge, the minimum number of random features required for the optimal learning rates in the literature. Specifically, we reduce the number of random features from O(|D|), r = 1/2 (Rahimi & Recht, 2008; Li

et al., 2021) to O(|D| 1 2r+γ ) O(|D|

2r+γ ), r (0, 1] (Theorem 3.9) for the uniform sampling scheme; and O(|D|

2γ 2γ+1 ), r = 1/2 (Sun et al., 2018) to O(|D|

γ 2r+γ ) O(|D|

2r+γ ), r (0, 1] (Corollary 3.16) for the datadependent sampling scheme. Here, |D| is the sample size, r, and γ are some key parameters defined in Section 3. The improvement notably enhances computational efficiency.

Novel proof skills. In contrast to existing related work on KRR and its RF variants (Caponnetto & De Vito, 2007; Rudi & Rosasco, 2017; Li et al., 2023b), the estimator of KQRRF (random features for Lipschitz loss) lacks an explicit solution, posing challenges in deriving the learning rates. In our proof, we first provide a novel error decomposition including a least square approximation (LS-approximation) error term (Lemma B.3). Leveraging the empirical process

and a self-calibration assumption, we successfully establish a connection between the KQR-RF estimator f M,D,λ and its KRR-RF approximation f M,D,λ. The theoretical extension to the entire regularity setting when r (0, 1] is achieved by using the nontrivial Cauthy-Schwarz and Young s inequalities, along with sharper estimates for the differences between the operators. A more detailed proof sketch will be provided in Section 4.

Diverse numerical verification. Another contribution of this work is the exploration of KQR-RF s efficacy across diverse synthetic and real-world examples, further validating the theoretical findings in this paper.

1.2. Related work

Some most related works are presented below.

Random features approximation. Rahimi & Recht (2007); Sutherland & Schneider (2015); Sriperumbudur & Szab o (2015) have investigated the approximation error between the approximated kernel function KM(x, x ) and the original kernel Gram-matrix K(x, x ), requiring O(|D|) features to guarantee the accuracy of the approximation, i.e., supx,x |KM(x, x ) K(x, x )| = O(|D| 1/2). Another line of research delves into the generalization properties of random features in various supervised learning tasks, such as kernel ridge regression (Bach, 2017; Avron et al., 2017; Rudi & Rosasco, 2017), kernel support vector machine (KSVM) (Sun et al., 2018), and kernel learning with Lipschitz loss (Rahimi & Recht, 2008; Li et al., 2021; Li, 2022). However, the success of these works depends on the realizable setting where the true function satisfies fρ HK.

Agnostic kernel learning. Recent studies have established the capacity-dependent optimal learning rates in the agnostic kernel learning, such as KRR (Smale & Zhou, 2007; Zhang et al., 2023), along with its variations including random features (Li et al., 2023b; Li & Liu, 2023) and Nystr om subsampling (Li et al., 2023a). However, these studies primarily concentrate on the least square loss, while our focus lies on the KQR-RF with a non-smooth check loss function (and Lispschitz loss functions), posing additional challenges due to the lack of explicit solutions (refer to Section 4 for a detailed discussion).

Data-dependent sampling. Data-dependent sampling based on an importance ratio was initially introduced by Alaoui & Mahoney (2015) for Nystr om subsampling and has been integrated into random features (Bach, 2017; Avron et al., 2017; Rudi & Rosasco, 2017; Li et al., 2021), facilitating faster learning rates with fewer random features. Despite its computational efficiency, there remains an open question regarding its impact on the theoretical results for KQR-RF (and RF for Lipschitz loss), particularly in the agnostic settings.

Optimal Kernel Quantile Learning with Random Features

2. Methodology

2.1. Preambles

Consider a standard supervised learning problem that we have a sample D = {(xi, yi)}|D| i=1, which are the independent copies of (x, y) drawn from an unknown joint distribution ρ(x, y) over X R. The τ-th conditional quantile of the response y is the minimizer of the expected quantile risk across all measurable functions, given by:

f τ = argmin f L2ρX

X R ρτ y f(x) dρ(x, y), (1)

where ρτ(u) = u(τ I{u 0}) denotes the check loss function at τ-th quantile level with the indicator function I( ), and L2 ρX = {f : X R| R

X f 2(x)dρX < } is the space of square integral function with respect to the distribution of the covariates ρX . We also denote the L2 ρX - norm of f as f 2 ρ = f, f ρ = R

X f 2(x)dρX for f L2 ρX . From the definition of quantile regression model, we have P(ϵ 0|x) = τ, where ϵ = y f τ (x) is the noise term.

2.2. Kernel Quantile Regression

Kernel methods are one of the most powerful nonparametric tools, particularly for learning predictive functions within an RKHS (Vapnik, 1999). Let HK denotes the RKHS induced by a symmetric, positive and semi-definite kernel function K : X X R, and we define its equipped norm as 2 K = , K with the endowed inner product , K.

KQR estimates a function in the RKHS HK by minimizing the check loss function combined with a penalty based on the squared Hilbert norm

f D,λ = argmin f HK

i=1 ρτ yi f(xi) + λ f 2 K, (2)

where |D| is the cardinality of D and λ is the regularization parameter. According to the representer theorem (Wahba, 1990), the solution of this optimization task (2) is of finite form as given by f D,λ(x) = P|D| i=1 αi K(x, xi) = αT KN(x), where α = (α1, . . . , α|D|)T R|D| are the representer coefficients and KN(x) = (K(x1, x), . . . , K(x|D|, x))T R|D|. With this solution plugged into (2), the optimization problem can be reformulated as

ˆα = argmin α R|D| 1 |D|

i=1 ρτ yi αT KN(xi) + λαT Kα,

where K = {K(xi, xj)}|D| i,j=1 is the Gram matrix. In literature, this problem can be solved by dual optimization (Takeuchi et al., 2006; Feng et al., 2023), path-following algorithm (Li et al., 2007), and ADMM algorithm (Boyd et al.,

2011; Wang et al., 2024+). However, its scalability for large datasets is limited due to the expensive computational complexity (O(|D|3) and storage requirements (O(|D|2) when |D| is large. Consequently, a surge in research investigating scalable kernel methods and analyzing their performance has surfaced (Lin et al., 2017; Rudi et al., 2015; Rudi & Rosasco, 2017; Li et al., 2021).

2.3. KQR with Random Features

Random features prove advantageous in kernel approximation. Assuming the kernel K has an integral representation,

K(x, x ) = Z

Ω ϕ(x, ω)ϕ(x , ω)dπ(ω), (3)

for any x, x X, where (Ω, π) is a probability space and ϕ : X Ω R, it is thus clear that we can adopt the standard Monte Carlo sampling method (Rahimi & Recht, 2007) to estimate K(x, x ) by

KM(x, x ) = ϕM(x, ω), ϕM(x , ω) ,

where ϕM(x, ω) = 1

M ϕ(x, ω1), . . . , ϕ(x, ωM) T is the feature map and ω1, . . . , ωM are independently sampled with respect to π. Henceforth, we use ϕM(x) to denote ϕM(x, ω) for notation simplicity. Consequently, the solution of (2) with random features can be written as

f M,D,λ(x) = ˆu T ϕM(x), (4)

and the optimization problem becomes

ˆu = argmin u RM 1 |D|

i=1 ρτ yi u T ϕM(xi) + λu T u. (5)

Notably, leveraging random features allows us to reformulate the initial problem into linear quantile regression augmented by a ridge penalty, reducing the number of parameters to be M |D|. In our simulation study, we utilize the ADMM algorithm with the proximal operator (Boyd et al., 2011; Gu et al., 2018) to solve (5). Although random features can achieve significant success in both computation and storage by approximating the kernel, the detailed trade-off between the number of features required and the statistical prediction accuracy is still an open question, especially when the non-smooth check loss is considered and the true quantile function lies outside of the exact RKHS HK. This paper aims to answer these theoretical questions of KQR-RF in subsequent sections.

3. Theoretical Guarantee

In this section, we first present an existing bound for KQRRF (Li et al., 2021, Theorem 3.8), where they focus on

Optimal Kernel Quantile Learning with Random Features

the Lipschitz continuous loss family including the check loss. Subsequently, we provide our capacity-dependent and shaper learning rates for KQR-RF (Theorem 3.9), which can not only recover those of Li et al. (2021), but also can be applied to the case with the agnostic settings where the true quantile functions may not lie in the considered function space. Furthermore, we consider the data-dependent sampling strategy, which achieves the same rates (Corollary 3.16) with fewer random features and pertains its applicability to the entire agnostic settings. At last, we extend our theoretical results to a wider array of Lipschitz continuous losses with a modified local strong convexity assumption (Assumption 3.17).

The objective of KQR-RF is to find an estimator that minimizes the following expected risk

X R ρτ y f(x) dρ(x, y),

and we evaluate the performance of KRR-RF by the excess risk E(f) E(f τ ), or the L2 ρX -norm of the difference f f τ 2 ρ. The following are some standard definitions and assumptions needed to establish the theoretical results. Definition 3.1 (Integral operators). For any f L2 ρX , we define the integral operators by the kernel K and KM as

X K(x, )f(x)dρX ,

X KM(x, )f(x)dρX .

Definition 3.2 (Effective dimension). For λ > 0, we define the effective dimension of kernel K and KM as

N(λ) = Tr((LK + λI) 1LK),

NM(λ) = Tr((LM + λI) 1LM).

The effective dimension N(λ) serves as a common metric in kernel learning theory literature, measuring the complexity of the RKHS HK (Caponnetto & De Vito, 2007; Smale & Zhou, 2007; Rudi et al., 2015; Rudi & Rosasco, 2017). Similarly, we introduce NM(λ) as the effective dimension induced by the approximation kernel KM. As indicated in Lemma E.7 (Rudi & Rosasco, 2017) in the appendix, NM(λ) has been shown to be equivalent to N(λ) under mild conditions on the number of random features.

Assumption 3.3 (Bounded and continuous random features). Assume kernel K has the integral representation defined in (3) with ϕ bounded and continuous in both variables, that is, there exists some constant κ 1 such that |ϕ(x, ω)| κ for any x X and ω Ω. The associated RKHS HK is separable.

Assumption 3.3 is a common condition in the literature of random features (Rudi & Rosasco, 2017; Liu et al., 2020;

Li et al., 2021), which can be satisfied when the random features are continuous and bounded and X is separable. Note that this assumption implies that supx,x X K(x, x ) κ2

and supx,x X KM(x, x ) κ2.

Assumption 3.4 (Source condition). Suppose there exists R > 0, r > 0 and hτ L2 ρX such that

f τ = Lr Khτ, (6)

where hτ ρ R and Lr K is the r-th power of LK.

The parameter r controls the size of the functional class of f τ , denoted as F = Lr K(L2 ρX ). According to Steinwart & Christmann (2008); Lin & Rosasco (2016), we have HK = L1/2 K (L2 ρX ), and Lr1 K(L2 ρX ) Lr2 K(L2 ρX ) if r1 r2. When r [1/2, 1], the functional class F is a subset of the assumed RKHS HK, so we have f τ HK. When r (0, 1/2), the functional class F is larger than the assumed RKHS HK, and there exists some cases where f τ / HK. Existing literature on KQR and kernel methods with Lipschitz continuous loss functions often assumes that r = 1/2 (Bach, 2017; Sun et al., 2018; Li et al., 2021) or r [1/2, 1] (Lian, 2022), corresponding to the realizable setting f τ HK. However, our analysis further allows r (0, 1/2), relating to the agnostic setting f τ / HK. This is a non-trivial extension since we consider a non-smooth loss with random feature approximation.

Assumption 3.5 (Capacity condition). For λ > 0, there exists Q > 0 and γ [0, 1] such that

N(λ) Q2λ γ. (7)

Note that this assumption always holds when γ = 1 by taking Q = Tr(LK) κ2, and γ = 0 corresponds to some more benign cases. It is more general than the eigenvalue decay assumption (Li et al., 2021; Li, 2022; Lian, 2022), since it is satisfied when the eigenvalues {µ }i 1 of LK have a polynomial decay, i.e., i 1/γ. For KRR and KRRRF, the minimax optimal capacity-dependent rate has been shown to be O(|D| 2r 2r+γ ) (Caponnetto & De Vito, 2007; Rudi & Rosasco, 2017). In the case of KQR, Lian (2022) also derive the same capacity-dependent rate O(|D| 2r 2r+γ ). We want to emphasize that these works mainly focus on the realizable setting with r [1/2, 1], while our result first extends the capacity-dependent rate analysis of KQR-RF to the agnostic setting.

Assumption 3.6 (Adaptive self-calibration condition). Let fy|x( ) denote the conditional density function of y given x. Suppose that supt R fy|x(t) c1 for c1 > 0. Furthermore, there exist some universal constants ε, ε , c2 > 0 that are independent with x and y, such that for any y B(f τ (x), ε) and |δ| ε , the following inequality holds almost surely,

|Fy|x(y + δ) Fy|x(y)| c2|δ|, (8)

Optimal Kernel Quantile Learning with Random Features

where B(f τ (x), ε) = {y | |y f τ (x)| ε} denotes the ball centered at f τ (x) with radius ε, and Fy|x( ) is the cumulative distribution function of y given x.

Assumption 3.6 serves as an adaptive self-calibration condition for the conditional distribution of y given x. It is a mild condition intended to hold for most realistic sequences of distributions. For example, if y has a density that is bounded away from zero on some compact interval around f τ (x), then Assumption 3.6 holds. More importantly, we do not impose any moment condition on the distribution of y. It is also worth noting that Assumption 3.6 is weaker than Condition 2 in He & Shi (1994) where the density function of y is lower bounded everywhere by some positive constant. It is also weaker than Condition D.1 in Belloni & Chernozhukov (2011) requiring the conditional density of Y given x to be continuously differentiable and bounded away from zero uniformly for all τ (0, 1) and all x in the support X. The special case when ε = 0 aligning with the self-calibration condition also appeared in Shen et al. (2021); Madrid Padilla & Chatterjee (2022).

Remark 3.7. This adaptive self-calibration condition plays a pivotal role in our novel error decomposition as shown in Lemma B.3 of the appendix, which leads to an adaptive local strong convexity condition of the expected check loss in a small ball around f τ . It is worth noting that the self-calibration condition is weaker than Assumption (A2 ) of Lian (2022) and Assumption (B2) of Li et al. (2021) where the conditional density of y given x is assumed to be bounded away from zero across all quantile levels and x X. Under this assumption, we derive a tight bound for a novel least square approximation (LS-approximation) error between the KQR-RF estimator f M,D,λ and its KRR-RF approximation estimator f M,D,λ, detailed in Lemma B.8 of the appendix.

3.1. Existing Learning Rates for KQR-RF

To facilitate a clear comparison between our findings and existing results, we first introduce the best learning rates so far for KQR-RF (Li et al., 2021).

Theorem 3.8 (Existing learning rates for KQR-RF (random features with Lipschitz loss), Theorem 19 of Li et al. (2021)). Assume there exists a function f H such that f H = argminf HK E(f). Under some technical assumptions1, and λ = O(|D| 1), when the number of random features satisfies

M |D| γ 2 log |D|,

1Assumption 3.3, Assumption 3.4 with r = 1/2, eigenvalue decaying assumption (stronger than Assumption 3.5), and the local strongly convex assumption which can be derived from Assumption 3.6.

and |D| is sufficiently large, there holds

E(f M,D,λ) E(f H) f M,D,λ f H 2 ρ = O(|D| 1

with probability near to 1.

Theorem 3.8 establishes an upper bound for KQR-RF in the worst case, requiring only the existence of f H. In this scenario, if the number of random features scales as |D| γ 2 log |D|, KQR-RF can achieve the capacityindependent optimal generalization properties. This represents a significant improvement over previous work, which required a larger number of random features to guarantee similar learning rates. Note that Rahimi & Recht (2008) proved O(|D|) random features to guarantee the learning rates at O(|D| 1

2 ). However, these results are capacityindependent and can not apply to the agnostic setting when the size of RKHS is small. In our subsequent analysis, we will present a sharper and capacity-dependent learning rate, allowing r (0, 1], which covers the entire source condition space. This particularly marks the primary novelty and advancement in the theoretical understanding of KQR-RF.

3.2. Sharper Learning Rates for KQR-RF

Theorem 3.9 (Worst case). Under Assumptions 3.3-3.6, if r (0, 1], γ [0, 1], and set λ = |D| 1 2r+γ , when the number of random features satisfies

M |D| 1 2r+γ , for r (0, 1/2);

2r+γ , for r [1/2, 1],

and |D| is sufficiently large, there holds

E(f M,D,λ) E(f τ ) f M,D,λ f τ 2 ρ

= O(|D| 2r 2r+γ log2 |D|),

with probability near to 1.

The capacity-dependent learning rates obtained in Theorem 3.9 align with those of KRR (Caponnetto & De Vito, 2007) and KRR-RF (Rudi & Rosasco, 2017), which is minimax optimal and thus can not be improved any further. Specifically, in scenario of highest regularity (r = 1) and a small RKHS (γ = 0), it approaches the standard parametric bound O(1/|D|). For r = 1/2 and γ = 1, corresponding to the worst case, our learning rates and the requirements on the number of the random features match those in Theorem 3.8. More interestingly, our results extend the optimal learning rates to the agnostic case where the true quantile function is located outside of the RKHS HK. Specifically, we relax the regularity condition from r [1/2, 1] to r (0, 1], covering a wider range of scenarios. Remark 3.10. Recent studies have explored the generalization performance of kernel-based methods in the agnostic setting, including kernel ridge regression (Zhang et al.,

Optimal Kernel Quantile Learning with Random Features

2023), kernel ridge regression with Nystr om subsampling (Lu et al., 2019; Li et al., 2023a), and kernel ridge regression with random features (Li et al., 2023b; Li & Liu, 2023). However, these studies primarily focus on the least square loss, contrasting with our work that delves into more complex non-smooth check loss and a broader Lipschitz loss family. Our theory requires a distinct set of proof techniques compared to the work grounded in the least square loss paradigm which has an implicit solution, necessitating the use of the empirical process. Specifically, we introduce a novel error decomposition including an LS-approximation error term, which bridges the excess risk for the check loss with the L2 ρX error of an intermediate estimator f M,D,λ (see details in Lemmas B.3 and B.8 of the appendix). To derive the faster learning rates for both realizable and agnostic settings, we use different technical skills to take the regularity condition into the LS-approximation error term, such as the non-trivial Young s inequality and Cauthy-Schwarz inequality tailored for operators. Remark 3.11. Theorem 3.9 broadens the regularity condition for optimal learning rates from r [1/2, 1] to (0, 1]. However, it uses the naive uniform sampling strategy for the random features (generate ϕ(x, ω) with π(ω)), which is independent of the training samples. This may lead to an unnecessary burden in computation. Inspired by the data-dependent sampling strategy (Bach, 2017; Avron et al., 2017; Rudi & Rosasco, 2017), we aim to demonstrate in the upcoming section how these strategies enable attaining optimal learning rates across all agnostic settings r (0, 1] with a reduced number of random features in the next section.

3.3. Refined Analysis: Beyond Uniform Sampling

To obtain shaper learning rates for the entire setting r (0, 1] with fewer random features, we first introduce a compatibility condition that is commonly used in the literature (Rudi et al., 2015; Rudi & Rosasco, 2017; Liu et al., 2020). Assumption 3.12 (Compatibility condition). Define the maximum dimension of random features as

N (λ) = sup ω Ω

(LK + λI) 1/2ϕ( , ω) 2

where λ > 0. There exist constants α [0, 1] and F > 0, such that N (λ) Fλ α.

The maximum dimension of random features in (9) correlates with the data-generating distribution through the integral operator LK, which is always satisfied for α = 1 and F = κ2. Recall the definition of N(λ) in Definition 3.2. N(λ) and N (λ) measure the average and supreme capacities of HK, respectively, so we have N(λ) = Eω (LK + λI) 1/2ϕ( , ω) 2 ρX

supω Ω (LK + λI) 1/2ϕ( , ω) 2 ρX = N (λ), where Eω denotes the expectation taking over ω.

Theorem 3.13. Under Assumptions 3.3-3.6 and 3.12, if r (0, 1], γ [0, 1], and set λ = |D| 1 2r+γ , when the number of random features satisfies

M |D| α 2r+γ , for r (0, 1/2);

(2r 1)(1+γ α)+α

2r+γ , for r [1/2, 1],

and |D| is sufficiently large, there holds

E(f M,D,λ) E(f τ ) f M,D,λ f τ 2 ρ

= O(|D| 2r 2r+γ log2 |D|),

with probability near to 1.

The above capacity-dependent learning rate is the same as that of Theorem 3.9, while the required number of random features reduces from O(|D| 1 2r+γ ) to O(|D| α 2r+γ ) when r

(0, 1/2) and O(|D|

2r+γ ) to O(|D|

(2r 1)(1+γ α)+α

2r+γ ) when r [1/2, 1], owing to the additional imposition of the compatibility condition N (λ) Fλ α. By adopting a favorable sampling strategy, as demonstrated in Example 3.14, we can further reduce the required number of random features and achieve the optimal learning rates across the entire range of r (0, 1]. Example 3.14 (Leverage scores sampling). Given the integral representation of kernel K as stated in (3), we adopt the leverage scores sampling strategy (Bach, 2017; Avron et al., 2017) by employing an importance ratio denoted as q(ω) = lλ(ω)/ R ω lλ(ω)dπ(ω), where lλ(ω) = (LK +λI) 1/2ϕ( , ω) 2 ρX . Consequently, the random features are computed as ϕl(x, ω) = [q(ω)] 1/2ϕ(x, ω) and exhibit a distribution πl(ω) = q(ω)π(ω). As pointed out in Rudi & Rosasco (2017), the random features provide the integral representation of K and satisfy Assumption 3.12 with α = γ indicating that N(λ) = N (λ). Remark 3.15. We call α = 1 as the worst case (Theorem 3.9) when considering the random features with uniform sampling in (3) which is independent of the training samples, and α = γ as the benign case (Corollary 3.16) when adopting the data-dependent sampling strategy in Example 3.14. Corollary 3.16 (Benign case). Under Assumptions 3.3-3.6, if random features are sampled according to the strategy in Example 3.14, r (0, 1], γ [0, 1], and set λ = |D| 1 2r+γ , when the number of random features satisfies

γ 2r+γ , for r (0, 1/2);

2r+γ , for r [1/2, 1],

and |D| is sufficiently large, then there holds

E(f M,D,λ) E(f τ ) f M,D,λ f τ 2 ρ

= O(|D| 2r 2r+γ log2 |D|),

with probability near to 1.

Optimal Kernel Quantile Learning with Random Features

(a) Agnostic case

(b) Realizable case

Figure 1. Comparison between the number of random features M = O(|D|c) required for uniform sampling (α = 1, left) and leverage scores sampling (α = γ, right), Figure 1(a) is the agnostic case and Figure 1(b) is the realizable case, respectively.

Theorem 3.9 is the worst case of Theorem 3.13 with α = 1, while Corollary 3.16 is the benign case of Theorem 3.13 with α = γ. This distinction arises from the choice of the uniform sampling strategy π(ω), which typically yields an approximate estimate where α tends to 1. Conversely, employing data-dependent random features assures a favorable scenario where α = γ. To better illustrate the computational improvement for different cases, we depict a comparison in Figure 1(a) and 1(b) between the number of random features required to ensure the optimal learning rates using uniform sampling (left panel) and data-dependent sampling (right panel) for the agnostic case when r (0, 1/2) and the realizable case when r [1/2, 1], respectively.

3.4. Extension to Lipschitz Loss

Note that the check loss belongs to the family composed of Lipschitz continuous losses. We aim to extend our theoretical results to the general Lipschitz continuous loss family, including other kernel-based methods, such as kernel support vector machines (Sun et al., 2018, KSVM) and kernel logistic regression (Keerthi et al., 2005, KLR). Similar to the quantile regression estimation in (5), we formulate the following general learning problem

eu = argmin u RM 1 |D|

i=1 L yi, u T ϕM(xi) + λu T u,

where L(y, ) is a Lipschitz continuous loss such that for some V 0, there exists a constant c L > 0 such that |L(y, x) L(y, x )| c L|x x | holds for all pairs x, x [ V, V ] and y R. We can refer to Feng et al. (2023) for more specific examples satisfying this property.

Our objective is to replace the check loss ρτ with some general Lipschitz continuous loss and construct a unified theoretical framework. In our proof of the main theorems for KQR-RF, a pivotal step involves controlling the LSapproximation error therm in Lemmas B.8 and B.9. To facilitate this, we merely need to substitute Assumption 3.6 with the following substantial assumption.

Assumption 3.17 (Local strong convexity). There exist some constants u, u , c3, c4 > 0 such that for any f and f

satisfying f f ρ u and f f ρ u , there holds

EL(f) EL(f ) c3 f f 2 ρ, (10)

EL(f) EL(f ) + f f 2 ρ c4 f f 2 ρ, (11)

where EL(f) = E(L(y, f(x))) and f = argminf EL(f). Here, we refer to (10) as the local strong convexity of L(y, f), and (11) as the adaptive local strong convexity of L(y, f).

It is worth pointing out that we can verify conditions (10) and (11) for the check loss ρτ( ) by using (8) in Assumption 3.6 with ε = 0 and ε = 0, respectively. With Assumption 3.6 replaced by Assumption 3.17 and keeping all other conditions unchanged, we can similarly establish the same learning rates for the Lipschitz loss L. Specifically,

f L M,D,λ f 2 ρ = O(|D| 2r 2r+γ log2 |D|),

with probability near to 1, where f L M,D,λ = eu T ϕM. The detailed results for Lipschitz continuous loss and their proofs are deferred to Appendix C due to the space limit.

4. Comparisons to Related Work

In this section, we compare the conditions and learning rates of our method with related existing approaches including KRR, KRR-RF, KQR, and Lipschitz loss with RF, which are summarized in Table 1, where Uniform and Leverage stand for the uniformly and leverage scores sampling strategies, respectively, and Lip is short for Lipschitz continuous loss. Note that under Assumption 3.17, the results in the last three lines of Table 1 can also be directly extended to the cases with the general Lipschitz continuous losses.

Compared to KRR and its RF variants. Previous studies have extensively pursued the optimal learning rates for

Optimal Kernel Quantile Learning with Random Features

Table 1. Summary of conditions for derived learning rates in different methods.

Methods Regularity condition Capacity condition Random centers M Learning rate

KRR (Caponnetto & De Vito, 2007) r [1/2, 1] γ [0, 1] |D| 2r 2r+γ

KRR (Zhang et al., 2023) r (0, 1] γ [0, 1] |D| 2r 2r+γ

KRR-RF-Uniform (Rudi & Rosasco, 2017) r [1/2, 1] γ [0, 1] |D| (2r 1)γ+1

2r+γ |D| 2r 2r+γ

KRR-RF-Leverage (Rudi & Rosasco, 2017) r [1/2, 1] γ [0, 1] |D| 2r+γ 1

2r+γ |D| 2r 2r+γ

KRR-RF-Uniform (Li et al., 2023b) r (0, 1], 2r + γ 1 γ [0, 1] |D| 1 2r+γ |D| 2r 2r+γ

KRR-RF-Leverage (Li et al., 2023b) r (0, 1] γ [0, 1] |D| γ 2r+γ |D| 2r 2r+γ

KQR (Lian, 2022) r [1/2, 1] γ [0, 1] |D| 2r 2r+γ Lip-RF-Uniform (Rahimi & Recht, 2008) r = 1/2 γ [0, 1] |D| |D| 1/2

Lip-RF-Leverage (Bach, 2017) r = 1/2 γ [0, 1] |D| γ 2 |D| 1/2

Lip-RF-Uniform (Li et al., 2021) r = 1/2 γ [0, 1] |D| |D| 1/2

Lip-RF-Leverage (Li et al., 2021) r = 1/2 γ [0, 1] |D| γ 2 |D| 1/2

KSVM-RF (Sun et al., 2018) r = 1/2 γ [0, 1] |D|

2γ 2γ+1 |D| 1 2γ+1

KQR-RF (Theorem 3.13) r (0, 1] γ [0, 1] |D| α 2r+γ , r (0, 1/2)

(2r 1)(1+γ α)+α

2r+γ , r [1/2, 1]

|D| 2r 2r+γ

KQR-RF-Uniform (Theorem 3.9) r (0, 1] γ [0, 1] |D| 1 2r+γ , r (0, 1/2)

2r+γ , r [1/2, 1]

|D| 2r 2r+γ

KQR-RF-Leverage (Corollary 3.16) r (0, 1] γ [0, 1] |D|

γ 2r+γ , r (0, 1/2) |D|

2r+γ , r [1/2, 1] |D| 2r 2r+γ

KRR (Caponnetto & De Vito, 2007; Smale & Zhou, 2007) and KRR-RF (Rudi & Rosasco, 2017; Avron et al., 2017). Recent extensions (Zhang et al., 2023; Li et al., 2023b; Li & Liu, 2023) have enlarged the regularity condition to the agnostic setting when the regression function lies outside of the RKHS. However, we focus on KQR-RF with the non-smooth check loss, which is more challenging since we have no explicit solutions. Notably, deriving a capacitydependent learning rate for r (0, 1/2) demands distinct technical skills compared to those required for KRR and its RF variations. Moreover, our results can be easily extended to the Lipschitz losses with a modified assumption, signifying the added novelty of our analysis.

Compared to kernel methods with Lipschitz loss and their RF variants. Existing literature for random features with Lipschitz loss (Rahimi & Recht, 2008; Sun et al., 2018; Li et al., 2021; Li, 2022) only consider the ideal case when r = 1/2, and their learning rates are either capacityindependent (Rahimi & Recht, 2008; Li et al., 2021) or suboptimal (Sun et al., 2018). Lian (2022) studied the capacitydependent learning rate for KQR when r [1/2, 1]. However, their work can not be directly applied to the random feature setting. In contrast, our study offers a comprehensive analysis of the capacity-dependent learning rates for KQR-RF (Lip-RF), exhibiting broader applicability across scenarios where the true regression function resides in the entire agnostic setting.

We also provide a brief proof sketch to emphasize the theoretical contributions of this paper.

A novel error decomposition and least square approximation f M,D,λ f M,D,λ ρ. Unlike existing RF work (Rudi & Rosasco, 2017; Li et al., 2021; 2023b), we first introduce a novel error decomposition in Lemma B.3. Except for the standard empirical, RF approximation, and kernel approximation errors, we have an extra least square approximation (LS-approximation) error term f M,D,λ f M,D,λ ρ. By the adaptive self-calibration assumption, we build an adaptive local strong convexity condition of the expected loss on a small neighborhood of f τ . This promises the convergence of the LS-approximation error term. We also use the non-trivial Cauthy-Schwarz and Young s inequalities to take into account the source index when r (0, 1/2) and r [1/2, 1], respectively.

Sharper analysis for f M,D,λ f τ ρ. As indicated in Lemma B.3, we divide the f M,D,λ f τ ρ into three terms. To get tighter bounds of the empirical errors f M,D,λ f M,λ ρ and f M,λ fλ ρ, we utilize the compatibility condition to Bernstein s inequalities among operators LK, LM, and CM, CM,D. This refined procedure helps to relax the conditions on M and |D|, and further enlarges the regularity condition to r (0, 1]. In fact, the convergence of term f M,D,λ f τ ρ is also an important premise for the convergence of the LS-approximation error f M,D,λ f M,D,λ ρ.

5. Numerical Experiments

Inspired by the simulation setup in Rudi & Rosasco (2017); Li et al. (2021), we also consider the spline kernel of order

Optimal Kernel Quantile Learning with Random Features

0.00 0.25 0.50 0.75 1.00 X

Estimated quantile

True quantile

Estimated quantile curves for r = 0, γ = 1

0.00 0.25 0.50 0.75 1.00 X

Estimated quantile

True quantile

Estimated quantile curves for r = 0.5, γ = 1

0.00 0.25 0.50 0.75 1.00 X

Estimated quantile

True quantile

Estimated quantile curves for r = 1, γ = 0

Figure 2. Estimated and true quantile curves for r = 0, γ = 1 (left), r = 1/2, γ = 1 (middle), and r = 1, γ = 0 (right) when τ = 0.5.

q, defined as Λq (x, x ) = P k= e2πikxe 2πikx |k| q, where x, x [0, 1], and q R. According to the property of spline kernel, we have R 1 0 Λq(x, z)Λq (x , z) dz = Λq+q (x, x ), for any q, q R. Consequently, for r (0, 1] and γ [0, 1], let K(x, x ) = Λ 1

γ (x, x ), and its corresponding random feature is ϕ(x, w) = Λ 1

2γ (x, w) with w U(0, 1). We consider the model y = Λ r

2 (x, 0) + ε, where ε N(0, 0.01) and x U(0, 1). Then Assumptions 3.3-3.5 and 3.11 are satisfied and α = γ (Rudi & Rosasco, 2017). To graphically show the true and estimated quantile function, we consider three different settings: (1) worst case (r = 0, γ = 1); (2) general case (r = 1/2, γ = 1); (3) most benign case (r = 1, γ = 0). Without loss of generality, we fix τ = 0.5. We generate training data with size Ntr = 1000, and testing data with size Nte = 10000. The regularization parameter λ is selected via a grid search based on a validation set with 1000 samples, where the grid is set as {100.5s : s = 20, 19, . . . , 2}. The number of random features is selected according to Theorem 3.13. The estimated and true quantile curves on the testing data are shown in Figure 2. From the results, we can conclude that KQR-RF can estimate the quantile functions very well both in realizable and agnostic settings.

To validate the derived learning rates, i.e., E(f M,D,λ) E(f τ ) = O(|D| 2r 2r+γ ), we estimate the excess risk on the testing data and compared it with the theoretical one. We consider two agnostic cases (r = 0.2, γ = 0.1 and r = 0.4, γ = 0.2) and two realizable cases (r = 0.5, γ = 0.1 and r = 0.8, γ = 0.2) for better illustration. The setting is the same as the above except that the training data size varies in {1000, 2000, . . . , 10000}. We perform a log transform on the empirical excess risk and the number of training data and plot them in Figure 3. From the results, we can see that the data points are uniformly distributed on both sides of a straight line, which verifies the derived learning rate. To further investigate the constants in the big-O bounds, we calculate the slope of each learning curve and compare

it to 2r 2r+γ . The slope constants are 0.81, 1.21, 1.63, 0.95 in four scenarios. This also highlights our contribution in deriving the sharper and capacity-dependent learning rates.

7.0 7.5 8.0 8.5 9.0 log number of training data

log emprical excess risk

r = 0.2, γ = 0.1

7.0 7.5 8.0 8.5 9.0 log number of training data

log emprical excess risk

r = 0.4, γ = 0.2

7.0 7.5 8.0 8.5 9.0 log number of training data

log emprical excess risk

r = 0.5, γ = 0.1

7.0 7.5 8.0 8.5 9.0 log number of training data

log emprical excess risk

r = 0.8, γ = 0.2

Figure 3. Log empirical excess risk for r = 0.2, γ = 0.1 (left top), r = 0.4, γ = 0.2 (right top), r = 0.5, γ = 0.1 (left bottom) and r = 0.8, γ = 0.2 (right bottom) when τ = 0.5.

6. Conclusion

This paper investigates kernel quantile regression with random features and derives capacity-dependent optimal learning rates for both realizable and agnostic settings. By introducing a modified local strong convexity assumption, our theoretical analysis seamlessly extends to the entire Lipschitz continuous loss family, leading to the sharpest result so far to our best knowledge. Extensive experiments are conducted on both simulated and real case studies, providing empirical evidence that supports the theoretical findings of our paper. Furthermore, it is feasible to extend the theoretical results of random features to incorporate other accelerated approaches, such as stochastic gradient methods or distributed techniques, or consider the parallel problem in deep over-parameterized quantile regression.

Acknowledgement

The authors thank the area chair and the anonymous referees for their constructive suggestions, which significantly improved this article. This research is supported in part by NSFC-12371270 and Shanghai Science and Technology Development Funds (23JC1402100). This research is also supported by Shanghai Research Center for Data Science and Decision Technology.

Optimal Kernel Quantile Learning with Random Features

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abedsoltan, A., Belkin, M., and Pandit, P. Toward large kernel models. In International Conference on Machine Learning, pp. 61 78. PMLR, 2023.

Alaoui, A. and Mahoney, M. W. Fast randomized kernel ridge regression with statistical guarantees. Advances in Neural Information Processing Systems, 28:775 783, 2015.

Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322 332. PMLR, 2019.

Avron, H., Clarkson, K. L., and Woodruff, D. P. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38 (4):1116 1138, 2017.

Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714 751, 2017.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Belloni, A. and Chernozhukov, V. ℓ1-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82, 2011.

Blanchard, G. and Kr amer, N. Optimal learning rates for kernel conjugate gradient regression. Advances in Neural Information Processing Systems, 23:226 234, 2010.

Bochner, S. Harmonic analysis and the theory of probability. Courier Corporation, 2005.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011.

Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007.

Carratino, L., Rudi, A., and Rosasco, L. Learning with sgd and random features. Advances in Neural Information Processing Systems, 31:10213 10224, 2018.

Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Advances in Neural Information Processing Systems, 29:2261 2269, 2016.

Feng, X., He, X., Wang, C., Wang, C., and Zhang, J. Towards a unified analysis of kernel-based methods under covariate shift. Advances in Neural Information Processing Systems, 36:73839 73851, 2023.

Furuta, T. Invitation to linear operators: From matrices to bounded linear operators on a Hilbert space. CRC Press, 2001.

Gu, Y., Fan, J., Kong, L., Ma, S., and Zou, H. Admm for high-dimensional sparse penalized quantile regression. Technometrics, 60(3):319 331, 2018.

He, X. and Shi, P. Convergence rate of b-spline estimators of nonparametric conditional quantile functions. Journaltitle of Nonparametric Statistics, 3(3-4):299 308, 1994.

Keerthi, S. S., Duan, K., Shevade, S. K., and Poo, A. N. A fast dual algorithm for kernel logistic regression. Machine Learning, 61:151 165, 2005.

Koenker, R. Quantile regression, volume 38. Cambridge University Press, 2005.

Li, J. and Liu, Y. Towards sharp analysis for distributed learning with random features. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pp. 3920 3928, 2023.

Li, J., Liu, Y., and Wang, W. Optimal convergence rates for agnostic nystr om kernel learning. In International Conference on Machine Learning, pp. 19811 19836. PMLR, 2023a.

Li, J., Liu, Y., and Wang, W. Optimal convergence for agnostic kernel learning with random features. IEEE Transactions on Neural Networks and Learning Systems, 28:1 11, 2023b.

Li, Y., Liu, Y., and Zhu, J. Quantile regression in reproducing kernel Hilbert spaces. Journal of the American Statistical Association, 102(477):255 268, 2007.

Li, Z. Sharp analysis of random fourier features in classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7444 7452, 2022.

Li, Z., Ton, J.-F., Oglic, D., and Sejdinovic, D. Towards a unified analysis of random fourier features. The Journal of Machine Learning Research, 22(1):4887 4937, 2021.

Optimal Kernel Quantile Learning with Random Features

Lian, H. Distributed learning of conditional quantiles in the reproducing kernel hilbert space. Advances in Neural Information Processing Systems, 35:11686 11696, 2022.

Lin, J. and Cevher, V. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. The Journal of Machine Learning Research, 21(1):5852 5914, 2020.

Lin, J. and Rosasco, L. Optimal learning for multi-pass stochastic gradient methods. Advances in Neural Information Processing Systems, 29:4563 4571, 2016.

Lin, S.-B., Guo, X., and Zhou, D.-X. Distributed learning with regularized least squares. The Journal of Machine Learning Research, 18(1):3202 3232, 2017.

Lin, S.-B., Wang, D., and Zhou, D.-X. Distributed kernel ridge regression with communications. The Journal of Machine Learning Research, 21:3718 3755, 2020.

Liu, F., Huang, X., Chen, Y., and Suykens, J. A. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128 7148, 2021.

Liu, J. and Lian, H. On optimal learning with random features. IEEE Transactions on Neural Networks and Learning Systems, 34(11):9536 9541, 2023.

Liu, Y., Liu, J., and Wang, S. Effective distributed learning with random features: Improved bounds and algorithms. In International Conference on Learning Representations, pp. 1 13, 2020.

Lu, S., Math e, P., and Pereverzyev Jr, S. Analysis of regularized nystr om subsampling for regression functions of low smoothness. Analysis and Applications, 17(06):931 946, 2019.

Madrid Padilla, O. H. and Chatterjee, S. Risk bounds for quantile trend filtering. Biometrika, 109(3):751 768, 2022.

Peng, H., Pappas, N., Yogatama, D., Schwartz, R., Smith, N., and Kong, L. Random feature attention. In International Conference on Learning Representations (ICLR 2021), 2021.

Pollard, D. Convergence of stochastic processes. Springer Science & Business Media, 2012.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20:1177 1184, 2007.

Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in Neural Information Processing Systems, 21:1313 1320, 2008.

Rudi, A. and Rosasco, L. Generalization properties of learning with random features. Advances in Neural Information Processing Systems, 30:3215 3225, 2017.

Rudi, A., Camoriano, R., and Rosasco, L. Less is more: Nystr om computational regularization. Advances in Neural Information Processing Systems, 28:1657 1665, 2015.

Rudi, A., Carratino, L., and Rosasco, L. Falkon: An optimal large scale kernel method. Advances in Neural Information Processing Systems, 30:3891 3901, 2017.

Rudin, W. Fourier analysis on groups. Courier Dover Publications, 2017.

Sch olkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, 2002.

Shen, G., Jiao, Y., Lin, Y., Horowitz, J. L., and Huang, J. Deep quantile regression: Mitigating the curse of dimensionality through composition. ar Xiv preprint ar Xiv:2107.04907, 2021.

Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153 172, 2007.

Sriperumbudur, B. and Szab o, Z. Optimal rates for random fourier features. Advances in Neural Information Processing Systems, 28:1144 1152, 2015.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Sun, Y., Gilbert, A., and Tewari, A. But how does it work in theory? linear svm with random features. Advances in Neural Information Processing Systems, 31:3379 3388, 2018.

Sutherland, D. J. and Schneider, J. On the error of random fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 862 871, 2015.

Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. Nonparametric quantile estimation. The Journal of Machine Learning Research, 7:1231 1264, 2006.

Vapnik, V. The nature of statistical learning theory. Springer Science & Business Media, 1999.

Wahba, G. Spline models for observational data. SIAM, 1990.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint. Cambridge University Press, 2019.

Optimal Kernel Quantile Learning with Random Features

Wang, C., Li, T., Zhang, X., Feng, X., and He, X. Communication-efficient nonparametric quantile regression via random features. Journal of Computational and Graphical Statistics, to appear, 2024+.

Williams, C. and Seeger, M. Using the nystroem method to speed up kernel machines. Advances in Neural Information Processing Systems, pp. 682 688, 2001.

Zambon, D., Alippi, C., and Livi, L. Graph random neural features for distance-preserving graph representations. In International Conference on Machine Learning, pp. 10968 10977. PMLR, 2020.

Zandieh, A., Han, I., Avron, H., Shoham, N., Kim, C., and Shin, J. Scaling neural tangent kernels via sketching and random features. Advances in Neural Information Processing Systems, 34:1062 1073, 2021.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Zhang, H., Li, Y., Lu, W., and Lin, Q. On the optimality of misspecified kernel ridge regression. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 41331 41353. PMLR, 2023.

Zhang, Y., Duchi, J., and Wainwright, M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299 3340, 2015.

Optimal Kernel Quantile Learning with Random Features

A. Random features in kernel methods and deep neural networks

Random features mapping is a powerful tool for scaling up kernel methods (Rudi & Rosasco, 2017), neural tangent kernel (Zandieh et al., 2021), graph neural networks (Zambon et al., 2020) and attention in Transformers (Peng et al., 2021). In fact, random features can be viewed as a class of two-layer neural networks with fixed weights in their first layer (Liu et al., 2021).

For example, we consider a two-layer neural network, i.e., f(x, θ) = q

2 M PM j=1 αjσ(ωT j x) for some activation function

σ, where x Rd and ω N(0, Id). Its corresponding random features mapping is k(x, x ) = Eω σ ωT x σ ωT x . If the commonly used Re LU activation σ(x) = max{0, x} is adopted, then the kernel is the first order arc-cosine kernel, i.e., k (x, x ) κ1(u) = 1

π u(π arccos(u)) +

1 u2 with u = x, x / ( x x ). This relationships helps to explain phenomena such as the fit the random labels (Zhang et al., 2021) and double descent (Belkin et al., 2019) in the two-layer overparameterized neural networks (Arora et al., 2019). For a deep neural network with more than two layers and fixed weights except for the output layer, we can also find a compositional kernel with its widths tending to infinity (Daniely et al., 2016). In view of the connection of random features for kernel methods and neural networks, it is meaningful to study the generalization properties of random features in over-parameterized quantile neural networks by those in KQR, especially in the agnostic setting.

B. Proofs of the Learning Rate KQR-RF

To start with, we define a M-dimensional function space HM related to ϕM(x) as

HM = f | f(x) = u T ϕM(x), x X, u RM .

It thus clear that HM is a RKHS induced by kernel function KM(x, x ) = ϕM(x, ω), ϕM(x , ω) . For f = u T ϕM(x) HM, g = z T ϕM(x) HM, we define their inner product in HM as f, g HM = u T z. And the corresponding norm of f in HM is f HM =

u T u = u 2.

In the rest of this paper, we denote as the operatorial norm, HS as the Hilbert-Schmidt norm and 2 as the Euclidean norm of a vector in Rn. Let H be a Hilbert space, we denote with , H the associated inner product, with H the norm and with Tr( ) the trace.

B.1. Kernel and Random Feature Operators

In this section, we provide some popular kernel and random feature operators used in the proofs.

Definition B.1. For any g L2 ρX , β RM, we define

SM : RM L2 ρX , (SMβ)( ) = βT ϕM( ),

S M : L2 ρX RM, S Mg = R

X ϕM(x)g(x)dρX (x),

S M,D : L2 ρX RM, S M,Dg = 1 |D| P|D| i=1 ϕM(xi)g(xi),

CM : RM RM, CM = R

X ϕM(x)ϕM(x)T dρX (x),

CM,D : RM RM, CM,D = 1 |D| P|D| i=1 ϕM(xi)ϕM(xi)T .

According to Definition 3.1, B.1 and Assumption 3.3, we have LM, CM, SM, CM,D are finite dimensional. Moreover, we have LM = SMS M, CM = S MSM and CM,D = S M,DSM. Finally, LM, CM, CM,D are self-adjoint and positive operator, with spectrum is [0, κ2].

B.2. Error Decomposition

In this section, we first introduce some intermediate estimators based on check loss and least square loss and then explain the relationship between the estimators. Finally, we give a tight error decomposition for KQR-RF.

Optimal Kernel Quantile Learning with Random Features

Definition B.2. We define the following intermediate estimators:

f M,D,λ = argmin f HM

(x,y) D ρτ (y f(x)) + λ f 2 HM

f M,D,λ = argmin f HM

(x,y) D (f(x) f τ (x))2 + λ f 2 HM

f M,λ = argmin f HM

f(x) f τ (x) 2dρX (x) + λ f 2 HM

fλ = argmin f HK

f(x) f τ (x) 2dρX (x) + λ f 2 K

We can also write f M,D,λ = ϕM( )T ω M,D,λ and ω M,D,λ = argminω RM 1 |D| P

(x,y) D(ϕM(x)T ω f τ (x))2 +λ ω 2 2, f M,λ = ϕM( )T ωM,λ and ωM,λ = argminω RM R

X (ϕM(x)T ω f τ (x))2dρX (x) + λ ω 2 2.

Note that f M,D,λ is the global estimator of KQR-RF, and it does not have an explicit form due to the non-smoothness of the check loss function, while the other three estimators are defined by the least square loss function. Recall the operators defined in Definition 3.1 and B.1, we have

f M,D,λ = SM(CM,D + λI) 1S M,Df τ ,

f M,λ = (LM + λI) 1LMf τ = SM(CM + λI) 1S Mf τ ,

fλ = (LK + λI) 1LKf τ .

According to the definition of these estimators, we summarized a relationship chain from the KQR-RF estimator to the true quantile function f τ in L2 ρX :

f M,D,λ ρτ ls f M,D,λ ρX (x) f M,λ HM HK fλ HK L2 f τ .

Consequently, we can decompose the error in terms of L2 ρX -norm into four parts: f M,D,λ f M,D,λ ρ is the LSapproximation error from the non-smooth check loss to the least square loss; f M,D,λ f M,λ ρ is the empirical error from the sample to the expectation; f M,λ fλ ρ is the approximation error introduced by the random features; and fλ f τ ρ is the approximation error between HK and L2 ρX . Lemma B.3. Let f M,D,λ, f M,D,λ, f M,λ and fλ be defined in Definition B.2, we have the following error decomposition for KQR-RF,

f M,D,λ f τ ρ f M,D,λ f M,D,λ ρ | {z } LS-approximation error

+ f M,D,λ f M,λ ρ | {z } Empirical error

+ f M,λ fλ ρ | {z } RF error

+ fλ f τ ρ | {z } Approximation error

Proof. According to the triangle inequality, we can obtain the result directly.

B.3. Error Bounds

In this section, we provide the bounds for the four error terms in Lemma B.3. By utilizing the operator representation of f M,D,λ, f M,λ and fλ, we first bound the last three error terms. Benefiting from the maximum dimension of random features and Berntein s inequalities, our refined convergence results allow the source condition index r [0, 1] which can also explain the agnostic case when f τ / HK. Thus we can show that f M,D,λ f τ ρ is small enough, i.e., OP (λr). Based on this result, we finally bound the LS-approximation error by using the empirical process and some properties of the check loss function.

B.3.1. APPROXIMATION ERRORS

Lemma B.4. Let fλ and f τ be defined in Definition B.2 and (1), respectively. Under Assumption 3.4, for any λ (0, 1) and r (0, 1], there holds

fλ f τ ρ Rλr.

Optimal Kernel Quantile Learning with Random Features

Proof. Recall that fλ = (LK + λI) 1LKf τ and Assumption 3.4 that f τ = Lr Khρ with hρ ρ R, we have

fλ f τ ρ = (LK + λI) 1LKf τ f τ ρ = λ(LK + λI) 1f τ ρ = λ (LK + λI) 1Lr Khρ ρ = λr λ1 r(LK + λI)r 1(LK + λI) r Lr Khρ ρ λr λ(LK + λI) 1 1 r (LK + λI) 1LK r hρ ρ Rλr,

where the first inequality is from Lemma E.1 and the fact that λ(LK + λI) 1 1 r 1 and (LK + λI) 1LK r 1 for r (0, 1). Thus we complete the proof.

B.3.2. RANDOM FEATURE ERROR

Lemma B.5. Let f M,λ and fλ be defined in Definition B.2, for any 0 < λ LK and δ (0, 1), if the number of random features satisfies the following inequalities

M 16(N (λ) + 1) log(4/δ), for r (0, 1/2);

M 16(N (λ) + 1) log(4/δ) 128κ2λ1 2r N(λ)2r 1N (λ)2 2r log(4/δ), for r [1/2, 1],

then under Assumptions 3.3, 3.4 and 3.12, there holds

f M,λ fλ ρ Rλr,

with probability at least 1 δ.

Proof. From the definition of f M,λ and fλ, we have

f M,λ fλ = ((LM + λI)LM (LK + λI)LK)f τ = λ((LK + λI) 1 (LM + λI) 1)f τ = λ(LM + λI) 1(LM LK)(LK + λI) 1f τ = λ1/2(λ1/2(LM + λI) 1/2)((LM + λI) 1/2(LK + λI)1/2)[(LK + λI) 1/2(LM LK)(LK + λI)r 1]

((LK + λI) r Lr K)hρ,

where the second and third equalities use (A + λI) 1A = I λ(A + λI) 1 and A 1 B 1 = B 1(B A)A 1, and the last inequality we use Assumption 3.4 that f τ = Lr Khρ with hρ ρ R. Note that λ1/2(LM + λI) 1/2 1, (LK + λI) r Lr K 1 and (LM + λI) 1/2(LK + λI)1/2

2 in (32) from Lemma D.2 when M 16(N (λ) + 1) log(2/δ), thus for any δ (0, 1), there holds

f M,λ fλ ρ R

2λ (LK + λI) 1/2(LM LK)(LK + λI)r 1 (13)

with probability at least 1 δ. We next to bound f M,λ fλ ρ for two cases:

For the case when r (0, 1/2), according to (13), we have

f M,λ fλ ρ R

2λ (LK + λI) 1/2(LM LK)(LK + λI) 1/2 (LK + λI)r 1/2

2Rλr (LK + λI) 1/2(LM LK)(LK + λI) 1/2 λ1/2 r(LK + λI)r 1/2

2 2 Rλr < Rλr,

where the third inequality is from λ1/2 r(LK + λI)r 1/2 1 for r (0, 1/2), and (LK + λI) 1/2(LM LK)(LK + λI) 1/2 1/2 in (31) from Lemma D.2 when M 16(N (λ) + 1) log(2/δ).

For the case when r [1/2, 1], according to (13), we apply Lemma E.2 by letting s = 2 2r [0, 1], X = (LK + λI) 1/2(LM LK) and A = (LK + λI) 1/2,

f M,λ fλ ρ R

2λ (LK + λI) 1/2(LM LK) 2r 1 (LK + λI) 1/2(LM LK)(LK + λI) 1/2 2 2r.

Optimal Kernel Quantile Learning with Random Features

Note that from Lemmas D.1 and D.3, for any δ (0, 1), with probability at least 1 δ, we have

f M,λ fλ ρ R

2λ (LK + λI) 1/2(LM LK) 2r 1 (LK + λI) 1/2(LM LK)(LK + λI) 1/2 2 2r

N (λ) log(4/δ)

4κ2N(λ) log(4/δ)

!2r 1 2(N (λ) + 1) log(4/δ)

2N (λ) log(4/δ)

N (λ) log(4/δ)

4κ2N(λ) log(4/δ)

4N (λ) log(4/δ)

λN (λ)(log(4/δ))r

λN(λ)2r 1N (λ)2 2r log(4/δ)

(14) where the second inequality follows from the inequality that (a + b)2r 1 a2r 1 + b2r 1 for r [1/2, 1].

Next, we need to add a condition on M to bound (14) with Rλr. We consider

M 128κ2λ1 2r N(λ)2r 1N (λ)2 2r log(4/δ),

plug this condition into (14), we get

f M,λ fλ ρ 4

λ4r2 2r+1N (λ)4r2 4r+1

1282rκ4r N(λ)4r2 2r +

1282rκ8r 8r2N(λ)4r2 2r +

where the second inequality is obtained from N (λ) = supω (LK + λ) 1/2ϕω ρ κ2/λ due to Assumptions 3.3 and 3.12, the third inequality follows from: (1) 1282r N(λ)4r2 2r 128, this is from the fact that 2r 1 and 4r2 2r 0 with r [1/2, 1), and N(λ) = Tr((LK + λI) 1LK) |LK| |LK|+λ 1/2 with 0 λ |LK|; (2) κ8r 8r2 1 and κ4 4r 1 with κ 1 from Assumption 3.3 and r [1/2, 1].

Combining the results of two cases, we complete the proof.

B.3.3. EMPIRICAL ERROR

The empirical error is related to the similarity between CM and CM,D, so we first define two important quantities measuring this similarity

QM,D,λ = (CM + λI)1/2(CM,D + λI) 1/2 ,

RM,D,λ = (CM + λI) 1/2(CM CM,D)(CM + λI) 1/2 .

Lemma B.6. Let f M,D,λ and f M,λ be defined in Definition B.2, if the number of random features and the total sample size satisfy inequalities

M 16(N (λ) + 1) log(4/δ), for r (0, 1/2);

M 16(N (λ) + 1) log(4/δ) 128κ2λ1 2r N(λ)2r 1N (λ)2 2r log(4/δ), for r [1/2, 1],

and |D| 16(κ2λ 1 + 1) log(6/δ), respectively, then under Assumption 3.3, 3.4 and 3.12, for any δ (0, 1), there holds

f M,D,λ f M,λ HM

2 e C1λr 1/2,

f M,D,λ f M,λ ρ

with probability at least 1 δ, where e C1 is a constant defined in the proof.

Optimal Kernel Quantile Learning with Random Features

Proof. From the definition of f M,D,λ and f M,λ, we have

f M,D,λ f M,λ HM = ω M,D,λ ωM,λ 2,

and by the equality that A 1 B 1 = A 1(B A)B 1 for A and B are invertible operators, we have

ω M,D,λ ωM,λ =(CM,D + λI) 1S M,Df τ (CM + λI) 1S Mf τ =(CM,D + λI) 1(S M,D S M)f τ + [(CM,D + λI) 1 (CM + λI) 1]S Mf τ =(CM,D + λI) 1(S M,D S M)f τ + (CM,D + λI) 1(CM CM,D)ωM,λ =(CM,D + λI) 1(S M,D S M)f τ + (CM,D + λI) 1(S M S M,D)SMωM,λ =(CM,D + λI) 1S M,D(f τ f M,λ) + (CM,D + λI) 1S M(f M,λ f τ ),

where the fourth equality uses CM,D = S M,DSM and CM = S MSM. Thus we have

ω M,D,λ ωM,λ 2 (CM,D + λI) 1S M,D + (CM,D + λI) 1S M f M,λ f τ ρ

(CM,D + λI) 1/2 (CM,D + λI) 1/2S M,D + (CM,D + λI) 1/2S M f M,λ f τ ρ.

Note that CM,D is self-adjoint and positive operator, we have (CM,D + λI) 1/2 λ 1/2. On the other hand, it holds that (CM,D + λI) 1/2S M,D = (CM,D + λI) 1/2CM,D(CM,D + λI) 1/2 1/2 1,

(CM,D + λI) 1/2S M = (CM,D + λI) 1/2(CM + λI)1/2(CM + λI) 1/2S M

(CM,D + λI) 1/2(CM + λI)1/2 (CM + λI) 1/2S M

= QM,D,λ (CM + λI) 1/2CM(CM + λI) 1/2 1/2

where the second equality uses the fact that AB = BA for A and B are self-adjoint operators, and the last inequality uses Lemma D.5 that QM,D,λ

2 and (CM + λI) 1/2CM(CM + λI) 1/2 1/2 1. Combine these inequalities and Lemma B.5, we get that

ω M,D,λ ωM,λ 2 (1 +

2)λ 1/2 f M,λ f τ ρ e C1λr 1/2,

where e C1 = (1 +

2)R. Similarly, we have

f M,D,λ f M,λ = SM(ω M,D,λ ωM,λ)

=SM(CM,D + λI) 1S M,D(f τ f M,λ) + SM(CM,D + λI) 1S M(f M,λ f τ )

=SM(CM + λI) 1/2(CM + λI)1/2(CM,D + λI) 1/2(CM,D + λI) 1/2S M,D(f τ f M,λ) + SM(CM + λI) 1/2

(CM + λI)1/2(CM,D + λI) 1/2(CM,D + λI) 1/2(CM + λI)1/2(CM + λI) 1/2S M(f M,λ f τ ).

Note that SM(CM + λI) 1/2 = (CM + λI) 1/2CM(CM + λI) 1/2 1/2 1,

and the above inequalities that (CM,D + λI) 1/2S M,D 1 and (CM + λI) 1/2S M 1, we have

f M,D,λ f M,λ ρ (QM,D,λ + Q2 M,D,λ) f M,λ f τ ρ (2 +

Thus we complete the proof.

The following proposition states the convergence rate of f M,D,λ under some mild conditions on r, γ, λ, and M.

Optimal Kernel Quantile Learning with Random Features

Proposition B.7. Under Assumptions 3.3-3.5 and 3.12, if r [0, 1], γ [0, 1], and λ = |D| 1 2r+γ , when the number of random features satisfies the following inequalities

M |D| α 2r+γ , for r (0, 1/2),

(2r 1)(1+γ α)+α

2r+γ , for r [1/2, 1],

and |D| is sufficiently large, then with probability near to 1, there holds

f M,D,λ f τ ρ e C2|D| r 2r+γ , (15)

where e C2 = 2R +

Proof. Combining Lemmas B.4-B.6, and setting λ = |D| 1 2r+γ , we can obtain that

f M,D,λ f τ ρ (2R +

2 e C1)λr = (2R +

2 e C1)|D| r 2r+γ ,

with probability near to 1. Now we check the following conditions for M and |D|,

M 16(N (λ) + 1) log(4/δ), for r (0, 1/2);

M 16(N (λ) + 1) log(4/δ) 128κ2λ1 2r N(λ)2r 1N (λ)2 2r log(4/δ), for r [1/2, 1];

|D| 32(κ2λ 1 + 1) log(6/δ).

Recalling Assumptions 3.5 and 3.12 that N(λ) Q2λ γ, N (λ) Fλ α, and λ = |D| 1 2r+γ , we have

M λ α = |D| α 2r+γ , for r (0, 1/2),

and M λ α λ(2r 2)α+(1 2r)(γ+1) = |D|

(2r 1)(1+γ α)+α

2r+γ , for r [1/2, 1],

and |D| λ 1 = |D| 1 2r+γ |D| is sufficiently large.

Thus we complete the proof.

B.3.4. LS-APPROXIMATION ERROR

Now we are ready to provide the bound for the LS-approximation error. We first give a lemma that establishes the connection between the L2 ρX error term f f M,D,λ 2 ρ and the excess risk error term E ρτ(y f(x)) ρτ(y f M,D,λ(x)) for any f L2 ρX . This lemma heavily relies on the adaptive self-calibration condition governing the conditional distribution of y (see Assumption 3.6). To use this assumption, we need the conclusion on Proposition B.7 that under mild condition that f M,D,λ lies in the ball center at f τ with radius f M,D,λ f τ ρ ε for ε 1 when |D| is large enough.

Lemma B.8. Suppose that Assumptions 3.1-3.6 and the conditions in Proposition B.7 are satisfied, for any f L2 ρX , if |D| e C3, then with probability near to 1, there holds

f f M,D,λ 2 ρ 4

c2 E ρτ(y f(x)) ρτ(y f M,D,λ(x)) + 4c2 1 e C2 2 c2 2 f M,D,λ f τ 2 ρ,

where c1, c2, e C2 and e C3 are some universal positive constants .

Proof. Using Knight s identity that ρτ(u v) ρτ(u) = v τ I(u 0) + R v 0 I(u t) I(u 0) dt, we have

ρτ(y f(x)) ρτ(y f M,D,λ(x)) = (f(x) f M,D,λ(x)) τ I(y f M,D,λ(x))

+ Z f(x) f M,D,λ(x)

I(y f M,D,λ(x) + t) I(y f M,D,λ(x)) dt

= (f(x) f M,D,λ(x)) τ I(y f τ (x)) (f(x) f M,D,λ(x)) I(y f τ (x)) I(y f M,D,λ(x))

+ Z f(x) f M,D,λ(x)

I(y f M,D,λ(x) + t) I(y f M,D,λ(x)) dt.

Optimal Kernel Quantile Learning with Random Features

Here we take the expectation and using Fubini s theorem,

E ρτ(y f(x)) ρτ(y f M,D,λ(x)) = E (f(x) f M,D,λ(x))E((τ I(y f τ (x))|x)

E (f(x) f M,D,λ(x))E((I(y f τ (x)) I(y f M,D,λ(x)))|x)

"Z f(x) f M,D,λ(x)

E(I(y f M,D,λ(x) + t)|x) E(I(y f M,D,λ(x))|x) dt

The first term on the right side of (16) is 0 due to the fact that P(y f τ (x)|x) = τ. For the second term,

E (f(x) f M,D,λ(x))E((I(y f τ (x)) I(y f M,D,λ(x)))|x)

E |f(x) f M,D,λ(x)||E((I(y f τ (x)) I(y f M,D,λ(x)))|x)|

=E |f(x) f M,D,λ(x)||Fy|x(f M,D,λ(x)) Fy|x(f τ (x))|

=E |f(x) f M,D,λ(x)||fy|x(ξ)(f M,D,λ(x) f τ (x))|

c1E |f(x) f M,D,λ(x)||f M,D,λ(x) f τ (x)|

E (f(x) f M,D,λ(x))2 q

E (f M,D,λ(x) f τ (x))2 = c1 f f M,D,λ ρ f M,D,λ f τ ρ,

where the first inequality is from E(AB) E(|A||B|) for any random variable A and B, the second equality is from the mean value theorem with ξ [f M,D,λ(x), f τ (x)] or ξ [f τ (x), f M,D,λ(x)] together with Assumption 3.6 that the conditional density fy|x( ) is uniformly bounded, and the last inequality is from the Cauchy-Schwarz inequality. Similarly, for the third term on the right side of (16),

"Z f(x) f M,D,λ(x)

E(I(y f M,D,λ(x) + t)|x) E(I(y f M,D,λ(x))|x) dt

"Z f(x) f M,D,λ(x)

Fy|x(f M,D,λ(x) + t) Fy|x(f M,D,λ(x)) dt

"Z f(x) f M,D,λ(x)

2 f f M,D,λ 2 ρ,

where the inequality is from Assumption 3.6 and Propsition B.7 that f M,D,λ f τ ρ ξ when |D| ( e C2/ξ)2+γ/r.

Plug these results into (16), we get

2 f f M,D,λ 2 ρ c1 f f M,D,λ ρ f M,D,λ f τ ρ + E ρτ(y f(x)) ρτ(y f M,D,λ(x))

4β f f M,D,λ 2 ρ + c1β f M,D,λ f τ 2 ρ + E ρτ(y f(x)) ρτ(y f M,D,λ(x)) ,

then we set β = c1/c2, it holds that

f f M,D,λ 2 ρ 4

c2 E ρτ(y f(x)) ρτ(y f M,D,λ(x)) + 4c2 1 c2 2 f M,D,λ f τ 2 ρ

c2 E ρτ(y f(x)) ρτ(y f M,D,λ(x)) + 4c2 1 e C2 2 c2 2 |D| 2r 2r+γ ,

with probability near to 1. Thus we completes the proof.

The following lemma bounds the supremum of the difference between the empirical average dependent on the data 1 |D| P|D| i=1[ρτ(yi f(xi)) ρτ(yi f M,D,λ(xi))] and its expectation E[ρτ(y f(x)) ρτ(y f M,D,λ(x)] within a local ball using the Rademacher complexity function on HM

j=1 min{µj, δ2},

Optimal Kernel Quantile Learning with Random Features

where µj s are the eigenvalues of the spectral decomposition of LM. Recall the definition of the effective dimension of HM that NM(λ) = Tr((LM + λI) 1LM) = P j=1 µj/(µj + λ). It is easy to verify that NM(λ) |D|R2 M(

λ)/λ (using inequality that min(a, b)/2 ab a+b min(a, b) for a, b R ). So these two quantities are equivalent to some extent, and according to Lemma E.8, under some mild conditions on the number of random features, we have RM(δ) R(δ), where R(δ) is the Rademacher complexity function on HK defined by

j=1 min{µ j, δ2},

where µ j s are the eigenvalues of the spectral decomposition of LK.

Lemma B.9. For any δ > 0 and f HM, we define the event M(δ) as

i=1 [ρτ(yi f(xi)) ρτ(yi f M,D,λ(xi))] E[ρτ(y f(x)) ρτ(y f M,D,λ(x)]

C log |D|RM(δ)

where Θ(δ) := {f HM | f f M,D,λ ρ δ, and f f M,D,λ HM 1}, then M(δ) holds with probability near to 1.

Proof. For the notation simplify, we denote

i=1 [ρτ(yi f(xi)) ρτ(yi f M,D,λ(xi))] E[ρτ(y f(x)) ρτ(y f M,D,λ(x)]

and C is a universal positive constant that may be different from line to line in this lemma.

We first use the standard symmetrization argument in the empirical process (Pollard, 2012) to bound E[A] such that

i=1 σi ρτ(yi f(xi)) ρτ(yi f M,D,λ(xi))

i=1 σi f(xi) f M,D,λ(xi)

where {σi} s denote the Rademacher variables taking values in { 1, 1} with equal probability, the second inequality follows from the fact that ρτ( ) is 1-Lipschitz continuous and the Ledoux Talagrand contraction inequality (Wainwright, 2019).

For any f Θ(δ), we denote g = f f M,D,λ HM, and g = P j=1 gjψj with gj = R

X f(x)ψj(x)ρX (x)dx. Note that g ρ δ and g HM 1, this implies that P j=1 g2 j δ2 and P j=1 g2 j /µj 1. Combine these two inequalities, we have

g2 j min{µj, δ2} 2. (18)

Then we have

i=1 σi f(xi) f M,D,λ(xi) =

j=1 gjψj(xi)

min{µj, δ2}

min{µj, δ2}

i=1 σiψj(xi)

j=1 min{µj, δ2}

i=1 σiψj(xi)

Optimal Kernel Quantile Learning with Random Features

where the inequality is from Cauthy-Schwarz inequality and (18). Plug (19) into (17), we have

j=1 min{µj, δ2}

i=1 σiψj(xi)

j=1 min{µj, δ2}Ex,σ

i=1 σiψj(xi)

j=1 min{µj, δ2}

i=1 Ex,σ σ2 i ψ2 j (xi)

j=1 min{µj, δ2},

where the second inequality follows from Jensen s inequality, and the first equality is from the fact that Ex,σ(σiψj(xi)) = 0 for each i. Thus we have

E[A] CRM(δ). (20)

Next, we turn to bound A E[A], note that

j=1 gjψj(x)

g2 j min{µj, δ2}

j=1 min{µj, δ2}ψ2 j (x) C

j=1 min{µj, δ2} = C p

Thus we have ρτ(y f(x)) ρτ(y f M,D,λ(x)) f(x) f M,D,λ(x) = |g(x)| C p

and E ρτ(y f(x)) ρτ(y f M,D,λ(x)) 2 E f(x) f M,D,λ(x) 2 = E(g(x))2 C|D|R2 M(δ).

With these two inequalities, we use the Bousquet bound inequality in Lemma E.7 and set t = C q

A E[A] C log |D|RM(δ) (21)

holds with probability at least 1 n C.

Combine (20) and (21), we can obtain the inequality in the lemma. Thus we complete the proof.

According to Lemma B.9, we can also get the following inequality by some normalized procedure,

i=1 [ρτ(yi f(xi)) ρτ(yi f M,D,λ(xi))] E[ρτ(y f(x)) ρτ(y f M,D,λ(x)]

C log |D|RM(δ) f f M,D,λ ρ

δ + f f M,D,λ HM

Lemma B.10. Suppose that Assumptions 3.3-3.6 and 3.12 and the conditions in Proposition B.7 are satisfied, if |D| e C3, then with probability near to 1, there holds

f M,D,λ f M,D,λ ρ C|D| r 2r+γ log |D|,

where C is a universal positive constant.

Optimal Kernel Quantile Learning with Random Features

Proof. Recall the definition of f M,D,λ, we have

i=1 ρτ(yi f M,D,λ(xi)) + λ f M,D,λ HM 1 |D|

i=1 ρτ(yi f M,D,λ(xi)) + λ f M,D,λ HM .

We can not directly obtain a upper bound of E[ρτ(y f M,D,λ(x)) ρτ(y f M,D,λ(x)] according to (22) from Lemma B.9, because E[ρτ(y f M,D,λ(x)) ρτ(y f M,D,λ(x)] 0 does not always hold. Thus we use Lemma B.8 and note that E[ρτ(y f M,D,λ(x)) ρτ(y f M,D,λ(x)] + C f M,D,λ f τ 0 holds. Thus combine Lemma B.8 and (22) from Lemma B.9 (plus a C|D| r 2r+γ term and minus the same term in the left of (22)) , with probability near to 1, we have

E[ρτ(y f M,D,λ(x)) ρτ(y f M,D,λ(x)] + C|D| 2r 2r+γ

λ f M,D,λ HM λ f M,D,λ HM + C log |D|RM(δ)

δ f f M,D,λ ρ + C log |D|RM(δ) f f M,D,λ HM + C|D| 2r 2r+γ

= 2λ f M,D,λ, f M,D,λ f M,D,λ HM λ f M,D,λ f M,D,λ 2 HM + C log |D|RM(δ)

δ f M,D,λ f M,D,λ ρ

+C log |D|RM(δ) f M,D,λ f M,D,λ HM + C|D| 2r 2r+γ

= 2λ f M,D,λ, f M,D,λ f M,D,λ HM λ f M,D,λ f M,D,λ 2 HM + Cλr log |D| f M,D,λ f M,D,λ ρ

+Cλr+1/2 log |D| f M,D,λ f M,D,λ HM + Cλ2r, (23) where in the last equality, we choose δ satisfying that RM(δ) = δ1+2r. Note that RM(δ) R(δ) δ1 γ/ p

|D| (Lian, 2022), we can obtain that δ = |D| 1 4r+2γ , and λ = δ2 = |D| 1 2r+γ .

We now establish the bound of the term 2λ f M,D,λ, f M,D,λ f M,D,λ HM . Note that by the triangle inequality, we have

|λ f M,D,λ, f M,D,λ f M,D,λ HM |

|λ f M,D,λ f M,λ, f M,D,λ f M,D,λ HM | + |λ f M,λ, f M,D,λ f M,D,λ HM |. (24)

For the first term of the right side of (24), we use the Cauchy Schwarz inequality and Lemma B.6 and obtain that

|λ f M,D,λ f M,λ, f M,D,λ f M,D,λ HM | λ f M,D,λ f M,λ HM f M,D,λ f M,D,λ HM

e C1λr+1/2 f M,D,λ f M,D,λ HM .

For the second term in the right side of (24), we consider the following two cases:

(i). For the case when r (0, 1/2), recall the definition of f M,λ we get

f M,λ HM = (LM + λI) 1LMfρ HM = (LM + λI) 1LMLr Khρ HM (LM + λI) 1LM Lr Khρ ρ Lr Khρ ρ Lr K hρ ρ Rκ2r,

where the first and third inequality is from the fact that (LM +λI) 1LM and Lr K are linear operators, the last inequality is from Lr K κ2r for r (0, 1/2) and hρ ρ R. Then by the Cauchy Schwarz inequality, we have

|λ f M,λ, f M,D,λ f M,D,λ HM | λ f M,λ HM f M,D,λ f M,D,λ HM Rϕ2rλ f M,D,λ f M,D,λ HM Rκ2rλr+1/2 f M,D,λ f M,D,λ HM .

Optimal Kernel Quantile Learning with Random Features

(ii). For the case when r [1/2, 1], we have

|λ f M,λ, f M,D,λ f M,D,λ HM |

=|λ f M,λ, L 1 M (f M,D,λ f M,D,λ) ρ|

=λ| (LM + λI) 1LMLr Mh τ, L 1 M (f M,D,λ f M,D,λ) ρ|

Rλr λ1 r(LM + λI) 1Lr M(f M,D,λ f M,D,λ) ρ

f M,D,λ f M,D,λ, λ2 2r L2r M(LM + λI) 2(f M,D,λ f M,D,λ) ρ

f M,D,λ f M,D,λ, ((2 2r)λ + (2r 1)LM)LM(LM + λI) 2(f M,D,λ f M,D,λ) ρ

f M,D,λ f M,D,λ, λLM(LM + λI) 2(f M,D,λ f M,D,λ) ρ+

f M,D,λ f M,D,λ, L2 M(LM + λI) 2(f M,D,λ f M,D,λ) ρ

f M,D,λ f M,D,λ, L2 M(LM + λI) 2(f M,D,λ f M,D,λ) HM +

f M,D,λ f M,D,λ, L2 M(LM + λI) 2(f M,D,λ f M,D,λ) ρ

Rλr+1/2 f M,D,λ f M,D,λ HM + Rλr f M,D,λ f M,D,λ ρ,

where we use the fact that f ρ = L1/2 M f HM for any f L2 ρX , the the second inequality uses the Young s inequality that λ2 2r L2r M (2 2r)λ + 2r LM for the positive operator LM, λ > 0, and r [1/2, 1], the last inequality is from (LM + λI) 1LM 1. This technical proof taking consideration of r [1/2, 1] is inspired from that of KQR in Lian (2022).

Plug the aforementioned two results (i) and (ii) into (24), we have

|λ f M,D,λ, f M,D,λ f M,D,λ HM | Cλr+1/2 f M,D,λ f M,D,λ HM + Cλr f M,D,λ f M,D,λ ρ,

where C = ( e C1 + R(ϕ2r + 1)) + R. Plug this result into (23), we get

E[ρτ(y f M,D,λ(x)) ρτ(y f M,D,λ(x)]

Cλr+1/2 f M,D,λ f M,D,λ HM + Cλr f M,D,λ f M,D,λ ρ λ f M,D,λ f M,D,λ 2 HM +Cλr log |D| f M,D,λ f M,D,λ ρ + Cλr+1/2 log |D| f M,D,λ f M,D,λ HM + Cλ2r

4 f M,D,λ f M,D,λ 2 HM + Cλr log |D| f M,D,λ f M,D,λ ρ λ f M,D,λ f M,D,λ 2 HM

+Cλ2r log2 |D| + λ

4 f M,D,λ f M,D,λ 2 HM

Cλ2r log2 |D| + Cλr log |D| f M,D,λ f M,D,λ ρ = C|D| 2r 2r+γ log2 |D| + C|D| r 2r+γ log |D| f M,D,λ f M,D,λ ρ.

By Lemma B.8, we have

f M,D,λ f M,D,λ 2 ρ CE[ρτ(y f M,D,λ(x)) ρτ(y f M,D,λ(x)] + C|D| 2r 2r+γ

C|D| 2r 2r+γ log2 |D| + C|D| r 2r+γ log |D| f M,D,λ f M,D,λ ρ.

Solve the above inequality we can finally obtain that

f M,D,λ f M,D,λ ρ C|D| r 2r+γ log |D|.

Thus we complete the proof.

Optimal Kernel Quantile Learning with Random Features

B.4. Proofs of Theorems 3.9 and 3.13 and Corollary 3.16

Now we are ready to prove Theorems 3.9 and 3.13 and Corollary 3.16.

Proof. By Proposition B.7 and Lemma B.10, if r [0, 1], γ [0, 1], 2r + γ α, and λ = |D| 1 2r+γ , and the number of random features satisfies the following two inequalities

M |D| α 2r+γ , for r (0, 1/2),

(2r 1)(1+γ α)+α

2r+γ , for r [1/2, 1],

then with probability near to 1, there holds

f M,D,λ f τ ρ f M,D,λ f M,D,λ ρ + f M,D,λ f τ ρ ( e C2 + C)|D| r 2r+γ log |D|,

so we have f M,D,λ f M,D,λ ρ C|D| r 2r+γ log2 |D|. Recall the knight inequality that

ρτ(y f M,D,λ(x)) ρτ(y f τ (x)) = (f M,D,λ(x) f τ (x)) τ I(y f τ (x))

+ Z f M,D,λ(x) f τ (x)

I(y f τ (x) + t) I(y f τ (x)) dt.

Taking the expectation and using Fubini s theorem, we obtain that

E ρτ(y f M,D,λ(x)) ρτ(y f τ (x)) = E (f M,D,λ(x) f M,D,λ(x))E((τ I(y f τ (x))|x)

"Z f M,D,λ(x) f τ (x)

E(I(y f τ (x) + t)|x) E(I(y f τ (x))|x) dt

The first term is 0 due to that fact that E((τ I(y f τ (x))|x) = 0 and the second term can be bounded by

"Z f M,D,λ(x) f τ (x)

E(I(y f τ (x) + t)|x) E(I(y f τ (x))|x) dt

"Z f M,D,λ(x) f τ (x)

Fy|x(f τ (x) + t) Fy|x(f τ (x)) dt

"Z f M,D,λ(x) f τ (x)

2 f M,D,λ f τ 2 ρ,

where the inequality uses Assumption 3.12 that supt R fy|x(t) c1. Therefore, we have

E(f M,D,λ) E(f τ ) c1/2 f M,D,λ f τ 2 ρ C|D| 2r 2r+γ log2 |D|.

Thus we complete the proof of Theorem 3.13. By Theorem 3.13 with α = 1 and α = γ, we can estabilsh the proofs of Theorem 3.9 and Corollary 3.16.

C. Extension to the Lipschitz Loss

In this section, we consider random feature method with Lipschitz continuous loss function L( , ). Similar to the check loss case in (5), we approximate yi with f L M,D,λ = euϕM and formulate the following general learning problem

eu = argmin u RM 1 |D|

i=1 L yi, u T ϕM(xi) + λu T u.

The following theorem shows the capacity-dependent learning rates for the RF estimator with Lipschitz continuous loss function (Lip-RF), which is sharper than those of the existing literature (Li et al., 2021; Li, 2022) and can be applied to the agnostic setting.

Optimal Kernel Quantile Learning with Random Features

Theorem C.1. Under Assumptions 3.3-3.6 and 3.17, if r (0, 1], γ [0, 1], and λ = |D| 1 2r+γ , when the number of random features satisfies

M |D| α 2r+γ , for r (0, 1/2),

(2r 1)(1+γ α)+α

2r+γ , for r [1/2, 1],

and |D| is sufficiently large, then there holds

f L M,D,λ f 2 ρ = O(|D| 2r 2r+γ log |D|),

with probability near to 1, where f = argminf EL(f).

Proof. Similar to Lemma B.3, we decompose the error for Lip-RF in the following

f L M,D,λ f ρ f L M,D,λ f M,D,λ ρ | {z } LS-approximation error

+ f M,D,λ f M,λ ρ | {z } Empirical error

+ f M,λ fλ ρ | {z } RF error

+ fλ f τ ρ | {z } Approximation error

For the last three error terms, we have established their upper bounds in Lemma B.4-B.6. So we only need to bound the first LS-approximation error term.

With the similar argument in the proof of Lemma B.8, we have the similar (adaptive) local strongly convexity condition on L with Assumption 3.17,

EL(f) EL(f M,D,λ) c3 f f M,D,λ 2 ρ, (26)

EL(f) EL(f M,D,λ) + f M,D,λ f 2 ρ c4 f f M,D,λ 2 ρ. (27)

Note that L is Lipschitz continuous, we can also replace ρτ with L and obtain a similar inequality with that in (22)

i=1 [L(yi f(xi)) L(yi f M,D,λ(xi))] E[L(y f(x)) L(y f M,D,λ(x)]

C log |D|RM(δ) f f M,D,λ ρ

δ + f f M,D,λ HM

Using (26)-(28), we perform a similar procedure in Lemma B.10 with ρτ replaced by L and get the upper bound for the LS-approximation error term

f L M,D,λ f M,D,λ ρ C|D| r 2r+γ log |D|, (29)

with probability near to 1.

Combining Lemmas B.4-B.6, (25) and (29), we have

f L M,D,λ f 2 ρ = O(|D| 2r 2r+γ log |D|),

with probability near to 1.

With α = 1 and α = γ, we can derive the following corollaries for Lip-RF with uniformly sampling and data-dependent sampling strategies.

Optimal Kernel Quantile Learning with Random Features

Corollary C.2. Under Assumptions 3.3-3.6 and 3.17, if random features are sampled according to the uniform strategy, r (0, 1], γ [0, 1], and λ = |D| 1 2r+γ , when the number of random features satisfies

M |D| 1 2r+γ , for r (0, 1/2),

2r+γ , for r [1/2, 1],

and |D| is sufficiently large, then there holds

f L M,D,λ f 2 ρ = O(|D| 2r 2r+γ log |D|)

with probability near to 1, where f = argminf EL(f).

Corollary C.3. Under Assumptions 3.3-3.6 and 3.17, if random features are sampled according to the strategy in Example 3.14, r (0, 1], γ [0, 1], and λ = |D| 1 2r+γ , when the number of random features satisfies

γ 2r+γ , for r (0, 1/2),

2r+γ , for r [1/2, 1],

and |D| is sufficiently large, then there holds

f L M,D,λ f 2 ρ = O(|D| 2r 2r+γ log |D|),

with probability near to 1, where f = argminf EL(f). Remark C.4. Note that if we further pose an assumption which is widely used in the literature (Feng et al., 2023)

EL(f) EL(f ) C f f 2 ρ,

we can also establish the learning rates for the excess risk of Lip-RF as given by

EL(f L M,D,λ) EL(f ) f L M,D,λ f 2 ρ = O(|D| 2r 2r+γ log2 |D|).

D. Operator Similarities

In this section, we provide some tight bounds of operator similarities. We first analyze the similarity between LK and LM. Lemma D.1. For any δ (0, 1), under Assumption 3.12, there holds

(LK + λI) 1/2(LK LM)(LK + λI) 1/2 2(N (λ) + 1) log(2/δ)

2N (λ) log(2/δ)

with probability at least 1 δ.

Proof. We denote ϕω as the the function ϕ( , ω) for any ω Ω. Note that

i=1 ϕωi ϕωi and LK = Eω[ϕω ϕω].

and Assumption 3.12 that the inequality (LK + λI) 1/2ϕω 2 ρ N (λ) holds almost everywhere. By Lemma E.6 with Q = LK and vi = ϕωi for i = 1, . . . , M we can get (30) with probability at least 1 δ. Thus we complete the proof.

Lemma D.2. If the number of random features M 16(N (λ) + 1) log(2/δ), then under Assumption 3.12, for any δ (0, 1), there holds

(LK + λI) 1/2(LK LM)(LK + λI) 1/2 1

(LM + λI) 1/2(LK + λI)1/2

with probability at least 1 δ.

Optimal Kernel Quantile Learning with Random Features

Proof. If M 16(N (λ) + 1) log(2/δ), from Lemma D.1, we have

(LK + λI) 1/2(LK LM)(LK + λI) 1/2 2(N (λ) + 1) log(2/δ)

2N (λ) log(2/δ)

By Lemma E.3 with A = LK, B = LM, and η = 1/2, we have

(LM + λI) 1/2(LK + λI)1/2

Thus we complete the proof.

Lemma D.3. For any δ (0, 1), under Assumptions 3.3 and 3.12, there holds

(LK + λI) 1/2(LK LM) 4κ p

N (λ) log(2/δ)

4κ2N(λ) log(2/δ)

with probability at least 1 δ.

Proof. Let vi = zi = ϕωi for i = 1, . . . , M, then

i=1 vi zi = 1

i=1 ϕωi ϕωi = LM,

and Q = T = E[v v] = E[v z] = E[ϕω ϕω] = LK.

Note that v 2 = ϕω 2 κ2 from Assumption 3.3, and (LK + λ) 1/2ϕω 2 N (λ) from Assumption 3.12, and N(λ) = Tr((LK + λI) 1LK), then by Lemma E.5, we have

(LK + λI) 1/2(LK LM) 4κ p

N (λ) log(2/δ)

4κ2N(λ) log(2/δ)

Thus we complete the proof.

Lemma D.4. Under Assumption 3.12, for any δ (0, 1), there holds

RM,D,λ 2(κ2λ 1 + 1) log(2/δ)

2κ2λ 1 log(2/δ)

with probability at least 1 δ.

Proof. Recall that RM,D,λ = (CM + λI) 1/2(CM CM,D)(CM + λI) 1/2 , we have

CM,D = 1 |D|

x D ϕM(x) ϕM(x) and CM = Ex[ϕM(x) ϕM(x)].

(CM + λI) 1/2ϕM(x) 2 2 1

λ sup x X ϕM(x) 2 2 = 1 λM sup x X

i=1 |ϕ(x, ωi)|2

i=1 sup x X |ϕ(x, ωi)|2 1 λM

i=1 sup x X,ω Ω |ϕ(x, ωi)|2 κ2λ 1,

where the second inequality is from Lemma D.2. By Lemma E.6 with Q = CM and vi = ϕM(xi) for xi D(x), we can get (30) with probability at least 1 δ. Thus we complete the proof.

Optimal Kernel Quantile Learning with Random Features

Lemma D.5. If the number of sample |D| 16(κ2λ 1 + 1) log(2/δ), then under Assumption 3.12, for any δ (0, 1), there holds

with probability at least 1 δ.

Proof. If |D| 16(κ2λ 1 + 1) log(2/δ), from Lemma D.4, we have

(CM + λI) 1/2(CM CM,D)(CM + λI) 1/2 2(κ2λ 1 + 1) log(2/δ)

2κ2λ 1 log(2/δ)

By Lemma E.3 with A = CM, B = CM,D, and η = 1/2, we have

(CM,D + λI) 1/2(CM + λI)1/2

Thus we complete the proof.

E. Technical Lemmas

Lemma E.1 (Cordes Inequality (Furuta, 2001)). Let A and B be positive bounded linear operators on a separable Hilbert space. Then, for any 0 < τ 1, we have AτBτ AB τ.

Lemma E.2 (Proposition 9 in (Rudi & Rosasco, 2017)). Let H, K be two separable Hilbert spaces and X, A be bounded linear operators, with X : H K and A : H H be positive semi-definite, then there holds

XAs X 1 s XA s, s [0, 1].

Lemma E.3 (Lemma E.2 in (Blanchard & Kr amer, 2010)). For any self-adjoint and positive semi-definite operators A and B, if there exists some η [0, 1] such that

(A + λI) 1/2(B A)(A + λI) 1/2 1 η,

then we have (A + λI)1/2(B + λI) 1/2 1 η .

Lemma E.4 (Bernstein s inequality for sum of random vectors (Proposition 2 in (Rudi & Rosasco, 2017))). Let ξ1, . . . , ξn be a sequence of i.i.d random variables on a separable Hilbert space H, if there exists eσ, e B 0 such that

E ξi Eξi p H 1

2p!eσ2 e Bp 2, p 2, (38)

for any 0 i n, then for any δ (0, 1], there holds 1 n

H 2 e B log(2/δ)

2eσ2 log(2/δ)

with probability at least 1 δ. Particularly, (38) is satisfied if

ξ H e B 2 , a.s. and E ξ 2 H eσ2, or ξ Eξ H e B, a.s. and E ξ Eξ 2 H eσ2.

Optimal Kernel Quantile Learning with Random Features

Lemma E.5 (Proposition 5 in (Rudi & Rosasco, 2017)). Let H and K be two separable Hilbert spaces and (v1, z1), . . . , (vn, zn) H K for n 1 be i.i.d. random variables such that there exists some constant τ such that v H τ and z H τ almost everywhere. Let Q = Ev v and T = Ev z and Tn = 1

n Pn i=1 vi zi, then for any δ (0, 1], the following holds with probability at least 1 δ,

(Q + λI) 1/2(T Tn) HS 4τ p

Q (λ) log(2/δ)

4τ 2Q(λ) log(2/δ)

where Q (λ) = supv H (Q + λI) 1/2v 2 and Q(λ) = Tr((Q + λI) 1/2Q).

Lemma E.6 (Proposition 6 in (Rudi & Rosasco, 2017)). Let v1, . . . , vn be a sequence of i.i.d random variables on a separable Hilbert spaces H such that Q = Ev v is trace class, and for any λ > 0 there exists a constant Q (λ) < such that v, (Q + λI) 1v Q (λ) almost everywhere. Let Qn = 1

n Pn i=1 vi vi, then for any δ (0, 1], the following holds with probability at least 1 δ,

(Q + λI) 1/2(Q Qn)(Q + λI) 1/2 2(Q (λ) + 1) log(2/δ)

2Q (λ) log(2/δ)

Lemma E.7 (Bousquet Inequality). Let Z1, . . . , Zn be independent random elements taking values in some space Z and let Ξ be a class of real-valued functions on Z, if we have

ξ ηn and 1 n

i=1 Var (ξ (Zi)) ζ2 n, ξ Ξ.

Define Z := supξ Ξ 1

n Pn i=1 (ξ (Zi) Eξ (Zi)) . Then for t > 0

P Z E(Z) + t p

2 (ζ2n + 2ηn E(Z)) + 2ηnt2

Lemma E.8 (Proposition 10 in (Rudi & Rosasco, 2017)). If the number of random features M (4 + 18N (λ)) log(12κ2/λδ), then under Assumption 3.3, for any δ (0, 1), there holds

|NM(λ) N(λ)| 1.55N(λ),

with probability at least 1 δ.

F. Additional numerical Experiments

In this section, we conduct some additional numerical experiments on both simulated and real-world data to demonstrate the effectiveness of random features in large kernel quantile learning tasks.

F.1. Simulated Data

For the simulated data, we consider the following two data-generating schemes that

(i) Homoscedastic case:

yi = exp( xi1 + xi2) xi2xi3 + xi + ϵi, i = 1, 2, . . . , N.

(ii) Heteroscedastic case:

j=1 βi sin(2πxij) + (1 + xi) ϵi F 1 ϵ (τ) , i = 1, 2, . . . , N.

In both cases, xij U(0, 1) with xi = 1

p Pp j=1 xij, and ϵi follows the standard normal distribution. Moreover, in the heteroscedastic case, βi U(0, 1) and F 1 ϵ denotes the quantile function of ϵ. Clearly, the τ-th conditional quantile of y

Optimal Kernel Quantile Learning with Random Features

(a) τ = 0.1

(b) τ = 0.25

(c) τ = 0.5

(d) τ = 0.75

(e) τ = 0.9

Figure 4. Averaged PQE and its standard deviation against the number of random features used in KQR-RF under various scenarios.

given x is given by f τ (x) = f(x) + F 1 ϵ (τ) with f(x) = exp( xi1 + xi2) xi2xi3 + xi; and in the homoscedastic case, the τ-th conditional quantile of y given x is f τ (x) = P3 j=1 βi sin(2πxij).

Parameters setting. In our simulation, several scenarios are considered by varying quantile level τ from {0.1, 0.25, 0.5, 0.75, 0.9}. In all the scenarios, we employ the standard Gaussian kernel K(x, x ) = exp( x x 2/2). As suggested by Rahimi & Recht (2007) and Rudi & Rosasco (2017), the corresponding random features are taken as ϕ(x, ω) =

2 cos(ωT x + b), where ω N(0, I) and b U(0, 2π). The regularization parameter λ is selected via a grid search based on a validation set with 1000 samples, where the grid is set as {100.5s : s = 20, 19, ..., 2}.

Performance evaluation. To assess the numerical performance of KQR-RF, we use the predicted quantile error (PQE) defined on a testing dataset with nte = 10000 samples {xi te, yi te} as follows

b Eτ( ˆf) = 1 nte

i=1 ρτ yi te bf(xi te) .

Our investigation delves into evaluating the influence of several critical factors including the number of random features, the sample size, and the sampling type of random features. All the reported numerical results are obtained from an average of 50 independently repeated experiments.

F.1.1. EFFECT OF RANDOM FEATURE SIZE

In this part, we evaluate the performance of KQR-RF by using various random features. Specifically, we consider the fixed setting that N = 1000 and vary the number of random features M {2, 5, 10, 50, 100, 150}. The results of KQR-RF are summarized in Figure 4.

As indicated in Figure 4, we can conclude that the PQE of KQR-RF tends to be smaller if a larger number of random features are used. The curves of PQEs become relatively flat as the number of random features reaches 50, which implies that the marginal gain from those increased random features is limited as enough features have been taken into account.

Optimal Kernel Quantile Learning with Random Features

Table 2. Averaged PQE and its standard deviation against N of different methods in the homoscedastic case.

Quantile level Method Sample size N = 500 N = 1000 N = 2000 N = 5000 N = 10000

τ = 0.1 KQR-RF 0.193(0.014) 0.185(0.010) 0.178(0.007) 0.176(0.005) 0.174(0.003) KQR 0.184(0.005) 0.175(0.004) 0.168(0.001) 0.164(0.001) 0.162(0.001)

τ = 0.25 KQR-RF 0.332(0.007) 0.325(0.005) 0.321(0.002) 0.319(0.002) 0.317(0.002) KQR 0.323(0.002) 0.315(0.002) 0.312(0.001) 0.308(0.001) 0.306(0.001)

τ = 0.5 KQR-RF 0.419(0.008) 0.407(0.006) 0.402(0.002) 0.400(0.002) 0.398(0.001) KQR 0.405(0.002) 0.395(0.002) 0.391(0.002) 0.388(0.001) 0.386(0.001)

τ = 0.75 KQR-RF 0.338(0.010) 0.327(0.008) 0.322(0.004) 0.319(0.002) 0.316(0.002) KQR 0.321(0.003) 0.313(0.003) 0.306(0.002) 0.304(0.001) 0.302(0.001)

τ = 0.9 KQR-RF 0.191(0.016) 0.183(0.011) 0.177(0.005) 0.175(0.004) 0.174(0.003) KQR 0.183(0.005) 0.173(0.004) 0.168(0.001) 0.166(0.001) 0.165(0.001)

Table 3. Averaged PQE and its standard deviation against N and n of different methods in the heteroscedastic case.

Quantile level Method Sample size N = 500 N = 1000 N = 2000 N = 5000 N = 10000

τ = 0.1 KQR-RF 0.284(0.009) 0.275(0.008) 0.268(0.005) 0.266(0.004) 0.264(0.003) KQR 0.271(0.005) 0.266(0.004) 0.261(0.001) 0.259(0.001) 0.256(0.001)

τ = 0.25 KQR-RF 0.502(0.007) 0.495(0.006) 0.487(0.003) 0.483(0.002) 0.479(0.002) KQR 0.492(0.002) 0.484(0.002) 0.479(0.001) 0.472(0.001) 0.470(0.001)

τ = 0.5 KQR-RF 0.621(0.009) 0.614(0.006) 0.605(0.003) 0.601(0.002) 0.598(0.001) KQR 0.609(0.002) 0.601(0.002) 0.597(0.002) 0.594(0.001) 0.591(0.001)

τ = 0.75 KQR-RF 0.501(0.009) 0.495(0.008) 0.482(0.004) 0.479(0.002) 0.476(0.002) KQR 0.493(0.003) 0.483(0.003) 0.474(0.002) 0.467(0.001) 0.463(0.001)

τ = 0.9 KQR-RF 0.295(0.013) 0.283(0.011) 0.270(0.006) 0.266(0.004) 0.264(0.003) KQR 0.284(0.005) 0.275(0.004) 0.261(0.001) 0.256(0.001) 0.254(0.001)

F.1.2. EFFECT OF SAMPLE SIZE

In this part, we investigate how the performance of KQR-RF is affected by the sample size N, and we also compare it with the exact kernel quantile regression (KQR without random feature). Specifically, all the settings are exactly the same as those in Section F.1.1 except that we set M = 50 and vary N {500, 1000, 2000, 5000, 10000}, respectively.

It is clear from Table F.1.2 that under the homoscedastic case, the PQE of KQR-RF decreases as N is increased at all the quantile levels, which is consistent with the theoretical result given in Theorem 3.13. Moreover, the performance of KQR-RF is near to that of KQR as the sample size increases, which shows the consistency of random feature approximation. Similar conclusions can also be drawn in the heteroscedastic case as indicated in Table F.1.2.

F.1.3. EFFECT OF SAMPLING STRATEGY

In this part, we compare the PQE of KQR-RF with different sampling strategies. Specifically, we consider the following two strategies,

1. Uniform RF: We generate the random features with uniform sampling strategy in (3), so the corresponding random features are ϕ(x, ωi) =

2 cos(ωT x + b), where ωi N(0, I) and b U(0, 2π) for i = 1, . . . , M.

2. Leverage scores RF: We generate the random features with leverage scores sampling strategy in Example 3.14. By adopting the idea of Sun et al. (2018) and Li et al. (2023b), we consider the importance ratio q(ωi) = ri/ PM i=1 ri, where {ri}M i=1 is the the diagonal of

ϕM(X)T ϕM(X) ϕM(X)T ϕM(X) + λNI 1,

with ϕM(X) = (ϕM(x1), . . . , ϕM(x N))T RN M. The corresponding random features are then given as ϕl(x, ωi) = [q(ωi)] 1/2ϕ(x, ωi) where ϕ(x, ωi) is the uniform RF.

Specifically, we consider the same settings as those in Section F.1.1 except that we additionally consider the data-dependent sampling strategy. The results of KQR-RF are summarized in Figures 5-6 for the homoscedastic and heteroscedastic cases, respectively.

Optimal Kernel Quantile Learning with Random Features

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(a) τ = 0.1

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(b) τ = 0.25

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(c) τ = 0.5

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(d) τ = 0.75

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(e) τ = 0.9

Figure 5. Averaged PQE and its standard deviation against the number of random features used in KQR-RF for different sampling strategies in the homoscedastic case.

From the results in Figures 5-6, we can see that both uniform random features and leverage scores random features can achieve better performance as the number of random features increases. The data-dependent sampling strategy is more effective than the uniform sampling strategy with a fixed number of random features, which confirms our theoretical findings in Theorem 3.9 and Corollary 3.16.

Theoretical and empirical leverage scores sampling. Example 3.14 consider a leverage scores sampling strategy by using an importance ratio denoted as q(ω) = lλ(ω)/ R

ω lλ(ω)dπ(ω), where lλ(ω) = (LK + λI) 1/2ψ( , ω) 2 ρX . The corresponding parameters ω are sampled from distribution πl(ω) = q(ω)π(ω), and random features are ϕl(x, ω) = [q(ω)] 1/2ϕ(x, ω). This reweighted sampling ensure that K(x, x ) = Eω πl(ω)[ ϕl(x, ω), ϕl(x , ω) ]. Here, we want to emphasize that Example 3.14 is a theoretical construction, and the data dependent sampling scheme is implicit due to the integral operator LK and the expectation with respect to x in (LK + λI) 1/2ψ( , ω) 2 ρX . However, in the literature, there are a lot of empirical leverage score sampling strategies that highly depend on the data. For example, the empirical random features leverages scores blλ(ω) = bΞ(ω)T (K + λI) 1bΞ(ω)T , with bΞ(ω) R|D|, (bΞ(ω))i = ϕM(xi) and K = {k(xi, xj)}ij is the data kernel matrix (see Remark 4 in Rudi & Rosasco (2017)). There are also some approximate leverage score sampling strategies to save the computation cost, see in Sun et al. (2018); Li et al. (2021).

F.2. Real Case Study

In this study, we consider the UK used car prices dataset from Kaggle (https://www.kaggle.com/datasets/ kukuroo3/used-car-price-dataset-competition-format). In the raw dataset, there are N = 7632 samples after excluding those with missing values. The response is the price of each used car, while the covariates include crucial information about the used cars, such as the registration year, mileage, road tax, miles per gallon (mpg), and engine size. We mean to predict the prices of used cars, thereby assisting car buyers in making optimal purchasing decisions.

In Figure 7, we plot the histogram of the used car price, as well as the skewness and kurtosis, revealing a notable right-skew in the raw price distribution. Although this skewness is partially mitigated by applying a log transformation to the price, there is still a tail dragging on the right. To obtain more robust estimates, we consider the quantile regression with the log-transformed price as the response. Note that the sample size is large, thus it is natural to use random features to save the computing cost. In our experimental setup, we randomly choose Ntr = 5000 samples as the training data and assume

Optimal Kernel Quantile Learning with Random Features

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(a) τ = 0.1

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(b) τ = 0.25

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(c) τ = 0.5

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(d) τ = 0.75

3 10 20 50 100 150

Number of random features

Effects of sampling strategy

Leverage score RF

(e) τ = 0.9

Figure 6. Averaged PQE and its standard deviation against the number of random features used in KQR-RF for different sampling strategies in the heteroscedastic case.

(a) Raw price

(b) Log-transformed price

Figure 7. The Histogram of the used car price and the log-transformed price.

Optimal Kernel Quantile Learning with Random Features

Table 4. Averaged PQE and its standard deviation for τ = 0.5 of different methods in used car price dataset.

Method Exact KQR KQR-RF(Uniform RF) KQR-RF(Leverage scores RF) PQE 0.094(0.000) 0.105(0.004) 0.098(0.003)

0.00 0.25 0.50 0.75 1.00 X

0.25 quantile

0.5 quantile

0.75 quantile

Quantile curves for r = 0, γ = 1

0.00 0.25 0.50 0.75 1.00 X

0.25 quantile

0.5 quantile

0.75 quantile

Quantile curves for r = 0.5, γ = 1

0.00 0.25 0.50 0.75 1.00 X

0.25 quantile

0.5 quantile

0.75 quantile

Quantile curves for r = 1, γ = 0

Figure 8. True quantile curves for r = 0, γ = 1 (left), r = 1/2, γ = 1 (middle), and r = 1, γ = 0 (right).

they are randomly distributed, Nva = 1000 samples as the validation data, and the rest as the testing data. For each dataset, we perform the min-max normalization for each covariate, i.e., xij is rescaled by xij = (xij xi min)/(xi max xi min), where xi min and xi max is the minimum and maximum values of the i-th covariate within the entire dataset. Following the suggestion in Section F.1.1, we select the number of random features M = 100, and the choices of regularization parameters λ are the same as that in Section F.1.

Considering that car buyers primarily focus on the average price of used cars, we only consider the quantile level τ = 0.5. Table F.2 depicts the averaged PQE and its standard deviation (50 repeats) of three methods, including the exact KQR, KQR-RF with uniform random features, and KQR-RF with leverage scores random features. Clearly, two random feature methods exhibit close performance compared to the exact KQR, especially when we use the data-dependent random features. These results further support the effectiveness of KQR-RF and substantiate the theoretical results provided in the main text.

F.3. True quantile functions in the simulation of the main text

To graphically show the quantile function at different quantile levels in the simulation of the main text , we consider three different settings: (1) worst case (r = 0, γ = 1); (2) general case (r = 1/2, γ = 1); (3) most benign case (r = 1, γ = 0). We generate data with size N = 200, and plot the quantile curves for τ {0.25, 0.5, 0.75} in Figure 8.