# joint_quantile_regression_in_vectorvalued_rkhss__86c9af16.pdf

Joint quantile regression in vector-valued RKHSs

Maxime Sangnier Olivier Fercoq Florence d Alch e-Buc LTCI, CNRS, T el ecom Paris Tech Universit e Paris-Saclay 75013, Paris, France {maxime.sangnier, olivier.fercoq, florence.dalche} @telecom-paristech.fr

Addressing the will to give a more complete picture than an average relationship provided by standard regression, a novel framework for estimating and predicting simultaneously several conditional quantiles is introduced. The proposed methodology leverages kernel-based multi-task learning to curb the embarrassing phenomenon of quantile crossing, with a one-step estimation procedure and no postprocessing. Moreover, this framework comes along with theoretical guarantees and an efﬁcient coordinate descent learning algorithm. Numerical experiments on benchmark and real datasets highlight the enhancements of our approach regarding the prediction error, the crossing occurrences and the training time.

1 Introduction

Given a couple (X, Y ) of random variables, where Y takes scalar values, a common aim in statistics and machine learning is to estimate the conditional expectation E [Y | X = x] as a function of x. In the previous setting, called regression, one assumes that the main information in Y is a scalar value corrupted by a centered noise. However, in some applications such as medicine, economics, social sciences and ecology, a more complete picture than an average relationship is required to deepen the analysis. Expectiles and quantiles are different quantities able to achieve this goal.

This paper deals with this last setting, called (conditional) quantile regression. This topic has been championed by Koenker and Bassett [16] as the minimization of the pinball loss (see [15] for an extensive presentation) and brought to the attention of the machine learning community by Takeuchi et al. [26]. Ever since then, several studies have built upon this framework and the most recent ones include regressing a single quantile of a random vector [12]. On the contrary, we are interested in estimating and predicting simultaneously several quantiles of a scalar-valued random variable Y |X (see Figure 1), thus called joint quantile regression. For this purpose, we focus on non-parametric hypotheses from a vector-valued Reproducing Kernel Hilbert Space (RKHS).

Since quantiles of a distribution are closely related, joint quantile regression is subsumed under the ﬁeld of multi-task learning [3]. As a consequence, vector-valued kernel methods are appropriate for such a task. They have already been used for various applications, such as structured classiﬁcation [10] and prediction [7], manifold regularization [21, 6] and functional regression [14]. Quantile regression is a new opportunity for vector-valued RKHSs to perform in a multi-task problem, along with a loss that is different from the ℓ2 cost predominantly used in the previous references.

In addition, such a framework offers a novel way to curb the phenomenon of quantile curve crossing, while preserving the so called quantile property (which may not be true for current approaches). This one guarantees that the ratio of observations lying below a predicted quantile is close to the quantile level of interest.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

In a nutshell, the contributions of this work are (following the outline of the paper): i) a novel methodology for joint quantile regression, that is based on vector-valued RKHSs; ii) enhanced predictions thanks to a multi-task approach along with limited appearance of crossing curves; iii) theoretical guarantees regarding the generalization of the model; iv) an efﬁcient coordinate descent algorithm, that is able to handle the intercept of the model in a manner that is simple and different from Sequential Minimal Optimization (SMO). Besides these novelties, the enhancements of the proposed method and the efﬁciency of our learning algorithm are supported by numerical experiments on benchmark and real datasets.

2 Problem deﬁnition

2.1 Quantile regression

Let Y R be a compact set, X be an arbitrary input space and (X, Y ) X Y a pair of random variables following an unknown joint distribution. For a given probability τ (0, 1), the conditional τ-quantile of (X, Y ) is the function µτ : X R such that µτ(x) = inf{µ R : P (Y µ | X = x) τ}. Thus, given a training set {(xi, yi)}n i=1 (X Y)n, the quantile regression problem aims at estimating this conditional τ-quantile function µτ. Following Koenker [15], this can be achieved by minimization of the pinball loss: ℓτ(r) = max(τr, (τ 1)r), where r R is a residual. Using such a loss ﬁrst arose from the observation that the location parameter µ that minimizes the ℓ1-loss Pn i=1 |yi µ| is an estimator of the unconditional median [16].

Now focusing on the estimation of a conditional quantile, one can show that the target function µτ is a minimizer of the τ-quantile risk Rτ(h) = E [ℓτ(Y h(X))] [17]. However, since the joint probability of (X, Y ) is unknown but we are provided with an independent and identically distributed (iid) sample of observations {(xi, yi)}n i=1, we resort to minimizing the empirical risk: Remp τ (h) = 1 n Pn i=1 ℓτ(yi h(xi)), within a class H (R)X of functions, calibrated in order to overcome the shift from the true risk to the empirical one. In particular, when H has the form: H = {h = f + b : b R, f (R)X , ψ(f) c}, with ψ: (R)X R being a convex function and c > 0 a constant, Takeuchi et al. [26] proved that (similarly to the unconditional case) the quantile property is satisﬁed: for any estimator ˆh, obtained by minimizing Remp τ in H, the ratio of observations lying below ˆh (i.e. yi < ˆh(xi)) equals τ to a small error (the ration of observations exactly equal to ˆh(xi)). Moreover, under some regularity assumptions, this quantity converges to τ when the sample grows. Note that these properties are true since the intercept b is unconstrained.

2.2 Multiple quantile regression

In many real problems (such as medical reference charts), one is not only interested by estimating a single quantile curve but a few of them. Thus, denoting Np the range of integers between 1 and p, for several quantile levels τj (j Np) and functions hj H, the empirical loss to be minimized can bi written as the following separable function: Remp τ (h1, . . . , hp) = 1

n Pn i=1 Pp j=1 ℓτj(yi hj(xi)), where τ denotes the p dimensional vector of quantile levels.

A nice feature of multiple quantile regression is thus to extract slices of the conditional distribution of Y |X. However, when quantiles are estimated independently, an embarrassing phenomenon often appears: quantile functions cross, thus violating the basic principle that the cumulative distribution function should be monotonically non-decreasing. We refer to that pitfall as the crossing problem.

In this paper, we propose to prevent curve crossing by considering the problem of multiple quantile regression as a vector-valued regression problem where outputs are not independent. An interesting feature of our method is to preserve the quantile property while most other approaches lose it when struggling to the crossing problem.

2.3 Related work

Going beyond linear and spline-based models, quantile regression in RKHSs has been introduced a decade ago [26, 17]. In [26], the authors proposed to minimize the pinball loss in a scalar-valued RKHS and to add hard constraints on the training points in order to prevent the crossing problem. Our work can be legitimately seen as an extension of [26] to multiple quantile regression using

a vector-valued RKHS and structural constraints against curve crossing thanks to an appropriate matrix-valued kernel.

Another related work is [27], which ﬁrst introduced the idea of multi-task learning for quantile regression. In [27], linear quantile curves are estimated jointly with a common feature subspace shared across the tasks, based on multi-task feature learning [3]. In addition, the authors showed that for such linear regressors, a common representation shared across inﬁnitely many tasks can be computed, thus estimating simultaneously conditional quantiles for all possible quantile levels. Both previous approaches will be considered in the numerical experiments.

Quantile regression has been investigated from many perspectives, including different losses leading to an approximate quantile property (ϵ-insensitive [25], re-weighted least squares [22]) along with models and estimation procedures to curb the crossing problem: location-scale model with a multi-step strategy [13], tensor product spline surface [22], non-negative valued kernels [18], hard non-crossing constraints [26, 28, 5], inversion and monotonization of a conditional distribution estimation [9] and rearrangement of quantile estimations [8], to cite only a few references. Let us remark that some solutions such as non-crossing constraints [26] lose theoretically the quantile property because of constraining the intercept.

In comparison to the literature, we propose a novel methodology, based on vector-valued RKHSs, with a one-step estimation, no post-processing, and keeping the quantile property while dealing with curve crossing. We also provide an efﬁcient learning algorithm and theoretical guarantees.

3 Vector-valued RKHS for joint quantile regression

3.1 Joint estimation

Given a vector τ (0, 1)p of quantile levels, multiple quantile regression is now considered as a joint estimation in (Rp)X of the target function x X 7 (µτ1(x), . . . , µτp(x)) Rp of conditional quantiles. Thus, let now ψ be a convex regularizer on (Rp)X and H = {h = f + b : b Rp, f (Rp)X , ψ(f) c} be the hypothesis set. Similarly to previously, joint quantile regression aims at minimizing Remp τ (h) = 1 n Pn i=1 ℓτ(yi1 h(xi)), where 1 stands for the all-ones vector, ℓτ(r) = Pp j=1 ℓτj(rj) and h is in H, which is to be appropriately chosen in order to estimate the p conditional quantiles while enhancing predictions and avoiding curve crossing. It is worthwhile remarking that, independently of the choice of ψ, the quantile property is still veriﬁed for a vectorvalued estimator since the loss is separable and the intercept is unconstrained. Similarly, the vectorvalued function whose components are the conditional τj-quantiles is still a minimizer of the τquantile risk Rτ(h) = E [ℓτ(Y 1 h(X))].

In this context, the constraint ψ does not necessarily apply independently on each coordinate function hj but can impose dependency between them. The theory of vector-valued RKHS seems especially well suited for this purpose when considering ψ as the norm associated to it. In this situation, the choice of the kernel does not only inﬂuence the nature of the hypotheses (linear, non-linear, universal approximators) but also the way the estimation procedure is regularized. In particular, the kernel critically operates on the output space by encoding structural constraints on the outputs.

3.2 Matrix-valued kernel

Let us denote the transpose operator and L(Rp) the set of linear and bounded operators from Rp to itself. In our (ﬁnite) case, L(Rp) comes down to the set of p p real-valued matrices. A matrix-valued kernel is a function K : X X L(Rp), that is symmetric and positive [20]: (x, x ) X X, K(x, x ) = K(x , x) and m N, {(αi, βi)}1 i m (X Rp)m, P

1 i,j m βi | K(αi, αj)βj

Let K be such a kernel and for any x X, let Kx : y Rp 7 Kxy (Rp)X be the linear operator such that: x X, (Kxy)(x ) = K(x , x)y. There exists a unique Hilbert space of functions KK (Rp)X (with an inner product and a norm respectively denoted | K and K), called the RKHS associated to K, such that x X [20]: Kx spans the space KK ( y Rp : Kxy K), Kx is bounded for the uniform norm (supy Rp Kxy K < ) and f K: f(x) = K xf (reproducing property), where is the adjoint operator.

From now on, we assume that we are provided with a matrix-valued kernel K and we limit the hypothesis space to: H = {f + b : b Rp, f KK, f K c} (i.e. ψ = K). Though several candidates are available [1], we focus on one of the simplest and most efﬁciently computable kernels, called decomposable kernel: K : (x, x ) 7 k(x, x )B, where k: X X R is a scalar-valued kernel and B is a p p symmetric Positive Semi-Deﬁnite (PSD) matrix. In this particular case, the matrix B encodes the relationship between the components fj and thus, the link between the different conditional quantile estimators. A rational choice is to consider B = exp( γ(τi τj)2)

1 i,j p. To explain it, let us consider two extreme cases (see also Figure 1).

First, when γ = 0, B is the all-ones matrix. Since KK is the closure of the space span {Kxy : (x, y) X Rp}, any f KK has all its components equal. Consequently, the quantile estimators hj = fj + bj are parallel (and non-crossing) curves. In this case, the regressor is said homoscedastic. Second, when γ + , then B I (identity matrix). In this situation, it is easy to show that the components of f KK are independent from each other and that f 2 K = Pp j=1 fj 2 K (where K is the norm coming with the RKHS associated to k) is separable. Thus, each quantile function is learned independently from the others. Regressors are said heteroscedastic. It appears clearly that between these two extreme cases, there is a room for learning a non-homescedastic and non-crossing quantile regressor (while preserving the quantile property).

Figure 1: Estimated (plain lines) and true (dashed lines) conditional quantiles of Y |X (synthetic dataset) from homoscedastic regressors (γ = 0) to heteroscedastic ones (γ + ).

4 Theoretical analysis

This section is intended to give a few theoretical insights about the expected behavior of our hypotheses. Here, we do assume working in an RKHS but not speciﬁcally with a decomposable kernel. First, we aim at providing a uniform generalization bound. For this purpose, let F = {f KK, f K c}, tr( ) be the trace operator, ((Xi, Yi))1 i n (X Y)n be an iid sample and denote ˆRn(h) = 1

n Pn i=1 ℓτ(Yi1 h(Xi)), the random variable associated to the empirical risk of a hypothesis h.

Theorem 4.1 (Generalization). Let a R+ such that supy Y |y| a, b Yp and H = {f + b : f F} be the class of hypotheses. Moreover, assume that there exists κ 0 such that: supx X tr(K(x, x)) κ. Then with probability at least 1 δ (for δ (0, 1]):

h H, R(h) ˆRn(h) + 2

n + (2pa + c pκ)

Sketch of proof (full derivation in Appendix A.1). We start with a concentration inequality for scalar-valued functions [4] and we use a vector-contraction property [19]. The bound on the Rademacher complexity of [24, Theorem 3.1] concludes the proof.

The uniform bound in Theorem 4.1 states that, with high probability, all the hypotheses of interest have a true risk which is less that an empirical risk to an additive bias in O(1/ n). Let us remark that it makes use of the output dimension p. However, there exist non-uniform generalization bounds for operator-valued kernel-based hypotheses, which do not depend on the output dimension [14], being thus well-suited for inﬁnite-dimensional output spaces. Yet those results, only hold for optimal solutions ˆh of the learning problem, which we never obtain in practice.

As a second theoretical insight, Theorem 4.2 gives a bound on the quantile property, which is similar to the one provided in [26] for scalar-valued functions. This one states that E [P (Y hj(X) | X)] does not deviate to much from τj.

Theorem 4.2 (Quantile deviation). Let us consider that the assumptions of Theorem 4.1 hold. Moreover, let ϵ > 0 be an artiﬁcial margin, Γ+ ϵ : r R 7 proj[0,1] 1 r

ϵ and Γ ϵ : r R 7 proj[0,1] r

ϵ , two ramp functions, j Np and δ (0, 1]. Then with probability at least 1 δ:

i=1 Γ ϵ (Yi hj(Xi)) E [P (Y hj(X) | X)] 1

i=1 Γ+ ϵ (Yi hj(Xi))

Sketch of proof (full derivation in Appendix A.2). The proof is similar to the one of Theorem 4.1, when remarking that Γ+ ϵ and Γ ϵ are 1/ϵ-Lipschitz continuous.

5 Optimization algorithm

In order to ﬁnalize the M-estimation of a non-parametric function, we need a way to jointly solve the optimization problem of interest and compute the estimator. For ridge regression in vector-valued RKHSs, representer theorems enable to reformulate the hypothesis f and to derive algorithms based on matrix inversion [20, 6] or Sylvester equation [10]. Since the optimization problem we are tackling is quite different, those methods do not apply. Yet, deriving a dual optimization problem makes it possible to hit the mark.

Quantile estimation, as presented in this paper, comes down to minimizing a regularized empirical risk, deﬁned by the pinball loss ℓτ. Since this loss function is non-differentiable, we introduce slack variables ξ and ξ to get the following primal formulation. We also consider a regularization parameter C to be tuned:

minimize f KK ,b Rp, ξ,ξ (Rp)n

1 2 f 2 K + C

τ | ξi ℓ2 + 1 τ | ξ i ℓ2

s. t. i Nn : ξi 0, ξ i 0, yi f(xi) b = ξi ξ i , (1)

where is a pointwise inequality. A dual formulation of Problem (1) is (see Appendix B):

minimize α (Rp)n 1 2

i,j=1 αi | K(xi, xj)αj ℓ2

i=1 yi αi | 1 ℓ2 s. t.

i=1 αi = 0Rp, i Nn :

C(τ 1) αi Cτ, (2)

where the linear constraints come from considering an intercept b. The Karush-Kuhn-Tucker (KKT) conditions of Problem (1) indicate that a minimizer ˆf of (1) can be recovered from a solution ˆα of (2) with the formula ˆf = Pn i=1 Kxi ˆαi. Moreover, ˆb can also be obtained thanks to KKT conditions. However, as we deal with a numerical approximate solution α, in practice b is computed by solving Problem (1) with f ﬁxed. This boils down to taking bj as the τj-quantile of (yi fj(xi))1 i n.

Problem (2) is a common quadratic program that can be solved with off-the-shelf solvers. However, since we are essentially interested in decomposable kernels K( , ) = k( , )B, it appears that the quadratic part of the objective function would be deﬁned by the np np matrix K B, where is the Kronecker product and K = (k(xi, xj))1 i,j n. Storing this matrix explicitly is likely to be time and memory expensive. In order to improve the estimation procedure, ad hoc algorithms can be derived. For instance, regression with a decomposable kernel boils down to solving a Sylvester equation (which can be done efﬁciently) [10] and vector-valued Support Vector Machine (SVM) without intercept can be learned with a coordinate descent algorithm [21]. However, these methods can not be used in our setting since the loss function is different and considering the intercept is necessary for the quantile property. Yet, coordinate descent could theoretically be extended in an SMO technique, able to handle the linear constraints introduced by the intercept. However, SMO works usually with a single linear constraint and needs heuristics to run efﬁciently, which are quite difﬁcult to ﬁnd (even though an implementation exists for two linear constraints [25]).

Therefore, for the sake of efﬁciency, we propose to use a Primal-Dual Coordinate Descent (PDCD) technique, recently introduced in [11]. This algorithm (which is proved to converge) is able to deal with the linear constraints coming from the intercept and is thus utterly workable for the problem at hand. Moreover, PDCD has been proved favorably competitive with SMO for SVMs.

Table 1: Empirical pinball loss and crossing loss 100 (the less, the better). Bullets (resp. circles) indicate statistically signiﬁcant (resp non-signiﬁcant) differences. The proposed method is JQR.

- Pinball loss Crossing loss

Data set - IND. IND. (NC) MTFL JQR - IND. IND. (NC) MTFL JQR

caution - 102.6 17.3 103.2 17.2 102.9 19.0 102.6 19.0 - 0.53 0.67 0.31 0.70 0.69 0.54 0.09 0.14 ftcollinssnow - 151.1 8.2 150.8 8.0 152.4 8.9 153.7 12.1 - 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 highway - 102.9 39.1 102.8 38.9 102.0 34.5 103.7 35.7 - 9.08 7.38 9.00 7.39 3.48 4.49 8.81 7.46 heights - 128.2 2.4 128.2 2.4 128.6 2.2 127.9 1.8 - 0.04 0.05 0.04 0.05 0.07 0.14 0.00 0.00 sniffer - 44.8 6.7 44.6 6.8 46.9 7.6 45.2 6.9 - 1.01 0.75 0.52 0.48 1.23 0.77 0.15 0.22 snowgeese - 68.4 35.3 68.4 35.3 75.3 38.2 76.0 31.5 - 3.24 5.10 2.60 4.28 8.93 19.52 0.94 3.46 ufc - 81.8 4.6 81.6 4.6 84.9 4.7 80.6 4.1 - 0.24 0.22 0.27 0.42 0.82 1.47 0.05 0.15 birthwt - 139.0 9.9 139.0 9.9 142.6 11.6 139.8 11.7 - 0.00 0.00 0.00 0.00 0.31 0.88 0.00 0.00 crabs - 12.3 1.0 12.3 1.0 12.6 1.0 11.9 0.9 - 0.46 0.33 0.35 0.24 0.30 0.22 0.06 0.20 GAGurine - 62.6 8.2 62.6 8.2 64.5 7.5 62.6 8.1 - 0.05 0.08 0.04 0.07 0.05 0.09 0.03 0.08 geyser - 110.2 7.8 110.1 7.8 109.4 7.1 111.3 8.2 - 0.87 1.60 0.92 2.02 0.80 1.18 0.72 1.51 gilgais - 47.4 4.4 47.2 4.4 49.9 3.6 46.9 4.6 - 1.23 0.96 0.95 0.85 0.71 0.96 0.81 0.43 topo - 71.1 13.0 70.1 13.7 73.1 11.8 69.6 13.4 - 2.72 3.26 1.52 2.47 2.75 2.93 1.14 2.02 Boston Housing - 48.5 5.0 48.5 5.0 49.7 4.7 47.4 4.7 - 0.64 0.32 0.48 0.27 1.11 0.33 0.58 0.34 Cobar Ore - 0.5 0.5 0.5 0.5 5.0 4.9 0.6 0.5 - 0.10 0.13 0.10 0.13 0.29 0.35 0.02 0.05 engel - 61.3 18.3 61.2 19.0 58.7 17.9 64.4 23.2 - 1.50 4.94 1.25 4.53 1.65 5.97 0.06 0.14 mcycle - 89.2 8.5 88.9 8.4 102.0 11.7 84.3 10.3 - 2.10 1.83 0.92 1.25 1.13 1.10 0.14 0.37 Big Mac2003 - 71.0 21.0 70.9 21.1 68.7 18.1 67.6 20.9 - 2.50 2.12 1.87 1.68 0.73 0.92 1.55 1.75 UN3 - 99.5 7.0 99.4 7.0 101.8 7.1 98.8 7.6 - 1.06 0.85 0.85 0.70 0.65 0.62 0.09 0.31 cpus - 20.0 13.7 19.9 13.6 23.8 16.0 19.7 13.7 - 1.29 1.13 1.17 1.15 0.46 0.28 0.09 0.13

PDCD is described in Algorithm 1, where, for α = (αi)1 i n (Rp)n, αj Rn denotes its jth

row vector and αj i its ith component, diag is the operator mapping a vector to a diagonal matrix and proj1 and proj[C(τl 1),Cτl] are respectively the projectors onto the vector 1 and the compact set [C(τl 1), Cτl]. PDCD uses dual variables θ (Rp)n (which are updated during the descent) and has two sets of parameters ν (Rp)n and µ (Rp)n, that verify ( (i, l) Nn Np): µl i < 1 (K(xi,xi))l,l+νl i . In practice, we kept the same parameters as in [11]: νl i = 10(K(xi, xi))l,l and µl i equal to 0.95 times the bound. Moreover, as it is standard for coordinate descent methods, our implementation uses efﬁcient updates for the computation of both Pn j=1 K(xi, xj)αj and θ l.

6 Numerical experiments

Two sets of experiments are presented, respectively aimed at assessing the ability of our methodology to predict quantiles and at comparing an implementation of Algorithm 1 with an off-theshelf solver and an augmented Lagrangian scheme. Following the previous sections, a decomposable kernel K(x, x ) = k(x, x )B is used, where B = (exp( γ(τi τj)2))1 i,j p and k(x, x ) = exp( x x 2 ℓ2 /2σ2), with σ being the 0.7-quantile of the pairwise distances of the training data {xi}1 i n. Quantile levels of interest are τ = (0.1, 0.3, 0.5, 0.7, 0.9).

6.1 Quantile regression

Quantile regression is assessed with two criteria: the pinball loss 1 n Pn i=1 ℓτ(yi1 h(xi)) is the one minimized to build the proposed estimator and the crossing loss Pp 1 j=1 1

n Pn i=1 max(0, hj+1(xi) hj(xi)) , assuming that τj > τj+1, quantiﬁes how far hj goes below hj+1, while hj is expected to stay always above hj+1. More experiments are in Appendix D.1.

This study focuses on three non-parametric models based on the RKHS theory. Other linear and spline-based models have been dismissed since Takeuchi et al. [26] have already provided a comparison of these ones with kernel methods. First, we considered an independent estimation of quantile regressors (IND.), which boils down to setting B = I (this approach could be set up without vector-valued RKHSs but with scalar-valued kernels only). Second, hard non-crossing constraints on the training data have been imposed (IND. (NC)), as proposed in [26]. Third, the proposed joint estimator (JQR) uses the Gaussian matrix B presented above.

Quantile regression with multi-task feature learning (MTFL), as proposed in [27], is also included. For a fair comparison, each point is mapped with ψ(x) = (k(x, x1), . . . , k(x, xn)) and the estimator h(x) = W ψ(x) + b (W Rn p) is learned jointly with the PSD matrix D Rn n of the

Algorithm 1 Primal-Dual Coordinate Descent.

Initialize αi, θi Rp ( i Nn). repeat

Choose (i, l) Nn Np uniformly at random. Set θ l proj1 θl + diag(νl)αl .

Set dl i Pn j=1(K(xi, xj)αj)l yi + 2θ l i θl i. Set αl i proj[C(τl 1),Cτl] αl i µl idl i .

Update coordinate (i, l): αl i αl i, θl i θ l i, and keep other coordinates unchanged. until duality gap (1)-(2) is small enough

Table 2: CPU time (s) for training a model.

Size QP AUG. LAG. PDCD

250 8.73 0.34 261.11 46.69 18.69 3.54 500 75.53 2.98 865.86 92.26 61.30 7.05 1000 621.60 30.37 266.50 41.16 2000 3416.55 104.41 958.93 107.80

regularizer ψ(h) = tr(W D 1W ). This comes down to alternating our approach (with B = I and k( , ) = | D ℓ2) and the update D (W W )1/2/ tr((W W )1/2).

To present an honorable comparison of these four methods, we did not choose datasets for the beneﬁt of our method but considered the ones used in [26]. These 20 datasets (whose names are indicated in Table 1) come from the UCI repository and three R packages: quantreg, alr3 and MASS. The sample sizes vary from 38 (Cobar Ore) to 1375 (heights) and the numbers of explanatory variables vary from 1 (5 sets) to 12 (Boston Housing). The datasets were standardized coordinate-wise to have zero mean and unit variance. Results are given in Table 1 thanks to the mean and the standard deviation of the test losses recorded on 20 random splits train-test with ratio 0.7-0.3. The best result of each line is boldfaced and the bullets indicate the signiﬁcant differences of each competitor from JQR (based on a Wilcoxon signed-rank test with signiﬁcance level 0.05).

The parameter C is chosen by cross-validation (minimizing the pinball loss) inside a logarithmic grid (10 5, 10 4, . . . , 105) for all methods and datasets. For our approach (JQR), the parameter γ is chosen in the same grid as C with extra candidates 0 and + . Finally, for a balanced comparison, the dual optimization problems corresponding to each approach are solved with CVXOPT [2].

Regarding the pinball loss, joint quantile regression compares favorably to independent and hard non-crossing constraint estimations for 12 vs 8 datasets (5 vs 1 signiﬁcantly different). These results bear out the assumption concerning the relationship between conditional quantiles and the usefulness of multiple-output methods for quantile regression. Prediction is also enhanced compared to MTFL for 15 vs 5 datasets (11 vs 1 signiﬁcantly different).

The crossing loss clearly shows that joint regression enables to weaken the crossing problem, in comparison to independent estimation and hard non-crossing constraints (18 vs 1 favorable datasets and 9 vs 0 signiﬁcantly different). Results are similar compared to MTFL (16 vs 3, 12 vs 1). Note that for IND. (NC), the crossing loss is null on the training data by construction but not necessarily on the test data. In addition, let us remark that model selection (and particularly for γ, which tunes the trade-off between hetero and homoscedastic regressors) has been performed based on the pinball loss only. It seems that, in a way, the pinball loss embraces the crossing loss as a subcriterion.

6.2 Learning algorithms

This section is aimed at comparing three implementations of algorithms for estimating joint quantile regressors (solving Problem 2), following their running (CPU) time. First, the off-the-shelf solver (based on an interior-point method) included in CVXOPT [2] (QP) is applied to Problem (2) turned into a standard form of linearly constrained quadratic program. Second, an augmented Lagrangian scheme (AUG. LAG) is used in order to get rid of the linear constraints and to make it possible to use a coordinate descent approach (detailed procedure in Appendix C). In this scheme, the inner solver is Algorithm 1 when the intercept is dismissed, which boils down to be the algorithm proposed in [23]. The last approach (PDCD) is Algorithm 1.

We use a synthetic dataset (the same as in Figure 1), for which X [0, 1.5]. The target Y is computed as a sine curve at 1 Hz modulated by a sine envelope at 1/3 Hz and mean 1. Moreover, this pattern is distorted with a random Gaussian noise with mean 0 and a linearly decreasing standard deviation from 1.2 at X = 0 to 0.2 at X = 1.5. Parameters for the models are: (C, γ) = (102, 10 2).

To compare the implementations of the three algorithms, we ﬁrst run QP, with a relative tolerance set to 10 2, and store the optimal objective value. Then, the two other methods (AUG. LAG and PDCD) are launched and stopped when they pass the objective value reached by QP (optimal objective values are reported in Appendix D.2). Table 2 gives the mean and standard deviation of the CPU time required by each method for 10 random datasets and several sample sizes. Some statistics are missing because AUG. LAG. ran out of time.

As expected, it appears that for a not too tight tolerance and big datasets, implementation of Algorithm 1 outperforms the two other competitors. Let us remark that QP is also more expensive in memory than the coordinate-based algorithms like ours. Moreover, training time may seem high in comparison to usual SVMs. However, let us ﬁrst remind that we jointly learn p regressors. Thus, a fair comparison should be done with an SVM applied to an np np matrix, instead of n n. In addition, there is no sample sparsity in quantile regression, which does speed up SVM training.

Last but not least, in order to illustrate the use of our algorithm, we have run it on two 2000-point datasets from economics and medicine: the U.S. 2000 Census data, consisting of annual salary and 9 related features on workers, and the 2014 National Center for Health Statistics data, regarding girl birth weight and 16 statistics on parents.1 Parameters (C, γ) have been set to (1, 100) and (0.1, 1) respectively for the Census and NCHS datasets (determined by cross-validation). Figure 2 depicts 9 estimated conditional quantiles of the salary with respect to the education (17 levels from no schooling completed to doctorate degree) and of the birth weight (in grams) vs mother s prepregnancy weight (in pounds). As expected, the Census data reveal an increasing and heteroscedastic trend while new-born s weight does not seem correlated to mother s weight.

Figure 2: Estimated conditional quantiles for the Census (left, salary vs education) and the NCHS data (right, birth weight vs mother s pre-pregnancy weight).

7 Conclusion

This paper introduces a novel framework for joint quantile regression, which is based on vectorvalued RKHSs. It comes along with theoretical guarantees and an efﬁcient learning algorithm. Moreover, this methodology, which keeps the quantile property, enjoys few curve crossing and enhanced performances compared to independent estimations and hard non-crossing constraints.

To go forward, let us remark that this framework beneﬁts from all the tools now associated with vector-valued RKHSs, such as manifold learning for the semi-supervised setting, multiple kernel learning for measuring feature importance and random Fourier features for very large scale applications. Moreover, extensions of our methodology to multivariate output variables are to be investigated, given that it requires to choose among the various deﬁnitions of multivariate quantiles.

Acknowledgments

This work was supported by the industrial chair Machine Learning for Big Data .

[1] M.A. Alvarez, L. Rosasco, and N.D. Lawrence. Kernels for Vector-Valued Functions: a Review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012. ar Xiv: 1106.6251.

1Data are available at www.census.gov/census2000/PUMS5.html and www.nber.org/data/ vital-statistics-natality-data.html.

[2] M.S. Anderson, J. Dahl, and L. Vandenberghe. CVXOPT: A Python package for convex optimization, version 1.1.5., 2012.

[3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3): 243 272, 2008.

[4] P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

[5] H.D. Bondell, B.J. Reich, and H. Wang. Noncrossing quantile regression curve estimation. Biometrika, 97(4):825 838, 2010.

[6] C. Brouard, F. d Alch e Buc, and M. Szafranski. Semi-supervised Penalized Output Kernel Regression for Link Prediction. In Proceedings of the 28th International Conference on Machine Learning, 2011.

[7] C. Brouard, M. Szafranski, and F. d Alch e Buc. Input Output Kernel Regression: Supervised and Semi Supervised Structured Output Prediction with Operator-Valued Kernels. Journal of Machine Learning Research, 17(176):1 48, 2016.

[8] V. Chernozhukov, I. Fern andez-Val, and A. Galichon. Quantile and Probability Curves Without Crossing. Econometrica, 78(3):1093 1125, 2010.

[9] H. Dette and S. Volgushev. Non-crossing non-parametric estimates of quantile curves. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3):609 627, 2008.

[10] F. Dinuzzo, C.S. Ong, P. Gehler, and G. Pillonetto. Learning Output Kernels with Block Coordinate Descent. In Proceedings of the 28th International Conference of Machine Learning, 2011.

[11] O. Fercoq and P. Bianchi. A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non Separable Functions. ar Xiv:1508.04625 [math], 2015.

[12] M. Hallin and M. ˇSiman. Elliptical multiple-output quantile regression and convex optimization. Statistics & Probability Letters, 109:232 237, 2016.

[13] X. He. Quantile Curves without Crossing. The American Statistician, 51(2):186 192, 1997.

[14] H. Kadri, E. Duﬂos, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren. Operator-valued Kernels for Learning from Functional Response Data. Journal of Machine Learning Research, 16:1 54, 2015.

[15] R. Koenker. Quantile Regression. Cambridge University Press, Cambridge, New York, 2005.

[16] R. Koenker and G. Bassett. Regression Quantiles. Econometrica, 46(1):33 50, 1978.

[17] Y. Li, Y. Liu, and J. Zhu. Quantile Regression in Reproducing Kernel Hilbert Spaces. Journal of the American Statistical Association, 102(477):255 268, 2007.

[18] Y. Liu and Y. Wu. Simultaneous multiple non-crossing quantile regression estimation using kernel constraints. Journal of nonparametric statistics, 23(2):415 437, 2011.

[19] A. Maurer. A vector-contraction inequality for Rademacher complexities. In Proceedings of The 27th International Conference on Algorithmic Learning Theory, 2016.

[20] C.A. Micchelli and M.A. Pontil. On Learning Vector-Valued Functions. Neural Computation, 17:177 204, 2005.

[21] H.Q. Minh, L. Bazzani, and V. Murino. A Unifying Framework in Vector-valued Reproducing Kernel Hilbert Spaces for Manifold Regularization and Co-Regularized Multi-view Learning. Journal of Machine Learning Research, 17(25):1 72, 2016.

[22] S.K. Schnabel and P.H.C. Eilers. Simultaneous estimation of quantile curves using quantile sheets. ASt A Advances in Statistical Analysis, 97(1):77 87, 2012.

[23] S. Shalev-Shwartz and T. Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research, 14:567 599, 2013.

[24] V. Sindhwani, M.H. Quang, and A.C. Lozano. Scalable Matrix-valued Kernel Learning for Highdimensional Nonlinear Multivariate Regression and Granger Causality. In Proceedings of the Twenty Ninth Conference on Uncertainty in Artiﬁcial Intelligence, 2013.

[25] I. Takeuchi and T. Furuhashi. Non-crossing quantile regressions by SVM. In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings, July 2004.

[26] I. Takeuchi, Q.V. Le, T.D. Sears, and A.J. Smola. Nonparametric Quantile Estimation. Journal of Machine Learning Research, 7:1231 1264, 2006.

[27] I. Takeuchi, T. Hongo, M. Sugiyama, and S. Nakajima. Parametric Task Learning. In Advances in Neural Information Processing Systems 26, pages 1358 1366. Curran Associates, Inc., 2013.

[28] Y. Wu and Y. Liu. Stepwise multiple quantile regression estimation using non-crossing constraints. Statistics and Its Interface, 2:299 310, 2009.