# hard_shapeconstrained_kernel_machines__ef47cadf.pdf Hard Shape-Constrained Kernel Machines Pierre-Cyril Aubin-Frankowski École des Ponts Paris Tech and CAS MINES Paris Tech, PSL Paris, 75006, France pierre-cyril.aubin@mines-paristech.fr Zoltán Szabó Center of Applied Mathematics, CNRS École Polytechnique, Institut Polytechnique de Paris Route de Saclay, Palaiseau, 91128, France zoltan.szabo@polytechnique.edu Shape constraints (such as non-negativity, monotonicity, convexity) play a central role in a large number of applications, as they usually improve performance for small sample size and help interpretability. However enforcing these shape requirements in a hard fashion is an extremely challenging problem. Classically, this task is tackled (i) in a soft way (without out-of-sample guarantees), (ii) by specialized transformation of the variables on a case-by-case basis, or (iii) by using highly restricted function classes, such as polynomials or polynomial splines. In this paper, we prove that hard affine shape constraints on function derivatives can be encoded in kernel machines which represent one of the most flexible and powerful tools in machine learning and statistics. Particularly, we present a tightened second-order cone constrained reformulation, that can be readily implemented in convex solvers. We prove performance guarantees on the solution, and demonstrate the efficiency of the approach in joint quantile regression with applications to economics and to the analysis of aircraft trajectories, among others. 1 Introduction Shape constraints (such as non-negativity, monotonicity, convexity) are omnipresent in data science with numerous successful applications in statistics, economics, biology, finance, game theory, reinforcement learning and control problems. For example, in biology, monotone regression techniques have been applied to identify genome interactions (Luss et al., 2012), and in dose-response studies (Hu et al., 2005). Economic theory dictates that utility functions are increasing and concave (Matzkin, 1991), demand functions of normal goods are downward sloping (Lewbel, 2010; Blundell et al., 2012), production functions are concave (Varian, 1984) or S-shaped (Yagi et al., 2020). Moreover cyclic monotonicity has recently turned out to be beneficial in panel multinomial choice problems (Shi et al., 2018), and most link functions used in a single index model are monotone (Li and Racine, 2007; Chen and Samworth, 2016; Balabdaoui et al., 2019). Meanwhile, supermodularity is a common assumption in supply chain models, stochastic multi-period inventory problems, pricing models and game theory (Topkis, 1998; Simchi-Levi et al., 2014). In finance, European and American call option prices are convex and monotone in the underlying stock price and increasing in volatility (Aït-Sahalia and Duarte, 2003). In statistics, the conditional quantile function is increasing w.r.t. the quantile level. In reinforcement learning and in stochastic optimization the value functions are regularly supposed to be convex (Keshavarz et al., 2011; Shapiro et al., 2014). More examples can be found in recent 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. surveys on shape-constrained regression (Johnson and Jiang, 2018; Guntuboyina and Sen, 2018; Chetverikov et al., 2018). Leveraging prior knowledge expressed in terms of shape structures has several practical benefits: the resulting techniques allow for estimation with smaller sample size and the imposed shape constraints help interpretability. Despite the numerous practical advantages, the construction of shape-constrained estimators can be quite challenging. Existing techniques typically impose the shape constraints (i) in a soft fashion as a regularizer or at finite many points (Delecroix et al., 1996; Blundell et al., 2012; Aybat and Wang, 2014; Wu et al., 2015; Takeuchi et al., 2006; Sangnier et al., 2016; Chen and Samworth, 2016; Agrell, 2019; Mazumder et al., 2019; Koppel et al., 2019; Han et al., 2019; Yagi et al., 2020), (ii) through constraint-specific transformations of the variables such as quadratic reparameterization (Flaxman et al., 2017), positive semi-definite quadratic forms (Bagnell and Farahmand, 2015), or integrated exponential functions (Wu and Sickles, 2018), or (iii) they make use of highly restricted functions classes such as classical polynomials (Hall, 2018) or polynomial splines (Turlach, 2005; Papp and Alizadeh, 2014; Pya and Wood, 2015; Wu and Sickles, 2018; Meyer, 2018; Koppel et al., 2019). Both the polynomial and spline-based shape-constrained techniques rely heavily on the underlying algebraic structure of these bases, without direct extension to more general function families. From a statistical viewpoint, the main focus has been on the design of estimators with uniform guarantees (Horowitz and Lee, 2017; Freyberger and Reeves, 2018). Several existing methods have been analyzed from this perspective and were shown to be (uniformly) consistent, on a case-by-case basis and when handling specific shape constraints (Wu et al., 2015; Chen and Samworth, 2016; Han and Wellner, 2016; Mazumder et al., 2019; Koppel et al., 2019; Han et al., 2019; Yagi et al., 2020). While these asymptotic results are of significant theoretical interest, applying shape priors is generally beneficial in the small sample regime. In this paper we propose a flexible optimization framework allowing multiple shape constraints to be jointly handled in a hard fashion. In addition, to address the bottlenecks of restricted shape priors and function families, we consider general affine constraints on derivatives, and use reproducing kernel Hilbert spaces (RKHS) as hypothesis space. RKHSs (also called abstract splines; Aronszajn, 1950; Wahba, 1990; Berlinet and Thomas-Agnan, 2004; Wang, 2011) increase significantly the richness and modelling power of classical polynomial splines. Indeed, the resulting function family can be rich enough for instance (i) to encode probability distributions without loss of information (Fukumizu et al., 2008; Sriperumbudur et al., 2010), (ii) to characterize statistical independence of random variables (Bach and Jordan, 2002; Szabó and Sriperumbudur, 2018), or (iii) to approximate various function classes arbitrarily well (Steinwart, 2001; Micchelli et al., 2006; Carmeli et al., 2010; Sriperumbudur et al., 2011; Simon-Gabriel and Schölkopf, 2018), including the space of bounded continuous functions. An excellent overview on kernels and RKHSs is given by Hofmann et al. (2008); Steinwart and Christmann (2008); Saitoh and Sawano (2016). In this paper we incorporate into this flexible RKHS function class the possibility to impose hard linear shape requirements on derivatives, i.e. constraints of the form 0 b + Df(x) x K (1) for a bias b R, given a differential operator D = P j γj rj where rf(x) = Pd j=1 rj f(x) r1 x1 rd xd and a compact set K Rd. The fundamental technical challenge is to guarantee the pointwise inequality (1) at all points of the (typically non-finite) set K. We show that one can tighten the infinite number of affine constraints (1) into a finite number of second-order cone constraints η f b + Df(xm) m {1, . . . , M} (2) for a suitable choice of η > 0 and {xm}m=1...M K. Our contributions can be summarized as follows. 1. We show that hard shape requirements can be embedded in kernel machines by taking a secondorder cone (SOC) tightening of constraint (1), which can be readily tackled by convex solvers. Our formulation builds upon the reproducing property for kernel derivatives and on coverings of compact sets. 2. We prove error bounds on the distance between the solutions of the strengthened and original problems. 3. We achieve state-of-the-art performance in joint quantile regression (JQR) in RKHSs. We also combine JQR with other shape constraints in economics and in the analysis of aircraft trajectories. The paper is structured as follows. Section 2 is about problem formulation. Our main result is presented in Section 3. Numerical illustrations are given in Section 4. Proofs and additional examples are provided in the supplement. 2 Problem formulation In this section we formulate our problem after introducing some notations, which the reader may skip at first, and return to if necessary. Notations: Let N := {0, 1, . . . }, N := {1, 2, . . . } and R+ denote the set of natural numbers, positive integers and non-negative real numbers, respectively. We use the shorthand [n] := {1, . . . , n}. The p-norm of a vector v Rp is v p = (P j [d] |vj|p) 1 p (1 p < ); v = maxj [d] |vj|. The j-th canonical basis vector is ej; 0d Rd is the zero vector. Let B (c, r) = {x Rd : x c r} be the closed ball in Rd with center c and radius r in norm . Given a norm and radius δ > 0, a δ-net of a compact set K Rd consists of a set of points {xm}m [M] such that K m [M]B (xm, δ), in other words B (xm, δ) m [M] forms a covering of K. The identity matrix is I. For a matrix M Rd1 d2, M Rd2 d1 denotes its transpose, its operator norm is M = supx Rd2: x 2=1 Mx 2. The inverse of a non-singular matrix M Rd d is M 1 Rd d. The concatenation of matrices M1 Rd1 d, . . . , MN Rd N d is denoted by M = [M1; . . . ; MN] R( P n [N] dn) d. Let X be an open subset of Rd with a real-valued kernel k : X X R, and associated reproducing kernel Hilbert space (RKHS) Fk. The Hilbert space Fk is characterized by f(x) = f, k(x, ) k ( x X, f Fk) and k(x, ) Fk ( x X) where , k stands for the inner product in Fk, and k(x, ) denotes the function y X 7 k(x, y) R. The first property is called the reproducing property, the second one describes a generating family of Fk. The norm on Fk is written as k. For a multi-index r Nd let the r-th order partial derivative of a function f be denoted by rf(x) = |r|f(x) r1 x1 rd xd where |r| = P j [d] rj is the length of r. When d = 1 the f (n) = nf notation is applied; specifically f and f are used in case of n = 2 and n = 1. Given s N, let Cs(X) be the set of real-valued functions on X with continuous derivatives up to order s (i.e., rf C(X) := C0(X) when |r| s). Let I N . Given (Ai)i [I] sets let Q i [I] Ai denote their Cartesian product; we write AI in case of A = A1 = . . . = AI. Our goal is to solve hard shape-constrained kernel machines of the form f, b = arg min f=(fq)q [Q] (Fk)Q, b=(bp)p [P ] B, (f,b) C L(f, b), (P) where we are given an objective function L and a constraint set C (detailed below), a closed convex constraint set B RP on the biases, an order s N, an open set X Rd with a kernel k Cs(X X) and associated RKHS Fk, and samples S = {(xn, yn)}n [N] X R. The objective function in (P) is specified by the triplet (S, L, Ω): L(f, b) = L b, xn, yn, (fq(xn))q [Q] + Ω ( fq k)q [Q] , with loss function L : RP X R RQ N R and regularizer Ω: (R+)Q R. Notice that the objective L depends on the samples S which are assumed to be fixed, hence our proposed optimization framework focuses on the empirical risk. The bias b RP can be both constraint (such as f(x) b1, f (x) b2) and variable-related (fq + bq, see (4)-(5) later). The restriction of L to B is assumed to be strictly convex in b, and Ωis supposed to be strictly increasing in each of its arguments to ensure the uniqueness of minimizers of (P). The I N hard shape requirements in (P) take the form1 C = {(f, b) | (b0 Ub)i Di(Wf f0)i(x), x Ki, i [I]} , (C) 1In constraint (C), Wf is meant as a formal matrix-vector product: (Wf)i = P q [Q] Wi,qfq. i.e., (C) encodes affine constraints of at most s-order derivatives (Di = P j [ni,j] γi,j ri,j, |ri,j| s, γi,j R). Possible shifts are expressed by the terms b0 = (b0,i)i [I] RI, f0 = (f0,i)i [I] (Fk)I. The matrices U RI P and W RI Q capture the potential interactions within the bias variables (bp)p [P ] and functions (fq)q [Q], respectively. The compact sets Ki X (i [I]) define the domain where the constraints are imposed. Differential operators: As X Rd is open and k Cs(X X), any differential operator Di of order at most s is well defined (Saitoh and Sawano, 2016, Theorems 2.5 and 2.6, page 76) as a mapping from Fk to C(X). Since the coefficients {γi,j}j [ni,j] of Di-s belong to the whole R, (C) can cover inequality constraints in both directions. Bias constraint B: Choosing B = {0P } leads to constant l.h.s. b0 in (C). The other extreme is B = RP in which case no hard constraint is imposed on the bias variable b. Compactness of Ki-s: The compactness assumption on the sets Ki is exploited in the construction of our strengthened optimization problem (Section 3). This requirement also ensures not imposing restrictions too far from the observation points, which could be impossible to satisfy. Indeed, let us consider for instance a c0-kernel k on R, i.e. that k(x, ) C0(R) for all x and lim|y| k(x, y) = 0 for all x R (as for the Gaussian kernel). In this case lim|y| f(y) = 0 also holds for all f Fk. Hence a constraint of the form for all t R+, f(t) ϵ > 0 can not be satisfied for f Fk. Assumption on X: If s = 0 (in other words only function evaluations are present in the shape constraints), then X can be any topological space. We give various examples for the considered problem family (P). We start with an example where Q = 1. Kernel ridge regression (KRR) with monotonicity constraint: In this case the objective function and constraint are L(f, b) := 1 n [N] |yn f(xn)|2 + λf f 2 k, s.t. f (x) 0, x [xl, xu] (3) with λf > 0. In other words in (P) we have Q = 1, d = 1, s = 1, P = I = 1, K1 = [xl, xu], Ω(z) = λfz2, D1 = 1, U = W = 1, f1,0 = 0, b1,0 = 0, and b B = {0} is a dummy variable. Joint quantile regression (JQR; e.g. Sangnier et al., 2016): Given 0 < τ1 < . . . < τQ < 1 levels the goal is to estimate jointly the (τ1, . . . , τQ)-quantiles of the conditional distribution P(Y |X = x) where Y is real-valued. In this case the objective function is L (f, b) = 1 n [N] lτq (yn [fq(xn) + bq]) + λb b 2 2 + λf X q [Q] fq 2 k, (4) where λb > 0, λf > 0,2 and the pinball loss is defined as lτ(e) = max(τe, (τ 1)e) with τ (0, 1). In JQR, the estimated τq-quantile functions {fq + bq}q [Q] are not independent; the joint shape constraint they should satisfy is the monotonically increasing property w.r.t. the quantile level τ. It is natural to impose this non-crossing requirement on the smallest rectangle containing the points {xn}n [N], i.e. K = Q r [d] min{(xn)r}n [N], max{(xn)r}n [N] . The corresponding shape constraint is fq+1(x) + bq+1 fq(x) + bq, q [Q 1], x K. (5) One gets (4)-(5) from (P) by choosing P = Q, I = Q 1, s = 0, b0 = 0, f0 = 0, B = RP , Ki = K ( i [I]), Ω(z) = λf P q [Q](zq)2, and U = W = 1 1 0 0 0 1 1 0 ... ... ... ... 0 0 1 1 Further examples: There are various other widely-used shape constraints beyond non-negativity (for which s = 0), monotonicity (s = 1) or convexity (s = 2) which can be taken into account in (C). For instance one can consider n-monotonicity (s = n), (n 1)-alternating monotonicity, 2Sangnier et al. (2016) considered the same loss but a soft non-crossing inducing regularizer inspired by matrix-valued kernels, and also set λb = 0. monotonicity w.r.t. unordered weak majorization (s = 1) or w.r.t. product ordering (s = 1), or supermodularity (s = 2). For details on how these shape constraints can be written as (C), see the supplement (Section C). In this section, we first present our strengthened SOC-constrained problem, followed by a representer theorem and explicit bounds on the distance to the solution of (P). In order to introduce our proposed tightening, let us first consider the discretization of the I constraints using Mi points { xi,m}m [Mi] Ki. This would lead to the following relaxation of (P) vdisc = min f (Fk)Q, b B L(f, b) s.t. (b0 Ub)i min m [Mi] Di(Wf f0)i ( xi,m) i [I]. (6) Our proposed SOC-constrained tightening can be thought of as adding extra, appropriately chosen, ηi-buffers to the l.h.s. of the constraints: (fη, bη) = arg min f (Fk)Q, b B Rp L(f, b) (Pη) s.t. (b0 Ub)i + ηi (Wf f0)i k min m [Mi] Di(Wf f0)i ( xi,m) i [I]. (Cη) The SOC constraint (Cη) is determined by a fixed η = (ηi)i [I] RI + coefficient vector and by the points { xi,m}.3 For each i [I], the points { xi,m}m [Mi] are chosen to form a δi-net of the compact set Ki for some δi > 0 and a pre-specified norm X.4 Given { xi,m}m [Mi], the coefficients ηi R+ are then defined as ηi = sup m [Mi], u B X(0,1) Di,xk( xi,m, ) Di,xk( xi,m + δiu, ) k, (8) where Di,xk(x0, ) is a shorthand for y 7 Di(x 7 k(x, y))(x0). Problem (Pη) has I scalar SOC constraints (Cη) over infinite-dimensional variables. Let v = L f, b be the minimal value of (P) and vη = L (fη, bη) be that of (Pη). Notice that, when formally setting η = 0, (Pη) corresponds to (6). In our main result below (i) shows that (Cη) is indeed a tightening of (C), (ii) provides a representer theorem which allows to solve numerically (Pη), and (iii) gives bounds on the difference between the solution of (Pη) and that of (P) as a function of (vη vdisc) and η respectively. Theorem (Tightened task, representer theorem, bounds). Let fη = (fη,q)q [Q]. Then, (i) Tightening: any (f, b) satisfying (Cη) also satisfies (C), hence vdisc v vη. (ii) Representer theorem: For all q [Q], there exist real coefficients ai,0,q, ai,m,q, an,q R such that fη,q = P i [I] h ai,0,qf0,i + P m [Mi] ai,m,q Di,xk ( xi,m, ) i + P n [N] an,qk(xn, ). (iii) Performance guarantee: if L is (µfq, µb)-strongly convex w.r.t. (fq, b) for any q [Q], then 2(vη vdisc) µfq , bη b 2 2(vη vdisc) If in addition U is of full row-rank (i.e. surjective), B = RP , and L( f, ) is Lb Lipschitz continuous on B 2 b, cf η where cf = I U U 1 U maxi [I] (W f f0)i k, then µfq , bη b 2 3Constraint (Cη) is the precise meaning of the preliminary intuition given in (2). 4The existence of finite δi-nets (Mi < ) stems from the compactness of Ki-s. The flexibility in the choice of the norm X allows for instance using cubes by taking the 1 or the -norm on Rd when covering the Ki-s. Proof (idea): The SOC-based reformulation relies on rewriting the constraint (C) as the inclusion of the sets ΦDi(Ki) in the closed halfspaces H+ φi,βi := {g Fk | φi, g k βi} for i [I] where ΦDi(x) := Di,xk(x, ) Fk, ΦDi(X) := {ΦDi(x) | x X}, φi := (Wf f0)i and βi := (b0 Ub)i. The tightening is obtained by guaranteeing these inclusions with an ηi-net of ΦDi(Ki) containing the δi-net of Ki when pushed to Fk. The bounds stem from classical inequalities for strongly convex objective functions. The proof details of (i)-(iii) are available in the supplement (Section A). The representer theorem allows one to express (Pη) as a finite-dimensional SOC-constrained problem: min A RN Q, b B, A RN Q, A0 RI Q L(f, b) s.t. (b0 Ub)i + ηi G1/2gi 2 min m [Mi] (GDigi)I+N+m i [I], q [Q], where ei RI and ei RI+N+M are the canonical basis vectors, gi := h A0; A; A i W ei ei and the coefficients of the components of f were collected to A0 = [ ai,0,q]i [I], q [Q] RI Q, A = [an,q]n [N], q [Q] RN Q, A = [ ai,q]i [I], q [Q] RM Q with M = P i [I] Mi and ai,q = [ ai,m,q]m [Mi] RMi (i [I], q [Q]). In this finite-dimensional optimization task, G R(I+N+M) (I+N+M) is the Gram matrix of ({f0,i}i I, {k(xn, )}n [N], {Di,xk( xi,m, )}m [Mi],i I), GDi R(I+N+M) (I+N+M) is the Gram matrix of the differentials Di of these functions, G1/2 is the matrix square root of the positive semi-definite G. The bounds5 (9)-(10) show that smaller η gives tighter guarantee on the recovery of f and b. Since rfη,q(x) r fq(x) p r,rk(x, x) fη,q fq k by the reproducing property and the Cauchy-Schwarz inequality, the bounds on fη f k can be propagated to pointwise bounds on the derivatives (for |r| s). We emphasize again that in our optimization problem (Pη) the samples S = {(xn, yn)}n [N] are assumed to be fixed; in other words the bounds (9) and (10) are meant conditioned on S. The parameters M, δ and η are strongly intertwined, their interplay reveals an accuracy-computation tradeoff. Consider a shift-invariant kernel (k(x, y) = k0(x y), x, y), then (8) simplifies to ηi := supu B X(0,1) p |2Di,x Di,yk0(0) 2Di,x Di,yk0 (δiu)|, where Di,y is defined similarly to Di,x.6 This expression of ηi implies that whenever Di,x Di,yk0 is Lδ-Lipschitz7 on B X(0, δi), then ηi 2Lδ δi. By the previous point, a smaller η ensures a better recovery which can be guaranteed by smaller δi-s, which themselves correspond to a larger number of centers (Mi-s) in the δi-nets of the Ki-s. Hence, one can control the computational complexity by the total number M of points in the nets. Indeed, most SOCP solvers rely on primal-dual interior point methods which have (in the worst-case) cubic complexity O (P + N + M)3 per iterations (Alizadeh and Goldfarb, 2003). Controlling M allows one to tackle hard shape-constrained problems in moderate dimensions (d); for details see Section 4. In practice, to reduce the number of coefficients in fη,q, it is beneficial to recycle the points {xn}n [N] among the Mi virtual centers, whenever the points belong to a constraint set Ki. This simple trick was possible in all our numerical examples and kept the computational expense quite benign. Supplement (Section B) presents an example of the actual computational complexity observed. While in this work we focused on the optimization problem (P) which contains solely infinitedimensional SOC constraints (Cη), the proved (Cη) (C) implication can be of independent interest to tackle problems where other types of constraints are present.8 For simplicity we formulated our 5Notice that (9) is a computable bound, while (10) is not, as the latter depends on properties of the unknown solution of (P). 6Similar computation can be carried out for higher order derivatives. For more general kernels, estimating ηi-s can be also done by sampling uniformly u in the unit ball. 7For instance any Cs+1 kernel satisfies this local Lipschitz requirement. 8For example having a unit integral is a natural additional requirement beyond non-negativity in density estimation, and writes as a linear equality constraint over the coefficients of fη,q. Table 1: Joint quantile regression on 9 UCI datasets. Compared techniques: Primal-Dual Coordinate Descent (PDCD, Sangnier et al., 2016) and the presented SOC technique. Rows: benchmarks. 2nd column: dimension (d). 3rd column: sample number (N). 4-5th columns: mean std of 100 value of the pinball loss for PDCD and SOC; smaller is better. Dataset d N PDCD SOC engel 1 235 48 8 53 9 GAGurine 1 314 61 7 65 6 geyser 1 299 105 7 108 3 mcycle 1 133 66 9 62 5 ftcollinssnow 1 93 154 16 148 13 Cobar Ore 2 38 159 24 151 17 topo 2 52 69 18 62 14 caution 2 100 88 17 98 22 ufc 3 372 81 4 87 6 result with uniform coverings (δi, ηi, i [I]). However, we prove it for more general non-uniform coverings (δi,m, ηi,m, i [I], m [Mi]; see Section A). This can beneficial for sets with complex geometry (e.g. star-shaped) or when recyling of the samples was used to obtain coverings (as the samples in S have no reason to be equally spaced); we provide an example (in economics) using a non-uniform covering in Section 4. In practice, since the convergence speed of SOCP solvers depends highly on the condition number of G1/2, it is worth replacing G1/2 with (G + ϵtol I)1/2, setting a tolerance ϵtol 10 4. As G + ϵtol I G (in the sense of positive semi-definite matrices), this regularization strengthens further the SOC constraint. Moreover, SOCP modeling frameworks (e.g. CVXPY or CVXGEN) suggest to replace quadratic penalties (see (4)) with the equivalent q P q [Q] fq 2 k λf and b 2 λb forms. This stems from their reliance on internal primal-dual interior point techniques. 4 Numerical experiments In this section we demonstrate the efficiency of the presented SOC technique to solve hard shapeconstrained problems.9 We focus on the task of joint quantile regression (JQR) where the conditional quantiles are encoded by the pinball loss (4) and the shape requirement to fulfill is the non-crossing property (5). Supplement (Section B) provides an additional illustration in kernel ridge regression (KRR, (3)) on the importance of enforcing hard shape constraints in case of increasing noise level. Experiment-1: We compare the performance of the proposed SOC technique on 9 UCI benchmark datasets with a state-of-the-art JQR solver relying on soft shape constraints. Experiment-2: We augment the non-crossing constraint of JQR with monotonicity and concavity. Our two examples here belong to economics and to the analysis of aircraft trajectories. In our experiments we used a Gaussian kernel with bandwidth σ, ridge regularization parameter λf and λb (or upper bounds λf on q P q [Q] fq 2 k and λb on b 2). We learned jointly five quantile functions (τq {0.1, 0.3, 0.5, 0.7, 0.9}). We used CVXGEN (Mattingley and Boyd, 2012) to solve (Pη); the experiments took from seconds to a few minutes to run on an i7-CPU 16GB-RAM laptop. In our first set of experiments we compared the efficiency of the proposed SOC approach with the PDCD technique (Sangnier et al., 2016) which minimizes the same loss (4) but with a soft non-crossing encouraging regularizer. We considered 9 UCI benchmarks. Our datasets were selected with d {1, 2, 3}; to our best knowledge none of the available JQR solvers is able to guarantee in a hard fashion the non-crossing property of the learned quantiles out of samples even in this case. Each dataset was split into training (70%) and test (30%) sets; the split and the experiment were repeated twenty times. For each split, we optimized the hyperparameters (σ, λf, λb) of SOC, searching over a 9The code replicating our numerical experiments is available at https://github.com/PCAubin/ Hard-Shape-Constraints-for-Kernels. -1 -0.5 0 0.5 1 1.5 2 -2 -1 -0.5 0 0.5 1 1.5 2 -2 Figure 1: Joint quantile regression on the engel dataset using the SOC technique. Solid lines: estimated conditional quantile functions with τq {0.1, 0.3, 0.5, 0.7, 0.9} from bottom (dark green) to top (blue). Left plot: with non-crossing and increasing constraints. Right plot: with non-crossing, increasing and concavity constraints. grid to minimize the pinball loss through a 5-fold cross validation on the training set. Particularly, the kernel bandwith σ was searched over the square root of the deciles of the squared pairwise distance between the points {xn}n [N]. The upper bound λf on q P q [Q] fq 2 k was scanned in the log-scale interval [ 1, 2]. The upper bound λb on b 2 was kept fixed: λb = 10 maxn [N] |yn|. We then learned a model on the whole training set and evaluated it on the test set. The covering of K = Q r [d] min{(xn)r}n [N], max{(xn)r}n [N] was carried out with 2-balls of radius δ chosen such that the number M of added points was less than 100. This allowed for a rough covering while keeping the computation time for each run to be less than one minute. Our results are summarized in Table 1. The table shows that while the proposed SOC method guarantees the shape constraint in a hard fashion, its performance is on par with the state-of-the-art soft JQR solver. In our second set of experiments, we demonstrate the efficiency of the proposed SOC estimator on tasks with additional hard shape constraints. Our first example is drawn from economics; we focused on JQR for the engel dataset, applying the same parameter optimization as in the first experiment. In this benchmark, the {(xn, yn)}n [N] R2 pairs correspond to annual household income (xn) and food expenditure (yn), preprocessed to have zero mean and unit variance. Engel s law postulates a monotone increasing property of y w.r.t. x, as well as concavity. We therefore constrained the quantile functions to be non-crossing, monotonically increasing and concave. The interval K = min{xn}n [N], max{xn}n [N] was covered with a non-uniform partition centered at the ordered centers { xm [M]} which included the original points {xn}n [N] augmented with 15 virtual points. The radiuses were set to δi,m := xm+1 xm 2 (m [M 1], i [I]). The estimates with or without concavity are available in Fig. 1. It is interesting to notice that the estimated curves can intersect outside of the interval K (see Fig. 1(a)), and that the additional concavity constraint mitigates this intersection (see Fig. 1(b)). In our second example with extra shape constraints, we focused on the analysis of more than 300 aircraft trajectories (Nicol, 2013) which describe the radar-measured altitude (y) of aircrafts flying between two cities (Paris and Toulouse) as a function of time (x). These trajectories were restricted to their takeoff phase (where the monotone increasing property should hold), giving rise to a total number of samples N = 15657. We imposed non-crossing and monotonicity property. The resulting SOC-based quantile function estimates describing the aircraft takeoffs are depicted in Fig. 2. The plot illustrates how the estimated quantile functions respect the hard shape constraints and shows where the aircraft trajectories concentrate under various level of probability, defining a corridor of normal flight altitude values. 0 20 40 60 80 100 120 140 160 180 200 Time (s) Altitude (*100 ft) Figure 2: Joint quantile regression on aircraft takeoff trajectories. Number of samples: N = 15657. Shape constraints: non-crossing and increasing constraints. Dashed lines: trajectory samples. Solid lines: estimated conditional quantile functions. These experiments demonstrate the efficiency of the proposed SOC-based solution to hard shapeconstrained kernel machines. 5 Broader impact Shape constraints play a central role in economics, social sciences, biology, finance, game theory, reinforcement learning and control problems as they enable more data-efficient computation and help interpretability. The proposed principled way of imposing hard shape constraints and algorithmic solution are expected to have positive impact in the aforementioned areas. For instance, from social perspective the studied quantile regression application can allow ensuring that safety regulations are better met. The improved sample efficiency, however, might result in dropping production indices and reduced privacy due to more target-specific applications. Acknowledgments and Disclosure of Funding ZSz benefited from the support of the Europlace Institute of Finance and that of the Chair Stress Test, RISK Management and Financial Steering, led by the French École Polytechnique and its Foundation and sponsored by BNP Paribas. Agrell, C. (2019). Gaussian processes with linear operator inequality constraints. Journal of Machine Learning Research, 20:1 36. Aït-Sahalia, Y. and Duarte, J. (2003). Nonparametric option pricing under shape restrictions. Journal of Econometrics, 116(1-2):9 47. Alizadeh, F. and Goldfarb, D. (2003). Second-order cone programming. Mathematical Programming, 95(1):3 51. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337 404. Aybat, N. S. and Wang, Z. (2014). A parallel method for large scale convex regression problems. In Conference on Decision and Control (CDC), pages 5710 5717. Bach, F. and Jordan, M. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3:1 48. Bagnell, J. A. and Farahmand, A. (2015). Learning positive functions in a Hilbert space. NIPS Workshop on Optimization, (OPT2015). (https://www.ri.cmu.edu/pub_files/2015/0/ Kernel-SOS.pdf). Balabdaoui, F., Durot, C., and Jankowski, H. (2019). Least squares estimation in the monotone single index model. Bernoulli, 25(4B):3276 3310. Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer. Blundell, R., Horowitz, J. L., and Parey, M. (2012). Measuring the price responsiveness of gasoline demand: economic shape restrictions and nonparametric demand estimation. Quantitative Economics, 3:29 51. Carmeli, C., Vito, E. D., Toigo, A., and Umanitá, V. (2010). Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 8:19 61. Chatterjee, S., Guntuboyina, A., and Sen, B. (2015). On risk bounds in isotonic and other shape restricted regression problems. Annals of Statistics, 43(4):1774 1800. Chen, Y. and Samworth, R. J. (2016). Generalized additive and index models with shape constraints. Journal of the Royal Statistical Society Statistical Methodology, Series B, 78(4):729 754. Chetverikov, D., Santos, A., and Shaikh, A. M. (2018). The econometrics of shape restrictions. Annual Review of Economics, 10(1):31 63. Delecroix, M., Simioni, M., and Thomas-Agnan, C. (1996). Functional estimation under shape constraints. Journal of Nonparametric Statistics, 6(1):69 89. Fink, A. M. (1982). Kolmogorov-Landau inequalities for monotone functions. Journal of Mathematical Analysis and Applications, 90:251 258. Flaxman, S., Teh, Y. W., and Sejdinovic, D. (2017). Poisson intensity estimation with reproducing kernels. Electronic Journal of Statistics, 11(2):5081 5104. Freyberger, J. and Reeves, B. (2018). Inference under shape restrictions. Technical report, University of Wisconsin-Madison. (https://www.ssc.wisc.edu/~jfreyberger/Shape_Inference_ Freyberger_Reeves.pdf). Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems (NIPS), pages 498 496. Guntuboyina, A. and Sen, B. (2018). Nonparametric shape-restricted regression. Statistical Science, 33(4):568 594. Hall, G. (2018). Optimization over nonnegative and convex polynomials with and without semidefinite programming. Ph D Thesis, Princeton University. Han, Q., Wang, T., Chatterjee, S., and Samworth, R. J. (2019). Isotonic regression in general dimensions. Annals of Statistics, 47(5):2440 2471. Han, Q. and Wellner, J. A. (2016). Multivariate convex regression: global risk bounds and adaptation. Technical report. (https://arxiv.org/abs/1601.06844). Hofmann, T., Schölkopf, B., and Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics, 36(3):1171 1220. Horowitz, J. L. and Lee, S. (2017). Nonparametric estimation and inference under shape restrictions. Journal of Econometrics, 201:108 126. Hu, J., Kapoor, M., Zhang, W., Hamilton, S. R., and Coombes, K. R. (2005). Analysis of doseresponse effects on gene expression data with comparison of two microarray platforms. Bioinformatics, 21(17):3524 3529. Johnson, A. L. and Jiang, D. R. (2018). Shape constraints in economics and operations research. Statistical Science, 33(4):527 546. Keshavarz, A., Wang, Y., and Boyd, S. (2011). Imputing a convex objective function. In IEEE Multi-Conference on Systems and Control, pages 613 619. Koppel, A., Zhang, K., Zhu, H., and Ba sar, T. (2019). Projected stochastic primal-dual method for constrained online learning with kernels. IEEE Transactions on Signal Processing, 67(10):2528 2542. Lewbel, A. (2010). Shape-invariant demand functions. The Review of Economics and Statistics, 92(3):549 556. Li, Q. and Racine, J. S. (2007). Nonparametric Econometrics. Princeton University Press. Luss, R., Rossett, S., and Shahar, M. (2012). Efficient regularized isotonic regression with application to gene-gene interaction search. Annals of Applied Statistics, 6(1):253 283. Malov, S. V. (2001). On finite-dimensional Archimedean copulas. Asymptotic Methods in Probability and Statistics with Applications, pages 19 35. Marshall, A. W., Olkin, I., and Arnold, B. C. (2011). Inequalities: Theory of Majorization and Its Applications. Springer. Mattingley, J. and Boyd, S. (2012). CVXGEN: A code generator for embedded convex optimization. Optimization and Engineering, 12(1):1 27. Matzkin, R. L. (1991). Semiparametric estimation of monotone and concave utility functions for polychotomous choice models. Econometrica, 59(5):1315 1327. Mazumder, R., Choudhury, A., Iyengar, G., and Sen, B. (2019). A computational framework for multivariate convex regression and its variants. Journal of the American Statistical Association, 114(525):318 331. Mc Neil, A. J. and Neslehová, J. (2009). Multivariate Archimedean copulas, d-monotone functions and ℓ1-norm symmetric distributions. Annals of Statistics, 37(5B):3059 3097. Meyer, M. C. (2018). A framework for estimation and inference in generalized additive models with shape and order restrictions. Statistical Science, 33(4):595 614. Micchelli, C., Xu, Y., and Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research, 7:2651 2667. Nicol, F. (2013). Functional principal component analysis of aircraft trajectories. In International Conference on Interdisciplinary Science for Innovative Air Traffic Management (ISIATM). Papp, D. and Alizadeh, F. (2014). Shape-constrained estimation using nonnegative splines. Journal of Computational and Graphical Statistics, 23(1):211 231. Pya, N. and Wood, S. N. (2015). Shape constrained additive models. Statistics and Computing, 25:543 559. Saitoh, S. and Sawano, Y. (2016). Theory of Reproducing Kernels and Applications. Springer Singapore. Sangnier, M., Fercoq, O., and d Alché Buc, F. (2016). Joint quantile regression in vector-valued RKHSs. Advances in Neural Information Processing Systems (NIPS), pages 3693 3701. Shapiro, A., Dentcheva, D., and Ruszczynski, A. (2014). Lectures on Stochastic Programming: Modeling and Theory. SIAM - Society for Industrial and Applied Mathematics. Shi, X., Shum, M., and Song, W. (2018). Estimating semi-parametric panel multinomial choice models using cyclic monotonicity. Econometrica, 86(2):737 761. Simchi-Levi, D., Chen, X., and Bramel, J. (2014). The Logic of Logistics: Theory, Algorithms, and Applications for Logistics Management. Springer. Simon-Gabriel, C.-J. and Schölkopf, B. (2018). Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. Journal of Machine Learning Research, 19(44):1 29. Sriperumbudur, B., Fukumizu, K., and Lanckriet, G. (2011). Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:2389 2410. Sriperumbudur, B., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561. Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 6(3):67 93. Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer. Szabó, Z. and Sriperumbudur, B. K. (2018). Characteristic and universal tensor product kernels. Journal of Machine Learning Research, 18(233):1 29. Takeuchi, I., Le, Q., Sears, T., and Smola, A. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7:1231 1264. Topkis, D. M. (1998). Supermodularity and complementarity. Princeton University Press. Turlach, B. A. (2005). Shape constrained smoothing using smoothing splines. Computational Statistics, 20:81 104. Varian, H. R. (1984). The nonparametric approach to production analysis. Econometrica, 52(3):579 597. Wahba, G. (1990). Spline Models for Observational Data. SIAM, CBMS-NSF Regional Conference Series in Applied Mathematics. Wang, Y. (2011). Smoothing Splines Methods and Applications. CRC Press. Wu, J., Meyer, M. C., and Opsomer, J. D. (2015). Penalized isotonic regression. Journal of Statistical Planning and Inference, 161:12 24. Wu, X. and Sickles, R. (2018). Semiparametric estimation under shape constraints. Econometrics and Statistics, 6:74 89. Yagi, D., Chen, Y., Johnson, A. L., and Kuosmanen, T. (2020). Shape-constrained kernel-weighted least squares: Estimating production functions for Chilean manufacturing industries. Journal of Business & Economic Statistics, 38(1):43 54. Zhou, D.-X. (2008). Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456 463.