# prediction_rule_reshaping__3e0334e9.pdf

Prediction Rule Reshaping

Matt Bonakdarpour 1 Sabyasachi Chatterjee 2 Rina Foygel Barber 1 John Lafferty 3

Two methods are proposed for high-dimensional shape-constrained regression and classiﬁcation. These methods reshape pre-trained prediction rules to satisfy shape constraints like monotonicity and convexity. The ﬁrst method can be applied to any pre-trained prediction rule, while the second method deals speciﬁcally with random forests. In both cases, efﬁcient algorithms are developed for computing the estimators, and experiments are performed to demonstrate their performance on four datasets. We ﬁnd that reshaping methods enforce shape constraints without compromising predictive accuracy.

1. Introduction

Shape constraints like monotonicity and convexity arise naturally in many real-world regression and classiﬁcation tasks. For example, holding all other variables ﬁxed, a practitioner might assume that the price of a house is a decreasing function of neighborhood crime rate, that an individual s utility function is concave in income level, or that phenotypes such as height or the likelihood of contracting a disease are monotonic in certain genetic effects.

Parametric models like linear regression implicity impose monotonicity constraints at the cost of strong assumptions on the true underlying function. On the other hand, nonparametric techniques like kernel regression impose weak assumptions but do not guarantee monotonicity or convexity in their predictions. Shape-constrained nonparametric regression methods attempt to offer the best of both worlds, allowing practitioners to dispense with parametric assumptions while retaining many of their appealing properties.

However, classical approaches to nonparametric regression

1Department of Statistics, The University of Chicago 2Department of Statistics, University of Illinois at Urbana Champaign 3Department of Statistics and Data Science, Yale University. Correspondence to: Matt Bonakdarpour <mbonakda@gmail.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

under shape constraints suffer from the curse of dimensionality (Han & Wellner, 2016; Han et al., 2017). Some methods have been developed to mitigate this issue under assumptions like additivity, where the true function f is assumed to have the form f(x) = P

j fj(xj) + c, where a subset of the component fj s are shape-constrained (Pya & Wood, 2015; Chen & Samworth, 2016; Xu et al., 2016). But in many real-world settings, the lack of interaction terms among the predictors can be too restrictive.

Approaches from the machine learning community like random forests, gradient boosted trees, and deep learning methods have been shown to exhibit outstanding empirical performance on high-dimensional tasks. But these methods do not guarantee monotonicity or convexity.

In this paper, we propose two methods for high-dimensional shape-constrained regression and classiﬁcation. These methods blend the performance of machine learning methods with the classical least-squares approach to nonparametric shape-constrained regression.

In Section (2.1), we describe black box reshaping, which takes any pre-trained prediction rule and reshapes it on a set of test inputs to enforce shape constraints. In the case of monotonicity constraints, we develop an efﬁcient algorithm to compute the estimator.

Section (2.2) presents a second method designed speciﬁcally to reshape random forests (Breiman, 2001). This approach reshapes each individual decision tree based on its split rules and estimated leaf values. Again, in the case of monotonicity constraints, we present another efﬁcient reshaping algorithm.

We apply our methods to four datasets in Section (3) and show that they enforce the pre-speciﬁed shape constraints without sacriﬁcing accuracy.

1.1. Related Work

In the context of monotonicity constraints, the black box reshaping method is related to the method of rearrangements (Chernozhukov et al., 2009; 2010). The rearrangement operation takes a pre-trained prediction rule and sorts its predictions to enforce monotonicity. In higher dimensions, the rearranged estimator is the average of one-dimensional rearrangements. In contrast, this paper focuses on isotonization

Prediction Rule Reshaping

of prediction values, jointly reshaping multiple dimensions in tandem. It would be interesting to explore adaptive procedures that average rearranged and isotonized predictions in future work.

Monotonic decision trees have previously been studied in the context of classiﬁcation. Several methods require that the training data satisfy monotonicity constraints (Potharst & Feelders, 2002; Makino et al., 1996), a relatively strong assumption in the presence of noise. The methods we propose here do not place any restrictions on the training data.

Another class of methods augment the score function for each split to incorporate the degree of non-monotonicity introduced by that split (Ben-David, 1995; Gonz alez et al., 2015). However, this approach does not guarantee monotonicity. Feelders & Pardoel (2003) apply pruning algorithms to non-monotonic trees as a post-processing step in order to enforce monotonicity. For a comprehensive survey of estimating monotonic functions, see Gupta et al. (2016) .

A line of recent work has led to a method for learning deep monotonic models by alternating different types of monotone layers (You et al., 2017). Amos et al. (2017) propose a method for ﬁtting neural networks whose predictions are convex with respect to a subset of predictors.

Our methods differ from this work in several ways. First, our techniques can be used to enforce both monotonic and convex/concave relationships. Unlike pruning methods, neither approach presented here changes the structure of the original tree. Black box reshaping, described in Section (2.1), can be applied to any pre-trained prediction rule, giving practitioners the ﬂexibility of picking the method of their choice. And both methods guarantee that the intended shape constraints are satisﬁed on test data.

2. Prediction Rule Reshaping

In what follows, we say that a function f : Rd R is monotone with respect to variables R [d] = {1, . . . , d} if f(x) f(y) when xi yi for i R, and xi = yi otherwise.

Similarly, a function f is convex in R if for all x, y Rd

and α [0, 1], f(αx + (1 α)y) αf(x) + (1 α)f(y) when xi = yi i / R.

2.1. Black Box Reshaping

Let bf : Rd R denote an arbitrary prediction rule ﬁt on a training set and assume we have a candidate set of shape constraints with respect to variables R [d]. For example, we might require that the function be monotone increasing in each variable v R.

Let F denote the class of functions that satisfy the desired

shape constraints on each predictor variable v R. We aim to ﬁnd a function f F that is close to bf in the L2 norm:

f = arg min f F f bf 2 (1)

where the L2 norm is with respect to the uniform measure on a compact set containing the data.

We simplify this inﬁnite-dimensional problem by only considering values of bf on certain ﬁxed test points.

Suppose we take a sequence t1, t2, . . . , tn of test points, each in Rd, that differ only in their v-th coordinate so that ti k = ti k for all k = v. These points can be ordered by their v-th coordinate, allowing us to consider shape constraints on the vector (f(t1), f(t2), ..., f(tn)) Rn. For instance, under a monotone-increasing constraint with respect to v, if t1 v t2 v tn v, then we consider functions f such that (f(t1), f(t2), ..., f(tn)) is a monotone sequence.

There is now the question of choosing a test point t as well as a sequence of values t1 v, ..., tn v to plug into its v-th coordinate. A natural choice is to use the observed data values as both test points and coordinate values.

Denote Dn = {(x1, y1), . . . , (xn, yn)} as a set of observed values where yi is the response and xi Rd are the predictors. From each xi, we construct a sequence of test points that can be ordered according to their v-th coordinate in the following way. Let xi,k,v denote the observed vector xi

with its v-th coordinate replaced by the v-th coordinate of xk, so that

xi,k,v = (xi 1, xi 2, . . . , xi v 1, xk v, xi v+1, . . . , xi d). (2)

This process yields n points from xi that can be ordered by their v-th coordinate, xi,1,v, xi,2,v, . . . , xi,n,v. We then require (f(xi,1,v), f(xi,2,v), . . . , f(xi,n,v)) Sv where Sv Rd is the appropriate convex cone that enforces the shape constraint for variable v R, for example the cone of monotone increasing or convex sequences.

To summarize, for each coordinate v R and for each i [n], we:

1. Take the i-th observed data point xi as a test point.

2. Replace its v-th coordinate with the n observed v-th coordinates x1 v, ...xn v to produce xi,1,v, xi,2,v, . . . , xi,n,v.

3. Enforce the appropriate shape constraint on the vector of evaluated function values, (f(xi,1,v), f(xi,2,v), . . . , f(xi,n,v)) Sv.

See Figure (1) for an illustration.

This leads to the following relaxation of (1):

f = arg min f Fn f bf 2 (3)

Prediction Rule Reshaping

Figure 1: Two-dimensional illustration of the black box setup when reshaping the y dimension. xi,k denotes the original test point xi with its y-coordinate replaced by the y-coordinate of xk. Dark gray nodes represent the original observed points. For monotonicity constraints, a directed edge from node x to node y represents the constraint f(x) f(y) on the shape-constrained function f.

where Fn is the class of functions f such that (f(xi,1,v), f(xi,2,v), . . . , f(xi,n,v)) Sv Rn for each v R and each i [n].

In other words, we have relaxed the shape constraints on the function f, requiring the constraints to hold relative to the selected test points. However, this optimization is still inﬁnite dimensional.

We make the ﬁnal transition to ﬁnite dimensions by changing the objective function to only consider values of f on the test points. Letting Fi,k,v denote the value of f evaluated on test point xi,k,v, we relax (3) to obtain the solution F = (F i,k,v)v R,i [n],k [n] of the optimization:

i,k,v (Fi,k,v bf(xi,k,v))2

subject to (Fi,1,v, ..., Fi,n,v) (Sv)v R, i [n] (4)

However, this leads to ill-deﬁned predictions on the original data points x1, ..., xn, since for each v, xi,i,v = xi, but we may obtain different values F i,i,v for various v R.

We avoid this issue by adding a consistency constraint (7) to obtain our ﬁnal black box reshaping optimization (BBOPT):

i,k,v (Fi,k,v bf(xi,k,v))2 (5)

subject to (Fi,1,v, ..., Fi,n,v) (Sv)v R, i [n] (6)

and Fi,i,v = Fi,i,w v, w R, i [n] (7)

We then take the reshaped predictions to be

f (xi) = F i,i,v

for any v R. Since the constraints depend on each xi independently, BBOPT decomposes into n optimization problems, one for each observed value. Note that the true response values yi are not used when reshaping. We could select optimal shape constraints on a held-out test set.

2.1.1. INTERSECTING ISOTONIC REGRESSION

In this section, we present an efﬁcient algorithm for solving BBOPT for the case when each Sv imposes monotonicity constraints. Let R = |R| denote the number of monotonicity constraints.

When reshaping with respect to only one predictor (R = 1), the consistency constraints (7) vanish, so the optimization decomposes into n isotonic regression problems. Each problem is efﬁciently solved in Θ(n) time with the pool adjacent violators algorithm (PAVA) (Ayer et al., 1955).

For R > 1 monotonicity constraints, BBOPT gives rise to n independent intersecting isotonic regression problems. The k-th problem corresponds to the k-th observed value xk; the intersection is implied by the consistency constraints (7). For each independent problem, our algorithm takes O(m log R) time, where m = n R is the number of variables in each problem.

We ﬁrst state the general problem. Assume v1, v2, . . . , v K

are each real-valued vectors with dimensions d1, d2, . . . , d K, respectively. Let ij {1, . . . , dj} denote an index in the j-th vector vj.

The intersecting isotonic regression problem (IISO) is:

minimize (bvk)K k=1

k=1 bvk vk 2

subject to bvk 1 bvk 2 bvk dk, k [K]

and bv1 i1 = bv2 i2 = = bv K i K

First consider the simpler constrained isotonic regression problem with a single sequence v Rd, index i [d], and ﬁxed value c R

minimize bv bv v 2

subject to bv1 bv2 bvd and bvi = c

Lemma 2.1. The solution v to (9) can be computed by using index i as a pivot and splitting v into its left and right tails, so that ℓ= (v1, v2, . . . , vi 1) and r = (vi+1, . . . , vd), then applying PAVA to obtain monotone tails bℓand br. v is

Prediction Rule Reshaping

Algorithm 1 IISO Algorithm

1. Apply PAVA to each of the 2K tails. 2. Combine and sort the left and right tails separately. 3. Find segment s in between tail values where the derivative g (η) changes sign. 4. Compute c , the minimizer of g(c) in segment s .

obtained by setting elements of bℓand br to

ℓ k = min(bℓk, c)

r k = max(brk, c) (10)

and concatenating the resulting tails so that v = (ℓ , c, r ) Rd.

We now explain the IISO Algorithm presented in Algorithm (1). First divide each vector vj into two tails, the left tail ℓj and the right tail rj, using the intersection index ij as a pivot,

vj = (vj 1, vj 2, . . . , vj (ij 1) | {z } ℓj

, vj ij, vj (ij+1), vj (ij+2), . . . , vj dj | {z } rj

resulting in 2K tails {ℓ1, . . . , ℓK, r1, . . . , r K}.

Step 1 of Algorithm (1) performs an unconstrained isotonic regression on each tail using PAVA to obtain 2K monotone tails {bℓ1, . . . , bℓK, br1, . . . , br K}. This can be done in Θ(n) time, where n is the total number of elements across all vectors so that n = PK i=1 di.

Given the monotone tails, we can write a closed-form expression for the IISO objective function in terms of the value at the point of intersection.

Let c be the value of the vectors at the point of intersection so that c = bv1 i1 = bv2 i2 = = bv K i K. For a ﬁxed c, we can solve IISO by applying Lemma (2.1) to each sequence separately. This yields the following expression for the squared error as a function of c:

k=1 (c vk ik)2 +

i=1 (ℓk i min(bℓk i , c))2

j=il+1 (rl j max(brl j, c))2 (11)

which is piecewise quadratic with knots at each bℓk i and brl j. Our goal is to ﬁnd c = minc g(c). Note that g(c) is convex and differentiable.

We proceed by computing the derivative of g at each knot, from smallest to largest, and ﬁnding the segment in which the sign of the derivative changes from negative to positive. The minimizer c will live in this segment.

xv t2 xv > t2

xv t1 xv > t1

Figure 2: An illustration motivating the random forest reshaping method. The only nodes that split on the shapeconstrained variable v are p1 and p2. Assume x Rd drops down to leaf ℓ1 and that, holding all other variables constant, increasing xv beyond t1 and t2, results in dropping to leaves ℓ2 and ℓ3, respectively. To enforce monotonicity in v for this point, we need to ensure leaf values µℓfollow µℓ1 µℓ2 µℓ3.

Step 2 of Algorithm (1) merges the left and right sorted tails into two sorted lists. This can be done in O(n log K) time with a heap data structure. Step 3 computes the derivative of the objective function g at each knot, from smallest to largest, searching for the segment in which the derivative changes sign. Step 4 computes the minimizer of g in the corresponding segment. By updating the derivative incrementally and storing relevant side information, Steps 3 and 4 can be done in linear time.

The total time complexity is therefore O(n log(K)).

2.2. Reshaping Random Forests

In this section, we describe a framework for reshaping a random forest to ensure monotonicity of its predictions in a subset of its predictor variables. A similar method can be applied to ensure convexity. For both regression and probability trees (Malley et al., 2012), the prediction of the forest is an average of the prediction of each tree; it is therefore sufﬁcient to ensure monotonicity or convexity of the trees. For the rest of this section, we focus on reshaping individual trees to enforce monotonicity.

Our method is a two-step procedure. The ﬁrst step is to grow a tree in the usual way. The second step is to reshape the leaf values to enforce monotonicity. We hope to explore the implications of combining these steps in future work.

Let T(x) be a regression tree and R [d] a set of predictor variables to be reshaped. Let x Rd be an input point and denote the k-th coordinate of x as xk. Assume v R is a predictor variable to be reshaped. The following thought experiment, illustrated in Figure (2), will motivate

Prediction Rule Reshaping

our approach.

Imagine dropping x down T until it falls in its corresponding leaf, ℓ1. Let p1 be the closest ancestor node to ℓ1 that splits on v and assume it has split rule {xv t1}. Holding all other coordinates constant, increasing xv until it is greater than t1 would create a new point that drops down to a different leaf ℓ2 in the right subtree of p1.

If ℓ1 and ℓ2 both share another ancestor p2 farther up the tree with split rule {xv t2}, increasing xv beyond t2 would yield another leaf ℓ3. Assume these leaves have no other shared ancestors that split on v. Denoting the value of leaf ℓ as µℓ, in order to ensure monotonicity in v for this point x, we require µℓ1 µℓ2 µℓ3.

We use this line of thinking to propose a framework for estimating monotonic random forests and describe two estimators that fall under this framework.

2.2.1. EXACT ESTIMATOR

Each leaf ℓin a decision tree is a cell (or hyperrectangle) Cℓ which is an intersection of intervals

j=1 {x : xj Iℓ j}

When we split on a shape-constrained variable v with splitvalue t, each cell in the left subtree is of the form Cl = Cl {x : xv t} and each cell in the right subtree is of the form Cr = Cr {x : xv > t}.

For cells l in the left subtree and r in the right subtree, our goal is to constrain the corresponding leaf values µl µr only when Cl Cr = . See Figure (3) for an illustration. We must devise an algorithm to ﬁnd the intersecting cells (l, r), and add each to a constraint set E. This can be done efﬁciently with an interval tree data structure.

Assume there are n unique leaves appearing in E. The exact estimator is obtained by solving the following optimization:

min (bµℓ)n ℓ=1

ℓ=1 (µℓ bµℓ)2

subject to bµi bµj (i, j) E

where µℓis the original value of leaf ℓ.

This is an instance of L2 isotonic regression on a directed acyclic graph where each leaf value µℓis a node, and each constraint in E is an edge. With n vertices and m edges, the fastest known exact algorithm for this problem has time complexity Θ(n4) (Spouge et al., 2003), and the fastest known δ-approximate algorithm has complexity O(m1.5 log2 n log n

δ ) (Kyng et al., 2015).

With a corresponding change to the constraints in Equa-

x1 2 x1 > 2

x1 1 x1 > 1

Figure 3: Suppose we have three variables (x1, x2, x3) and when we split on reshaped variable x3, the left and right subtrees and their corresponding cells are as shown above. By examination, any point that drops to ℓ2 can only travel to r2 when its x3 coordinate is increased. By this logic, the exact estimator would use the six constraints , µℓ2 µr2, µℓ1 µr1, µℓ1 µr2, µℓ3 µr1, µℓ3 µr2, µℓ3 µr3, whereas the over-constrained estimator would use all nine pairwise constraints.

tion (12), this approach extends naturally to convex regression trees. It can also be applied directly to probability trees for binary classiﬁcation by reshaping the estimated probabilities in each leaf.

2.2.2. OVER-CONSTRAINED ESTIMATOR

In this section, we propose an alternative estimator that can be more efﬁcient to compute, depending on the tree structure. In our experiments below, we ﬁnd that computing this estimator is always faster.

Let Ep denote the set of constraints that arise between leaf values under a shape-constrained split node p. By adding additional constraints to Ep, we can solve (12) exactly for each shape-constrained split node in O(np log np) time, where np is the number of leaves under p.

In this setting, each shape-constrained split node gives rise to an independent optimization involving its leaves. Due to transitivity, we can solve these optimizations sequentially in reverse (bottom-up) level-order on the tree.

Let np denote the number of leaves under node p. For each node p that is split on a shape-constrained variable, the over-constrained estimator solves the following max-min

Prediction Rule Reshaping

Table 1: Experimental results of 5-fold cross-validation. Accuracy is measured by mean squared error (Diabetes), mean absolute percent error (Zillow), and classiﬁcation accuracy (Adult, Spam). Standard errors are shows in parentheses.

METHOD DIABETES ZILLOW ADULT SPAM

RF 3209 (377) 2.594% (0.1%) 87.0% (1.7%) 95.4% (1.3%) BB 3210 (390) 2.594% (0.1%) 87.1% (1.8%) 95.4% (1.3%) EX 3154 (353) - 86.8% (1.5%) 95.3% (1.2%) OC 3155 (350) 2.588% (0.1%) 87.0% (1.5%) 95.3% (1.2%)

min (bµℓ) np ℓ=1

ℓ=1 (µℓ bµℓ)2

subject to max ℓ left(p) bµℓ min r right(p) bµr

where left(p) denotes all leaves in the left subtree of p and right(p) denotes all leaves in the right subtree.

This is equivalent to adding an edge (ℓ, r) to E for every pair of leaves such that ℓis in left(p) and r is in right(p). All such pairs do not necessarily exist in E for the exact estimator; see Figure (3). For each shape-constrained split, (13) is an instance of L2 isotonic regression on a complete directed bipartite graph.

For a given shape-constrained split node p, let ℓ = (ℓ1, ℓ2, . . . , ℓn1) be the values of the leaves in its left subtree, and r = (r1, r2, . . . , rn2) be the values of the leaves in its right subtree, indexed so that ℓ1 ℓn1 and r1 rn2. Then the max-min problem (13) is equivalent to:

i=1 (ℓi eℓi)2 +

i=1 (ri eri)2

subject to eℓ1 eℓ2 ... eℓn1 er1 ern2

The solution to this optimization is of the form eℓi = min(c, ℓi) and eri = max(c, ri), for some constant c. Given the two sorted vectors ℓand r, the optimization becomes:

i=1 (ℓi min(c, ℓi))2 +

i=1 (ri max(c, ri))2

This objective is convex and differentiable in c. Similar to the black box reshaping method, we can compute the derivatives at the values of the data and ﬁnd where it ﬂips sign, then compute the minimizer in the corresponding segment. This takes O(n) time where n = n1 + n2, the number of leaves under the shape-constrained split. With sorting, the over-constrained estimator can be computed in O(n log n) time for each shape-constrained split node.

We apply this procedure sequentially on the leaves of every shape-constrained node in reverse level-order on the tree.

3. Experiments

We apply the reshaping methods described above to two regression tasks and two binary classiﬁcation tasks. We show that reshaping allows us to enforce shape constraints without compromising predictive accuracy. For convenience, we use the acronyms in Table (2) to refer to each method.

The BB method was implemented in R, and the OC and EX methods were implemented in R and C++, extending the R package ranger (Wright & Ziegler, 2017). The exact estimator from Section (2.2.1) is computed using the MOSEK C++ package (Ap S, 2017).

For binary classiﬁcation, we use the probability tree implementation found in ranger, enforcing monotonicity of the probability of a positive classiﬁcation with respect to the chosen predictors.

For the purposes of these experiments, black box reshaping is applied to a traditional random forest. The random forest was ﬁt with the default settings found in ranger.

We apply 5-fold cross validation on all four tasks and present the results under the relevant performance metrics in Table (1).

Table 2: Acronyms used for reshaping methods.

METHOD ACRONYM SEC

RANDOM FOREST RF - BLACK BOX RESHAPING BB 2.1 EXACT ESTIMATOR EX 2.2.1 OVER-CONSTRAINED ESTIMATOR OC 2.2.2

3.1. Diabetes Dataset

The diabetes dataset (Efron et al., 2004) consists of ten physiological baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements, for each of 442 patients. The response is a quantitative measure of disease progression measured one year after baseline.

Holding all other variables constant, we might expect dis-

Prediction Rule Reshaping

0.06 0.04 0.02 0.00 0.02

80 100 120 140 160

disease progression

RF BB EX OC subject

(a) Diabetes dataset.

4 6 8 10 12 14

0.05 0.10 0.15 0.20 0.25 0.30

years of education

Pr( income > $50k )

RF BB EX OC

(b) Adult dataset.

Figure 4: Illustrating the result of reshaping. We choose a test point at random and make predictions with each model as we vary the predictor variable on the x-axis, holding all other variables constant. We see in both cases that the original RF predictions are not monotone-increasing. The Diabetes plot (4a) also shows the true value of the chosen data point with an X.

ease progression to be monotonically increasing in body mass index (Ganz et al., 2014). We estimate a random forest and apply our reshaping techniques, then make predictions for a random test subject as we vary the body mass index predictor variable. The results shown in Figure (4a) illustrate the effect of reshaping on the predictions.

We use mean squared error to measure accuracy. The results in Table (1) indicate that the prediction accuracy of all four estimators is approximately the same.

3.2. Zillow Dataset

In this section, the regression task is to predict real estate sales prices using property information. The data were obtained from Zillow, an online real estate database company. For each of 206,820 properties, we are given the list price, number of bedrooms and bathrooms, square footage, build decade, sale year, major renovation year (if any), city, and metropolitan area. The response is the actual sale price of the home.

We reshape to enforce monotonicity of the sale price with respect to the list price. Due to the size of the constraint set, this problem becomes intractable for MOSEK; the results for the EX method are omitted. An interesting direction for future work is to investigate more efﬁcient algorithms for this method.

Following reported results from Zillow, we use mean absolute percent error (MAPE) as our measure of accuracy. For

an estimate by of the true value y, the APE is |by y|/y.

The results in Table (1) show that the performance across all estimators is indistinguishable.

3.3. Adult Dataset

We apply the reshaping techniques to the binary classiﬁcation task found in the Adult dataset (Lichman, 2013). The task is to predict whether an individual s income is less than or greater than $50,000. Following the experiments performed in Milani Fard et al. (2016) and You et al. (2017), we apply monotonic reshaping to four variables: capital gain, weekly hours of work, education level, and the gender wage gap.

We illustrate the effect of reshaping on the predictions in Figure (4b). The results in Table (1) show that we achieve similar test set accuracy before and after reshaping the random forest.

3.4. Spambase Dataset

Finally, we apply reshaping to classify whether an email is spam or not. The Spambase dataset (Lichman, 2013) contains 4,601 emails each with 57 predictors. There are 48 word frequency predictors, 6 character frequency predictors, and 3 predictors related to the number of capital letters appearing in the email.

That data were collected by Hewlett-Packard labs and do-

Prediction Rule Reshaping

nated by George Forman. One of the predictors is the frequency of the word george , typically assumed to be an indicator of non-spam for this dataset. We reshape the predictions to enforce the probability of being classiﬁed as spam to be monotonically decreasing in the frequency of words george and hp .

The results in Table (1) again show similar performance across all methods.

4. Discussion

We presented two strategies for prediction rule reshaping. We developed efﬁcient algorithms to compute the reshaped estimators, and illustrated their properties on four datasets. Both approaches can be viewed as frameworks for developing more sophisticated reshaping techniques.

There are several ways that this work can be extended. Extensions to the black box method include adaptively combining rearrangements and isotonization (Chernozhukov et al., 2009), and considering a weighted objective function to account for the distance between test points.

In general, the random forest reshaping method can be viewed as operating on pre-trained parameters of a speciﬁc model. Applying this line of thinking to gradient boosted trees, deep learning methods, and other machine learning techniques could yield useful variants of this approach.

And ﬁnally, while practitioners might require certain shapeconstraints on their predictions, many scientiﬁc applications also require inferential quantities, such as conﬁdence intervals and conﬁdence bands. Developing inferential procedures for reshaped predictions, similar to Chernozhukov et al. (2010) for rearrangements and Athey et al. (2018) for random forests, would yield interpretable predictions along with useful measures of uncertainty.

Acknowledgements

Research supported in part by ONR grant N00014-12-10762, NSF grants DMS-1513594 and DMS-1654076, and an Alfred P. Sloan fellowship.

Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.

Ap S, M. MOSEK Fusion API for C++ 8.0.0.94, 2017. URL https://docs.mosek.com/8.0/ cxxfusion/index.html.

Athey, S., Tibshirani, J., and Wager, S. Generalized Random

Forests. Forthcoming in Annals of Statistics, 2018. URL https://arxiv.org/abs/1610.01271.

Ayer, M., Brunk, H., Ewing, G., Reid, W., and Silverman, E. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26:641 648, 1955.

Ben-David, A. Monotonicity maintenance in informationtheoretic machine learning algorithms. Mach. Learn., 19 (1):29 43, April 1995. ISSN 0885-6125. doi: 10.1023/A: 1022655006810.

Breiman, L. Random forests. Mach. Learn., 45(1):5 32, October 2001. ISSN 0885-6125. doi: 10.1023/A: 1010933404324.

Chen, Y. and Samworth, R. J. Generalized additive and index models with shape constraints. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78 (4):729 754, 2016.

Chernozhukov, V., Fern andez-Val, I., and Galichon, A. Improving point and interval estimators of monotone functions by rearrangement. Biometrika, 96(3):559 575, 2009.

Chernozhukov, V., Fern andez-Val, I., and Galichon, A. Quantile and probability curves without crossing. Econometrica, 78(3):1093 1125, 2010. ISSN 1468-0262.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. Annals of Statistics, 32:407 499, 2004.

Feelders, A. and Pardoel, M. Pruning for monotone classiﬁcation trees. In R. Berthold, M., Lenz, H.-J., Bradley, E., Kruse, R., and Borgelt, C. (eds.), Advances in Intelligent Data Analysis V, pp. 1 12, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.

Ganz, M. L., Wintfeld, N., Li, Q., Alas, V., Langer, J., and Hammer, M. The association of body mass index with the risk of type 2 diabetes: a case control study nested in an electronic health records system in the united states. Diabetology & Metabolic Syndrome, 6(1):50, Apr 2014. ISSN 1758-5996.

Gonz alez, S., Herrera, F., and Garc ıa, S. Monotonic random forest with an ensemble pruning mechanism based on the degree of monotonicity. 33:367 388, 07 2015.

Gupta, M., Cotter, A., Pfeifer, J., Voevodski, K., Canini, K., Mangylov, A., Moczydlowski, W., and van Esbroeck, A. Monotonic calibrated interpolated look-up tables. Journal of Machine Learning Research, 17(109):1 47, 2016.

Han, Q. and Wellner, J. A. Multivariate convex regression: global risk bounds and adaptation. ar Xiv preprint ar Xiv:1601.06844, 2016.

Prediction Rule Reshaping

Han, Q., Wang, T., Chatterjee, S., and Samworth, R. J. Isotonic regression in general dimensions. ar Xiv preprint ar Xiv:1708.09468, 2017.

Kyng, R., Rao, A., and Sachdeva, S. Fast, provable algorithms for isotonic regression in all l p-norms. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2719 2727. Curran Associates, Inc., 2015.

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Makino, K., Suda, T., Yano, K., and Ibaraki, T. Data analysis by positive decision trees. In CODAS, pp. 257 264, 1996.

Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., and Ziegler, A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine, 51(1):74, 2012.

Milani Fard, M., Canini, K., Cotter, A., Pfeifer, J., and Gupta, M. Fast and ﬂexible monotonic functions with ensembles of lattices. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 2919 2927. Curran Associates, Inc., 2016.

Potharst, R. and Feelders, A. J. Classiﬁcation trees for problems with monotonicity constraints. SIGKDD Explor. Newsl., 4(1):1 10, June 2002. ISSN 1931-0145.

Pya, N. and Wood, S. N. Shape constrained additive models. Statistics and Computing, 25(3):543 559, May 2015. ISSN 1573-1375.

Spouge, J., Wan, H., and Wilbur, W. Least squares isotonic regression in two dimensions. Journal of Optimization Theory and Applications, 117(3):585 605, Jun 2003.

Wright, M. N. and Ziegler, A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1 17, 2017.

Xu, M., Chen, M., and Lafferty, J. Faithful variable screening for high-dimensional convex regression. Ann. Statist., 44(6):2624 2660, 12 2016. doi: 10.1214/15-AOS1425.

You, S., Ding, D., Canini, K., Pfeifer, J., and Gupta, M. Deep lattice networks and partial monotonic functions. In NIPS, 2017.