# monotonic_kroneckerfactored_lattice__ef30578b.pdf Published as a conference paper at ICLR 2021 MONOTONIC KRONECKER-FACTORED LATTICE Nobuyuki Morioka , Erez Louidor , William Bakst Google Research {nmorioka,erez,wbakst}@google.com It is computationally challenging to learn flexible monotonic functions that guarantee model behavior and provide interpretability beyond a few input features, and in a time where minimizing resource use is increasingly important, we must be able to learn such models that are still efficient. In this paper we show how to effectively and efficiently learn such functions using Kronecker-Factored Lattice (KFL), an efficient reparameterization of flexible monotonic lattice regression via Kronecker product. Both computational and storage costs scale linearly in the number of input features, which is a significant improvement over existing methods that grow exponentially. We also show that we can still properly enforce monotonicity and other shape constraints. The KFL function class consists of products of piecewise-linear functions, and the size of the function class can be further increased through ensembling. We prove that the function class of an ensemble of M base KFL models strictly increases as M increases up to a certain threshold. Beyond this threshold, every multilinear interpolated lattice function can be expressed. Our experimental results demonstrate that KFL trains faster with fewer parameters while still achieving accuracy and evaluation speeds comparable to or better than the baseline methods and preserving monotonicity guarantees on the learned model. 1 INTRODUCTION Many machine learning problems have other requirements in addition to accuracy, such as efficiency, storage, and interpretability. For example, the ability to learn flexible monotonic functions at scale is useful in practice because machine learning practitioners often know apriori which input features positively or negatively relate to the output and can incorporate these hints as inductive bias in the training to further regularize the model (Abu-Mostafa, 1993) and guarantee its expected behavior on unseen examples. It is, however, computationally challenging to learn such functions efficiently. In this paper, we extend the work of interpretable monotonic lattice regression (Gupta et al., 2016) to significantly reduce computational and storage costs without compromising accuracy. While a linear model with nonnegative coefficients is able to learn simple monotonic functions, its function class is restricted. Prior works proposed non-linear methods (Sill, 1997; Dugas et al., 2009; Daniels & Velikova, 2010; Qu & Hu, 2011) to learn more flexible monotonic functions, but they are shown to work only in limited settings of small datasets and low dimensional feature spaces. Monotonic lattice regression (Gupta et al., 2016), an extension of lattice regression (Garcia et al., 2012), learns an interpolated look-up table with linear inequality constraints that impose monotonicity. This has been demonstrated to work with millions of training examples and achieve competitive accuracy against, for example, random forests; however, because the number of model parameters scales exponentially in the number of input features, it is difficult to apply such models in high dimensional feature space settings. To overcome this limitation, (Canini et al., 2016) incorporated ensemble learning to combine many tiny lattice models, each capturing non-linear interactions among a small random subset of features. This paper proposes Kronecker-Factored Lattice (KFL), a novel reparameterization of monotonic lattice regression via Kronecker product to achieve significant parameter efficiency and simultaneously provide guaranteed monotonicity of the learned model with respect to a user-prescribed set of input Equal contribution. Published as a conference paper at ICLR 2021 features for user trust. Both inference and storage costs scale linearly in the number of features. Hence, it potentially allows for more features to non-linearly interact. Kronecker factorization has been applied to a wide variety of problems including optimization (Martens & Grosse, 2015; George et al., 2018; Osawa et al., 2019), convolutional neural networks (Zhang et al., 2015), and recurrent neural networks (Jose et al., 2018), but, to the best of our knowledge, has not yet been explored to learn flexible monotonic functions. The main contributions of this paper are: (1) reparameterization of monotonic lattice regression to achieve linear evaluation time and storage of the resulting subclass of models, (2) proving sufficient and necessary conditions for a KFL model to be monotonic as well as convex and nonnegative, (3) showing how the conditions for monotonicity can be efficiently imposed during training, (4) characterizing the values of M for which the capacity of the function class of an ensemble of M base KFL models strictly increases and showing that with a large enough M an ensemble can express every multilinear interpolated lattice function, and (5) providing experimental results on both public and proprietary datasets that demonstrate that KFL has accuracy and evaluation speeds comparable to or better than other lattice methods while requiring less training time and fewer parameters. 2 NOTATION AND OVERVIEW OF MONOTONIC LATTICE REGRESSION For n N we denote by [n] the set {1, 2, . . . , n}. For x RD and d [D], let x[d] denote the dth entry of x. We use ed,n to denote the one-hot vector in {0, 1}n where ed,n[j] = 1 if j = d and ed,n[j] = 0, otherwise. When n is clear from the context we write ed. For two real vectors v, w of the same length we use v w to denote their dot product. We write v w and v w if the respective inequality holds entrywise. We denote by 0 the zero vector in a real vector space whose dimension is clear from the context. For two sets S and T we use S T to denote their Cartesian product and S \ T to denote the set of elements in S but not in T. We denote by |S| the cardinality of S. For w RV , we use fpwl(x; w) to denote the 1-dimensional continuous piecewise linear function fpwl( ; w) : [0, V 1] R whose graph has V 1 linear segments, where for i = 1, . . . , V 1, the ith segment connects the points (i 1, w[i]) with (i, w[i + 1]). See Figure 1 for an example. Observe that for any α, β R and vectors w, v RV , we have fpwl( ; αw + βv) = αfpwl( ; w) + βfpwl( ; v). Symbol Description D Number of features ed,n "1-hot" n-dimensional vector with 1 in the dth entry fpwl 1-dimensional piecewise linear function fmll Multilinear lattice function fsl Simplex lattice function fkfl KFL function V D-dimensional size of lattice MV Set of vertices of lattice of size V DV Domain of a lattice model Φ Interpolation kernel. θ Vector of lattice parameters ivi Outer / Kronecker product of vectors vi KFL(V,M) Class of ensembles of M KFL functions of size V r(T) Rank of tensor T Table 1: Table of major notation Lattice models, proposed in Garcia et al. (2012), are interpolated look-up tables such that the function parameters are the function values sampled on a regular grid. They have been shown to exhibit efficient training procedures such that the resulting model is guaranteed to satisfy various types of shape constraints, including monotonicity and convexity (Gupta et al., 2016; 2018; Cotter et al., 2019), without severely restricting the model class. See Figure 2 for an example. Published as a conference paper at ICLR 2021 Figure 1: f(x) = fpwl(x; [0 1 0.5]) Figure 2: 2 2 multilinear interpolated lattice with four parameters defining a monotonically increasing function in both dimensions. More precisely, a D-dimensional lattice of size V ND consists of a regular D-dimensional grid of look-up table vertices MV = 0, 1, . . . , V[1] 1 . . . 0, 1, . . . , V[D] 1 . Thus, V[d] is the number of vertices in the lattice in the dth dimension, and the grid has QD d=1 V[d] vertices. We denote by DV RD the domain of the lattice model given by DV = [0, V [1] 1] . . . [0, V [D] 1]. Typically, each feature d is "calibrated" so that its value lies in [0, V[d] 1], using the piecewise-linear calibration technique of (Gupta et al., 2016) (see also Section 3.5). The lattice parameters are a vector θ RMV, where RMV denotes the set of vectors of length |MV| with entries indexed by MV: for each vertex v MV, there is a corresponding parameter we denote by θv. The lattice model or function is obtained by interpolating the parameters using one of many possible interpolation schemes specified by an interpolation kernel Φ : DV RMV. Given Φ, the resulting lattice model f : Dv R is defined by f(x; θ) = θ Φ(x) = X where (Φ(x))v denotes the entry of Φ(x) with index v. Typically, one requires that for all u, v MV, (Φ(v))u = 1 if u = v, and 0 otherwise, so that f(v) = θv. In Gupta et al. (2016) the authors present two interpolation schemes termed multilinear and simplex . We denote the corresponding kernels by Φmultilinear V and Φsimplex V and the resulting lattice models by fmll( ; θ) and fsl( ; θ), respectively. We give the definition of Φmultilinear V here since we make use of it later, but refer the reader to (Gupta et al., 2016) for the definition of simplex interpolation. Φmultilinear V (x) j [D] fpwl(x[j]; ev[j]+1,V[j]), v MV (2) For both multilinear and simplex interpolated lattice functions, the resulting function is increasing (resp. decreasing) in a given direction if and only if the function is increasing (resp. decreasing) on the grid s vertices in that direction (Gupta et al., 2016, Lemmas 1, 3). As a result, training a lattice model with these interpolation schemes to respect the monotonicity constraint can be done by solving a constrained optimization problem with linear inequality constraints (Gupta et al., 2016). Evaluating the lattice function at a single point can be done in O(2D) time for multilinear interpolation and in O(D log D) time for simplex interpolation (Gupta et al., 2016). Both interpolation schemes require O(Q d V[d]) space for storing the parameters (assuming a constant number of bits per parameter). In the next section we present a sub-class of the set of multlinear-interpolation lattice functions, which we term KFL(V). Each function in this class can be evaluated in O(D) time and requires only O(P d V[d]) space for its parameters. Moreover, like the multilinear and simplex interpolated lattice function classes, there is a necessary and sufficient condition for a function in KFL(V) to be monotonic in a subset of its inputs, and this condition can be efficiently checked. As a result, one can train models in KFL(V) that are guaranteed to be monotonic in a prescribed set of features. Published as a conference paper at ICLR 2021 3 KRONECKER-FACTORED LATTICE The Kronecker product of two matrices A = (ai,j) Rm n and B = (bi,j) Rp q is defined as the mp nq real block matrix given by: a11B a1n B ... ... ... am1B amn B The Kronecker product is associative and generally non-commutative; however, A B and B A are permutation equivalent. We extend the definition of Kronecker product to vectors by regarding them as matrices with a single row or column. For s vectors vi Rni, we use the notation s i=1vi = v1 . . . vs. Let I = {0, . . . , n1 1} . . . {0, . . . , ns 1}. We index the entries of s i=1vi by I, so that the entry with index (i1, . . . , is) I is Qs j=1 vj[ij + 1]. The Kronecker product satisfies the mixed-product property: for 4 matrices A Rm n, B Rp q, C Rn r, D Rq s it holds that (A B)(C D) = (AC) (BD), where for two matrices M, N, we denote their conventional product by MN. Consequently for vectors a, c Rn and b, d Rq it holds that (a b) (c d) = (a c) (b d) = (a c)(b d), (4) where the rightmost equality follows since Kronecker product of scalars reduces to a simple product. See (Loan, 2000) for more details on Kronecker products. We are now ready to define the class KFL(V). Let V ND and fix a lattice of size V. The class KFL(V) is the subclass of all multilinear interpolation lattice functions whose parameters are Kronecker products of D factors. Precisely, KFL(V) = {fmll( ; D d=1wd) : d wd RV[d]}. We reparameterize the functions in KFL and use fkfl( ; w1, . . . , wd) = fmll( ; D d=1wd) to denote the function with parameters {wd}d. The following proposition shows that functions in KFL(V) are products of 1-dimensional piecewise linear functions. It essentially follows from the fact that Φmultilinear V (x) can be expressed as a Kronecker product of D vectors. Proposition 1. Let V ND. For all w1 RV[1], . . . , w D RV[D] fkfl(x; w1, . . . , w D) = Y d fpwl(x[d]; wd), for all x DV. (5) Proof of Proposition 1. For v N, let Ψv : R Rv be given by Ψv(x) = [fpwl(x; e1,v) fpwl(x; e2,v) . . . fpwl(x; ev,v)]. Let w1, . . . , w D be vectors satisfying the conditions in the proposition. By the definition of KFL(V), we have fkfl( ; w1, . . . , w D) = fmll( ; D d=1wd). Fix Φ = Φmultilinear V . It follows from (2) that for all x DV we have Φ(x) = D d=1ΨV[d](x[d]). Therefore using (4) we get, for all x Dv, fkfl(x; w1, . . . , w D) = fmll(x; D d=1wd) = ( D d=1wd) ( D d=1ΨV[d](x[d])) d=1 (wd ΨV[d](x[d])). (6) Now, for each d [D] we have wd ΨV[d](x[d]) = X wd[i]fpwl(x[d]; ei,V[d]) = fpwl x[d]; X i wd[i]ei,V[d] = fpwl(x[d]; wd). (7) Substituting (7) into (6), we obtain (5). It follows from Proposition 1 that evaluating a function in KFL(V) requires evaluating D piecewise linear functions with uniformly spaced knots and computing the product. This can be done in time O(D). The storage complexity is O(# parameters) = O(P Published as a conference paper at ICLR 2021 3.1 MONOTONICITY CRITERIA FOR KFL Let i {1, . . . , D}. We say that a function f : D R, where D RD, is increasing (resp. decreasing) in direction i in D, if for every x D and real δ 0, such that x + δei D, it holds that f(x) f(x + δei) (resp. f(x) f(x + δei)). Note that a function that does not depend on x[i] is thus "trivially" increasing in direction i. In this section we disregard such edge cases and require a function increasing in the ith direction to strictly increase on at least one pair of inputs. To simplify the exposition, we only deal with increasing functions in this section all the results transfer naturally to the decreasing case, as well. The next proposition shows a sufficient and necessary condition for a function in KFL(V) to be increasing in a subset of its features in DV. In Section 3.5 we explain how to use this result to train models that are guaranteed to be monotonic. Proposition 2. Fix V ND, let f KFL(V) and let i1, . . . , ip [D] be distinct direction indices for p D. Then f is increasing in directions i1, . . . , ip iff f can be written as: d fpwl(x[d]; wd), (8) where σ {+1, 1}, and wd RV[d], d [D] are vectors satisfying the following 3 conditions: 1. If p 2 then wd 0 for all d [D]. 2. If p = 1 then wd 0 for all d [D] \ {i1}. 3. For all j [p], σwij[1] σwij[2] . . . σwij[V[ij]]. See Appendix A for the proof. 3.2 CRITERIA FOR OTHER SHAPE CONSTRAINTS While the focus of this paper is on learning flexible monotonic functions, other shape constraints, namely convexity and nonnegativity, can easily be imposed on KFL. This section presents the propositions showing sufficient and necessary conditions for these constraints. More shape constraints beyond these are left for future work. We say that a function f : D R is convex (resp. concave) in direction i in D if for every x D and real δ 0 such that x, x + δei D, we have f(tx + (1 t)(x + δei)) tf(x) + (1 t)f(x + δei) (resp. f(tx+(1 t)(x+δei)) tf(x)+(1 t)f(x+δei)) for all t [0, 1]. As with monotonicity, we require a function convex (resp. concave) in direction i to strictly satisfy the inequality for some x,δ and t (i.e. we disregard functions that are affine in direction i in D). The following proposition shows sufficient and necessary criteria for a function to be convex in a subset of directions; an analogous criteria can be derived for concavity. Proposition 3. Fix V ND, let f KFL(V) and let i1, . . . , ip [D] be distinct direction indices for p D. Then f is convex in directions i1, . . . , ip iff f can be written as: d fpwl(x[d]; wd), (9) where σ {+1, 1}, and wd RV[d], d [D] are vectors satisfying the following 3 conditions: 1. If p 2 then wd 0 for all d [D]. 2. If p = 1 then wd 0 for all d [D] \ {i1}. 3. For all j [p], V[ij] = 2 or V[ij] > 2 and σ(wij[k + 2] wij[k + 1]) σ(wij[k + 1] wij[k]) for all k [V[ij] 2]. See Appendix A for the proof. Proposition 4. Fix V ND, let f KFL(V). Then f is nonnegative (resp. strictly positive) iff f can be written as f(x) = Q d fpwl(x[d]; wd), where wd 0 (resp. wd > 0) for all d [D]. See Appendix A for the proof. Published as a conference paper at ICLR 2021 3.3 ENSEMBLES OF KFL MODELS It follows from Propositon 2 that the set of monotonic functions in KFL(V) is fairly restricted: for example, a function in KFL(V) that is increasing in two or more directions cannot change sign in DV. To make the model class larger, we use ensembles of sub-models in KFL(V) and take the average or the sum of their predictions as the composite model s prediction. We are thus motivated to define the class of functions that can be expressed as sums of M KFL models. KFL(V, M) = n M X i=1 gi : gi KFL(V) o . (10) Since the zero function with domain DV is in KFL(V), we have KFL(V, M) KFL(V, M + 1). Let L(V) denote the set of all multilinear inteprolation lattice functions on a grid of size V. Then, by construction, L is closed under addition. So we have KFL(V) = KFL(V, 1) KFL(V, 2) . . . L(V). Moreover, the following proposition shows that there exists a positive integer M, such that for all m < M, there are functions that can be expressed as a sum of m + 1 KFL(V) models which cannot be expressed as a sum of m KFL(V) models, and every function in L(V) can be expressed as a sum of M functions in KFL(V). To state the proposition, we require the following definitions on tensors, which are multidimensional generalizations of matrices. There are multiple ways to define a real tensor. Here we view a real tensor of size V as a D-dimensional V[1] . . . V[D] array of real numbers. The set of such tensors is denoted by RV[1] . . . RV[D]. We index the entries of tensors of size V by MV. Addition between two tensors of the same size and multiplication of a tensor by a real scalar are defined entrywise. Observe that each element in RMV can naturally be regarded as a tensor of size V and vice versa. The outer product of D vectors wi RV[i], i [D] is their Kronecker product, D i=1wi, regarded as a tensor of size V, and we use the same notation for the outer product as for the Kronecker product. A real tensor is called simple if it is the outer product of some D real vectors. Every tensor T can be expressed as a sum of simple tensors and the rank of T, denoted r(T), is the minimum number of simple tensors that sum to T. The rank of a tensor of size V is at most |MV|/ maxi{V[i]}. See (Rabanser et al., 2017) for an introduction to tensors. Proposition 5. For two sets A, B we use A B to denote that A is a subset of B but A = B. Let M = max{r(T) : T RV[1] . . . RV[D]}. Then KFL(V, 1) KFL(V, 2) . . . KFL(V, M) = L(V). See Appendix A for the proof. 3.4 DEPENDENCE OF THE CAPACITY OF THE MODEL CLASS ON LATTICE SIZE In the previous section, we showed that the capacity of KFL(V, M) increases as M increases up to a certain threshold. One may also ask how the capacity of KFL(V, M) depends on V. Unfortunately, the domain DV of a member in KFL(V, M) is different for different vectors V, which introduces a technical difficulty when trying to compare the capacity across different lattice sizes. In practice, we follow the technique of Gupta et al. (2016) and use a "calibration" transformation c : RD DV to map the "raw" input x into the lattice domain DV. Here, c(x) = (c1(x[1]), . . . , c D(x[D])), and each cd : R [0, V[d] 1] is a learned piecewise linear function with a fixed prespecified number of linear segments. See Gupta et al. (2016)) for more details. Let C(V) denote the set of all calibration transformations whose image is in DV. For a positive integer M, we define the calibrated Kronecker Factored Lattice model class CKFL by: CKFL(V, M) = {f(c( )) : f KFL(V, M), c C(V)}. The following proposition shows that the capacity of CKFL(V, M) increases as V increases. Proposition 6. Let V1, V2 ND be two lattice sizes and M be a positive integer. If V1 V2, then CKFL(V1, M) CKFL(V2, M). See Appendix A for the proof. 3.5 MODEL TRAINING DETAILS To ensure that the input to a lattice model lies in its domain, we follow the input calibration techniques explained in Gupta et al. (2016). For numeric inputs, we use a learned one-dimensional piecewise Published as a conference paper at ICLR 2021 x AVG + β f(x) Figure 3: KFL Model Architecture linear function to transform the feature s input domain to [0, V[d] 1]. For categorical inputs, we learn a one dimensional embedding for each category value to map to a real value in [0, V[d] 1]. Calibrators can be efficiently trained to be monotonic. Our model architecture for KFL is depicted in Figure 3 and defined by the following equation f(x) = β + 1 γm fkfl c(x); w(m) 1 , . . . , w(m) D , where c is the input calibration layer, described above (note that c has its own learned parameters we don t show here to simplify the notation) and γm, β R are additional scaling and bias parameters, respectively. To guarantee that the final model is monotonic in a given feature, it is sufficient to impose monotonicity on its calibrator and each of the terms in the sum. We train our model by minimizing the empirical risk subject to the monotonicity constraints using projected mini-batch gradient descent. Following Proposition 2, after every gradient descent step, we project each w(m) d to be nonnegative where required. Then for each monotonic feature d we project w(m) d so that its components increase if σ = 1 decrease if σ = 1. Computing this projection is an instance of the Isotonic Regression problem and can be solved efficiently using the "pool-adjacentviolators" algorithm (Barlow et al., 1972). In our experiments, we instead use a more efficient approximation algorithm that computes a vector that satisfies the constraints, but may not be the closest (in L2-norm) to the original vector. See Appendix B for the exact algorithm we use. 4 EXPERIMENTS We present experimental results on three datasets, summarized in Table 2. The first dataset is the same public Adult Income dataset (Dheeru & Karra Taniskidou, 2017) with the same monotonicity constraint setup described in Canini et al. (2016). The other two datasets are provided by a large internet services company with monotonic features specified by product groups, as domain knowledge or policy requirements. Table 2: Experiment Dataset Summary Dataset Type # Features # Monotonic # Train # Test Adult Income Classification 14 4 32,561 16,282 Query Result Matching Regression 14 11 805,660 201,415 User Query Intent Classification 24 16 522,302 130,576 We compare KFL against two previously proposed baseline lattice models: (1) calibrated multilinear interpolation lattice, and (2) calibrated simplex interpolation lattice, which use the input calibration layer described in Section 3.5 and Gupta et al. (2016). We note that we do not compare KFL to deep lattice networks (DLN) (You et al., 2017) in this paper. In a DLN, KFL would act as a replacement layer for the lattice layers, which only comprise a subset of the model layers. We believe that such a comparison would be dominated by the other layers and not properly show how KFL differs from previously proposed lattice models. All our models were implemented using Tensor Flow (Abadi et al., 2015) and Tensor Flow Lattice (Google AI Blog, 2017). Open-source code for KFL has been pushed to the Tensor Flow Lattice 2.0 library and can be downloaded at github.com/tensorflow/lattice. We used logistic loss for classification and mean squared error (mse) loss for regression. For each Published as a conference paper at ICLR 2021 experiment, we train for 100 epochs with a batch size of 256 using the Adam optimizer and validate the learning rate from {0.001, 0.01, 0.1, 1.0} with five-fold cross-validation. We use a lattice size with the same entry V for each direction and tune V from {2, 4, 8} and different settings of M from [1, 100] for KFL (we note that increasing V above 2 for our baselines was not feasible on our datasets). The results were averaged over 10 independent runs. The train and evaluation times were measured on a workstation with 6 Intel Xeon W-2135 CPUs. We report the models with best cross-validation results, as well as some other settings for comparison. 4.1 UCI ADULT INCOME For the Adult Income dataset Dheeru & Karra Taniskidou (2017), the goal is to predict whether or not income is greater than or equal to $50,000. We followed the same setup described in Canini et al. (2016), specifying capital gain, weekly hours of work, education level, and the gender wage gap as monotonically increasing features. The results are summarized in Table 3 and Figure 4. Table 3: UCI Adult Income Results M V Train Acc. Test Acc. Train Time (s) Eval Time (µs) # Parameters Multilinear - 2 85.60 0.02 85.51 0.05 333 3700 16, 552 Simplex - 2 86.71 0.21 85.72 0.15 151 2200 16, 552 KFL 1 4 86.31 0.06 86.05 0.07 70 1800 226 KFL 2 8 86.51 0.02 86.24 0.05 85 2100 395 4.2 QUERY RESULT MATCHING This experiment is a regression problem of predicting the matching quality score of a result to a query. The results are summarized in Table 4 and Figure 5. Table 4: Query Result Matching Results M V Train MSE Test MSE Train Time (s) Eval Time (µs) # Parameters Multilinear - 2 0.5694 0.0010 0.5702 0.0009 4606 2835 16, 512 Simplex - 2 0.5718 0.0066 0.5732 0.0066 2150 1952 16, 512 KFL 25 8 0.5728 0.0054 0.5735 0.0055 870 1847 2954 KFL 50 4 0.5700 0.0037 0.5707 0.0038 868 1839 2979 4.3 USER QUERY INTENT For this real-world problem, the goal is to classify the user intent into one of two classes. For the baseline multilinear and simplex models, a single lattice with all 24 features is infeasible. When the number of features is too large for a single lattice, we follow the technique described in Canini et al. (2016), which uses an ensemble of lattice sub models each taking a small subset of the calibrated input features. The result of the model is the average of the sub-models outputs. The calibrators are shared among all lattices. Here we use a random tiny lattice ensemble of L lattices with each seeing a random subset of 10 of the 24 features. We tune and set L = 100. The results are summarized in Table 5 and Figure 6. Table 5: User Query Intent Results M V Train Acc. Test Acc. Train Time (s) Eval Time (µs) # Parameters Multilinear - 2 69.48 0.02 69.20 0.01 13, 559 15, 932 102, 619 Simplex - 2 69.48 0.03 69.34 0.02 4478 6559 102, 619 KFL 100 4 69.53 0.04 69.35 0.04 1842 2534 9920 KFL 100 8 69.74 0.08 69.53 0.07 3507 2560 19, 520 Published as a conference paper at ICLR 2021 5 DISCUSSION For each one of our experiments, we can see very similar results: (1) the accuracy or mse of KFL is comparable to or slightly better than the baselines, (2) the training time is significantly reduced, (3) the number of parameters is significantly reduced, (4) KFL can use larger lattice sizes V because the number of parameters scales linearly and not as a power of D, (5) increasing M increases the capacity of the KFL model class where the value for M that makes KFL expressive enough to perform well is a relatively small (and going above such an M no longer provides any more value), and (6) the evaluation speed of KFL compared to multilinear and simplex interpolation aligns with our theoretical runtime analysis, where KFL is significantly faster than multilinear interpolation (particularly when the number of features is large, e.g. Experiment 4.3) and comparable to simplex interpolation. Figure 4: UCI Adult Income Figure 5: Query Result Matching Figure 6: User Query Intent Figure 7: Comparison of test results of KFL with different M and V values Furthermore, each experiment highlights different qualities of KFL. Experiment 4.1 shows that increasing V can sometimes be more important than increasing M, where such an increase can potentially lead to further boosts in performance with only a linear impact to the model s time and space efficiency. This result follows from Proposition 6. In Figure 4, we see that KFL acts like a form of regularization: KFL s accuracy drops as M gets too high, indicating that we start to over fit and lose this regularization effect as the size of the model class increases. Experiment 4.2 shows that KFL can also perform well for regression tasks but will likely need a larger M and V for more complex problems. It also shows a trade-off between the values of M and V where computation time, number of parameters, and mse remain pretty much fixed. Experiment 4.3 demonstrates KFL s ability to create higher-order feature interactions by keeping all features in a single lattice. Unlike multilinear and simplex interpolated lattice functions, KFL can handle many more features in a single lattice. It demonstrates that KFL can achieve even faster evaluations compared to the ensembling method described by Canini et al. (2016). Experiments 4.2 and 4.3 also show that there is a logarithmic-like improvement of performance with respect to M, indicating that the M needed to achieve comparable accuracy is potentially impractically large (likely for more complex problems); however, as Proposition 6 shows, there is a nice trade-off between M and V where we can instead increase V to compensate and ultimately achieve comparable or better performance with a reasonable M. 6 CONCLUSION We proposed KFL, which reparameterizes monotonic lattice regression using Kronecker factorization to achieve significant improvement in parameter and computational efficiency. We proved sufficient and necessary conditions to efficiently impose multiple constraints during training on a model in KFL w.r.t a subset of its features. We showed that through ensembling the KFL function class strictly increases as more base KFL models are added, becoming the class of all multilinear interpolated lattice functions when the number of base models is sufficiently large. Our experimental results demonstrated its practical advantage over the multilinear and simplex interpolated lattice models in terms of speed and storage space while maintaining accuracy. For future work, it would be interesting to (1) explore applications of KFL in more complex architectures such as deep lattice networks (You et al., 2017) and (2) to understand if we can impose other useful shape constraints (Cotter et al., 2019; Gupta et al., 2018). Published as a conference paper at ICLR 2021 M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. Y. S. Abu-Mostafa. A method for learning from hints. In Advances in Neural Information Processing Systems, 1993. R. E. Barlow, D. J. Bartholomew, and J. M. Bremner. Statistical inference under order restrictions; the theory and application of isotonic regression. Wiley, 1972. K. Canini, A. Cotter, M. M. Fard, M. R. Gupta, and J. Pfeifer. Fast and flexible monotonic functions with ensembles of lattices. In Advances in Neural Information Processing Systems, 2016. A. Cotter, M. R. Gupta, H. Jiang, E. Louidor, J. Muller, T. Narayan, S. Wang, and T. Zhu. Shape constraints for set functions. In Proceedings of the 36th International Conference on Machine Learning, 2019. H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Transactions on Neural Networks, 21(6):906 917, 2010. D. Dheeru and E. Karra Taniskidou. UCI Machine Learning Repository, 2017. URL http: //archive.ics.uci.edu/ml. C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia. Incorporating functional knowledge in neural networks. Journal of Machine Learning Research, 10:1239 1262, June 2009. ISSN 1532-4435. E. K. Garcia, R. Arora, and M. R. Gupta. Optimized regression for efficient function evaluation. IEEE Transactions on Image Processing, 21(9):4128 4140, 2012. T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis. In Advances in Neural Information Processing Systems, 2018. Google AI Blog. Tensorflow lattice: Flexibility empowered by prior knowledge, 2017. URL https: //ai.googleblog.com/2017/10/tensorflow-lattice-flexibility.html. M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. Van Esbroeck. Monotonic calibrated interpolated look-up tables. Journal of Machine Learning Research, 17(109):1 47, 2016. URL http://jmlr.org/papers/v17/15-243.html. M. R. Gupta, D. Bahri, A. Cotter, and K. Canini. Diminishing returns shape constraints for interpretability and regularization. In Advances in Neural Information Processing Systems, 2018. C. Jose, M. Cisse, and F. Fleuret. Kronecker recurrent units. In Proceedings of the 35th International Conference on Machine Learning, 2018. C. F. Van Loan. The ubiquitous Kronecker product. Journal of Computational and Applied Mathematics, 123(1):85 100, 2000. ISSN 0377-0427. URL http://www.sciencedirect.com/ science/article/pii/S0377042700003939. J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, 2015. K. Osawa, Y. Tsuji, Y. Ueno, A. Naruse, R. Yokota, and S. Matsuoka. Large-scale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, June 2019. Published as a conference paper at ICLR 2021 Y. Qu and B. Hu. Generalized constraint neural network regression model subject to linear priors. IEEE Transactions on Neural Networks, 22(12):2447 2459, 2011. S. Rabanser, O. Shchur, and S. Günnemann. Introduction to tensor decompositions and their applications in machine learning. Ar Xiv, abs/1711.10781, 2017. J. Sill. Monotonic networks. In Advances in Neural Information Processing Systems, 1997. S. You, K. Canini, D. Ding, J. Pfeifer, and M. R. Gupta. Deep lattice networks and partial monotonic functions. In Advances in Neural Information Processing Systems, 2017. X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, and S.F. Chang. Fast orthogonal projection based on kronecker product. In International Conference on Computer Vision, 2015. Published as a conference paper at ICLR 2021 SUPPLEMENTAL MATERIAL Proposition 1. Let V ND. For all w1 RV[1], . . . , w D RV[D] fkfl(x; w1, . . . , w D) = Y d fpwl(x[d]; wd), for all x DV. (5) Proof of Proposition 1. For v N, let Ψv : R Rv be given by Ψv(x) = [fpwl(x; e1,v) fpwl(x; e2,v) . . . fpwl(x; ev,v)]. Let w1, . . . , w D be vectors satisfying the conditions in the proposition. By the definition of KFL(V), we have fkfl( ; w1, . . . , w D) = fmll( ; D d=1wd). Fix Φ = Φmultilinear V . It follows from (2) that for all x DV we have Φ(x) = D d=1ΨV[d](x[d]). Therefore using (4) we get, for all x Dv, fkfl(x; w1, . . . , w D) = fmll(x; D d=1wd) = ( D d=1wd) ( D d=1ΨV[d](x[d])) d=1 (wd ΨV[d](x[d])). (6) Now, for each d [D] we have wd ΨV[d](x[d]) = X wd[i]fpwl(x[d]; ei,V[d]) = fpwl x[d]; X i wd[i]ei,V[d] = fpwl(x[d]; wd). (7) Substituting (7) into (6), we obtain (5). Proposition 2. Fix V ND, let f KFL(V) and let i1, . . . , ip [D] be distinct direction indices for p D. Then f is increasing in directions i1, . . . , ip iff f can be written as: d fpwl(x[d]; wd), (8) where σ {+1, 1}, and wd RV[d], d [D] are vectors satisfying the following 3 conditions: 1. If p 2 then wd 0 for all d [D]. 2. If p = 1 then wd 0 for all d [D] \ {i1}. 3. For all j [p], σwij[1] σwij[2] . . . σwij[V[ij]]. To prove the proposition we shall make use of the following lemma. Lemma 1. Let f(x) = Q d fpwl(x[d]; wd) be a function in KFL(V). If for some i [D], f is increasing in direction i in DV then for all d [D] \ {i}, either wd 0 or wd 0. Proof of Lemma 1. Let f be as above, and assume to the contrary that for some j [D]\{i}, neither wj 0 nor wj 0 holds. Then there must exist s, t [V[j]] such that wj[s] < 0 and wj[t] > 0. For d [D], let ℓd(x) = fpwl(x; wd). Then ℓj(s 1) = wj[s] < 0 and ℓj(t 1) = wj[t] > 0. By our assumption there exists x DV and δ > 0, such that f(x + δei) f(x) > 0.1 Now, for any r [0, V[j] 1] f(x+(r x[j])ej+δei) f(x+(r x[j])ej) = ℓi(x[i] + δ)ℓj(r) Y d [D]\{i,j} ℓd(x[d]) ℓj(r) Y ℓi(x[i] + δ) Y ℓj(r) ℓj(x[j]) = (f(x + δei) f(x)) ℓj(r) ℓj(x[j]). 1Otherwise f(x + δei) = f(x) for all x,δ implying that f(x) does not depend on x[i] which we don t regard as increasing in direction i. Published as a conference paper at ICLR 2021 Since for some r {s 1, t 1} it holds that ℓj(r)/ℓj(x[j])<0, it follows from the last equation that for that r, f(x+(r x[j])ej+δei) < f(x+(r x[j])ej), which contradicts our assumption that f is increasing in direction i. Proof of Proposition 2. We first show the "if" direction. Assume that f can be written as above and let j [p]. Then f is the product of two terms: σfpwl(x[ij], wij) and Q d =ij fpwl(x[d]; wd). Condition 3, implies the first term is increasing in direction ij. The second term doesn t depend on x[ij] and conditions 1 and 2 imply that it is nonnegative over DV. It follows that f, the product of the two terms, is increasing in direction ij. We next show the "only if" direction. To simplify the notation, we only prove this for the case p 2. A similar argument shows the proposition holds for p = 1, as well. Let f KFL(V) be increasing in directions i1, . . . , ip in DV. By Proposition 1 there exists D vectors, w d RV[d], d [D], such that f(x) = Q d fpwl(x[d]; w d). Let wd be w d if w d 0 or w d, otherwise. By Lemma 1 applied to directions i1 and i2, it follows that for every d [D], wd 0. Since for any scalar α R, fpwl(x[d]; αw d) = αfpwl(x[d]; w d), it follows that f(x) = σ Q d fpwl(x[d]; wd), where σ = ( 1)|{d:w d 0}|. Therefore condition 1 holds. As for condition 3, let j [p] and assume to the contrary that σwij[s] > σwij[s + 1] for some s [V[ij] 1]. Since f is not identically zero in DV (as we don t consider the zero function increasing in direction ij), for each d [D], there must be xd [0, V[d] 1] such that fpwl(xd; wd) = 0. Since wd 0, it follows that fpwl(xd; wd) > 0. Let x RD be the vector with x[d] = xd if d = ij and x[ij] = s, otherwise. Then we have f(x) = σfpwl(s; wij) Q d =ij fpwl(xd; wd) = σwij[s] Q d =ij fpwl(xd; wd) > σwij[s + 1] Q d =ij fpwl(xd; wd) = f(x + eij) contradicting our assumption that f is increasing in direction ij. Therefore condition 3 holds as well. Proposition 3. Fix V ND, let f KFL(V) and let i1, . . . , ip [D] be distinct direction indices for p D. Then f is convex in directions i1, . . . , ip iff f can be written as: d fpwl(x[d]; wd), (9) where σ {+1, 1}, and wd RV[d], d [D] are vectors satisfying the following 3 conditions: 1. If p 2 then wd 0 for all d [D]. 2. If p = 1 then wd 0 for all d [D] \ {i1}. 3. For all j [p], V[ij] = 2 or V[ij] > 2 and σ(wij[k + 2] wij[k + 1]) σ(wij[k + 1] wij[k]) for all k [V[ij] 2]. To prove the proposition we shall make use of the following lemma. Lemma 2. Let f(x) = Q d fpwl(x[d]; wd) be a function in KFL(V). If for some i [D], f is convex in direction i in DV then for all d [D] \ {i}, either wd 0 or wd 0. Proof of Lemma 2. Let f be as above, and assume to the contrary that for some j [D] \ {i} neither wj 0 nor wj 0. Then there must exist s, q [V[j]] such that wj[s] < 0 and wj[q] > 0. For d [D], let ℓd(x[d]) = fpwl(x[d]; wd). Then ℓj(s 1) = wj[s] < 0 and ℓj(q 1) = wj[q] > 0. By our assumption there exists x DV and δ > 0 such that tf(x) + (1 t)f(x + δei) f(tx + (1 t)(x + δei)) > 0.2 To further simplify notation, we set Q d =i,j ℓd(x[d]) = zd =i,j and 2Otherwise tf(x) + (1 t)f(x + δei) f(tx + (1 t)(x + δei)) = 0 for all x, δ, t implying that f(x) is affine in direction i, which we don t regard as convex in direction i. Published as a conference paper at ICLR 2021 d =i ℓd(x[d]) = zd =i. Now, for any r [0, V[j]], we have: tf(x + (r x[j])ej) + (1 t)f(x + (r x[j])ej + δei) f(t(x + (r x[j])ej) + (1 t)(x + (r x[j])ej + δei)) = tℓi(x[i])ℓj(r)zd =i,j + (1 t)ℓi(x[i] + δ)ℓj(r)zd =i,j ℓi(tx[i] + (1 t)(x[i] + δ))ℓj(r)zd =i,j = tℓi(x[i])zd =i + (1 t)ℓi(x[i] + δ)zd =i ℓi(tx[i] + (1 t)(x[i] + δ))zd =i = tf(x) + (1 t)f(x + δei) f(tx + (1 t)(x + δei)) ℓj(r) Since for some r {s 1, q 1} it holds that ℓj(r)/ℓj(x[j]) < 0, it follows from the last equation that for that r, f(t(x + (r x[j])ej) + (1 t)(x + (r x[j])ej + δei)) > tf(x + (r x[j])ej) + (1 t)f(x + (r x[j])ej + δei), which contradicts our assumption that f is convex in direction i (for the point x + (r x[j])ej). Proof of Proposition 3. We first show the "if" direction. Assume that f can be written as above and let j [p]. Then f is the product of two terms: σfpwl(x[ij]; wij) and Q d =ij fpwl(x[d]; wd). Condition 3 implies the first term is convex in direction ij. The second term does not depend on x[ij] and conditions 1 and 2 imply that it is non-negative over DV. It follows that f, the product of the two terms, is convex in direction ij. We next show the "only if" direction. To simplify the notation, we only prove this for the case p 2. A similar argument shows the proposition holds for p = 1, as well. Let f KFL(V) be convex in directions i1, ..., ip in DV. By Proposition 1 there exists D vectors, w d RV[d], d [D], such that f(x) = Q d fpwl(x[d]; w d). Let wd be w d if w d 0 or w d, otherwise. By Lemma 2 applied to directions i1 and i2, it follows that for every d [D], wd 0. Since for any scalar α R, fpwl(x[d]; αw d) = αfpwl(x[d]; w d), it follows that f(x) = σ Q d fpwl(x[d]; wd), where σ = ( 1)|{d:w d 0}|. Therefore condition 1 holds. As for condition 3, let j [p] and assume to the contrary that V[ij] > 2 but σ(wij[k + 2] wij[k + 1]) < σ(wij[k + 1] wij[k]) for some k [V[ij 2]]. Since f is not identically zero in DV (as we don t consider the zero function convex in direction ij), for each d [D], there must be xd [0, V[d] 1] such that fpwl(xd; wd) = 0. Since wd 0 it follows that fpwl(xd; wd) > 0. Now let t = 0.5 and consider ordered inputs x1, x2 RD such that x1[d] = x2[d] = xd for all d [D], x1[ij] = k 1, and x2[ij] = k + 1. We first decompose f(x) = σfpwl(x[ij]; wij) Q d =ij fpwl(x[d]; wd). Next, by construction, we note that Q d =ij fpwl(x1[d]; wd) = Q d =ij fpwl(x2[d]; wd) = Q d =ij fpwl(x3[d]; wd) = z > 0, and since f is convex in direction ij, we have: σzfpwl(tx1[ij] + (1 t)(x2[ij]); wij) σztfpwl(x1[ij]; wij) + σz(1 t)fpwl(x2[ij]; wij) σfpwl(0.5(k 1) + 0.5(k + 1); wij) 0.5 σfpwl(k 1; wij) + σfpwl(k + 1; wij) 2σwij[k + 1] σwij[k] + σwij[k + 2] σ(wij[k + 1] wij[k]) σ(wij[k + 2] wij[k + 1]) which contradicts our assumption. Proposition 4. Fix V ND, let f KFL(V). Then f is nonnegative (resp. strictly positive) iff f can be written as f(x) = Q d fpwl(x[d]; wd), where wd 0 (resp. wd > 0) for all d [D]. Proof of Proposition 4 (nonnegative case). We first show the "if" direction. Assume that f can be written as above. Then for each d [D], since wd 0, it follows that fpwl(x[d]; wd) 0. Therefore, f, the product of D nonnegative piecewise linear functions, is nonnegative. We next show the "only if" direction. By Proposition 1 there exists D vectors, w d RV[d], d [D], such that f(x) = Q d fpwl(x[d]; w d). If for some i [D], neither w i 0 nor w i 0 then there must exist s, t [0, V[i] 1] such that fpwl(s; w i) < 0 and fpwl(t; w i) > 0. Published as a conference paper at ICLR 2021 It follows that there exists x DV such that either fpwl(s; w i) Q d =i fpwl(x[d]; w d) < 0 or fpwl(t; w i) Q d =i fpwl(x[d]; w d) < 0, thus making f not nonnegative3. Hence, for all d [D], either w d 0 or w d 0. Now, let wd be w d if w d 0 or w d, otherwise. Since for any scalar α R, fpwl(x[d]; αwd) = αfpwl(x[d]; wd), it follows that f(x) = σ Q d fpwl(x[d]; wd), where σ = ( 1)|{d:w d 0}|. Since Q d fpwl(x[d]; wd) 0, for f to be nonnegative, |{d : w d 0}| must be even. This implies σ = 1. Then a nonnegative function f is written as f(x) = Q d fpwl(x[d]; wd), where wd 0 for all d [D]. Proof of Proposition 4 (strictly positive case). We first show the "if" direction. Assume that f can be written as above. Then for each d [D], since wd > 0, it follows that fpwl(x[d]; wd) > 0. Therefore, f, the product of D strictly positive piecewise linear functions, is strictly positive. We next show the "only if" direction. By Proposition 1 there exists D vectors, w d RV[d], d [D], such that f(x) = Q d fpwl(x[d]; w d). If for some i [D], neither w i > 0 nor w i < 0 then there must exist x[i] such that fpwl(x[i]; w i) = 0. Thus, it follows that f is not strictly positive. Hence, for all d [D], either w d > 0 or w d < 0. Now, let wd be w d if w d < 0 or w d, otherwise. Since for any scalar α R, fpwl(x[d]; αwd) = αfpwl(x[d]; wd), it follows that f(x) = σ Q d fpwl(x[d]; wd), where σ = ( 1)|{d:w d<0}|. Since Q d fpwl(x[d]; wd) > 0, for f to be strictly positive, |{d : w d < 0}| must be even. This implies σ = 1. Then a strictly positive function f is written as f(x) = Q d fpwl(x[d]; wd), where wd > 0 for all d [D]. Proposition 5. For two sets A, B we use A B to denote that A is a subset of B but A = B. Let M = max{r(T) : T RV[1] . . . RV[D]}. Then KFL(V, 1) KFL(V, 2) . . . KFL(V, M) = L(V). Proof. We first show KFL(V, M) = L(V). Since L(V) is closed under addition, KFL(V, M) L(V), so it suffices to show that L(V) KFL(V, M). For θ RMV, denote by fmll( ; θ) the multilinear interpolation lattice function with parameters θ. Let fmll( ; θ) L(V). Regard θ as a tensor of size V and set r = r(θ). Thus, there exist r D vectors w(i) j RV[j], i [r], j [D], such i [r] D j=1w(i) j = θ. Thus, from (1), it follows that fmll( ; θ) = fmll( ; P i [r] D j=1w(i) j ) = P i [r] fmll( ; D j=1w(i) j ). Now, by the definition of KFL(V), for each i [r], fmll( ; D j=1w(i) j ) KFL(V), and by the definition of M in the proposition it holds that r M. Since the zero function is in KFL(V), it follows that fmll( ; θ) = P i [r] fmll( ; D j=1w(i) j ) KFL(V, M). It remains to show that for each m [M 1], KFL(V, m) KFL(V, m+1). Since the zero function is in KFL(V), KFL(V, m) KFL(V, m + 1). Assume to the contrary that KFL(V, m + 1) = KFL(V, m), and let θ RMV be a tensor of size V and rank M. Then i [M] D j=1w(i) j , (11) for some MD vectors w(i) j RV[j], with j [D] and i [M]. From the definition of KFL(V, m+1), it follows that P i [m+1] fmll( ; D j=1w(i) j ) KFL(V, m + 1), thus by our as- sumption it is also in KFL(V, m). Therefore there exist m D vectors u(i) j RV[j], with j [m] and i [D], such that P i [m+1] fmll( ; D j=1w(i) j ) = P i [m] fmll( ; D j=1u(i) j ), or equivalently that fmll( ; P i [m+1] D j=1w(i) j ) = fmll( ; P i [m] D j=1u(i) j ). Since the mapping θ fmll( ; θ) is one-to-one (as for all v MV, f(v) = θv, where θv is the entry of θ with index v), it follows from the last equality that P i [m+1] D j=1w(i) j = P i [m] D j=1u(i) j . Therefore we may replace the sum of the first m + 1 elements in (11) with the sum of the m simple tensors P i [m] D j=1u(i) j , obtaining a sum of M 1 simple tensors that yields θ, contradicting the assumption that r(θ) = M. Proposition 6. Let V1, V2 ND be two lattice sizes and M be a positive integer. If V1 V2, then CKFL(V1, M) CKFL(V2, M). 3If for some j [D] \ {i} w j = 0 then f is still nonnegative, but for this edge case, f can be trivially written as f(x) = Q d fpwl(x[d]; wd) where wd = 0. Published as a conference paper at ICLR 2021 Proof. Assume V1 V2 and let g CKFL(V1, M). We need to show that g CKFL(V2, M), as well. By our assumption g = f(c( )) for some f KFL(V1, M) and c C(V1). Thus, there exist vectors w(m) d for m [M], d [D], such that f = P m fkfl( ; w(m) 1 , . . . , w(m) D ). For each such m and d, define w (m) d RV2[d] by w (m) d [j] = w(m) d [j] for j [V1[d]], and w (m) d [j] = 0, for j [V2[d]] \ [V1[d]]. Let f KFL(V2, M) be the function given by f = P m fkfl( ; w (m) 1 , . . . , w (m) D ). Since V1 V2, we have DV1 DV2, and thus c C(V2), as well. Therefore, by the definition of CKFL, we have f (c( )) CKFL(V2, M). Now, for each m [M], it follows from (5) and the construction of the vectors w (m) d that for all y DV1, we have fkfl(y; w (m) 1 , . . . , w (m) D ) = fkfl(y; w(m) 1 , . . . , w(m) D ). Consequently, for all y DV1, f(y) = f (y). Since for all x RD, c(x) DV1, we get that g = f(c( )) = f (c( )). Therefore g CKFL(V2, M). B APPROXIMATION ALGORITHM FOR MONOTONICITY PROJECTION We use Algorithm 1 to map a vector w RD to a close (in L2 norm) vector w that satisfies w[1] . . . w[D]. The vector w may not be the closest such vector to w, but in our experiments the algorithm worked well enough as the projection step in the projected SGD optimization algorithm and is faster than the pool-adjacent-violators algorithm. We use an analogous algorithm to map w to a close vector with monotonic decreasing components. In the description of the algorithm we denote by |v| the L2 norm of a real vector v. input :A vector w RD output :A vector w RD s.t. w [1] . . . w[D] and |w w| is "small". 2 // Make v satisfy the constraints by increasing components. 3 for i 2 to D do 4 v[i] max{v[i], v[i 1]}; 7 // Make w satisfy the constraints by decreasing components. 8 for i D 1 to 1 do 9 w [i] = min{w [i], w [i + 1]}; Algorithm 1: Approximation Algorithm for solving Simply Ordered Uniformally Weighted Isotonic Regression