# private_isotonic_regression__129b599c.pdf Private Isotonic Regression Badih Ghazi Pritish Kamath Ravi Kumar Pasin Manurangsi Google Research Mountain View, CA, US badihghazi@gmail.com, pritish@alum.mit.edu, ravi.k53@gmail.com, pasin@google.com In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) X and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given n input points, has an expected excess empirical risk of roughly width(X) log |X|/n, where width(X) is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly (width(X) + log |X|)/n, that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset. In the special case of a totally ordered set and for ℓ1 and ℓ2 2 losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function. 1 Introduction Isotonic regression is a basic primitive in statistics and machine learning, which has been studied at least since the 1950s [4, 9]; see also the textbooks on the topic [5, 38]. It has seen applications in numerous fields including medicine [31, 39] where the expression of an antigen is modeled as a monotone function of the DNA index and WBC count, and education [19], where isotonic regression was used to predict college GPA using high school GPA and standardized test scores. Isotonic regression is also arguably the most common non-parametric method for calibrating machine learning models [51], including modern neural networks [23]. In this paper, we study isotonic regression with a differential privacy (DP) constraint on the output model. DP [17, 16] is a highly popular notion of privacy for algorithms and machine learning primitives, and has seen increased adoption due to its powerful guarantees and properties [37, 43]. Despite the plethora of work on DP statistics and machine learning (see Section 5 for related work), ours is, to the best of our knowledge, the first to study DP isotonic regression. In fact, we consider the most general version of the isotonic regression problem. We first set up some notation to describe our results. Let (X, ) be any partially ordered set (poset). A function f : X [0, 1] is monotone if and only if f(x) f(x ) for all x x . For brevity, we use F(X, Y) to denote the set of all monotone functions from X to Y; throughout, we consider Y [0, 1]. Let [n] denote {1, . . . , n}. Given an input dataset D = {(x1, y1), . . . , (xn, yn)} (X [0, 1])n, let the empirical risk of a function f : X [0, 1] be L(f; D) := 1 n P i [n] ℓ(f(xi), yi), where ℓ: [0, 1] [0, 1] R is a loss function. We study private isotonic regression in the basic machine learning framework of empirical risk minimization. Specifically, the goal of the isotonic regression problem, given dataset D = {(x1, y1), . . . , (xn, yn)} (X [0, 1])n, is to find a monotone function f : X [0, 1] that 36th Conference on Neural Information Processing Systems (Neur IPS 2022). minimizes L(f; D). The excess empirical risk of a function f is defined as L(f; D) L(f ; D) where f := argming F(X,Y) L(g; D). 1.1 Our Results General Posets. Our first contribution is to give nearly tight upper and lower bounds for any poset, based on its width, as stated below (see Section 4 for a formal definition.) Theorem 1 (Upper Bound for General Poset). Let X be any finite poset and let ℓbe an L-Lipschitz loss function. For any ε (0, 1], there is an ε-DP algorithm for isotonic regression for ℓwith expected excess empirical risk at most O L width(X) log |X| (1+log2(εn)) Theorem 2 (Lower Bound for General Poset; Informal). For any ε (0, 1] and any δ < 0.01 ε/|X|, any (ε, δ)-DP algorithm for isotonic regression for a nice loss function ℓmust have expected excess empirical risk Ω width(X)+log |X| While our upper and lower bounds do not exactly match because of the multiplication-vs-addition of log |X|, we show in Section 4.3 that there are posets for which each bound in tight. In other words, this gap cannot be closed for generic posets. Totally Ordered Sets. The above upper and lower bounds immediately translate to the case of totally ordered sets, by plugging in width(X) = 1. More importantly, we give efficient algorithms in this case, which runs in time O(n2 + n log |X|) for general loss function ℓ, and in nearly linear O(n log |X|) time for the widely-studied ℓ2 2and ℓ1-losses1. Theorem 3. For all finite totally ordered sets X, L-Lipschitz loss functions ℓ, and ε (0, 1], there is an ε-DP algorithm for isotonic regression for ℓwith expected excess empirical risk O L (log |X|) (1+log2(εn)) εn . The running time of this algorithm is O(n2 + n log |X|) in general and can be improved to O(n log |X|) for ℓ1 and ℓ2 2 losses. We are not aware of any prior work on private isotonic regression. A simple baseline algorithm for this problem would be to use the exponential mechanism over the set of all monotone functions taking values in a discretized set, to choose one with small loss. We show in Appendix A that this achieves an excess empirical risk of O(L p width(X) log |X|/εn), which is quadratically worse than the bound in Theorem 1. Moreover, even in the case of a totally ordered set, it is unclear how to implement such a mechanism efficiently. We demonstrate the flexibility of our techniques by showing that it can be extended to variants of isotonic regression where, in addition to monotonicity, we also require f to satisfy additional properties. For example, we may want f to be Lf-Lipchitz for some specified Lf > 0. Other constraints we can handle include k-piecewise constant, k-piecewise linear, convexity, and concavity. For each of these constraints, we devise an algorithm that yields essentially the same error compared to the unconstrained case and still runs in polynomial time. Theorem 4. For all finite totally ordered sets X, L-Lipschitz loss functions ℓ, and ε (0, 1], there is an ε-DP algorithm for k-piecewise constant, k-piecewise linear, Lipchitz, convex, or concave isotonic regression for ℓwith expected excess empirical risk O L (log |X|) εn . The running time of this algorithm is (n|X|)O(1). Organization. We next provide necessary background on DP. In Section 3, we prove our results for totally ordered sets (including Theorem 3). We then move on to discuss general posets in Section 4. Section 5 contains additional related work. Finally, we conclude with some discussion in Section 6. Due to space constraints, we omit some proofs from the main body; these can be found in the Appendix. 1Recall that the ℓ2 2-loss is ℓ2 2(y, y ) = (y y )2 and the ℓ1-loss is ℓ1(y, y ) = |y y |. 2 Background on Differential Privacy Two datasets D = {((x1, y1), . . . , (xn, yn)} and D = {(x 1, y 1), . . . , (x n, y n)} are said to be neighboring, denoted D D , if there is an index i [n] such that (xj, yj) = (x j, y j) for all j [n] \ {i}. We recall the formal definition of differential privacy [18, 16]: Definition 5 (Differential Privacy (DP) [18, 16]). Let ε > 0 and δ [0, 1]. A randomized algorithm M : X n Y is (ε, δ)-differentially private ((ε, δ)-DP) if, for all D D and all (measurable) outcomes S Y, we have that Pr[M(D) S] eε Pr[M(D ) S] + δ. We denote (ε, 0)-DP as ε-DP (aka pure-DP). The case when δ > 0 is referred to as approximate-DP. We will use the following composition theorems throughout our proofs. Lemma 6. (ε, δ)-DP satisfies the following: Basic Composition: If mechanisms M1, . . . , Mt are such that Mi satisfies (εi, δi)-DP, then the composed mechanism (M1(D), . . . , Mt(D)) satisfies (P i δi)-DP. This holds even under adaptive composition, where each Mi can depend on the outputs of M1, . . . , Mi 1. Parallel Composition [33]: If a mechanism M satisfies (ε, δ)-DP, then for any partition of D = D1 Dt, the composed mechanism given as (M(D1), . . . , M(Dt)) satisfies (ε, δ)- DP. Exponential Mechanism. The exponential mechanism solves the basic task of selection in data analysis: given a dataset D Zn and a set A of options, it outputs the (approximately) best option, where best is defined by a scoring function s : A Zn R. The ε-DP exponential mechanism [34] is the randomized mechanism M : Zn A given by D Zn, a A : Pr[M(D) = a] exp ε 2 s s(a, D) , where s := sup D D maxa A |s(a, D) s(a, D )| is the sensitivity of the scoring function. Lemma 7 ([34]). For M being the ε-DP exponential mechanism, it holds for all D Zn that E[s(M(D), D)] mina A s(a, D) + 2 s Lower Bound for Privatizing Vectors. Lower bounds for DP algorithms that can output a binary vector that is close (say, in the Hamming distance) to the input are well-known. Lemma 8 (e.g., [32]). Let ε, δ > 0, m N, let the input domain be {0, 1}m and let two vectors z, z {0, 1}m be neighbors if and only if z z 0 1. Then, for any (ε, δ)-DP algorithm M : {0, 1}m {0, 1}m, we have Ez {0,1}m[ M(z) z 0] e ε m 0.5 (1 δ). It is also simple to extend the lower bound for the case where the vector is not binary, as stated below. We defer the full proof to Appendix B. Lemma 9. Let D, m be any positive integer such that D 2, let the input domain be [D]m and let two vectors z, z [D]m be neighbors if and only if z z 0 1. Then, for any (ln(D/2), 0.25)- DP algorithm M : [D]m [D]m, we have that Ez [D]m[ M(z) z 0] Ω(m). Group Differential Privacy. For any neighboring relation , we write k as a neighboring relation where D k D iff there is a sequence D = D0, . . . , Dk = D for some k k such that Di 1 Di for all i [k ]. Fact 10 (e.g., [41]). Let ε > 0, δ 0 and k N. Suppose that M is an (ε, δ)-DP algorithm for the neighboring relation . Then M is kε, ekε 1 eε 1 δ -DP for the neighboring relation k. 3 DP Isotonic Regression over Total Orders We first focus on the one-dimensional case where X is totally ordered; for convenience, we assume that X = [m] where the order is the natural order on integers. We first present an efficient algorithm for the this case and then a matching lower bound. 3.1 An Efficient Algorithm To describe our algorithm, it will be more convenient to use the unnormalized version of the empirical risk, which we define as Labs(f; D) := P (x,y) D ℓ(f(x), y). We now provide a high-level overview of our algorithm. Any monotone function f : [m] [0, 1] contains a (not necessarily unique) threshold α {0} [m] such that f(a) 1/2 for all a > α and f(a) 1/2 for all a α. Our algorithm works by first choosing this threshold α in a private manner using the exponential mechanism. The choice of α partitions [m] into [m]>α := {a [m] | a > α} and [m] α := {a [m] | a α}. The algorithm recurses on these two parts to find functions f> : [m]>α [1/2, 1] and f : [m] α [0, 1/2], which are then glued to obtain the final function. In particular, the algorithm proceeds in T stages, where in stage t, the algorithm starts with a partition of [m] into 2t intervals {Pi,t | i {0, . . . , 2t 1}}, and the algorithm eventually outputs a monotone function f such that f(x) [i/2t, (i + 1)/2t] for all x Pi,t. This partition is further refined for stage t + 1 by choosing a threshold αi,t in Pi,t and partitioning Pi,t into P >αi,t i,t and P αi,t i,t . In the final stage, the function f is chosen to be the constant i/2T 1 over Pi,T 1. Note that the algorithm may stop at T = Θε(log n) because the Lipschitzness of ℓensures that assigning each partition to the constant i/2T 1 cannot increase the error by more than L/2T Oε(L/n). We already have mentioned above that each αi,t has to be chosen in a private manner. However, if we let the scoring function directly be the unnormalized empirical risk, then its sensitivity remains as large as L even at a large stage t. This is undesirable because the error from each run of the exponential mechanism can be as large as O(L log m) but there are as many as 2t runs in stage t. Adding these error terms up would result in a far larger total error than desired. To circumvent this, we observe that while the sensitivity can still be large, they are mostly ineffective because the function range is now restricted to only an interval of length 1/2t. Indeed, we may use the following clipped version of the loss function which has low sensitivity of L/2t instead. Definition 11 (Clipped Loss Function). For a range [τ, θ] [0, 1], let ℓ[τ,θ] : [τ, θ] [0, 1] R be given as ℓ[τ,θ](ˆy, y) := ℓ(ˆy, y) miny [τ,θ] ℓ(y , y). Similar to above, we also define Labs [τ,θ](f; D) := P (x,y) D ℓ[τ,θ](f(xi), yi). Observe that range(ℓ[τ,θ]) [0, L (θ τ)], since ℓis L-Lipschitz. In other words, the sensitivity of Labs [τ,θ](f; D) is only L (θ τ). Algorithm 1 contains a full description. Proof of Theorem 3. Before proceeding to prove the algorithm s privacy and utility guarantees, we note that the output f is indeed monotone since for every x < x that gets separated when we partition Pi,t into P2i,t+1, P2i+1,t+1, we must have x P2i,t+1 and x P2i+1,t+1. Privacy Analysis. Since the exponential mechanism is ε -DP and the dataset is partitioned with the exponential mechanism being applied only to each partition once, the parallel composition property (Lemma 6) implies that the entire subroutine for each t is ε -DP. Thus, by basic composition (Lemma 6), it follows that Algorithm 1 is ε-DP (since ε = ε T). Utility Analysis. Since the sensitivity of scorei,t( ) is at most L/2t, we have from Lemma 7, that for all t {0, . . . , T 1} and i {0, 1, . . . , 2t}, E scorei,t(αi,t) min α Pi,t scorei,t(α) O L log |Pi,t| Let hi,t denote argminh F(Pi,t,[i/2t,(i+1)/2t]) Labs(h; Di,t) (with ties broken arbitrarily). Then, let αi,t denote the largest element in Pi,t such that hi,t( αi,t) (i+1/2)/2t; namely, αi,t is the optimal threshold when restricted to Di,t. Under this notation, we have that scorei,t(αi,t) min α Pi,t scorei,t(α) scorei,t(αi,t) scorei,t( αi,t) Algorithm 1 DP Isotonic Regression for Totally Ordered Sets. Input: X = [m], dataset D = {(x1, y1), . . . , (xn, yn)}, DP parameter ε. Output: Monotone function f : [m] [0, 1]. T log(εn) ε ε/T P0,0 [m] for t = 0, . . . , T 1 do for i = 0, . . . , 2t 1 do Di,t {(xj, yj) | j [n], xj Pi,t} {Set of all input points whose x belongs to Pi,t} {Notation: Define D α i,t := {(x, y) Di,t | x α} and D>α i,t similarly } {Notation: Define P α i,t := {x Pi,t | x α} and P >α i,t similarly } Choose threshold αi,t {0} Pi,t, using ε -DP exponential mechanism with scoring function scorei,t(α) := min f1 F(P α i,t ,[ i 2t ]) Labs [ i 2t ](f1; D α i,t ) + min f2 F(P >α i,t ,[ (i+0.5) 2t ]) Labs [ i 2t ](f2; D>α i,t ) {Note: scorei,t(α) has sensitivity at most L/2t. } P2i,t+1 P αi,t i,t and P2i+1,t+1 P >αi,t i,t . Let f : [m] [0, 1] be given as f(x) = i/2T 1 for all x Pi,T 1 and all i [2T ]. return f = Labs [i/2t,(i+1)/2t](h2i,t+1; D2i,t+1) + Labs [i/2t,(i+1)/2t](h2i+1,t+1; D2i+1,t+1) Labs [i/2t,(i+1)/2t](hi,t; Di,t) = Labs(h2i,t+1; D2i,t+1) + Labs(h2i+1,t+1; D2i+1,t+1) Labs(hi,t; Di,t). (2) Finally, notice that Labs(f; Di,T 1) Labs(hi,T 1; Di,T 1) L 2T 1 |Di,T 1| = O L |Di,T 1| With all the ingredients ready, we may now bound the expected (unnormalized) excess risk: Labs(f; D) = P 0 i<2T 1 Labs(f; Di,T 1) (3) P 0 i<2T 1 O L |Di,T 1| εn + Labs(hi,T 1; Di,T 1) = O(L/ε) + P 0 i<2T 1 Labs(hi,T 1; Di,T 1) = O(L/ε) + Labs(h0,0; D0,0) t [T 1] 0 i<2t 1 Labs(h2i,t; D2i,t) + Labs(h2i+1,t; D2i+1,t) Labs(hi,t 1; Di,t 1) . Taking the expectation on both sides and using (1) and (2) yields E[Labs(f; D)] O(L/ε) + Labs(h0,0; D0,0) + P t [T 1] 0 i<2t 1 O L log m = O(L/ε) + Labs(f ; D) + O T 2 L log m = Labs(f ; D) + O L log m (1+log2(εn)) Dividing both sides by n yields the desired claim. Running Time. To obtain a bound on the running time for general loss functions, we need to make a slight modification to the algorithm: we will additionally only restrict the range of f1, f2 to multiples of 1/2T 1. We remark that this does not affect the utility since anyway we always take the final output whose values are multiples of 1/2T 1. Given any dataset D = {(x1, y1), . . . , (xn, yn)} where x1 < < xn, the prefix isotonic regression algorithm is to compute, for each i [n], the optimal loss in isotonic regression on (x1, y1), . . . , (xi, yi). Straightforward dynamic programming solves this in O(n v) time, where v denote the number of possible values allowed in the function. Now, for each i, t, we may run the above algorithm with D = Di,t and the allowed values are all multiples of 1/2T 1 in [ i 2t ]; this gives us minf1 F(P α i,t ,[ i 2t ]) Labs [ i 2t ](f1; D α i,t ) for all α Pi,t in time O(|Di,t| 2T t + |Pi,t|). Analogously, we can also compute minf2 F(P >α i,t ,[ (i+0.5) 2t ]) Labs [ i 2t ](f2; D>α i,t ) for all α Pi,t in a similar time. Thus, we can compute (scorei,t(α))α Pi,t in time O(|Di,t| 2T t + |Pi,t|), and then sample accordingly. We can further speed up the algorithm by observing that the score remains constant for all α [xi, xi+1). Hence, we may first sample an interval among [0, x1), [x1, x2), . . . , [xn 1, xn), [xn, m) and then sample αi,t uniformly from that interval. This entire process can be done in O(|Di,t| 2T t + log m) time. In total, the running time of the algorithm is thus i=0 O(|Di,t| 2T t + log m) t=0 O(n2T + 2t log m) O(n2 log n + n log m). Near-Linear Time Algorithms for ℓ1-, ℓ2 2-Losses. We now describe faster algorithms for the ℓ1and ℓ2 2-loss functions, thereby proving the last part of Theorem 3. The key observation is that for convex loss functions, the restricted optimal is simple: we just have to clip the optimal function to be in the range [τ, θ]. Below clip[τ,θ] denotes the function y 7 min{θ, max{τ, y}}. Observation 12. Let ℓbe any convex loss function, D any dataset, f argminf F(X,Y) L(f; D) and τ θ any real numbers such that τ, θ Y. Define f clipped(x) := clip[τ,θ](f (x)). Then, we must have f clipped(x) argminf F(X,Y [τ,θ]) L(f; D). Proof. Consider any f F(X, Y [τ, θ]). Let X > (resp. X <) denote the set of all x X such that f (x) > θ (resp. f (x) < τ). Consider the following operations: For each x X >, change f(x) to θ. For each x X <, change f(x) to τ. Let f(x) = f (x) for all x X \ (X > X <). At the end, we end up with f(x) = f clipped(x). Each of the first two changes does not increase the loss L(f; D); otherwise, due to convexity, changing f (x) to f(x) would have decrease the objective function. Finally, the last operation does not decrease the loss; otherwise, we may replace this section of f with the values in f instead. Thus, we can conclude that f clipped(x) argming F(X,Y [τ,δ]) L(f; D). We will now show how to compute the scores in Algorithm 1 simultaneously for all α (for fixed i, t) in nearly linear time. To do this, recall the prefix isotonic regression problem from earlier. For this problem, Stout [42] gave an O(n)-time algorithm for ℓ2-loss and an O(n log n)-time algorithm for ℓ1-loss (both the unrestricted value case). Furthermore, after the ith iteration, the algorithm also keeps a succinct representation Sopt i of the optimal solution in the form of an array (i1, v1, ℓ1), . . . , (ik, vk, ℓk), which denotes f(x) = vj for all x [xij, xij+1), and ℓj indicates the loss Labs up until xij+1, not including. We can extend the above algorithm to prefix clipped isotonic regression problem, which we define in the same manner as above except that we restrict the function range to be [τ, θ] for some given τ < θ. Using Observation 12, it is not hard to extend the above algorithm to work in this case. Lemma 13. There is an O(n log n)-time algorithm for ℓ2 2and ℓ1-prefix clipped isotonic regression. Proof. We first precompute cτ(i) = P j i ℓ(τ, xj) and cθ(i) = P j i ℓ(θ, xj) for all i [n]. We then run the aforementioned algorithm from [42]. At each iteration i, use binary search to find the largest index jτ such that vjτ < τ and the largest index jθ such that vjθ < θ. Observation 12 implies that the optimal solution of the clipped version is simply the same as the unrestricted version except that we need to change the function values before xjτ to τ and after xjθ to θ. The loss of this clipped optimal can be written as ℓjθ ℓjτ +cτ(jτ)+(cθ(i) cθ(jθ)), which can be computed in O(1) time given that we have already precomputed cτ, cθ. The running time of the entire algorithm is thus the same as that of [42] together with the binary search time; the latter totals to O(n log n). Our fast algorithm for computing (scorei,t(α))α Pi,t first runs the above algorithm with τ = i 2t , θ = i+0.5 2t and D = Di,t; this gives us minf1 F(P α i,t ,[ i 2t ]) Labs [ i 2t ](f1; D α i,t ) for all α Pi,t in time O(|Di,t| log |Di,t| + |Pi,t|). Analogously, we can also compute minf2 F(P >α i,t ,[ (i+0.5) 2t ]) Labs [ i 2t ](f2; D>α i,t ) for all α Pi,t in a similar time. Thus, we can compute (scorei,t(α))α Pi,t in time O(|Di,t| log |Di,t| + |Pi,t|), and sample accordingly. Using the same observation as the general loss function case, this can be sped up further to O(|Di,t| log |Di,t| + log m) time. In total, the running time of the algorithm is thus i=0 O(|Di,t| log |Di,t| + log m) t=0 O(n log n + 2t log m) O(n(log2 n + log m)). 3.2 A Nearly Matching Lower Bound We show that the excess empirical risk guarantee in Theorem 3 is tight, even for approximate-DP algorithms with a sufficiently small δ, under a mild assumption about the loss function stated below. Definition 14 (Distance-Based Loss Function). For R 0, a loss function ℓis said to be R-distancebased if there exist g : [0, 1] R+ such that ℓ(y, y ) = g(|y y |) where g is a non-decreasing function with g(0) = 0 and g(1/2) R. We remark that standard loss functions, including ℓ1or ℓ2 2-loss, are all Ω(1)-distance-based. Our lower bound is stated below. It is proved via a packing argument [25] in a similar manner as a lower bound for properly PAC learning threshold functions [10]. This is not a coincidence: indeed, when we restrict the range of our function to {0, 1}, the problem becomes exactly (the empirical version of) properly learning threshold functions. As a result, the same technique can be used to prove a lower bound in our setting as well. Theorem 15. For all 0 δ < 0.1 (eε 1)/m, any (ε, δ)-DP algorithm for isotonic regression over [m] for any R-distance-based loss function ℓmust have expected excess empirical risk Ω R min n 1, log m Proof. Suppose for the sake of contradiction that there exists an (ε, δ)-DP algorithm M for isotonic regression with an R-distance-based loss function ℓwith expected excess empirical risk 0.01 R min n 1, log(0.1m) εn o . Let k := 0.1 log(0.1m)/ε . We may assume that n 2k, as the Ω(R) lower bound for the case n = 2k can easily be adapted for an Ω(R) lower bound for the case n < 2k as well. We will use the standard packing argument [25]. For each j [m 1], we create a dataset Dj that contains k copies of (j, 0), k copies of (j + 1, 1) and n 2k copies of (1, 0). Finally, let Dm denote the dataset that contains k copies of (m, 0) and n k copies of (1, 0). Let Vj denote the set of all f F([m], [0, 1]) such that L(f; D) < Rk/n. The utility guarantee of M implies that Pr[M(Dj) Vj] 0.5. Furthermore, it is not hard to see that V1, . . . , Vm are disjoint. In particular, for any function f F([m], [0, 1]), let xf be the largest element x [m] for which f(x) 1/2; if no such x exists (i.e., f(0) > 1/2), let xf = 0. For any j < xf, we have L(f; Dj) k nℓ(f(j + 1), 1) k Rk/n. Similarly, for any j > xf, we have L(f; Dj) k nℓ(f(j), 0) k n g(1/2) Rk/n This implies that f can only belong to Vj, as claimed. Therefore, we have that j [m] Pr[M(Dm) Vj] j [m] 1 e2kε Pr[M(Dj) Vj] δ (e2kε 1) eε 1 (Fact 10) j [m] 10 m (0.5 0.1) a contradiction. 3.3 Extensions We now discuss several variants of the isotonic regression problem that places certain additional constraints on the function f that we seek, as listed below. k-Piecewise Constant: f must be a step function that consists of at most k pieces. k-Piecewise Linear: f must be a piecewise linear function with at most k pieces. Lipschitz Regression: f must be Lf-Lipschitz for some specified Lf > 0. Convex/Concave: f must be convex/concave. We devise a general meta algorithm that, with a small tweak in each case, works for all of these constraints to yield Theorem 4. At a high-level, our algorithm is similar to Algorithm 1, except that, in addition to using exponential mechanism to pick the threshold αi,t, we also pick certain auxiliary information that is then passed onto the next stage. For example, in the k-piecewise constant setting, the algorithm in fact picks also the number of pieces to the left of αi,t and that to the right of it. These are then passed on to the next stage. The algorithm stops when the number of pieces become one, and then simply use the exponential mechanism to find the constant value on this subdomain. The full description of the algorithm and the corresponding proof are deferred to Appendix C. 4 DP Isotonic Regression over General Posets We now provide an algorithm and lower bounds for the case of general discrete posets. We first recall basic quantities about posets. An anti-chain of a poset (X, ) is a set of elements such that no two distinct elements are comparable, whereas a chain is a set of elements such that every pair of elements is comparable. The width of a poset (X, ), denoted by width(X), is defined as the maximum size among all anti-chains in the poset. The height of (X, ), denoted by height(X), is defined as the maximum size among all chains in the poset. Dilworth s theorem and Mirsky s theorem give the following relation between chains an anti-chains: Lemma 16 (Dilworth s and Mirsky s theorems [12, 36]). A poset with width w can be partitioned into w chains. A poset with height h can be partitioned into h anti-chains. 4.1 An Algorithm Our algorithm for general posets is similar to that of totally ordered set presented in the previous section. The only difference is that, instead of attempting to pick a single maximal point α such that f(α) τ as in the previous case, there could now be many such maximal α s. Indeed, we need to use the exponential mechanism to pick all such α s. Since these are all maximal, they must be incomparable; therefore, they form an anti-chain. Since there can be as many as |X|width(X) anti-chains in total, this means that the error from the exponential mechanism is O log |X|width(X)/ε = O(width(X) log |X|/ε ), leading to the multiplicative increase of width(X) in the total error. This completes our proof sketch for Theorem 1. 4.2 Lower Bounds To prove a lower bound of Ω(width(X)/εn), we observe that the values of the function in any antichain can be arbitrary. Therefore, we may use each element in a maximum anti-chain to encode X as a binary vector. The lower bound from Lemma 8 then gives us an Ω(width(X)/n) lower bound for ε = 1, as formalized below. Lemma 17. For any δ > 0, any (1, δ)-DP algorithm for isotonic regression for any R-distance- based loss function ℓmust have expected excess empirical risk Ω R(1 δ) min n 1, width(X) Proof. Consider any (1, δ)-DP isotonic regression algorithm M for loss ℓ. Let A be any maximum anti-chain (of size width(A)) in X. We use this algorithm to build a (1, δ)-DP algorithm M for privatizing a binary vector of m = min{n, |A| 1} dimensions as follows: Let x0, x1, . . . , xm be distinct elements of A. On input z {0, 1}m, create a dataset D = {(x1, z1), . . . , (xm, zm), (x0, 0), . . . , (x0, 0)} where (x0, 0) is repeated n m times. Run M on the instance D to get f, and output a vector z where z i = 1[f(xi) 1/2]. It is obvious that this algorithm is (1, δ)-DP. Observe also that L(f ; D) = 0 and thus M s expected excess empirical risk is Ef M (D)[L(f; D)] R Ez M(z)[ z z 0]/n, which, from Lemma 8, must be at least Ω(Rm(1 δ)/n) = Ω R(1 δ) min n 1, width(X) By using group privacy (Fact 10) and repeating each element Θ(1/ε) times, we arrive at a lower bound of Ω R min n 1, width(X) εn o . Furthermore, since X contains height(X) elements that form a totally ordered set, Theorem 15 gives a lower bound of Ω(R log(height(X))/εn) as long as δ < 0.01 ε/ height(X). Finally, due to Lemma 16, we have height(X) |X|/ width(X), which means that max{width(X), log(height(X))} Ω(log |X|). Thus, we arrive at: Theorem 18. For any ε (0, 1] and any δ < 0.01 ε/|X|, any (ε, δ)-DP algorithm for isotonic regression for R-distance-based loss function ℓmust have expected excess empirical risk Ω R min n 1, width(X)+log |X| 4.3 Tight Examples for Upper and Lower Bounds Recall that our upper bound is O width(X) log |X| εn while our lower bound is Ω width(X)+log |X| εn . One might wonder whether this gap can be closed. Below we show that, unfortunately, this is impossible in general: there are posets for which each bound is tight. Tight Lower Bound Example. Let Xdisj(w,h) denote the poset that consists of w disjoint chains, C1, . . . , Cw where |C1| = h and |C2| = = |Cw| = 1. (Every pair of elements on different chains are incomparable.) In this case, we can solve the isotonic regression problem directly on each chain and piece the solutions together into the final output f. Note that |Xdisj(w,h)| = w + h 1 and width(X) = w, height(X) = h. According to Theorem 1, the unnormalized excess empirical risk in Ci is O (log(|Ci|)/ε). Therefore, the total (normalized) empirical risk for the entire domain X is O log h+(w 1) εn . This is at most O w εn as long as h exp(O(w)); this matches the lower bound. Tight Upper Bound Example. Consider the grid poset Xgrid(w,h) := [w] [h] where (x, y) (x , y ) if and only if x x and y y . We assume throughout that w h. Observe that width(Xgrid(w,h)) = w and height(Xgrid(w,h)) = w + h. We will show the following lower bound, which matches the O width(X) log |X| εn upper bound in the case where h w1+Ω(1), up to O(log2(εn)) factor. We prove it by a reduction from Lemma 9. Note that this reduction is in some sense a combination of the proofs of Theorem 15 and Lemma 17, as the coordinate-wise encoding aspect of Lemma 17 is still present (across the rows) whereas the packing-style lower bound is present in how we embed elements of [D] (in blocks of columns). Lemma 19. For any ε (0, 1] and δ < Oε(1/h), any (ε, δ)-DP algorithm for isotonic regression for any R-distance-based loss function ℓmust have expected excess empirical risk Ω R min n 1, w log(h/w) Proof. Let D := h/w 1 , m = w and r := min{ 0.5n/m , 0.5 ln(D/2)/ε }. Consider any (ε, δ)-DP algorithm M for isotonic regression for ℓon Xgrid(w,h) where δ 0.01ε/D. We use this algorithm to build a (ln(D/2), 0.25)-DP algorithm M for privatizing a vector z [D]m as follows: Create a dataset D that contains: For all i [m], r copies of ((i, (w i)(D + 1) + zi), 0) and r copies of ((i, (w i)(D + 1) + zi + 1), 1). n 2rm copies of ((1, 1), 0). Run M on instance D to get f. Output a vector z where z i = max {j [D] | f((i, (w i)(D + 1) + j)) 1/2}. (For simplicity, when such j does not exist let z i = 0.) By group privacy, M is (ln(D/2), 0.25)-DP. Furthermore, L(f ; D) = 0 and the expected empirical excess risk of M is Ef M (D)[L(f; D)] i [m] (ℓ(f(i, (w i)(D + 1) + zi), 0) + ℓ(f(i, (w i)(D + 1) + zi + 1), 1)) r n P i [m] g(1/2) 1[z i = zi] = Rr which must be at least Ω(Rrm/n) = Ω R min n 1, w log(h/w) εn o by Lemma 9. 5 Additional Related Work (Non-private) isotonic regression is well-studied in statistics and machine learning. The onedimensional (aka univariate) case has long history [9, 46, 5, 44, 45, 13, 8, 35, 14, 15, 49]; for a general introduction, see [22]. Moreover, isotonic regression has been studied in higher dimensions [24, 27, 26], including the sparse setting [21], as well as in online learning [29]. A related line of work studies learning neural networks under (partial) monotonicity constraints [3, 50, 30, 40]. There has been a rich body of work on DP machine learning, including DP empirical risk minimization (ERM), e.g., [11, 6, 48, 47], and DP linear regression, e.g., [1]; however, to the best of our knowledge none of these can be applied to isotonic regression to obtain non-trivial guarantees. Another line of work related to our setting is around privately learning threshold functions [7, 20, 10, 2, 28]. We leveraged this relation to prove our lower bound for totally ordered case (Section 3.2). 6 Conclusions In this paper we obtained new private algorithms for isotonic regression on posets and proved nearly matching lower bounds in terms of the expected empirical excess risk. Although our algorithms for totally ordered sets are efficient, our algorithm for general posets is not. Specifically, a trivial implementation of the algorithm would run in time exp( O(width(X))). It remains an interesting open question whether this can be sped up. To the best of our knowledge, this question does not seem to be well understood even for the non-private setting, as previous algorithmic works have focused primarily on the totally ordered case. Similarly, while our algorithm is efficient for the totally ordered sets, it remains interesting to understand whether nearly linear time algorithms for ℓ1and ℓ2 2-losses can be extended to a larger class of loss functions. [1] D. Alabi, A. Mc Millan, J. Sarathy, A. D. Smith, and S. P. Vadhan. Differentially private simple linear regression. Proc. Priv. Enhancing Technol., 2022(2):184 204, 2022. [2] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private PAC learning implies finite littlestone dimension. In STOC, pages 852 860, 2019. [3] N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with monotonicity constraints for two-group classification problems. Dec. Sci., 24(1):60 75, 1993. [4] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Ann. Math. Stat., pages 641 647, 1955. [5] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference Under Order Restrictions. John Wiley & Sons, 1973. [6] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464 473, 2014. [7] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. Theory Comput., 12(1):1 61, 2016. [8] L. Birg e and P. Massart. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields, 97(1):113 150, 1993. [9] H. D. Brunk. Maximum likelihood estimates of monotone parameters. Ann. Math. Stat., pages 607 616, 1955. [10] M. Bun, K. Nissim, U. Stemmer, and S. P. Vadhan. Differentially private release and learning of threshold functions. In FOCS, pages 634 649, 2015. [11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. JMLR, 12(3), 2011. [12] R. P. Dilworth. A decomposition theorem for partially ordered sets. Ann. Math., 51(1):161 166, 1950. [13] D. L. Donoho. Gelfand n-widths and the method of least squares. Technical Report, University of California, 1991. [14] C. Durot. On the lp-error of monotonicity constrained estimators. Ann. Stat., 35(3):1080 1104, 2007. [15] C. Durot. Monotone nonparametric regression with random design. Math. Methods Stat., 17(4):327 341, 2008. [16] C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486 503, 2006. [17] C. Dwork, F. Mc Sherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265 284, 2006. [18] C. Dwork, F. Mc Sherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265 284, 2006. [19] R. L. Dykstra and T. Robertson. An algorithm for isotonic regression for two or more independent variables. Ann. Stat., 10(3):708 716, 1982. [20] V. Feldman and D. Xiao. Sample complexity bounds on differentially private learning via communication complexity. In COLT, volume 35, pages 1000 1019, 2014. [21] D. Gamarnik and J. Gaudio. Sparse high-dimensional isotonic regression. Neur IPS, 2019. [22] P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints. Cambridge University Press, 2014. [23] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321 1330, 2017. [24] Q. Han, T. Wang, S. Chatterjee, and R. J. Samworth. Isotonic regression in general dimensions. Ann. Stat., 47(5):2440 2471, 2019. [25] M. Hardt and K. Talwar. On the geometry of differential privacy. In STOC, pages 705 714, 2010. [26] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Neur IPS, 2011. [27] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009. [28] H. Kaplan, K. Ligett, Y. Mansour, M. Naor, and U. Stemmer. Privately learning thresholds: Closing the exponential gap. In COLT, pages 2263 2285, 2020. [29] W. Kotłowski, W. M. Koolen, and A. Malek. Online isotonic regression. In COLT, pages 1165 1189, 2016. [30] X. Liu, X. Han, N. Zhang, and Q. Liu. Certified monotonic neural networks. Neur IPS, pages 15427 15438, 2020. [31] R. Luss, S. Rosset, and M. Shahar. Efficient regularized isotonic regression with application to gene gene interaction search. Ann. Appl. Stat., 6(1):253 283, 2012. [32] P. Manurangsi. Tight bounds for differentially private anonymized histograms. In SOSA, pages 203 213, 2022. [33] F. Mc Sherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9):89 97, 2010. [34] F. Mc Sherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94 103, 2007. [35] M. Meyer and M. Woodroofe. On the degrees of freedom in shape-restricted regression. Ann. Stat., 28(4):1083 1104, 2000. [36] L. Mirsky. A dual of Dilworth s decomposition theorem. AMS, 78(8):876 877, 1971. [37] C. Radebaugh and U. Erlingsson. Introducing Tensor Flow Privacy: Learning with Differential Privacy for Training Data, March 2019. blog.tensorflow.org. [38] T. Robertson, F. T. Wright, and R. L. Dykstra. Order restricted statistical inference. John Wiley & Sons, 1988. [39] M. J. Schell and B. Singh. The reduced monotonic regression method. JASA, 92(437):128 135, 1997. [40] A. Sivaraman, G. Farnadi, T. Millstein, and G. Van den Broeck. Counterexample-guided learning of monotonic neural networks. In Neur IPS, pages 11936 11948, 2020. [41] T. Steinke and J. R. Ullman. Between pure and approximate differential privacy. J. Priv. Confidentiality, 7(2), 2016. [42] Q. F. Stout. Unimodal regression via prefix isotonic regression. Comput. Stat. Data Anal., 53(2):289 297, 2008. [43] D. Testuggine and I. Mironov. Py Torch Differential Privacy Series Part 1: DP-SGD Algorithm Explained, August 2020. medium.com. [44] S. Van de Geer. Estimating a regression function. Ann. Stat., pages 907 924, 1990. [45] S. Van de Geer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Stat., 21(1):14 44, 1993. [46] C. van Eeden. Testing and Estimating Ordered Parameters of Probability Distribution. CWI, Amsterdam, 1958. [47] D. Wang, C. Chen, and J. Xu. Differentially private empirical risk minimization with nonconvex loss functions. In ICML, pages 6526 6535, 2019. [48] D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster and more general. In NIPS, 2017. [49] F. Yang and R. F. Barber. Contraction and uniform convergence of isotonic regression. Elec. J. Stat., 13(1):646 677, 2019. [50] S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic functions. NIPS, 2017. [51] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages 694 699, 2002. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6. (c) Did you discuss any potential negative societal impacts of your work? [N/A] Since this is a purely theoretical paper regarding private algorithms for well studied ML task of isotonic regression, we do not foresee any immediate potential negative impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] Including some proofs in the supplementary material. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]