# agnostically_learning_singleindex_models_using_omnipredictors__fef3bf50.pdf

Agnostically Learning Single-Index Models using Omnipredictors

Aravind Gollakota Apple Parikshit Gopalan Apple Adam R. Klivans UT Austin Konstantinos Stavropoulos UT Austin

We give the first result for agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. All prior work either held only in the realizable setting or required the activation to be known. Moreover, we only require the marginal to have bounded second moments, whereas all prior work required stronger distributional assumptions (such as anticoncentration or boundedness). Our algorithm is based on recent work by Gopalan et al. [2023] on omniprediction using predictors satisfying calibrated multiaccuracy. Our analysis is simple and relies on the relationship between Bregman divergences (or matching losses) and ℓp distances. We also provide new guarantees for standard algorithms like GLMtron and logistic regression in the agnostic setting.

1 Introduction

Generalized Linear Models (GLMs) and Single-Index Models (SIMs) constitute fundamental frameworks in statistics and supervised learning [Mc Cullagh, 1984, Agresti, 2015], capturing and generalizing basic models such as linear and logistic regression. In the GLM framework, labeled examples (x, y) are assumed to satisfy u(E[y|x]) = w x (or E[y|x] = u 1(w x)), where u is a known monotone function (called the link function) and w is an unknown vector. Single-Index Models (SIMs) are a generalization in which the monotone link function u is also unknown.

In the realizable setting where the labels are indeed generated according to a GLM with a known Lipschitz link function, the GLMtron algorithm of Kakade et al. [2011] is a simple and efficient learning algorithm. When the ground truth is only assumed to be a SIM (i.e. the link function is unknown), it can be learned efficiently by the Isotron algorithm [Kalai and Sastry, 2009, Kakade et al., 2011].

In this work, we consider the significantly more challenging agnostic setting, where the labels are arbitrary and not necessarily realizable by any SIM. Importantly, we do not fix a link function in advance; our goal is to output a predictor that has squared error comparable to that of the optimal SIM with an arbitrary monotone and Lipschitz link function. We can equivalently view this as a natural squared-error regression problem in which the final optimality guarantee must hold with respect to all SIMs with bounded weights and monotone, Lipschitz link functions.1

Formally, consider a distribution D over Rd [0, 1] and denote the squared error of a function h : Rd R by err2(h) = E(x,y) D[(y h(x))2]. Let opt(SIM) denote optimal value of err2(h) over all SIMs h with bounded weights and arbitrary 1-Lipschitz monotone activations (we call the inverse u 1 of a link function u the activation function). Given access to samples from D, the goal of an agnostic learning algorithm is to compute a predictor p : Rd [0, 1] with error err2(p) that, with high probability over the samples, is comparable to the error of the optimal SIM: err2(p) opt(SIM) + ε.

1In this context, in recent times it has also been common to refer to the class of GLMs (resp. SIMs) as the class of single neurons with known (resp. unknown) activation functions.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Our main result is the first efficient learning algorithm with a guarantee of this form.

Theorem 1.1 (Informal, see Theorem 3.1). Let SIMB denote the class of SIMs of the form x 7 u 1(w x) for some 1-Lipschitz function u 1 and w 2 B. Let D be any distribution over Rd [0, 1] whose marginal on Rd has bounded second moments. There is an efficient algorithm (Algorithm 1) that agnostically learns SIMB over D up to error

err2(p) O B p

opt(SIMB) + ε.

This result provides a guarantee comparable to that of the Isotron algorithm [Kalai and Sastry, 2009, Kakade et al., 2011] (which also tackles the SIM setting, where the link function is unknown) but for the challenging agnostic setting rather than the realizable setting (where opt(SIMB, D) = 0). Moreover, Isotron s guarantees require the distribution to be supported on the unit ball, whereas we only require a mild second moment condition.

In our view, this result helps establish Algorithm 1 (and indeed any algorithm to compute omnipredictors, as we introduce and discuss shortly) as an efficient and powerful off-the-shelf supervised learning algorithm, akin to random forests or gradient-boosted trees.

A natural question is whether our guarantees are near-optimal, e.g., whether we can obtain a guarantee of the form err2(p) opt(SIM) + ε. However, there is strong evidence that such results cannot be obtained using efficient algorithms [Goel et al., 2019, Diakonikolas et al., 2020a, Goel et al., 2020, Diakonikolas et al., 2021]. We partially justify the form of our guarantee by showing in Section 5 (adapting a result due to Diakonikolas et al. [2022a]) that one cannot avoid a dependence on the norm bound B. Further closing the gap between upper and lower bounds is an important direction for future work.

Overview of techniques. We now describe the main ingredients and techniques that go into proving Theorem 1.1. Our starting point is the connection between GLMs and so-called matching losses [Auer et al., 1995]. This connection arises from the fact that fitting a GLM with a known link function is equivalent to minimizing (over the class of all linear functions) a certain convex loss known as the matching loss corresponding to that link function (see Definition 1.8). (We refer the reader to [Gopalan et al., 2023, Sec 5.1] for a more detailed discussion.)

Importantly for us, minimizing the matching loss corresponding to a link function u yields a meaningful guarantee even in the agnostic setting, where the Bayes optimal predictor (i.e. x 7 E[y|x]) is not necessarily a GLM at all. Specifically, it turns out to be equivalent (via Fenchel-Legendre duality) to finding the closest predictor to the Bayes optimal predictor in the metric of the Bregman divergence associated with the link function u (see e.g. [Gopalan et al., 2023, Lemma 5.4]).

This is a powerful observation, but it still assumes that we have a fixed and known link function u in mind. In the agnostic setting, it is arguably unrealistic to assume that we know the best link function for the distribution at hand. Remarkably, recent work by Gopalan et al. [2022, 2023] has shown that there exist efficient learning algorithms that simultaneously minimize all matching losses corresponding to arbitrary monotone and Lipschitz link functions. Their solution concept is called an omnipredictor, i.e., a single predictor that is able to compete with the best-fitting classifier in a class C as measured by a large class of losses (as opposed to just a single pre-specified loss, as is standard in machine learning). They obtain such predictors through calibrated multiaccuracy [Gopalan et al., 2023] or multicalibration [Gopalan et al., 2022].

From the point of view of ordinary supervised learning or regression, however, an optimality guarantee in terms of such matching losses or Bregman divergences is hard to interpret. A much more standard metric is simply that of squared error. The key final step for our results is to find a way of translating (a) optimality (ranging over all linear functions) in terms of all matching losses simultaneously into (b) optimality (ranging over all SIMs) in terms of squared error. We do so by proving simple analytic distortion inequalities relating matching losses to ℓp losses, which we believe may be of independent interest.

On a technical level, to prove these distortion inequalities we first prove strong versions of such inequalities for matching losses arising from bi-Lipschitz link functions (see Lemma 2.2). We then obtain our results for general Lipschitz link functions by carefully approximating them using bi-Lipschitz link functions (see Lemma 3.3).

Further applications. As further applications of our approach, if we let opt(GLMu 1,B) denote the optimal value of err2(h) over all GLMs of the form x 7 u 1(w x), where w 2 B, we obtain the following result about bi-Lipschitz link functions (including, for example, the Leaky Re LU).

Theorem 1.2 (Informal, see Theorem 2.1). Let u : R R be a bi-Lipschitz invertible link function. Then, any predictor p : Rd R that is an ε-approximate minimizer of the population matching loss that corresponds to u, with respect to a distribution D over Rd [0, 1] satisfies

err2(p) O opt(GLMu 1,B) + O(ε)

This guarantee holds under milder distributional assumptions than are required by comparable prior work on agnostically learning GLMs or single neurons [Frei et al., 2020, Diakonikolas et al., 2022b]. Moreover, when we focus on distortion bounds between the logistic loss and the squared loss, we obtain a near-optimal guarantee of e O(opt(GLMu 1,B)) for logistic regression, when u is the logit link function (i.e., when GLMu 1,B is the class of sigmoid neurons).

Theorem 1.3 (Informal, see Theorem 4.1). Let u(t) = ln( t 1 t). Then, any predictor p : Rd R that is an approximate ε-minimizer of the population logistic loss, with respect to a distribution D over Rd [0, 1] whose marginal on Rd has subgaussian tails in every direction satisfies

err2(p) e O opt(GLMu 1,B) + O(ε)

While our error guarantee for this problem is weaker than that of Diakonikolas et al. [2022b], we do not make the anti-concentration assumptions their results require.

1.1 Background and Relation to Prior Work

We note that matching losses have been studied in various previous works either implicitly [Kakade et al., 2011] or explicitly [Auer et al., 1995, Diakonikolas et al., 2020b, Gopalan et al., 2023] and capture various fundamental algorithms like logistic and linear regression [Mc Cullagh, 1984, Agresti, 2015]. However, to the best of our knowledge, our generic and direct approach for transforming matching loss guarantees to squared error bounds, has not been explored previously. Furthermore, our results do not depend on the specific implementation of an algorithm, but only on the matching loss bounds achieved by its output. In this sense, we provide new agnostic error guarantees for various existing algorithms of the literature. For example, our results imply new guarantees for the GLMtron algorithm of Kakade et al. [2011] in the agnostic setting, since GLMtron can be viewed as performing gradient descent (with unit step size) on the matching loss corresponding to a specified link function.

Matching losses over linear functions are also linked to the Chow parameters [O Donnell and Servedio, 2008] through their gradient with respect to w, as observed by Diakonikolas et al. [2020b]. In fact, the norm of the matching loss gradient is also linked to multiaccuracy, a notion that originates to fairness literature [Hebert-Johnson et al., 2018, Kim et al., 2019]. A stationary point w of a matching loss that corresponds to a GLM with link u is associated with a multiaccurate predictor p(x) = u 1(w x), i.e., a predictor such that for all i [d], E[xi(y p(x))] = 0. The work of [Gopalan et al., 2022, 2023] on omnipredictors presents a single predictor that is better than any linear model w x for every matching loss. The results of Gopalan et al. [2022] show that a multicalibrated predictor (with respect to the features xi) is an omnipredictor for all convex losses, whereas Gopalan et al. [2023] shows that the simpler condition of calibrated multiaccuracy suffices for matching losses that arise from GLMs. In view of the relationship between multiaccuracy and the gradient of the matching loss, our results show that, while multiaccuracy implies bounds on agnostically learning GLMs, the additional guarantee of calibration is sufficient for agnostically learning all SIMs.

The work of Shalev-Shwartz et al. [2011] showed strong agnostic learning guarantees in terms of the absolute error (rather than the squared error) of the form opt + ε for a range of GLMs, but their work incurs an exponential dependence on the weight norm B. In comparison, for the absolute loss, we get a bound of the form B opt log(1/opt) for logistic regression (see Theorem 4.3). In more recent years, the problem of agnostically learning GLMs has frequently also been phrased as the problem of agnostically learning single neurons (with a known activation). For the Re LU activation, work by Goel et al. [2017] showed an algorithm achieving error opt + ε in time poly(d) exp(1/ε) over marginals on the unit sphere, and Diakonikolas et al. [2020b] showed an algorithm achieving error

O(opt) + ε in fully polynomial time over isotropic log-concave marginals. The work of Frei et al. [2020], Diakonikolas et al. [2022b] both show guarantees for learning general neurons (with known activations) using the natural approach of running SGD directly on the squared loss (or a regularized variant thereof). Frei et al. [2020] achieves error O(opt) for any given strictly increasing activation and O( opt) for the Re LU activation, but they assume that the marginal distribution is bounded. In contrast, we only assume that the marginal distribution has bounded second moments and we do not consider that the activation is known. Diakonikolas et al. [2022b] proved an O(opt) guarantee for a wide range of activations (including the Re LU), in the setting where the activation is known and over a large class of structured marginals, which need, however, to satisfy strong concentration and anti-concentration properties.

In terms of lower bounds and hardness results for this problem, the work of [Goel et al., 2019, Diakonikolas et al., 2020a, Goel et al., 2020, Diakonikolas et al., 2021, 2022a] has established superpolynomial hardness even for the setting of agnostically learning single Re LUs over Gaussian marginals.

Limitations and directions for future work. While we justify a certain dependence on the norm bound B in our main result on agnostically learning SIMs, we do not completely justify the exact form of Theorem 1.1. An important direction for future work is to tightly characterize the optimal bounds achievable for this problem, as well as to show matching algorithms.

1.2 Preliminaries

For the following, (x, y) is used to denote a labelled sample from a distribution D over Rd Y, where Y denotes the interval [0, 1] unless it is specified to be the set {0, 1}. We note that, although we provide results for the setting where the labels lie within [0, 1], we may get similar results for any bounded label space. We use PD (resp. ED) to denote the probability (resp. expectation) over D and, similarly, PS (resp. ES) to denote the corresponding empirical quantity over a set S of labelled examples. Throughout the paper, we will use the term differentiable function to mean a function that is differentiable except on finitely many points. Our main results will assume the following about the marginal distribution on Rd. Definition 1.4 (Bounded moments). For λ 1 and k N, we say that a distribution Dx over Rd has λ-bounded 2k-th moments if for any v Sd 1 we have Ex Dx[(v x)2k] λ.

For a concept class C : Rd R, we define opt(C, D) to be the minimum squared error achievable by a concept c : Rd R in C with respect to the distribution D. We now define our main learning task. Definition 1.5 (Agnostic learning). Let C : Rd R be a concept class, let Ψ : [0, 1] [0, 1] be an increasing function, let D be a distribution over Rd [0, 1] and ε > 0. We say that an algorithm A agnostically learns the class C up to error Ψ(opt(C, D)) + ε if algorithm A, given a number of i.i.d. samples from D, outputs, with probability at least 2/3 over the samples and the randomness of A, a hypothesis h : Rd [0, 1] with ED[(y h(x))2] Ψ(opt(C, D)) + ε.

We will also provide results that are specific to the sigmoid activation and work under the assumption that the marginal distribution is sufficiently concentrated. Definition 1.6 (Concentrated marginals). For λ > 0 and γ, we say that a distribution Dx over Rd is (λ, γ)-concentrated if for any v Sd 1 and r 0 we have Px Dx[|v x| r] λ exp( rγ).

Definition 1.7 (Fenchel-Legendre pairs). We call a pair of functions (f, g) a Fenchel-Legendre pair if the following conditions hold.

1. g : R R is continuous, non-decreasing, differentiable and 1-Lipschitz with range ran(g ) (0, 1) and g(t) = R t 0 g (τ) dτ, for any t R.

2. f : ran(g ) R is the convex conjugate (Fenchel-Legendre transform) of g, i.e., we have f(r) = supt R r t g(t) for any r ran(g ).

For such pairs of functions (and their derivatives f , g ), the following are true for r ran(g ) and t ran(f ) (note that ran(f ) is not necessarily R when g is not invertible). g (f (r)) = r and f(r) = rf (r) g(f (r)), for r ran(g ) (1.1)

f (g (t)) = t and g(t) = tg (t) f(g (t)), for t ran(f ) (1.2)

Note that g will be used as an activation function for single neurons and f corresponds to the unknown link function of a SIM (or the known link function of a GLM). We say that g is bi-Lipschitz if for any t1 < t2 R we have that (g (t2) g (t1))/(t2 t1) [α, β]. If g is [α, β] bi-Lipschitz, then f is [ 1

α] bi-Lipschitz. However, the converse implication is not necessarily true when g is not strictly increasing. Definition 1.8 (Matching Losses). For a non-decreasing and Lipschitz activation g : R R, the matching loss ℓg : Y R R is defined pointwise as follows:

ℓg(y, t) = Z t

0 g (τ) y dτ,

where g(t) = R t 0 g . The function ℓg is convex and smooth with respect to its second argument. The corresponding population matching loss is

Lg(c ; D) = E (x,y) D

h ℓg(y, c(x)) i (1.3)

In Equation (1.3), c : Rd R is some concept and D is some distribution over Rd [0, 1]. In the specific case where c is a linear function, i.e., c(x) = w x, for some w Rd, then we may alternatively denote Lg(c ; D) with Lg(w ; D).

We also define the Bregman divergence associated with f to be Df(q, r) = f(q) f(r) (q r)f (r), for any q, r ran(g ). Note that Df(q, r) 0 with equality iff q = r. Definition 1.9 (SIMs and GLMs as Concept Classes). For B > 0, we use SIMB to refer to the class of all SIMs of the form x 7 g (w x) where w 2 B and g : R R is an arbitrary 1-Lipschitz monotone activation that is differentiable (except possibly at finitely many points). We define GLMg ,B similarly except for the case where g is fixed and known.

We also define the notion of calibrated multiaccuracy that we need to obtain omnipredictors in our context. Definition 1.10 (Calibrated Multiaccuracy). A predictor p : Rd [0, 1] is called ε-multiaccurate if for all i [d], | E[xi(y p(x))]| ε. It is called ε-calibrated if | Ep(x) Ey|p(x)[y p(x)]| ε.

2 Distortion Bounds for the Matching Loss

In this section, we propose a simple approach for bounding the squared error of a predictor that minimizes a (convex) matching loss, in the agnostic setting. We convert matching loss bounds to squared loss bounds in a generic way, through appropriate pointwise distortion bounds between the two losses. In particular, for a given matching loss Lg, we transform guarantees on Lg that are competitive with the optimum linear minimizer of Lg to guarantees on the squared error that are competitive with the optimum GLM whose activation (g ) depends on the matching loss at hand.

We now provide the main result we establish in this section. Theorem 2.1 (Squared Error Minimization through Matching Loss Minimization). Let D be a distribution over Rd [0, 1], let 0 < α β and let (f, g) be a Fenchel-Legendre pair such that g : R R is [α, β] bi-Lipschitz. Suppose that for a predictor p : Rd ran(g ) we have

Lg(f p ; D) min w 2 B Lg(w ; D) + ε (2.1)

Then we also have: err2(p) β

α opt(GLMg ,B) + 2βε.

The proof of Theorem 2.1 is based on the following pointwise distortion bound between matching losses corresponding to bi-Lipschitz link functions and the squared distance. Lemma 2.2 (Pointwise Distortion Bound for bi-Lipschitz link functions). Let 0 < α β and let (f, g) be a Fenchel-Legendre pair such that f : ran(g ) R is [ 1

α] bi-Lipschitz. Then for any y, p ran(g ) we have

ℓg(y, f (p)) ℓg(y, f (y)) = Df(y, p) 1

2β (y p)2, 1

In the case that f is differentiable on (0, 1), the proof of Lemma 2.2 follows from an application of Taylor s approximation theorem of degree 2 on the function f, since the Bregman divergence Df(y, p) is exactly equal to the error of the second degree Taylor s approximation of f(y) around p and f (ξ) [ 1

α] for any ξ ran(g ). The relationship between ℓg and Df follows from property (1.2). Note that when g is [α, β] bi-Lipschitz, then f is [ 1

α] bi-Lipschitz.

Theorem 2.1 follows by applying Lemma 2.2 appropriately to bound the error of a predictor p by its matching loss Lg(f p) and bound the matching loss of the linear function corresponding to w by the squared error of g (w x), where g (w x) is the element of GLMg ,B with minimum squared error.

Although Theorem 2.1 only applies to bi-Lipschitz activations g , it has the advantage that the assumption it makes on p corresponds to a convex optimization problem and, when the marginal distribution has certain concentration properties (for generalization), can be solved efficiently through gradient descent on the empirical loss function. As a consequence, for bi-Lipschitz activations we can obtain O(opt) efficiently under mild distributional assumptions in the agnostic setting.

3 Agnostically Learning Single-Index Models

In this section, we provide our main result on agnostically learning SIMs. We combine the distortion bounds we established in Section 2 with results from Gopalan et al. [2023] on Omniprediction, which can be used to learn a predictor p that satisfies the guarantee of Theorem 2.1 simultaneously for all bi-Lipschitz activations. By doing so, we obtain a result for all Lipschitz and non-decreasing activations simultaneously. Theorem 3.1 (Agnostically Learning SIMs). Let λ, B 1, ε > 0 and let D be a distribution over Rd [0, 1] with second moments bounded by λ. Then, Algorithm 1 agnostically learns the class SIMB over D up to squared error O(B

opt(SIMB, D)) + ε using time and sample complexity poly(d, B, λ, 1

ε). Moreover, the same is true for any algorithm with an omniprediction guarantee like the one of Theorem 3.2.

In order to apply Theorem 2.1, we use the following theorem which is a combination of results in Gopalan et al. [2023], where they show that the matching losses corresponding to a wide class of functions can all be minimized simultaneously by an efficiently computable predictor. Theorem 3.2 (Omnipredictors for Matching Losses, combination of results in Gopalan et al. [2023]). Let λ, L, R, B 1, ε > 0 and let D be a distribution over Rd [0, 1] whose marginal on Rd has λ-bounded second moments. Then, Algorithm 1, given sample access to D, with probability at least 2/3 returns a predictor p : Rd (0, 1) with the following guarantee. For any Fenchel-Legendre pair (f, g) such that g : R R is L-Lipschitz, and f takes values within the interval [ R, R], p satisfies Lg(f p ; D) min w 2 B Lg(w ; D) + ε.

The algorithm requires time and sample complexity poly(λ, B, L, R, 1

Algorithm 1 is a version of the algorithm of Gopalan et al. [2023] for calibrated multiaccuracy, specific to our setting. For a more quantitative version of Theorem 3.2, see Theorem C.3 in the appendix.

We aim to apply Theorem 3.2 to the class of all Lipschitz activations (which is wider than the class of bi-Lipschitz activations). This is enabled by the following lemma, whose proof is based on Theorem 2.1 and the fact that the error of a predictor is bounded by the sum of the error of another predictor and the squared expected distance between the two predictors. Lemma 3.3. Let D be a distribution over Rd [0, 1]. Let g : R R be some fixed activation, and f its dual. Consider the class GLMg ,B, and let w be the weights achieving opt(GLMg ,B, D). Let ϕ : R R be an [α, β] bi-Lipschitz function (differentiable except possibly at finitely many points) that we wish to approximate g by. Any predictor p : Rd R that satisfies Lϕ(f p ; D) min w 2 B Lϕ(w ; D) + ε

also satisfies the following ℓ2 error guarantee:

α opt(GLMg ,B) + 2β

α E g (w x) ϕ (w x) 2 + 2βε.

Algorithm 1: Calibrated Multiaccuracy (modification of Algorithm 2 in Gopalan et al. [2023])

Input: Sufficiently large training set S over Rd [0, 1], parameters ε, λ, B, L, R Output: A predictor p : Rd (0, 1) Form a training set S by substituting each element (x, y) in S with (x, y ) where y {0, 1} is formed as follows. Conditioned on y, let y be an independent Bernoulli random variable with probability of success y. Let A = B L, C > 0 a sufficiently large universal constant and let δ = ε2 CA2λR2 , σ = ε δR 2CA2λR. Set p 0 (the zero function). repeat

Let S be a subset of S with |S | = C d2A2λR2

ε2 and set S S \ S . Form a set T by substituting each element (x, y ) of S with (x, z), where z = y p(x). Let v = 1 |T | P (x,z) T zx.

Reject if v 2 3(ε A

λδ)/(4A). Otherwise, let w = Bv/ v 2 and update p to be the following predictor

x 7 (p(x) + σ(w x))[0,1] , where (t)[0,1] = min{max{t, 0}, 1}

until some iteration rejects (which happens when it has found a multiaccurate predictor); Let pδ : Rd [0, 1] such that pδ(x) = P1/δ j=0 jδ1{p(x) Ij}, where Ij = [(j 1

2)δ, (j + 1

2)δ]. Let S be a subset of S with |S | = C A8λ4R8

ε8 log( AλR

ε ), set S S \ S and let S j = {(x, y ) S : pδ(x) = jδ}. Estimate the calibration error | Epδ(x) Ey|pδ(x)[y pδ(x)]| of pδ using the samples in S according to the following expression

|S j | |S |

jδ yj , where yj = 1 |S j |

(x,y ) S j y

If the estimate has value at most ε/2, then output pδ and terminate. Otherwise, update p to be the following (calibrated) predictor

j=0 yj1{p(x) Ij}

until some iteration terminates the execution;

By combining Theorem 3.2 with Lemma 3.3 (whose proofs can be found in Appendix C), we are now ready to prove our main theorem.

Proof of Theorem 3.1. We will combine Theorem 3.2, which states that there is an efficient algorithm that simultaneously minimizes the matching losses corresponding to bounded, non-decreasing and Lipschitz activations, with Lemma 3.3, which implies that minimizing the matching loss corresponding to the nearest bi-Lipschitz activation is sufficient to obtain small error. Note that we may assume that ε < 1/2, since otherwise the problem is trivial (output the zero function and pick C = 2).

As a first step, we show that link functions corresponding to bi-Lipschitz activations are bounded (according to Definition C.1, for γ = 0). In particular, let ϕ : R R be an [α, β] bi-Lipschitz activation for some β α > 0 such that ϕ (s) [ 1, 2] for some s R and let ψ be the inverse of ϕ (ϕ is invertible since it is strictly increasing). We will show that ψ (r) [ R, R] for any r [0, 1] and R = O(|s| + 1/α).

We pick r0 = 0, r1 = 1 and get that |ψ (ϕ (s)) ψ (r0)| 1

α|ϕ (s) r0| 2

α. Hence ψ (r0) ψ (ϕ (s)) 2

α. Similarly, we get ψ (r1) s + 2

α. Therefore, ψ (r) [ψ (0), ψ (1)] [ |s| 2

α], for any r [0, 1], due to monotonicity of ψ .

For a given non-decreasing and 1-Lipschitz g , we will now show that there is a bounded bi-Lipschitz activation ϕ such that if the assumption of Lemma 3.3 is satisfied for ϕ by a predictor p, then the error of p is bounded by

err2(p) O B

λ(opt(GLMg ,B))1/2 + O(λB2ε)

Suppose, first, that opt(GLMg ,B) ε2. Then, we pick ϕ (t) = g (t) + εt. Note that ϕ is [ε, 1 + ε] bi-Lipschitz. Moreover, since opt(GLMg ,B) ε2, we must have some s R with |s| 2λB2

such that g (s) [ 1, 2]. Otherwise, we would have opt(GLMg ,B) = E[(g (w x) y)2] E[(g (w x) y)2 | |w x 2λB2|] P[|w x| 2λB2] P[|w x| 2λB2] = 1 P[|w x| > 2λB2] 1

4 > ε2, due to Chebyshev s inequality, the fact that w W and the bounded moments assumption. Therefore, ψ is (R = 2λB2 + 2

ε, γ = 0)-bounded and we have

E g (w x) ϕ (w x) 2 ε2 E[(w x)2] ε2λB2

As a consequence, under the assumption of Lemma 3.3 for ϕ , the error of the corresponding predictor p is err2(p) 2(1 + ε)ε + 2(1 + ε)ελB2 + 2(1 + ε)ε = O(λB2ε).

In the case that opt(GLMg ,B) > ε2, we pick ϕ (t) = g (t)+t

opt(GLMg ,B)

λ . We may also assume that opt(GLMg ,B) 1/2, since otherwise any predictor with range [0, 1] will have error at most 2opt(GLMg ,B). Then, ψ is (O(λB2 + B

λ ε ), 0)-bounded, ϕ is [ 1

opt(GLMg ,B)/

B ] bi-Lipschitz which gives

E g (w x) ϕ (w x) 2 opt(GLMg ,B)

B2λ E[(w x)2] opt(GLMg ,B)

As a consequence, under the assumption of Lemma 3.3 for ϕ , the error of the corresponding predictor p is err2(p) 4(1 + 1

opt(GLMg ,B) + 2(1 + 1

B )ε. Using a similar approach as for the case opt(GLMg ,B) ε, we can show that ψ is polynomially bounded (as per Definition C.1), since opt(GLMg ,B) 1

To conclude the proof of Theorem 3.1, we apply Theorem 3.2 with appropriate (polynomial) choice of parameters (R = O(λB2 + B

λ ε ), L = 2), to show that there is an efficient algorithm that outputs a predictor p : Rd (0, 1) for which the assumption of Lemma 3.3 holds simultaneously for all bi-Lipschitz activations (ϕ ) with sufficiently bounded inverses (ψ ).

4 Stronger Guarantees for Logistic Regression

In this section, we follow the same recipe we used in Section 2 to get distortion bounds similar to Theorem 2.1 for the sigmoid activation (or, equivalently, the logistic model) under the assumption that the marginal distribution is sufficiently concentrated (see Definition C.1). In particular, Theorem 2.1 does not hold, since the sigmoid is not bi-Lipschitz and our main Theorem 3.1 only provides a guarantee of O( opt) for squared error. We use appropriate pointwise distortion bounds for the matching loss corresponding to the sigmoid activation and provide guarantees of e O(opt) for logistic regression with respect to both squared and absolute error, under appropriate assumptions about the concentration of the marginal distribution. The proofs of this section are provided in Appendix D.

For the logistic model, the link function f is defined as f (r) = ln( r 1 r), for r (0, 1) and the corresponding activation g is the sigmoid g (t) = 1 1+e t for t R. The corresponding matching loss is the logistic loss.

Squared error. We first provide a result for squared loss minimization. In comparison to Theorem 2.1, the qualitative interpretation of the following theorem is that, while the sigmoid activation is not bi-Lipschitz, it is effectively bi-Lipschitz under sufficiently concentrated marginals.

Theorem 4.1 (Squared Loss Minimization through Logistic Loss Minimization). Let D be a distribution over Rd [0, 1] whose marginal on Rd is (1, 2)-concentrated. Let g : R R be the sigmoid activation, i.e., g (t) = (1 + e t) 1 for t R. Assume that for some B > 0, ε > 0 and a predictor p : Rd (0, 1) we have

Lg(f p ; D) min w: w 2 B Lg(w ; D) + ε (4.1)

If we let optg = min w 2 B err2(g w), then for the predictor p and some universal constant C > 0 we also have

err2(p) C optg exp

B2 log 1 optg

In particular, the squared error of p is upper bounded by e O(optg), since the function t 7 exp(log1/2 t) is asymptotically smaller than any polynomial function t 7 tγ with γ > 0.

Once more, the proof of our result is based on an appropriate pointwise distortion bound which we provide below. It follows by the fact that the Bregman divergence corresponding to the sigmoid is the Kullback-Leibler divergence and by combining Pinsker s inequality (lower bound) with Lemma 4.1 of Götze et al. [2019] (upper bound). Lemma 4.2 (Pointwise Distortion Bound for Sigmoid). Let g be the sigmoid activation. Then, for any y, p (0, 1) we have that Df(y, p) = DKL(y p) = y ln(y/p) + (1 y) ln( 1 y

1 p). Moreover

ℓg(y, f (p)) ℓg(y, f (y)) = DKL(y p) 1

2(y p)2, 2 min{p, 1 p} (y p)2

We translate Lemma 4.2 to a bound on the squared error of a matching loss minimizer (in this case, the logistic loss) using an approach similar to the one for Theorem 2.1. In order to use the upper bound on the surrogate loss provided by Lemma 4.2, we apply it to p g (w x), where g is the sigmoid function, and observe that the quantity 1 p(1 p) is exponential in |w x|. Hence, when the marginal is (λ, 2)-concentrated (subgaussian concentration), then 1 p(1 p) is effectively bounded.

Absolute error. All of the results we have provided so far have focused on squared error minimization. We now show that our approach yields results even for the absolute error, which can also be viewed as learning in the p-concept model [Kearns and Schapire, 1994]. In particular, for a distribution D over Rd [0, 1], we define the absolute error of a predictor p : Rd [0, 1] as follows.

err1(p) = E (x,y) D[|y p(x)|]

In the specific case when the labels are binary, i.e., y {0, 1}, we have

err1(p) = E (x,y) D[|y p(x)|] = P (x,y,yp) Dp[y = yp] (see Proposition A.2)

where the distribution Dp is over Rd {0, 1} {0, 1} and is formed by drawing samples (x, y) from D and, given x, forming yp by drawing a conditionally independent Bernoulli random variable with parameter p(x). We provide the following result. Theorem 4.3 (Absolute Loss Minimization through Logistic Loss Minimization). Let D be a distribution over Rd {0, 1} whose marginal on Rd is (1, 1)-concentrated. Let g : R R be the sigmoid activation, i.e., g (t) = (1 + e t) 1 for t R. Assume that for some B > 0, ε > 0 and a predictor p : Rd (0, 1) we have

Lg(f p ; D) min w: w 2 B Lg(w ; D) + ε (4.2)

If we let optg = min w 2 B err1(g w), then for the predictor p and some universal constant C > 0 we also have err1(p) C B optg log 1 optg + ε

The corresponding distortion bound in this case is between the absolute and logistic losses and works when the labels are binary.

Lemma 4.4 (Pointwise Distortion between Absolute and Logistic Loss). Let g be the sigmoid activation. Then, there is a constant c R such that for any y {0, 1} and p (0, 1), we have

ℓg(y, f (p)) c |y p| , 2 ln 1 p(1 p)

The bound of Theorem 4.3 implies an algorithm for learning an unknown sigmoid neuron in the p-concept model, by minimizing a convex loss. While there are algorithms achieving stronger guarantees [Diakonikolas et al., 2022b] for agnostically learning sigmoid neurons, such algorithms typically make strong distributional assumptions including concentration, anti-concentration and anti-anti-concentration or boundedness.

Moreover, it is useful to compare the bound we provide in Theorem 4.3 to a lower bound by Diakonikolas et al. [2020c, Theorem 4.1], which concerns the problem of agnostically learning halfspaces by minimizing convex surrogates. In particular, they show that even under log-concave marginals, no convex surrogate loss can achieve a guarantee better than O(opt log(1/opt)), where opt is measured with respect to the ℓ1 error (which is equal to the probability of error). The result is not directly comparable to our upper bound, since we examine the sigmoid activation. Their setting can be viewed as a limit case of ours by letting the norm of the weight vector grow indefinitely (the sigmoid tends to the step function), but the main complication is that our upper bound is of the form O(Bopt log(1/opt)), which scales with B. However, their lower bound concerns marginal distributions that are not only concentrated, but are also anti-concentrated and anti-anticoncentrated, while our results only make concentration assumptions.

5 Necessity of Norm Dependence

In this final section, we use a lower bound due to Diakonikolas et al. [2022a] on agnostic learning of GLMs using SQ algorithms and compare it with our main result (Theorem 3.1). For simplicity, we specialize to the case of the standard sigmoid or logistic function. A modification of their proof ensures that the bound holds under isotropic marginals.2

Theorem 5.1 (SQ Lower Bound for Agnostically Learning GLMs, variant of [Diakonikolas et al., 2022a, Thm C.3]). Let g : Rd R be the standard logistic function. Any SQ algorithm either requires dω(1) queries or d ω(1) tolerance to distinguish between the following two labeled distributions:

(Labels have signal.) Dsignal on Rd R is such that opt(GLMg ,B, Dsignal) exp( Ω(log1/4 d)) = o(1) for some B = poly(d).

(Labels are random.) Drandom on Rd R is such that the labels y are drawn i.i.d. from {a, b} for certain universal constants a, b [0, 1]. In particular, opt(GLMg ,B, Drandom) = Ω(1) for any B.

Both Dsignal and Drandom have the same marginal on Rd, with 1-bounded second moments.

Let us consider applying our main theorem (Theorem 3.1) to this setting, with D being either Dsignal or Drandom, and with the same B = poly(d) as is required to achieve small error in the labels have signal case. We would obtain a predictor with ℓ2 error at most B p

opt(GLMg ,B) (or indeed with SIMB in place of GLMg ,B). Since this is ω(1), this guarantee is insufficient to distinguish the two cases above, which is as it should be since our main algorithm indeed fits into the SQ framework.

Theorem 5.1 does, however, justify a dependence on the norm B in our main result. In particular, it is clear that a guarantee of the form opt(GLMg ,B)c for any universal constant c > 0 (independent of B) would be too strong, as it would let us distinguish the two cases above. In fact, this lower bound rules out a large space of potential error guarantees stated as functions of B and opt(GLMg ,B). For instance, for sufficiently large d, it rules out any error guarantee of the form exp(O(log1/5 B)) opt(GLMg ,B)c for any universal constant c > 0.

2Specifically, our features correspond to all multilinear monomials (or parities) of degree at most k over { 1}n, whereas they use all monomials (not necessarily multilinear) of degree at most k. These yield equivalent representations since the hard distributions are obtained from the uniform distribution on { 1}n.

Acknowledgments and Disclosure of Funding

We wish to thank the anonymous reviewers of Neur IPS 2023 for their constructive feedback. Aravind Gollakota was at UT Austin while this work was done, supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML). Adam R. Klivans was supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML). Konstantinos Stavropoulos was supported by NSF award AF-1909204, the NSF AI Institute for Foundations of Machine Learning (IFML), and by scholarships from Bodossaki Foundation and Leventis Foundation.

Parikshit Gopalan, Lunjia Hu, Michael P. Kim, Omer Reingold, and Udi Wieder. Loss minimization through the lens of outcome indistinguishability. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA, volume 251 of LIPIcs, pages 60:1 60:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. (document), 1, 1.1, 3, 3, 3.2, 3, 1, C.1, C.3, C.1, C.1, C.1

Peter Mc Cullagh. Generalized linear models. European Journal of Operational Research, 16(3): 285 292, 1984. ISSN 0377-2217. doi: https://doi.org/10.1016/0377-2217(84)90282-0. URL https://www.sciencedirect.com/science/article/pii/0377221784902820. 1, 1.1

Alan Agresti. Foundations of linear and generalized linear models. John Wiley & Sons, 2015. 1, 1.1

Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011. 1, 1, 1.1

Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009. 1, 1

Surbhi Goel, Sushrut Karmalkar, and Adam Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. Advances in neural information processing systems, 32, 2019. 1, 1.1

Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586 13596, 2020a. 1, 1.1

Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147 2158, 2020. 1, 1.1

Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, and Nikos Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals in the sq model. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 1552 1584. PMLR, 15 19 Aug 2021. 1, 1.1

Ilias Diakonikolas, Daniel Kane, Pasin Manurangsi, and Lisheng Ren. Hardness of learning a single neuron with adversarial label noise. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 8199 8213. PMLR, 28 30 Mar 2022a. 1, 1.1, 5, 5.1

Peter Auer, Mark Herbster, and Manfred K. K Warmuth. Exponentially many local minima for single neurons. In D. Touretzky, M.C. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8. MIT Press, 1995. URL https://proceedings.neurips.cc/ paper_files/paper/1995/file/3806734b256c27e41ec2c6bffa26d9e7-Paper.pdf. 1, 1.1

Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022. 1, 1.1

Spencer Frei, Yuan Cao, and Quanquan Gu. Agnostic learning of a single neuron with gradient descent. Advances in Neural Information Processing Systems, 33:5417 5428, 2020. 1, 1.1

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning a single neuron with adversarial label noise via gradient descent. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4313 4361. PMLR, 02 05 Jul 2022b. URL https://proceedings. mlr.press/v178/diakonikolas22c.html. 1, 1, 1.1, 4

Ilias Diakonikolas, Surbhi Goel, Sushrut Karmalkar, Adam R Klivans, and Mahdi Soltanolkotabi. Approximation schemes for relu regression. In Conference on Learning Theory, pages 1452 1485. PMLR, 2020b. 1.1

Ryan O Donnell and Rocco A Servedio. The chow parameters problem. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 517 526, 2008. 1.1

Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (Computationally-identifiable) masses. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1939 1948. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/hebert-johnson18a.html. 1.1

Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247 254, 2019. 1.1

Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces with the 0-1 loss. SIAM Journal on Computing, 40(6):1623 1646, 2011. 1.1

Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004 1042. PMLR, 2017. 1.1

Friedrich Götze, Holger Sambale, and Arthur Sinulis. Higher order concentration for functions of weakly dependent random variables. Electronic Journal of Probability, 24:1 19, 2019. 4

Michael J Kearns and Robert E Schapire. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464 497, 1994. 4

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Non-convex sgd learns halfspaces with adversarial label noise. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18540 18549. Curran Associates, Inc., 2020c. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/d785bf9067f8af9e078b93cf26de2b54-Paper.pdf. 4

A Technical Lemmas

In this section, we provide some technical Lemmas that we use in our proofs. Proposition A.1 (Weak Learner for Linear Functions). Let D be a distribution over Rd [ 1, 1] whose marginal on Rd has λ-bounded second moments and B > 0. For any ε > 0 and δ (0, 1), there is a universal constant C > 0 and an algorithm that given a set S of i.i.d. samples from D of size at least C d2λB2

δ , runs in time O(d |S|) and satisfies the following specifications with probability at least 1 δ

1. If E(x,z) D[z(w x)] ε for some w Rd with w 2 B, then the algorithm accepts. Otherwise, it may or may not reject and return a special symbol.

2. If the algorithm accepts then it returns w Rd with w 2 B such that we have E(x,z) D[z(w x)] ε/4.

Proof. We will prove the proposition for δ = 1/6. We may boost the probability of success with repetition.

The algorithm computes the vector v = ES[z x]. If v 2 3ε 4B , then the algorithm rejects and outputs a special symbol. Otherwise, it outputs the vector B v 2 v.

Suppose, first, that E(x,z) D[z(w x)] ε for some w with w 2 B. Then, due to Chebyshev s inequality we have for any i [d]

P E S[z xi] E D[z xi] > ε 8 B

|S| ε2 E[x2 i ] 64 d B2 λ

6 d (for large enough |S|)

Hence, with probability at least 5/6, we have ES[zx] ED[zx] 2 ε 8B , due to a union bound. Therefore, v 2 ED[zx] 2 ε 8B ED[z(w x)]

8 and the algorithm accepts.

Suppose, now, that the algorithm accepts. Then, we have v 2 > 3ε

4B and (with probability at least 5/6) we have

v 2 z(v x) = B v 2 v E[zx] ε/4

since ES[zx] ED[zx] 2 ε 8B . This concludes the proof.

Proposition A.2. Let D be a distribution over Rd {0, 1} and p : Rd [0, 1]. Consider the distribution Dp over Rd {0, 1} {0, 1}, which is formed by drawing samples (x, y) from D and, given x, forming yp by drawing a conditionally independent Bernoulli random variable with parameter p(x). Then we have

err1(p) = E (x,y) D[|y p(x)|] = P (x,y,yp) Dp[y = yp]

Proof. Since over Dp, y and yp are conditionally independendent, we have

err1(p) = E[|y p(x)|] = E[(1 p(x))1{y = 1} + p(x)1{y = 0}]

h P[yp = 0|x] P[y = 1|x] + P[yp = 1|x] P[y = 0|x] i

= E x[P[y = yp|x]] = P[y = yp]

B Proofs from Section 2

B.1 Proof of Theorem 2.1

To prove Theorem 2.1, we first prove the following more general theorem. Theorem 2.1 may then be easily recovered from this by setting H = GLMg ,B and observing that f (g (w x)) = w x, since g is invertible.

Theorem B.1 (Squared Error Minimization through Distorted Matching Loss Minimization). Let D be a distribution over Rd [0, 1], let 0 < α β and let (f, g) be a pair of Fenchel-Legendre dual functions such that g : R R is continuous, non-decreasing and f : ran(g ) R is [ 1

α] bi-Lipschitz. Let ε > 0 and H {Rd ran(g )}. Assume that for a predictor p : Rd ran(g ) we have Lg(f p ; D) min h H Lg(f h ; D) + ε (B.1)

Then, for the predictor p, we also have: err2(p) β

α minh H err2(h) + 2βε.

Proof. We apply Lemma 2.2 with y y and p p(x) and take expectations over D on both sides. We have that err2(p) 2β E ℓg(y, f (p(x))) 2β E ℓg(y, f (y)) Therefore, we can bound the squared error of p as follows.

err2(p) 2β Lg(f p ; D) 2β Q

2β Lg(f h ; D) 2β Q (by assumption, for any h H)

where Q = E ℓg(y, f (y)).

We now apply Lemma 2.2 again with y y and p h(x) and we similarly have

E ℓg(y, f h(x)) Q 1

Therefore, for any h H, we have, in total: err2(p) β

αerr2(h) + 2βε.

We first prove Lemma 2.2, which we restate here for convenience. Lemma B.2. Assume f is [1/β, 1/α] bi-Lipschitz and differentiable on all except from a finite number of points on any bounded interval. Then for any y, p ran(g ) we have

ℓg(y, f (p)) ℓg(y, f (y)) = Df(y, p) 1

2β (y p)2, 1

Proof. We first show that ℓg(y, f (p)) ℓg(y, f (y)) = Df(y, p). In particular, we have

g(f (p)) = f (p)g (f (p)) f(g (f (p))) = pf (p) f(p) ,

since f (p) ran(f ) and we know that g(t) = tg (t) f(g (t)) for any t ran(f ) as well as g (f (p)) = p for any p ran(g ). Therefore, we have

ℓg(y, f (p)) ℓg(y, f (y)) = g(f (p)) yf (p) g(f (y)) + yf (y)

= pf (p) f(p) yf (p) yf (y) + f(y) + yf (y) =

= f(y) f(p) (y p)f (p) = Df(y, p) .

Let ψ : ran(g ) R be such that ψ (p) = f (p) and ψ is differentiable on the open interval between y and p, with ψ (ξ) [1/β, 1/α] for any ξ between y and p. Let γy := f(y) ψ(y), γp := f(p) ψ(p) and γψ := 2 max{|γy|, |γp|}. Then we have that

Df(y, p) = ψ(y) ψ(p) (y p)ψ (p) + (γy γp) = 1

2ψ (ξ)(y p)2 + (γy γp)

2β (y p)2 2γψ, 1

2α(y p)2 + 2γψ

for any ψ as defined above (say ψ Ψ). In particular, we have

2α(y p)2 + 2 inf ψ Ψ γψ and

2β (y p)2 2 inf ψ Ψ γψ

Since, we have only a finite number of points where the derivative is not well defined, a simple smoothening technique may give us Ψ such that infψ Ψ γψ = 0.

C Proofs from Section 3

C.1 Proof of Theorem 3.2

We first define a boundedness property which we use in order to apply the results from Gopalan et al. [2023]. The property states that the activation function (the partial inverse of the link function) must either have a range that covers all possible labels, or has a range whose closure covers all possible labels and the rate with which the labels are covered as we tend to the limits of the domain is at least polynomial. For example, the sigmoid activation tends to 1 (resp. 0) exponentially fast as its argument increases (resp. decreases). Definition C.1 (Bounded Functions). Let u : (0, 1) R be a non-decreasing function defined on the interval (0, 1). For R, γ 0, we say that u is (R, γ)-bounded on [0, 1] if for any ε > 0, there are r0 r1 [0, 1] such that if we let u(ri) = limr ri u(r), i = 0, 1 then

max{ u(r0), u(r1)} R 1

(1 r1)(u(r) u(r1)) ε for r r1 and r0(u(r0) u(r)) ε for r r0

Proposition C.2. Let u : (0, 1) ( R, R) be non-decreasing and continuous, then u is (R, 0)- bounded.

We restate a more quantitative version of Theorem 3.2 here for convenience.

Theorem C.3 (Omnipredictors for Matching Losses, combination of results in Gopalan et al. [2023]). Let D be a distribution over Rd [0, 1] whose marginal on Rd has λ-bounded second moments. There is an algorithm that, given sample access to D, with high probability returns a predictor p : R (0, 1) with the following guarantee. For any pair of Fenchel-Legendre dual functions (f, g) such that g : R R is continuous, non-decreasing and L-Lipschitz, and f is (R, γ)-bounded (see Definition C.1), p satisfies

Lg(f p ; D) min w 2 B Lg(w ; D) + ε.

The algorithm requires time

O d3B4L4λ2R3 3

3+3γ + d B2L2λR4 3

4+4γ + B10L10λ5R12 3

12+12γ log BLλR

and sample complexity

O d2B4L4λ2R3 3

3+3γ + B8L8λ4R10 3

10+10γ log BLλR

Proof of Theorem 3.2. The idea of the proof (given by Gopalan et al. [2023]) is that in each repetition of both the inner and the outer loop of Algorithm 1, there is a potential function which reduces by some amount that is bounded away above zero. The potential function is in fact the function ED[(p (x) p(x))2], where p (x) = ED[y |x] (for us, y is formed by drawing a Bernoulli random variable with probability of success y, given an example (x, y) Rd [0, 1] drawn from D). Since ED[(p (x) p(x))2] [0, 1], the number of iterations of each of the loops has to be bounded. Moreover, after the completion of each of the inner loops, the current value of p corresponds to an approximately multiaccurate predictor and note that the algorithm terminates if and only if a discretized version pδ of (the multiaccurate) p is approximately calibrated. The output is then pδ which is close to p (and hence also approximately multiaccurate), but also approximately calibrated. We will now present a small number of technical modifications we need to make in the proof of Gopalan et al. [2023] in order to specialize it to our setting. The main difference here is that we do not consider the distribution of x to have bounded norm with probability 1, but we only assume it to have bounded second moments. For the following, we assume that g is 1-Lipschitz, since we can set B B L and push the Lipschitz constant in the domain of w.

Suppose first that the marginal of D on Rd is supported on the unit ball Bd and that the labels are binary. Then, the result would follow from Theorems 7.7 and A.4 of Gopalan et al. [2023]. In particular,

Theorem 7.7 states that given access to a weak learner with the specifications of Proposition A.1, there is an efficient algorithm that computes an ε1-calibrated and (C, ε1)-multiaccurate predictor p, where the notions of calibration and multiaccuracy originate to the literature of fairness and are defined, e.g., in Definitions 3.1 and 3.2 of Gopalan et al. [2023] and C = {x w x | w 2 B} {x 1} (the class C is bounded as long as x 2 1 almost surely). Theorem A.4 states that for ε2 > 0, any ε1-calibrated and (C, ε1)-multiaccurate predictor p minimizes simultaneously the matching loss corresponding to any pair (f, g) F (where f is (R, γ)-bounded) up to error

R(1/ε2)γε1 + ε1 + ε2

The expression above is formed by proving that any pair (f, g) F has the property that f is (ε2, R(1/ε2)γ)-approximately optimal (as per the Definition A.1 of Gopalan et al. [2023]), for any ε2 > 0. In particular, we would like to show that for any ε2 > 0 there exists bf such that the following is true for any r [0, 1]

ℓg(r, bf (r)) ℓg(r, f (r)) + ε2

| bf (r)| R (1/ε2)γ

We may pick bf as follows (for r0 r1 as given by Definition C.1 for ε ε2, since f is (R, γ)- bounded).

f (r), if r [r0, r1] f (r0), if r < r0 f (r1), if r > r1

The desired result follows from using the expression for ℓg, the convexity of g (since g is non decreasing) and the guarantees of Definition C.1. In particular, we have that | bf | max{|f (r0)|, |f (r1)|} since f is increasing. Moreover, for r [r0, r1] we have ℓg(r, bf (r)) = ℓg(r, f (r)) and for r < r0 we have

ℓg(r, bf (r)) = ℓg(r, f (r0)) = g(f (r0)) rf (r0)

r0(f (r0) f (r)) + g(f (r)) rf (r0) (since g is convex)

r0(f (r0) f (r)) + g(f (r)) rf (r) (since f is increasing and r < r0)

ε2 + g(f (r)) rf (r) (since f is bounded according to Definition C.1)

= ℓg(r, f (r)) + ε2

Similarly, for r > r1, we have

ℓg(r, bf (r)) = ℓg(r, f (r1)) = g(f (r1)) rf (r1)

r1(f (r1) f (r)) + g(f (r)) rf (r1) (since g is convex)

(1 r1)(f (r1) f (r)) + g(f (r)) + (1 r)f (r1) f (r)

(1 r1)(f (r1) f (r)) + g(f (r)) rf (r) (since f is increasing, 1 > r > r1)

ε2 + g(f (r)) rf (r) (since f is bounded according to Definition C.1)

= ℓg(r, f (r)) + ε2

In order to acquire R(1/ε2)γε1 + ε1 + ε2 ε, we set ε2 = ε/3 and ε1 = (ε/3)1+γ/R.

However, we only assume that the marginal distribution has λ-bounded second moments and we, therefore, need to make certain modifications to the proof of their Theorem 7.7. In particular, the boundedness assumption is used in the proofs of Lemma 7.2, Lemma 7.6 and Theorem 7.7 in Gopalan et al. [2023].

During the execution of Algorithm 1, two types of updates are made. The first type of update is done within the inner loop and corresponds to beginning from a predictor pold and acquiring pnew which is the function x 7 (pold(x) + σ(w x))[0,1], where w is the output of the weak learner of our Proposition A.1, run with ε ε3 (ε3 will be specified later). To lower bound the decrease in the potential function during this type of update, we use a version of Lemma 7.6 in Gopalan et al. [2023]. In this case, one needs to pick a step size σ that is polynomially smaller than the guarantee that the

weak learner provides. In particular, if the weak learner of our Proposition A.1 is run with ε ε3, then one acquires (following the proof of Lemma 7.6 in Gopalan et al. [2023])

E[(p (x) pold(x))2] E[(p (x) pnew(x))2] σ ε3

If σ is picked to be σ = ε3 4B2λ, then the quantity of interest E[(p (x) pold(x))2] E[(p (x)

pnew(x))2] is lower bounded by ε2 3 16B2λ. Hence the number of iterations of the inner loop is upper bounded by O( B2λ

ε2 3 ). Note that in their original algorithm, σ was picked equal to ε3 and this is why ε3 (or another corresponding parameter) does not appear in their proofs. The updated choice of σ generates a polynomial overhead in time and sample complexity. We pick ε3 = 1

λδ) so that each time we exit the inner loop, we have that, with high probability, the current value of p corresponds to an (ε1 B

λδ)-multiaccurate predictor.

The second type of update (the outer loop update) is a calibration step where pnew is set to be the following function with the notation of Algorithm 1

j=0 yj1{pold(x) Ij}

In this case, Corollary 7.5 of Gopalan et al. [2023] can be used as is to acquire that E[(p (x) pold(x))2] E[(p (x) pnew(x))2] ε2 1/8 if we set δ ε2 1/C for some large enough universal constant C > 0. Hence, the number of iterations of the outer loop is upper bounded by O(1/ε2 1).

It remains to show that once the algorithm terminates, the output is approximately calibrated and multiaccurate. Regarding multiaccuracy, whenever we exit the inner loop, the current value of p corresponds (with high probability) to an (ε1 B

λδ)-multiaccurate predictor. The predictor pδ is close to p in absolute distance and therefore can be shown to be approximately multiaccurate using some version of Lemma 7.2 in Gopalan et al. [2023]. In particular, following the proof of Lemma 7.2, the multiaccuracy guarantee for pδ changes to (C, α + B

λδ)-multiaccuracy, where α is the multiaccuracy guarantee for p, by using a Cauchy-Schwarz inequality and bounding E[(w x)2] by B2 λ. Since α = ε1 B

λδ, the multiaccuracy guarantee is ε1. The calibration guarantee follows by the fact that the termination criterion corresponds to empirically checking whether pδ is calibrated and by using Lemma 7.4 in Gopalan et al. [2023], the empirical estimate should be close to the true calibration error. Overall, once we terminate, the output is ε1 calibrated and multiaccurate.

Since the time and sample complexity of each of the calls of the weak learner depends on ε3, we pick δ ε2 1 C(1+B2λ) so that ε3 = Θ(ε1) and δ ε2 1 C as required.

Sample Complexity. For each of the inner loop iterations, we require a fresh sample of size O( d2B2λ

ε2 1 ), as specified by Proposition A.1 (by setting the probability of success to a sufficiently

small constant). For the outer loop, we require O( 1+B8λ4

ε8 1 log( 1+Bλ

ε1 )) samples per iteration, so that the calibration error estimate is accurate enough, according to Lemma 7.4 in Gopalan et al. [2023].

Time Complexity. Each of the inner loop iterations requires O( d3B2λ

ε2 1 ) time and each of the outer

loop iterations requires an additional O(d 1+B2λ

ε2 1 + 1+B10λ5

ε10 1 log( 1+Bλ

ε1 )) time.

The final technical complication we need to address is that their algorithm works only given binary labels. We can, however, form binary labels as follows. Let (x, y) be drawn from D. We have that y [0, 1]. Given y, we draw an independent Bernoulli random variable y with probability of success y, forming the distribution D over Rd {0, 1}. We run the algorithm of Gopalan et al. [2023] on D and receive a predictor p such that

Lg(f p ; D ) min w W Lg(w ; D ) + ε , for any (f, g) F

We have that

Lg(c ; D ) = E x,y [g(c(x)) y c(x)]

g(c(x)) E y

E y y y x c(x)

g(c(x)) E y y x c(x)

= E x,y[g(c(x)) yc(x)] = Lg(c ; D)

This concludes the proof of Theorem 3.2.

C.2 Proof of Lemma 3.3

We first apply Theorem 2.1 with g ϕ to get that for ϕ w(x) = ϕ (w x), we have

α err2(ϕ w ) + 2βε (since inequality holds for w W)

Moreover, we have

err2(ϕ w ) = E y ϕ (w x) 2

= E y g (w x) + g (w x) ϕ (w x) 2

2 E y g (w x) 2 + 2 E g (w x) ϕ (w x) 2

= 2 optg + 2 E g (w x) ϕ (w x) 2

This concludes the proof of lemma 3.3.

D Proofs from Section 4

D.1 Proof of Theorem 4.1

In the case we consider here, g is the sigmoid activation, i.e., g (t) = (1 + e t) 1 for any t R. In particular, the pointwise surrogate loss we consider satisfies

ℓg(y, f (p)) = y ln 1

p + (1 y) ln 1 1 p ln 2 ,

for any y [0, 1] and p (0, 1). We may extend Lemma 4.2 to also capture y {0, 1}, by defining ℓg(0, f (0)) = ℓg(1, f (1)) = ln 2 (the inequality would hold under this definition). Hence, following a similar procedure as the one used for proving Theorem B.1, we obtain the following by applying Lemma 4.2

err2(p) 2 (Lg(f p) E[ℓg(y, f (y))]) (D.1)

Lg(w ) E[ℓg(y, f (y))] E 2 g (w x) (1 g (w x)) (y g (w x))2 (D.2)

Lg(f p) Lg(w ) + ε (D.3)

Therefore, in order to prove Theorem 4.1, it is sufficient to provide a strong enough upper bound for the quantity of the right hand side of Equation (D.2) in terms of optg. We observe that

2 g (w x) (1 g (w x)) 4 exp(|w x|) , for any x Rd

It remains to bound the quantity E[e|w x|(y g w(x))2]. Set Q = e|w x|(y g w(x))2 (Q is a random variable). Then for any r 0 we have

E[Q] = E[Q 1{|w x| r}] + E[Q 1{|w x| > r}]

er E[(y g w (x))21{|w x| r}] + E[e|w x|1{|w x| > r}]

er opt + E[e|w x|1{|w x| > r}]

To bound the quantity E[e|w x|1{|w x| > r}], consider F(s) := Pr[e|w x|1{|w x| > r} s]. We have that

F(s) = 1{s = 0} Pr[|w x| r] + Pr[|w x| max{ln s, r}] .

Since E[e|w x|1{|w x| > r}] = R s=0 F(s) ds = R s=0 Pr[|w x| max{ln s, r}] ds, we obtain

s=0 Pr [|w x| max{ln s, r}] ds

s=0 Pr h |bw x| r

s=er Pr |bw x| ln s

ds (since w 2 B)

er e (r/B)2 + Z

s=er e (ln s/B)2 ds (see Def. 1.6)

= er e (r/B)2 + er Z

u=0 eu ( u+r

B )2 du (define u = ln s r)

er e (r/B)2 + er e (r/B)2 B e B2

Ce B2ere (r/B)2

Therefore, in total, we have that err2(p) 8eroptg + 8Ce B2ere (r/B)2 + 2ε and we may obtain Theorem 4.1 by picking r = B(ln 1 opt)1/2.

D.2 Proof of Theorem 4.3

We first prove Lemma 4.4. We have that for any y {0, 1} and p (0, 1)

ℓg(y, f (p)) = y ln 1

p + (1 y) ln 1 1 p ln 2 = CE(y, p) ln 2 ,

where CE(y, p) is the cross entropy function. It is sufficient to show that for y {0, 1} and p [0, 1],

|y p| CE(y, p) 2|y p| log 1 p(1 p)

Observe that

CE(0, p) = log 1 1 p

where the series on the right converges for all p < 1. We can also write

CE(1, p) = log 1

with the series converging for p > 0.

For the lower bound, we observe the following inequalities hold for all p [0, 1]

CE(0, p) p = |0 p| CE(1, p) 1 p = |1 p|.

For the upper bound, we first prove the claim for y = 0, where it states that

CE(0, p) = log 1 1 p

2p log 1 p(1 p)

When p 1/2 we can use Eq. (D.5) to bound

i=1 pi = p 1 p 2p. (D.8)

The bounds holds by observing that since p(1 p) 1/4,

log 1 p(1 p)

When p 1/2, we note that

log 1 p(1 p)

and 2p 1. Hence the bound holds in this case too.

In the case where y = 1, the bound states that

CE(1, p) = log 1

2(1 p) log 1 p(1 p)

This is implied by our bound for y = 0 by taking q = 1 p. This concludes the proof of Lemma 4.4 and we are ready to prove Theorem 4.3. The following are true

err1(p) Lg(f p) ln 2 (D.9)

Lg(w ) ln 2 2 E ln 1 g (w x) (1 g (w x))

|y g (w x)| (D.10)

Lg(f p) Lg(w ) + ε (D.11)

Similarly to the proof of Theorem 4.1, we observe that

ln 1 g (w x) (1 g (w x))

ln 4 + |w x|

It remains to bound the quantity E[|w x| |y g w(x)|]. Set Q = |w x| |y g w(x)| (Q is a random variable). Then for any r 0 we have E[Q] = E[Q 1{|w x| r}] + E[Q 1{|w x| > r}]

r E[|y g w (x)| 1{|w x| r}] + E[|w x| 1{|w x| > r}] r opt + E[|w x| 1{|w x| > r}] To bound the quantity E[|w x| 1{|w x| > r}], consider F(s) := Pr[|w x| 1{|w x| > r} s]. We have that F(s) = 1{s = 0} Pr[|w x| r] + Pr[|w x| max{s, r}] .

Since E[|w x| 1{|w x| > r}] = R s=0 F(s) ds = R s=0 Pr[|w x| max{s, r}] ds, we obtain

s=0 Pr [|w x| max{s, r}] ds

s=0 Pr h |bw x| r

s=r Pr h |bw x| s

i ds (since w 2 B)

r e r/B + Z

s=r e s/B ds (see Def. 1.6)

= r e r/B + B e r/B

Therefore, in total, we have that err1(p) 2 ln 4 optg + 2(r + B)e r/B + 2r optg + ε and we may obtain Theorem 4.3 by picking r = B ln 1 opt.