# conformalization_of_sparse_generalized_linear_models__0ac20f6c.pdf

Conformalization of Sparse Generalized Linear Models

Etash Kumar Guha 1 Eugene Ndiaye 2 Xiaoming Huo 3

Given a sequence of observable variables {(x1, y1), . . . , (xn, yn)}, the conformal prediction method estimates a confidence set for yn+1 given xn+1 that is valid for any finite sample size by merely assuming that the joint distribution of the data is permutation invariant. Although attractive, computing such a set is computationally infeasible in most regression problems. Indeed, in these cases, the unknown variable yn+1 can take an infinite number of possible candidate values, and generating conformal sets requires retraining a predictive model for each candidate. In this paper, we focus on a sparse linear model with only a subset of variables for prediction and use numerical continuation techniques to approximate the solution path efficiently. The critical property we exploit is that the set of selected variables is invariant under a small perturbation of the input data. Therefore, it is sufficient to enumerate and refit the model only at the change points of the set of active features and smoothly interpolate the rest of the solution via a Predictor-Corrector mechanism. We show how our path-following algorithm accurately approximates conformal prediction sets and illustrate its performance using synthetic and real data examples.

1. Introduction

Modern statistical learning algorithms perform remarkably well in predicting an object based on its observed characteristics. In terms of AI safety, it is essential to quantify the uncertainty of their predictions. More precisely, after observing a finite sequence of data Dn = {(x1, y1), . . . , (xn, yn)}, it is interesting to analyze to what extent one can build a

1College of Computing, Georgia Institute of Technology, Atlanta, GA, USA 2 Apple (Work partly done while at Georgia Tech) 3H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA. Correspondence to: Etash Guha <etash@gatech.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

confidence set for the next observation yn+1 given xn+1.

A classical approach is to adjust a prediction model µDn on the observed data Dn and consider an interval centered around the prediction of yn+1 when the fitted model receives xn+1 as new input, i.e., using µDn(xn+1). We calibrate the confidence interval to satisfy a 100(1 α)% confidence by considering, for any level α in (0, 1), the set

{z : |z µDn(xn+1)| Qn(1 α)} , (1)

where Qn(1 α) is the (1 α)-quantile of the empirical cumulative distribution function of the fitted residuals |yi µDn(xi)| for indices i in {1, . . . , n}. If the fitted model is close to the exact value, this method is approximately valid as n goes to infinity.

Alternatively, conformal prediction is a versatile and simple method introduced in (Vovk et al., 2005; Shafer & Vovk, 2008) that provides a finite sample and distribution free 100(1 α)% confidence region for the predicted object based on past observations. The main idea is to follow the construction of the confidence set in Equation (1) by using candidate values for yn+1. Since the true yn+1 is not given in the observed dataset Dn, one can instead learn a predictive model µDn+1(z) on an augmented database

Dn+1(z) = Dn (xn+1, z) ,

where a candidate z replaces the unknown response yn+1. We can, therefore, define a prediction loss for each observation and rank them. A candidate z will be considered conformal or typical if the rank of its loss is sufficiently small. The conformal prediction set will simply contain the most typical z as a confidence set for yn+1. More formally, the conformal prediction set is obtained as z : |z µDn+1(z)(xn+1)| Qn+1(1 α, z) , (2)

where Qn+1(1 α, z) is the (1 α)-quantile of the empirical cumulative distribution function of the refitted residuals, e.g., |yi(z) µDn+1(z)(xi)| for indices i in {1, . . . , n + 1} and y(z) = (y1, . . . , yn, z). This method benefits from a strong coverage guarantee without any assumption on the distribution, including finite sample size n; see Section 4. The conformal prediction approach has been applied for designing uncertainty sets in active learning (Ho & Wechsler, 2008), anomaly detection (Laxhammar & Falkman, 2015;

Conformalization of Sparse Generalized Linear Models

Bates et al., 2021), few-shot learning (Fisch et al., 2021), time series (Chernozhukov et al., 2018; Xu & Xie, 2021; Chernozhukov et al., 2021; Lin et al., 2022), or to infer the performance guarantee for statistical learning algorithms (Holland, 2020; Cella & Ryan, 2020; Ndiaye, 2022). We refer to the extensive reviews in Balasubramanian et al. (2014) for other applications to artificial intelligence. Despite its attractive properties, the computation of conformal prediction sets traditionally requires fitting a model µDn+1(z) for each possible augmented dataset Dn+1(z) corresponding to each possible candidate z for yn+1. The number of possible candidates is infinite in a regression setting where an object can take an uncountable number of possible values. Therefore, the computation of conformal prediction is generally infeasible without additional structural assumptions on the underlying model fit. Otherwise, the calculation costs remain high or impossible. While many algorithms encounter this problem of fitting many models under alterations to the regularization parameter λ (Park & Hastie, 2007), to our knowledge, such algorithms do not exist for general loss functions under changes to the dataset without high computation cost. We can avoid the central issue of refitting the model many times by using the structural assumptions given by the setting of General Linear Models with ℓ1 regularization.

Contributions We generalize linear homotopy approaches from quadratic loss to a broader class of nonlinear loss functions using numerical continuation to efficiently trace a piecewise smooth solution path. Overall, we propose a homotopy drawing algorithm that efficiently keeps track of the weights over the space of possible candidates using the sparsity induced by the ℓ1 regularization. We develop an efficient Conformal Prediction algorithm for sparse generalized linear models from this homotopy algorithm. Additionally, using numerical continuation and the patterns in the sparsity of the weights, we relinquish the expensive necessity of retraining the model many times from random initialization. Furthermore, we provide a primal prediction step that significantly reduces the number of iterations needed to obtain an approximation at high precision. We illustrate the performance of our algorithm as a homotopy drawer and a conformal set generator using Quadratic, Asymmetric and Robust Loss functions with ℓ1 regularization.

Related Works Our methodology uses numerical continuation (also called homotopy) to generate a path of solutions. Such continuation techniques have been previously used when the objective function is differentiable (Allgower & Georg, 2012), (Hastie et al., 2004) for support vector machine, (Bach et al., 2004) for logistic regression, and more general loss functions regularized with the ℓ1 norm in (Rosset & Zhu, 2007; Park & Hastie, 2007; Tibshirani, 2013; Mairal & Yu, 2012). However, the latter focus on the regularization path and plot the solution curve as the regu-

larization parameter λ varies. To our knowledge, there does not exist work generating the solution curve as the label z varies in y(z) for general loss functions. In the setting we consider, we recall that it is the response vector that is parameterized as y(z) = (y1, . . . , yn, z) for a real value z; for which Garrigues & Ghaoui (2009) and Lei (2019) proposed a homotopy algorithm when the loss function is quadratic. However, such algorithms do not work for general nonlinear loss functions; our algorithm extends these works to such nonlinear loss functions. For such loss functions, works such as Ndiaye & Takeuchi (2019) aim to approximate the homotopy only enough to generate the conformal prediction set. However, this work suffers much worse as increasing accuracy is required when drawing the homotopy and cannot, for example, recover the path with quadratic loss, for which an exact homotopy algorithm is known.

Notation For a nonzero integer n, we denote [n] to be the set {1, , n}. Furthermore, the row-wise feature matrix is X = [x1, , xn+1] such that X R(n+1) p. We use the notation XA to refer to the sub-matrix of X assembled from the columns with indices in A. If we need to do so for only one index j, where j [p], we use Xj. For brevity, we will define σmax(XA) as the maximum singular value of XA, i.e. σmax(XA) = XA 2. We also similarly define σmin(XA). If a function β(z) returns a vector for some input z, we can index that output vector by βA(z), where A [p] or βj(z) where j [p]. Moreover, given a function f(xi, xj) of two variables, we denote the gradient of that function as f. Furthermore, we use the simple notation i,j,kf = 3f xi xj xk where i, j, k [2]. We denote the smallest integer no less than a real value r as r . We denote by Qn+1(1 α), the (1 α)-quantile of a real valued sequence (Ui)i [n+1], defined as the variable Qn+1(1 α) = U( (n+1)(1 α) ), where U(i) are the i-th order statistics. For k in [n + 1], the rank of Uk among U1, , Un+1 is defined as Rank(Uk) = Pn+1 i=1 1Ui Uk.

2. Sparse Generalized Linear Models

By definition of the conformal prediction set in Equation (2), one needs to consider an augmented dataset Dn+1(z) for any possible replacement of the target variable yn+1 by a real value z. This implies the computation of the whole path z 7 µDn+1(z)(xn+1) as well as the path of scores and quantiles. However, it is generally difficult to achieve. We focus on the Generalized Linear Model (GLM) regularized with an ℓ1 norm that promotes sparsity of the model parameter. For a fixed z R, the weight β (z) is defined as a solution to the following optimization problem

β (z) arg min β Rp f(y(z), Xβ) + λ β 1 . (3)

Conformalization of Sparse Generalized Linear Models

where the data fitting term f(y(z), y (z)) is a non negative loss function between a prediction y (z) and the augmented vector of labels y(z) = (y1, , yn, z). We parameterize a linear prediction as y i = x i β (z) and the empirical loss is

f(y(z), y (z)) =

i=1 ℓ(yi, y i (z)) + ℓ(z, y n+1(z)) .

There are many examples of cost functions in the literature. A popular example is the power norm regression, where ℓ(a, b) = |a b|q. When q = 2, this corresponds to the classical linear regression. The cases where q = [1, 2) are frequent in robust statistics, where the case q = 1 is known as the least absolute deviation. One can also consider the loss function Linex (Gruber, 2010; Chang & Hung, 2007) which provides an asymmetric loss function ℓ(a, b) = exp(γ(a b)) γ(a b) 1, for γ = 0.

2.1. Assumptions and Properties

We first describe the structure of the optimal solution β (z) for a candidate z. A solution to the optimization problem from Equation (3) must obey the first-order optimality condition. Analyzing the solution reveals a set of weights in β (z) whose value is 0 and, thus, does not contribute to the inference. This is a crucial property of ℓ1 regularization. Lemma 2.1. A vector β (z) Rp is optimal for Equation (3) if and only if for y (z) = Xβ (z), it holds

X 2f(y(z), y (z)) = λv(z) , (4)

where v(z) belongs to the subdifferential of the ℓ1 norm at β (z) i.e., j {1, . . . , p}, we have

( {sign(β j (z))} if β j (z) = 0 , [ 1, 1] if β j (z) = 0 . (5)

Within this lemma, we wish to formally distinguish between nonzero weights and zero weights, as this helps determine the value of vj(z), per Equation (5). Definition 2.1. We define our active set at a point z as

A(z) = j [p] : |X j 2f(y(z), y (z))| = λ . (6)

The active set contains at least all the indices of the optimal solution that are guaranteed to be nonzero. We will denote A = A(z) if there is no ambiguity.

The following result provides sufficient conditions to ensure uniqueness of the solution path, i.e., for any z, there exists a single optimal solution β (z) for Problem 3. Lemma 2.2. For all z, we assume that the matrix XA(z) is full rank and that the loss function f is strictly convex. With these two assumptions, for all candidates z, only one unique optimal solution β (z) exists. Thus, the solution path z 7 β (z) is well defined.

In the following, for simplicity of the presentation of the algorithms, we will add the classical qualification condition that the active set coincides with the support of the solution for any candidate z where the path is differentiable.

3. Efficient Computation of the Solution Path

We aim to finely approximate the function β (z) as ˆβ(z) across all candidates z. The initial and main observation is that the active set map (resp. solution path) is piecewise constant (resp. smooth). That is to say, That is to say, the variable selected by the ℓ1 penalty is invariant with respect to small perturbation of the input data. Building on this, the path drawing algorithm is a combination of finding points where the active set changes occur and estimating the optimal solution, leveraging the regularity of the loss f.

We have two situations for a change in the active set:

A nonzero variable becomes zero i.e., j A(z) s.t.

β j (z) = 0 and β j (zout j ) = 0 .

A zero variable becomes nonzero i.e., j Ac(z) s.t.

|X j 2f(y(zin j ), y (zin j ))| = λ .

Here, zout j and zin j are the estimated points where variable j could leave or join the active set, respectively. With decreasing input z, the next change point occurs at

znext(z) = max max j A(z) zout j , max j Ac(z) zin j

Here, znext(z) is the function that finds where the active set changes after point z. The set of change points are called kinks of the path because they correspond to the nondifferentiable points of the solution path z 7 β (z). Core difficulties are that f can be highly nonlinear, and the optimal weights β (z+) at an arbitrary point z+ cannot be efficiently computed for many loss functions. To alleviate this, our algorithm sequentially creates a linearized version of β A(z+) called βA(z+) (Section 3.1) in order to estimate the active set changes (Section 3.2 and Section 3.2). Given a point of active set change zt, we can manually correct βA(zt) into ˆβA(zt) so that ˆβA(zt) β A(zt) up to a negligible optimization error ϵtol using any appropriate solver (Section 3.3). It then approximates β A+(z+), where A+ is the new active set, repeating these steps until the stopping point is reached. We detail the entire pipeline in Algorithm 2 and illustrate how our approximated solution path deviates from the exact one for different loss functions in Appendix A.

3.1. Solution Estimation

We wish to approximate β A(z+) for a candidate z+ smaller than the most recently found kink zt where A(z+) = A(zt).

Conformalization of Sparse Generalized Linear Models

To start, we will assume access to the corrected (up to negligible error) weights ˆβA(zt) at the previous kink zt. We can use a local linearization of the solution path as

βA(z+) = ˆβA(zt) + ˆβ A(zt) (z+ zt) , (8)

where, ˆβ A(zt) is our approximation of the true slope β A z (zt), which we do not have access to. To understand this term, we follow Park & Hastie (2007) to define

H(y(z), β A(z)) = X A 2f(y(z), y (z)) + λv A ,

From the Optimality Condition in Equation (4), it holds

H(y(z), β A(z)) = 0 = H

By the implicit function theorem and the chain rule, we have

β = X A 2,2f(y(z), y (z))XA

y = X A 2,1f(y(z), y (z))

y z = (0, . . . , 0, 1) .

To compute an approximation of β A z (zt), we use a plug-in approach and only replace the (unknown) exact value of y (zt) = Xβ (zt) with the approximate ˆy(zt) = X ˆβ(zt), yielding ˆβ A(zt). Notably, we get an equation for βA(z+), which is efficient to compute given y(z+). As a reminder, the loss function f differentiates this algorithm from existing path-finding algorithms tailored for changes in the hyperparameter λ. If f is the Quadratic loss function, we recover the path-finding algorithm from Lei (2019). A completely different homotopy will be generated if it is another loss function.

3.2. Active Set Updates

We have to track the changes that may occur in the active sets along the path sequentially depending on whether the variable leaves or enters the active set. We will compute our path restricted in the interval [zmin, zmax] where

zmin = min(y1, . . . , yn) and zmax = max(y1, . . . , yn) .

For sufficiently large sample size n, any point z outside this interval has a very low probability of being in the conformal set since it is an outlier of a label; see justification in Lemma C.1. For simplicity, we reiterate that we know the corrected ˆβ(zt) at the most recent kink zt approximating β (zt) up to error ϵtol and the active set of weights A(zt). We estimate the kinks by following Equation (7) and replacing the exact solution β (zt) by β(zt) in Equation (8).

As such, we will iteratively set zt+1 = znext(zt) as the next change point following Equation (7).

Leaving the active set At the point, where a nonzero variable becomes zero, we know that by Equation (8), we have a closed form approximation of β A(z+) given βA(zt). Therefore, for a feature index j A, we have a closed-form approximation for β j (z+) in terms of z+, which we can compute efficiently. Thus, from Equation (8), j leaving the active set occurs at β j (z+) = 0 implies a kink occurs at z+ when 0 βj(z+) defined in the R.H.S. of Equation (8); which is easily solvable in closed-form. Thus, for an active variable j with nonvanishing gradient ˆβ j(ˆzt) = 0, we define

zout j,t+1 = zt ˆβj(zt) ˆβ j(zt) ,

and define zout j,t+1 = otherwise. We remind the reader

that ˆβ j(zt) is our approximation of the true slope β j z (zt) from Section 3.1.

Joining the active set At the point where a variable becomes nonzero, we know from Equation (4) that for any inactive variable j Ac that joins the active set

|X j 2f(y(z+), XA+β A+(z+))| = λ

where A+ = A {j}. However, given that we are searching for a point z+ where the active sets shift from A to A+, at point z+, β j (z+) is roughly 0 since it is the first point where β j (z+) becomes nonzero. Therefore, given this information, the prediction Xjβ j (z+) = 0 where z+ is a kink. Using this idea, we can provide the equivalence

XA+β A+(z+) = XAβ A(z+) = y (z+) .

This equivalence is useful as we know how to approximate β A(z+), and therefore y (z+), efficiently from Equation (8). Therefore, the j-th variable must join the active set at approximately z+ such that Ij(z+) = 0 where

Ij(z+) = |X j 2f(y(z+), y (z+))| λ . (9)

We also leverage a plug-in estimate of Equation (9) by replacing y ( ) by ˆy( ). We could use a root-finding function to efficiently find the roots of the function Ij(z+) where the kink may lie. However, we seek a closed form as in Equation (8) to make finding the roots of Ij(z+) more efficient. We do this via linearization again.

Approximation of 2f(y(z+), y (z+))

While βj(z+) is linear in z+, giving way to an explicit solution for z+, this property does not hold for Ij(z+) in Equation (9). To achieve such a form, we need to linearize further 2f(y(z+), y (z+)). To simplify, we denote

f(y(z), y (z)) = f ζ(z) where ζ(z) = (y(z), y (z)) ,

Conformalization of Sparse Generalized Linear Models

and approximate its gradient 2f ζ(z+) as

2f ζ(z) + 2,1f ζ(z) y + 2,2f ζ(z) y (10)

where y = y(z+) y(z) and y = y (z+) y (z). We still have that Equation (10) can be nonlinear since y

can be nonlinear in z+. To alleviate this, we leverage the local approximation of the solution path in Equation (8) and the plug-in replacement of β A z with ˆβ A. As such, we can estimate the root of Ij(z+) and sequentially define the next point where the jth variable becomes active. To simplify the expression, we set ˆζ(z) = (y(z), ˆy(z)) and

g(zt) = [ 21f ˆζ(zt)]n+1 + 2,2f ˆζ(zt) XA ˆβ A(zt) .

A zero variable j is estimated to become nonzero at

zin j,t+1 = zt + X j 2f ζ(zt) λ

X j g(zt) ,

The detailed computations are provided in Appendix D. Note that when the denominator g(zt) is zero, we set zin j,t+1 = . Finally, the next kink is estimated as

zt+1 = max max j A(zt) zout j,t+1, max j Ac(zt) zin j,t+1

3.3. Solution Updates

Our active set change point finder obtains the next kink zt+1 by tracking all variables in the optimal solution to see whether or not it cancels out after zt. However, our kink-finding tool requires exact knowledge of ˆβ(zt), as in Equation (8). To find the next kink, we, therefore, need to know ˆβ(zt+1). To ensure that our linearized version βA(zt+1) is close enough to the exact solution β A(zt+1), we manually correct our linearized weights βA(zt+1), creating our ˆβA(zt+1). We use the Predictor-Corrector strategy described below (Allgower & Georg, 2012).

Predictor To initialize the solving process for ˆβ(zt+1), we first provide our linearized version β(zt+1) from Equation (8) as a warm start initialization. This vastly improves the computation time of our corrector step here after.

Corrector The solution obtained in the warm start often has a reasonably small approximation error. For example, in the case of the Quadratic loss, this warm start is exact and correction is unnecessary. However, it generally is an imprecise estimate of the exact solution. To overcome this, we use an additional corrector step using an iterative solver, such as proximal gradient descent initialized with the predictor output, or more advanced solvers such as CVXPY (Diamond & Boyd, 2016) or SKGLM (Bertrand et al., 2022). This takes our linearized weight estimates of β(zt+1) and

outputs our approximate weights ˆβ(zt+1) β (zt+1) up to error ϵtol which is a hyperparameter for our corrector.

Finally, we can summarize our approximation of the homotopy as the following.

( β(z) if z / {z1, . . . , zt} β (z) if z {z1, . . . , zt} (output of corrector)

For point z that is not a kink, we form our estiamte weights simply through the linearization. Otherwise, we can use the output of the corrector as our estimates.

Algorithm 1 Full Homotopy Generation

Input Data: {(x1, y1), . . . , (xn, yn)}, xn+1, λ > 0 Initialization: t = 0 z0 = max(y1, . . . , yn) y(z0) = (y1, . . . , yn, z0) β (z0) = arg minβ Rp f(y(z0), Xβ) + λ β 1 A(z0) = j [p] : |X j 2f(y(z0), y (z0))| = λ

while zt > min(y1, . . . , yn) do

z I = max j Ac(zt)zin j,t+1 and j I = arg max j Ac(zt) zin j,t+1

z O = max j A(zt)zout j,t+1 and j O = arg max j A(zt) zout j,t+1

if z I > z O then

zt+1 = z I A(zt+1) = A(zt) {j I} else

zt+1 = z O A(zt+1) = A(zt) \ {j O} end if Predictor

β(zt+1) = ˆβ(zt) + ˆβ (zt) (zt+1 zt)

Corrector warm started with β(zt+1)

ˆβ(zt+1) = arg min β Rp f(y(zt+1), Xβ) + λ β 1

t = t + 1 end while RETURN: ˆβ(zt), zt for all t

4. Conformal Prediction for Sparse GLM

Given a homotopy for specific data and loss function, computing the Conformal Prediction set relies on a simple calculation using the homotopy. Meanwhile, the primary tool for proving its validity is that the rank of one variable among an exchangeable and identically distributed sequence follows a (sub)-uniform distribution (Bröcker & Kantz, 2011).

This idea of rank helps construct distribution-free confidence intervals. We can estimate the conformity of a given

Conformalization of Sparse Generalized Linear Models

candidate z by calculating its prediction loss |z y n+1(z)| and compute its rank relative to the losses of the other datapoints. The candidate will be considered conformal if the rank of its loss is sufficiently small. Let us define the conformity measure for Dn+1(z) as

Ei(z) = |yi y i (z)|, i [n] , (11)

En+1(z) = |z y n+1(z)| . (12)

The main idea for constructing a conformal confidence set is to consider the conformity of a candidate point z measured as

π(z) = 1 1 n + 1Rank(En+1(z)) . (13)

The conformal prediction set will collect the most conformal z as a confidence set for yn+1, i.e., gathers all the real values z such that π(z) α. This condition occurs if and only if the score En+1(z) is ranked no higher than (n + 1)(1 α) , among the sequence {Ei(z)}i [n+1], i.e.,

{z R : En+1(z) Qn+1(1 α, z)} ,

which is exactly the conformal set defined in Equation (2). We need to calculate the piecewise constant function z 7 π(z) to compute a conformal set. Fortunately, our framework directly sheds light on the computation of this value over the range space.

Access to the homotopy, as well as the kinks, yields an efficient methodology for calculating the conformal prediction set over the range space. Once can readily use a root-finding approach (Ndiaye & Takeuchi, 2021) but it requires the assumption that the conformal set is an interval. Instead, we do so by tracking where changes in this set occur. Naturally, changes in the rank function only occur when the error of one example surpasses or goes below that of the error of the last example. Formally, this can be seen when

|yi(z) y i (z)| = |yn+1(z) y n+1(z))| . (14)

We will look between the two kinks to efficiently find points satisfying Equation (14). For a point z between two kinks, we can efficiently estimate y (z). Indeed, given a point z is between two kinks zt and zt+1 with an active set A, we can use Equation (8) to estimate the quantity y(z) y (z) as

F(z) = y(z) ˆy(zt) + X ˆβ A(zt) (z zt) ,

where ˆβA(zt) is stored from the corrector step at the kink zt. Given that this value is linear in z, we can form a closed-form explicit approximation for what z solves Equation (14). Therefore, we can look for where the π(z) value changes between every sequential pair of kinks. To find the conformal set, we track the changes π(z) and recompute it along each root of Equation (14), yielding an efficient methodology to compute π(z), and, therefore, the conformal set along the space of possible yn+1 values.

Algorithm 2 Conformal Set Generation

Input Data: {zt, ˆβ(zt)}t [0:T ], α (0, 1) //Find where changes in π occur Set C = for t [T], i [n] do

if z+ s.t. F(z+)i = F(z+)n+1 then

if |F(zt)n+1| |F(zt)i| then

C = C {(z+, 1)} else if |F(zt)n+1| |F(zt)i| then

C = C {(z+, +1)} end if end if end for //Get z s.t. Rank(En+1(z)) is small Set E = R = Rank(En+1(z0)) SORT(C) according to first argument z for z, c C do

R = R + c Rank(En+1(z)) = R if Rank(En+1(z)) (n + 1)(1 α) then

E = E {z} end if end for RETURN: (min(E), max(E))

5. Theoretical Analysis

To understand where and how our algorithm fails, we provide an upper bound on the pointwise error of our algorithm. The error is mainly accumulated in the linearizations we use for estimating the solution and gradient of the loss. To form such bounds, we need assumptions on the regularity of the loss function f itself and on the sequence of design matrix restricted on the active sets along the path. Namely, we will see that the derivatives of the loss function is bounded. Lemma 5.1. The second derivatives, assumed to be continuous, of the loss function f are locally bounded by datadependent constants. Indeed, for any z [zmin, zmax], we have β (z) B 1(0, R/λ) where

R = max z [zmin,zmax] f(y(z), 0) .

By Weierstrass theorem, for any i, j [2], we have

i,jf ζ(z) 2 νf .

Lemma 5.2. We assume that the loss f is µf-strongly convex i.e., µf := inf ζ B 2,2f ζ(z) > 0, where B is provided in the appendix. Thus, for any z [zmin, zmax], the maximum singular value of the inverse of the matrix H

β = X A 2,2f ζ(z)XA is upper bounded as H

1 2 1 σ2 min(XA) µf .

Conformalization of Sparse Generalized Linear Models

With these two lemmas, we can form our error bounds.

Theorem 5.1. The error between our linearized weights β(z+) and the true weights β (z+) is upper bounded by β(z+) β (z+) 2 ϵtol + L νf

µf |z+ zt| .

where L = σmax(XA(zt))

σ2 min(XA(zt)) + sup z [z+,zt]

σmax(XA(z))

σ2 min(XA(z)) ,

and zt is the prior kink of z+.

Theorem 5.2. The estimation error is upper bounded by

2f ζ(z+) 2f ˆζ(z+) 2 K ϵtol + L νf

where K = νf σmax(XA).

6. Numerical Experiments

Our central claim is twofold. Our method efficiently and accurately generates the homotopy over general loss functions. Our method also efficiently and accurately generates conformal sets over general loss functions. We demonstrate these two claims over different datasets and loss functions. For reproducibility, our implementation is at github.com/Etash Guha/sparse_conformal.

Datasets We use four datasets to illustrate the performance of our algorithm. The first three are real datasets sourced from (Pedregosa et al., 2011). The Diabetes dataset is a regression dataset with 20 features and 442 samples. Additionally, we use the well-known regression dataset from (H., 1991) denoted as Friedman1, which has 10 features and 100 samples. We also use the multivariate dataset denoted Friedman2 from (Breiman, 1996), which has 100 samples and 4 features. These datasets are used to demonstrate the capabilities of our algorithm on real datasets. We also generate regression problems synthetically. We sample the data and labels from a uniform distribution between [ 1, 1]. We also divide by the standard deviation to normalize the dataset. We generate two different synthetic datasets, one normal-sized dataset, denoted synthetic with 100 samples and 100 features, and a larger dataset, denoted large with 1000 features and 20 samples. This larger dataset is intended to display our algorithm s complexity in terms of the number of features. These datasets represent a reasonable range of regression problems usable for our experiments.

Baselines To form a baseline for our algorithm, we use several baselines. This baseline is the most naive conformal prediction algorithm. For Grid algorithms, the algorithm selects 100 potential candidates evenly across the range of possible candidates. It uses the primal corrector at each

point to calculate the weights to form the homotopy. A more sophisticated conformal prediction and homotopy generating algorithm is the Approximate homotopy from (Ndiaye & Takeuchi, 2019), which leverages loss function smoothness to track violations (up to a prescribed error tolerance) of the optimality condition along the path.

6.1. Homotopy Experiments

To test our algorithm in terms of homotopy generation, we measure our algorithm s accuracy and efficacy against different baselines across different loss functions. For all baselines and our algorithm, we use Proximal Gradient Descent for Lasso Loss and CVXPY for Robust and Asymmetric as Primal Correctors. Precisely, we measure the negative logarithm of the gap between primal values of the calculated ˆβ values and a ground truth baseline. We measure this gap across many possible z values and take the average. The ground truth baseline is a Grid-based homotopy, where we compute the homotopy iteratively along a find grid of candidates. Given that we apply the negative logarithm to the primal gap, the larger the value reported, the smaller the true error term and the better the algorithm s performance. Moreover, we report the time taken in seconds required to form the homotopy. Our experiments cover the Lasso, Robust, and Asymmetric functions across all the datasets.

We report our results in Table 1 and Table 2. We shorten Synthetic to Synth and Approximate to Appr for brevity. As evident, we see a significant decrease in time used over Approximate Homotopy for most applications of the Lasso Loss with a significant increase in accuracy. On the largest dataset for Lasso Loss, our algorithm gets similar accuracy and is much more efficient. Furthermore, we report similar primal gaps for both ours and the approximate homotopy algorithms on Robust and Asymmetric losses. However, we achieve significant time improvements. Notably, on the Diabetes and Large dataset for Asymmetric loss and the Synthetic and Large dataset for both Asymmetric and Robust losses, we report an almost 50% reduction in the time taken to achieve a similar error. Overall, across all loss types and datasets, we either achieve similar or better errors with the same or less time relative to the standard Approximate Homotopy, demonstrating the capability of our algorithm to efficiently and accurately generate the homotopy.

To illustrate the accuracy of our algorithm, we plot the optimization error gap over the space of all z [zmin, zmax] for all three loss functions and four datasets. We report the figures in Figure 1. Notably, we see that on Figure 1d, we achieve all losses better than 10 4. On other figures, all objective errors are bounded by 10 2. Our application of Lasso and Robust over all datasets achieves near 0 objective error over the entire pass.

Conformalization of Sparse Generalized Linear Models

(a) Friedman1

(b) Friedman 2

(c) Diabetes

(d) Synthetic

Figure 1: We demonstrate the objective error of our achieved homotopy over the space of possible yn+1 on all four datasets and loss functions.

10 20 Candidate z

(a) Robust on Friedman1

0 2 Candidate z

(b) Robust on Friedman 2

0 2 Candidate z

(c) Robust on Diabetes

0.2 0.0 0.2 Candidate z

(d) Robust on Synthetic

Figure 2: The π(z) function as generated by a ground truth discretized searching algorithm and by our homotopy drawing algorithm for the Robust loss function over all 4 datasets.

2 0 2 Candidate z

(a) Asymmetric on Friedman1

1 0 1 2 Candidate z

(b) Asymmetric on Friedman 2

0 2 Candidate z

(c) Asymmetric Diabetes

0.2 0.0 0.2 Candidate z

(d) Asymmetric on Synthetic

Figure 3: The π(z) function as generated by a ground truth discretized searching algorithm and by our homotopy drawing algorithm for the Asymmetric loss function over all 4 datasets.

Table 1: Average Time of Homotopy

Synth. Friedman1 Diabetes Friedman 2 Large

Our Lasso 1.706 1.945 1.0785 0.681 150.012 Appr. Lasso 5.176 43.823 70.813 14.055 500.820 Our Robust 27.156 1.069 2.411 0.701 323.372 Appr. Robust 62.894 1.009 2.734 0.618 607.203 Our Asym. 9.270 3.147 27.349 2.454 41.269 Appr. Asym. 18.963 2.699 54.149 3.342 82.857

Table 2: Average Negative Logarithm of Primal Gap of Homotopy

Synth. Friedman1 Diabetes Friedman2 Large

Our Lasso 12.498 15.844 16.001 15.241 7.933 Appr. Lasso 6.597 6.469 7.554 6.702 7.558 Our Robust 5.137 2.317 3.819 2.778 5.223 Appr. Robust 5.990 3.561 3.712 4.434 5.026 Our Asym. 7.879 3.633 3.814 3.058 6.101 Appr. Asym. 6.939 3.208 4.032 2.795 5.365

6.2. Conformal Prediction Experiments

It is a natural question whether this improvement in the generation of the homotopy function yields a strong conformal set generation algorithm. We demonstrate this both visually and empirically. We draw the π(z) function for visual verification over all four datasets and three loss functions using our algorithm. To form a baseline, we use the Grid algorithm. This algorithm is a ground truth to which we compare our π(z) function. For empirical verification, we compare coverage, length, and the time of our method vs. several important baselines. Namely, we use the Grid method, Approximate homotopy from (Ndiaye & Takeuchi, 2019), the Oracle methodology, which has access to the true value of yn+1 to form its conformal interval, and the Split methodology, which uses a calibration dataset to calibrate the conformal values predicted but loses statistical validity.

Conformalization of Sparse Generalized Linear Models

Diabetes Coverage Friedman 1 Coverage Friedman 2 Coverage Synthetic Coverage

Lasso Robust Asymmetric Lasso Robust Asymmetric Lasso Robust Asymmetric Lasso Robust Asymmetric Ours 0.933 0.933 0.867 0.900 0.900 0.883 0.900 0.850 0.933 0.900 0.900 0.900 Approximate 0.933 0.933 0.867 0.850 0.900 0.833 0.900 0.800 0.900 0.850 0.900 0.850 Split 0.933 0.867 0.800 0.867 0.933 0.867 1.000 0.867 0.933 1.000 1.000 0.900 Grid 0.933 0.933 0.867 0.767 0.767 0.767 0.900 0.767 0.933 0.850 0.850 0.850 Oracle 0.933 0.933 0.867 0.867 0.967 0.933 0.900 0.867 0.933 1.000 0.900 1.000

(a) Coverage Results over Several Datasets

Diabetes Length Friedman 1 Length Friedman 2 Length Synthetic Length

Asymmetric Lasso Robust Lasso Robust Asymmetric Lasso Robust Asymmetric Lasso Robust Asymmetric Ours 0.867 2.234 2.230 2.340 2.570 2.875 2.004 2.191 2.621 0.632 0.714 0.702 Approximate 0.867 2.262 2.237 2.368 2.599 2.897 2.024 2.245 2.618 0.786 0.705 0.790 Split 0.800 2.409 2.429 2.589 2.837 3.219 2.361 2.448 2.888 0.831 0.831 0.831 Grid 0.867 2.286 2.255 2.475 2.782 2.982 2.108 2.338 2.741 0.872 0.903 0.651 Oracle 0.867 2.320 2.337 2.204 2.508 2.787 2.240 2.447 2.550 0.062 0.001 0.226

(b) Length Results over Several Datasets

Diabetes Time (s) Friedman 1 Time (s) Friedman 2 Time (s) Synthetic Time (s)

Lasso Robust Asymmetric Lasso Robust Asymmetric Lasso Robust Asymmetric Lasso Robust Asymmetric Ours 0.658 1.125 6.865 0.493 0.398 1.047 0.326 0.305 0.901 6.056 115.565 4.037 Approximate 4.012 33.858 213.995 5.127 4.234 17.111 1.538 2.991 13.178 30.124 332.152 9.142 Split 0.025 0.086 0.474 0.139 0.041 0.102 0.036 0.042 0.100 0.599 0.122 0.039 Grid 0.769 5.692 58.632 1.704 2.238 11.331 0.564 2.068 8.834 29.150 431.398 2.418 Oracle 0.049 0.188 1.032 0.116 0.034 0.212 0.040 0.033 0.193 0.429 0.256 0.063

(c) Time Results over Several Datasets

Figure 4: We demonstrate the performance of our Conformal Set Algorithm against several baselines across many datasets and loss functions. Our algorithm maintains strong coverage, length, and time metrics across many loss functions and datasets.

Visual Results We report the figures in Figure 3. As is evident over all loss functions and datasets, our estimated π(z) roughly traces the true π(z) generated by the discretized searching algorithm. While on particular examples, notably Figures 2d, 3a, and 3c, the trace is less accurate than the others. However, the error is within a reasonable range to achieve the desired coverage and length guarantees. We also report similar experiments for the Lasso loss, but we mention these in Appendix A since our method is exact for the Lasso loss. We demonstrate that our homotopy drawing algorithm yields an efficient and accurate methodology for generating conformal sets for general loss functions as tested on several datasets.

Empirical Results We report our empirical results in Figure 4a, Figure 4b, and Figure 4c. We can see that most methods maintain strong coverage guarantees over all datasets. For our experiments, we used α = 0.1, and most of our results hover around that level of coverage. Moreover, in Figure 4b, we see that except for Oracle, across several loss functions and datasets, our algorithm achieves the smallest length. The Oracle, however, consistently has the best length due to its knowledge of the true yn+1. Also, our algorithm

is the fastest over all homotopy methods but slower than Split and Oracle, as seen in Figure 4c. Therefore, our experiments indicate that our Conformal Prediction Algorithm is competitive in all coverage, length, and time measures.

7. Conclusion

Our results demonstrate that we can efficiently and accurately draw the homotopy of the typicalness function of a model over several loss functions via exploiting the sparsity structure of the Linear Models with ℓ1 regularization. Furthermore, we achieve explicit closed-form equations to model the behavior of this homotopy. Previous results mainly focus on quadratic loss functions or ignore the structure of the regularization altogether. Our framework, instead, captures this information and uses it to improve the accuracy of our final results. Several avenues for extending our research remain interesting. Spline instead of linear interpolation may yield improved accuracy for different loss functions. Additionally, smoothing at the kinks may reduce the algorithm s sensitivity to the primal corrector s results. Furthermore, we would like to expand our work to non-convex settings such as deep learning in future works.

Conformalization of Sparse Generalized Linear Models

Allgower, E. L. and Georg, K. Numerical continuation methods: an introduction. Springer Science & Business Media, 2012.

Bach, F., Thibaux, R., and Jordan, M. Computing regularization paths for learning multiple kernels. Advances in neural information processing systems, 2004.

Balasubramanian, V., Ho, S.-S., and Vovk, V. Conformal prediction for reliable machine learning: theory, adaptations and applications. Elsevier, 2014.

Bates, S., Candès, E., Lei, L., Romano, Y., and Sesia, M. Testing for outliers with conformal p-values. ar Xiv preprint ar Xiv:2104.08279, 2021.

Bertrand, Q., Klopfenstein, Q., Bannier, P.-A., Gidel, G., and Massias, M. Beyond l1: Faster and better sparse models with skglm. In Neur IPS, 2022.

Breiman, L. Bagging predictors. Machine learning, 24(2): 123 140, 1996.

Bröcker, J. and Kantz, H. The concept of exchangeability in ensemble forecasting. Nonlinear Processes in Geophysics, 2011.

Cella, L. and Ryan, R. Valid distribution-free inferential models for prediction. ar Xiv preprint ar Xiv:2001.09225, 2020.

Chang, Y.-C. and Hung, W.-L. Linex loss functions with applications to determining the optimum process parameters. Quality & Quantity, 2007.

Chernozhukov, V., Wüthrich, K., and Zhu, Y. Exact and robust conformal inference methods for predictive machine learning with dependent data. Conference On Learning Theory, 2018.

Chernozhukov, V., Wüthrich, K., and Zhu, Y. An exact and robust conformal inference method for counterfactual and synthetic controls. Journal of the American Statistical Association, 2021.

Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. J. Mach. Learn. Res, 2016.

Fisch, A., Schuster, T., Jaakkola, T., and Barzilay, R. Fewshot conformal prediction with auxiliary tasks. ICML, 2021.

Garrigues, P. and Ghaoui, L. E. An homotopy algorithm for the lasso with online observations. In Advances in neural information processing systems, pp. 489 496, 2009.

Gruber, M. Regression estimators: A comparative study. JHU Press, 2010.

H., F. J. Multivariate Adaptive Regression Splines. The Annals of Statistics, 1991.

Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. The entire regularization path for the support vector machine. J. Mach. Learn. Res, 2004.

Ho, S.-S. and Wechsler, H. Query by transduction. IEEE transactions on pattern analysis and machine intelligence, 2008.

Holland, M. J. Making learning more transparent using conformalized performance prediction. ar Xiv preprint ar Xiv:2007.04486, 2020.

Laxhammar, R. and Falkman, G. Inductive conformal anomaly detection for sequential detection of anomalous sub-trajectories. Annals of Mathematics and Artificial Intelligence, 2015.

Lei, J. Fast exact conformalization of lasso using piecewise linear homotopy. Biometrika, 2019.

Lin, Z., Trivedi, S., and Sun, J. Conformal prediction intervals with temporal dependence. Transactions of Machine Learning Research, 2022.

Mairal, J. and Yu, B. Complexity analysis of the lasso regularization path. ICML, 2012.

Ndiaye, E. Stable conformal prediction sets. In International Conference on Machine Learning. PMLR, 2022.

Ndiaye, E. and Takeuchi, I. Computing full conformal prediction set with approximate homotopy. Neur IPS, 2019.

Ndiaye, E. and Takeuchi, I. Root-finding approaches for computing conformal prediction set. ar Xiv preprint ar Xiv:2104.06648, 2021.

Park, M. Y. and Hastie, T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2007.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Rosset, S. and Zhu, J. Piecewise linear regularized solution paths. The Annals of Statistics, 2007.

Conformalization of Sparse Generalized Linear Models

Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research, 2008.

Tibshirani, R. J. The lasso problem and uniqueness. Electronic Journal of Statistics, 2013.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. Springer, 2005.

Xu, C. and Xie, Y. Conformal prediction interval for dynamic time-series. ICML, 2021.

Conformalization of Sparse Generalized Linear Models

0.25 0.00 0.25 Candidate z

0.25 0.00 0.25 0.50 Candidate z

(b) Asymmetric

0.25 0.00 0.25 0.50 Candidate z

Figure 5: We generate several example homotopies over Lasso, Asymmetric, and Robust loss functions. We plot using our algorithm and a discretized search space algorithm, where the space of potential zn+1 values is split into several points, and we solve for β using Proximal Gradient Descent at each point.

2 0 2 Candidate z

(a) Lasso on Friedman1

0 2 Candidate z

(b) Lasso on Friedman 2

0 2 Candidate z

(c) Lasso on Diabetes

0.2 0.0 0.2 Candidate z

(d) Lasso on Synthetic

Figure 6: The π(z) function as generated by a ground truth discretized searching algorithm and by our homotopy drawing algorithm for the Quadratic Loss function over all 4 datasets.

A. Additional Visualizations

We have provided two extra visualizations for the reader s understanding. We have provided figures of the homotopy for different loss functions and what the conformality function π(z) looks like for the Quadratic Loss function on both our real and synthetic datasets.

Homotopy Visualizations We run our homotopy generation algorithm over Quadratic Loss, Robust, and Asymmetric Loss functions over several low-dimensional synthetic examples. As we can see in Figure 6a, for the Quadratic Loss, our homotopy algorithm perfectly matches that of the Grid baseline since our algorithm captures the Quadratic Loss homotopy from Lei (2019) exactly. Moreover, in Figure 6b and Figure 6c, we see that our algorithm very closely identifies the homotopy of the Grid algorithm. In the Robust and Asymmetric cases, the linearization causes a slight miss in the kink, but the difference is negligible. Across all dimensions, our homotopy generation algorithm closely tracks that of the baseline Grid algorithm. These visualizations verify visually that our homotopy generation algorithm is accurate.

Conformity Function for Quadratic Loss We also visualize what the conformality function π(z) looks like across several datasets for the Quadratic Loss function. We do not include these in the main manuscript since our algorithm is exact on the Quadratic Loss function, and no visual verification is truly needed. Nevertheless, we provide such visuals in Figure 6. Our Conformal Prediction algorithm indeed matches precisely that of the Grid baseline algorithm. This confirms our claims that our algorithm is indeed exact on the Quadratic Loss function.

Conformalization of Sparse Generalized Linear Models

B. Proofs for Properties of GLM s

B.1. Proof of Lemma 2.1

Lemma 2.1. A vector β (z) Rp is optimal for Equation (3) if and only if for y (z) = Xβ (z), it holds

X 2f(y(z), y (z)) = λv(z) , (4)

where v(z) belongs to the subdifferential of the ℓ1 norm at β (z) i.e., j {1, . . . , p}, we have

( {sign(β j (z))} if β j (z) = 0 , [ 1, 1] if β j (z) = 0 . (5)

Proof. The Fermat rule reads 0 {X 2f(y(z), y (z))} + λ 1 (β (z)) .

Defining v(z) 1 (β (z)) yields Equation (4). To show Equation (5), we look at vj(z) for any index j. We remind that by separability of the ℓ1 norm, we have vj(z) = | |(β j (z)). Hence, vj(z) = sign(β j (z)) if β j (z) = 0 and vj(z) [ 1, 1] otherwise. This proves the claim.

B.2. Proof of Lemma 2.2

Lemma 2.2. For all z, we assume that the matrix XA(z) is full rank and that the loss function f is strictly convex. With these two assumptions, for all candidates z, only one unique optimal solution β (z) exists. Thus, the solution path z 7 β (z) is well defined.

Proof. We first prove that A(z) is unique. From the definition of the active set, we have

A(z) = j [p] : |X j 2f(y(z), Xβ (z))| = λ ,

where we remind that β (z) arg min β R|p| f(y(z), Xβ) + λ β 1 .

Since, from strict convexity of f, the prediction Xβ (z) is unique for any solution β (z) to the aforementioned optimization problem, we have A(z) is uniquely defined. From the first order optimality condition, it exists v(z) 1 (β (z))

0 X 2f(y(z), Xβ (z)) + λv(z) .

Restricted to the active set yields

0 X A 2f(y(z), XAβ A(z)) + λv A(z) β A(z) arg min w R|A| f(y(z), XAw) + λ w 1 .

Since f is strictly convex and XA is full rank, the latter optimization problem is strictly convex meaning β A(z) is unique.

B.3. Proof of Lemma B.1

Lemma B.1. Let 0 be the vector of 0 s, For all z [ymin, ymax], we have that the optimal weights β (z) satisfy

{β (z) : z [zmin, zmax]} {β : β 1 R/λ} where R = sup z [zmin,zmax] f(y(z), 0).

Proof. Let s denote the objective function as

P(β, z) = f(y(z), Xβ) + λ β 1 .

Conformalization of Sparse Generalized Linear Models

We remind the reader that the solution β (z) satisfies β (z) = arg minβ P(β, z). By optimality and assuming that f is non-negative, we have for any z

λ β (z) 1 P(β (z), z) P(0, z) = f(y(z), 0) .

Here, 0 is the vector of 0 s, the first step comes from the definition of P, and the second inequality comes from the fact that β (z) is a minimizer of P. Naturally, we then have that the ℓ1 norm of the the solving weights β (z) is bounded by the value of f(y(z), 0). Any solution β (z) is inside the ℓ1 ball centered at 0 with radius R/λ.

Since the path is truncated i.e., z [zmin, zmax], then the solution path is bounded i.e.,

{β (z) : z [zmin, zmax]} β : β 1 R

where R = sup z [zmin,zmax] f(y(z), 0) .

Also, it is easy to see that, along the path

y(z) 2 max( y(zmin) 2 , y(zmax) 2)

y (z) 2 = XAβ A 2 σmax(XA) R

y(z) 2 2 + y (z) 2 2

max( y(zmin) 2 , y(zmax) 2)2 + σmax(XA) R

Note that, for simplicity, we naturally suppose that the estimate ˆβ(z) is a better minimizer than the vector 0. Thus, the same bounds above hold for ˆβ(z), ˆy(z) and ˆζ(z).

C. Choice of the Range [zmin, zmax]

Lemma C.1. Choosing z0 = zmax = max(y1, . . . , yn) and stopping once zt zmin = min(y1, . . . yn) reduces the probability of coverage by at most 2 n+1.

Proof. Given our exchangability assumption, the probability that yn+1 zmax is at most 1 n+1. Similarly, the probability that yn+1 zmin is at most 1 n+1. Therefore, using the union bound, the probability that choosing the criteria we do in our algorithm affects coverage by at most 2 n+1, which becomes negligible as n grows.

D. Details on 2f

In the main text, we mentioned that we are approximating 2f(y(z+), y (z+)). We can do this via linearization.

2f ζ(z+) 2f ζ(z) + 2,1f ζ(z)(y(z+) y(z)) + 2,2f ζ(z)(y (z+) y (z))

y(z+) y(z) = (0, . . . , 0, z+ z)

y (z+) y (z) = XA(β A(z+) β A(z)) XA β A z (z) (z+ z)

Finally, with a plug-in approach, we approximate ˆβ A(z) β A z (z) and obtain

Conformalization of Sparse Generalized Linear Models

2f ζ(z+) 2f ζ(z) + [ 2,1f ζ(z)]n+1 + [ 2,2f ζ(z)] XA ˆβ A(z) (z+ z)

Here, the first equality come from definition, the second is from applying the chain rule to intermediate variables, the third is from a simple notational switch, and the final equality comes from using our estimators for β . Now, we can find the roots of I from Equation (9) as the following

zin j,t+1 = zt + X j 2f ˆζ(zt) λ

X j h [ 2,1f ˆζ(zt)]n+1 + 2,2f ˆζ(zt) XA ˆβ A(zt) i .

We can now present our desired theorem.

Lemma D.1. The gradient of the solution path β

z (z), as well as its estimates ˆβ (z) are bounded as follow

z (z) , ˆβ (z) σmax(XA(z))

σ2 min(XA(z)) νf

Proof. We remind that y

z = (0, . . . , 0, 1) and

y y z ˆβ A =

β = X A 2,2f ζ(z)XA d H

β = X A 2,2f ˆζ(z)XA

y = X A 2,1f ζ(z) d H

y = X A 2,1f ˆζ(z)

Hence β A z

By definition, we have that for any z [zmin, zmax], H

β (y(z), β A(z)) 2 = X A 2,2f ζ(z)XA 2 σmin(X A XA) inf ζ B σmin( 2,2f ζ(z)) .

Since f is assumed to be µf-strongly convex from Lemma 5.2, it holds

inf ζ B σmin( 2,2f ζ(z)) µf > 0 ,

1 2 1 σ2 min(XA) µf .

Similarly, given f is smooth with constant νf from Lemma 5.1, we have H

y (y(z), β A(z)) 2 = X A 2,1f ζ(z) 2 σmax(XA) 2,1f ζ(z) 2 σmax(XA) νf .

Hence the result. The proof for upper-bounding the estimated gradient norm follows the same line.

Conformalization of Sparse Generalized Linear Models

Theorem 5.1. The error between our linearized weights β(z+) and the true weights β (z+) is upper bounded by β(z+) β (z+) 2 ϵtol + L νf

µf |z+ zt| .

where L = σmax(XA(zt))

σ2 min(XA(zt)) + sup z [z+,zt]

σmax(XA(z))

σ2 min(XA(z)) ,

and zt is the prior kink of z+.

Proof. To analyze β(z+) β (z+) 2, we will use the definition from our algorithm that

β(z+) = ˆβ(zt) + ˆβ (zt)(z+ zt).

Here, zt is the last point at which we ran our primal corrector. Using this, we can decompose the error as the follows: β(z+) β (z+) 2 = ˆβ(zt) + ˆβ (zt)(z+ zt) β (z+) 2

= ˆβ(zt) + ˆβ (zt)(z+ zt) β (z+) + β (zt) β (zt) 2

ˆβ(zt) β (zt) + Z z+

ˆβ (zt) β (z)

ˆβ(zt) β (zt) 2 + sup z [z+,zt]

ˆβ (zt) β (z)

Here, the third equality comes from the fact that β (z+) β (zt) = R z+

Now, from the Triangular Inequality, we have

sup z [z+,zt]

ˆβ (zt) β (z)

2 ˆβ (zt) + sup z [z+,zt]

" σmax(XA(zt))

σ2 min(XA(zt)) + sup z [z+,zt]

σmax(XA(z))

σ2 min(XA(z))

Here, the second inequality comes from Lemma D.1.

Now, if point z+ is such that z+ (zt+1, zt], the active sets at point z+ and zt are constant. Then, we can simplify.

β(z+) β (z+) 2 ϵtol + 2 σmax(XA(zt))

σ2 min(XA(zt)) νf

µf |z+ zt| .

Note that if the candidate z+ = zt is exactly a kink, the right-most term is zero. It only remains the corrector error. If it is not the case that z+ and zt have the same active set, we have β(z+) β (z+) 2 ϵtol + L νf

" σmax(XA(zt))

σ2 min(XA(zt)) + sup z [z+,zt]

σmax(XA(z))

σ2 min(XA(z))

Conformalization of Sparse Generalized Linear Models

Theorem 5.2. The estimation error is upper bounded by

2f ζ(z+) 2f ˆζ(z+) 2 K ϵtol + L νf

where K = νf σmax(XA).

Proof. Let us define, for t [0, 1], the function

ϕ(t) = 2f(ˆζ(z+) + t(ζ(z+) ˆζ(z+))) .

We have from the fundamental theorem of calculus, ϕ(1) ϕ(0) = R 1 0 ϕ(t)

ϕ(1) ϕ(0) = 2f(ζ(z+)) 2f(ˆζ(z+))

t = 2,1f(ˆζ(z+) + t(ζ(z+) ˆζ(z+))) [ζ(z+) ˆζ(z+)]1 + 2,2f(ˆζ(z+) + t(ζ(z+) ˆζ(z+))) [ζ(z+) ˆζ(z+)]2 .

We remind the reader that, by definition, we have

[ζ(z+) ˆζ(z+)]1 = y(z+) y(z+) = 0

[ζ(z+) ˆζ(z+)]2 = y (z+) ˆy(z+)

and deduce that

2f ζ(z+) 2f ˆζ(z+) 2 = ϕ(1) ϕ(0) 2

0 2,2f(ˆζ(z+) + t(ζ(z+) ˆζ(z+))) [ζ(z+) ˆζ(z+)]2dt 2 sup t [0,1] 2,2f(ˆζ(z+) + t(ζ(z+) ˆζ(z+))) 2 y (z+) ˆy(z+) 2

νf XAβ A(z+) XA ˆβA(z+) 2

νf σmax(XA) β A(z+) ˆβA(z+) 2

νf σmax(XA) ϵtol + L νf

µf |z+ zt| .

Here, the fourth inequality comes from our Lemma 5.1 and the final inequality comes from Theorem 5.1.