# robust_yet_efficient_conformal_prediction_sets__6987cb1c.pdf

Robust Yet Efficient Conformal Prediction Sets

Soroush H. Zargarbashi 1 Mohammad Sadegh Akhondzadeh 2 Aleksandar Bojchevski 2

Conformal prediction (CP) can convert any model s output into prediction sets guaranteed to include the true label with any user-specified probability. However, same as the model itself, CP is vulnerable to adversarial test examples (evasion) and perturbed calibration data (poisoning). We derive provably robust sets by bounding the worstcase change in conformity scores. Our tighter bounds lead to more efficient sets. We cover both continuous and discrete (sparse) data and our guarantees work both for evasion and poisoning attacks (on both features and labels).

1. Introduction

Uncertainty quantification (UQ) is crucial for deploying models, especially in safety-critical domains. The predicted probability is not a reliable source for UQ as it is often uncalibrated (Guo et al., 2017). Most methods do not provide any guarantees and require retraining or modifications in the model architecture (Abdar et al., 2021). Instead, conformal prediction (CP) returns prediction sets with a distributionfree guarantee to cover the true label. It only requires blackbox access to the model and assumes exchangeable data (a weaker assumption than i.i.d.). This makes CP flexible we can apply it to image classification, segmentation (Angelopoulos et al., 2023), question answering (Angelopoulos et al., 2022), and node classification (Huang et al., 2023).

Most models suffer a significant performance drop when fed noisy or manipulated data, even for indistinguishable (label-preserving) perturbations (Silva & Najafirad, 2020). Adversaries can exploit this vulnerability by perturbing the training data (poisoning) or the test data (evasion). CP s performance is also sensitive to the same attacks. One goal of the adversary is to break the guarantee reducing the probability to cover the true label by perturbing the test inputs

1CISPA Helmholtz Center for Information Security 2University of Cologne. Correspondence to: Soroush H. Zargarbashi <zargarbashi@cs.uni-koeln.de>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

(evasion) or poisoning the calibration data. In all settings, the perturbations are limited according to a threat model, e.g. a ball of a given radius around the clean input (see 2). Unlike heuristic defenses which are easily overcome by new attacks (Athalye et al., 2018; Mujkanovic et al., 2022), certificates provide worst-case guarantees that the prediction does not change. How can we extend robustness certificates to conformal prediction sets?

Given calibration data and a score function s : X Y 7 R capturing conformity (agreement) between inputs and all potential labels, CP finds a calibrated threshold qα, and defines prediction sets Cα(x) = {y : s(x, y) qα} that include all labels with scores above it. CP guarantees that Pr [ytrue Cα(x)] 1 α for a clean x, exchangeable with the calibration data, and any user-specfied α. To certify robustness, we can define conservative sets that ensure the coverage remains above 1 α even under perturbation.

To this end, Gendler et al. (2021) leverage the fact that the randomly smoothed scores Eδ N(0,σ2I)[s(x + δ, y)] change slowly around the input to compute an upper bound on the worst-case score. Their randomly smoothed conformal prediction (RSCP) method has 4 limitations: (i) It considers only the mean of randomized scores resulting in a looser bound and thus larger sets; (ii) It only certifies evasion but not poisoning attacks; (iii) It only supports L2-bounded perturbations of continuous data, ignoring discrete and sparse data such as graphs; (iv) It does not correct for finite-sample approximation errors. We address all of these limitations.

Our key insight is that we can use the cumulative distribution (CDF) of smooth scores to obtain tighter upper bounds. The resulting CDF-aware sets are smaller while maintaining the same robustness guarantee. For continuous data we reuse Kumar et al. (2020) s bound developed to certify confidence, while for discrete/graph data we extend the bounds of Bojchevski et al. (2020).1 We then propose an approach for finite sample correction. Different from Yan et al. (2024), we bound calibration points instead of test points. In addition to being significantly faster (especially for large datasets like Image Net), our calibration-time algorithm also leads to smaller sets when correcting for finite samples.

1Both of these methods do not provide sets or CP guarantees.

Robust Yet Efficient Conformal Prediction Sets

Currently, there are no CP methods designed to handle poisoning. To fill this gap, we further derive provably robust sets that maintain worst-case coverage when either the features or the labels of the calibration set can be perturbed. Moreover, the poisoning guarantee is independent of how the bound on conformity scores is derived. Hence, our poisoning-aware and evasion-aware methods can be combined to provide robustness to both attacks simultaneously.

In short, we introduce CDF-Aware smoothed prediction Sets (CAS) that provably cover the true label under adversarial attacks. For evasion, we show a consistent improvement on all metrics and datasets compared to RSCP. Moreover, for the first time, we additionally provide guarantees for poisoning, as well as discrete and sparse data.

2. Background

Conformal prediction. Given a holdout calibration set Dcal = {(xi, yi)}n i=1 exchangeably sampled from the data distribution (or a finite dataset) with labels unseen by the model (during training), and a user-specified coverage probability 1 α, for any test point xn+1, CP defines a prediction set Cα(xn+1) Y that is guaranteed to cover the true label yn+1 with the predetermined probability.

Theorem 2.1 (Vovk et al. (2005)). If Dcal = {(xi, yi)}n i=1, and (xn+1, yn+1) are exchangeable, for any continuous score function s : X Y 7 R capturing the agreement between x, and y, and user-specified α (0, 1), the prediction set defined as Cα(xn+1) = {y : s(xn+1, y) qα} has coverage probability

Pr [yn+1 Cα(xn+1)] 1 α (1)

where qα := Quant (α; {s(xi, yi)}n i=1) is the α-quantile of the true scores in the calibration set.

This theorem was extended to graphs (Zargarbashi et al., 2023; Huang et al., 2023) showing that the same guarantee holds for node classification. Although the coverage is guaranteed regardless of the choice of score function, a good choice is reflected in the size of the prediction sets (also called efficiency), the proportion of singleton sets covering the true label, and other metrics. A simple score function known as threshold prediction sets (TPS) directly considers the model s output s(x, y) = π(x, y) where π are the class probability (softmax) estimates (Sadinle et al., 2018). TPS tends to over-cover easy examples and under-cover hard ones (Angelopoulos & Bates, 2021). This is remedied by the commonly used adaptive prediction sets (APS) score defined as s(x, y) := (ρ(x, y) + u π(x)y). Here ρ(x, y) := PK c=1 π(x)c1 [π(x)c > π(x)y] is the sum of all classes predicted as more likely than y, and u [0, 1] is a uniform random value that breaks the ties between different scores to allow exact 1 α coverage (Romano et al., 2020).

While we report our results on both scoring functions, our approach is orthogonal and hence applicable to any other choice (see A for an extended introduction to CP).

Adversarial attacks. We define the threat model the set of all possible perturbations the adversary can apply by a ball centered around a clean input x. For continuous x we consider the l2 ball of radius r around the input Br(x) = { x X : || x x||2 r}. For binary data, we define the ball w.r.t. the number of flipped bits: Bra,rd(x) = { x X : Pd i=1 1[ xi = xi 1] rd, Pd i=1 1[ xi = xi + 1] ra} where rd and ra are the numbers of deleted and added bits respectively. This distinction accounts for sparsity as shown by Bojchevski et al. (2020). We discuss categorical data in C, extensions to other threat models are simple.

Evasion attacks. For a given input x and the model f, the adversary s usual goal is to find a perturbed input x such that f( x) = f(x) (Yuan et al., 2019; Madry et al., 2017). In CP, the goal changes to excluding the true label from the prediction set Cα( x) which breaks the guarantee in Eq. 1. Here we assume that CP is calibrated with clean calibration points.

Poisoning attacks. The adversary can perturb the training data to e.g. decrease accuracy. However, since CP is modelagnostic, the guarantee holds regardless of the model s accuracy. Instead, here the goal of the adversary is to perturb the calibration set in order to decrease the empirical coverage breaking the guarantee (see formal definition in 3.2).

3. Robust Prediction Sets

3.1. Robustness to Evasion Attacks

Definition 3.1 (Robust coverage). The prediction sets Cα( ) have adversarially robust 1 α coverage if for any (xn+1, yn+1) exchangeable with Dcal

Pr [yn+1 Cα( xn+1) | xn+1 B(xn+1)] 1 α (2)

where B(x) can be the l2 ball Br(x), the binary ball Bra,rd, or any other threat model. Gendler et al. (2021) define a score srscp(x, y) = Φ 1(Eδ N(0,σ2I)[s(x + δ, y)]) based on Gaussian smoothing (Cohen et al., 2019) where Φ 1( ) is the inverse CDF of N(0, 1). Since the smooth score is bounded, srscp( x, y) srscp(x, y) + r

σ, x Br(x) they shift the quantile qα = qα r

σ to ensure robustness. Instead of shifting the quantile we directly bound the conformal scores which is a slight generalization.

Proposition 3.1. Define s(x, y) as the upper bound for {s( x, y) : x B(x)}. With qα as the α-quantile of the true (clean) calibration scores, let Cα(x) = {y : s(x, y) qα}. For all xn+1 B(xn+1), if (xn+1, yn+1) is exchangeable with Dcal then we have Pr yn+1 Cα( xn+1) 1 α.

All omitted proofs are in D.1. We summarize our notation

Robust Yet Efficient Conformal Prediction Sets

in K. In short, the conservative set for any x B(x) includes the labels of the vanilla prediction set for x. Thus, the coverage guarantee also applies for the perturbed points.

RSCP is a special case with s(x, y) = srscp(x, y) + r

σ. We can equivalently rewrite RSCP as an upper bound on E[s( , )] instead of Φ 1(E[s( , )]) which matches the bound from Kumar et al. (2020) (see F.1). In 4 we significantly improve the bound using the CDF. Tighter bounds result in smaller (more efficient) sets.

3.2. Robustness to Feature Poisoning Attacks

We assume that the adversary can modify at most k instances, 0 k n = |Dcal|, whose features can be perturbed in a (continuous or discrete) ball B around the clean features. We define the threat model at dataset-level:

Bk,B(D) = { D : D = {( xi, yi) : (xi, yi) D,

j=1 1[ xj = xj] k}}

Let qα be the α-quantile of the clean calibration scores. To decrease coverage the adversary aims to find a perturbed calibration set Dcal Bk,B(Dcal) that moves the quantile qα = Quant(α; Dcal) as right as possible compared to qα.2 This shift increases the probability of rejecting true labels, resulting in a lower coverage. Namely, for α = Quant 1( qα; Dcal), the quantile inverse of the poisoned threshold q w.r.t. the clean calibration set, the poisoned calibration set results in near 1 α coverage where by definition 1 α 1 α. Given a potentially poisoned calibration set Dcal we certify the prediction sets via the following optimization problem:

qα = min zi X Quant (α; {s(zi, yi)}n i=1)

s.t. ( xi, yi) Dcal : zi B( xi) X

i n 1[zi = xi] k (3)

The problem in Eq. 3 finds the most conservative quantile qα and it holds that qα qα since for any perturbed Dcal by definition it holds Dcal Bk,B( Dcal). We show that the minimizer of problem Eq. 3 certifies at least 1 α coverage.

Proposition 3.2. Let qα to be the solution to the optimization problem in Eq. 3. With the conservative prediction sets

Cα(xn+1) = yi : s(xn+1, yi) qα

for any (xn+1, yn+1) exchangeable with (clean) Dcal we have Pr yn+1 Cα(xn+1) 1 α.

2Our setup works with conformity score capturing the agreement between x and y. With a non-conformity score, the goal is to equivalently shift the quantile to the left (see A).

With access to lower and upper bounds on the adversarial scores we can change the constraint zi B( xi) in Eq. 3 to zi [s( xi, yi), s( xi, yi)] where zi R is a scalar variable, and solve the relaxed problem. We describe in 4 how to obtain such bounds using randomized smoothing which we can use in both Prop. 3.1 and Prop. 3.2. Regardless of how we solve Eq. 3, as long as it finds a qα qα conditional on the clean Dcal the guarantee holds.

3.3. Robustness to Label Poisoning Attacks

In the label poisoning setup, the adversary can flip the labels of at most k datapoints in the calibration set, again aiming to shift the quantile to the right. As before, we can find the most conservative quantile by solving the problem:

qα = min zi Y Quant α; n s(xi, zi) : (xi, yi) Dcal o

i n 1[zi = yi] k (5)

Similar to 3.2, since qα qα, prediction sets defined as in Eq. 4 maintain 1 α coverage even under worst-case label perturbation. We can solve both problems (Eq. 3 and Eq. 5) by writing them as mixed-interger linear programs (MILPs). We present the technical details in G.

Interestingly, our evasion-aware sets can easily be combined with our poisoning-aware threshold to obtain prediction sets that are robust to both types of attacks. Similarly, we can easily combine the feature and label poisoning constraints in a single problem. We discuss these extensions in H.

4. Randomized Smoothing Bounds

To instantiate the conservative sets Cα( ) defined in 3 we need bounds on the worst-case change in conformity scores under perturbation. There is a rich literature on robustness certificates for standard classification (Li et al., 2023) that we can lean on, since they often need to compute similar bounds as a byproduct. We focus on methods based on the randomized smoothing framework (Cohen et al., 2019) given their high flexibility and black-box nature. This couples well with the flexibility of CP, ensuring that our final robust CP method can be broadly applied.

Smooth scores. A smoothing scheme ξ : X 7 X is a function that maps the input x to a nearby random point. Given an arbitrary score s( , ), we compute the expected (smooth) conformal scores as ˆs(x, y) := E[s(ξ(x), y)]. Following Cohen et al. (2019) for Gaussian smoothing, we add isotropic noise where the scale σ2 determines the amount of smoothing ˆs(x, y) = Eδ N(0,σ2I)[s(x + δ, y)]. For binary data, we use sparse smoothing (Bojchevski et al., 2020) and flip zeros and ones with probabilities p0 and p1 respectively: ˆs(x, y) = E[s(x δ, y)], where is the XOR and each

Robust Yet Efficient Conformal Prediction Sets

entry δ[i] Bernoulli (p = px[i]). See C for more details. Our approach works with other smoothing schemes such as uniform noise for l1 threat models (Levine & Feizi, 2021), but we focus on these two due to their popularity. Gaussian smoothing preserves exchangeability (Gendler et al., 2021). Similar argument applies to sparse smoothing and other methods that are symmetric w.r.t. xn+1 and Dcal.

The goal is to bound the smooth score ˆs( x, y) of any adversarial x B(x). Since the base score function s( , ) often depends on a complex model such as a neural network, even computing the expected score ˆs( , ) is challenging, let alone finding the worst-case x. Therefore, we follow the general recipe of relaxing the problem by searching over the space of all possible score functions h( , ) H. We focus on upper bounds, but the entire discussion equivalently applies to lower bounds by switching from max to min. By definition we have s( , ) H, therefore it holds that:

max x B(x) E[s(ξ( x), y)] max x B(x),h H E[h(ξ( x), y)] (6)

The solution to Eq. 6 is trivial unless we add additional constraints to the functions h( , ) H that capture information about the actual score function s( , ). The tightness of the resulting bound is directly controlled by the constraints. First, we describe a baseline bound that only captures information about the mean of s( , ). This is exactly the bound used by RSCP. Then, we describe a second bound that leverages information about the entire distribution of scores via the CDF. In both cases, we only need black-box access to the score function and the underlying classifier, and we assume that s( , ) [a, b] is bounded (w.l.o.g. a = 0, b = 1).

Canonical view. It turns out that for both Gaussian and sparse smoothing it is sufficient to derive a so-called pointwise bound for a given (x, x) pair since it can be shown that the maximum in Eq. 6 is always attained at a canonical x which is on the sphere of the respective ball. Namely, for the continuous Br(x) we have the canonical vectors x = 0, x = [r, 0, 0, . . . ] that completely specify the problem. For the binary Bra,rd we have the canonical x = [1, . . . , 1, 0, . . . , 0] and x = 1 x where x 0 = rd and x 0 = ra. Intuitively, the reason is due to the symmetry of the smoothing distributions and the balls (see C).

Baseline bound. A straightforward approach only incorporates the expected smoothed score (mean) for the given input x. Let p = E[s(ξ(x), y)] for simplicity. With x B(x) the baseline upper-bound for ˆs( x, y) = E[s(ξ( x), y)] is determined by the following problem:

smean(x, y) = max h H E[h(ξ( x), y)]

s.t. E[h(ξ(x), y)] = p (7)

This bound discards a lot of information about the distribution of scores around the given x. To remedy this, we

incorporate the information from the CDF of the scores.

CDF-based bound. Let a = b1 < b2 bm 1 < bm = b be m real numbers that partition the output space. Let pi = Pr [s(ξ(x), y) bi]. We define the problem:

scdf( x, y) = max h H E[h(ξ( x), y)]

s.t. bi : Pr [h(ξ(x), y) bi] = pi (8)

The key insight for solving Eq. 8 is to upper bound the mean of h via the CDF. Intuitively, we compute the probability of each bin [bj, bj+1] and choose the upper end of the bin to get an upper bound. This can be rewritten in terms of the CDF. Let Fh(bj) = Pr [h(x, y) bj], for any function h

j=2 bj [(Fh(bj) Fh(bj 1)]

j=2 Fh(bj) (bj+1 bj)

Next, we show how to solve both problems for the two different smoothing schemes. For Gaussian smoothing, both problems in Eq. 7 and Eq. 8 have closed-form solutions as shown by Kumar et al. (2020). For sparse smoothing, Bojchevski et al. (2020) provides an efficient algorithm to solve Eq. 7. We extend their approach to also solve Eq. 8 which is a novel contribution of potentially independent interest, e.g. to certify graph neural networks with regression tasks.

In practice, scdf is tighter than smean, and the improvement depends on the distribution of random scores. While we can easily combine both mean and CDF constraints to get a provably tighter bound, we focus only on CDF constraints.

Bounds for Gaussian smoothing. For any perturbed x with || x x||2 r we have the baseline bound ˆs( x, y) smean(x, y) = Φσ Φ 1 σ (p) + r where Φσ is the CDF of N(0, σ2) and p = Eδ N(0,σ2I)[s(x + δ, y)] is the clean expected score. We can get the lower bound by flipping the sign of r. The CDF bound is ˆs( x, y) scdf(x, y) with

j=2 Φσ Φ 1 σ (pj) r (bj+1 bj) (10)

where pj = Prδ N(0,σ2I)[s(x + δ, y) bj]. The corresponding lower bound and derivations are in D.2.

Bounds for sparse smoothing. To solve both optimization problems, we apply the same approach as Bojchevski et al. (2020), dividing the input space into regions of constant likelihood ratio X = I i Ri where Ri = {z : Pr [ξ(x) = z] /Pr [ξ( x) = z] = ci}. For the mean variant, we greedily distribute the p mass to each region (from the highest to the lowest ratio) until the constraint is satisfied. For the CDF variant, we instead distribute

Robust Yet Efficient Conformal Prediction Sets

the pj masses in each region and each bin [bj, bj+1]. Technical details, including the linear programming formulations, are in C. The runtime complexity scales linearly with the number of regions which is I = ra + rd + 1. We provide an efficient algorithm that runs in less than a few milliseconds.

Clean vs. observed input. In the discussion we refer to a clean x and a perturbed x B(x). In practice, we do not know whether the observed input x is clean or perturbed. However, since the l2-ball is symmetric, if x Br(x) then also x Br(x ). Thus, computing an upper bound for any observed x in the threat model yields a valid upper bound for the clean x, ˆs(x) s(x ). That is, we do not assume that the clean input is given at test time. For sparse data x Bra,rd(x) = x Brd,ra(x ), so we need to switch ra and rd when computing the certificate. Similar conclusions apply for an observed and potentially perturbed D cal since the clean Dcal Bk,B(D cal) for any D cal Bk,B(Dcal). This detail is not important for standard certificates since they only certify that the prediction does not change.

5. CAS: CDF-Aware Sets

We use the CDF-based bounds to obtain conservative prediction sets for evasion and conservative thresholds for poisoning attacks. We summarize our approach with the pseudocode in Algorithm 1 that works with any score function3.

Algorithm 1 CDF-Aware Sets (CAS, Evasion)

qα = Quant (α; {ˆs(x, y))(x,y) Dcal} Clean quantile Compute scdf(x, y), e.g. with Eq. 10 Upper bound Return Cα = {y : scdf(x, y) qα} Conservative set

Calibration-time variant. For evasion we need to compute s(x, y) via solving Eq. 8 (or Eq. 7) for each test point and each class. This can be computationally costly if we have many classes (e.g. Image Net has 1000) at deployment. We define an alternative approach that instead needs only a lower bound s(x, y) for each x Dcal and the true y. The key insight is that we can directly compare the smooth test score ˆs( xn+1, y) against a conservative (lower) quantile. Proposition 5.1. For xn+1 B(xn+1) and (xn+1, yn+1) exchangeable with Dcal, define

qα = Quant (α; {s(xi, yi) : (xi, yi) Dcal}) (11)

For prediction sets Cα( xn+1) = {y : ˆs( xn+1, y) qα} we have Pr[yn+1 Cα( xn+1)] 1 α. Moreover, the vanilla CP covers the true label with probability 1 β for

β = Quant 1 (qα; {s(xi, yi) : (xi, yi) Dcal}) (12)

and Quant 1 (t; A) = min {τ : Quant (τ ; A) t}.

3Our code and experiments are in the github repository soroushzargar/CAS.

With Prop. 5.1 we need only |Dcal| certified bounds as a preprocessing step. At test time we directly plug in ˆs( xn+1, y) and not its upper bound. Since |Dcal| is often significantly smaller than the test set the computational savings are substantial (see Table 3). With Eq. 12 we can compute a lower bound on the coverage of vanilla (non-robust) CP under perturbation, where by definition 1 β 1 α. This is a generalization of Theorem 2 in Gendler et al. (2021).

Poisoning. For poisoning attacks we simply use the conservative threshold qα from Eq. 3 or Eq. 5 where we use the CDF-bounds in the constraints (see 3.2). If the test examples are assumed clean we return Cα = {y : ˆs(x, y) qα}. Since robustness to evasion and poisoning are independent, we can achieve simultaneous robustness to both evasion and poisoning via Cα = {y : scdf(x, y) qα}.

To solve the two poisoning optimization problems we rewrite them as mixed-integer linear programs and solve them with an off-the-shelf solver. We only need 2 |Dcal| binary variables for Eq. 3 and |Dcal| |Y| binary variables for Eq. 5. See G for technical details. Since the calibration set is relatively small we can solve the MILPs in just a few minutes. Thus, our guarantees are practically feasible.

6. Finite Sample Correction

Solving Eq. 7, or Eq. 8 requires the true mean or CDF. Since exact computation is intractable, we use Monte-Carlo (MC) samples. To ensure a valid certificate, we bound the exact statistics via concentration inequalities. The resulting confidence intervals are valid together with adjustable 1 η probability. To account for this we calibrate with α = α η so that the final sets still have 1 α coverage (see E). RSCP did not include such finite-sample correction, and the resulting sets are only asymptotically valid without it.

Yan et al. (2024) incorporates the correction directly in the conformity scores, leveraging exchangeability between MC-estimated calibration scores and clean test scores. We discuss this in E and propose another approach built on Prop. 5.1. Our correction results in smaller sets for CAS with the same guarantee; and similar results for RSCP (see 7).

Proposition 6.1. Let scdf+(xi, yi) scdf(xi, yi) hold with 1 η/(2|Dcal|) probability for each (xi, yi) Dcal, and ˆs+( xn+1, yn+1) ˆs( xn+1, yn+1) hold with 1 η/(2|Y|) probability. Define the conservative qα+ = Quant α η; scdf+(xi, yi) : (xi, yi) Dcal

and Cα+(xn+1) = {y : ˆs+(xn+1, y) qα+}. Then

Pr[yn+1 Cα+( xn+1)] 1 α (13)

We compute scdf+(xi, yi) by solving the minimization variant of Eq. 8 with CDF error correction through the Dvoretzky Kiefer Wolfowitz inequality (Dvoretzky et al.,

Robust Yet Efficient Conformal Prediction Sets

clean perturbed

Emp. Coverage

clean perturbed

Ave. Set Size

0.2 0.3 0.4 0.5 Radius r

Emp. Coverage

RSCP CAS 1 α: 0.85 0.9 0.95

Figure 1. Empirical coverage [left] and average set size [middle] of RSCP and CAS for clean and perturbed data. All sets are certified robust up to radius r = 0.125. [Right] Empirical coverage for different certified radii (on clean data). All results are for CIFAR-10 with Gaussian smoothing (σ = 0.25). CAS is less conservative since it is closer to the nominal 1 α, and has smaller sets.

1956). We define ˆs+( xn+1, y) = 1 ns Pns s(ξ(xn+1), y)+ϵ where ϵ is the error given by the Bernstein confidence interval. In short, we divide the η budget between |Dcal| + |Y| estimates. This divides between all calibration scores (only for the true class), and |Y| classes for the test input.

The corrected CP in (Yan et al., 2024) compares qα,mc and s+(xn+1, y) + ϵhoef, where the quantile qα,mc is computed on the clean scores estimated with MC-sampling without correction. Instead, ϵhoef is added to account for the difference between the unseen clean test MC-score (exchangeable with the MC-calibration scores) and the upper bound which only bounds the true (non-MC) mean. See E for details. In our case, we compare the corrected quantile qα+ and the corrected estimate of the input test score ˆs+(xn+1, y). Instead of a Hoeffding bound we can use the tighter Bernstein bound for ˆs+(xn+1, y) since we have access to it. In addition, to compute qα+ we use DKW-corrected scores which introduce less error compared to the Hoeffding bound.

Feature Poisoning. We find the lower bound quantile qα (Eq. 3) using smooth scores (see G, Eq. 24). To apply sample correction, again with an error budget of η we divide this budget equally between calibration points. For each calibration point, the CDF bound with correction finds a probabilistic lower bound on the clean smooth score. Since the test scores are computed with MC-estimation, we directly bound the MC-estimated clean calibration scores for exchangeability. Since we do not have access to the clean calibration scores, following Yan et al. (2024) we use Hoeffding s inequality. The corrected quantile is lower than the clean quantile for MC-estimated calibration scores. Proposition 6.2. Let scdf+( xi, yi) scdf( xi, yi) hold with 1 η/(2|Dcal|) probability for all ( xi, yi) Dcal. Let qα+ be the solution to Eq. 24 (Eq. 3 with CDF bounds) for α = α η with si = scdf+( xi, yi) ϵhoef. Then for each new test point xn+1 exchangeable with Dcal the prediction set defined as C(xn+1) = {y : smc(xn+1, y) qα+} has 1 α coverage.

Here ϵhoef = p

log(2/η)/2|Dcal| comes from the Hoeffding inequality. In label poisoning we do not use randomly smoothed scores, therefore sample correction is not needed.

7. Experiments

For evasion, we compare CAS with RSCP (Gendler et al., 2021). Even though the original RSCP is not able to handle sparse or discrete data, we extend it and use it as an additional baseline (see C). There are no baselines for poisoning. Since both RSCP and CAS have the same guaranteed coverage we focus on two main metrics: the average size of prediction sets (or efficiency) and the empirical coverage. Ideally, we want the coverage to be concentrated around the nominal 1 α. Higher coverage costs larger prediction sets. In J we report additional experiments including the singleton hits ratio metric. We also consider the maximum perturbation radius such that robust CP has the same set size as standard CP (averaged across test points). This sizepreserving r is the largest certified radius which we can get for free . On all metrics CAS outperforms RSCP.

Setup. We evaluate our method on two image datasets: CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009), and one node-classification (graph) dataset Cora-ML (Mc Callum et al., 2004). We used Res Net-110 and Res Net-50 pretrained on CIFAR-10 and Image Net with noisy data augmentation from Cohen et al. (2019). We trained a GCN model (Kipf & Welling, 2017) for node classification. All models are trained on data augmented with noise. The GNN is trained with 20 nodes per class with stratified sampling as the training set and similarly sampled validation set. The size of the calibration set is between 100 and 150 (sparsely labeled setting). We use APS as the main score function.

For each dataset, we pick a number of test points at random (900 for CIFAR-10, 400 for Image Net, and 2480 nodes for Cora). We estimate the expected smooth scores with

Robust Yet Efficient Conformal Prediction Sets

0.0 0.2 0.4 0.6 Radius r

Average Set Size

0.0 0.2 0.4 0.6 Radius r

0 1 2 3 Radius r

0.85 0.9 0.95

Figure 2. Average set size of CAS and RSCP under evasion for (from left to right) CIFAR-10, Image Net (with TPS), and Cora-ML.

APS TPS Score Function

Average Set Size

0.80 0.85 0.90 0.95 1 α

0.0 0.2 0.4 0.6 ϵ

Average Set Size

0.12 0.25 0.5

Figure 3. [Left] Set size for r = 0.12 with different scores. [Middle] Maximum set size-preserving radius (average over test points). Both results are on CIFAR-10 dataset and σ = 0.25. [Right] The effect of smoothing parameter σ on the set size across a range of radii for CIFAR-10 dataset with error correction for 104 samples.

104 Monte-Carlo samples. All results are an average of 100 runs with exchangeable calibration sampling (details in J).

Evasion Certificate. The conservative robust sets are necessarily larger than non-robust sets. Consequently, on Fig. 1 (left) we observe a higher empirical coverage on clean data compared to the nominal 1 α. The coverage on perturbed inputs which we find with a PGD attack (Madry et al., 2017) is above 1 α verifying our theory. In Fig. 1 (right) we see that the empirical coverage increases with the certified radius r and is 1 α for r = 0. CAS is less needlessly conservative (grows slower with r) than RSCP while still providing the same guarantee. This leads to improved efficiency (smaller sets) as shown in Fig. 1 (middle). The set size is slightly higher for perturbed inputs.

In Fig. 2 we see that CAS s results in smaller prediction sets, across all radii, and all nominal 1 α values, and as in Fig. 3 (left) all scores. The improvement is substantial and also grows with r for larger radii it is doubled or even tripled, especially on Image Net and Cora-ML. Similarly, Fig. 3 (middle) shows that with CAS we can consistently certify a larger maximum radius for free .

Calibration-time evasion. Following Prop. 5.1 if we use vanilla (non-robust) CP, in the adversarial setup we can certify a lower bound 1 β on the worst-case robust cov-

erage. In Fig. 4 (left) we see that the certificate based on CAS leads to a better (higher) lower bound. At the same time, Prop. 5.1 implies that we can avoid computing upper bounds for the test points and instead account for the effect of the adversary by choosing a conservative conformal threshold (qα) via the lower bound on the calibration scores. Fig. 4 (middle) show the set size distribution for test-time vs. calibration-time evasion. The results for RSCP are comparable. CAS shows smaller sets for the calibration-time certificate. This approach is also computationally faster especially for datasets with a high number of classes, which is discussed in B.

Ablation study. In Fig. 3 (right) we study the effect of the smoothing strength as controlled by σ in N(0, σ2I). For all σ values and all radii r we get the same 1 α coverage guarantee, however, there is a clear trade-off for choosing σ. A smaller amount of smoothing results in a smaller set size in the beginning, but the set sizes grows rapidly by increasing the certified radius. In all cases, CAS is better than RSCP. Here for each σ we use the model that is pretrained on the same noise augmentation.

Finite sample correction. The previous results were without error correction since RSCP did not account for finitesample errors when estimating the smooth scores with

Robust Yet Efficient Conformal Prediction Sets

0.80 0.85 0.90 0.95 1.00 1 α

Guaranteed Coverage

Test-time Cal-time Approach

Average Set Size

0.2 0.3 0.4 0.5 Radius r

Average Set Size

Cal-time Test-time

Figure 4. [Left] Lower bound 1 β on the robust coverage of vanilla CP (Prop. 5.1). CAS certifies a larger lower bound. [Middle] Distribution of prediction set sizes using the slower test-time vs. the faster calibration-time evasion certificate. [Right] Set sizes for RSCP and CAS with account for finite sample error. All results are for CIFAR-10 with σ = 0.25.

0.25 0.50 0.75 ϵ

Poisoned Cov.

0.25 0.50 0.75 ϵ

Robust Cov.

0.80 0.85 0.90 0.95 1 α

Ave. Set Size

Figure 5. [Left] The coverage of vanilla CP under calibration set with poisoned features. [Middle] The result of robust CP on the same calibration set. [Right] The average set size of the CP robust to feature poisoning for a range of nominal coverages given the clean calibration data, and r = 0.12. All results are for CIFAR-10 dataset with σ = 0.25.

Monte-Carlo samples. The sets are still asymptotically valid without correction, as confirmed by Fig. 1 (left); however, correction is necessary for a valid certificate as argued by Yan et al. (2024). In Fig. 4 (right) we see that the size for RSCP quickly explodes, reaching almost all classes (|Y| = 10) for large radii, while CAS maintains low average size. Moreover, CAS has smaller standard deviation across test inputs. CAS uses calibration-time correction (see E).

Label poisoning. Next, we study label poisoning where now the attacker can perturb the ground-truth labels of the calibration points. In Table 1 we see that increasing the budget k leads to predictably larger set size and larger empirical coverage. The difference to the clean calibration set (k = 0) is minor, showing that provable label robustness comes almost for free for small k. While Einbinder et al. (2022) show that standard CP is already naturally robust to random (non-worst case) label noise, Table 1 shows that adversarial label noise can break the guarantee even for small budget k.

Feature poisoning. Since there are no baselines that provide robust coverage guarantees under poisoning we can only study the behaviour of CAS. First, we consider feature poi-

Table 1. Label poisoning for CIFAR-10.

Cov. Vanilla Cov. Robust Cov. Set Size k (Clean) (Pert) (Pert) (Clean)

0 0.897 0.897 0.897 1.41 1 0.916 0.872 0.900 1.58 2 0.923 0.859 0.901 1.62

soning where the attacker is allowed to change k calibration points which we refer to as the budget, each of which can be perturbed in a given ball Br(x) (see Eq. 3). In Fig. 5 (left) we show that the coverage can slightly decrease via poisoning the features with a limited budget. This drop becomes significant when the adversary can perturb all the calibration points. To poison the data, we run the PGD attack on all calibration points and decide which point to perturb by solving Eq. 3 (specifically Eq. 24) with maximization goal. Fig. 5 (middle) shows the robustness of CAS even under an infinite budget which verifies Prop. 3.2. We also show the set size of robust CP in Fig. 5 (right). We see that as expected a smaller budget k leads to less conservative sets which translates to smaller set sizes. Interestingly, for small

Robust Yet Efficient Conformal Prediction Sets

Emp. Coverage Ave. Set Size

r k With Without With Without

0.12 3 94.6 94.5 1.84 1.83

|Dcal| 97.7 96.3 3.17 2.47

0.25 3 94.0 94.0 1.756 1.752

|Dcal| 99.6 98.7 7.32 4.48

Table 2. CAS for feature poisoning with and without finite-sample correction.

r (e.g. r = 0.12) even with an infinite budget the set size does not increase drastically. Making CP robust to poisoning comes at only a small cost. Note that for k = setting each calibration score to its lower bound is one solution to Eq. 3, which equals calibration-time evasion.

Similar to evasion, in Table 2 we show the results of CAS robust to feature poisoning with and without sample correction. For sample correction, we use Prop. 6.2. Note that label-poisoning does not require sample correction.

8. Related Work

Ghosh et al. (2023) introduce the notion of probabilistically robust CP. Intuitively, their guarantee is w.r.t. the average adversarial input, while for RSCP and our method the guarantee is w.r.t. the worst-case input. They produce more efficient sets via a quantile of quantiles method one quantile considers the adversarial examples around a datapoint and the other finds the CP threshold over the first set of quantiles. This enables a tuneable trade-off between nominal performance and robustness. Our method is orthogonal since we consider exact coverage, and Ghosh et al. (2023) s probabilistic robustness can be applied on top of ours.

Cauchois et al. (2020) propose an approach which returns prediction sets that are robust to distribution shift between the calibration and the test distribution. As input, their method needs an upper bound ρ on the f-divergence between the two distributions, which they estimate from data. In principle, for a given radius r one can derive a suitable ρ, however, the resulting sets can be needlessly too conservative. We can conclude this from the fact that the optimization problem with the resulting f-divergence constraint is a relaxation as shown by Dvijotham et al. (2020) in a different context (classification certificates). Gendler et al. (2021) extensively discuss the differences between RSCP and Cauchois et al. (2020) s approach across various settings (e.g. model trained with and without noise) and report better or equal efficiency. With CAS outperforming RSCP, we draw similar conclusions by transitivity. Further discussion is in F.2.

Two concurrent works use the same bound as RSCP, but

improve the sets by modifying other aspects of the algorithm. Yan et al. (2024) adopt robust conformal training (Stutz et al.,

2022) and propose to transform the smooth score (ranking + sigmoid scaling) using an additional holdout set. Kang et al. (2024) integrate a reasoning component via probabilistic circuits. Both are completely orthogonal to our method and can be directly improved with our CDF bounds.

Angelopoulos et al. (2022) extend conformal prediction to control the expected value of any monotone loss function, including adversarial risk (see Proposition 7). However, they do not propose an algorithm to compute the worst-case adversarial loss. Einbinder et al. (2022) show that standard CP is already robust to random label noise, e.g. resulting from wrong annotation or any other natural source of noise. Unlike our work, they do not study robustness to adversarial (worst-case) label perturbations.

9. Conclusion

We provide certified robustness for conformal prediction both for evasion and poisoning attacks. We propose a CDFaware bound on the conformity scores under adversarial perturbation. Our bound is empirically tighter and leads to consistent improvements compared to previous certificates. We further propose novel certificates against feature and/or label poisoning of the calibration set. We generalize both results to discrete and binary (sparse) data. Finally, we show how we can correct for finite-sample error. Our calibrationtime approach for robustness to evasion that reduces the inflation of set sizes when correcting for finite samples. Overall, our method CAS yields provably robust yet efficient (small) prediction sets.

Limitations. We identify three main limitations. First, the coverage guarantee is marginal, which means that it holds on average across the entire input domain. Conditional coverage Pr[y C(x) | x] is impossible to achieve without strong assumptions (Barber et al., 2019). Achieving nearconditional coverage is still and open problem. This means that CP can have over-coverage or under-coverage for different groups which can be unfair. This holds true for both vanilla CP and robust CP. Lu et al. (2022) consider a groupconformal variant to equalize coverage across groups, however, unfairness can still be reflected in the set-size. Studying the intersection of robustness and fairness is an exciting future direction. Second, while randomized smoothing is a powerful and flexible method, estimating empirical statistics requires a large number of Monte-Carlo samples. This can be computationally expensive. Finally, we assumed that the goal of the attacker is to reduce the empirical coverage and designed our certificate to prevent this. However, the attacker may have other goals, e.g. to increase the set size, or to attack only a subset of labels.

Robust Yet Efficient Conformal Prediction Sets

Acknowledgements

We thank Giuliana Thomanek for feedbacks on our draft.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U. R., et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243 297, 2021.

Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. ar Xiv preprint ar Xiv:2107.07511, 2021.

Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. Ar Xiv, abs/2208.02814, 2022. URL https://api.semanticscholar. org/Corpus ID:251320513.

Angelopoulos, A. N., Bates, S., et al. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494 591, 2023.

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274 283. PMLR, 2018.

Barber, R. F., Cand es, E. J., Ramdas, A., and Tibshirani, R. J. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 2019.

Bojchevski, A., Gasteiger, J., and G unnemann, S. Efficient robustness certificates for discrete data: Sparsity-aware randomized smoothing for graphs, images and more. In International Conference on Machine Learning, pp. 1003 1013. PMLR, 2020.

Cauchois, M., Gupta, S., Ali, A., and Duchi, J. C. Robust validation: Confident predictions even when distributions shift. ar Xiv preprint ar Xiv:2008.04267, 2020.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp. 1310 1320. PMLR, 2019.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009. URL https://api. semanticscholar.org/Corpus ID:57246310.

Dvijotham, K., Hayes, J., Balle, B., Kolter, Z., Qin, C., Gy orgy, A., Xiao, K. Y., Gowal, S., and Kohli, P. A framework for robustness certification of smoothed classifiers using f-divergences. In International Conference on Learning Representations, 2020. URL https://api.semanticscholar. org/Corpus ID:213452491.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pp. 642 669, 1956.

Einbinder, B.-S., Bates, S., Angelopoulos, A. N., Gendler, A., and Romano, Y. Conformal prediction is robust to label noise. Ar Xiv, abs/2209.14295, 2022. URL https://api.semanticscholar. org/Corpus ID:262091979.

Fey, M. and Lenssen, J. E. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

Gendler, A., Weng, T.-W., Daniel, L., and Romano, Y. Adversarially robust conformal prediction. In International Conference on Learning Representations, 2021.

Ghosh, S., Shi, Y., Belkhouja, T., Yan, Y., Doppa, J., and Jones, B. Probabilistically robust conformal prediction. In Evans, R. J. and Shpitser, I. (eds.), Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pp. 681 690. PMLR, 31 Jul 04 Aug 2023. URL https://proceedings.mlr.press/ v216/ghosh23a.html.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International conference on machine learning, pp. 1321 1330. PMLR, 2017.

Huang, K., Jin, Y., Candes, E., and Leskovec, J. Uncertainty quantification over graph with conformalized graph neural networks. ar Xiv preprint ar Xiv:2305.14535, 2023.

Kang, M., G urel, N. M., Li, L., and Li, B. COLEP: Certifiably robust learning-reasoning conformal prediction via probabilistic circuits. In The Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=XN6ZPINd Sg.

Robust Yet Efficient Conformal Prediction Sets

Kipf, T. and Welling, M. Semi-supervised classification with graph convolutional networks. Ar Xiv, abs/1609.02907, 2017.

Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. URL https://api. semanticscholar.org/Corpus ID:18268744.

Kumar, A., Levine, A., Feizi, S., and Goldstein, T. Certifying confidence via randomized smoothing. Advances in Neural Information Processing Systems, 33:5165 5177, 2020.

Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pp. 656 672. IEEE, 2019.

Lee, G.-H., Yuan, Y., Chang, S., and Jaakkola, T. Tight certificates of adversarial robustness for randomly smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.

Levine, A. and Feizi, S. Improved, deterministic smoothing for l1 certified robustness. 2021.

Li, L., Xie, T., and Li, B. Sok: Certified robustness for deep neural networks. 2023.

Lu, C., Lemay, A., Chang, K., H obel, K., and Kalpathy Cramer, J. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 12008 12016, 2022.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Mc Callum, A., Nigam, K., Rennie, J. D. M., and Seymore, K. Automating the construction of internet portals with machine learning. Information Retrieval, 3:127 163, 2004.

Mujkanovic, F., Geisler, S., G unnemann, S., and Bojchevski, A. Are defenses for graph neural networks robust? Advances in Neural Information Processing Systems, 35: 8954 8968, 2022.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

Romano, Y., Sesia, M., and Cand es, E. J. Classification with valid and adaptive coverage. ar Xiv: Methodology, 2020.

Sadinle, M., Lei, J., and Wasserman, L. A. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114:223 234, 2018.

Salman, H., Li, J., Razenshteyn, I., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.

Silva, S. H. and Najafirad, P. Opportunities and challenges in deep learning adversarial robustness: A survey. ar Xiv preprint ar Xiv:2007.00753, 2020.

Stutz, D., Dvijotham, K. D., Cemgil, A. T., and Doucet, A. Learning optimal conformal classifiers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=t8O-4LKFVx.

Teng, J., Wen, C., Zhang, D., Bengio, Y., Gao, Y., and Yuan, Y. Predictive inference with feature conformal prediction. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=0u Rm1Ym FTu.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. 2005.

Yan, G., Romano, Y., and Weng, T.-W. Provably robust conformal prediction with improved efficiency. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=BWAh Ej Xje G.

Yuan, X., He, P., Zhu, Q., and Li, X. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9): 2805 2824, 2019.

Zargarbashi, S. H., Antonelli, S., and Bojchevski, A. Conformal prediction sets for graph neural networks. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://openreview. net/forum?id=z Gf8J0b Nf X.

Robust Yet Efficient Conformal Prediction Sets

A. More On Conformal Prediction

Conformity vs. non-conformity scores. As mentioned in 2, for CP we need to define a score function that quantifies the agreement between the input and each label. Equivalently, one can define CP with a non-conformity score function that captures disagreement instead. In this case, the conformal threshold is the 1 α quantile of the calibration true scores. Similarly, in the test time, labels with score less than the threshold are included in the prediction set. Both approaches are equivalent up to a change in the sign of the scores. The latter setup is used in (Gendler et al., 2021) and is equivalent to our implementation that uses conformity scores. Our choice of agreement score is due to simplicity.

Score function. In 2 we mentioned that conformal prediction returns guaranteed sets regardless of the score function employed. Specifically, any score function maintaining the exchangeability (between calibration and test) is viable. In brief, the exchangeability of random variable Z1, . . . , Zn means that the joint distribution of the variables is insensitive to the order/index. In other words for any permutation function ψ : [n] 7 [n] we have Pr [Z1, . . . , Zn] = Pr Zψ(1), . . . , Zψ(n) . Assuming the calibration set to be exchangeably sampled from the data distribution, any permutation equivariant transformation on the data still preserves the exchangeability. Conclusively, the smooth scores from Gendler et al. (2021) and Bojchevski et al. (2020) are both permutation equivariant (the smoothing applies similarly to all calibration and test points regardless of their order). Therefore, smoothing scores maintains exchangeability.

While any score function preserving the exchangeability maintains the conformal guarantee, better scores result in better performance with respect to the metric of interest. For instance, even a function that returns uniform conformity scores at random provides a valid guarantee, although the prediction sets will be large.

Various score functions are proposed in the literature of conformal classification ranging from simple softmax function on top of model s result (Sadinle et al., 2018), to more complex functions leveraging information from embedding spaces of the model (Teng et al., 2023), or from the confidence of adjacent datapoints within a network structure (Zargarbashi et al., 2023). The expected score within the smoothing scheme around an input is no exception as it only involves the datapoint itself and applies symmetrically to all datapoints. Similar conditions hold for any approximation of that expectation e.g. the mean of Monte-Carlo samples. See B in Yan et al. (2024) for a longer discussion.

Effect of the calibration set size. With a calibration set exchangeably sampled from the data distribution (infinite samples), conformal prediction provides a marginal coverage of at least 1 α (Eq. 1). This probability is also upper bounded by 1 α + 1/(n + 1). Precisely, the coverage is distributed as Beta(n + 1 l, l) with l = (n + 1)α .

For a finite set of points and an exchangeably sampled calibration subset, e.g. transductive node-classification, Huang et al. (2023) show that the coverage probability, Cov(D) = (1/|D|) P

(xi,yi) D 1[yi C(xi)] is distributed as

Pr [Cov(D) t] = 1 ΦHG( α(n + 1) 1; M + N, N, Mt + α(n + 1) ) (14)

Where M = |D|, N = |Dcal| is the size of the calibration set, and ΦHG(P, p, K) is the CDF function of hypergeometric distribution of population P, sample size p, and K successful samples within the population.

This means that the coverage probability on standard CP is concentrated around 1 α. It also means that the variance around 1 α decrease as the size of Dcal increases. When moving the threshold from qα to any other value q within the domain of the score function (as in poisoning), the new threshold will correspond to another quantile β = Quant 1 (q ; Dcal) and the coverage will be similarly concentrated around 1 β.

Access to a large calibration set (e.g. 1000 points) is unrealistic. Even with a large set of labeled points, there is an open question of whether to use a portion of it for training the model toward better accuracy which can help even in the efficiency of CP. While we ran our experiments with the sparse labeled setting, increasing the size of the calibration set will result in similar values on average but the results will be more concentrated following the distribution of conformal probability.

Conservative coverage. Both RSCP and CAS result in an empirical coverage higher than 1 α for clean data. This is since the vanilla prediction set is a subset of their conservative prediction set. The empirical coverage for RSCP is even higher compared to CAS since it uses looser bounds on the score and the prediction sets are unnecessarily more conservative. Higher empirical coverage is gained by larger prediction sets; therefore the goal of Robust CP is to find conservative sets that cover the worst-case perturbed input with higher than 1 α probability but not by increasing the set size significantly.

One-sided robust guarantee. Although CP comes with a two-sided coverage guarantee (upper and lower bound on the coverage probability), our robust coverage guarantee is one-sided we only guarantee that the coverage is larger than 1 α. The standard two-sided guarantee relies on exchangeability. However, since the adversary might perturb each point

Robust Yet Efficient Conformal Prediction Sets

differently, i.e. we have a non-symmetric mapping from clean x to perturbed x; therefore, the perturbed points are no longer exchangeable. Another strategy to obtain the second side, would be to compute max x B(x) s( x, y) which needs access to the clean test point. Given the difficulty, we leave computing two-sided guarantees for future work.

A.1. Impelementation Details

We based our implementation on Py Torch (Paszke et al., 2019) and Pytorch Geometric (Fey & Lenssen, 2019). We run all our experiments both on CPU (Intel(R) Xeon(R) Platinum 8368 CPU @ 2.40GHz) and, and on GPU (NVIDIA A100-SXM4-40GB).

B. Faster Evasion-Robustness via Calibration-time Bound

The evasion-robust CP algorithm (see 5) requires an estimation of the expected smooth score for (i) the true class for all calibration points, (ii) and all classes for each test point. Moreover, for the standard evasion-aware robustness, we need to additionally compute adversarial upper bounds (solutions to Eq. 8) within the threat model for all classes of all test points. This upper bound has a closed-form for continuous data, and an efficient algorithm for binary/discrete data (see C). Nonetheless, it can be beneficial to reduce the overall runtime. Let tbound be the time complexity for the upper bound computation for a single (x, y) and t MC be the time complexity of approximating the expected smooth score with M Monte-Carlo samples. With n calibration points and c classes, we need O(n t MC) time for calibration, including the quantile computation. Then, for each test point we need O(c t MC tbound) time.

We define a computationally more efficient and robust alternative built upon Prop. 5.1 in which we offload the computational overhead from the test set to the calibration set. Prop. 5.1 gives a worst-case coverage lower bound for vanilla CP even if we evaluate vanilla CP with smooth (but not upper bounded) scores. Alternatively, we can find a conservative quantile that results in a certified 1 α coverage probability for the worst case input. We call this approach the faster evasion method.

This method of producing prediction sets significantly reduces the computation in two ways: (i) instead of test points (which are larger in number), we compute the upper bounds on calibration points, (ii) instead of computing the upper bound for all classes, we only compute it for the true class. Thus, we need O(n t MC tbound) for calibration, and O(c t MC) for each test point. In practical scenarios where the test set (during deployment) is larger than the calibration set, the computational savings of the faster approach become significant, especially for tasks with a large number of classes (e.g. Image Net with 1000 classes). As shown in Table 3 we gain a significant speed up (more than 3X) on CIFAR-10 with 204 calibration points and just 100 test points. Here we have gains despite using a relatively tiny test set (even smaller than the calibration set) since we have c = 10 classes. Similar, and even better speed-ups can be achieved for datasets with a larger test set and larger number of classes.

Table 3. Run-time comparison between test-time (slower) and calibration-time (faster) upper bound computation. The result is for CIFAR-10 with 104 number of Monte Carlo samples. Here, m is the number of test samples.

Time (seconds) No. Datapoints

Runtime Standard Evasion Robust Sets Faster Evasion Robust Sets

Calibration 0.15 O(n t MC) 0.79 O(n t MC tb) 204 Testing 2.93 O(m c t MC tb) 0.15 O(m c t MC) 100

Total 3.08 0.94

C. Technical Details On Randomized Smoothing

RSCP uses the closed-form solution to Eq. 7 as an upperbound on the score function within the L2 perturbation radius (details are in F). The same equation can be used to address other perturbation schemes (e.g. perturbations for sparse data). We use the results from Bojchevski et al. (2020) to find extend RSCP to sparse and discrete data and use it as a baseline.

To apply randomized smoothing we need to define a smoothing scheme ξ( ) a probabilistic function that adds random noise to the input. Given any score function s, we define ˆs(x, y) = E[s(ξ(x), y)]. Now Pr [ξ(x) = z] is the probability of

Robust Yet Efficient Conformal Prediction Sets

visiting some z in the domain by smoothing from x. For a continuous data we use Gaussian smoothing where ξ(x) = x + δ with δ N(0, σ2I) coming from an isotropic Gaussian distribution with zero mean and variance σ. We can compute the adversarial upper bounds using the closed-form expressions from Kumar et al. (2020) (see e.g. Eq. 10 in 4).

For binary data, following Bojchevski et al. (2020), we use the following smoothing function:

Pr [ξ(x)[i] = x[i]] = px[i] (15)

This means that ξ toggles each 1-bit of x with probability p1 and each 0-bit with p0. This distinction allows us to preserve sparsity by specifying a lower p0. Setting p1 = p0 = p we have the special case of flipping each bit with the same probability p. Similarly Bojchevski et al. (2020) generalizes the binary case to the discrete case. Assuming that x XK = {0, 1, . . . , K}d the sparsity aware randomization scheme is defined as

Pr [ξ(x)i = k] =

p0 K 1 (x[i] =k) (1 p0)(x[i]=k) x[i] = 0 p1 K 1 (x[i] =k) (1 p1)(x[i]=k) x[i] = 0 (16)

that flips any zero bit with probability p0 and any non-zero bit with p1 to any other (K 1) possible value.

For the baseline bound we can rewrite Eq. 7 as a linear program by partitioning the input space X into regions of constant likelihood ratio (Lee et al., 2019). Let X = S

i Ri and Ri T Rj = be a partitioning into disjoint regions of constant likelihood ratio such that for ever z Ri it holds Pr[ξ(x)=z]

Pr[ξ( x)=z] = ci for some constant ci. Let ti = Pr [ξ(x) Ri] and ti = Pr [ξ( x) Ri] for for each region Ri. Then Eq. 7 is equivalent to:

max h h T t s.t. h T t = p, 0 h 1 (17)

where h [0, 1]I is the vector that we are optimizing over corresponding to the score function h H, t and t are the vectors with ti and ti as elements, and I = ra + rd + 1 is the number of regions. Note, that by replacing the constraint with a h b we can handle score functions that are bounded in [a, b]. The exact solution to this LP can be easily obtained with a simple algorithm. We visit each region in increasing order w.r.t. ci where

ci = p0 1 p1

i rd p1 1 p0

and assign hi = 1 for all regions Ri until the budget constraint is met, and hi = 0 for the remaining regions, with the exception of the region in between where hi is a value between 0 and 1 such that the equality constraint is exactly met. Since the likelihood ratios ci are monotonic in i, the regions are automatically sorted so the solution to the LP can be obtain in linear O(I) time. See Bojchevski et al. (2020) for more details and the pseudo-code.

For the CDF-based bound we can similarly rewrite Eq. 8 as the following linear program:

max H bm H td s.t. Ht = p, 0 H 1 (19)

where H [0, 1](m 1) I is the matrix that we are optimizing over with Hji being the score that we assign to the j-th bin and the i-th region, d is the vector of bin widths such that dj = bj bj 1, and p is a vector where pi = Pr [s(ξ(x), y) bi]. Intuitively, for each bin and each region the worst-case score function h H assigns the same score to all z in that region since the likelihood ratio is constant. As before we have a simple algorithm to obtain the exact solution to this LP. Observe that Eq. 19 can be decomposed into m 1 separate LPs similar to Eq. 17 which can be solved in parallel using the same algorithm as above. The reason is that there is no interaction between the different bins (different rows of H) in neither the constraint nor the objective function. Therefore, the solution can be obtain in O(m I) with serial computation and O(I) with parallel computation.

Tightness. All four bounds are tight, i.e. cannot be improved unless we make additional assumption or provide additional constraints. The reason is that there exists a base score function s such that when relaxing to h H we get an equality in Eq. 6. See Kumar et al. (2020) for a discussion of why the two Gaussian bounds are tight when certifying the confidence of a classifier and observe that their analysis immediately applies to our score functions. Similarly, the two discrete bounds are tight since there exists an s for which we obtain equality. The s can be constructed using the optimal h from the problem in Eq. 17 and similarly for Eq. 19.

Robust Yet Efficient Conformal Prediction Sets

D. Supplementary to Theoretical Support

D.1. Proofs

Proof of Prop. 3.1. Given the exchangeability of (xn+1, yn+1) with the calibration set, Eq. 1 holds for the clean point. Since xn+1 B(xn+1) we have yi : s( xn+1, yi) s(xn+1, yi). By the definition of CP for any label yi we have

yi C(xn+1) s(xn+1, yi) qα xn+1 B( xn+1) s( xn+1, yi) s(xn+1, yi) qα C(xn+1) C( x)

Which clearly implies that Pr yn+1 C( x) Pr [yn+1 C(x)] 1 α.

Proof of Prop. 3.2. By definition Dcal Bk,B(Dcal); therefore qα is a feasible solution to Eq. 3 and we have qα qα. It follows that Cα(x) Cα(x) where Cα(x) = {yi : s(x, yi) qα} and Cα(x) = yi : s(x, yi) qα

. Since Pr [yn+1 Cα(x)] 1 α due to exchangeability it follows that Pr yn+1 Cα(x) 1 α. In summary the following chain of inequlities hold:

yi : yi C(xn+1) s(xn+1, yi) qα qα qα s(xn+1, yi) qα qα yi C(xn+1)

Proof of Prop. 5.1. Setting qα = Quant (α; {s(xi, yi) : (xi, yi) Dcal}) we have:

Pr ˆs( xn+1, yn+1) qα

Pr s(xn+1, yn+1) qα

Lower bound within the threat model

1 α Exchangeability between lower bounds

Alternatively, the vanilla CP is calibrated with quantile qα = Quant (α; {ˆs(xi, yi) : (xi, yi) Dcal}). The probability of a given potentially perturbed xn+1 being covered is:

Pr [yn+1 Cα( xn+1)] = Pr [ˆs( xn+1, yn+1) qα] Definition of CP

Pr [s(xn+1, yn+1) qα] Lower bound within the threat model

Let β = Quant 1 (qα; {s(xi, yi) : (xi, yi) Dcal}). If s is computed symmetrically indices are invariant to s, then s(xn+1) and {s(xi, yi) : (xi, yi) Dcal} are exchangeable. Hence, via quantile lemma we have:

Pr [s(xn+1, yn+1) qα] 1 β

Proof of Prop. 6.1. Since for each calibration point scdf+(xi, yi) scdf(xi, yi) has at most η 2|Dcal| failure probability following holds with 1 η/2 probability via the union bound:

qα+ := Quant α η; scdf+(xi, yi)

(xi,yi) Dcal

qα := Quant α η/2; {scdf(xi, yi)}(xi,yi) Dcal

This is because every element scdf+(xi, yi) in the first set is lower than the corresponding element in the other set. Now given the new test datapoint xn+1, the new calibration scores {scdf(xi, yi) : (xi, yi) Dcal} and scdf(xn+1, yn+1) are exchangeable, as a result for the clean corresponding point xn+1 we have Pr scdf(xn+1, yn+1) qα

1 α + η. Therefore we have the following chain of inequalities:

qα+ 1 η/2 qα 1 α+η scdf(xn+1, yn+1) ˆs( xn+1, yn+1) 1 η/2 ˆs+(xn+1, yn+1)

summing up the probability of each inequality we have Pr yn+1 Cα+ = Pr h ˆs+(xn+1, yn+1) qα+

Robust Yet Efficient Conformal Prediction Sets

Proof of Prop. 6.2. Since Eq. 24 (and Eq. 3) is a minimization problem, dropping ai si does not change the optimal solution. For each calibration point, we have:

smc(xi, yi) 1 η/2|Dcal| ˆs(xi, yi) ϵhoef scdf( xi, yi) ϵhoef 1 η/2|Dcal| scdf+( xi, yi) ϵhoef

Thus scdf+( xi, yi) ϵhoef smc(xi, yi) holds with 1 η probability for all i via union bound. This follows that qα+ qα,mc where qα,mc is the α-quantile of MC scores for clean calibration set. Therefore by exchangeability, we have Pr [smc(xn+1, yn+1) qα,mc] 1 α = 1 α + η. Finally

Pr h smc(xn+1, yn+1) qα+

i 1 η Pr [smc(xn+1, yn+1) qα,mc] 1 α + η

Therefore, Pr h smc(xn+1, yn+1) qα+

D.2. Details on l2 CDF bounds

Rephrase from Kumar et al. (2020). The upper bound in Eq. 10 is a rephrasing of Theorem 2 from Kumar et al. (2020). In the original version the bins are defined as a < c1 c2 cn < b. For the for a score function s and (clean) input x the statistics pcj is defined as

pcj = Prδ N(0,σ2I)[s(x + δ, y) ci]

Here we correct for finite sample estimation via Dvoretzky Kiefer Wolfowitz inequality. With the detailed discussion on Monte-Carlo sample correction in E, we assume that the statistics are computed with the error correction. For the adversarial point x Br(x) we have the following upper bound:

ˆs( x, y) c1 + (b cn)Φσ Φ 1 σ (pcn) + r +

j=1 (cj+1 cj)Φσ Φ 1 σ (pcj) + r (20)

In Eq. 10 we rewrote the same inequality with a simpler notation. Here we show that the two inequalities are the same. Our bins are indexed as a = b1 < b2 b3 bm 1 < bm = b; therefore for the same number of bins (n = m 2), there is an index mapping as 1 i < m : ci 1 = bi. Rewriting Eq. 20 with the new bins, we have:

ˆs( x, y) b2 + (bm bm 1)Φσ Φ 1 σ (pbm 1) + r +

j=1 (bj+2 bj+1)Φσ Φ 1 σ (pbj+1) + r

j=1 (bj+2 bj+1)Φσ Φ 1 σ (pbj+1) + r = b2 +

j=2 (bj+1 bj)Φσ Φ 1 σ (pbj) + r

We write the upper bound in terms of CDF function where pj = Prδ N(0,σ2I)[s(x + δ, y) bj]. We use two properties from Gaussian distribution (i) for the CDF function it holds that Φσ( z) = 1 Φσ(z) (ii) for the quantile (inverse CDF) function it holds that Φ 1 σ (1 z) = Φ 1 σ (z). Hence, we have

pbj = 1 pj Φσ Φ 1 σ (pbj) + r = Φσ Φ 1 σ (1 pj) + r

= Φσ Φ 1 σ (pj) + r

= Φσ [Φ 1 σ (pj) r]

= 1 Φσ Φ 1 σ (pj) r

Robust Yet Efficient Conformal Prediction Sets

j=2 (bj+1 bj)Φσ Φ 1 σ (pbj) + r = b2 +

j=2 (bj+1 bj) 1 Φσ Φ 1 σ (pj) r

j=2 (bj+1 bj)

j=2 (bj+1 bj)Φσ Φ 1 σ (pj) r

j=2 (bj+1 bj)Φσ Φ 1 σ (pj) r

Intuitively, with a fixed set of bins, the mass of each bin can be bounded within Br(x) independently (the bound for each bin is similar to the mean bound). Therefore for a discrete empirical CDF of scores around x, first we find a worst-case upper bound CDF, then we bound the mean via the Anderson inequality (Eq. 9) given the worst-case CDF.

Lower bounds within Br( ). Similar to the mean upper bound from the Anderson inequality (Eq. 9), the mean can be lower bounded as:

j=2 bj 1 [Fh(bj) Fh(bj 1)] = bm 1

j=2 Fh(bj) (bj bj 1) (21)

The lower and upper bounds are intuitive as they assume every point within each bin [bj 1, bj) is equal to bj 1 for lower and bj for the upper bound. The rest is just computing the average based on the relative frequency Fh(bj) Fh(bj 1). With that the lower bound version of Eq. 8 is

ˆs( x, y) scdf(x, y) = bm 1

j=2 Φσ Φ 1 σ (pj) + r (bj bj 1) (22)

Here we derive the equality in Eq. 9 namely the following;

j=2 bj [(Fh(bj) Fh(bj 1)] = bm

j=2 Fh(bj) (bj+1 bj)

The lower bound follows a similar way to derive. We have

j=2 bj [(Fh(bj) Fh(bj 1)]

= b2 [(Fh(b2) Fh(b1)] + b3 [(Fh(b3) Fh(b2)] + + bm [(Fh(bm) Fh(bm 1)]

= b2 Fh(b1) + [b2 Fh(b2) b3 Fh(b2)] + + [bm 1 Fh(bm 1) bm Fh(bm 1)] + bm Fh(bm)

With Fh(b1) = 0 and Fh(bm) = 1 we have

j=2 bj [(Fh(bj) Fh(bj 1)]

= 0 + [ Fh(b2) (b3 b2)] + + [ Fh(bm 1)(bm bm 1)] + bm

j=2 Fh(bj) (bj+1 bj)

E. Estimating Expectations with Monte-Carlo Sampling

Concentration inequalities. For any random variable z, let z1, . . . , zm be Monte-Carlo samples of z. With Em[z] = 1 m Pm i=1 zi, we bound the true expectation around the MC-estimate via Hoeffding s inequality. The following holds with

Robust Yet Efficient Conformal Prediction Sets

any adjustable 1 η probability;

|E[z] Em[z]|

This bound only accesses to the empirical mean and not the samples. Therefore, in cases where we want to account for the distance of empirical mean, and the true expectation for an unknown variable, we can use this bound. An example of this case is the test-time correction where the upper bound on the mean of the unseen point is computed while the empirical mean is not computable (since there are no samples).

Let σ2 m be the variance of the MC samples, then empirical Bernstein inequality produces variance-dependent confidence intervals as following:

|E[z] Em[z]|

m + 7 ln( 4

Similar to the mean, the empirical CDF is also bounded between an upper and a lower CDF, via the Dvoretzky Kiefer Wolfowitz (DKW) inequality (Dvoretzky et al., 1956). Let F(bi) = Pr [z bi] and Fm(bi) = Pm j=1 1[zj bi],

|F(bi) Fm(bi)|

The above inequality holds simultaneously for all bi.

For Eq. 7 we use the Bernstein inequality as is has shown a better empirical result compare to Hoeffding s inequality. For Eq. 8 we use the DKW inequality to find confidence intervals the empirical CDF.

Error correction in Eq. 7 and Eq. 8. To find the upper (or lower) bound in Eq. 7, we need to estimate the mean of the smooth score around the input x. We use the mean corrected with the Bernstein confidence interval. For the upper bound problem, we use the upper end of the interval since it is more conservative. The same logic follows for the lower bound.

0.2 0.3 0.4 0.5 Radius r

Average Set Size

CAS RSCP Cal-time Test-time

Figure 6. Comparison of CAS and RSCP for faster (calibration-time) and test-time error correction.

For Eq. 8 we use the Dvoretzky Kiefer Wolfowitz inequality to find an upper (or lower) CDF. Since in the Eq. 9 the CDF is added with a negative sign, the lower endpoint of the confidence interval should be used to find a conservative upper bound.

Empirically, Bernstein s confidence intervals are tighter than Hoeffding s intervals. Therefore we only use the Hoeffding error anytime we need a correction without having access to the variance.

Test-time correction (Yan et al., 2024). The MC-sampled smooth score does not break the exchangeability since this estimation is permutation invariant. This means that given the clean input xn+1 the estimated scores are exchangeable and the guarantee is valid without any error correction. However, given x, CAS and RSCP find bounds on the true mean. Given x, we compute s+ via solving either Eq. 7 (RSCP) or Eq. 8 (CAS) with the error corrected estimate. For both methods, the following holds:

qα,mc 1 α ˆsmc(xn+1, yn+1) 1 η1 ˆs(xn+1, yn+1) + ϵhoef s( xn+1, yn+1) + ϵhoef 1 η2

s+( xn+1, yn+1) + ϵhoef

By setting α = α + η1 + η2 we have a valid CP guarantee with certified 1 α probability.

Calibration-time vs test-time correction. As shown in Fig. 6, CAS benefits significantly from calibration-time robustness. The reason is that the CDF bound (Eq. 8) performs significantly better than the mean bound (Eq. 7) when the score

Robust Yet Efficient Conformal Prediction Sets

distribution is more spread out. For distributions concentrated around each endpoint of the domain, the CDF has a high slope at the endpoint and is almost flat elsewhere. While using DKW inequality, a large penalty is added to the distribution resulting in larger CDF intervals. Meanwhile, in these distributions, the mean bound can benefit from Bernstein inequality which due to the low variance performs even better. In calibration-time robustness, we find the lower bound for true scores (which are often more spread) while in test-time unlikely classes that have scores concentrated to 0 are bounded by large value (due to DKW for concentrated scores) which directly affects the set size. In addition, Yan et al. (2024) adds a Hoeffding error to the unseen clean score, where in our method we bound the estimation of the input which can use the Bernstein error. Since we are free to choose between test-time and calibration-time correction, and RSCP has equal performance for both, we argue that we should use calibration-time correction as a default. For Fig. 4 we choose the best performance of each method in either calibrationor test-time robustness with error correction.

F. Details on RSCP

F.1. Equivalence Between RSCP and our Gaussian Baseline Bound

For a given score function s : X Y 7 [0, 1] on continuous inputs, Gendler et al. (2021) define the new scoring function as follows:

srscp(x, y) = Φ 1(E[s(ξ(x), y)]) (23)

RSCP computes the α-quantile qα of the new calibration scores (Eq. 23) and compares each score with the modified threshold4 qα = qα r/σ , where r is the radius of the l2 ball from the threat model, and σ is the scale of the smoothing distribution. We can equivalently add an additional r/σ term to test scores instead and compare the augmented score with unchanged qα. Using Φ 1 σ (p) = σΦ 1(p) as a property of the inverse CDF function of the Gaussian distribution, we have

Φ 1 (E[s(ξσ(x), y)]) Φ 1 (E[s(ξσ( x), y)]) + r

σ Φ 1 σ (E[s(ξσ(x), y)]) Φ 1 σ (E[s(ξσ( x), y)]) + r

Since the CDF is a monotonically increasing function we apply Φσ on both sides of the inequality:

Φσ Φ 1 σ (E[s(ξσ(x), y)]) Φσ Φ 1 σ (E[s(ξσ( x), y)]) + r

E[s(ξσ( x), y) Φσ Φ 1 σ (E[s(ξσ( x), y)]) + r

Substituting p = E[s(ξ( x), y)] we see that this is equivalent to the Gaussian smean upper-bound defined in 4.

F.2. Comparison with Cauchois et al. (2020)

Cauchois et al. (2020) derive robust prediction sets when the f-divergence between the test distribution and the calibration distribution of the non-conformity scores is bounded by a fixed value ρ. We can connect their approach to our definition of adversarial robustness using the results from Dvijotham et al. (2020). Specifically, we can rewrite the optimization problem max x Br(x) E[s(ξ( x, y)] over the ball Br(x) to the optimization problem maxν P E[s(ν(x, y)] over the space of probability measures P = {ξ( x) | x Br(x)}. Since this set is intractable we can relax the problem using the fact that P {Df(ν||ξ) ρf r} for an appropriately chosen ρf r where Df(ν||ξ) is the f-divergence between the smoothing distribution ν centered at a perturbed example and the smoothing distribution ξ centered at the clean example. See Dvijotham et al. (2020) for a derivation of the optimal ρf r for different different divergence functions f and different smoothing distributions. Thus, for smooth scores there is a direct connection between RSCP, CAS and Cauchois et al. (2020) s method.

Importantly however, for most choices of f (e.g. the KL divergence) the relaxation results in a looser (though potentially easier to compute) bound. The analysis in Dvijotham et al. (2020) was developed for classification problems but it also directly applies to our setting. They show that we need to use the Hockey Stick divergences with the right parameters to obtain tight certificates. Specifically, for Gaussian smoothing and an l2 norm the result is equivalent to the tight certificate from Cohen et al. (2019). Disregarding that Hockey-Stick divergences are harder to estimate in general, it means that in the best case, the approach by Cauchois et al. (2020) can recover the baseline smean( x, y) which we have shown is looser than our scdf( x, y).

4Originally RSCP shifts the quantile forward qα = qα + r/σ since it is defined with the non-conformity setup where scores lower than the quantile are accepted. Here since we use conformity (agreement) scores and the acceptance criteria is to be larger than qα we shift the quantile backward. The setups are equivalent via changing the sign of the scores (see A).

Robust Yet Efficient Conformal Prediction Sets

G. Technical Details on Poisoning Certificate

Feature poisoning. The solution of the optimization problem in Eq. 3 is robust to feature poisoning; however, the problem is hard to solve since: (i) we need to optimize over each zi in B( xi), (ii) it involves a quantile computation, (iii) and it has a cardinality constraint as the sum of indicator functions. Therefore, we relax the problem to a MILP which can solve with standard solvers. First, we replace each zi B( xi) constraint with a si si si constraint directly over scores si where the lower and upper bounds are computed as in discussed in 4. This is a sound relaxation and the optimal qα of the relaxed problem is smaller or equal than the qα of the original problem. Then, we introduce |Dcal| binary variables to compute the α quantile, and additional |Dcal| binary variables to enforce the perturbation budget. The resulting MILP is:

qα = min si,q q

s.t. si : si si si

ti := 1[si q],

i=1 zi αn , and

i=1 (1 ti) (1 α)n

bi := 1[si = si],

In Eq. 24, the zi variables indicate whether the calibration point is below or above the α quantile q, and the bi variables indicate whether the point is perturbed or not. We use the standard big-M technique to translate this into a canonical form which we solve with MOSEK.

Label poisoning. We can directly rewrite Eq. 5 as a MILP without any relaxations. Let S be an n = |Dcal| by c matrix of scores for each class and each calibration point, where c is the number of classes. We have

qα = min q,C {0,1}n c q

s.t. C1c 1 = 1n 1

r = (C S) 1c 1 X

i C[i, yi] n k

zi := 1[ri q],

i=1 zi αn , and

i=1 (1 zi) (1 α)n

where the binary one-hot matrix C is responsible for selecting one score per calibration point (i.e. one of the c possible labels), r is the resulting set of chosen scores, and the zi variables implement the quantile as before.

Complexity. Note that while in general, solving MILPs is computationally expensive, since our calibration sets are relatively small, we can still obtain the exact solution in reasonable wall-clock time. We leave it as future work to derive more efficient algorithms for the feature and label poisoning problems.

H. Robustness to Poisoning and Evasion Attacks Combined

In 3.2 we make CP robust to poisonings (in feature or label domain) by finding a conservative ˆq that in the most adverse case of attack (within the defined budget and threat model) the coverage probability remains above 1 α. Once this threshold is defined, we can consider the calibrated quantile to safely satisfy the guarantee on the clean test we can assume that CP was calibrated on clean calibration data. Formally the solution to Eq. 3, and Eq. 5, is a threshold with which the prediction sets constructed for clean x has larger than 1 α coverage probability.

While making CP robust to evasion, we only consider the confidence interval of scores for the clean test point given the potentially perturbed point x. This process only involves computing upper bounds on the given test point and hence is independent of the prior robustness to poisoning. In other words, the resulting conservative prediction set includes the prediction set of the clean datapoint C(x) C( x).

Robust Yet Efficient Conformal Prediction Sets

0.00 0.25 0.50 0.75 Radius r

Singleton Hits

0.00 0.25 0.50 0.75 Radius r

0 1 2 3 Radius radd

0.85 0.9 0.95

Figure 7. Singleton hit ratio of CAS and RSCP under evasion for (from left to right) CIFAR-10 with APS, Image Net with TPS, and Cora with APS.

This shows that we can make CP robust to poisoning and evasion attacks at the same time. However, this combined robustness comes at the price of comparably larger prediction sets. The robust q is less than qα which allows more labels to be included in the prediction sets. At the same time, for each test point, the upper-bound scores introduce a higher probability for a label to be included in the prediction sets again. So there will be two conservative processes each increasing the chance of accepting a label which increases the expected set size.

I. Time and Space Complexity

Our robust CP approach breaks down into four computations (i) computing the score function, (ii) estimating expectations for randomized smoothing (in practice the MC sampling and computing the confidence intervals), (iii) computing upper-bounds, and (iv) standard CP processes including calibration and constructing prediction sets. Here we omit the time complexity analysis of the model, and with the black-box access, we assume the model s prediction of logits to take O(1) step. The computation of the conformity score depends on the choice of this function. TPS takes O(K) (K is the number of classes) to compute the categorical distribution via softmax function. APS score function takes an additional O(K) steps to sort the class probabilities and compute the summation of confidences (see 2 for the definition of the score function). This additional sort can become time-consuming for datasets with large number of classes (like Image Net) For simplicity, we call the score function to take ts steps. Standard CP procedures are calibration and constructing prediction sets. Given n calibration score finding the 1 α quantile takes O(n) steps (median computation) and the prediction sets take O(C) to be constructed for each test input. All the time complexities are reported w.r.t. serial computation, while with enough number of parallel processing cores, all above computations can be done in relatively lower number of steps.

In the randomized smoothing we need to estimate the expected score function within the smoothing scheme. For that, we use Monte-Carlo sampling which takes O(N M) steps to compute the mean of M Monte-Carlo samples and N is the number of datapoints in total.

With the Monte-Carlo samples each upperand lower-bound need solving an optimization problem. The optimal value is found via a closed-form solution for Gaussian smoothing. Given S bins for the binary (and discrete) CDF computing this bound takes O(S R) time where R is the number of regions of similar likelihood and we have R = ra + rd + 1. We refer to the time computation time of the bound as tb.

As a result, in the evasion setup, we take O(NM) additional steps for calibration on smooth scores and O(MK +K tb) for constructing the prediction sets. We also proposed a faster way to provide robust prediction sets in B. For that we compute an upperbound per each calibration datapoint but only for the true class. For any given test point we only compute smooth scores which in total reduces the computation to O(MK) for test time (per test datapoint), and increases the calibration time complexity to O(N tb). This procedure decreases the number of steps in total. Table 3 compares the runtime of both approaches for a limited number of calibration, and test points.

For poisoning in the feature space, we should first compute the upper and lower bounds for each calibration data which takes O(NM + M tb) steps. Here we just compute the bounds for the true label. We then solve a mixed integer linear programming which is computationally hard. We apply tricks like big-M method to make the problem solvable and enable

Robust Yet Efficient Conformal Prediction Sets

the use of standard convex optimization solvers. Similarly for the label poisoning, the problem is hard involving ILP solvers, but here we do not need to compute bounds on scores as the perturbations are in the label domain.

J. Supplementary To Experiments

J.1. Details on the Experiments in the Manuscript

In our core experiment, we utilized a Res Net-110 model pre-trained on the CIFAR-10 dataset and a Res Net-50 model pre-trained on the Image Net dataset. Both models were trained using noisy training by Gaussian data augmentation across various noise variances, as proposed by Lecuyer et al. (2019) and later used by Cohen et al. (2019) for randomized smoothing. Detailed insights into the model training and augmentation processes are elaborated in Cohen et al. (2019); Salman et al. (2019).

For evaluation, we employed an l2 norm smoothing paradigm and applied various noise levels, identifying the model that delivered optimal performance based on findings from Cohen et al. (2019). On CIFAR-10 dataset we used a skip parameter and ran the experiments on between 1000 to 2000 samples. Similarly, 500 data points are used from the sampling of every 100-image from the Image Net dataset. Noise variance settings used were σ = 0.25 for CIFAR-10 and σ = 0.5 for Image Net. During the Monte Carlo sampling, each datapoint was processed through 104 iterations to calculate the expected probability or mean.

For our experiments on the Cora-ML dataset. we utilized a two-layer GCN equipped with 64 hidden units. Followed by Bojchevski et al. (2020), our training procedure incorporated randomized perturbations of the node features. Specifically, we used a perturbation addition probability (p+) of 0.01 and a deletion probability (p ) of 0.6. For the training process, we employed 20 node labels per class for training and similar number of nodes for validation. We conducted the training over 1,000 epochs. The remaining portion of the dataset was set aside for evaluation purposes.

In our conformal prediction strategy, the split conformal method was adopted. To account for the effect of randomness in calibration set sampling, we reported our result in terms of mean and confidence bounds over 100 calibration samplings.

Moreover, results for adversarial cases are discussed. For these attacks, we employed the projected gradient descent (PGD) attack (Madry et al., 2017), using an alpha value of 0.1 across 40 iterations. The attack outcomes, constrained by L2 norm distance from the original image, are presented for r = 0.125.

Singleton hits ratio. This metric quantifies the proportion of correct singleton predictions which can be used without any further post-processing. Similar to the prediction set size, Fig. 7 shows that CAS outperform RSCP on all datasets.

Proportion of Empty, Singleton, and Multi-sets. While we report the average set size (similar to many other studies in CP), a CP method might misleadingly show to be more efficient by returning more empty sets. That is why an alternative metric is to only report the average size of non-empty sets. In Fig. 8 we report the proportion of empty, singleton and multi-prediction sets for various radii. In vanilla CP, as we increase the 1 α guarantee to higher values, CP adds more elements to prediction sets to satisfy the increased coverage guarantee. Since there are almost no empty prediction sets for various α, and r, both effective and average set size are the same.

Different Score Functions. As mentioned in 2 (and in A extensively), coverage guarantee in vanilla CP, and robustness methods defined on top (including RSCP, and CAS) are defined agnostic to the score function leaving the freedom of choosing the score based on the domain of application. Here we empirically support this argument. Fig. 9 compares RSCP with CAS applied on TPS and APS score functions. In all scores, and all metrics CAS shows an improved result.

Robust Yet Efficient Conformal Prediction Sets

0.85 0.90 0.95 1 α

0.85 0.90 0.95 1 α

Multi Set Single Set Empty Set

0.85 0.90 0.95 1 α

Figure 8. The proportion of singleton, empty and multi-sets for RSCP and CAS across radii (left) r = 0, (middle) r = 0.12, and (right) r = 0.25.

K. Notations and Definition Guide

For a complete guide to all notations used in the paper see Table 4.

Robust Yet Efficient Conformal Prediction Sets

APS TPS Score Function

Emp. Coverage

APS TPS Score Function

Eff. Set Size

APS TPS Score Function

Singleton Hit Rate

APS TPS Score Function

Emp. Coverage

APS TPS Score Function

Eff. Set Size

APS TPS Score Function

Singleton Hit Rate

APS TPS Score Function

Emp. Coverage

APS TPS Score Function

Eff. Set Size

APS TPS Score Function

Singleton Hit Rate

Figure 9. Comparison of RSCP and CAS for smooth APS and TPS score across various radii for (left column) empirical coverage (middle column) set size, and (right column) singleton hits. From upper to lower row results are respectively for r = 0, r = 0.12, and r = 0.25. All results are for CIFAR-10, and smoothing with σ = 0.25

Robust Yet Efficient Conformal Prediction Sets

Notation Desciption

x The clean input (xi, yi) The clean input alongside its true label. Dcal Clean calibration set. A set of labeled datapoints which its labels are unseen by the model during the training. Precisely, the conformity score (e.g. model softmax) is exchangeable between elements of this set and the test set. x The input perturbed by the adversary (xi, yi) The clean input alongside a label that is potentially flipped by the adversary. Dcal The poisoned calibration set. Here the adversary has returned a set, given clean Dcal, where under threat model either features are perturbed, or labels are fliped (or both). B( ) Point-level threat model: The set of all allowed perturbations w.r.t. the clean point; e.g. all points that are closer than r in l2 distance Bk,B(D) Set-level threat model: The set of all allowed perturbations changing an input set; e.g. CP s calibration set. As an example the set of all perturbed sets where the adversary has changed at most k points within a point-level threat model. s( , ) Conformity score function originally defined for vanilla CP srscp( , ) Scores defined by Gendler et al. (2021). s( , ), s( , ) Upperand lower-bounds for given score function s( , ) within the specified threat model. qα Conformal quantile computed by CP on the clean calibration set with nominal coverage probability 1 α qα Adversarial conformal quantile; this is a quantile of the calibration set that is poisoned by the adversary. It is expected that this quantile results in lower coverage compared to qα. qα Conservative lower-bound for qα; This is computed by the defender to return robustness prediction sets. Cα( ) Prediction set of vanilla CP with 1 α nominal coverage. Cα( ) Prediction set of robust CP with 1 α nominal coverage. Dependent on the attack scenario (evasion or poisoning), this set is robust to the perturbations within the threat model. ˆs( , ) The smooth score for the input x. This score is the expectation of the score under a predefined randomized smoothing framework. smean( , ) The upperbound score calculated by solving Eq. 7. This problem only has the mean similarity constraint. scdf( , ) The upperbound score calculated by solving Eq. 8. This problem only has the CDF similarity constraint.

Table 4. Table of notations used in the paper.