# data_poisoning_attacks_against_conformal_prediction__2850ef7e.pdf

Data Poisoning Attacks against Conformal Prediction

Yangyi Li * 1 Aobo Chen * 1 Wei Qian 1 Chenxu Zhao 1 Divya Lidder 1 Mengdi Huai 1

The efficient and theoretically sound uncertainty quantification is crucial for building trust in deep learning models. This has spurred a growing interest in conformal prediction (CP), a powerful technique that provides a model-agnostic and distribution-free method for obtaining conformal prediction sets with theoretical guarantees. However, the vulnerabilities of such CP methods with regard to dedicated data poisoning attacks have not been studied previously. To bridge this gap, for the first time, we in this paper propose a new class of black-box data poisoning attacks against CP, where the adversary aims to cause the desired manipulations of some specific examples prediction uncertainty results (instead of misclassifications). Additionally, we design novel optimization frameworks for our proposed attacks. Further, we conduct extensive experiments to validate the effectiveness of our attacks on various settings (e.g., the full and split CP settings). Notably, our extensive experiments show that our attacks are more effective in manipulating uncertainty results than traditional poisoning attacks that aim at inducing misclassifications, and existing defenses against conventional attacks are ineffective against our proposed attacks.

1. Introduction

Deep Neural Networks (DNNs) have achieved remarkable success in recent years. Although deep learning models work well in numerous fields, deploying such models in realworld applications often requires to appropriately quantify the uncertainty of their predictions. To tackle uncertainty issues, people have developed different uncertainty quantification techniques, including Bayesian neural networks (Trinh et al., 2022; Hobbhahn et al., 2022).

*Equal contribution 1Department of Computer Science, Iowa State University, United States. Correspondence to: Mengdi Huai <mdhuai@iastate.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Among different uncertainty quantification techniques, conformal prediction (CP), pioneered by Vovk et al. (2005), has become a popular distribution-free technique to perform uncertainty quantification (Ndiaye, 2022; Fisch et al., 2022; Stutz et al., 2021; Fisch et al., 2021; Qian et al., 2024). The model-agnostic and distribution-free nature of CP makes it particularly suitable for large neural networks. Concretely, we are mainly interested in a conformal set prediction setting where we are given n examples (Xi, Yi) X Y, i = 1, , n as calibration data, that are drawn exchangeably from some underlying distribution P (Humbert et al., 2023; Fisch et al., 2022; Lin et al., 2022; Teng et al., 2022). Let Xn+1 X be a new exchangeable test example for which we would like to predict Y n+1 = f(Xn+1; θ) Y, where θ Θ is a well-trained model. CP aims to construct a conformal prediction set, i.e., Cε(xn+1; θ), that contains Y n+1 with marginal coverage at a significance level ε (0, 1), i.e.,

P(Y n+1 Cε(Xn+1; θ)) 1 ε. (1)

A conformal model is considered to be valid if the frequency of error, Y n+1 / Cε(Xn+1; θ), remains below the threshold ε. CP offers straightforward uncertainty estimates, where larger conformal sets C generally convey higher uncertainty.

Although CP is being increasingly used in safety-critical and security related applications, there s still a gap in understanding the effects of poisoning attacks on CP, an area that remains largely unexplored. In practice, the risk of data poisoning attacks (Yang et al., 2023; Jagielski et al., 2021; Qian et al., 2023) intensifies in DNNs, since they rely on large and diverse datasets and their size makes it difficult to guarantee the trustworthiness of the training data. As a result, models trained on such datasets are susceptible to data poisoning attacks, wherein an adversary places specifically constructed poisoned examples into the training data with harmful intentions (e.g., leading to unequalized and unfair coverage outcomes). In Schwarzschild et al. (2021), industry practitioners have identified data poisoning as the most significant concern among various threats.

In this work, we perform the first study on data poisoning attacks against CP, where the adversary aims to undermine the use of CP techniques by manipulating conformal prediction sets while ensuring label correctness. Consider a scenario

Data Poisoning Attacks against Conformal Prediction

where a doctor relies on conformal prediction to distinguish if a model prediction is reliable enough or requires more attention from the doctor. An adversary can compromise this process through poisoning attacks. Such attacks affect the model s ability to accurately estimate uncertainty, leading to potential risks in medical decision-making. Traditional data poisoning attacks (Zhao et al., 2024; Jagielski et al., 2021; Li et al., 2021; Geiping et al., 2021b; Peri et al., 2020; Foret et al., 2020; Qian et al., 2023) mainly focus on inducing misclassifications, whereas we focus on vulnerabilities related to the model s prediction uncertainties. Data poisoning attacks against CP could be more subtle and harder to detect compared to traditional poisoning attacks, since these attacks manipulate the model s prediction uncertainties rather than directly altering label predictions and might bypass existing defenses that are dependent on label changes.

While there are a few existing works (Ghosh et al., 2023; Gendler et al., 2021; Zhao et al., 2023) addressing test-time adversarial attacks on CP, they do not consider the risks of data poisoning attacks during the training process. Compared with these existing adversarial attacks, performing data poisoning attacks in the above discussed CP setting. is more stealthy, due to the preservation of the data exchangeability assumption in such scenarios. Notably, among the limited existing works on adversarial attacks against CP, they usually focus on how to ensure the validity (i.e., the coverage guarantee in Eq. (1)) under the violations of the data exchangeability, and fail to consider the maliciously manipulated efficiency, where the adversary targets prediction confidence. For example, the adversary might deliberately craft the poisoning training data to cause unequalized coverage probabilities that fail for specific sub-populations. Therefore, these adversarial robust CP methods are not equipped to counteract our proposed data poisoning attacks.

Motivated by the above, we thus believe that studying poisoning attacks targeting the prediction uncertainty is essential for safety applications of CP. In this work, we move the first step towards this direction, i.e., understanding the effects of data poisoning attacks on CP. To this end, we design a novel bi-level poisoning attack framework to craft effective poisoning points in the black-box setting. In the proposed framework, we first design approximate relaxation to handle the discrete conformal sets and the non-differential quantile. We also present a new worst-case adversarial loss to maximize the poisoning effect on the worst-case model for a strong poisoning effect. Further, we present novel efficient optimization methods by rigorously refining our attacks of generating effective poisoning points through the closed-form updates, thus eliminating the need for extensive model retraining or full access to training data. We conduct thorough experiments to verify the effectiveness of our attacks in various scenarios, including both full and split settings. Our detailed analysis reveals that these attacks are

more successful at manipulating uncertainty outcomes than conventional poisoning attacks. Moreover, we found that current defenses against traditional poisoning attacks do not effectively counter our proposed attacks, underscoring the need for new strategies to address these advanced forms of data poisoning. The findings underscore the potential negative impacts of poisoning attacks on CP, aiming to raise awareness within the research community about this issue.

2. Related Work

Compared with traditional uncertainty estimation techniques, CP (Vovk et al., 2005) is a general framework for constructing conformal confidence sets, with the remarkable properties of being distribution-free, having coverage guarantees, and being able to be adapted to any estimator. However, previous literature on uncertainty estimation (Ren et al., 2023; Ledda et al., 2023; Alarab & Prakoonwit, 2022; Wicker et al., 2020; Yuan et al., 2020; Wang et al., 2018; 2022) has not delved into the vulnerability of CP to data poisoning attacks. On the other hand, data poisoning attacks at training time have emerged as a threat perceived to be of significant potential threat. Traditional poisoning attacks primarily deceive the model into making incorrect predictions. However, the distinct characteristics of both split CP and full CP (e.g., the coverage guarantees) pose challenges in directly applying these conventional poisoning attacks.

On the other hand, existing defenses against data poisoning attacks primarily depend on either anomaly detection based on nearest neighbors, training loss, singular-value decomposition, clustering (Peri et al., 2020; Cretu et al., 2008; Tran et al., 2018; Chen et al., 2018; Steinhardt et al., 2017), or robust training based on randomized smoothing, ensembling, data augmentation, and adversarial training (Weber et al., 2023; Li et al., 2021; Tao et al., 2021; Levine & Feizi, 2020; Ma et al., 2019; Abadi et al., 2016). For example, Peri et al. (2020) filters examples whose class labels differ from those of their nearest neighbors in the feature space. However, our proposed attacks are tailored to exploit vulnerabilities related to prediction uncertainty via CP, a nuance that these existing defenses may not be designed to handle. Additionally, compared with existing works (Jagielski et al., 2021; Geiping et al., 2021b; Koh & Liang, 2017) that do not consider retraining in the end-to-end manner, our proposed attacks maximize the poisoning effects of the worst-case model during optimization, resulting in enhanced stability for strong attack performance. Additionally, our proposed attacks also present a more rigorous derivation of our attack optimization methodology through the closed-form gradient updates between the poisoned and benign models. Importantly, these closed-form updates afford our optimization framework the ability to modulate its precision and computational cost by adjusting the model update precision.

Data Poisoning Attacks against Conformal Prediction

3. Preliminaries

We assume pairs (X, Y ) X Y have a joint distribution denoted as P, with the marginal distributions of X and Y and the conditional distribution Y |X denoted as PX, PY , and PY |X, respectively. Given a new sample X, for every candidate label Y Y, CP applies a simple test to either accept or reject the null hypothesis that pair (X, Y ) is correct (Fisch et al., 2021). The test statistic for this test is a nonconformity measure, S((X, Y ); θ), where θ is a model fit to the training data using some learning algorithm. Informally, a lower value of S reflects (X, Y ) conforms to the training data, whereas a higher value of S reflects that (X, Y ) is atypical relative to the training data.

Assumption 3.1 (Exchangeability). Consider the calibration data Z1 = (X1, Y1), , Zn = (Xn, Yn) and the test data Zn+1 = (Xn+1, Yn+1). The examples are exchangeable if any permutation yields the same distribution, i.e.,

(Z1, , Zn+1) d= (Zτ(1), , Zτ(n+1)), (2)

with arbitrary permutation τ of the integers 1, , n + 1.

Let D = {Zi = (Xi, Yi)}n i=1 denote a calibration set of exchangeable (see Assumption 3.1) and correctly labeled examples. To determine the conformal prediction set for a test sample X, the classifier tests the nonconformity score for each potential label Y , against a pre-defined significance level ε, and includes all Y for which the null hypothesis that the candidate data pair (X, Y ) is conformal is not rejected. This is achieved by comparing the nonconformity score of the test candidate against the nonconformity scores computed over the calibration dataset D. This comparison uses the below quantile

Q1 ε(D, θ) := Quantile(1 ε; {S((Xi, Yi); (3)

θ)}n i=1 { }).

Note that compared with full CP, split CP is fast and easy to implement and model.

Theorem 3.2 (Vovk et al., 2005). Assume that examples (Xi, Yi), i = 1, , n + 1 are exchangeable. For any nonconformity measure S and ε, define the conformal set (based on the first n examples) at Xn+1 X as

Cε(Xn+1; θ) = {Yn+1 Y : S(Xn+1, Yn+1) (4)

Quantile(1 ε; {S((Xi, Yi); θ)}n i=1 { })}.

Then Cε(X; θ) satisfies Eq. (1).

4. Problem Statement

4.1. Threat Model

We consider a realistic threat model, where the adversary has no knowledge of the internal model parameters and the

training process of the victim model, and is unable to alter test data during the model s testing phase. Additionally, the adversary cannot gain knowledge of the data points adopted for training, and can inject a limited number of manipulated new points into the training data. This situation depicts an attack setting where the adversary spreads poisoned data that developers unknowingly compile, along with vast benign data, to form the model s training set. However, we allow the adversary to have the computational capability required to train a pre-trained model θ (D) with a separate auxiliary dataset D, which is comparable to the victim model (Jagielski et al., 2021; Geiping et al., 2021b). D is similar to the training data owned by the model owner and sampled from the distribution. Note that, for the assumption that the adversary is able to access a pre-trained model trained over an auxiliary dataset, it is reasonable given the widespread availability of public data. It has been a common assumption for black-box attacks in existing literature (Jagielski et al., 2021). Additionally, we also consider the white-box setting (Chen & Gu, 2020; Huai et al., 2020a; Neekhara et al., 2021; Wang et al., 2021; Huai et al., 2022; Gluch & Urbanke, 2021; Liu et al., 2024; Suya et al., 2021; Schwarzschild et al., 2021). In this scenario, the adversary has full access to the threat model s training data and network architecture.

The adversary aims to interfere with conformal predictions either by inducing overconfidence CP attacks, leading the model to underestimate prediction uncertainty, or underconfidence CP attacks, making the model underconfident by widening its conformal prediction sets. Additionally, to ensure stealth, we also consider maintaining the same coverage guarantees and executing targeted attacks without compromising uncertainty accuracy in benign samples. Note that our proposed data poisoning attacks can also be utilized to cause unequalized coverage subgroups.

4.2. Attack Formulation

Here, we propose our attacks for crafting poisoning samples against CP. As discussed in Section 3, in CP, we first split the auxiliary dataset D into a training fold Dtr and a calibration fold Dca. Then, based on the learning algorithm A and the training data Dtr, the adversary can train a model f with parameters θ which correctly classifies as many data points as possible, maximizing EX,Y Dtr I(f(X; θ) = Y ), where I is the indicator function. We denote the training loss over the training data as L(θ; Dtr) = 1

n Pn i=1 l(f(Xi; θ), Yi). We denote the set of victim target samples as {(Xv, Yv)}V v=1. We assume that the adversary selects a subset Dtr p from Dtr

which takes an ξ1 [0, 1] percentage of Dtr, and replaces it with a poisoning set Dtr p . We denote the remaining clean data as Dtr c = Dtr \ Dtr p . For simplicity, we will omit the superscripts for Dtr p , Dtr c and Dtr p in the following. The effective poisoning points can be obtained by solving the

Data Poisoning Attacks against Conformal Prediction

following formulated optimization problem

D p arg max Dp ℓ1({Xv}V v=1; θ( Dp), Q1 ε(θ( Dp)))

v=1 |Cε(Xv; θ( Dp))| +

v=1 I(Y v = f(Xv; θ( Dp)))

v=1 I(Y v C(Xv; θ( Dp))), (5)

where θ( Dp) is obtained by training on the poisoned data Dtr = Dp Dc, and Cε(Xv; θ( Dp)) = {Y Y : S(Xv, Y ; θ( Dp)) < Q1 ε(Dca, θ( Dp))}. Q1 ε(θ( Dp)) is the new quantile calculated from the poisoned model θ( Dp). Without loss of generality, we here focus on scenarios where the adversary aims to increase the prediction uncertainty by enlarging the sizes of conformal prediction sets. The second and third loss terms in the above equation are designed to ensure correct label predictions and the inclusion of true labels in post-attack conformal prediction sets, respectively. This enhances attack stealthiness without impacting coverage results and altering label predictions. Note that the above equation is a bi-level optimization problem the minimization for Dp involves the model parameters θ( Dp), which are themselves the minimizer of the following training problem

θ( Dp) = arg min θ Θ L(θ; Dtr = Dp Dc). (6)

Note that Eq. (5) and (6) provide a high-level formulation for crafting poisoning examples Dp to increase the conformal set sizes (i.e., |Cε(Xv; θ( Dp))|). However, directly solving this framework is infeasible due to the discrete nature of the conformal sets. Recall that the conformal prediction set Cε(Xv; θ( Dp)) (defined in Eq. (4)) is based on comparing nonconformity scores to a threshold. A straightforward way is to directly adopt the quantile to formulate the relative comparison. However, this is impractical due to the difficulty of expressing the quantile Q1 ε(θ( Dp)) in a continuous and differential way. To overcome this, we develop a more feasible method, drawing upon the derivation method of Q1 ε(θ( Dp)) in Eq. (3). Based on this, we can have

min Dp ℓ2({Xv}V v=1; θ( Dp), Q1 ε(θ( Dp))) =

Y Cε(Xv;θ( Dp)) Ya max(S(Xv, Y ; θ( Dp)) (7)

S(Xi, Y i ; θ( Dp)), 0)] +

v=1 max( max Y =Y v f Y (Xv; θ( Dp))

f Y v (Xv; θ( Dp)), β),

where nca ε = (1 ε) |Dca| , Ya is the set of labels we aim to add into the prediction set, and β is a constant. Since the

second and third terms are non-convex and non-differential, we design the surrogate losses to approximate them.

Note that the above optimization in Eq. (7) and (6) is designed to craft effective poisoning samples to fulfill the adversary s objectives, which are then injected into the dataset of the model owner. However, during re-training for optimization, the poisoned model θ( Dp) can converge differently due to training uncertainties like model initialization and hyperparameter choice. Consequently, this can diminish the effectiveness of these poisoning samples and reduce their overall poisoning impact. To address this, we propose to focus on the worst-case poisoned model, which is the inner minima in Eq. (6) that has the worst poisoning effect (Andriushchenko & Flammarion, 2022; Wen et al., 2022). Our key idea here is to maximize the poisoning effect of the worst-case model to ensure that a high poisoning effect is preserved for other models. We then can formulate the worst-case poisoned model as θ = arg max θ Θp ℓ2({Xv}V v=1; θ( Dp), Q1 ε(θ( Dp))), where

Θp = {θ : L(θ; Dtr = Dp Dc) τ1} is the poisoned model space that is the set of all models that are trained on poisoned dataset and have a small training loss. Then, based on the notion of model sharpness (Foret et al., 2020), we can approximate the worst-case loss ℓ2(θ ) by ℓ2(θ ) max||ζ||q ρ ℓ2({Xv}V v=1; θ( Dp) + ζ, Q1 ε(θ( Dp))). Therefore, we can obtain

D p arg min Dp ℓ3({Xv}V v=1; θ( Dp), Q1 ε) (8)

= arg min Dp max ζ p ρ ℓ2({Xv}V v=1; θ( Dp) + ζ, Q1 ε),

where θ( Dp) is the minimizer of the training problem in Eq. (6). In the above, we locally maximize the loss by perturbing θ( Dp) with a vector ζ (constrained by a norm limit ζ p ρ). In this way, the perturbed model θ( Dp)+ζ has a worst poisoning effect compared to θ( Dp).

The attack framework described in Eq. (8) and Eq. (6) is a bilevel optimization problem, where the outer optimization in Eq. (8) defines the adversarial attack objective and the inner problem in Eq. (6) specifies the model s learning objective using both the clean and poisoning data. Notably, compared with the original adversarial objective in Eq. (7), for the worst-case optimization in Eq. (8), the perturbations on the inner minima help achieve a strong poisoning effect.

4.3. Optimization

Fundamentally, the formulated bi-level optimization problem in Eq. (8) and Eq. (6) can be computationally expensive especially for DNNs, since we need to fully solve the inner problem in Eq. (6) to update the outer variables. Besides the high computation complexity, the inner optimization

Data Poisoning Attacks against Conformal Prediction

also incurs significant storage costs to maintain the entirety of the large training dataset. These raise a critical question: Is it possible to craft effective poisoning points without needing to retrain DNNs and accessing the entire training dataset? This question underscores the need for more resource-efficient strategies that circumvent the extensive computational and storage requirements typically associated with such end-to-end poisoning attacks (Foret et al., 2020).

To address the above challenges, we resort to formulating the optimization as a closed-form update of the original pre-trained model θ , while only knowing the subset Dp. Specifically, we adopt influence functions (Hampel, 1974) to find an closed-form model update Ψ(Dp, Dp) that we add to the original model θ (trained over Dtr = Dp Dc) for the generated poisoning samples. In this way, by capturing the changes to the pre-trained model θ in a closedform update, we can provide significant speed-ups over existing retraining based methods (Huang et al., 2020). Our closed-form updates are not only limited to the feature-level manipulations, but also the labels. To map the changes of the training data in retrospection to close-form updates of model parameters, we can formulate

θ ξ,Dp Dp = arg min θ Lξ(θ; Dtr) = L(θ; Dtr)+

Zp Dp l( Zp, θ) ξ X

Zp Dp l(Zp, θ). (9)

The above generalization allows for the substitution of Zp with Zp by slightly increasing the weight of Zp by a small value ξ and correspondingly decreasing Zp. Below, we introduce our rigorously refined attacks based on the firstorder and second-order closed-form gradient updates.

First-order case. To derive the first-order based update, when ξ is small and l is differential with respect to θ, we can use a first-order Taylor series at θ to approximate Lξ(θ; Dtr) in Eq. (9) by

Lξ(θ ξ,Zp Zp; Dtr) L(θ ; Dtr) + ξ(l( Zp, θ )

l(Zp, θ )) + Ψ(Dp, Dp) ( θL(θ ; Dtr)

+ ξ( θl( Zp; θ ) θl(Zp; θ ))), (10)

where θ is obtained over Dtr. Given that the poisoned model θ ξ,Zp Zp is a minimum of Lξ( ; Dtr), we can as-

sume that Lξ(θ ξ,Zp Zp; Dtr) < Lξ(θ ; Dtr). Integrating this into the Taylor series approximation and using the condition that θL(θ ; Dtr) = 0, based on Eq. (9), we now can have ξΨ(Dp, Dp) ( θl( Zp, θ ) θl(Zp, θ )) < 0. Given ξ > 0, our attention shifts to analyzing the dot product within the equation. For two given vectors µ1, µ2, the dot product can be expressed as µ1 µ2 = ||µ1||||µ2|| cos (µ1, µ2), where cos (µ1, µ2) is the cosine between µ1 and µ2. The minimum cosine, 1, occurs when

µ1 = µ2. Therefore, we can arrive at

Ψ(Dp, Dp) = X

Zp Dp θl(Zp, θ ) X

Zp Dp θl( Zp, θ ),

which indicates the optimal direction for adjustment from θ is P Zp Dp θl( Zp, θ ) P

Zp Dp θl(Zp, θ ). The actual step size is unknown and requires calibration with a small constant τ to determine the appropriate update magnitude. Based on this, we can have

θ ξ,Dp Dp θ (11)

Zp Dp θl( Zp, θ ) X

Zp Dp θl(Zp, θ )).

Intuitively, this update shifts the model parameters from P Zp Dp θl( Zp, θ ) to P

Zp Dp θl(Zp, θ ), with τ dictating the update s step size.

Next, we can use a first-order Taylor series around θ to approximate ℓ3(Xv; θ ξ,Dp Dp, Q1 ε) in Eq. (8) as follows

min Dp ℓ3(Xv; θ ξ,Dp Dp, Q1 ε) (12)

= min Dp ℓ3(Xv; θ ξ,Dp Dp, Q1 ε) ℓ3 (Xv; θ , Q1 ε)

min Dp θℓ3(Xv; θ , Q1 ε) [θ ξ,Dp Dp θ ]

= min Dp τ θℓ3(Xv; θ , Q1 ε) Ψ(Dp, Dp).

Therefore, to induce a modification θ ξ,Dp Dp θ that can most increase the adversarial loss ℓ3 on victim examples {(Xv, Yv)}V v=1, we can minimize the above equation when ξ is small, i.e., maximizing θℓ3({Xv}V v=1; θ , Q1 ε) Ψ(Dp, Dp). Now the objective is to solve

arg max Dp Φ( Dp, θ) = θℓ3({Xv}V v=1; θ , Q1 ε) Ψ(Dp,

Dp)/( θℓ3({Xv}V v=1; θ , Q1 ε) Ψ(Dp, Dp) ), (13)

which achieves the maximized attack goal by aligning the directions of θℓ3({Xv}V v=1; θ , Q1 ε) and Ψ(Dp, Dp). To compute the adversarial loss θℓ3({Xv}V v=1; θ , Q1 ε), we adopt the technique in Foret et al. (2020) to first approximate ℓ3 by leveraging a first-order method

ˆζ = ρ sign( θℓ2({Xv}V v=1; θ , Q1 ε))| θℓ2({Xv}V v=1;

θ , Q1 ε)|q 1/(|| θℓ2({Xv}V v=1; θ , Q1 ε)||q p) 1 p , (14)

where 1/p + 1/q = 1. We set p = 2, following Foret et al. (2020), unless otherwise stated. Then, we can have the approximation to calculate θℓ3({Xv}V v=1; θ , Q1 ε) via replacing θ with θ + ˆζ

θℓ3({Xv}V v=1; θ , Q1 ε) θℓ2({Xv}V v=1;

θ, Q1 ε)|θ=θ +ˆζ. (15)

Data Poisoning Attacks against Conformal Prediction

In this way, by fixing θℓ3, we can solve Eq. (13) to find effective poisoning samples Dp via gradient descent. In this way, the poisoned model is specifically tailored to exhibit malicious behavior towards the victim s data samples.

Second-order case. When the loss L(θ; Dtr) is twice differentiable and strictly convex, there exists an inverse Hessian matrix H 1 θ , which allows for the approximation of changes to the model (Ling, 1984). In particular, the optimality conditions for Eq. (9) can be directly determined by 0 = L(θ ; Dtr) + ξ(l( Z, θ ξ,Z Z) l(Z, θ ξ,Z Z)). If ξ is sufficiently small, we can use a first-order Taylor series at θ to approximate the conditions as

0 L(θ ; Dtr) + ξ(l( Z, θ ) l(Z, θ )) + (θ ξ,Z Z θ )

( 2L(θ ; Dtr) + ξ( 2l( Z; θ ) 2l(Z; θ ))). (16)

Given the optimality condition L(θ ; Dtr) = 0 for θ , using the Hessian of the loss function, we can rearrange this solution and get θ ξ,Dp Dp θ =

ξH 1 θ (P Zp Dp θl( Zp, θ ) P

Zp Dp θl(Zp, θ )), where we additionally omit higher-order terms. Then, we can set ξ = 1 to replace sample Z completely by Z, which leads to the below second-order update

θ Dp Dp θ H 1 θ ( X

Zp Dp θl( Zp, θ ) (17)

Zp Dp θl(Zp, θ )).

Combining this with Eq. (13), we can easily derive the second-order based attack framework.

Theorem 4.1. Assume that L(θ) is local convex and differentiable. Let Dp = {(Xi, Yi)}P i=1, L(θ ) be the initial optimal solution, and L(θ u) be the updated optimal solution. Given a bound ϵ > 0 with the perturbation δi 2 ϵ, assume that θ θ u 2 has a upper bound Bθ, the gradient l is Lz-Lipschitz with respect to X at θ and L1 -Lipschitz with respect to θ. We get θ Dp Dp from θ by our closedform updates. Then the following upper bounds hold: For the first-order update of our approach, if τ 1 L1 we have Lξ(θ Dp Dp) L (θ u) ϵLz|Dp|Bθ. For the second-order

update of our approach, we have Lξ(θ Dp Dp) L (θ u)

ϵBθLz|Dp| + 1 + 1

2L2 1 L1(ϵLz|Dp|)2.

Theorem 4.1 gives a finite-sample bound to quantify the difference between our two approximation methods and the optimal solution. It demonstrates that as we decrease the perturbation bound ϵ and the number of poisoned data |Dp|, our methods approximate L(θ u) more closely. The procedure for optimizing the above losses is postponed to the full version of this paper. Notably, we can easily generalize this algorithm to perform other different attack types, e.g.,

adding irrelevant labels or removing specific labels regardless of the correctness of labels. This further demonstrates the significant threats of poisoning attacks against CP. Theorem 4.2 shows that under specific conditions regarding step sizes, the victim model will converge to a stationary point of the adversarial loss when the main training loss is optimized using stochastic gradient descent.

Theorem 4.2. Let ℓ3(θ) be bounded below and have a Lipschitz continuous gradient with constant L > 0. Assume that the victim model is trained by stochastic gradient descent (SGD) with step sizes αt , i.e. by sampling a random index it uniformly from {1, . . . , n} and then updating θt+1 = θt αt L it (θt). If the gradient descent steps αt > 0 satisfy αt L <

ωΦ(Dp, θt) ℓ3(θt)

L(θt) and E h L it (θt) 2i

L (θt) 2 for some fixed ω < 2, then E ℓ3 θt+1 < E [ℓ3 (θt)]. If in addition µ > 0, t0 and t t0, Φ(Dp, θt) > µ, we then can have limt ℓ3 (θt) 0.

Proof. For ℓ3 θt+1 , we can have the following

ℓ3 θt+1 =ℓ3 θt αt L it θt

ℓ3 θt αt L it θt ℓ3 θt

2α2 t L L it θt 2 . (18)

If we take the expected value of both sides of this expression (where the expectation is taken over the randomness in the sample selection it), we get

E ℓ3 θt+1 E ℓ3 θt αt E h L it θt ℓ3 θt i

+ α2 t LE h L it (θt) 2i

Now, the expected value of L it (θt) given θt is E L it (θt) | θt = Pn i=1 Li (θt) P it = i | θt = Pn i=1 Li (θt) 1

n = L (θt) . Based on this, we can have

E ℓ3 θt+1 E ℓ3 θt αt L θt ℓ3 θt

+ α2 t LE h L it (θt) 2i

According to the assumption E h L it (θt) 2i

L (θt) 2, we get

E ℓ3 θt+1 E ℓ3 θt (αt ℓ3 (θt)

L (θt) cos γt

2α2 t L) L θt 2 . (21)

Data Poisoning Attacks against Conformal Prediction

0.5% 1% 2% 4% 8% Poison budget

Set size reduction ratio

HPS, ours RAPS, ours

HPS, Rand Un RAPS, Rand Un

HPS, Rand Ga RAPS, Rand Ga

(a) CIFAR-10

0.5% 1% 2% 4% 8% Poison budget

Set size reduction ratio

HPS, ours APS, ours

HPS, Rand Un APS, Rand Un

HPS, Rand Ga APS, Rand Ga

(b) CIFAR-100

0.5% 1% 2% 4% 8% Poison budget

Set size expansion ratio

HPS, ours APS, ours

HPS, Rand Un APS, Rand Un

HPS, Rand Ga APS, Rand Ga

(c) Tiny-Image Net

Figure 1. Performance of overconfidence CP attacks on CIFAR-10 and CIFAR-100, and underconfidence CP attacks on Tiny-Image Net.

As such, the adversarial loss decreases for nonzero step sizes

L(θt) cos (γt) > 1 2αt L for some 1/2 < c < . This follows from our assumption on the parameter ω. Therefore, we can get E ℓ3 θt+1 < E [ℓ3 (θt)]. Reinserting this estimate into Eq. (21) reveals that

E ℓ3 θt+1 E ℓ3 θt cos2 γt

Due to monotonicity we may sum over all descent inequalities, yielding Pt=T 1 t=0 E [ℓ3 (θt)] E ℓ3 θt+1 Pt=T 1 t=0 cos2 γt

2L ℓ3 (θt) 2, then

ℓ3 θ0 ℓ 3 ℓ3 θ0 E ℓ3 θT

ℓ3 θt 2 (22)

where ℓ 3 is the global optimum of ℓ3. When T we can find

ℓ3 θt 2 < . (23)

According to the assumption that cos γt is bounded below by some fixed µ > 0 except finitely many iterates for all (i.e., the angle between adversarial and training gradient is less than 90 ), we have the convergence to a stationary point as P t=0 µ2

2L ℓ3 (θt) 2 < . Therefore, we can get

lim t ℓ3 θt 0. (24)

Discussions on poisoning attacks against full conformal prediction. In full conformal prediction, it assumes that both training and test data are exchangeable. Therefore, directly crafting perturbation-based poisoning samples would violate the data exchangeability assumption. This would increase the risks of being detected by just checking coverage results. One straightforward way is to inject exchangeable

samples without perturbations. However, such a method is limited in attack effectiveness and the availability of a large number of exchangeable points. To study the effects of poisoning attacks on full conformal prediction while maintaining validity, we can employ transfer learning-based attack settings (Shen et al., 2021; Shafahi et al., 2018), where the adversary has knowledge of a pre-trained model and the victim model is fine-tuned on this pre-trained model. Due to space limitation, more details about poisoning attacks against full conformal prediction can be found in the full version of this paper.

5. Experiments

In this section, we perform extensive experiments to validate our proposed poisoning attacks against conformal prediction. Due to space limitation, more experimental details and results (e.g., more datasets, and attacks scenarios for unequalized and unfair coverage subgroups) are given in the full version of this paper.

Datasets and models. In experiments, we adopt the following image classification datasets: Tiny-Image Net (Deng et al., 2009) and CIFAR-10/100 (Krizhevsky et al.). We consider various DNN models, including Mobile Net-V2 (Sandler et al., 2018), Res Net-18 (He et al., 2016), VGG-16 (Simonyan & Zisserman, 2014), and a 5-layer Conv Net.

Baselines. As there is no existing work on data poisoning attacks against conformal prediction, in our experiments, we adopt the Rand Un and Rand Ga baselines to assess the effectiveness of the proposed poisoning strategies. Specifically, we use random uniform noise and Gaussian noise as poisoning perturbations for the Rand Un and Rand Ga baselines, respectively.

Evaluation metrics. To evaluate the attack effectiveness, we measure the set size reduction ratio as (set sizebenign set sizevictim)/set sizebenign and set size expansion ratio as (set sizevictim set sizebenign)/set sizebenign of the target samples on the victim model. In addition, we analyze prediction

Data Poisoning Attacks against Conformal Prediction

Table 1. Set size reduction ratio of overconfidence CP attacks under data poisoning defenses.

Defense method HPS APS RAPS RSCP

Ours + No defense 0.46 0.02 0.47 0.04 0.54 0.04 0.49 0.05

Ours + Max Up (Gong et al., 2021) 0.34 0.05 0.32 0.07 0.49 0.06 0.26 0.07 Ours + Adversarial Poisoning (Geiping et al., 2021a) 0.39 0.04 0.28 0.09 0.41 0.06 0.24 0.07 Ours + EPIC (Yang et al., 2022) 0.38 0.06 0.33 0.09 0.42 0.07 0.38 0.10

consistency (whether the prediction labels are consistent) and empirical coverage rate between benign and victim models to show the stealthiness of our attacks.

The adopted conformal methods. In experiments, we adopt the following popular conformal methods: RSCP (Gendler et al., 2021), an adversarial robust CP method against adversarial attacks; APS (Romano et al., 2020), designed to improve conditional coverage; RAPS (Angelopoulos et al., 2020), a regularized variant of APS for generating smaller sets; and HPS (Lei et al., 2013; Vovk et al., 2005), which relies on softmax output.

Implementation details. In experiments, we allocate 10% data for calibration and maintain a default coverage rate (1 ε) of 0.9. We limit the perturbation bound ϵ to 16/255. The poisoning attacks are implemented through training the models from scratch (Huang et al., 2020; Huai et al., 2020b), utilizing the SGD optimizer with a learning rate of 0.01 and a batch size of 128. We evaluate the attack results in each experiment by randomly sampling a target class. We generate poisons and evaluate them on 8 newly initialized victim models. We repeat each experiment 10 times and report the mean and standard errors.

5.1. Attack Performance against Conformal Prediction

In Figure 1, we present the performance of overconfidence CP attacks on CIFAR-10 and CIFAR-100, as well as the underconfidence CP attacks on Tiny-Image Net. We observe that our proposed attacks significantly outperform Rand Un and Rand Ga baselines in terms of set size reduction ratio and set size expansion ratio across various poison budgets. For example, consider Figure 1a, where overconfidence CP attacks are conducted on CIFAR-10 with HPS and RAPS. The benign set size of HPS is 2.0 (implying a maximum reduction ratio of 0.5 in order to obtain a set size of 1), and the benign set size of RAPS is about 2.84 (with a maximum reduction ratio of 0.64). Our proposed attacks achieve a reduction ratio of 0.48 with HPS and 0.54 with RAPS using 2% poison budget, while the baselines achieve reduction ratios below 0.32. Therefore, our proposed attacks can effectively manipulate the uncertainty of CP and successfully trick the model into being overconfident or underconfident for target samples.

In addition, in Figure 2, we demonstrate the stealthiness

0.5% 1% 2% 4% 8% Poison budget

Consistency

(a) Prediction consistency

0.5% 1% 2% 4% 8% Poison budget

Coverage rate

Benign model Victim model

(b) Empirical coverage rate

Figure 2. Stealthiness of overconfidence CP attacks on CIFAR-10.

of overconfidence CP attacks on CIFAR-10 with HPS. Our proposed attacks achieve a high prediction consistency and similar empirical convergence rates compared to the benign model. This underscores the stealthiness of our attacks when targeting uncertainty in CP.

5.2. Attack Performance under Data Poisoning Defenses

In this section, we explore the performance of our proposed attacks under existing data poisoning defenses. In Table 1, we report the set size reduction ratio of overconfidence CP attacks under Max Up (Gong et al., 2021), Adversarial Poisoning (Geiping et al., 2021a), and EPIC (Yang et al., 2022), using 2% poison budget on CIFAR-10. Specifically, Max Up generates augmented data with random perturbations, aiming to minimize the worst-case loss of the augmented data. Adversarial Poisoning is a variant of adversarial training that builds a robust model using adversarially poisoned data. EPIC identifies and eliminates effective poison data in gradient space during training to prevent poisoning attacks. Notably, our proposed attacks remain effective even under these existing poisoning defenses since we specifically target the nonconformity scores in our attack framework. For example, it still achieves a reduction ratio of 0.34 under Max Up with HPS, compared to 0.46 without defense. Therefore, our proposed attacks demonstrate a satisfying set size reduction ratio across existing defense mechanisms, indicating the utility and effectiveness of our approach.

5.3. Ablation Study

First, we compare the performance and running time of overconfidence CP attacks with different optimizations against HPS on CIFAR-10. The results in Table 2 reveal that our proposed attacks, both in first-order and second-order op-

Data Poisoning Attacks against Conformal Prediction

Table 2. Set size reduction ratio and running time (min) of overconfidence CP attacks with varying optimizations.

Poison budget Ours first-order Ours second-order Meta Poison (Huang et al., 2020)

Reduction ratio Running time Reduction ratio Running time Reduction ratio Running time

0.5% 0.36 0.03 12.75 0.09 0.38 0.05 95.76 0.37 0.31 0.04 233.48 0.38 1% 0.39 0.05 14.96 0.25 0.40 0.05 179.77 0.96 0.36 0.05 300.04 1.93 2% 0.46 0.02 19.26 0.14 0.48 0.02 369.23 4.81 0.38 0.07 494.72 2.35

0.5% 1% 1.5% 2% Poison budget

Set size reduction ratio

Ours with ℓ3 loss

Ours with ℓ2 loss

Figure 3. Impact of selected loss.

2 3 4 Benign set size

Set size reduction ratio

HPS APS RAPS RSCP

Figure 4. Impact of benign set size.

8/255 16/255 24/255 32/255 Perturbation bound ε

Set size reduction ratio

HPS APS RAPS RSCP

Figure 5. Impact of perturbation bound.

timizations, achieve a significantly higher set size reduction ratio and require much less running time compared to Meta Poison (Huang et al., 2020) optimization.

Next, we examine the performance of overconfidence CP attacks using the ℓ2 loss and ℓ3 loss on various poison budgets in the practical black-box scenario. Note that unlike ℓ2, ℓ3 considers the worst-case poisoned model. As shown in Figure 3, our attacks demonstrate significantly higher set size reduction ratios compared to the Rand Un baseline. When comparing the two loss functions, we observe that under budgets of 0.5% and 1%, attacks employing the ℓ3 loss achieve higher reduction ratios of 0.01 and 0.02, respectively, than the ℓ2 loss. This indicates we can make an improvement by taking into account the worst case of the model when conducting the poisoning attacks.

Furthermore, we conduct overconfidence CP attacks on varying benign set sizes, using 2% poison budget on CIFAR10. As shown in Figure 4, our proposed attacks consistently reduce the uncertainty for target samples across different benign set sizes. Typically, a larger prediction set implies more uncertainty and poses a greater challenge for attacks due to the need to manipulate more labels. Nonetheless, our attacks persist in showcasing their capability to reduce the set size. Our optimization approach specifically targets each nonconformity score associated with labels in the prediction set, ensuring that the attacked prediction set exclusively contains the predicted label, thereby reducing uncertainty.

Lastly, in Figure 5, we illustrate the performance of overconfidence CP attacks across different perturbation bounds employing ℓ3 loss, using 2% poison budget on CIFAR-10 in the practical black-box scenario. The results show that our proposed attacks generally achieve higher set size reduction ratios with larger perturbation bounds. Even with a small perturbation bound (e.g., 16/255), our proposed attacks ex-

hibit remarkable performance. The reason is that, as the perturbation bound increases, the adversary has more space to adjust the features of victim samples, allowing them to explore a broader range and find perturbations that deceive the model more effectively.

6. Conclusion and Future Work

For the first time to our best knowledge, in this paper, we study the vulnerabilities of CP to data poisoning attacks, and devise a bi-level attack framework for crafting effective poisoning points in black-box scenarios. Specifically, in our proposed strategy, we first propose to calculate the worst poisoning model before using it to update poisoning points, to maintain a strong poisoning effect across various models for maximizing the impact of our attacks. Additionally, we also design approximate relaxations for handling the discrete uncertainty set sizes and the non-convex, non-differentiable quantile. Further, we introduce rigorous optimization methods that refine our strategies for efficiently creating effective poisoning points using closed-form updates, thus bypassing the need for full model retraining or complete dataset access. Our extensive experiments in both full and split CP settings demonstrate our attacks effectiveness in manipulating uncertainty, surpassing traditional poisoning methods. Moreover, we discover that existing defenses are inadequate against our advanced attack strategies.

In the future, we will extend our proposed attacks to a broader range of machine learning models, CP methods, and larger datasets. Notably, the attack strategies proposed in this paper could potentially be used by malicious users to attack real CP systems. To mitigate the potential negative consequences and impacts, we will design robust CP algorithms that can effectively defend against such poisoning attacks in our future work.

Data Poisoning Attacks against Conformal Prediction

Impact Statement

In this paper, we introduce a novel class of data poisoning attacks tailored to compromise conformal prediction systems by manipulating the uncertainty estimate. This approach reveals vulnerabilities of conformal prediction, thereby shedding light on potential security breaches in the predicted conformal results for uncertainty estimation. Our results highlight the urgent need for further research to protect against such significant threats and improve the security and reliability of such uncertainty estimation methods.

Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308 318, 2016.

Alarab, I. and Prakoonwit, S. Adversarial attack for uncertainty estimation: identifying critical regions in neural networks. Neural Processing Letters, 54(3):1805 1821, 2022.

Andriushchenko, M. and Flammarion, N. Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pp. 639 668. PMLR, 2022.

Angelopoulos, A., Bates, S., Malik, J., and Jordan, M. I. Uncertainty sets for image classifiers using conformal prediction. ar Xiv preprint ar Xiv:2009.14193, 2020.

Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., Molloy, I., and Srivastava, B. Detecting backdoor attacks on deep neural networks by activation clustering. ar Xiv preprint ar Xiv:1811.03728, 2018.

Chen, J. and Gu, Q. Rays: A ray searching method for hard-label adversarial attack. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1739 1747, 2020.

Cretu, G. F., Stavrou, A., Locasto, M. E., Stolfo, S. J., and Keromytis, A. D. Casting out demons: Sanitizing training data for anomaly sensors. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pp. 81 95. IEEE, 2008.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Fisch, A., Schuster, T., Jaakkola, T., and Barzilay, R. Fewshot conformal prediction with auxiliary tasks. In International Conference on Machine Learning, pp. 3329 3339. PMLR, 2021.

Fisch, A., Schuster, T., Jaakkola, T., and Barzilay, R. Conformal prediction sets with limited false positives. In International Conference on Machine Learning, pp. 6514 6532. PMLR, 2022.

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. ar Xiv preprint ar Xiv:2010.01412, 2020.

Geiping, J., Fowl, L., Somepalli, G., Goldblum, M., Moeller, M., and Goldstein, T. What doesn t kill you makes you robust (er): How to adversarially train against data poisoning. ar Xiv preprint ar Xiv:2102.13624, 2021a.

Geiping, J., Fowl, L. H., Huang, W. R., Czaja, W., Taylor, G., Moeller, M., and Goldstein, T. Witches brew: Industrial scale data poisoning via gradient matching. In International Conference on Learning Representations, 2021b.

Gendler, A., Weng, T.-W., Daniel, L., and Romano, Y. Adversarially robust conformal prediction. In International Conference on Learning Representations, 2021.

Ghosh, S., Shi, Y., Belkhouja, T., Yan, Y., Doppa, J., and Jones, B. Probabilistically robust conformal prediction. In Uncertainty in Artificial Intelligence, pp. 681 690. PMLR, 2023.

Gluch, G. and Urbanke, R. Query complexity of adversarial attacks. In International Conference on Machine Learning, pp. 3723 3733. PMLR, 2021.

Gong, C., Ren, T., Ye, M., and Liu, Q. Maxup: Lightweight adversarial training with data augmentation improves neural network training. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 2474 2483, 2021.

Hampel, F. R. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383 393, 1974.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hobbhahn, M., Kristiadi, A., and Hennig, P. Fast predictive uncertainty for classification with bayesian deep networks. In Uncertainty in Artificial Intelligence, pp. 822 832. PMLR, 2022.

Huai, M., Sun, J., Cai, R., Yao, L., and Zhang, A. Malicious attacks against deep reinforcement learning interpretations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 472 482, 2020a.

Data Poisoning Attacks against Conformal Prediction

Huai, M., Wang, D., Miao, C., Xu, J., and Zhang, A. Pairwise learning with differential privacy guarantees. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 694 701, 2020b.

Huai, M., Zheng, T., Miao, C., Yao, L., and Zhang, A. On the robustness of metric learning: an adversarial perspective. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(5):1 25, 2022.

Huang, W. R., Geiping, J., Fowl, L., Taylor, G., and Goldstein, T. Metapoison: Practical general-purpose cleanlabel data poisoning. Advances in Neural Information Processing Systems, 33:12080 12091, 2020.

Humbert, P., Le Bars, B., Bellet, A., and Arlot, S. One-shot federated conformal prediction. In International Conference on Machine Learning, pp. 14153 14177. PMLR, 2023.

Jagielski, M., Severi, G., Pousette Harger, N., and Oprea, A. Subpopulation data poisoning attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 3104 3122, 2021.

Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894. PMLR, 2017.

Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian institute for advanced research). URL http://www. cs.toronto.edu/ kriz/cifar.html.

Ledda, E., Angioni, D., Piras, G., Fumera, G., Biggio, B., and Roli, F. Adversarial attacks against uncertainty quantification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4599 4608, 2023.

Lei, J., Robins, J., and Wasserman, L. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278 287, 2013.

Levine, A. and Feizi, S. Deep partition aggregation: Provable defenses against general poisoning attacks. In International Conference on Learning Representations, 2020.

Li, Y., Lyu, X., Koren, N., Lyu, L., Li, B., and Ma, X. Anti-backdoor learning: Training clean models on poisoned data. Advances in Neural Information Processing Systems, 34:14900 14912, 2021.

Lin, Z., Trivedi, S., and Sun, J. Conformal prediction with temporal quantile adjustments. Advances in Neural Information Processing Systems, 35:31017 31030, 2022.

Ling, R. F. Residuals and influence in regression, 1984.

Liu, Z., Wang, T., Huai, M., and Miao, C. Backdoor attacks via machine unlearning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14115 14123, 2024.

Ma, Y., Zhu, X., and Hsu, J. Data poisoning against differentially-private learners: attacks and defenses. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4732 4738, 2019.

Ndiaye, E. Stable conformal prediction sets. In International Conference on Machine Learning, pp. 16462 16479. PMLR, 2022.

Neekhara, P., Dolhansky, B., Bitton, J., and Ferrer, C. C. Adversarial threats to deepfake detection: A practical perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 923 932, 2021.

Peri, N., Gupta, N., Huang, W. R., Fowl, L., Zhu, C., Feizi, S., Goldstein, T., and Dickerson, J. P. Deep k-nn defense against clean-label data poisoning attacks. In Computer Vision ECCV 2020 Workshops: Glasgow, UK, August 23 28, 2020, Proceedings, Part I 16, pp. 55 70. Springer, 2020.

Qian, W., Zhao, C., Le, W., Ma, M., and Huai, M. Towards understanding and enhancing robustness of deep learning models against malicious unlearning attacks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1932 1942, 2023.

Qian, W., Zhao, C., Li, Y., Ma, F., Zhang, C., and Huai, M. Towards modeling uncertainties of self-explaining neural networks via conformal prediction. ar Xiv preprint ar Xiv:2401.01549, 2024.

Ren, Q., Deng, H., Chen, Y., Lou, S., and Zhang, Q. Bayesian neural networks avoid encoding complex and perturbation-sensitive concepts. In International Conference on Machine Learning, pp. 28889 28913. PMLR, 2023.

Romano, Y., Sesia, M., and Cand es, E. J. Classification with valid and adaptive coverage. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510 4520, 2018.

Data Poisoning Attacks against Conformal Prediction

Schwarzschild, A., Goldblum, M., Gupta, A., Dickerson, J. P., and Goldstein, T. Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. In International Conference on Machine Learning, pp. 9389 9398. PMLR, 2021.

Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., and Goldstein, T. Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in neural information processing systems, 31, 2018.

Shen, L., Ji, S., Zhang, X., Li, J., Chen, J., Shi, J., Fang, C., Yin, J., and Wang, T. Backdoor pre-trained models can transfer to all. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 3141 3158, 2021.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Steinhardt, J., Koh, P. W. W., and Liang, P. S. Certified defenses for data poisoning attacks. Advances in neural information processing systems, 30, 2017.

Stutz, D., Cemgil, A. T., Doucet, A., et al. Learning optimal conformal classifiers. ar Xiv preprint ar Xiv:2110.09192, 2021.

Suya, F., Mahloujifar, S., Suri, A., Evans, D., and Tian, Y. Model-targeted poisoning attacks with provable convergence. In International Conference on Machine Learning, pp. 10000 10010. PMLR, 2021.

Tao, L., Feng, L., Yi, J., Huang, S.-J., and Chen, S. Better safe than sorry: Preventing delusive adversaries with adversarial training. Advances in Neural Information Processing Systems, 34:16209 16225, 2021.

Teng, J., Wen, C., Zhang, D., Bengio, Y., Gao, Y., and Yuan, Y. Predictive inference with feature conformal prediction. In The Eleventh International Conference on Learning Representations, 2022.

Tran, B., Li, J., and Madry, A. Spectral signatures in backdoor attacks. Advances in neural information processing systems, 31, 2018.

Trinh, T. Q., Heinonen, M., Acerbi, L., and Kaski, S. Tackling covariate shift with node-based bayesian neural networks. In International Conference on Machine Learning, pp. 21751 21775. PMLR, 2022.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world, volume 29. Springer, 2005.

Wang, K.-C., Vicol, P., Lucas, J., Gu, L., Grosse, R., and Zemel, R. Adversarial distillation of bayesian neural network posteriors. In International conference on machine learning, pp. 5190 5199. PMLR, 2018.

Wang, X., He, X., Wang, J., and He, K. Admix: Enhancing the transferability of adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16158 16167, 2021.

Wang, X., Li, Y., Xu, Z., and Luo, Y. Nested information representation of multi-dimensional decision: An improved promethee method based on npltss. Information Sciences, 607:1224 1244, 2022.

Weber, M., Xu, X., Karlaˇs, B., Zhang, C., and Li, B. Rab: Provable robustness against backdoor attacks. In 2023 IEEE Symposium on Security and Privacy (SP), pp. 1311 1328. IEEE, 2023.

Wen, K., Ma, T., and Li, Z. How does sharpness-aware minimization minimize sharpness? ar Xiv preprint ar Xiv:2211.05729, 2022.

Wicker, M., Laurenti, L., Patane, A., and Kwiatkowska, M. Probabilistic safety for bayesian neural networks. In Conference on uncertainty in artificial intelligence, pp. 1198 1207. PMLR, 2020.

Yang, Y., Liu, T. Y., and Mirzasoleiman, B. Not all poisons are created equal: Robust training against data poisoning. In International Conference on Machine Learning, pp. 25154 25165. PMLR, 2022.

Yang, Z., He, X., Li, Z., Backes, M., Humbert, M., Berrang, P., and Zhang, Y. Data poisoning attacks against multimodal encoders. In International Conference on Machine Learning, pp. 39299 39313. PMLR, 2023.

Yuan, M., Wicker, M., and Laurenti, L. Gradient-free adversarial attacks for bayesian neural networks. ar Xiv preprint ar Xiv:2012.12640, 2020.

Zhao, C., Qian, W., Li, Y., Li, W., and Huai, M. Rethinking adversarial robustness in the context of the right to be forgotten. 2023.

Zhao, C., Qian, W., Ying, R., and Huai, M. Static and sequential malicious attacks in the context of selective forgetting. Advances in Neural Information Processing Systems, 36, 2024.