# riskcontrolling_prediction_with_distributionally_robust_optimization__bcfa93a0.pdf

Published in Transactions on Machine Learning Research (08/2025)

Risk-controlling Prediction with Distributionally Robust Optimization

Franck Iutzeler franck.iutzeler@math.univ-toulouse.fr Institut de Mathématiques de Toulouse Université de Toulouse, CNRS, UPS, 31062, Toulouse, France

Adrien Mazoyer adrien.mazoyer@math.univ-toulouse.fr Institut de Mathématiques de Toulouse Université de Toulouse, CNRS, UPS, 31062, Toulouse, France

Reviewed on Open Review: https: // openreview. net/ forum? id= d9dl6Dy Jp J

Conformal prediction is a popular paradigm to quantify the uncertainty of a model s output on a new batch of data. Quite differently, distributionally robust optimization aims at training a model that is robust to uncertainties in the distribution of the training data. In this paper, we examine the links between the two approaches. In particular, we show that we can learn conformal prediction intervals by distributionally robust optimization on a well chosen objective. This further entails to train a model and build conformal prediction intervals all at once, using the same data.

1 Introduction

1.1 Background and contribution

During the last decade, the use of conformal inference to quantify reliability of black-box method and complex machine learning models, has been widely studied (see e.g., (Vovk et al., 2005; Angelopoulos and Bates, 2023; Fontana et al., 2023)). In particular, Split Conformal Inference (SCI) as introduced by Papadopoulos et al. (2002) is a procedure that aims to estimate the reliability of pre-trained models by observing its performance on calibration data. More formally, for some input/output spaces X Y, the aim is to construct a suitable function C : X P(Y), with output in the set of parts of Y denoted by P(Y), from the model f : X Y and n calibration points (Xi, Yi)n i=1 in X Y. In SCI, the objective is that for a new point (Xn+1, Yn+1) and a miscoverage level α set by the user, the following holds

P[Yn+1 C(Xn+1)] 1 α , (1)

where the above probability is taken with respect to the joint distribution of (Xi, Yi)n+1 i=1 . The main advantage of SCI is that it requires only exchangeability assumptions for the joint distribution of (Xi, Yi)n+1 i=1 (see the references above for more details), and still gives finite sample guarantees.

We are considering here a related paradigm to SCI, referred to as risk controlling prediction by Bates et al. (2021). The controlled risk corresponds to R(Tλ) := E[L(Yn+1, Tλ(Xn+1))], where Tλ : X P(Y) is a parameterized set-valued predictor and L : Y P(Y) R+ is a loss that evaluates the quality of Tλ. The objective here is to ensure that, given some ˆλn depending on the calibration data, the risk R(Tˆλn) := E[L(Yn+1, Tˆλn(Xn+1))|(Xi, Yi)n i=1] remains with high probability below some target level. More formally we aim to guarantee for levels α and δ chosen by the user that

P h R(Tˆλn) α i 1 δ , (2)

Published in Transactions on Machine Learning Research (08/2025)

where the above probability is over the joint distribution of (Xi, Yi)n i=1. The only requirement to get that guarantee is assuming the calibration points (Xi, Yi)n i=1 to be independent and identically distributed (i.i.d.) and having access to a concentration inequality in order to bound R(Tλ) by some α with probability 1 δ.

We develop in this paper the idea that an interesting alternative to concentration inequalities in this context can be obtain by the formalism of distributionally robust optimization (DRO). Indeed, the principle of distributional robustness is to design an uncertainty neighborhood of distributions U(Pn) around the empirical distribution of the calibration data Pn = 1

n Pn i=1δ(Xi,Yi). Then, if the distribution of (Xn+1, Yn+1) belongs to U(Pn) with probability 1 δ, then (2) holds with α = α(ˆλn) where

α(λ) = sup Q U(Pn) E(X,Y ) Q[L(Y, Tλ(X))] (3)

is a distributionally robust version of the conformal loss. Conversely, from the user-chosen risk level α, choosing ˆλn so that α(ˆλn) α leads to the sought guarantee (2). The design effort for the upper-bound is thus differed to the construction of U(Pn).

Conformal methods mainly rely on exchangeability assumptions on (Xi, Yi)i=1...n+1 which may be violated, typically when the calibration points are i.i.d. according to a distribution different from the test point. Interestingly, the distributional robustness approach presented above does not rely on such assumptions but only on the probability of the distribution of (Xn+1, Yn+1) belonging to U(Pn).

In this paper, we consider the paradigm of Risk-Controlling Prediction Sets (RCPS), i.e., (2), where the upper-bound on the risk is obtained by Wasserstein distributionally robust optimization (WDRO), i.e., (3) where U(Pn) is a ball of radius ρ around Pn in the Wasserstein distance. We demonstrate that the procedure exposed by Bates et al. (2021) remains valid provided that ρ is adequately chosen and can additionally suffer a shift compared to the training distribution in terms of Wasserstein distance.

In Section 2, we recall the RCPS methodology and connect it to distributional robustness. Then, in Section 3, we leverage tools from Wasserstein distributionally robust optimization to show the following result (stated below informally):

Theorem 1 (informal). Let (Xi, Yi)n i=1 P n and (Xn+1, Yn+1) P . For any α > 0 and any δ (0, 1), there exists ρ(n, δ) and ˆλn (chosen so that α(ˆλn) α where α( ) is defined in (3)) so that for any ρ ρ(n, δ)

P h R(Tˆλn) α i 1 δ

whenever W2(P, P ) ρ ρ(n, δ).

This enables theoretically the use of WDRO to produce upper-bounds in RCPS. Interestingly, this method is pertinent for any value of α (see Section 3.2.3) and is compatible with a joint training of the prediction and conformal prediction on the same dataset (see Section 3.5). Finally, we illustrate its encouraging performance in practice in Section 4.

1.2 Related works

The link between classical SCI and RCPS can be made through the work of Gupta et al. (2022) who reformulated Conformal Inference (CI) in terms of nested prediction sets, as well as Conformal Risk Control (CRC) introduced by Angelopoulos et al. (2024) where the guarantee (2) is expressed in expectation, that is E[R(Tˆλn)] α . Note that guarantee (1) is then retrieved by taking R(Tˆλn) = 1Yn+1 / Tˆλn(Xn+1), where the

expression of ˆλn depends on the calibration data (see Bates et al. (2021, App. A) for more details).

Distribution shift in CI has been recently widely studied for different type of shifting. Tibshirani et al. (2019) studied the covariate shift case (i.e., X distribution differs betwen calibration and test points, but not the Y |X distribution) by introducing the notion of weighted exchangeability. This approach has been generalized for the CRC context by Angelopoulos et al. (2024, Sec. 4.1). Podkopaev and Ramdas (2021) used similar arguments in order to tackle the label shift case (i.e., Y distribution differs but not the X|Y distribution). Barber et al. (2023) studied the consequences of such distribution shifts, by quantifying the

Published in Transactions on Machine Learning Research (08/2025)

coverage gap in terms of total variation distance. The online time series case has also been considered in the works of Gibbs and Candès (2021; 2024). Their former work has been considered in the CRC context by Feldman et al. (2023). Cauchois et al. (2024) studied shift in terms of f-divergence, which a closest context of our, and less restrictive than the other case mentioned here. It is also worth to notice that our results are also close but different from the conformal prediction with adversarial shift presented by Bates et al. (2021, Sec. 6.3) as a possible extension, since we consider shifts on the data distribution rather than pointwise shift in the training set. That adversarial context has actually been studied for SCI by Gendler et al. (2022) followed by Yan et al. (2024).

Very recently, Aolaritei et al. (2025) proposed to address the problem of distribution shifts in conformal prediction by considering ambiguity sets of distributions (similar to what we denote by U(Pn)) based on the Lévy Prokhorov pseudo-metric (where we use the Wasserstein distance).1 Their results are thus complementary to ours.

2 Risk-controlled prediction sets and Distributional Robustness

In this section, we focus on the RCPS paradigm. In the third part of the section, we connect this with the notion of distributional robustness.

We suppose to have access to a sample (Xi, Yi)n i=1 of i.i.d. couples of random variables with probability distribution P0 on X Y.

We seek to provide conformal prediction sets around a predictor f : X Y (the predictor output space Y is usually equal to Y but in some contexts such as classification, it can be equal to the simplex over Y). For the moment, we will consider that the predictor is fixed (e.g., learned by an external procedure), we will get back to the predictor training in Section 3.5.

We recall here the notion of nested prediction sets as described by Bates et al. (2021). We consider a class of parametric sets of the form

T : Λ X P(Y) (λ, x) 7 Tλ(x)

where P(Y) denotes the parts of set Y and Λ is a closed subset of R. We assume that these predictions sets are nested i.e., that

[λ1 λ2] [Tλ1(x) Tλ2(x) for all x X] . (4)

Typically, these sets are constructed from predictor outputs, as detailed in the examples below.

Finally, we define the loss for the conformal prediction as

L : Y P(Y) R+ (y, S) 7 L(y, S)

which we will often use in conjunction of the parametric sets above by considering L(y, Tλ(x)). Moreover, for practical convenience and in line with applications of interest, we will choose it monotone in its second argument i.e.,

[S S ] [L(y, S) L(y, S ) for all y Y] . (5)

The goal in conformal prediction is to use the available samples to build prediction sets around a predictor output such that the average conformal prediction loss is the low in expectation over the unknown distribution from which the samples are generated. To do so, we rely on the following assumption.

1The authors comment on the links between the Lévy Prokhorov pseudo-metric and the type Wasserstein distance, which is very different to the type-2 Wasserstein distance we employ both in terms of statistical guarantees and computational issues.

Published in Transactions on Machine Learning Research (08/2025)

Assumption 2. The following properties hold:

i) the parametric sets {Tλ}λ Λ satisfy the nesting property (4)

ii) the conformal loss L is non-negative, upper-semicontinuous, and satisfies the monotony property (5)

iii) the pointwise limit of the function (λ, (x, y)) 7 L(y, Tλ(x)) as λ sup(Λ) is the null function

iv) the function (x, y) 7 L(y, Tλ(x)) is upper-bounded uniformly in λ.

Assumption 2-i) is the classical nesting property of conformal prediction. Item ii) is also standard and fits the vast majority of the target usecases (the upper-semicontinuity assumption can be relaxed for most results of the paper). Note also that non-monotone losses could be also considered, using procedures described by Angelopoulos et al. (2024)2. As for item iii), it is slightly different from e.g., Bates et al. (2021) which assumes the existence of some λmax such that E(X,Y ) P [L(Y, Tλmax(X))] = 0 (as well as all greater λ by monotony of L); we rely on this modified assumption mainly to avoid complications about distributions supports in future developments. Notice that item iv) is actually not mandatory for RCPS: it can be relaxed into assuming the coefficient of variation of the empirical risk (6) is bounded, in order to consider unbounded losses (see Bates et al. (2021, Sec. 3.2)). However, it will be required when introducing the connection with Wasserstein distributionally robust optimization in Section 3. Moreover, without any loss of generality, we will assume that losses are bounded by 1.

Example 2.1 (Regression). Take X = Rd and Y = R. Typical predictors include (pretrained) linear models (fθ(x) = θ; x ), neural networks, etc. A simple class of parametric sets fitting our framework is Tλ(x) = {y Y : |y fθ(x)| λ} (in this setting, Y = Y = R). Doing so, the conformal loss can be defined from the 0/1 loss as L(y, Tλ(x)) = 1y Tλ(x) = 1|y fθ(x)|>λ.

Example 2.2 (Multi-class Classification). Take X = Rd and Y = {1, 2, .., c}. Usual predictors include (pretrained) once again linear models, neural networks, etc. However, in this setup, in order to get meaningful prediction sets, it is usual to work with predictors outputting the probability of each class rather than a single class, so Y is the simplex over c alternatives (This is not really restrictive since this is exactly the output of a linear predictor or a neural net after the cross-entropy but before thresholding). A possible class of parametric sets is Tλ(x) = {y Y : fθ(x)[y] 1 λ} (most probable classes predicted, in the sense of a probability greater than the threshold), associated to a loss of the form L(y, Tλ(x)) = 1y Tλ(x) = 1fθ(x)[y]<1 λ.

2.2 Risk-controlled prediction sets

The objective of risk-controlled prediction as described by Bates et al. (2021) is to find some λ that gives a low loss for a new point (X, Y ) distributed according to some distribution P . More precisely, we want to ensure that

R(λ) := E(X,Y ) P [L(Y, Tλ(X))]

is small. Note that the expectation is taken over the distribution of the new sample, P , which is a priori unknown. Importantly, Tλ (and thus (x, y) 7 L(y, Tλ(x))) does not depend on (Xi, Yi)n i=1, but does depend on λ which is a fixed parameter. However, in what follows, the choice of λ may depend on (Xi, Yi)n i=1, in which case it will be denoted by ˆλn to avoid confusion.

As a proxy, the empirical risk obtained from the data is an entry point for computations:

b Rn(λ) := E(X,Y ) Pn[L(Y, Tλ(X))] = 1

i=1 L(Yi, Tλ(Xi)) (6)

where Pn = 1

n Pn i=1δ(Xi,Yi) denotes the empirical distribution of the calibration data.

Whenever the samples (Xi, Yi)n i=1 are i.i.d. and, even more importantly P0 = P , the following rationale has been proposed by Bates et al. (2021):

2At a cost though: this might lead to larger prediction set as the gap with monotonicity of the loss increases.

Published in Transactions on Machine Learning Research (08/2025)

Step A Given a confidence 1 δ (0, 1), consider a (random) function b R+ n (λ) verifying

λ Λ, P(Xi,Yi)n i=1 P n h R(λ) b R+ n (λ) i 1 δ; (7)

this is typically obtained by a concentration inequality (e.g., Hoeffding).

Step B Given a level α (0, 1), find a random value ˆλn (depending on (Xi, Yi)n i=1) such that

λ Λ, λ ˆλn b R+ n (λ) α (8)

by using a 1D search.

Putting it all together, we obtain that

P(Xi,Yi)n i=1 P n h R(ˆλn) α i 1 δ (9)

which, in words, means that we have a low true risk with high probability (this last probability reflecting the fact that ˆλn depends on (Xi, Yi)n i=1 and is thus random).

Example 2.3 (Regression (cont.)). With the notation from Example 2.1 and relying on the fact that P0 = P = P, Equation (9) guarantees that

P(Xi,Yi)n i=1 P n h P(X,Y ) P h |Y fθ(X)| < ˆλn i > 1 α i 1 δ

where we recover our double guarantee on the conformal interval: with respect to the randomness of the new point and of the data sample.

2.3 Distributionally Robust Optimisation

The empirical risk (6) can lead to over-confident decisions (lower risk for training samples than in test) and be sensitive to distribution shifts between training and application (see e.g., (Esfahani and Kuhn, 2018)). In our context, this means that the empirical risk may not be a valid upper bound of the true risk in the sense of (7) (hence the need of an auxiliary function b R+ n (λ)). To overcome these drawbacks, an approach gaining momentum in machine learning is distributionally robust optimization, which consists in minimizing the worst expectation of the loss when the distribution lives in a neighborhood of Pn.

In our context, distributionally robust optimization amount to solving the problem

sup Q Uρ(Pn) E(X,Y ) Q[L(Y, Tλ(X))] (10)

where the supremum is taken over all probability distributions on Ξ := X Y which are close to Pn i.e.,

Uρ(Pn) := {Q P(Ξ) : dist(Pn, Q) ρ} ,

where ρ > 0 controls the required level of robustness around Pn and dist is denotes any kind of metric on probability distributions.

Popular choices of distribution neighborhoods are based on the Kullback-Leibler (KL) divergence (Laguel et al., 2020; Levy et al., 2020), kernel tools (Zhu et al., 2021a; Staib and Jegelka, 2019; Zhu et al., 2021b), moments (Delage and Ye, 2010; Goh and Sim, 2010), or the Wasserstein distance (Shafieezadeh Abadeh et al., 2015; Esfahani and Kuhn, 2018).

An important motivation for distributionally robust models is that if the true distribution belongs to the robustness neighborhood, the method directly benefits from generalization guarantees. This is why a careful choice of the neighborhood is necessary. Indeed, if P Uρ(Pn) then we directly get that

R(λ) := E(X,Y ) P [L(Y, Tλ(X))] sup Q Uρ(Pn) E(X,Y ) Q[L(Y, Tλ(X))]. (11)

Published in Transactions on Machine Learning Research (08/2025)

This attractive property is also highly connected with the choice of the metric on probability distributions. For instance, using the KL divergence (i.e., taking Uρ(Pn) = {Q P(Ξ) : KL(Pn, Q) ρ}) was studied in details in the literature as a natural choice leading to tractable formulations, see e.g., Ben-Tal et al. (2013); Hu and Hong (2013); Namkoong and Duchi (2016) and Wang et al. (2023, Rem. 5). However, it imposes that the support of P should comprised in the one of Pn if one wants to have P Uρ(Pn) since the KL divergence between two distribution is infinite as soon as the supports are disjoint. Thus, this approach amounts to reweighting the samples with higher conformal loss so that the newly obtained empirical conformal loss is higher and can act as an upper-bound. While this approach does not grants us the same guarantees, we note that the idea of reweighting samples to account for distributional shifts provides a nice connection with the notion of weighted conformal prediction for covariate shifts, see Tibshirani et al. (2019) and Cauchois et al. (2024) in light of Hu and Hong (2013, Sec. 2).

In this paper, we focus on the ambiguity sets relying the Wasserstein distance, which naturally metrizes the convergence of measures,3 which is a sought feature in this context. This is the topic of the next section.

3 Risk-controlled prediction sets using Wasserstein Distributionally Robust Optimization

In this section, we propose to use a Wasserstein distributionally robust version of the empirical risk as our high probability upper-bound on the true risk in the form of (7). We investigate the tractability and the statistical guarantees that can be offered by WDRO RCPS.

We recall that:

we denote by P the unknown true distribution of (X, Y ), so that the true risk is R(λ) := E(X,Y ) P [L(Y, Tλ(X))]

we have access to n samples (Xi, Yi)n i=1, i.i.d. with probability distribution P0 (different from P in general)

we denote by Pn = 1 n Pn i=1δ(Xi,Yi) the empirical data distribution, so that the empirical risk is b Rn(λ) := E(X,Y ) Pn[L(Y, Tλ(X))] = 1

n Pn i=1L(Yi, Tλ(Xi))

3.1 Wasserstein distributionally robust optimization

Wasserstein distributionally robust optimization corresponds to Problem (10) with distribution neighborhoods formulated in terms of optimal transport distance i.e.,

Uρ(Pn) := {Q P(Ξ) : W2(Pn, Q) ρ} ,

where W2( , ) denotes the (type-2) Wasserstein distance4 defined as

W2(Q, Q ) := inf π P(Ξ Ξ),π1=Q,π2=Q E(ξ,ζ) π ξ ζ 2 1/2 ,

where P(Ξ Ξ) is the set of probability distributions in the product space Ξ Ξ, and π1 (resp. π2) denotes the first (resp. second) marginal of π.

The Wasserstein distance is a natural metric to compare discrete and absolutely continuous probability distributions and its use for distributional robustness has attracted a lot of attention; see e.g., (Shafieezadeh Abadeh et al., 2015; Sinha et al., 2018; Shafieezadeh-Abadeh et al., 2019; Li et al., 2020; Kwon et al., 2020) and the review articles (Blanchet et al., 2021; Kuhn et al., 2024).

3While this motivation is valid and provides a good intuition, we note that having P Uρ(Pn) may be slightly too demanding compared to verifying directly (11). We will get back to this point in more details in the next sections. 4We focus here on the type-2 distance for simplicity. Actually, the transport cost could be easily modified in several parts of the paper to accommodate for instance discrete sets for classification problems. The results still hold without modifications, except for Lemma 4 which has to be modified accordingly (see Gao and Kleywegt (2023, Rem. 2)).

Published in Transactions on Machine Learning Research (08/2025)

3.2 Construction of WDRO risk-controlled prediction sets

Now, let us construct risk-controlled prediction sets using Wasserstein distributionally robust optimization problems using the rationale of Section 2.2.

3.2.1 Preliminaries

In this section, we will use as an upper-bound to the true risk the Wasserstein distributionally robust empirical risk defined as

b Rρ n(λ) := sup Q P(Ξ):W2(Pn,Q) ρ E(X,Y ) Q[L(Y, Tλ(X))]. (12)

Now, in order for this proxy to have desired properties such as finiteness, we have to introduce adequate assumptions.

Assumption 3. Assume that b Rn(λ) + for all λ Λ and that either of the following properties hold:

a) the set X Y is bounded and the function (x, y) 7 L(y, Tλ(x)) is upper-bounded uniformly in λ

b) the set X Y is unbounded and there exists finite constants M, M > 0, independent of λ, such that L(y, Tλ(x)) M + M ( x 2 ε + y 2 ε) for some ε (0, 2)

These assumptions are rather mild as the conformal losses are usually bounded (especially in light of Assumption 2-iii)). This implies that b Rρ n(λ) is finite by classical results of the WDRO literature (e.g., Gao and Kleywegt (2023)).

Lemma 4. Let Assumption 3 hold. Then, for all λ Λ, and all ρ 0, b Rρ n(λ) < + .

Proof. This directly follows from Gao and Kleywegt (2023, Th. 1) as Assumption 3 guarantees that (x, y) 7 L(y, Tλ(x)) L1(Pn) and that the growth rate (see Gao and Kleywegt (2023, Def. 4)) is finite.

3.2.2 Step A

In order to build upper-bounds of the form of (7), one has to guarantee that b Rρ n(λ) is a valid upper-bound in our goal provided that the robustness radius is chosen adequately. Since we consider a Wasserstein distributionally robust upper bound, we can directly rely on measure concentration results in W2 distance (e.g., Fournier and Guillin (2015)). This naturally provides a candidate radius that has the merit to be valid uniformly in λ (and in the base model) as it gives a probability for P0 to be included in the constraint set in the definition of b Rρ n(λ) in (12).

Lemma 5. Suppose that Eξ P0 exp(γ ξ β) < for some β > 2, γ > 0. For any δ (0, 1), there exists

ρ(n, δ) = O (n 1 log(δ 1))

4 max(4,d) so that for any ρ ρ(n, δ) we have

P(Xi,Yi)n i=1 P0 n h λ Λ, R(λ) b Rρ n(λ) i 1 δ (13)

whenever W2(P0, P ) ρ ρ(n, δ). In particular, if P0 = P , Eq. (13) holds as soon as ρ ρ(n, δ).

Proof. From Fournier and Guillin (2015, Th. 2), case (1), we have that if Eβ,γ := Eξ P0 exp(γ ξ β) < for some β > 2, γ > 0, then for any x > 0

P(Xi,Yi)n i=1 P0 n[W2(Pn, P0) < x] 1 C

exp( c n x) if d < 4 and x 1 exp( c n x/ log(2 + 1/ x)2) if d = 4 exp( c n xd/4) if d > 4 exp( c n xβ/4) otherwise (14)

Published in Transactions on Machine Learning Research (08/2025)

where constants c and C only depend on d, β, γ and Eβ,γ.

Fix δ (0, 1). Then, with

cn if d 4 and n log(C/δ)

cn 4/d if d > 4 log(C/δ)

cn 4/β otherwise

we have P(Xi,Yi)n i=1 P0 n[W2(Pn, P0) < ρ(n, δ)] 1 δ.

In turn, if W2(P0, P ) ρ ρ(n, δ), then with probability 1 δ, W2(P , Pn) ρ and thus

λ Λ, R(λ) = E(X,Y ) P [L(Y, Tλ(X))] sup Q:W2(Q,Pn) ρ E(X,Y ) Q[L(Y, Tλ(X))] = b Rρ n(λ)

which is the claimed result.

This result validates the use of WDRO for Step A of the production of risk-controlled prediction sets. One disadvantage of the approach is that the choice of the uncertainty radius is not easy to choose as it depends non-trivially on the problem s parameters.

Remark 6 (Conformal guarantees on the WDRO problem). By flipping the problem, the above result means that for any choice of radius ρ > 0, the Wasserstein distributionally robust conformal prediction problem (12) provides a natural guarantee of the form of Eq. (13) with probability 1 δ where δ = C exp( c n ρp) with p [1, max(d, β)/4] depending on the value of d and ρ (see Eq. (14)).

3.2.3 Step B

Now, we want to find some ˆλn so that (8) is satisfied with our WDRO upper-bound. In words, we have to make the conformal prediction set sufficiently large so the robust conformal loss is arbitrarily small. We first show existence of such a ˆλn.

Lemma 7 (Existence). Let Assumption 2 hold. Then, for any α (0, 1) and ρ > 0 there exists λ satisfying

λ Λ, λ λ b Rρ n(λ) α.

Proof. Fix α (0, 1). Using Assumption 2-iii), there exists some λ such that for all λ λ, L(y, Tλ(x)) α for all (x, y) X Y.

Now, we have that for any ρ > 0, and for all λ λ,

b Rρ n(λ) = sup Q P(Ξ):W2(Pn,Q) ρ E(X,Y ) Q[L(Y, Tλ(X))]

where we simply used that the supremum is taken over probability distributions.

The result is thus almost direct from Assumption 2-iii) nevertheless it tells us that under this assumption (which holds in most cases of interest), one can find a suitable ˆλn for any α, which is not the case with the classical Hoeffding or Bernstein bounds (see the upcoming Fig. 2 in Section 4.1).

The above result means that one can find a data-independent value for ˆλn to get (8). Nevertheless, in practice, we shall rely on data-driven values both for tractability and performance. In order to so, a desirable property is monotonicity in λ, as it enables the use of dichotomy search.

Lemma 8 (Monotonicity). Let Assumption 2 hold. Then, for any ρ > 0, b Rρ n(λ) is monotonically decreasing.

Published in Transactions on Machine Learning Research (08/2025)

Proof. Fix ρ > 0. Using Assumption 2-i) and ii), we have that if λ1 λ2, L(y, Tλ1(x)) L(y, Tλ2(x)) for all (x, y) X Y. Thus,

b Rρ n(λ2) = sup Q P(Ξ):W2(Pn,Q) ρ E(X,Y ) Q[L(Y, Tλ2(X))]

sup Q P(Ξ):W2(Pn,Q) ρ E(X,Y ) Q[L(Y, Tλ1(X))] = b Rρ n(λ1)

hence the monotonocity.

3.2.4 Conclusion

We are now in position to state our main result on the construction of risk-controlled prediction sets using WDRO. It states that if the radius ρ is well chosen, there exists ˆλn that gives the sought guarantees and more precisely that any that satisfies b Rρ n(ˆλn) α will do thanks to the monotonicity. Theorem 9. Let Assumptions 2 and 3 hold and suppose that Eξ P0 exp(γ ξ β) < for some β > 2, γ > 0.

For any α (0, 1) and any δ (0, 1), there exists ρ(n, δ) = O (n 1 log(δ 1))

4 max(4,d) so that for any ρ ρ(n, δ), choosing ˆλn so that b Rρ n(ˆλn) α, we have

P(Xi,Yi)n i=1 P0 n h R(ˆλn) α i 1 δ

whenever W2(P0, P ) ρ ρ(n, δ). In particular, if P0 = P , the result holds as soon as ρ ρ(n, δ).

The result directly follows from Lemmas 5, 7 and 8. Although quite natural once the adequate results of the WDRO literature and assumptions are put together, it highlights that the the connection between WDRO and RCPS (or possibly other conformal methods) can be of interest. After discussing further the reach and limitations of this result, the remaining of the paper will thus be devoted to the computation and numerical evaluation of this methodology.

3.3 Discussion

The main parameter to tune in WDRO is the robustness radius ρ (one could also mention the transport cost), the same holds for the approach developed here. One point to notice is that once ρ is chosen, the statistical guarantee of Eq. (13) is uniform in λ. The sole condition on it in Theorem 9 is that it should be greater than some prescribed radius ρ(n, δ). Indeed, the excess ρ ρ(n, δ) can be interpreted as the amount of distribution shift that can be handled by this approach. This point is important as it means that WDRO risk-controlled prediction sets are inherently robust to some amount of distribution shift.

Nevertheless, a drawback is that ρ(n, δ) is not explicitly computable. In fact, the tuning of ρ is still an active research direction in the WDRO community. A common choice is 1/ n, reflecting the fact that the radius should be larger in a data-poor environment. However, the amount of distribution shift between the training distribution and the true one is also difficult to estimate in general so the drawback is not really specific to the developed approach.

Another limitation is that the prescribed radius ρ(n, δ) in Theorem 9 (coming from Lemma 5) suffers from the curse of the dimensionality. Indeed, ρ(n, δ) decreases in O(1/n1/d) which can be very slow when d is large. This is due to the fact that we concentrate here the full probability distribution instead of looking directly at the WDRO problem. Indeed, for WDRO problems, a scaling in O(1/ n) appears to be asymptotically optimal (Blanchet et al., 2022; Blanchet and Shapiro, 2023). Recent works have managed to demonstrate this formally for a wide class of WDRO objectives (see (Azizian et al., 2023b; Le and Malick, 2024)); nevertheless, the required assumptions on the loss make the results inapplicable in our setting of interest. In order to focus on the main points of the paper, we leave this question open.

3.4 Dual problem and Numerical Tractability

As problem (12) is an infinite dimensional problem over a space of measures, duality has been a central tool in both the theoretical analyses and computational schemes of WDRO from the onset (Shafieezadeh Abadeh

Published in Transactions on Machine Learning Research (08/2025)

et al., 2015; Esfahani and Kuhn, 2018). Indeed, the dual problem (Gao and Kleywegt, 2023; Sinha et al., 2018; Blanchet and Murthy, 2019) can be written as a one-dimensional problem involving n subproblems:

b Dρ n(λ) := inf γ 0 γρ2 + E(X,Y ) Pn

sup (x ,y ) Ξ

L(y , Tλ(x )) γ X x 2 γ Y y 2 #

Though its expression appears simpler, the presence of a sup makes (15) involved to solve in general. Nevertheless, since both the optimization in γ and in the supremum are finite-dimensional, it open the door to easier practical implementation and carries important information about the solution of the primal problem.

3.4.1 Strong duality and worst-case distributions

The following theorem guarantees that we have strong duality i.e., that the optimal value of the dual problem corresponds to b Rρ n(λ) which is the workhorse of our approach.

Theorem 10 (Strong duality). Under Assumptions 2 and 3, for all λ Λ,

b Rρ n(λ) = b Dρ n(λ)

and are finite. Furthermore, the worst case distribution for the WDRO risk is attained and

Q := arg sup Q P(Ξ):W2(Pn,Q) ρ E(X,Y ) Q[L(Y, Tλ(X))]

is supported on n + 1 atoms and Q = 1

n Pn i=1 δξ i + p0(δξ i0 δξ i0) where

ξ i arg sup (x ,y ) Ξ

L(y , Tλ(x )) γ

2 Yi y 2 , i = 1, .., n

with γ the optimal dual variable in (15), i0 {1, .., n}, p0 [0, 1],

and ξ i0 arg sup(x ,y ) Ξ n L(y , Tλ(x )) γ

2 Xi0 x 2 γ

2 Yi0 y 2o .

Proof. As for Lemma 4, the strong duality follows from Gao and Kleywegt (2023, Th. 1). Furthermore, we can invoke Yue et al. (2022, Th. 4) to get the existence of a worst case distribution in b Rρ n(λ) and Gao and Kleywegt (2023, Cor. 2) to obtain the form of the worst case distribution.

This result gives an intuition on what the worst distribution shift looks like. In words, the worst case distribution is a shift of the empirical one towards the regions where the conformal loss is large.

3.4.2 A special case

To give a simple example where (15) leads to an explicit solution, consider the setup of Example 2.1 where Tλ(x) = {y Y : |y fθ(x)| < λ} and L(y, Tλ(x)) = 1y Tλ(x) = 1|y fθ(x)| λ. In addition, let us suppose that only the Y are moved in the worst case distribution (this can be seen for instance as the transport cost for X being multiplied by a large constant).

The dual problem is then

b Dρ n(λ) := inf γ 0 γρ2 + 1

1|y fθ(Xi)| λ γ Yi y 2 (16)

and notice that for each i:

1. if |Yi fθ(Xi)| λ (the data point is not in the confidence region), then supy R 1|y fθ(Xi)| λ γ Yi y 2 = 1, attained by Yi;

Published in Transactions on Machine Learning Research (08/2025)

fθ(Xi) fθ(Xi) + λ fθ(Xi) λ

fθ(Xi) + λ 1 γ fθ(Xi) λ + 1 γ

Figure 1: Illustration of the value of the suprema in (16) depending on Yi. If Yi is in the plain region around fθ(Xi) (Case 2.a), it is sufficiently well classified (depending on λ but also γ and thus ρ as show after) so it is too costly to displace it, the optimal y is Yi and the supremum is worth 0. If Yi is in the dashed region (Case 2.b), it is well classified but can be pushed outside the confidence region, the optimal y is then fθ(Xi) λ and the supremum is in (0, 1). Finally, if Yi is in the dotted region (Case 1), no displacement is necessary to be out of the confidence region and thus the optimal y is Yi and the supremum is worth 1.

2. otherwise (the data point is in the confidence region), in order to push y outside of the prediction interval, it has to be at distance at least λ |Yi fθ(Xi)| > 0 of Yi and thus:

a) if |Yi fθ(Xi)| > λ 1 γ 1 γ(λ |Yi fθ(Xi)|)2 > 0, the displaced point leads to a positive value for the term in braces. Thus, it is optimal to displace the point and the supremum is equal to 1 γ(λ |Yi fθ(Xi)|)2 > 0; b) otherwise, |Yi fθ(Xi)| λ 1 γ 1 γ(λ |Yi fθ(Xi)|)2 0, the displaced point leads to a negative value for the term in braces. Thus, it is optimal to take y = Yi, the supremum is then equal to 0;

This reasoning is illustrated by Fig. 1; note that we are actually exhibiting the worst-case distribution. Without loss of generality, we reorder the samples by decreasing error so that

|Y1 fθ(X1)| |Y2 fθ(X2)| |Yn fθ(Xn)|

and let i0 = max{i : |Yi fθ(Xi)| λ} and iγ = max{i : |Yi fθ(Xi)| λ 1 γ } (so that indices in [1, i0] correspond to Case 1, indices in [i0 + 1, iγ ] correspond to Case 2.a, indices in [iγ + 1, n] correspond to Case 2.b) to have

b Dρ n(λ) = inf γ 0 γρ2 + 1

1 γ(λ |Yi fθ(Xi)|)2 +

= inf γ 0 γρ2 + i0

1 γ(λ |Yi fθ(Xi)|)2 (17)

We are left with finding the optimal value of the dual variable γ. In fact, by looking at the definition of iγ and the expression of (17), it is direct that γ can be sought as 1/(λ |Yi fθ(Xi)|)2 for i in [i0 + 1, n]. By computing the value of (17) for these n i0 1 points and denoting by i the arg min, we obtain an optimal dual variable γ = 1/(λ |Yi fθ(Xi )|)2 and we have

b Rρ n(λ) = b Dρ n(λ) = ρ2

(λ |Yi fθ(Xi )|)2 + i

(λ |Yi fθ(Xi)|)2

(λ |Yi fθ(Xi )|)2 (18)

and the worst case distribution is exactly the training data except for the indices between i0 + 1 and i (included), corresponding some points that used to fall into the conformal prediction set and that are push at the boundary. The formula in (18) seems difficult to interpret in general but we nevertheless see some interesting similarities in its formulation with the Waudby-Smith Ramdas bound (Waudby-Smith and Ramdas, 2024; Bates et al., 2021) despite the completely different approaches in their construction.

Remark 11. In order to include variations of X while keeping a closed form, it would be possible to consider a linear model fθ(X) = θ, X and apply the same reasoning as above. Nevertheless, the derivations become rapidly cumbersome and the result is not much more informative.

Published in Transactions on Machine Learning Research (08/2025)

3.4.3 Regularization

Providing explicit expressions for the the solution of (15) as in the previous section is out of reach for many situations of interest, a common limitation of WDRO. Nevertheless, entropic regularization have of WDRO have been proposed and studied recently, offering promising performances (Azizian et al., 2023a;b; Wang et al., 2023). These approaches can be directly used in our context; by adding an entropy regularization term to the transportation problem, the dual problem (15) becomes

b Dρ,ε n (λ) := inf γ 0 γρ2 + ε E(X,Y ) Pn

log E(X ,Y ) Q((X,Y )) exp L(Y , Tλ(X )) γ X X 2 γ Y Y 2

where ε > 0 is the regularization strength, and Q((X, Y )) is a user-defined conditional distribution. Typical choices are ε = ρ and Q((X, Y )) = N((X, Y ), ρI).

The regularized dual (19) enables to find an approximate solution of (15) by relying on gradient-based methods (since the supremum in (15) is replaced by a differentiable function). Nevertheless, one still has to sample the points (X , Y ) in order to numerically approximate the inner expectation. Compared to the special case of the previous section or to classical conformal bounds, the numerical cost of solving this problem is clearly much higher (due to the sampling and optimization in γ) but when jointly training and performing conformal prediction on a model, the computation times are on par with classical adversarial training methods. We refer to the papers above for more details and to Vincent et al. (2024) for a practical implementation.

3.5 Simultaneous learning and conformal prediction

A crucial point in our analysis is that the statistical guarantees for WDRO are uniform in the loss functions; in Lemma 5, the result is uniform in λ. This means that if the prediction depends on some parameter θ, we can learn θ while constructing our WDRO conformal prediction sets, using the same dataset. This feature is seldom found in the conformal prediction literature where the model and conformal prediction set usually have to be learned using separate datasets (a recent exception is (Braun et al., 2025)).

To make this more precise, consider the setup of Example 2.1 where Tλ(x) = {y Y : |y fθ(x)| < λ} and L(y, Tλ(x)) = 1y Tλ(x) = 1|y fθ(x)| λ. Then, we can use

b Rρ n(λ, θ) = sup Q P(Ξ):W2(Pn,Q) ρ E(X,Y ) Q 1|y fθ(x)| λ .

Then, choosing ˆλn, ˆθn so that b Rρ n(ˆλn, ˆθn) α, we have the same guarantees as Theorem 9. Thus, we can actually optimize the model jointly with the risk-controlled prediction set estimation, on the same dataset. Note here that the obtained model would be different from the one obtained by empirical risk minimization of even WDRO as it is learned to optimize the conformal loss. Numerically, a good solution is to rely on the regularized formulation mentioned in Section 3.4.3.

4 Numerical illustrations

In this section, we illustrate the results of the paper and the applicability of WDRO-based Risk Controlling Prediction Sets. Our goal is thus to show how the WDRO-based bounds behave compared to classical approaches for RCPS and how they can encompass distribution shifts and simultaneous training and conformal prediction, rather than performing a complete performance evaluation. The code used for these experiments is available at https://github.com/iutzeler/rcps-wdro.

We place ourselves in the regression setting of Example 2.1 and compare the following upper-bound functions b R+ n (λ) aimed at verifying (7) in Step A:

Hoeffding: b RH n (λ) = b Rn(λ) + q

δ (see (Bates et al., 2021, Sec. 3.1.1))

Published in Transactions on Machine Learning Research (08/2025)

Bernstein: b RB n (λ) = b Rn(λ) + ˆσ q

δ + 7 3(n 1) log 2

δ where ˆσ2 is the standard unbiased variance estimator for the losses (see (Bates et al., 2021, Sec. 3.1.3))

Simple WDRO: The bound of (18)

SKWDRO: The bound of (19) where the 0/1 loss is replaced by a smooth version5 solved using the package skwdro (Vincent et al., 2024)

In all the section, unless otherwise specified, we take α = 0.1 and δ = 0.05. For the WDRO-based approaches, we take ρ = c/ n with c =

2 for SKWDRO and c = 10

2 for Simple WDRO. Finally, note that for both WDRO-based approaches, the solved objectives ((18) and (19)) are dual-based. Remark 12 (Choice of the radius ρ). The choice of the radius is an important and difficult question in distributionally robust optimization both theoretically and practically. We can distinguish two approaches: (i) generic measure concentration (as used in Esfahani and Kuhn (2018) and Lemma 5), which gives ρ = (log(C/δ)/(cn))4/d and thus suffers from the curse of dimensionality (but can be used directly to obtain statistical guarantees on WDRO estimators); or (ii) specific concentration on the WDRO (dual) objective which give ρ = C / n but are unfortunately elusive for the conformal objective functions considered in the paper (mainly due to the discontinuity, see Azizian et al. (2023b)). In both cases, constants are not explicit.

Nevertheless, as in classical WDRO we found that ρ 1/ n was performing rather well all around and this was used in our experiments as mentioned above. This finding is in line with the practical rate for WDRO estimation as mentioned in Sec. 7.2.D of Esfahani and Kuhn (2018).

Furthermore, for a specific dataset, several approaches can be used to tune the radius adequately. Indeed, as training and conformal prediction can be done simultaneously over the same dataset (contrary to most conformal prediction techniques which rely on a separate calibration set), a small portion of the data can be held out for cross validating the value of ρ to satisfy the requirements with virtually no performance loss over the whole training procedure. This approach is detailed in (Esfahani and Kuhn, 2018, Sec. 7.2.B,C) for instance and can be an alternative to the standard 1/ n choice.

Finally, in the case of an unknown distribution shift, these techniques could be too optimistic (unless some samples from the shifted distribution are available) and the (square of) the radius should be theoretically increased by the measure of the shift (in W 2 2 distance). In practice, this is often too conservative as the geometry of the Wasserstein distance with Euclidean cost may not capture adequately the loss degradation. This is a general issue in the modeling of distribution shifts that goes beyond the scope of the present paper.

4.1 Coverage and interval size

In this section, our experimental setup is the following. We use 1000 samples generated from the make-regression function of scikit-learn:6 500 are used to train a linear prediction model (using scikit-learn s Linear Regression estimator), 250 are used for calibration (which is n in the notation of the paper), 250 are used for evaluation.

In Fig. 2, we display the values of the considered upper-bounds as a function of the prediction size λ for one realization of the experiment. The selected prediction size ˆλn for each upper-bound is the smallest value of λ for which the upper-bound is below the target risk (represented as a dashed line), the smaller the better as this will result in tighter prediction sets for the same guarantee. We observe that WDRO-based upperbounds are consistently smaller than Hoeffding or Bernstein meaning they will lead to tighter prediction sets in general.

In Fig. 3, we repeat our experiment 100 times and report boxplots of the coverages (i.e., one minus the risk) evaluated on the test data as well as prediction interval sizes. We observe that WDRO-based approaches lead to admissible coverages (slightly closer to the target risk) with smaller interval sizes. Thus, the constitute interesting substitutes for Hoeffding or Bernstein-based approaches.

5Precisely, we replace 1u>0 by 1 1+exp( 30u 3). 6We use the options n_features=5, n_informative=3, noise=20 leading to a problem in dimension d = 5 with a fair amount of noise.

Published in Transactions on Machine Learning Research (08/2025)

40 50 60 70 80 Prediction size λ

Risk / Upper Bound

Hoeﬀding Bound Bernstein Bound WDRO Bound SKWDRO Bound Empirical Conformal Risk Target risk

Figure 2: Behavior of the RCPS bounds a function of λ

Hoeﬀding Bernstein WDRO SKWDRO 0.86

1.00 Test Coverage (1 - Risk)

Hoeﬀding Bernstein WDRO SKWDRO

Prediction Interval Size

Figure 3: Performance of the RCPS bounds over N = 100 runs. The dashed line represents the target coverage 1 α. The whiskers of the boxplot are set to the δ and 1 δ quantiles so that the target of RPCS is to have a lower whisker above the dashed line.

Published in Transactions on Machine Learning Research (08/2025)

Hoeﬀding Bernstein ρ = 0.5 ρ = 1.0 ρ = 2.0 ρ = 4.0 0.86

1.00 Test Coverage

Hoeﬀding Bernstein ρ = 0.5 ρ = 1.0 ρ = 2.0 ρ = 4.0 0.86

1.00 Coverage after distribution shift

Hoeﬀding Bernstein ρ = 0.5 ρ = 1.0 ρ = 2.0 ρ = 4.0

Prediction Interval Size

Figure 4: Performance of the RCPS bounds over N = 100 runs with and without distribution shift. The dashed line represents the target coverage 1 α. The whiskers of the boxplot are set to the δ and 1 δ quantiles so that the target of RPCS is to have a lower whisker above the dashed line. The leftmost figure corresponds to the classical test coverage (as in the previous section) while the center one corresponds to the coverage on the shifted dataset. For the WDRO bound, the performance for several values of ρ are displayed

4.2 Distribution shifts

We place ourselves in the same setup as before. To simulate a distribution shift, we create a shifted dataset by copying the test dataset and adding to the points (Xi) a Gaussian noise of mean (0.25, 0, 0, 0, 0) and covariance 0.252I, resulting in an Wasserstein distance of 0.612 between testing and training data. In Fig. 4, we display the coverage for the test set and the shifted set. We compare different values of ρ for the WDRO bound. We notice first that the Bernstein and Hoeffding bounds are so pessimistic that they can actually accommodate for quite large distribution shifts in practice (without theoretical grounding though). For the WDRO-based bounds, we see that, as expected, the interval sizes and coverage grow with ρ, which enables to take into account different magnitudes of shift (we recall that following Theorem 9, a WDRO model can accommodate for a shift of magnitude s in Wasserstein distance whenever ρ s+ρ(n, δ) where ρ(n, δ) is the concentration radius that can be taken 1/ n).7 In particular, in the present case, the shift is of magnitude s = 0.612 and the practical concentration radius is ρn = 10 p

2/250 = 0.894 leading to a minimal radius of 1.506. Hence, the choice ρ = 0.5 is smaller than the distribution shift magnitude and thus does not provide coverage in the shifted situation (middle plot) and even without shift (left plot). The choice ρ = 1.0 is closer to the prescribed value, provides a correct coverage before shift but does not match the RCPS target after shift. A satisfying performance after shift is attained with ρ = 2.0. This validates the interest of this approach for directly integrating distribution shifts.

4.3 Simultaneous training and conformal inference

In this section, we draw n = 200 samples from the following distribution: X is drawn uniformly in [ 2, 2] and Y = f(X) + U where f(t) = 10 exp(t)+exp( t) + t and U is a standard Gaussian variable.

For a fixed λ, we perform a degree-4 polynomial regression of Y from X and compare two approaches for training and conformal inference using the SKWDRO bound:8

Separate. The regression is learned using sklearn s Linear Regression on polynomial features then the conformal upper bound is computed.

Joint. The polynomial model is trained to minimize the conformal loss, i.e., the bound of (12) is also optimized in θ.

7We do not comment more on how to tune ρ with respect to distribution shift and refer to Remark 12. Furthermore, the correct pointwise distance used as a base for the Wasserstein distance (here the Euclidean distance) may be tuned for each problem for a large performance gain. Such a numerical study is out-of-scope for the present paper. 8the bound of (12) with a smoothed loss solved by entropic regularization

Published in Transactions on Machine Learning Research (08/2025)

In Fig. 5, we report the bound values and the test risk (estimated from 1000 independent samples) for separate and joint training as a function of λ. We observe that the joint approach enables to have a smaller bound and risk than the separate one, for any value of λ. Hence, simultaneous training and conformal inference will lead to a better conformal inference performance at the cost of degrading the empirical performance.

In Fig. 6, we observe the models obtained by joint learning and conformal prediction can be very different than the one obtained separately by empirical risk minimization. Indeed, if λ is small, it is difficult to obtain meaningful confidence guarantees and thus any model than can grab a few points will be optimal. On the over way around, if λ is large, it is easy to be confident due to the large error margin and a simple tendency of the data is sufficient to obtain a good conformal prediction. In between, the situation is mixed as a sufficient number of points should be correctly predicted but whenever the prediction fails, it does not matter by how much and thus the conformal loss training offers some kind of regularization to the predictor.

Thus, we can see that training a model using the distributionally robust conformal loss does not lead to good test performance even though the conformal risk is improved. Nevertheless, this opens the door for the study of WDRO-based conformal risk as a regularization for the empirical risk minimization which could lead to good performance but with an improved conformal coverage.

Remark 13 (On the interplay between coverage and performance). There is a subtlety here in the interaction between conformal prediction (ie. risk control) which is at the core of this work and prediction performance of the base model which is not our main concern. The basic conformal prediction framework aims to quantify the uncertainty of a predictor (whatever its performance). Typically, for the regression case, if the predictor has poor performance for a given criteria and that the conformal interval is built with respect to this criteria, then the obtained prediction sets (ie. λ) will be large. Conversely, if we minimize the prediction size while jointly training a model (as we do in this section), then the obtained predictor should be somehow performing but has no reason to be optimal.9

When jointly training, the RCPS guarantees are theoretically preserved since the generalization part only depends on measure concentration. Intuitively, by doing so we should reduce the overfitting of the predictor even if we might overfit the prediction set, still this is mitigated by the distributionally robust optimization of the conformal loss. In addition, conformal prediction intervals can be obtained even when the models depend on the calibration data under some assumptions (see eg. Barber et al. (2023)).

Additional experiments exploring this setting can be found in Appendix A.

5 Conclusion

In this paper, we discussed the design of risk-controlled prediction sets using Wasserstein distributionally robust optimization. We showed that, by replacing the model loss by the conformal loss, the WDRO problem exactly leads to the type of generalization guarantees needed in the considered conformal prediction approach. This demonstrates the potential links between these two separate methods which can foster future research. In addition, the obtained WDRO-based conformal prediction methods displayed promising performance in our numerical illustrations and benefit from two major advantages: i) they can seamlessly account for distribution shifts; and ii) they allow for simultaneous training and calibration on the same dataset which is a advantageous in data-poor situations.

Acknowledgments

The authors thank the action editor and the reviewers for their fruitful remarks. F. Iutzeler acknowledges the support of the AI Cluster ANITI (ANR-23-IACL-0002) and the ANR project MAD (ANR-24-CE23-1529). A.

9Note that we consider in this work the conformal risk control approach, which is a more general framework: as shown in Example 2.1, the size of the conformal interval is then generalized through the parameter λ. Both frameworks aim to get guarantees as Eq. (1) (ie. getting a prediction set for the new output at a given level 1 α) which is then a guarantee in expectation. In particular, it implies that the level 1 α would not be achieved for some realizations of (Xi, Yi)n+1 i=1 . The guarantees we are studying here (Eq. (9)) aim to ensure that the proportion of such events remains controlled at another level 1 δ. This can be achieved by extrapolating concentration knowledge for the considered risk (Eq. (7)), as proposed by Bates et al. (2021).

Published in Transactions on Machine Learning Research (08/2025)

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 λ

Separate: bound Separate: risk Joint: bound Joint: risk

Figure 5: Conformal bounds and test risk (estimated from 1000 independent samples) for separate and joint training.

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(a) λ = 1.2

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(b) λ = 2.6

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(c) λ = 5.0

Figure 6: Polynomial models learned for different values of λ with separate and joint training.

Mazoyer has been supported by the project ROMEO (ANR-21-ASIA-0001) from the ASTRID joint program of the French National Research Agency (ANR) and the French Innovation and Defence Agency (AID).

A. N. Angelopoulos and S. Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494 591, 2023.

A. N. Angelopoulos, S. Bates, A. Fisch, L. Lei, and T. Schuster. Conformal risk control. In International Conference on Learning Representations, 2024.

L. Aolaritei, M. I. Jordan, Y. Marzouk, Z. O. Wang, and J. Zhu. Conformal prediction under Lévy-Prokhorov distribution shifts: Robustness to local and global perturbations. ar Xiv preprint ar Xiv:2502.14105, 2025.

W. Azizian, F. Iutzeler, and J. Malick. Regularization for Wasserstein distributionally robust optimization. ESAIM: COCV, 29:33, 2023a.

W. Azizian, F. Iutzeler, and J. Malick. Exact generalization guarantees for (regularized) wasserstein distributionally robust models. Advances in Neural Information Processing Systems, 36, 2023b.

R. F. Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 845, 2023.

S. Bates, A. Angelopoulos, L. Lei, J. Malik, and M. Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68(6):1 34, 2021.

Published in Transactions on Machine Learning Research (08/2025)

A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013.

J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, 2019.

J. Blanchet and A. Shapiro. Statistical limit theorems in distributionally robust optimization. ar Xiv preprint ar Xiv:2303.14867, 2023.

J. Blanchet, K. Murthy, and V. A. Nguyen. Statistical analysis of wasserstein distributionally robust estimators. In Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pages 227 254. INFORMS, 2021.

J. Blanchet, K. Murthy, and N. Si. Confidence regions in wasserstein distributionally robust estimation. Biometrika, 109(2):295 315, 2022.

S. Braun, L. Aolaritei, M. I. Jordan, and F. Bach. Minimum volume conformal sets for multivariate regression. ar Xiv preprint ar Xiv:2503.19068, 2025.

M. Cauchois, A. A. Suyash Gupta, and J. C. Duchi. Robust validation: Confident predictions even when distributions shift. Journal of the American Statistical Association, 119(548):3033 3044, 2024.

E. Delage and Y. Ye. Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems. Operations Research, 58:595 612, 2010.

P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115 166, 2018.

S. Feldman, L. Ringel, S. Bates, and Y. Romano. Achieving risk control in online learning settings. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.

M. Fontana, G. Zeni, and S. Vantini. Conformal prediction: a unified review of theory and new challenges. Bernoulli, 29(1):1 23, 2023.

N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3):707 738, 2015.

R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603 655, 2023.

A. Gendler, T.-W. Weng, L. Daniel, and Y. Romano. Adversarially robust conformal prediction. In International Conference on Learning Representations, 2022.

I. Gibbs and E. Candès. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems, volume 34, pages 1660 1672, 2021.

I. Gibbs and E. Candès. Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research, 25(162):1 36, 2024.

J. Goh and M. Sim. Distributionally Robust Optimization and Its Tractable Approximations. Operations Research, 58:902 917, 2010.

C. Gupta, A. K. Kuchibhotla, and A. Ramdas. Nested conformal prediction and quantile out-of-bag ensemble methods. Pattern Recognition, 127, 2022.

Z. Hu and L. J. Hong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 1(2):9, 2013.

D. Kuhn, S. Shafiee, and W. Wiesemann. Distributionally robust optimization, 2024. URL https://arxiv. org/abs/2411.02549.

Published in Transactions on Machine Learning Research (08/2025)

Y. Kwon, W. Kim, J.-H. Won, and M. C. Paik. Principled learning method for wasserstein distributionally robust optimization with local perturbations. In International Conference on Machine Learning, pages 5567 5576. PMLR, 2020.

Y. Laguel, J. Malick, and Z. Harchaoui. First-order optimization for superquantile-based supervised learning. In International Workshop on Machine Learning for Signal Processing, pages 1 6. IEEE, 2020.

T. Le and J. Malick. Universal generalization guarantees for wasserstein distributionally robust models. ar Xiv preprint ar Xiv:2402.11981, 2024.

D. Levy, Y. Carmon, J. C. Duchi, and A. Sidford. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33, 2020.

J. Li, C. Chen, and A. M.-C. So. Fast Epigraphical Projection-based Incremental Algorithms for Wasserstein Distributionally Robust Support Vector Machine. In Advances in Neural Information Processing Systems, volume 33, pages 4029 4039. Curran Associates, Inc., 2020.

H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29, 2016.

H. Papadopoulos, K. Proedrou, V. Vovk, and A. Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, pages 345 356, 2002.

A. Podkopaev and A. Ramdas. Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in Artificial Intelligence, volume 161, pages 844 853, 2021.

S. Shafieezadeh Abadeh, P. M. Mohajerin Esfahani, and D. Kuhn. Distributionally Robust Logistic Regression. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

S. Shafieezadeh-Abadeh, D. Kuhn, and P. M. Esfahani. Regularization via Mass Transportation. Journal of Machine Learning Research, 20:1 68, 2019.

A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.

M. Staib and S. Jegelka. Distributionally robust optimization and generalization in kernel methods. Advances in Neural Information Processing Systems, 32, 2019.

R. J. Tibshirani, R. F. Barber, E. J. Candès, and A. Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems, volume 32, 2019.

F. Vincent, W. Azizian, F. Iutzeler, and J. Malick. skwdro: a library for wasserstein distributionally robust machine learning. ar Xiv preprint ar Xiv:2410.21231, 2024.

V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer, 2005.

J. Wang, R. Gao, and Y. Xie. Sinkhorn distributionally robust optimization. ar Xiv preprint ar Xiv:2109.11926, 2023.

I. Waudby-Smith and A. Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1):1 27, 2024.

G. Yan, Y. Romano, and T.-W. Weng. provably robust conformal prediction with improved efficiency. In International Conference on Learning Representations, 2024.

M.-C. Yue, D. Kuhn, and W. Wiesemann. On linear optimization over wasserstein balls. Mathematical Programming, 195(1):1107 1122, 2022.

Published in Transactions on Machine Learning Research (08/2025)

J.-J. Zhu, W. Jitkrittum, M. Diehl, and B. Schölkopf. Kernel distributionally robust optimization: Generalized duality theorem and stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pages 280 288. PMLR, 2021a.

J.-J. Zhu, C. Kouridi, Y. Nemmour, and B. Schölkopf. Adversarially robust kernel smoothing. ar Xiv preprint ar Xiv:2102.08474, 2021b.

Published in Transactions on Machine Learning Research (08/2025)

A Extra numerical illustrations for joint training and conformal inference

In this section, we use the same setup as in Section 4.3 but change some of the parameters.

A.1 Varying the number of samples

We take n = 500, 1000, 2000 samples and report the bounds and test risk in Fig. 7 and the obtain models in Fig. 8. We observe that neither the models nor the prediction change dramatically, indicating that the concentration observed is stable and hence the predictions. In particular, we do not particularly see overfitting behaviors.

A.2 Changing the variance

We take a variance of σ = .1, .2, and .5 (instead of 1) in the Gaussian noise U and report the bounds and test risk in Fig. 9 and the obtain models in Fig. 10. With a smaller variance, the prediction size needed to obtain a coverage guarantee diminishes as wee see in Fig. 9. Thus for λ = 1.0, we observe that the models are better adjusted than with unit variance and for the λ = 2.6 corresponding to an approximate coverage of 80%, the model better follows the curve compared to the unit variance case (which has a coverage of approximately 65%). Finally, for λ = 5.0, the size is so large that virtually any model will have a perfect coverage hence the observed behavior.

Published in Transactions on Machine Learning Research (08/2025)

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 λ

Separate: bound Separate: risk Joint: bound Joint: risk

(a) n = 500

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 λ

Separate: bound Separate: risk Joint: bound Joint: risk

(b) n = 1000

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 λ

Separate: bound Separate: risk Joint: bound Joint: risk

(c) n = 2000

Figure 7: Conformal bounds and test risk for separate and joint training .

Published in Transactions on Machine Learning Research (08/2025)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(a) n = 500, λ = 1.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(b) n = 500, λ = 2.6

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(c) n = 500, λ = 5.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(d) n = 1000, λ = 1.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(e) n = 1000, λ = 2.6

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(f) n = 1000, λ = 5.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(g) n = 2000, λ = 1.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(h) n = 2000, λ = 2.6

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(i) n = 2000, λ = 5.0

Figure 8: Polynomial models learned for different values of λ with separate and joint training.

Published in Transactions on Machine Learning Research (08/2025)

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 λ

Separate: bound Separate: risk Joint: bound Joint: risk

(a) σ = 0.1

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 λ

Separate: bound Separate: risk Joint: bound Joint: risk

(b) σ = 0.2

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 λ

Separate: bound Separate: risk Joint: bound Joint: risk

(c) σ = 0.5

Figure 9: Conformal bounds and test risk for separate and joint training .

Published in Transactions on Machine Learning Research (08/2025)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(a) σ = 0.1, λ = 1.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(b) σ = 0.1, λ = 2.6

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(c) σ = 0.1, λ = 5.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(d) σ = 0.2, λ = 1.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(e) σ = 0.2, λ = 2.6

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(f) σ = 0.2, λ = 5.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(g) σ = 0.5, λ = 1.0

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(h) σ = 0.5, λ = 2.6

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2

train data generating function Separately trained polynomial model Jointly trained polynomial model

(i) σ = 0.5, λ = 5.0

Figure 10: Polynomial models learned for different values of λ with separate and joint training.