# distributed_conformal_prediction_via_message_passing__428e6ae1.pdf

Distributed Conformal Prediction via Message Passing

Haifeng Wen* 1 Hong Xing* 1 2 Osvaldo Simeone 3

Post-hoc calibration of pre-trained models is critical for ensuring reliable inference, especially in safety-critical domains such as healthcare. Conformal Prediction (CP) offers a robust post-hoc calibration framework, providing distribution-free statistical coverage guarantees for prediction sets by leveraging held-out datasets. In this work, we address a decentralized setting where each device has limited calibration data and can communicate only with its neighbors over an arbitrary graph topology. We propose two message-passing-based approaches for achieving reliable inference via CP: quantile-based distributed conformal prediction (Q-DCP) and histogram-based distributed conformal prediction (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence, while H-DCP uses a consensus-based histogram estimation approach. Through extensive experiments, we investigate the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes across different network topologies. The code of our work is released on: https://github.com/Haifeng Wen/ Distributed-Conformal-Prediction.

1. Introduction

1.1. Context and Motivation

The post-hoc calibration of pre-trained artificial intelligence (AI) models has become increasingly important as a means

*Equal contribution 1Io T Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 2Department of ECE, The Hong Kong University of Science and Technology, HK SAR 3Department of Engineering, King s College London, London, U.K.. Correspondence to: Hong Xing <hongxing@ust.hk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

to ensure reliable inference and decision-making in safetycritical domains such as healthcare (Kompa et al., 2021), engineering (Cohen et al., 2023) and large language models (LLMs) (Ji et al., 2023; Huang et al., 2025). Conformal prediction (CP) is a model-agnostic post-hoc calibration framework that provides distribution-free statistical coverage guarantees. This is done by augmenting an AI model s decisions with prediction sets evaluated on the basis of heldout data (Vovk et al., 2005; Angelopoulos et al., 2024b).

CP uses held-out calibration data to infer statistical information about the distribution of the errors made by the pre-trained AI model. Based on this analysis, CP associates to each decision of the AI model a prediction set that includes all the outputs that are consistent with the inferred error statistics. Specifically, CP calculates a quantile of performance scores attained by the pre-trained model on the calibration data (Vovk et al., 2005; Lei et al., 2018; Barber et al., 2021). The prediction set provably meets marginal coverage guarantees for a user-defined target. Accordingly, the correct output is included in the prediction set with high probability with respect to the distribution of calibration and test data.

In light of its simplicity and of its strong theoretical properties, CP has been recently developed in several directions, including to address more general risk functions (Angelopoulos et al., 2024c), to provide localized statistical guarantees (Gibbs et al., 2025), to operate via an online feedback-based mechanism (Gibbs & Candes, 2021), and to incorporate Bayesian inference mechanisms (Clart e & Zdeborov a, 2024). Furthermore, it has been applied in safety-critical areas such as medical diagnostics (Lu et al., 2022) and LLMs (Quach et al., 2024; Mohri & Hashimoto, 2024).

As mentioned, CP requires access to a data set of calibration data points. However, in practice, a decision maker may not have sufficient calibration data stored locally. This is an important issue, since the size of the prediction set depends on the number of available calibration data points, and thus a small calibration data set would yield uninformative prediction sets (Zecchin et al., 2024). However, calibration data may be available in a decentralized fashion across multiple devices (Xu et al., 2023). Examples include diagnostic healthcare models at different hospitals,

Distributed Conformal Prediction via Message Passing

Prediction set

{leopard, tiger, cat}

P Ytest 𝒞𝒞𝑋𝑋test 𝒟𝒟 1 𝛼𝛼

𝑠𝑠4 (𝑡𝑡) Update 𝑠𝑠𝑘𝑘 (𝑡𝑡) via ADMM

(1-𝛼𝛼)-th quantile

Distributed pinball loss minimization

Averaging consensus

(a) (b) (c)

Figure 1. (a) This work studies a decentralized inference setting in which multiple devices share the same pre-trained model, and each device has a local calibration data set. (b) Given a common input, the devices aim at producing a prediction set that includes the true label with a user-defined probability 1 α. (c) We propose two message-passing schemes for conformal prediction (CP): Quantile-based distributed CP (Q-DCP) addresses the decentralized optimization of the pinball, or quantile, loss over the calibration scores; while histogram-based distributed CP (H-DCP) targets the consensus-based estimate of the histogram of the calibration scores.

Internet-of-Things (Io T) systems, and autonomous vehicle networks. In many of these scenarios, the distributed devices are privacy-conscious, preventing a direct exchange of the local calibration data sets.

With this motivation, prior art has studied settings in which multiple data-holding devices are connected to the decisionmaking device via capacity-limited links in a star topology (Humbert et al., 2023; Lu et al., 2023; Plassier et al., 2023; Zhu et al., 2024), which is widely adopted for federated learning (Mc Mahan et al., 2017; Kairouz et al., 2021). In these systems, the central device collects information from the devices to estimate the relevant global quantile of the calibration data distributed across the devices. In particular, in (Humbert et al., 2023), the global quantile is estimated via a quantile-of-quantiles approach, in which the devices transmit their local quantiles to the central server and the quantile of the received quantiles is calculated at the central server side. In (Zhu et al., 2024), the central server estimates the average of the local histograms of quantized calibration scores and then estimates the global quantile based on the average histogram.

1.2. Main Contributions

The star topology assumed in the prior art does not reflect many important scenarios of interest in which communication is inherently local, being limited to the neighbors of each device. As illustrated in Fig. 1(a), this paper studies CP in a fully decentralized architecture in which the data-holding devices can only communicate via message passing on a connectivity graph. In this setting, all devices are decision-makers that have access to limited local calibration data and share a common pre-trained model. As shown in Fig. 1(b), given a common input, the devices aim

at producing a prediction set that includes the true label with a user-defined probability.

This paper proposes two distributed CP (DCP) approaches: quantile-based DCP (Q-DCP) and histogram-based DCP (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence via message passing, while HDCP uses a consensus-based histogram estimation approach inspired by (Zhu et al., 2024).

Specifically, the main contributions of this work are as follows:

1. We introduce Q-DCP, a message-passing CP protocol based on quantile regression. Q-DCP addresses a decentralized quantile regression problem by means of the alternating direction method of multipliers (ADMM) (Boyd et al., 2011) with tailored smoothing and regularization to accelerate the convergence speed. Q-DCP guarantees marginal coverage as centralized CP, as long as hyperparameters related to the network topology and to the initialization are properly selected.

2. We also introduce H-DCP, a decentralized CP protocol that attains hyperparameter-free coverage guarantees. H-DCP estimates the global histogram of quantized local scores via a consensus algorithm.

3. Experiments explore the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes for QDCP and H-DCP across different network topologies. While H-DCP requires a larger communication load than Q-DCP, its coverage guarantees are not tied to hyperparameter selection.

Distributed Conformal Prediction via Message Passing

1.3. Notations

We use lower-case letter x to denote scalars; upper-case letter X to denote random variables; bold lower-case letter x to denote vectors; bold upper-case letter X to denote matrices; script letter X to denote sets. We denote x as the l2-norm of a vector x; I{ } as the indicator function; x G =

x T Gx as the G-norm of a vector x with a positive semidefinite matrix G; σmax(X) and σmin(X) as the largest singular value and the smallest non-zero singular value of a matrix X, respectively; and λi(X) to denote the i-th largest eigenvalue of a matrix X.

2. Problem Description

We study a network of K devices in which each device k has access to a local calibration data set Dk = {(Xi,k, Yi,k)}nk i=1 with data points (Xi,k, Yi,k) drawn i.i.d. from a distribution Pk. We define the global data set as the union SK k=1 Dk = D and n = PK k=1 nk as the total number of data points. Using the local data sets {Dk}K k=1, and message-passingbased communication, the devices aim to collaboratively calibrate the decision of a shared pre-trained model f : X 7 Y.

To elaborate, assume that a test input Xtest is available at all devices. This may represent, e.g., a common observation in a sensor network or a user query distributed to multiple servers. Given a target coverage level (1 α) for α [0, 1], the goal of the system is to determine a set-valued function C( |D) : X 7 2Y with marginal coverage guarantees. Specifically, following (Lu et al., 2023), given a test data point (Xtest, Ytest), drawn from the mixture distribution

k=1 wk Pk (1)

for some weight wk 0 and PK k=1 wk = 1, we impose the coverage requirement

P(Ytest C(Xtest|D)) 1 α. (2)

This condition requires that the test label Ytest belongs to the prediction set C(Xtest|D) with probability at least 1 α. The probability in (2) is evaluated over the calibration data in D and the test data (Xtest, Ytest). As in (Lu et al., 2023), we will specifically concentrate on the choice

wk nk + 1, (3)

in which the weight wk for each device k is proportional to the size of the local data set Dk.

The average size of the output of C( |D), referred to as the inefficiency, is defined as the expectation E|C(Xtest|D)|, where the average is evaluated over Xtest and on the distribution of the data D. Ideally, the prediction set C(Xtest|D)

minimizes the inefficiency while satisfying the constraint (2).

As shown in Fig. 1(a), in order to produce the prediction set C(Xtest|D), devices can communicate over a connectivity graph. The connectivity graph G(V, E) is undirected, with V denoting the set of devices with |V| = K and E {(i, j) V V|i = j} denoting the set of edges with |E| = E. The connectivity of the graph is characterized by the 2E K incidence matrix A = [AT 1 , AT 2 ]T

with submatrices A1, A2 RE K. Denoting the index corresponding to an edge (i, j) E by q {1, . . . , |E|}, the (q, i)-th entry of matrix A1 and the (q, j)-th entry of matrix A2 equal 1 if (i, j) is an edge in graph G(V, E), i.e., (i, j) E, and they equal zero otherwise. Each device k can only communicate with its set of neighbors, denoted as Nk = {j V|(j, k) E}, under communication constraints to be specified in Sec. 4.

3. Background

3.1. Split Conformal Prediction

Split CP provides a general framework for the design of post-hoc calibration strategies satisfying the coverage requirement (2) in centralized settings. CP constructs the prediction set C( |D) based on a calibration data set D. This is done by evaluating a quantile, dependent on the largest miscoverage level α, of the scores assigned by the pre-trained model f to the calibration data points in set D.

To elaborate, define as s(x, y) a negatively oriented score derived from model f, such as the absolute error s(x, y) = |y f(x)| for regression or the log-loss s(x, y) = log fy(x) for classification. Write as Si = s(Xi, Yi) the score assigned by the model f to the i-th calibration data point (Xi, Yi) in the data set D. The prediction set is evaluated by including all the labels y Y with a score no larger than a fraction, approximately 1 α, of the calibration scores {Si}n i=1. Specifically, CP produces the set

C(X|D) = {y Y : s(X, y)

Q((1 α)(1 + 1/n); {Si}n i=1)}, (4)

where Q(γ; {Si}n i=1) is the γ -th smallest value of the set {Si}n i=1, which is set as the score threshold.

The empirical γ-quantile Q(γ; {Si}n i=1) in (4) can be obtained by solving the following quantile regression problem (Koenker & Bassett Jr, 1978)

Q(γ; {Si}n i=1) = arg min s R ργ(s |{Si}n i=1), (5)

Distributed Conformal Prediction via Message Passing

where ργ( |{Si}n i=1) is the pinball loss function defined as

ργ(s |{Si}n i=1) =

i=1 Re LU(Si s) + (1 γ)

i=1 Re LU(s Si), (6)

with Re LU(x) = max(x, 0).

3.2. Distributed Conformal Prediction on a Star Topology

The calculation of the empirical quantile in the prediction set (4) requires full knowledge of the scores of all calibration data points in set D. When the data points are distributed across privacy-sensitive and communication-constrained devices as in the setting under study in this paper, evaluating the empirical quantile is not possible unless communication is enabled. Prior work has studied a federated setting in which all devices are connected to a centralized server (Humbert et al., 2023; Lu et al., 2023; Plassier et al., 2023; Zhu et al., 2024; 2025; Kang et al., 2024). For reference, we briefly introduce Fed CP-QQ (Humbert et al., 2023), FCP (Lu et al., 2023) and WFCP (Zhu et al., 2024) next.

Fed CP-QQ: Denote as {Si,k = s(Xi,k, Yi,k)}nk i=1 the scores evaluated using the shared model f on the local data set Dk for device k. In Fed CP-QQ (Humbert et al., 2023), each device k first calculates the empirical (1 α )- quantile of its local scores, which is transmitted to the centralized server for some probability α . After collecting K local quantiles, the central server calculates the empirical (1 β)-quantile of the received quantiles, thus obtaining quantile-of-quantiles. By optimizing the probabilities α

and β, the prediction set (4) is constructed using quantileof-quantiles as the threshold. This approach satisfies the coverage condition (2) if the data sets Dk are i.i.d. across devices.

FCP: While Fed CP-QQ assumes i.i.d. data sets across devices, FCP (Lu et al., 2023) adopts the model described in Sec. 2 allowing data points from different local data sets to be drawn from different distributions {Pk}K k=1, while the test data point (Xtest, Ytest) is drawn from a mixture P = PK k=1 wk Pk of the local distributions. In FCP, the central server collects the local scores from the K devices, and then calculates the empirical (1 α)(1+K/n)-quantile. This value is used as the threshold to construct prediction set C(X|D) in (4). FCP satisfies coverage condition (2) by choosing the weight wk to be proportional to nk + 1, i.e., wk nk + 1.

WFCP: Unlike Fed CP-QQ and FCP, which communicate quantiles between devices and server, Zhu et al. (2024) proposed to exchange information about the local histograms of the quantized scores. Specifically, the scores {Si,k}nk i=1 are first quantized to M levels by each device k, which then eval-

uates the histogram vector pk = [p1,k, p2,k, . . . , p M,k]T RM of the scores. The vectors pk, with k = 1, 2, ..., K, are synchronously transmitted to the centralized server on an additive multi-access channel, and the server estimates the average histogram p = 1

K PK k=1 pk based on the received signal. The empirical (1 α )-quantile of all the scores is then estimated from the estimated average histogram p for some optimized value α . Under the non-i.i.d. setting described in Sec. 2, WFCP can guarantee the coverage condition (2).

4. Quantile-based Distributed CP

In this section, we present the first fully decentralized protocol proposed in this work, which is referred to as quantilebased distributed CP (Q-DCP).

4.1. Decentralized Quantile Regression via ADMM

Q-DCP addresses the quantile regression problem (5) in the fully decentralized setting described in Sec. 2 using ADMM (Boyd et al., 2011). This strategy obtains an approximation of the empirical quantile Q((1 α)(1 + K/n); {Si}n i=1) with a controlled error, which requires a single scalar broadcast to the neighbors of each device at each optimization iteration. As will be discussed, on the flip side, Q-DCP requires careful hyperparameter tuning in order to satisfy the coverage condition (2). The alternative strategy proposed in the next section alleviates such requirements on hyperparameter tuning at the cost of a larger per-iteration communication cost.

Formally, in Q-DCP, the K devices collaboratively solve the quantile regression problem

Minimize s R

k=1 ρ(1 α)(1+ K

n ) (s |{Si,k}nk i=1) , (7)

where target coverage (1 α)(1 + K/n) follows FCP (Lu et al., 2023) (see previous section). The objective function (7) is convex, but it is not smooth or strongly convex. For such an objective function, a direct application of ADMM would exhibit a sub-linear convergence rate, causing a large optimality gap when the number of communication rounds is limited (He & Yuan, 2012).

To ensure a linear convergence rate, we propose to replace the Re LU operation in (6) with the smooth function g(x) = x + (1/κ) log(1 + e κx), where smaller κ leads to greater smoothness at the cost of larger approximation error. In practice, the function g(x) coincides with Re LU(x) as limκ g(x) = Re LU(x) (Chen & Mangasarian, 1996). Furthermore, to guarantee strong convexity, we add a regu-

Distributed Conformal Prediction via Message Passing

larization term to the pinball loss function (6) as

ργ(s |{Si}n i=1) = γ

i=1 g(Si s)

i=1 g(s Si) + µ

where µ > 0 and s0 are hyperparameters. The modified pinball loss function ργ( |{Si}n i=1) is L-smooth and µstrongly convex, with L = (nκ)/4 + µ. Using the smooth and strongly convex loss in (8), the decentralized optimization of quantile regression in (7) is modified as

Minimize s R

k=1 ρ(1 α)(1+ K

n ) (s |{Si,k}nk i=1) . (9)

Note that the (unique) solution of problem (9), denoted by ˆs , is generally different from the optimal solution s of problem (7), unless κ tends to infinity and µ is set to zero.

Q-DCP adopts ADMM to address problem (9), obtaining the formulation

Minimize s RK,z RE f(s) =

k=1 ρ(1 α)(1+ K

n ) (sk |{Si,k}nk i=1)

Subject to As + Bz = 0, (10) where A R2E K is the incidence matrix defined in Sec. 2, B = [ IE; IE] R2E E, and z is an auxiliary variable imposing the consensus constraint on neighboring devices, i.e., si = zq and sj = zq if (i, j) E with q being the corresponding index of edge (i, j).

4.2. Description of Q-DCP

Following ADMM (Boyd et al., 2011), Q-DCP solves problem (10) by considering the augmented Lagrangian

Lc(s, z, λ) = f(s) + λ, As + Bz + c

2 As + Bz 2 2, (11) where λ R2E is the Lagrange multiplier associated with the 2E equality constraints in (10), and c > 0 is a hyperparameter. The updates of the local estimated quantiles s, the consensus variables z and the dual variables λ at iteration t + 1 are given by (Boyd et al., 2011)

s(t+1) = argmin s RK Lc(s, z(t), λ(t)), (12a)

z(t+1) = argmin z RE Lc(s(t+1), z, λ(t)), (12b)

λ(t+1) = λ(t) + c As(t+1) + Bz(t+1) . (12c)

After T iterations, ADMM produces the quantile estimate s(T ) k for each device k V. Then, the devices run the fast

distributed averaging algorithm (Xiao & Boyd, 2004) to obtain the average s(T ) = 1/K P

k V s(T ) k with negligible error.

Finally, the prediction set is constructed as

CQ-DCP(X|D) = y Y : s(X, y) s Q-DCP , (13)

with s Q-DCP = s(T ) + ϵQ-DCP, (14)

where ϵQ-DCP is an upper bound on the error | s(T ) s | of the quantile estimate s(T ) to be derived in the next subsection.

The proposed Q-DCP is summarized in Algorithm 1.

Algorithm 1 Q-DCP

Input: ADMM parameters c > 0 and b > 1, number of iterations T, incidence matrix A, smoothness parameter L and regularization parameters (µ, s0, ϵ0), coverage level 1 α, and score s( , ) Initialize s(0) = z(0) = λ(0) = 0 and t = 0 ADMM (Boyd et al., 2011): while t < T do

Update s(t+1) by solving

f(s(t+1))+AT λ(t) +c AT (As(t+1) +Bz(t)) = 0

z(t+1) = (c BT B) 1BT (c As(t+1) + λ(t)) λ(t+1) = λ(t) + c As(t+1) + c Bz(t+1)

t t + 1 end while Prediction set construction: for device k V do

Run distributed averaging (Xiao & Boyd, 2004) to obtain s(T ) = 1

K PK k=1 s(T ) k Calculate ϵ(T ) and ϵ0 using (17) and (16), respectively, and set s Q-DCP = s(T ) + ϵ(T ) + ϵ0 end for Output: CQ-DCP( |D) = y Y : s( , y) s Q-DCP

4.3. Theoretical Guarantees

In this section, we first derive the ϵQ-DCP upper bound on the estimation error of Q-DCP used in constructing the prediction set (13). Then we prove that Q-DCP attains the coverage guarantee (2).

By the triangle inequality, the estimation error | s(T ) s | can be upper bounded as

| s(T ) s | | s(T ) ˆs | + |ˆs s |, (15)

where the first term accounts for the convergence error for problem (10), while the second term quantifies the bias caused by the use of the smooth and strongly convex approximation (8).

Distributed Conformal Prediction via Message Passing

The bias term can be bounded as follows. We start with the following assumption. Assumption 4.1. The regularization parameter s0 and the optimal solution s in (5) differ by at most ϵ0 0, i.e., |s0 s | ϵ0. Proposition 4.2. Under Assumption 4.1, the bias |ˆs s | is upper bounded as

µκ + ϵ2 0 = ϵ0. (16)

Proof: See supplementary material (see Appendix A.1).

Next, the convergence error | s(T ) ˆs | is bounded by leveraging on existing results on the convergence of ADMM.

Proposition 4.3 (Theorem 1, Shi et al., 2014). The convergence error | s(T ) ˆs | is upper bounded as

| s(T ) ˆs |

T 1 u(0) u G

where the parameter δ > 0 is defined as

δ = min (b 1)σ2 min(M ) / bσ2 max(M +) ,

µ/ (c/4)σ2 max(M +) + (b/c)L2σ 2 min(M ) , (18)

with any b > 1, M = AT 1 AT 2 , M + = AT 1 + AT 2 ,

and G = c IE 0E 0E (1/c)IE

. Moreover, u(0) =

[(z(0))T , ( λ (0))T ]T with λ (0) RE containing the first E entries of the vector λ(0), and we also write as u = [(z )T , ( λ )T ]T the 2E 1 vector collecting of the optimal primal solution z and the optimal Lagrange multipliers λ

of problem (10).

The upper bound (17) depends on the initial error, quantified by the distance u(0) u G, and by the connectivity of the graph. In fact, a larger connectivity is reflected by a larger ratio σmin(M )/σmax(M +) (Fiedler, 1973), and thus a larger parameter δ in (17).

Using the obtained bounds (16) and (17) in (15), we obtain the following main result on the performance for Q-DCP. Theorem 4.4 (Coverage Guarantee for Q-DCP). The prediction set CQ-DCP( |D) in (13) produced by Q-DCP satisfies the coverage condition (2) with weights selected as in (3).

Proof: See supplementary material (see Appendix A.2)

5. Histogram-based Distributed CP

As demonstrated by Theorem 4.4, Q-DCP requires knowledge of a parameter ϵ0 satisfying Assumption 4.1, as well

as of the parameter δ in (17), which depends on the connectivity properties of the graph through the incidence matrix A. In this section, we present an alternative fully decentralized protocol, referred to as histogram-based DCP (H-DCP), which provides rigorous coverage guarantees without the need for hyperparameter tuning as in Q-DCP, but at the cost of larger communication overhead.

In H-DCP, devices exchange histograms of quantized calibration scores in a manner similar to WFCP (Zhu et al., 2024). As a result of a consensus algorithm, the devices evaluate the average histogram in order to yield an estimate of the mixture distribution P in (1) with weights (3). From this estimate, a suitable bound is derived on the (1 α)(1 + K/n)-quantile of the distribution P to evaluate the prediction set of the form (4).

5.1. Description of H-DCP

To start, in H-DCP, all devices apply a uniform scalar quantizer, Γ( ) : [0, 1] 7 {1/M, 2/M, . . . , 1}, with step size 1/M, to all the local calibration scores under the following assumption.

Assumption 5.1. The score function s( , ) is bounded, i.e., without loss of generality, 0 s( , ) 1.

The quantizer is formally defined as

1 M s [0, 1

, for m = 2, . . . , M.

(19) With this quantizer, each device k V evaluates the local histogram pk = [p1,k, . . . , p M,k]T associated with the local quantized scores {Γ(Si,k)}nk i=1, with PM m=1 pm,k = 1. Accordingly, this vector includes the entries

i=1 I n Γ(Si,k) = m

with pm,k representing the fraction of the quantized scores associated with the m-th quantization level at device k V. The sum

k=1 nkpm,k = 1

i=1 I n Γ(Si,k) = m

corresponds to the fraction of quantized scores equal to m/M in the set of quantized scores for the global data set D, defining the global histogram vector p = [p1, . . . , p M].

In H-DCP, the devices aim at estimating the global score histogram p in (21) in order to obtain an estimate of the (1 α)(1 + K/n)-quantile of the quantized calibration

Distributed Conformal Prediction via Message Passing

scores, i.e.,

M = Q (1 α) (1 + K/n) ; K k=1 {Γ(Si,k)}nk i=1

M arg min m=1,...,M

i=1 pi (1 α) 1 + K

By the design of the quantizer (19), this quantile provides an upper bound on the (1 α)(1 + K/n)-quantile of the unquantized calibration score.

To estimate the sum (21), the devices apply consensus with a matrix W RK K satisfying W = W T , W 1 = 1T W = 1T and W 11T /K 2 < 1 with entries [W ]k,j = Wk,j > 0 if (k, j) E and [W ]k,j = 0 otherwise. Following the fast distributed averaging algorithm (Xiao & Boyd, 2004), each device k V updates the local vector xk by a linear consensus step

x(t+1) k = x(t) k + η

j=1 Wkj(x(t) j x(t) k ), (23)

where we define the initialization x(0) k = xk = ((Knk)/n)pk such that p = 1/K PK k=1 xk, and η (0, 1] is the update rate.

Finally, H-DCP constructs the prediction set at device k as

CH-DCP k (X|D) =

y Y : Γ(s(X, y)) m H-DCP α,k

m H-DCP α,k = arg min m=1,...,M

i=1 x(T ) i,k

(1 α) 1 + K

and ϵH-DCP is a parameter to be determined below to give an upper bound on the error x(T ) k p for the local estimate x(T ) k of the global vector p at device k V after T global communication rounds. Note that if (1 α)(1 + K/n) + ϵH-DCP > 1, we set m H-DCP α,k = M.

The proposed H-DCP is summarized in Algorithm 2.

5.2. Theoretical Guarantee

In this subsection, we provide the main result on the coverage guarantee of H-DCP through the following theorem. Theorem 5.2 (Coverage Guarantee for H-DCP). Setting

v u u tmax k V {n2 k} + 1

j=1 n2 j, (26)

Algorithm 2 H-DCP

Input: Consensus parameters η > 0 and matrix W , number of iterations T, quantization level M, coverage level 1 α, and score s( , ) Each device k V calculates the local histogram pk via (20) Initialize x(0) k = ((Knk)/n)pk and t = 0 Consensus: while t < T do

for device k V do

x(t+1) k = x(t) k + η PK j=1 Wkj(x(t) j x(t) k ) end for t t + 1 end while Prediction set construction: for device k V do

Calculate ϵH-DCP via (26) Calculate m H-DCP α,k via (25) end for Output: For all k V, construct CH-DCP k ( |D) via (24)

where ϱ = 1 |λ2(W )| is the spectral gap of the consensus matrix W , the prediction sets CH-DCP k ( |D) in (24) produced by H-DCP for all k V satisfy the coverage condition (2) by choosing weights as in (3).

Proof: See supplementary material (see Appendix A.3).

Theorem 5.2 shows that H-DCP only requires knowledge of the spectral gap ϱ, which can be efficiently estimated in a fully decentralized manner (Muniraju et al., 2020). The flip side is that, while Q-DCP requires the exchange of a single real number at each iteration, H-DCP has a communication overhead M times larger, requiring the exchange of an Mdimensional real vector.

6. Experimental Results

In this section, we report experimental results on the proposed decentralized CP protocols, Q-DCP and H-DCP, using the conventional centralized CP as a benchmark. We first study Q-DCP and H-DCP separately, and then provide a comparison between Q-DCP and H-DCP as a function of the communication load. Throughout, we evaluate the performance in terms of coverage and average set size, also known as inefficiency (Vovk et al., 2005), for different network topologies.

6.1. Setting

As in Lu et al. (2023), we first train a shared model f( ) using the Cifar100 training data set to generate the score function s( , ). Calibration data, obtained from the Cifar100 test data, is distributed in a non-i.i.d. manner among K =

Distributed Conformal Prediction via Message Passing

20 devices by assigning 5 unique classes to each device. Specifically, device 1 receives n1 data points corresponding to 0-4, device 2 receives n2 data points from classes 59, and so on. We set nk = 50 for all devices k V. Using the shared model f( ), the score function is defined as s(x, y) = 1 [f(x)]y, where [f(x)]y is the confidence assigned by the model to label y (Sadinle et al., 2019).

We extract the calibration data from the Cifar100 test data by sampling uniformly at random (without replacement), and average results are shown over 10 random generations of the calibration data. The set size is normalized to the overall number of classes, i.e., 100.

6.2. Results for Q-DCP

The hyperparameters for the Q-DCP loss (8) are chosen as follows. We set κ = 2000 for the smooth function g( ) as suggested by Nkansah et al. (2021), and we choose µ = 2000. Moreover, unless noted otherwise, in (8), we set s0 to be the average of the local score quantiles, which can be evaluated via a preliminary consensus step. Note that the need to choose a suitable value for s0 is related to the general problem of hyperparameter-tuning for Q-DCP discussed in Sec. 4.3. This aspect will be further elaborated below by studying the impact of the selection of hyperparameter ϵ0 in (16).

The impact of the coverage level 1 α on Q-DCP is studied in Fig. 2 for T = 1500 and ϵ0 = 0.1, while results on convergence can be found in the supplementary material (see Appendix B.2.). We specifically consider the chain, cycle, star, torus, and complete graph, which are listed in order of increasing connectivity. Validating Theorem 4.4, the figure shows that Q-DCP provides coverage guarantees, with more conservative prediction sets obtained over graphs with lower connectivity. In particular, with a complete graph, the assumed T = 1500 iterations are seen to be sufficient to obtain the same set size as centralized CP.

As detailed in Theorem 4.4, Q-DCP requires a carefully selected hyperparameter ϵ0, satisfying Assumption 4.1, in order to meet coverage guarantees. To verify this point, Fig. 3 shows coverage and set size as a function of 1 α for the pair (s0 = 8 20α, ϵ0 = 10 4), for which Assumption 4.1 is not satisfied. The figure confirms that Q-DCP can fail to yield coverage rates larger than 1 α for some topologies. This is particularly evident for graphs with stronger connectivity, which yield a faster rate of error compounding across the iterations.

6.3. Results for H-DCP

For H-DCP, unless noted otherwise, we set the consensus rate to η = 1, and the number of quantization levels to M = 1000. We choose the consensus matrix W in the

0.1 0.3 0.5 0.7 0.9 1 α

0.1 0.3 0.5 0.7 0.9 1 α

Normalized set size

CP Q-DCP chain cycle star torus complete

Figure 2. Coverage and normalized set size versus coverage level 1 α for CP and Q-DCP when Assumption 4.1 is satisfied (T = 1500 and ϵ0 = 0.1). The shaded error bars correspond to intervals covering 95% of the realized values.

0.1 0.3 0.5 0.7 0.9 1 α

0.1 0.3 0.5 0.7 0.9 1 α

Normalized set size

CP Q-DCP chain cycle star torus complete

Figure 3. Coverage and normalized set size versus coverage level 1 α for CP and Q-DCP when Assumption 4.1 is not satisfied (T = 1500, ϵ0 = 10 4, and s0 = 8 20α).

standard form with entries Wkj = a for all edges (k, j) E, Wkk = 1 |Nk|a for all devices k V, and Wkj = 0 otherwise, where a = 2/(λ1(L) + λK 1(L)) with L being the Laplacian of the connectivity graph G (Xiao & Boyd, 2004). As illustrated in Fig. 4, unlike Q-DCP, HDCP guarantees the coverage level 1 α without the need for hyperparameter tuning. Results on the convergence of H-DCP can be found in the supplementary material (see Appendix B.2).

0.1 0.3 0.5 0.7 0.9 1 α

0.1 0.3 0.5 0.7 0.9 1 α

Normalized set size

CP H-DCP chain cycle star torus complete

Figure 4. Coverage and normalized set size versus coverage level 1 α for CP and H-DCP (T = 150).

6.4. Comparing Q-DCP and H-DCP

Finally, we conduct experiments to compare the two proposed approaches in terms of the trade-offs among communication overhead, coverage guarantees, and prediction set sizes. In order to enable a fair comparison, we evaluate the performance of both Q-DCP and H-DCP under the same

Distributed Conformal Prediction via Message Passing

hyp. setting 1

hyp. setting 2

hyp. setting 1

hyp. setting 2

Figure 5. Coverage and normalized set size versus the total perdevice communication load (torus graph with α = 0.1, and Q-DCP with hyperparameter setting 1 given by (ϵ0 = 0.1, s0 being the average of the local score quantile) and hyperparameter setting 2 given by (ϵ0 = 10 4, s0 = 10)).

total communication load per device. The communication load is evaluated in bits as T C, where T is the number of iterations and C denotes the number of bits communicated by one device per iteration. The per-iteration communication load C is evaluated for the two schemes as detailed next.

For H-DCP, we fix the number of quantization levels to M = 1000, so that each histogram vector contains M = 1000 numbers. Each of these numbers is represented using a floating point format with f {16, 32, 64} bits. Overall, the per-iteration per-device communication load of HDCP is C = Mf. For Q-DCP, we also represent each estimated quantile s(t) k using a floating point format with f {16, 32, 64} bits, yielding a communication load equal to C = f.

Fig. 5 show the coverage and set size versus the total perdevice communication load, TC, for both Q-DCP and HDCP on the torus graph using different values of the numerical precision f. The choice of the torus graph is motivated by the fact that the spectral gap of the torus graph with 20 devices is moderate, providing an intermediate setting between a complete graph and a cycle graph. For Q-DCP, we consider two choices of hyperparameters, one, set as in Fig. 2, satisfying Assumption 4.1, and one, chosen as in Fig. 3, not satisfying this assumption.

As seen in Fig. 5, with well-selected hyperparameters, QDCP provides more efficient prediction sets, while also meeting the coverage requirement 1 α = 0.9. An exception is given by the case f = 16, in which the reduced numerical precision of the inputs prevents Q-DCP from obtaining a high-quality solution for equation (12a). However, with poorly selected hyperparameters, Q-DCP can yield a violation of the coverage requirements even at a high precision f. In contrast, H-DCP maintains the coverage level 1 α across all considered communication loads.

7. Conclusion

This work has addressed the post-hoc calibration of pretrained models in fully decentralized settings via conformal prediction (CP). We have proposed two message-passingbased approaches, quantile-based distributed CP (Q-DCP) and histogram-based distributed CP (H-DCP), that achieve reliable inference with marginal coverage guarantees. QDCP leverages distributed quantile regression with smoothing and regularization to enhance convergence, while HDCP applies consensus-based histogram estimation. Our extensive experiments demonstrated the effectiveness of both methods in balancing communication overhead, coverage guarantees, and prediction set sizes across various network topologies. Specifically, Q-DCP was observed to have lower communication requirements, while being sensitive to hyperparameter tuning. In contrast, H-DCP offers robust coverage guarantees at the cost of a larger communication load.

Future work may investigate extension to asynchronous network (Wei & Ozdaglar, 2013; Tian et al., 2020), to localized risk guarantees (Gibbs et al., 2025; Zecchin & Simeone, 2024) and to online CP (Zhang et al., 2024; Angelopoulos et al., 2024a).

Acknowledgements

The work of H. Xing was supported by the Guangdong Basic and Applied Basic Research Foundation under Grant 2025A1515010123, by the Guangdong Provincial Key Lab of Integrated Communication, Sensing and Computation for Ubiquitous Internet of Things under Grant 2023B1212010007, and by the Guangzhou Municipal Science and Technology Project under Grants 2024A04J4527 and 2024A03J0623. The work of O. Simeone was supported by the European Union s Horizon Europe project CENTRIC (101096379), by an Open Fellowship of the EPSRC (EP/W024101/1), and by the EPSRC project (EP/X011852/1).

Impact Statement

This paper presents work whose goal is to advance the field of conformal prediction and reliable machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Angelopoulos, A. N., Barber, R., and Bates, S. Online conformal prediction with decaying step sizes. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 1616 1630, Jul. 2024a.

Distributed Conformal Prediction via Message Passing

Angelopoulos, A. N., Barber, R. F., and Bates, S. Theoretical foundations of conformal prediction. ar Xiv preprint ar Xiv:2411.11824, 2024b.

Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. In The Twelfth International Conference on Learning Representations, Vienna, Austria, May 2024c.

Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486 507, 2021.

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011.

Chen, C. and Mangasarian, O. L. A class of smoothing functions for nonlinear and mixed complementarity problems. Computational Optimization and Applications, 5 (2):97 138, 1996.

Clart e, L. and Zdeborov a, L. Building conformal prediction intervals with approximate message passing. ar Xiv preprint ar Xiv:2410.16493, 2024.

Cohen, K. M., Park, S., Simeone, O., and Shitz, S. S. Calibrating AI models for wireless communications via conformal prediction. IEEE Transactions on Machine Learning in Communications and Networking, 1:296 312, 2023.

Fiedler, M. Algebraic connectivity of graphs. Czechoslovak mathematical journal, 23(2):298 305, 1973.

Gibbs, I. and Candes, E. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660 1672, 2021.

Gibbs, I., Cherian, J. J., and Cand es, E. J. Conformal prediction with conditional guarantees. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. qkaf008, 2025.

He, B. and Yuan, X. On the O(1/n) convergence rate of the Douglas Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700 709, 2012.

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1 55, 2025.

Humbert, P., Le Bars, B., Bellet, A., and Arlot, S. Oneshot federated conformal prediction. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 14153 14177, Jul. 2023.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1 38, 2023.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and trends in machine learning, 14(1 2):1 210, 2021.

Kang, M., Lin, Z., Sun, J., Xiao, C., and Li, B. Certifiably Byzantine-robust federated conformal prediction. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 23022 23057, Jul. 2024.

Koenker, R. and Bassett Jr, G. Regression quantiles. Econometrica: journal of the Econometric Society, pp. 33 50, 1978.

Kompa, B., Snoek, J., and Beam, A. L. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digital Medicine, 4(1):4, 2021.

Lei, J., G Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018.

Lu, C., Lemay, A., Chang, K., H obel, K., and Kalpathy Cramer, J. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 12008 12016, 2022.

Lu, C., Yu, Y., Karimireddy, S. P., Jordan, M., and Raskar, R. Federated conformal predictors for distributed uncertainty quantification. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 22942 22964, Jul. 2023.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017.

Mohri, C. and Hashimoto, T. Language models with conformal factuality guarantees. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 36029 36047, Jul. 2024.

Muniraju, G., Tepedelenlioglu, C., and Spanias, A. Consensus based distributed spectral radius estimation. IEEE Signal Processing Letters, 27:1045 1049, 2020.

Nkansah, H., Benyah, F., and Amankwah, H. Smoothing approximations for least squares minimization with L1norm regularization functional. International Journal of Analysis and Applications, 19(2):264 279, 2021.

Distributed Conformal Prediction via Message Passing

Plassier, V., Makni, M., Rubashevskii, A., Moulines, E., and Panov, M. Conformal prediction for federated uncertainty quantification under label shift. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 27907 27947, Jul. 2023.

Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T. S., and Barzilay, R. Conformal language modeling. In The Twelfth International Conference on Learning Representations, Vienna, Austria, May 2024.

Sadinle, M., Lei, J., and Wasserman, L. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019.

Shi, W., Ling, Q., Yuan, K., Wu, G., and Yin, W. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014.

Tian, Y., Sun, Y., and Scutari, G. Achieving linear convergence in distributed asynchronous multiagent optimization. IEEE Transactions on Automatic Control, 65(12): 5264 5279, 2020.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world, volume 29. Springer, 2005.

Wei, E. and Ozdaglar, A. On the O(1=k) convergence of asynchronous distributed alternating direction method of multipliers. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 551 554, 2013.

Xiao, L. and Boyd, S. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65 78, 2004.

Xu, M., Wu, Y., Cai, D., Li, X., and Wang, S. Federated finetuning of billion-sized language models across mobile devices. ar Xiv preprint ar Xiv:2308.13894, 2023.

Zecchin, M. and Simeone, O. Localized adaptive risk control. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, Dec. 2024.

Zecchin, M., Park, S., Simeone, O., and Hellstr om, F. Generalization and informativeness of conformal prediction. In 2024 IEEE International Symposium on Information Theory, pp. 244 249. IEEE, 2024.

Zhang, Y., Park, S., and Simeone, O. Bayesian optimization with formal safety guarantees via online conformal prediction. IEEE Journal of Selected Topics in Signal Processing, 2024.

Zhu, M., Zecchin, M., Park, S., Guo, C., Feng, C., and Simeone, O. Federated inference with reliable uncertainty quantification over wireless channels via conformal

prediction. IEEE Transactions on Signal Processing, 72: 1235 1250, 2024.

Zhu, M., Zecchin, M., Park, S., Guo, C., Feng, C., Popovski, P., and Simeone, O. Conformal distributed remote inference in sensor networks under reliability and communication constraints. IEEE Transactions on Signal Processing, 73:1485 1500, 2025.

Distributed Conformal Prediction via Message Passing

A. Proofs of Main Results

A.1. Proof of Proposition 4.2

For simplicity, in this proof, we use the notations ργ(s) = ργ(s|{Si}n i=1) and ργ(s) = ργ(s|{Si}n i=1). By definition, we set s = arg mins R ρβ(s) and ˆs = arg mins R ρβ(s), where β = (1 α)(1+K/n). First, by 0 g(x) max(x, 0) log 2

κ , we have

ρβ(s) ρβ(s)

i=1 ( g(Si s) max(Si s, 0)) + (1 β)

i=1 ( g(s Si) max(s Si, 0)) + µ

and ρβ(s) ρβ(s) µ

2 (s s0)2. By µ-strong convexity and ρβ(ˆs ) = 0, we have

ρβ(s ) ρβ(ˆs ) µ

2 (s ˆs )2. (28)

We next bound the term ρβ(s ) ρβ(ˆs ) as follows:

ρβ(s ) ρβ(ˆs ) = ρβ(s ) ρβ(ˆs ) + ρβ(s ) ρβ(s )

2 (s s0)2 + ρβ(s ) ρβ(ˆs )

(a) n log 2

2 (s s0)2 + ρβ(ˆs ) ρβ(ˆs )

2 (s s0)2 µ

(b) n log 2

where (a) is due to ρβ(s ) ρβ(ˆs ) by the definition s , and (b) is by Assumption 4.1. Finally, we have

µ ( ρβ(s ) ρβ(ˆs )) 2n log 2

µκ + ϵ2 0. (30)

This yields the desired result.

A.2. Proof of Theorem 4.4

We have | s(T ) s | | s(T ) ˆs | + |ˆs s | ϵ(T ) + ϵ0, yielding

s(T ) ϵ(T ) ϵ0 s s(T ) + ϵ(T ) + ϵ0. (31)

Then, for a data point (X, Y ) drawn from P in (1), we have

P Y CQ-DCP(X|D) = P s(X, Y ) s(T ) + ϵ(T ) + ϵ0 P (s(X, Y ) s ) . (32)

Since s is the empirical (1 α)(1 + K/n)-quantile in (4) and by [Theorem 4.3, Lu et al., 2023] with wk nk + 1, we have P (s(X, Y ) s ) 1 α, which implies the desired result.

A.3. Proof of Theorem 5.2

Define the spectral gap ϱ = 1 |λ2(W )|. Then, the typical result of the consensus convergence in Xiao & Boyd (2004) yields the inequality X

k V x(t) k p 2 (1 ηϱ)2t X

k V x(0) k p 2. (33)

Distributed Conformal Prediction via Message Passing

In other words, the consensus error decays exponentially.

Let the histogram error as e(T ) k = x(T ) k x. To simplify the notation, we denote mα,k = m H-DCP α,k , ϵ = ϵH-DCP, ˆSi,k = Γ(s(Xi,k, Yi,k)) as the i-th quantized score on device k, and ˆStest = Γ(s(Xtest,Ytest)) as the quantized score of the test point. By definition of mα,k, we have

i=1 x(T ) i,k (1 α)(1 + K

pi + e(T ) i,k (1 α)(1 + K

i=1 pi (1 α)(1 + K

i=1 e(T ) i,k

Following the technique in (Lu et al., 2023), let wk = nk+1

n+K and define the event E as

E = n k V, πk, ˆSπk(1),k, ˆSπk(2),k, . . . , ˆSπk(nk+1),k = (s1,k, s2,k, . . . , snk+1,k) o , (35)

where {si,k}i [nk+1],k V are the order statistics of the scores. For each device k, we also define bk(γ) = |{ ˆSi,k γ}|. By

the definition of mα,k, mα,k

M is the (1 α)(n + K) + nϵ n Pmα,k i=1 e(T ) i,k -th smallest score in K k=1{ ˆSi,k} leading to

k=1 bk mα,k

(1 α)(n + K) + nϵ n

i=1 e(T ) i,k

Then, we have

P ˆStest mα,k

k=1 wk P ˆStest mα,k

n ˆS1,k, . . . , ˆSnk,k, Stest o are exchangeable, E

k=1 wk bk( mα,k

l (1 α)(n + K) + nϵ n Pmα,k i=1 e(T ) i,k m

1, 1 α + nϵ n Pmα,k i=1 e(T ) i,k n + K

We next bound the RHS. Using min{x, y} = y + x y |x y|

2 , we have

P ˆStest mα,k

nϵ n Pmα,k i=1 e(T ) i,k n + K 1

nϵ n Pmα,k i=1 e(T ) i,k n + K α

n Pmα,k i=1 e(T ) i,k n + K

2 + 2 α nϵ n + K

i=1 e(T ) i,k

i=1 e(T ) i,k

Distributed Conformal Prediction via Message Passing

Next, we bound the term Pmα,k i=1 e(T ) i,k 2 as

i=1 e(T ) i,k

i=1 (e(T ) i,k )2 M

i=1 (e(T ) i,k )2 M e(T ) k 2, (39)

where the first inequality follows from Jensen s inequality. This directly implies Pmα,k i=1 e(T ) i,k | Pmα,k i=1 e(T ) i,k |

M e(T ) k . Furthermore, by the convergence of consensus (33), we have

e(T ) k 2 X

k V e(T ) k 2 (1 ηϱ)2T X

k V x(0) k p 2 K(1 ηϱ)2T max k V x(0) k p 2. (40)

The last term x(0) k p 2 can be bounded as, for any k V,

x(0) k p 2 = xk 1

j=1 xk xj 2 2

xk 2 + xj 2 . (41)

To bound xk 2, we first recall that xk = Knk

n pk, where pk = [p1,k, . . . , p M,k] with pm,k = 1/nk Pnk i=1 I{Γ(Si,k) = m/M}, 0 pm,k 1 and 1T pk = 1. Then, we have

xk 2 = K2n2 k n2 pk 2 = K2n2 k n2

m=1 (pm,k)2 K2n2 k n2

m=1 pm,k = K2n2 k n2 . (42)

e(T ) k 2 2K3(1 ηϱ)2T

To simplify the notation, we define the constant

v u u u t2K3(1 ηϱ)2T

v u u u tmax k V

Combining the above bounds, we have

P ˆStest mα,k

s α nϵ n + K

2 + 2A α nϵ n + K

Let the RHS of (45) equal to 1 α, yielding

Finally, the coverage guarantee of H-DCP follows

P Ytest CH-DCP k (Xtest|D) = P ˆStest mα,k

= EE h P ˆStest mα,k

E i 1 α. (47)

B. Additional Experimental Results

B.1. Comparing with SOTA on a star topology

In the special case of a star topology, the communication cost of Q-DCP coincides with FCP, while H-DCP reduces to WFCP. Experimental results with α = 0.1, comparing Q-DCP and H-DCP with FCP (Lu et al., 2023), Fed CP-QQ (Humbert et al., 2023), and WFCP (Zhu et al., 2024), can be found in Table 1. These results show that the proposed protocols have comparable performance as the state of the art in terms of coverage and set size in the special case of a star topology. However, in contrast to existing schemes, H-DCP and Q-DCP apply to any network topology.

Distributed Conformal Prediction via Message Passing

Table 1. Coverage and set size for Q-DCP, H-DCP, FCP, WFCP, and Fed CP-QQ under CIFAR100 in a star topology with α = 0.1.

Q-DCP H-DCP(WFCP) FCP Fed CP-QQ Coverage ( std) 0.913 0.01 0.92 0.014 0.918 0.01 0.996 0.003 Set Size ( std) 11.35 0.89 12.16 1.47 11.98 0.99 45.32 7.22

B.2. Convergence Results for Q-DCP and H-DCP

Setting ϵ0 = 0.1, Fig. 6 shows the coverage and set size versus the number of ADMM iterations T for centralized CP and for Q-DCP on different graphs. We specifically consider the complete graph, torus, star, cycle, and chain, which are listed in order of decreasing connectivity. While coverage is guaranteed for any number of iterations T, Q-DCP is seen to require more rounds to achieve efficient prediction sets on graphs with weaker connectivity levels. Furthermore, Q-DCP achieves performance comparable to CP on all graphs at convergence.

102 103 104 ADMM Iterations (T)

102 103 104 ADMM Iterations (T)

Normalized set size

CP Q-DCP complete torus star cycle chain

Figure 6. Coverage and normalized set size versus the number of ADMM iterations T for CP and Q-DCP (α = 0.1 and ϵ0 = 0.1).

Fig. 7 shows coverage and set size performance versus the number of consensus iterations T for H-DCP on different graphs. Graphs with lower connectivity are observed to require more consensus iterations to give efficient prediction sets as in the case of Q-DCP presented in Fig. 6.

100 102 104 Consensus Iterations (T)

100 102 104 Consensus Iterations (T)

Normalized set size

CP H-DCP complete torus star cycle chain

Figure 7. Coverage and normalized set size versus iterations T for CP and H-DCP (α = 0.1).

B.3. Scalability and sensitivity of Q-DCP and H-DCP for network size

To validate the proposed method on a larger network, we considered a network with K = 100 devices, each of which includes data from a distinct class setting T = 3000 for both H-DCP and Q-DCP (with ϵ = 0.5), experimental results can be found in Fig. 8 and Fig. 9. These results demonstrate that the proposed schemes are scalable to larger networks.

To evaluate the sensitivity of the performance of to the choice of ϵ0, we have evaluated coverage and set size for Q-DCP on Erd os R enyi graphs with an increasing number of devices K, in which each edge is included in the graph with fixed probability 0.5. The 100 classes of CIFAR100 are divided (almost) equally among the K devices (without replacement). Other parameters are the same as Sec. 6. For α = 0.1 and T = 3000, the experimental result can be found in Fig. 10 with

Distributed Conformal Prediction via Message Passing

Figure 8. Coverage and normalized set size versus coverage level 1 α for CP and Q-DCP under CIFAR100 when Assumption 4.1 is satisfied (K = 100).

0.1 0.3 0.5 0.7 0.9 1 α

0.1 0.3 0.5 0.7 0.9 1 α

Normalized set size

CP H-DCP chain cycle star torus complete

Figure 9. Coverage and normalized set size versus coverage level for CP and H-DCP under CIFAR100 (K = 100).

ϵ0 = 1 and in Fig. 11 with ϵ0 = 0.1. The level of connectivity increases with K as the average spectral gap increases from 0.44 to 0.68. As a result, fixing T, the set size decrease as the level of connectivity increases. This observation is robust with respect to the choice of hyperparameter ϵ0. However, as verified by these results, the optimal choice of ϵ0 depends on the size of the network. In practice, for ϵ0 = 1, Assumption 4.1 is satisfied for all values of K between 20 and 80, and thus Proposition 4.3 guarantees convergence to the target coverage probability 1 α = 0.9 when T is large enough. This is not the case for ϵ0 = 0.1, in which case Assumption 4.1 is violated as K grows larger.

Figure 10. Coverage and normalized set size versus number of devices K for CP and Q-DCP under CIFAR100 on Erd os R enyi graph with connect probability 0.5 (α = 0.1, T = 3000, ϵ = 1).

B.4. Ablation studies for Q-DCP

Fig. 12 to 14 show the effects of different choices of the smoothness parameter κ and the regularization parameter µ on the coverage guarantee and set size for the torus graph, respectively. Fig. 12 shows that a small κ will lead to an inefficient prediction set due to the inaccurate approximation of Re LU, while a large kappa will lead to a slow convergence speed of the ADMM yielding an inefficient prediction set as well. Fig. 13 demonstrates that a small µ causes a low ADMM convergence rate making the prediction set conservative. Fig. 14 shows that under the poor choices of s0 and ϵ0 as in Fig. 3, in which Assumption 4.1 does not satisfied. One can observe that small µ gives the satisfaction of the coverage condition, while large

Distributed Conformal Prediction via Message Passing

Figure 11. Coverage and normalized set size versus number of devices K for CP and Q-DCP under CIFAR100 on Erd os R enyi graph with connect probability 0.5 (α = 0.1, T = 3000, ϵ = 0.1).

101 103 0.90

101 103 10 1

Normalized set size

Figure 12. Coverage and set size versus κ for Q-DCP. (Torus graph with T = 1000, α = 0.1, s0 = s and ϵ0 = 0.1)

µ leads to the solution of ADMM nearly same as s0, yielding the violation of the coverage condition.

B.5. Ablation study for H-DCP

Fig. 15 shows the coverage and set size versus quantization level M for H-DCP on different graphs. One can observe that under limited consensus iterations, increasing the quantization level M can make the estimated quantiles more accurate thus potentially providing more efficient prediction sets. On the flip side, it will lead to a larger error bound of the consensus and may violate the efficiency of the prediction on graphs with poor connectivity.

B.6. Experimental results on Path MNIST

Path MNIST includes 9 classes and 107, 180 data samples in total (89, 996 for training, 10, 004 for validation, 7, 180 for test). Medical datasets involve private and sensitive information, which are usually distributed over siloed medical centers, making a decentralized implementation practically relevant.

To this end, we trained a small-resnet14 as the shared model, and we considered a setting with K = 8 devices. Seven of the devices have data from only one class, while the last device stores data for the remaining two classes. For Q-DCP, the number of ADMM communication rounds was set as T = 8000, and the number of consensus iterations and the quantization level for H-DCP were set as T = 80 and M = 100. This way, both Q-DCP and H-DCP are subject to the same communication costs (in bits). Other settings remain the same as in Sec. 6. Results on coverage and normalized set size versus coverage level for Q-DCP and H-DCP, can be found in Fig. 16 and Fig. 17, respectively. These experimental results confirm the efficiency of the proposed methods for applications of interest.

Distributed Conformal Prediction via Message Passing

Normalized set size

Figure 13. Coverage and set size versus µ for Q-DCP. (Torus graph with T = 1000, α = 0.1, s0 = s and ϵ0 = 0.1)

Normalized set size

Figure 14. Coverage and set size versus µ for Q-DCP. (Torus graph with T = 1000, α = 0.1, s0 = mink V Q((1 α)(1 + K/n); {Si,k}nk i=1) and ϵ0 = 10 4)

102 104 Quantization Level (M)

102 104 Quantization Level (M)

Normalized set size

CP H-DCP chain cycle star torus complete

Figure 15. Coverage and set size versus quantization level M for H-DCP. (T = 150 and α = 0.1)

0.1 0.3 0.5 0.7 0.9 1 α

0.1 0.3 0.5 0.7 0.9 1 α

Normalized set size

CP Q-DCP chain cycle star torus complete

Figure 16. Coverage and normalized set size versus coverage level for CP and Q-DCP under Path MNIST.

Distributed Conformal Prediction via Message Passing

0.1 0.3 0.5 0.7 0.9 1 α

0.1 0.3 0.5 0.7 0.9 1 α

Normalized set size

CP H-DCP chain cycle star torus complete

Figure 17. Coverage and normalized set size versus coverage level for CP and H-DCP under Path MNIST.