# multivariate_conformal_selection__d96a3450.pdf

Multivariate Conformal Selection

Tian Bai 1 Yue Zhao 2 Xiang Yu 3 Archer Y. Yang 1 4

Selecting high-quality candidates from large datasets is critical in applications such as drug discovery, precision medicine, and alignment of large language models (LLMs). While Conformal Selection (CS) provides rigorous uncertainty quantification, it is limited to univariate responses and scalar criteria. To address this issue, we propose Multivariate Conformal Selection (m CS), a generalization of CS designed for multivariate response settings. Our method introduces regional monotonicity and employs multivariate nonconformity scores to construct conformal pvalues, enabling finite-sample False Discovery Rate (FDR) control. We present two variants: m CS-dist, using distance-based scores, and m CS-learn, which learns optimal scores via differentiable optimization. Experiments on simulated and real-world datasets demonstrate that m CS significantly improves selection power while maintaining FDR control, establishing it as a robust framework for multivariate selection tasks.

1. Introduction

Selecting a subset of promising candidates from a large pool is crucial across various scientific and real-world applications. In drug discovery, researchers search vast chemical spaces to identify compounds with strong effects, such as high binding affinity to a specific target (Szyma nski et al., 2011; Scannell et al., 2022; Sheridan et al., 2015; Zhang et al., 2025). Similarly, precision medicine aims to identify positive individual treatment effects (Lei & Cand es, 2021), and post-hoc certification of large language model (LLM) outputs seeks to retain only trustworthy generations that meet user-defined criteria (Gui et al., 2024).

1Department of Mathematics and Statistics, Mc Gill University, Montreal, Canada 2Department of Mathematics, University of York, York, UK 3MRL, Merck & Co., Inc., Rahway, NJ, USA 4Mila - Quebec AI Institute, Montreal, Quebec, Canada. Correspondence to: Archer Y. Yang <archer.yang@mcgill.ca>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

In these settings, true test responses (e.g., binding affinity or alignment score) are often unavailable, requiring selection to rely on machine learning model predictions. Since selected targets drive downstream decisions, quantifying selection uncertainty is essential for maintaining efficiency. Conformal selection (Jin & Cand es, 2023) provides a modelagnostic framework for selection with uncertainty quantification by extending conformal prediction (Vovk et al., 2005) to multiple hypothesis testing, using conformal p-values (Bates et al., 2023) and multiple testing corrections. It has shown promise in real-world drug discovery (Bai et al., 2024) and LLM alignment (Gui et al., 2024).

However, existing conformal selection methods are limited to univariate responses and selection targets of the form y > c, where c is a user-defined threshold. Many real-world applications require selection based on multiple interdependent criteria. For instance, LLM outputs must simultaneously satisfy alignment requirements such as fairness, safety, and correctness (Bai et al., 2022), which are better represented as multivariate alignment scores. This highlights the need for a principled selection method with uncertainty quantification in multivariate settings.

In this paper, we extend the conformal selection framework to multivariate response settings. To ensure finite-sample false discovery rate (FDR) control in our procedure, we generalize the concept of monotonicity (Jin & Cand es, 2023) to regional monotonicity for the nonconformity function. We propose two types of nonconformity scores that satisfy this property: (i) distance-based nonconformity scores for regular-shaped and convex target regions, and (ii) a learningbased method for optimizing nonconformity scores. The latter approach leverages a loss function that penalizes either the smooth selection size or the conformal p-value to learn an optimal score. This method is particularly effective when the dimension of responses is large or when the target region is irregular or nonconvex. Through experiments on simulated and real-world datasets, both variants of m CS demonstrates enhanced selection power over baseline methods while ensuring finite-sample FDR control.

2. Background and Related Work

Problem Setup We let x Rp represent the pdimensional features, and let y Rd denote the d-

Multivariate Conformal Selection

dimensional multivariate response variables. We consider a training dataset Dtrain = {xi, yi}n i=1 and a test dataset Dtest = {xn+j}m j=1, where the corresponding test responses {yn+j}m j=1 are unobserved. We further assume that the combined set of training and test samples {xi, yi}n+m i=1 are drawn i.i.d.1 from an unknown, arbitrary distribution DX Y .

We formulate the selection problem as follows: Given a predefined d-dimensional closed region R Rd, our goal is to identify a subset of indices S {1, . . . , m} from Dtest such that as many test observations j S satisfy yn+j R as possible, while controlling the FDR (Benjamini & Hochberg, 1995) below a user-specified level q. The FDR is defined as the expected proportion of false discoveries (j S but yn+j / R) among all selected observations

where H0 = {j : yn+j / R}, with the convention that 0/0 = 0 in the fraction above. This criterion measures the overall Type-I error rate of the selection procedure.

The overall Type-II error of selection can be quantified by the power, defined as the expected proportion of desirable observations (yn+j R) that are correctly selected,

where H1 = {j : yn+j R}. The Type-II error of selection is therefore (1 Power). An ideal selection procedure should aim to maximize the power while keeping the FDR below the specified nominal level.

Conformal Prediction Conformal prediction (CP) (Vovk et al., 2005) is a popular framework for uncertainty quantification that constructs prediction intervals on a per-sample basis. Assuming exchangeable calibration and test data, CP provides prediction sets b C1 α(x) with finite-sample coverage guarantees for α (0, 1):

P(y b C1 α(x)) 1 α.

Although CP was originally designed for univariate responses, numerous studies have proposed multivariate generalizations (Kuleshov et al., 2018; Bates et al., 2021; Messoudi et al., 2022; Johnstone & Cox, 2021; Feldman et al., 2023; Park et al., 2024; Klein et al., 2025). However, these multivariate CP methods are not directly applicable to our selection problem. The primary objective of CP constructing confidence sets for predictions does not naturally align

1Later, we will relax the i.i.d. assumption to exchangeability conditions.

with selection tasks. Specifically, the potentially complex shapes of the multivariate CP sets b C1 α may be incompatible with the pre-defined target region R.

Even in the simpler cases, such as when the response is binary, using CP for selection introduces a multiplicity issue (Jin & Cand es, 2023). In this context, CP only controls the per-comparison error rate (PCER), which differs from the FDR. PCER also measures the Type-I error, and is defined as (1) with the denominator replaced by m, the size of the test dataset. By definition, the PCER is always smaller than the FDR, making it a less stringent error control criterion. As a result, procedures controlling PCER may fail to meet the stricter requirements of FDR control.

Conformal Selection Conformal selection (CS) (Jin & Cand es, 2023) is a model-agnostic selection framework that guarantees finite-sample FDR control. However, CS only considers the univariate response case (d = 1) and assumes that the selection region takes the form [c, + ), where c is a predefined threshold. In this setting, CS formulates one hypothesis test per candidate:

H0j : yn+j c vs. H1j : yn+j > c.

Rejecting the null hypothesis H0j indicates that the j-th test sample is selected, as its response is deemed to exceed the threshold c.

CS uses nonconformity scores to guide its selection process. A nonconformity measure quantifies how atypical (or nonconforming) an observation is, based on the relationship between inputs and responses. For calibration samples, the nonconformity scores are Vi = V (xi, yi), and for test samples, they are b Vn+j = V (xn+j, c), where c replaces the unobserved yn+j. These scores are then used to compute conformal p-values through a rank-based comparison of b Vn+j against the calibration scores V1, . . . , Vn. A lower rank of b Vn+j relative to the calibration scores provides stronger evidence for rejecting H0j.

To determine the final selected subset S, CS applies the Benjamini-Hochberg (BH) procedure (Benjamini & Hochberg, 1995; 1997), a widely used method for controlling FDR in multiple testing setting, to the set of conformal p-values. The use of the BH procedure ensures that the overall Type-I error rate is kept below the specified level.

3. Multivariate Conformal Selection

In this section, we introduce the key concepts, procedures, and theoretical foundations of multivariate Conformal Selection (m CS).

First, for each d-dimensional multivariate response yn+j,

Multivariate Conformal Selection

m CS performs the following hypothesis test:

H0j : yn+j Rc vs. H1j : yn+j R,

where R represents an arbitrary pre-defined closed target region in Rd. Multivariate responses yn+j can represent either regression or classification outcomes. In this paper, we focus on the more challenging regression setting. For a discussion of the classification case, please see Appendix B.1. The m CS process consists of three main steps:

1. Training: Construct a multivariate predictive model ˆµ for y. This model can be obtained using any suitable machine learning algorithm.

2. Calibration: Build a regionally monotone multivariate nonconformity function based on ˆµ, and evaluate this function on the calibration dataset and test dataset. Subsequently, we compute the conformal p-values for each test sample.

3. Thresholding: Apply the Benjamini-Hochberg (BH) procedure as in the original CS procedure to the set of conformal p-values, yielding the final selection set S.

Both the training step and the calibration step rely on the labeled dataset Dtrain. In the case where a model ˆµ is already available, m CS can be directly applied using all labeled data for calibration. Otherwise, the training data is divided into two subsets: one for model training (the proper training dataset) and the other for calibration. For simplicity, in the following discussion, we assume that ˆµ is already available and all training data {xi, yi}n i=1 are used for calibration so that Dcal = Dtrain.

The conformal p-values (Bates et al., 2023) are used to perform the hypothesis tests. If the true responses {yn+j}m j=1 were observed, the oracle conformal p-value would be defined as

p j = Pn i=1 1{Vi < Vn+j} + Uj(1 + Pn i=1 1{Vi = Vn+j}) n + 1 (3)

where Vi = V (xi, yi) for i = 1, . . . , n + m and V is a multivariate nonconformity function based on ˆµ, and Uj Unif(0, 1) is an independent random variable for the tiebreaking of the nonconformity scores. We defer the specific choice of V to later sections.

The evaluation of the oracle p-value p j is infeasible, because that in the above definition (3), the computation of Vn+j = V (xn+j, yn+j) requires knowledge of the unobserved response yn+j. To address this issue, we replace Vn+j with b Vn+j = V (xn+j, rn+j), where rn+j is an arbitrarily chosen point in the target region R, yielding the

(practical) conformal p-values:

pj = Pn i=1 1{Vi < b Vn+j} + Uj(1 + Pn i=1 1{Vi = b Vn+j}) n + 1 .

Standard results from conformal inference ensure that the oracle conformal p-values p j follow the Unif(0, 1) distribution, in particular implying that they are conservative in the sense that p j as a random variable has a super-uniform distribution on [0, 1], satisfying P(p j α) α (Bates et al., 2023). To guarantee that the practical conformal p-values pj also maintain this conservativeness, the nonconformity function V must satisfy a property that we introduce as regional monotonicity. With this property the pj s in the resulting m CS procedure will ensure the control of the FDR. We formally define the regional monotonicity as follows:

Definition 3.1 (Regional Monotonicity). A nonconformity score V : X Y R satisfy the regional monotone property if V (x, y ) V (x, y) for any x X, y Rc

Definition 3.1 leads to the conservativeness of pj. The following proposition formalizes this result, with proof available in Appendix A.1.

Proposition 3.2. Given that the calibration data {(xi, yi)}n i=1 together with the j-th data test data point (xn+j, yn+j) are exchangeable for j {1, . . . , m}, regionally monotone nonconformity scores V ensures that the conformal p-value pj defined in (4) is conservative in the following sense,

P(pj α and j H0) α, for all α (0, 1). (5)

Remark 3.3 (Clarification on conservativeness). The conservativeness described in (5) differs from the conventional notion of statistical conservativeness, which is not conditional on the event j H0. Due to the inherent randomness in yn+j within our hypothesis tests, an unknown dependency exists between the p-values and the event j H0. Consequently, the standard form of conservativeness does not hold in this context. Remark 3.4 (Univariate monotonicity as a special case). Regional monotonicity generalizes the univariate monotonicity concept introduced in CS (Jin & Cand es, 2023), which was originally defined for nonconformity scores increasing in their second argument. The original definition is restricted to the univariate case, where the target region R is specified as (c, + ). For a univariate nonconformity score V : X Y R, monotonicity implies regional monotonicity. Specifically, for any y Rc = ( , c] and y R, it holds that V (x, y) V (x, y ) for all x X. Moreover, even in the univariate case, the original definition of monotonicity across the entire domain X is unnecessary and monotonicity across the regions Rc and R suffices.

Multivariate Conformal Selection

Once a regionally monotone multivariate nonconformity score is defined, conformal p-values can be computed using (4). Leveraging these conformal p-values, m CS again applies the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) to construct a selection set S. The complete approach is outlined in Algorithm 1.

Algorithm 1 m CS: Multivariate Conformal Selection

Input: Calibration data Dcal = {(xi, yi)}n i=1, test data Dtest = {xn+j}m j=1, target target region R, FDR nominal level q (0, 1), regionally monotone nonconformity score V : X Y R. Output: Selection set S.

1: Compute Vi = V (xi, yi) for i = 1, . . . , n, and b Vn+j = V (xn+j, rn+j) for j = 1, . . . , m with rn+j R. 2: Construct conformal p-values {pi}m i=1 as in (4). 3: (BH procedure) Compute the BH selection threshold k = max{k Z 0 : Pm j=1 1{pj qk/m} k}, and return the selection set as S = {j : pj qk /m}.

The following theorem shows that Algorithm 1 controls FDR in finite sample. The proof can be found in Appendix A.2.

Theorem 3.5. Suppose V is a regionally monotone nonconformity score, and for any j {1, . . . , m}, the random variables V1, . . . , Vn, Vn+j are exchangeable conditioned on {b Vn+ℓ}ℓ =j. Then, for any q (0, 1), the output S of m CS satisfies FDR q.

4. Choices of Nonconformity Score

While the previous sections demonstrate that FDRcontrolled selection can be performed using any regionally monotone nonconformity score, the selection power heavily depends on the quality of the chosen score. Although related studies have explored this issue in the context of CP (Romano et al., 2019; Kivaranovic et al., 2020; Sesia & Cand es, 2020), there has been limited focus on the choice of scores specific for CS.

In this section, we introduce two types of nonconformity score that satisfies the regional monotonicity. As a result of Theorem 3.5, applying Algorithm 1 with the purposed scores would guarantee FDR control.

4.1. m CS-dist: Distance-based Scores

For multivariate selection, we propose distance-based nonconformity scores of the following form:

V (x, y) = D1(y, Rc) D2(ˆµ(x), Rc) (6)

where ˆµ is a trained predictive model of y, and D1 and D2 are distance functions. Two examples of distance-based

nonconformity score are as follows:

1.(regular) D1(z, Rc) = D2(z, Rc) = inf s Rc z s p.

2.(clipped) D1(z, Rc) = M 1{z R \ R},

D2(z, Rc) = inf s Rc z s p, (8)

where M is a large constant that serves as a relaxation of infinity, approximately satisfying M > 2D2(ˆµ(x), Rc). We discuss the role of M in the following paragraphs. These scores generalize the signed error score and the clipped score (Jin & Cand es, 2023), respectively.

This formulation enables the decomposition of V (x, y) into two terms. The first term D1 inherently ensures the regional monotonicity of the score V (Definition 3.1), as D1 is a distance function satisfying 0 = D1(y , Rc) D1(y, Rc) for any y Rc and y R. Meanwhile, the second term D2 measures the distance between the predicted responses of the points and Rc; this term is designed to increase as ˆµ(x) moves away from Rc, ensuring that the test points with the predicted responses having large distance from Rc

will yield smaller test scores b Vn+j. According to (4), these points will have smaller conformal p-values and are more likely to be rejected (selected) by the BH procedure. Note that when the predictive model ˆµ outputs an estimated conditional distribution b P(y|x) as in classification, conditional density estimation, or Bayesian models the second term D2 can be replaced by the predicted probability of being in the target region: y R, i.e. R 1{y R}d b P(y|x). This serves the same purpose: points with high predicted probability of satisfying the selection criterion will receive lower scores and are more likely to be selected.

We note that computing b Vn+j = V (xn+j, rn+j) in (4) requires choosing a point rn+j R to ensure FDR control. Although any point in R works, selecting rn+j on the boundary R is optimal for maximizing selection power for both regular and clipped scores. In this case, for any fixed xn+j, the test score b Vn+j achieves its minimum, ensuring it to be smaller than a larger proportion of calibration scores Vi = V (xi, yi). For example, with the clipped score, if rn+j R, then the first term D1 becomes 0 in b Vn+j, while for any calibration samples with yi R, we have D1 = M (except when yi lies exactly on R, which occurs with zero probability.) This yields smaller p-values pj and enables more samples to be selected. We require M to be large for the same reason.

The following asymptotic analysis provides a theoretical basis for preferring the second, clipped score (8) over the first score (7) when choosing a distance-based nonconformity score. This result extends the original CS framework (Jin & Cand es, 2023) to the multivariate setting, as formalized in the following theorem:

Multivariate Conformal Selection

Theorem 4.1. Let V be any fixed regionally monotone nonconformity score, and suppose {(xi, yi)}n+m i=1 are exchangeable from distribution DX Y . Let (x, y) denote a random pair also drawn from DX Y , and define F(v, u) = P(V (x, y) < v) + u P(V (x, y) = v) for any v R and u [0, 1]. Assuming the choice of rn+j r R is fixed, define

t = sup n t [0, 1] : t P(F(V (x, r), U) t) q o . (9)

Suppose that, for any sufficiently small ϵ > 0, there exists some t (t ϵ, t ) such that t P(F (V (x,r),U) t) q. Then the output S of Algorithm 1 from input {(xi, yi)}n i=1 {xn+j}m j=1 satisfies

lim n,m FDR = P(F(V (x, r), U) t , y Rc)

P(F(V (x, r), U) t ) , and

lim n,m Power = P(F(V (x, r), U) t , y R)

P(y R) . (10)

We omit the proof of Theorem 4.1, as it closely mirrors Proposition 7 in Jin & Cand es (2023). An intuitive explanation can be found in Appendix B.2. Leveraging a characterization of the BH procedure (Storey et al., 2004), the theorem establishes that the asymptotics of FDR and power can be precisely achieved by replacing each term in (1) and (2) with its population counterpart. Notably, in this context, t represents the population version of the BH rejection threshold for the p-values.

Theorem 4.1 indicates that the second, clipped score (8) is preferable to the first score (7) for achieving higher power. According to (10), since the value V (x, r) = infs Rc r s p is identical for both scores assuming r R, it suffices to compare their asymptotic BH thresholds t . The score with the larger t achieves higher asymptotic power and is therefore more effective. In the definition of t in (9), the fraction can be rewritten as

GV (t) t P(F(V (x, r), U) t) = P(F(V (x, y), U) t)

P(F(V (x, r), U) t)

= P(V (x, y) s)

P(V (x, r) s)

when s is the inverse of F( , U) at t, assuming it exists. The first equality follows from the fact that F(V (x, y), U) Unif(0, 1), while the second arises from the monotonicity of F with respect to its first argument v. To maximize t in (9) (equivalently, the inverse s ), an effective score should yield a larger V (x, y) relative to V (x, r), thereby reducing GV (t) for a fixed t. This, in turn, results in a larger t when computing sup{t : GV (t) q} in (9).

This criterion is precisely satisfied by the clipped score. An alternative justification for favoring the clipped score,

based on maximizing the realized FDR, is provided in Appendix B.2, along with further discussions on Theorem 4.1. For an empirical comparison of the performance of the two scores (7) and (8), refer to Appendix C.2.1.

4.2. m CS-learn: Learning-based Nonconformity Scores

The two distance-based nonconformity scores introduced in the previous section offer straightforward and practical solutions for many scenarios. However, their effectiveness, particularly in the design of the second distance term D2( ), is limited in some cases. For example, our numerical simulations indicate that m CS-dist would only achieve suboptimal power when R is a nonconvex set; see Appendix C.2.3 for further details. Furthermore, when the target region R is irregular, constructing a closed-form distance function can be challenging, leading to higher computational costs and potential inaccuracies.

To address these challenges, we propose an alternative method m CS-learn, which leverages a loss function that penalizes either the smooth selection size or conformal pvalue to learn an optimal nonconformity score within the following family:

V θ(x, y) = M 1{y R \ R} fθ(x, y; R) (11)

where M is a large constant and fθ : X Y R is a flexible function parametrized by θ, that can be chosen from a specific machine learning model class, such as kernel machines, gradient boosting models or neural networks, etc. The first term, an indicator function identical to D1( ) in (8), ensures regional monotonicity in Definition 3.1 and boosts selection power, as suggested in Section 4.1. The second term generalizes the distance term D2( ) from (6), offering a more expressive framework for constructing optimal nonconformity scores.

The following result demonstrates the expressiveness of the family (11) by showing that it can include the optimal nonconformity score for any selection task. The proof is provided in Appendix A.3. Proposition 4.2. Let {(xi, yi)}n+m i=1 be sampled i.i.d. from a distribution, and assume a fixed choice of rn+j r. Under Algorithm 1, for any nominal FDR level q and target region R, there exists a function f such that the score constructed using f in (11) maximizes the number of selected samples (and thus the power) among all scores with FDR control. Remark 4.3 (Subfamilies of the score class). A notable subfamily of (11) is

M 1{y R \ R} fθ(ˆµ(x); R), (12)

where fθ depends solely on the prediction ˆµ(x). This subfamily includes the clipped distance-based score from Sec-

Multivariate Conformal Selection

tion 4.1 as a special case. Here, regional monotonicity is guaranteed for any constant M, but the score can also be generalized further by allowing f to depend on x as well:

M 1{y R \ R} fθ(x, ˆµ(x); R).

In contrast, the broader family defined in (11) offers greater flexibility by incorporating y in the second term. However, this added expressiveness requires a sufficiently large M to preserve regional monotonicity. Specifically, M must satisfy M > 2|fθ(x, y; R)|. In practice, M is chosen to be sufficiently large to ensure that this inequality holds across the entire dataset. Remark 4.4 (Incorporating pretrained models). This could be achieved in several ways, such as using the predictions of ˆµ as inputs to fθ, or train fθ as a prediction on top of ˆµ when both models are implemented as neural networks. While such practice does not increase the expressiveness of the score family, it often facilitates the training of fθ, as ˆµ(x) estimates y and is very informative for selection. Since fθ can directly learn the data and the selection task, m CS-learn can still perform well when ˆµ is poorly fitted; see Appendix C.2.4.

To identify an optimal function within the family (11), we introduce a differentiable loss function that mimics the inherently non-differentiable m CS procedure. The hard sorting and ranking operations in the m CS workflow are replaced with their smooth, differentiable counterparts (Blondel et al., 2020; Cuturi et al., 2019). We adopt the implementation introduced in Blondel et al. (2020), with ℓ2 regularization and regularization strength set to 0.1. The resulting loss function is then used for a chosen machine learning method to train fθ. Specifically, we partition the calibration data into three batches Dcal = Df-train Df-val D cal, where Df-train and Df-val are used for training and validating fθ, respectively. Upon completion of training and validation, Algorithm 1 can be applied with D cal as the calibration dataset for V θ to generate the final selection set S.

Training Step. The training loss function is defined based on the smoothed selection size. We denote the softened rank of an element a A within the set A by soft-rank(a; A). We randomly partition Df-train into two subsets Df-train1 and Df-train2, and we assume Df-train1 = {(xi, yi)}n i=1 and Df-train2 = {(xn +j, yn +j)}m j=1 for notational simplicity. We define the smooth conformal p-value pθ j for j = 1, . . . , m as

pθ j = soft-rank b V θ n +j; {V θ i }n i=1 {b V θ n +j}

n + 1 . (13)

where V θ i := V θ(xi, yi) are computed on Df-train1 and b V θ n +j := V θ(xn +j, rn +j) are computed on Df-train2, with

V θ in (11). Here, Df-train1 and Df-train2 serve as the calibration dataset and test dataset respectively, to obtain the smoothened conformal p-value for training.

Next, to smooth the BH procedure, we first apply the softsorting operation to the smooth p-values pθ j to obtain their corresponding ranks aθ j,

aθ j = soft-rank pθ j; { pθ k}m k=1), (14)

and then compute the softened selection size S(θ) as

S(θ) (i) = log

j=1 eaθ j sθ j , where sθ j (ii) = σ q aθ j/m pθ j τ

In the above equation, the sigmoid function σ with temperature coefficient τ in (ii) serves as a smooth approximation of the indicator function 1{ pθ j < qaθ j/m }, while the logsum-exp function in (i) approximates the element-wise max function max(aθ 1sθ 1, . . . , aθ m sθ m ). This formulation ensures that S(θ) closely approximates the BH selection size, as the selection size is defined by the largest rank with a p-value below the threshold.

To maximize the selection size, the loss function for learning the m CS-learn score can be defined as

L1(θ) = S(θ). (15)

While above formulation of the loss L1 is intuitive in its attempt to approximate the final selection size using differentiable functions, the inclusion of two soft sorting steps may reduce numerical stability and impede the training process.

In the m CS procedure, the BH procedure is solely intended for multiplicity correction for FDR control, and thus is not necessarily required in the formulation of the loss function, whose primary objective is to learn the function fθ. A simpler yet effective alternative loss function directly penalizes the smooth p-values pθ j through the following loss function:

j=1 pθ j 1(yn+j R) γ 1(yn+j Rc) . (16)

where pθ j is minimized when the j-th sample is deemed desirable, as indicated by the first term. The second term, scaled by a balancing coefficient γ, ensures that the p-value is not uniformly small but becomes relatively larger for less favorable samples. This approach eliminates the need to estimate the BH threshold via a secondary soft-ordering (14), leading to improved numerical stability and enhanced overall performance. A comparison of the two loss functions can be found in Appendix C.2.5. After computing the loss in each epoch, we can follow the standard backpropagation procedure to train θ.

Multivariate Conformal Selection

Validation Step. To avoid overfitting fθ, we perform an additional model selection procedure using a hold-out dataset Df-val. Specifically, for each epoch t = 1, . . . , T of the backpropagation procedure, we apply K random partitions on Df-val to obtain D(k) f-val1 and D(k) f-val2 for k = 1, . . . , K. For each k, we then apply Algorithm 1 with the setting Dcal := D(k) f-val1 and Dtest := D(k) f-val2, and record the validation power ρk(t). We then compute the average validation power Power(t) = PK k=1 ρk(t)/K.

In the end, we select t from {1, . . . , T} to be the epoch with the highest average validation selection power Power(t), and deploy the associated model fθ t for the final selection.

Finally, Algorithm 2 details the complete learning procedure for m CS-learn scores. The code for reproduction can be found at https://github.com/Tian-Bai/mcs.

Algorithm 2 m CS-learn Learning Procedure

Input: Training data Df-train, validation data Df-val, target region R, FDR level q (0, 1), other hyperparameters. Output: Trained nonconformity function fθ.

1: Initialize parameters θ = θ0. 2: for epoch t = 1, . . . , T do

Training Step 3: Randomly partition Df-train into two disjoint subsets Df-train1 and Df-train2. 4: Use the current fθ to obtain V θ i from Df-train1 and b V θ n +j from Df-train2.

5: Compute the smooth conformal p-values pθ j (13). 6: Compute the loss function using (15) or (16). 7: Back-propagate to update the parameters θ = θt.

Validation Step 8: Apply K random partitions on Df-val to obtain D(k) f-val1 and D(k) f-val2 for each k = 1, . . . , K. 9: For each k, apply Algorithm 1 for score function V θ with the setting Dcal := D(k) f-val1, Dtest :=

D(k) f-val2 and compute validation power ρk(t).

10: Compute Power(t) = PK k=1 ρk(t)/K.

11: end for 12: Determine t = arg maxt Power(t) and return fθ t.

5. Simulation Studies

5.1. Baseline Methods

While the standard CS approach is originally designed for univariate settings and cannot be directly applied to multivariate selections, appropriate adaptations can be devised. In this section, we introduce several na ıve methods directly

adapted from CS to address the multivariate case. Later, we employ these adapted methods as baselines and compare their performance against our proposed method.

In the scenario when the target region R Rd is rectangular, the overall selection criterion can be decomposed, allowing each dimension to be evaluated independently. By applying CS separately to each dimension, we obtain d selection sets S1, . . . , Sd, where each Sk contains observations satisfying the k-th corresponding marginal criterion. The final selection set S is then given by the intersection of these individual sets, S = d k=1Sk. We refer to this approach as CS int.

It can be shown that CS int, when each CS subroutine is conducted at a nominal level q, fails to control the FDR at or below q. This issue is analogous to the intersection hypothesis testing (IHT) problem in statistics. A common approach to address this is the Bonferroni correction (Dunn, 1961), which adjusts the nominal level of each subroutine to a lower threshold q/d. Then obtain the intersection of individual sets. However, this method is widely recognized as being overly conservative (Perneger, 1998; Westfall & Young, 1993). We refer to the Bonferroni-adjusted CS int as CS ib. An alternative to the Bonferroni correction is to adaptively account for intersection hypothesis testing. Rather than predefining the nominal levels for subroutines, we determine suitable values by validating on a hold-out dataset. This approach necessitates additional data splits to construct the hold-out set, and we refer to it as CS is.

Beyond considering each dimension separately, another natural adaptation of CS involves transforming the response vector y before applying CS. Specifically, each response yi is converted to a binary indicator reflecting whether it meets the selection criterion, defined as yi = 1{yi R}. Under this transformation, the new selection threshold can be set to c = 0, as yi > 0 is equivalent to yi R. We refer to this approach as bi.

5.2. Numerical Results

We compare the performance of m CS-dist, m CS-learn (abbreviated as m CS-d and m CS-l respectively) against the baseline methods outlined in Section 5.1.

In our data generation processes, covariates x are sampled uniformly from Unif( 1, 1)p where p is the covariate dimension, and the responses y are generated as y = µ(x)+ϵ, where µ denotes the regression function and ϵ represents noise drawn from either a multivariate Gaussian or multivariate t-distribution. By varying the regression function, the size of response dimensions, and the choice of Gaussian or heavy-tailed noise, we create a range of selection problems with differing levels of difficulty.

We consider two selection tasks where the target region R

Multivariate Conformal Selection

is defined as:

Task 1. The (shifted) first orthant, R = {y : yk ck k}, Task 2. A sphere centered at c, R = {y : y c 2 r}.

These two specific tasks are particularly relevant in applications, as they simulate scenarios where (1) d criteria must be simultaneously satisfied or (2) an instance must be sufficiently close to a specific point c to be deemed acceptable.

In our simulation, the coefficients c and r for each selection problem are chosen to ensure that approximately 15% to 35% of the data points lie within R across all six datagenerating processes. Detailed descriptions of the data generation process, model specifications, and the specific values of the coefficients are provided in Appendix C.1. We first train a support vector regression model ˆµ using 1000 data points, and use an additional labeled dataset of 1000 samples to construct selection sets for different methods in comparison. We evaluate the selection power and FDR using a test dataset of size 100. We adopt the clipped score (8) for m CS-dist, and adopt the loss function in (16) with balancing coefficient γ = 0.5 for m CS-learn. For m CS-learn, the calibration data is split to Df-train, Df-val and D cal with ratio 8:1:1, and the model fθ is formulated as a two-layer MLP with batch normalization. The response dimension is set to be d = 30, and nominal FDR level is set at q = 0.3. Number of iterations for validation is set to K = 100. The selection process is repeated across 100 iterations, with a new dataset generated independently for each iteration.

Table 1 and Table 2 summarize the experimental results for the first selection task. As shown in Table 1, CS int substantially violates FDR control. CS is provides only approximate FDR control, and in scenarios such as Setting 6, the FDR control may be compromised. In each setting, the red numbers in Table 2 indicate the highest achieved power among all methods that properly control the FDR. Among the four remaining methods that always maintain valid FDR control, our two proposed methods consistently achieve the best and second-best power, outperforming the baseline methods under all settings.

Table 3 summarizes similar results for the second selection task. Baseline methods CS int and CS ib are not included as they are not applicable to non-rectangular target regions. Figure 1 shows the realized FDR and power curves across varying nominal FDR levels, ranging from 0.05 to 0.5 in increments of 0.05. Results are shown exclusively for Setting 3 due to space constraints. Among the methods compared, m CS-dist, m CS-learn, and bi demonstrate consistent FDR control, as their respective curves remain below the black dashed line indicating the nominal FDR threshold. Notably, m CS-learn also achieves consistently higher power across all nominal levels.

Additional simulation results and ablation studies explor-

ing various factors of the selection problem and the m CS algorithm are provided in Appendix C.2.

Table 1: Observed FDR for Task 1 (shifted first orthant R). The nominal FDR level is q = 0.3.

Setting CS int CS ib CS is bi m CS-d m CS-l

1 0.773 0.000 0.280 0.240 0.277 0.251 2 0.801 0.000 0.133 0.300 0.315 0.266 3 0.724 0.000 0.204 0.295 0.264 0.278 4 0.811 0.000 0.255 0.309 0.277 0.315 5 0.810 0.000 0.300 0.266 0.308 0.239 6 0.778 0.000 0.374 0.245 0.287 0.258

Table 2: Observed power for Task 1 (shifted first orthant R).

Setting CS int CS ib CS is bi m CS-d m CS-l

1 1.000 0.000 0.406 0.324 0.555 0.325 2 1.000 0.000 0.012 0.069 0.104 0.108 3 1.000 0.000 0.039 0.059 0.068 0.102 4 1.000 0.000 0.124 0.194 0.324 0.198 5 1.000 0.000 0.019 0.035 0.060 0.042 6 1.000 0.000 0.101 0.027 0.046 0.034

Table 3: Observed FDR and power for Task 2 (spherical R). The nominal FDR level is q = 0.3.

Setting bi m CS-d m CS-l bi m CS-d m CS-l

1 0.255 0.265 0.279 0.636 0.760 0.534 2 0.260 0.263 0.273 0.343 0.405 0.421 3 0.216 0.223 0.263 0.115 0.134 0.179 4 0.316 0.286 0.254 0.192 0.333 0.180 5 0.307 0.291 0.273 0.137 0.170 0.189 6 0.292 0.283 0.207 0.055 0.063 0.061

6. Real Data Application

We apply the m CS framework to drug discovery, selecting drug candidates with desirable biological properties. Each candidate corresponds to a chemical compound, where the feature vector x encodes structural or chemical characteristics, and the multivariate response y represents biological properties. The multidimensional nature of y reflects the need to evaluate multiple biological criteria before advancing a compound. Ensuring FDR control improves downstream processes, such as wet-lab validation.

We employ an imputed public ADMET dataset compiled from multiple sources (Wenzel et al., 2019; Iwata et al., 2022; Kim et al., 2023; Watanabe et al., 2018; Falc on-Cano et al., 2022; Esposito et al., 2020; Braga et al., 2015; Aliagas et al., 2022; Perryman et al., 2020; Meng et al., 2022; Vermeire et al., 2022), comprising n = 22805 compounds with d = 15 biological assay responses. We focus on three different selection tasks with the following target regions:

Task 1. The (shifted) first orthant, R = {y : yk ck k},

Multivariate Conformal Selection

0.1 0.2 0.3 0.4 0.5 Nominal FDR level

0.1 0.2 0.3 0.4 0.5 Nominal FDR level

0.1 0.2 0.3 0.4 0.5 Nominal FDR level

Task 1 Power

0.1 0.2 0.3 0.4 0.5 Nominal FDR level

Task 2 Power

bi m CS-d m CS-l CS_int CS_ib CS_is

Figure 1: Observed FDR and power across varying nominal levels for Task 1 (shifted first orthant R) and 2 (spherical R).

Task 2. A sphere centered at c, R = {y : y c 2 r}, Task 3. The complement of a sphere centered at c, R = {y : y c 2 r }.

We included the two tasks introduced in the numerical simulations (Section 5), and we also designed the third task to evaluate the performance of various methods under a nonconvex target region with real data. Each of the selection task is designed so that approximately 15% 30% of the compounds qualify as acceptable. Further details on the dataset and problem setup, including response descriptions and cutoffs, can be found in Appendix D.

In this experiment, the underlying predictor ˆµ is specified as a drug property prediction model from the Deep Purpose Python package (Huang et al., 2020) with Morgan drug encoding. We train the model using ntrain = 12000 samples, provide ncal = 8000 samples for calibration and reserves the remaining data of size ntest = 2805 as test data. We keep the configuration and hyperparameters of the methods unchanged as in Section 5. Two nominal levels q = 0.3 and q = 0.5 are considered, and the selection processes for each method are repeated across 500 iterations.

Table 4: Observed FDR of different methods with real data.

Task q CS int CS ib CS is bi m CS-d m CS-l

1 0.3 0.760 0.000 0.303 0.038 0.304 0.275 2 0.3 0.000 0.300 0.293 3 0.3 0.084 0.301 0.296

1 0.5 0.782 0.393 0.496 0.040 0.499 0.488 2 0.5 0.000 0.499 0.498 3 0.5 0.084 0.501 0.497

Tables 4 and Table 5 summarize the FDR and power of the competing methods respectively. Consistent with our numerical simulation, the methods bi, m CS-dist, and m CS-learn all demonstrate valid FDR control. Among the methods guaranteed to control FDR, m CS-dist and m CS-learn consistently achieve the best and the secondbest power across all tasks and nominal levels. Notably,

Table 5: Observed power of methods with real data.

Task q CS int CS ib CS is bi m CS-d m CS-l

1 0.3 0.993 0.000 0.019 0.000 0.006 0.010 2 0.3 0.000 0.278 0.086 3 0.3 0.000 0.410 0.431

1 0.5 1.000 0.003 0.225 0.000 0.433 0.193 2 0.5 0.000 0.759 0.515 3 0.5 0.000 0.449 0.589

m CS-learn exhibits superior performance under nonconvex target regions, corroborating the results presented in Appendix C.2.3. We note that although the baseline method bi performed well in the simulation settings, it barely selected any compound in the current task. This outcome may be attributed to the suboptimal performance of the underlying binary classification model, which achieved an F1 score of only 0.31.

7. Conclusion

We propose multivariate conformal selection, an extension of conformal selection to multivariate response settings. Our experiments demonstrate that m CS significantly improves selection power while maintaining rigorous FDR control, outperforming existing baselines across simulated and realworld datasets. The flexibility of m CS makes it a valuable tool for selective tasks involved in diverse fields including drug discovery. Looking forward, we anticipate that the m CS framework can be further extended to handle additional practical challenges, including settings with hierarchical or structured responses. By addressing these challenges, m CS has the potential to further enhance its applicability and impact across diverse scientific and industrial domains.

Impact Statement

This work does not present any foreseeable societal consequence.

Multivariate Conformal Selection

Acknowledgments

This work was supported by Natural Sciences and Engineering Research Council (NSERC) Discovery Grant (RGPIN2024-06780) and FRQNT Team Research Project Grant (FRQ-NT 327788).

Aliagas, I., Gobbi, A., Lee, M.-L., and Sellers, B. D. Comparison of logp and logd correction models trained with public and proprietary data sets. Journal of Computer Aided Molecular Design, 36(3):253 262, 2022.

Bai, T., Tang, P., Xu, Y., Svetnik, V., Khalili, A., Yu, X., and Yang, A. Conformal selection for efficient and accurate compound screening in drug discovery. Chem Rxiv, 2024. doi: 10.26434/ chemrxiv-2024-pf3ph-v2. URL https://chemrxiv. org/engage/chemrxiv/article-details/ 67cfbb1c81d2151a02f4efc6.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022.

Bates, S., Angelopoulos, A., Lei, L., Malik, J., and Jordan, M. Distribution-free, risk-controlling prediction sets. Journal of the ACM (JACM), 68(6):1 34, 2021.

Bates, S., Cand es, E., Lei, L., Romano, Y., and Sesia, M. Testing for outliers with conformal p-values. The Annals of Statistics, 51(1), February 2023. ISSN 0090-5364. doi: 10.1214/22-aos2244. URL http://dx.doi.org/ 10.1214/22-AOS2244.

Benjamini, Y. and Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289 300, 1995. ISSN 00359246. URL http://www.jstor.org/ stable/2346101.

Benjamini, Y. and Hochberg, Y. Multiple hypotheses testing with weights. Scandinavian Journal of Statistics, 24(3): 407 418, 1997. doi: https://doi.org/10.1111/1467-9469. 00072. URL https://onlinelibrary.wiley. com/doi/abs/10.1111/1467-9469.00072.

Blondel, M., Teboul, O., Berthet, Q., and Djolonga, J. Fast differentiable sorting and ranking. In International Conference on Machine Learning, pp. 950 959. PMLR, 2020.

Braga, R. C., Alves, V. M., Silva, M. F., Muratov, E., Fourches, D., Li ao, L. M., Tropsha, A., and Andrade,

C. H. Pred-herg: A novel web-accessible computational tool for predicting cardiac toxicity. Molecular Informatics, 34(10):698 701, 2015.

Cuturi, M., Teboul, O., and Vert, J.-P. Differentiable ranking and sorting using optimal transport. Advances in Neural Information Processing Systems, 32, 2019.

Dunn, O. J. Multiple comparisons among means. Journal of the American Statistical Association, 56(293):52 64, 1961.

Esposito, C., Wang, S., Lange, U. E., Oellien, F., and Riniker, S. Combining machine learning and molecular dynamics to predict p-glycoprotein substrates. Journal of Chemical Information and Modeling, 60(10):4730 4749, 2020.

Etemadi, N. and Kaminski, M. Strong law of large numbers for 2-exchangeable random variables. Statistics & probability letters, 28(3):245 250, 1996.

Falc on-Cano, G., Molina, C., and Cabrera-P erez, M. A. Reliable prediction of caco-2 permeability by supervised recursive machine learning approaches. Pharmaceutics, 14(10):1998, 2022.

Feldman, S., Bates, S., and Romano, Y. Calibrated multipleoutput quantile regression with representation learning. Journal of Machine Learning Research, 24(24):1 48, 2023.

Gui, Y., Jin, Y., and Ren, Z. Conformal alignment: Knowing when to trust foundation models with guarantees. Advances in Neural Information Processing Systems, 2024.

Heid, E., Greenman, K. P., Chung, Y., Li, S.-C., Graff, D. E., Vermeire, F. H., Wu, H., Green, W. H., and Mc Gill, C. J. Chemprop: a machine learning package for chemical property prediction. Journal of Chemical Information and Modeling, 64(1):9 17, 2023.

Huang, K., Fu, T., Glass, L. M., Zitnik, M., Xiao, C., and Sun, J. Deeppurpose: a deep learning library for drug target interaction prediction. Bioinformatics, 36(22-23): 5545 5547, 2020.

Iwata, H., Matsuo, T., Mamada, H., Motomura, T., Matsushita, M., Fujiwara, T., Maeda, K., and Handa, K. Predicting total drug clearance and volumes of distribution using the machine learning-mediated multimodal method through the imputation of various nonclinical data. Journal of Chemical Information and Modeling, 62 (17):4057 4065, 2022.

Jin, Y. and Cand es, E. J. Selection by prediction with conformal p-values. Journal of Machine Learning Research, 24(244):1 41, 2023.

Multivariate Conformal Selection

Johnstone, C. and Cox, B. Conformal uncertainty sets for robust optimization. In Conformal and Probabilistic Prediction and Applications, pp. 72 90. PMLR, 2021.

Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., et al. Pubchem 2023 update. Nucleic Acids Research, 51(D1): D1373 D1380, 2023.

Kivaranovic, D., Johnson, K. D., and Leeb, H. Adaptive, distribution-free prediction intervals for deep networks. In International Conference on Artificial Intelligence and Statistics, pp. 4346 4356. PMLR, 2020.

Klein, M., Bethune, L., Ndiaye, E., and Cuturi, M. Multivariate conformal prediction using optimal transport. ar Xiv preprint ar Xiv:2502.03609, 2025.

Kuleshov, A., Bernstein, A., and Burnaev, E. Conformal prediction in manifold learning. In Conformal and Probabilistic Prediction and Applications, pp. 234 253. PMLR, 2018.

Lei, L. and Cand es, E. J. Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911 938, 2021.

Meng, J., Chen, P., Wahib, M., Yang, M., Zheng, L., Wei, Y., Feng, S., and Liu, W. Boosting the predictive performance with aqueous solubility dataset curation. Scientific Data, 9(1):71, 2022.

Messoudi, S., Destercke, S., and Rousseau, S. Ellipsoidal conformal inference for multi-target regression. In Conformal and Probabilistic Prediction with Applications, pp. 294 306. PMLR, 2022.

Park, J. W., Tibshirani, R., and Cho, K. Semiparametric conformal prediction. ar Xiv preprint ar Xiv:2411.02114, 2024.

Perneger, T. V. What s wrong with bonferroni adjustments. BMJ, 316(7139):1236 1238, 1998.

Perryman, A. L., Inoyama, D., Patel, J. S., Ekins, S., and Freundlich, J. S. Pruned machine learning models to predict aqueous solubility. ACS Omega, 5(27):16562 16567, 2020.

Romano, Y., Patterson, E., and Candes, E. Conformalized quantile regression. Advances in Neural Information Processing Systems, 32, 2019.

Scannell, J. W., Bosley, J., Hickman, J. A., Dawson, G. R., Truebel, H., Ferreira, G. S., Richards, D., and Treherne, J. M. Predictive validity in drug discovery: what it is, why it matters and how to improve it. Nature Reviews Drug Discovery, 21(12):915 931, 2022.

Sesia, M. and Cand es, E. J. A comparison of some conformal quantile regression methods. Stat, 9(1):e261, 2020.

Sheridan, R. P., Mc Masters, D. R., Voigt, J. H., and Wildey, M. J. ecounterscreening: using qsar predictions to prioritize testing for off-target activities and setting the balance between benefit and risk. Journal of Chemical Information and Modeling, 55(2):231 238, 2015.

Storey, J. D., Taylor, J. E., and Siegmund, D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society Series B: Statistical Methodology, 66(1):187 205, 2004.

Szyma nski, P., Markowicz, M., and Mikiciuk-Olasik, E. Adaptation of high-throughput screening in drug discovery toxicological screening tests. International Journal of Molecular Sciences, 13(1):427 452, 2011.

Vermeire, F. H., Chung, Y., and Green, W. H. Predicting solubility limits of organic solutes for a wide range of solvents and temperatures. Journal of the American Chemical Society, 144(24):10785 10797, 2022.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic Learning in a Random World. Springer, 2005. URL https://api.semanticscholar. org/Corpus ID:118783209.

Watanabe, R., Esaki, T., Kawashima, H., Natsume-Kitatani, Y., Nagao, C., Ohashi, R., and Mizuguchi, K. Predicting fraction unbound in human plasma from chemical structure: improved accuracy in the low value ranges. Molecular Pharmaceutics, 15(11):5302 5311, 2018.

Wenzel, J., Matter, H., and Schmidt, F. Predictive multitask deep neural network models for adme-tox properties: learning from large data sets. Journal of Chemical Information and Modeling, 59(3):1253 1268, 2019.

Westfall, P. H. and Young, S. S. Resampling-Based Multiple Testing: Examples and Methods for p-value Adjustment, volume 279. John Wiley & Sons, 1993.

Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea, M., et al. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59(8):3370 3388, 2019.

Zhang, K., Yang, X., Wang, Y., Yu, Y., Huang, N., Li, G., Li, X., Wu, J. C., and Yang, S. Artificial intelligence in drug development. Nature Medicine, pp. 1 15, 2025.

Multivariate Conformal Selection

A. Technical Proofs

A.1. Proof of Proposition 3.2

Proof of Proposition 3.2. Since V is fixed, the nonconformity scores V1, . . . , Vn and Vn+j are exchangeable. By a standard result from conformal inference Vovk et al. (2005, Proposition 2.4), the oracle p-value p j defined as in (3) is uniform distributed with value ranging in (0, 1), and P(p j α) α. This gives

P(p j α and j H0) α.

When the null hypothesis H0j is true, yn+j Rc. Since rn+j R, by the regional monotone property, we have Vn+j = V (xn+j, yn+j) V (xn+j, rn+j) = b Vn+j. We then have p j pj by definition, and

P(pj α and j H0) P(p j α and j H0) α.

A.2. Proof of Theorem 3.5

Proof of Theorem 3.5. We adapt the proof of Theorem 6 in (Jin & Cand es, 2023). In the proof that follows, we fix index j {1, . . . , m}. For notational simplicity, only in this proof we deal with the deterministic conformal p-values,

pj = 1 n + 1 1 +

i=1 1{Vi b Vn+j}

We note that the deterministic conformal p-values are only valid when the scores {Vi}n i=1 {Vn+j}m j=1 have no ties almost surely, and therefore we also make this assumption. We highlight that the validity of the random conformal p-value does not rely on this statement. We also define the corresponding deterministic oracle p-values,

p j = 1 n + 1 1 +

i=1 1{Vi Vn+j} .

This version of p-values are also conservative by standard results in conformal inference (Bates et al., 2023; Jin & Cand es, 2023). For l = 1, . . . , m, we define a set of slightly modified p-values,

p(j) l = 1 n + 1 n X

i=1 1{Vi b Vn+l} + 1{Vn+j b Vn+l} .

Also define S(a1, . . . , am) {1, . . . , m} as the rejection index set obtained by the Benjamini-Hochberg procedure, from p-values taking values a1, . . . , am. Then, the output of m CS is S(p1, . . . , pm).

In the sequel, we will compare S(p1, . . . , pm) to

S(p(j) 1 , . . . , p(j) j 1, p j, p(j) j+1, . . . , p(j) m )

for the test sample j that is falsely rejected, i.e.

j S, Yn+j Rc.

First, on this event, we have Vn+j = V (xn+j, yn+j) V (xn+j, rn+j) = b Vn+j and p j pj. For the remaining p-values,

{p(j) l }l =j, we consider two cases:

(i) If b Vn+l b Vn+j, then pl pj. In this case, we also have b Vn+l Vn+j as b Vn+j Vn+j. This implies

p(j) l = 1 n + 1 n X

i=1 1{Vi b Vn+l} + 1{Vn+j b Vn+l} = 1 n + 1 n X

i=1 1{Vi b Vn+l} + 1 = pl.

Multivariate Conformal Selection

(ii) If b Vn+l < b Vn+j, then pl pj. We also have

p(j) l 1 n + 1 n X

i=1 1{Vi b Vn+l} + 1 1 n + 1 n X

i=1 1{Vi b Vn+j} + 1 = pj.

Since j S, by the definition of Benjamini-Hochberg procedure l S as pl has smaller rank when ordering the p-values.

To summarize, if we replace (p1, . . . , pm) by (p(j) 1 , . . . , p(j) j 1, p j, p(j) j+1, . . . , p(j) m ) on the event {j S, Yn+j Rc}, such a replacement does not modify any of those p-values pl if they satisfied pl pj. Also, for all p-values pl with pl pj including j itself (l = j), their replaced values p(j) l are still no greater than pj. Since all p-values are no larger than their original values after the replacements, the size of rejection set must not decrease. On the other hand, since j S and no p-values larger than pj are modified, no new hypotheses can be rejected by the new set of p-values. We conclude that

S j := S(p(j) 1 , . . . , p(j) j 1, p j, p(j) j+1, . . . , p(j) m ) = S(p1, . . . , pm) = S

on the event {Yn+j Rc, j S}. By decomposing the FDR we have

FDR = E Pm j=1 1{Yn+j Rc}1{j S}

max(1, |S|)

1 k E 1{|S| = k}1{Yn+j Rc}1{pj qk

1 k E 1{|S j | = k}1{Yn+j Rc}1{p j qk

1 k E 1{|S j | = k}1{p j qk

1 k E 1{|S j | = k}1{j S j } .

The last equality is again by the property of the Benjamini-Hochberg procedure. Also, by its step-up nature, sending p j to zero does not change the rejection set if the corresponding hypothesis of p j is rejected, i.e. on the event {j S j }. We have

S j = S(p(j) 1 , . . . , p(j) j 1, p j, p(j) j+1, . . . , p(j) m ) = S(p(j) 1 , . . . , p(j) j 1, 0, p(j) j+1, . . . , p(j) m ) =: S j 0

1 k E 1{|S j 0| = k}1{j S j } =

j=1 E 1{p j q|S j |/m} max(1, |S j 0|)

j=1 E 1{p j q|S j 0|/m} max(1, |S j 0|)

By definition, {p(j) l }l =j is invariant after permuting {Vi}n i=1 {Vn+j}. Since {Vi}n i=1 {Vn+j} are exchangeable conditioned on {b Vn+l}l =j, the distribution of {p(j) l }l =j is independent from the ordering of {Vi}n i=1 {Vn+j} conditioned on the (unordered) set [V1, . . . , Vn, Vn+j] {b Vn+l}l =j. Also, conditioned on {b Vn+l}l =j, S j 0 only depends on {p(j) l }l =j which is in turn only dependent on the unordered set [V1, . . . , Vn, Vn+j], and p j only depends on the ordering of {Vi}n i=1 {Vn+j}. This implies that S j 0 is independent on p j conditioned on the (unordered) set [V1, . . . , Vn, Vn+j] and {b Vn+l}l =j. Therefore, by the conservative property of conformal p-values and conditional independence,

P p j q|S j 0| m

[V1, . . . , Vn, Vn+j] {b Vn+l}l =j q|S j 0| m .

Multivariate Conformal Selection

By the law of total expectation,

E 1{p j q|S j 0|/m} max(1, |S j 0|)

E 1{p j q|S j 0|/m} max(1, |S j 0|)

[V1, . . . , Vn, Vn+j] {b Vn+l}l =j

E h 1{|S j 0| = 0} q|S j 0| m|S j 0|

[V1, . . . , Vn, Vn+j] {b Vn+l}l =j i

Since when |S j 0| = 0, 1{p j q|S j 0|/m} = 0. Now, the proof is concluded by summing over every j = 1, . . . , m.

A.3. Proof of Proposition 4.2

Proof of Proposition 4.2. For any specific dataset and selection problem, let V opt denote the optimal nonconformity score that controls the FDR and maximize the number of selected samples. Define V opt i = V opt(xi, yi) for calibration samples and b V opt n+j = V opt(xn+j, rn+j) for test samples. Consider the following score W within the family (11):

W(x, y) = M 1{y R \ R} + V opt(x, y).

Picking rn+j r R, the test nonconformity scores c Wn+j satisfy:

c Wn+j = W(xn+j, rn+j) = V opt(xn+j, rn+j) = b V opt n+j,

and for each calibration score,

Wi = W(xi, yi) V opt(xi, yi) = V opt i .

Therefore, by replacing V opt with W, the calibration nonconformity scores W1, . . . , Wn may increase, while the test nonconformity scores c Wn+j remain unchanged. According to (4), this leads to a decrease in each conformal p-value pj, which in turn increase the value k in Algorithm 1. This means that we would select more samples, by the definition of S. Consequently, we conclude that there must exist a score within the family (11) that achieves the optimal selection size. A similar argument applies to the maximization selection power, thereby concluding the proof.

B. Deferred Discussions

B.1. Discussions on Classification Responses

In our paper, we choose to focus primarily on the regression response case for multivariate conformal selection. In fact, the selection problem for classification responses (univariate or multivariate) can be directly reduced to the univariate conformal selection framework introduced by Jin & Cand es (2023):

For the univariate classification setting, suppose the response space is composed of classes Y = K k=1Ck with target region R = s k=1Ck (with s < K). Then, by defining a binary response: yi = 1{yi R}, the original selection problem directly translates into a univariate conformal selection problem, where we select samples with yi = 1.

For multivariate classification case, e.g. suppose that bivariate responses yi are drawn from a joint class space Y = (Y(1), Y(2)) = k,ℓ(C(1) k , C(2) ℓ ), and the target region R = (k,ℓ) I(C(1) k , C(2) ℓ ). Again, we define a binary indicator yi = 1{yi R}, converting the original multivariate selection problem into a standard univariate selection task:

H0j : yn+j < 0.5 versus H1j : yn+j 0.5.

Since P( yi = 1) = P(yi R), there is a direct correspondence between the multivariate and univariate nonconformity scores: V (x, y) = M 1{y R} b P(y R|x)

and V (x, y) = M 1{ y 0.5} µ(x)

Multivariate Conformal Selection

where µ(x) b P( y = 1|x). Moreover, regional monotonicity is simply the usual monotone condition of univariate conformal selection: V (x, y = 0) V (x, y = 1).

Thus, every classification-based selection task can be naturally and effectively solved using existing univariate conformal selection methods.

B.2. Additional Discussions on Theorem 4.1 and Advantages of The Clipped Nonconformity Score

In this section, we first provide interpretations about Theorem 4.1 and offer an alternative perspective on the advantages of the clipped nonconformity score.

We first note that the (practical) conformal p-values defined in (4) can be rewritten as

pj = 1 n + 1

i=1 1{Vi < b Vn+j} + 1 n + 1Uj

i=1 1{Vi = b Vn+j} + 1 n + 1Uj.

By the Glivenko-Cantelli theorem under exchangeability (Etemadi & Kaminski, 1996), as n ,

i=1 1{Vi < t} + 1 n + 1Uj

i=1 1{Vi = t} + 1 n + 1Uj P(V (x, y) < t) Uj P(V (x, y) = t)

Replacing t in the above by b Vn+j = V (xn+j, r) yields

pj a.s. P(V (x, y) < V (xn+j, r)|V (xn+j, r)) + Uj P(V (x, y) = V (xn+j, r)|V (xn+j, r)) = F(V (xn+j, r), Uj)

d F(V (x, r), U).

Therefore, F(V (x, r), U) is also a conservative random variable.

The quantity t in Theorem 4.1 can also be viewed as the asymptotic counterpart of a corresponding finite-sample quantity. A characterization of the BH procedure (Storey et al., 2004) states that the rejection set S = {j : pj τ }, where the BH rejection threshold τ is defined as

τ = sup n t [0, 1] : t 1 m Pm j=1 1{pj t} q o .

By the fact that 1

m Pm j=1 1{pj t} P(pj t) p 0 as m (which follows due to the the weak law of large numbers for triangular arrays), and our earlier characterization of F(V (x, r), Uj) which implies P(pj t) P(F(V (x, r), U) t) as n , the fraction in the definition of τ satisfies (in the limit n, m )

t 1 m Pm j=1 1{pj t}

p t P(F(V (x, r), U) t).

This establishes t as the asymptotic limit of the BH rejection threshold τ , which further implies

j=1 1{pj τ } p P(F(V (x, r), U) t ) and 1 m

j=1 1{pj τ and yn+j R} p P(F(V (x, r), U) t , y R).

Theorem 4.1 then gives the asymptotic version of FDR and power:

" |S {j : yn+j Rc}|

" |{j : pj τ } {j : yn+j Rc}|

|{j : pj τ }|

m Pm j=1 1{pj τ and yn+j Rc}

1 m Pm j=1 1{pj τ }

P(F(V (x, r), U) t , y Rc)

P(F(V (x, r), U) t ) ,

Multivariate Conformal Selection

" |S {j : yn+j R}|

|{j : yn+j R}|

" |{j : pj τ } {j : yn+j R}|

|{j : yn+j R}|

m Pm j=1 1{pj τ and yn+j R}

1 m Pm j=1 1{yn+j R}

P(F(V (x, r), U) t , y R)

Theorem 4.1 suggests that the clipped score should be preferred from the perspective of maximizing the realized (asymptotic) FDR (which is still controlled below the nominal level). The asymptotic FDR is bounded through the following chain of inequalities:

P(F(V (x, r), U) t , y Rc)

P(F(V (x, r), U) t ) P(F(V (x, y), U) t , y Rc)

P(F(V (x, r), U) t )

( ) P(F(V (x, y), U) t )

P(F(V (x, r), U) t ) = t

P(F(V (x, r), U) t ). (17)

By the definition of t , the final term in (17) is at most the nominal level q. If the bounds in this chain of inequalities could be further tightened, the realized FDR would more closely align with the nominal level q, effectively allowing for greater selection power as more of the FDR budget could be utilized.

Using the clipped score tightens the inequality ( ) in (17). Assuming the score function V is clipped, we observe that

P(F(V (x, y), U) t ) P(F(V (x, r), U) t ) P(F(V (x, y), U) t , y Rc)

P(F(V (x, r), U) t ) = P(F(V (x, y), U) t , y R)

P(F(V (x, r), U) t )

is approximately zero, since for y R, the clipped score satisfies V (x, y) M, implying that P(F(V (x, y), U) t , y R) 0, because that F(v, u) is monotone with respect to its first argument v. Notably, the large constant M is originally introduced in conformal selection as a relaxation of infinity, and these approximations indeed hold. This phenomenon is verified through our extra simulations in Appendix C.2.1.

C. Additional Details for Numerical Simulations (Section 5)

C.1. Data Generating Processes and Configuration of the Selection Tasks

The configuration of the true regression function µ and noise term ϵ Rd can be found in Table 6.

For the noise term, the degree of freedom for the t-distribution scenario is set to ν = 3, and the scale matrix Σ is specified as a matrix with diagonal entries of 0.5 and off-diagonal entries of 0.05. The second column denotes the k-th output entry of the true regression function µ, which relates to the k-th coordinate yk of the response y and the k-th coordinate ϵk of the noise ϵ as yk = [µ(x)]k + ϵk. In all of our simulations, we take p = 10 (recall that x Unif( 1, 1)p but y Rd). If, when generating yk the value of xℓ(the ℓ-th coordinate of x) for ℓ> p is needed, we take xℓ= x((ℓ 1) mod p)+1.

Table 7 summarizes the values of coefficient vectors (c and r) that define the selection tasks for each response dimension. Within each vector, all entries share the same value. For instance, if ck is listed as 1, it indicates that c = (1, 1, . . . , 1).

The six settings we consider differ in two key aspects that influence the difficulty of selection. First, the regression function µ exhibits varying degrees of nonlinearity across settings. Specifically, Settings 1 and 4 are linear, while Settings 2 and 5 have weak nonlinearity, and Settings 3 and 6 exhibit strong nonlinearity. The degree of nonlinearity affects the predictive accuracy of the estimated regression function ˆµ, which in turn impacts the selection difficulty. Second, the distribution of the noise term ϵ differs across settings, leading to variations in the conditional variance. In Settings 1, 2, and 3, ϵ follows a multivariate Gaussian distribution, whereas in Settings 4, 5, and 6, it follows a multivariate t-distribution. The conditional variance cov(ϵ) = Var(y | x) in the latter three settings is ν ν 2Σ = 3Σ, which is three times larger than that of the former three settings, thereby increasing the difficulty of selection.

To better illustrate the data distributions, Figure 2 presents example scatter plots of the response vector y for the six settings, with the response dimension set to 2 for visualization purposes. Higher-dimensional cases are not displayed as they are

Multivariate Conformal Selection

Table 6: True regression functions and noise distributions

Setting [µ( )]k ϵ

2xk+1 + xk+2 + 3

2 xk + x2 k+2 + 1

3 1{xkxk+1 > 0} 1{xk+2 > 0.5} (0.25 + xk+2) 1{xkxk+1 0} 1{xk+2 0.5} (xk+2 0.25) + 0.75 N(0, Σ)

4 Same as Setting 1 tν(0, Σ)

5 Same as Setting 2 tν(0, Σ)

6 Same as Setting 3 tν(0, Σ)

Table 7: Coefficients for Selection Task 1 and 2

Response Dimension ck (Task 1) ck (Task 2) rk (Task 2)

5 0.2 2 2.6

10 0.2 2 4.1

30 0.6 2 7.5

more challenging to interpret. Settings within the same column share the same conditional expectation E(y | x), while those in the lower row exhibit greater noise and more dispersed scatter patterns due to the multivariate t-distributed noise ϵ. The target regions for the two tasks, defined based on the coefficients in Table 7, are highlighted as red and yellow shaded areas in the plots.

C.2. Extra Simulated Experiments

C.2.1. COMPARING TWO DISTANCE-BASED SCORES

In Tables 8 and 9, we compare the performance of the two distance-based scores (7) and (8) in m CS-dist. All other configurations are kept the same as in Section 5.

Table 8: Observed FDR of m CS-dist with score (7) and (8) for Task 1 and 2.

Task 1 Task 2

Setting Score (7) Clipped Score (8) Score (7) Clipped Score (8)

1 0.154 0.277 0.102 0.265 2 0.269 0.314 0.127 0.263 3 0.241 0.265 0.150 0.223 4 0.202 0.278 0.226 0.286 5 0.282 0.308 0.240 0.292 6 0.265 0.287 0.220 0.283

Across a wide range of tasks and experimental settings, the clipped score (8) consistently demonstrates superior performance.

Multivariate Conformal Selection

4 2 0 2 4 6 8

Setting 1, linear, Gaussian noise

4 2 0 2 4 6 8

Setting 2, weak nonlinearity, Gaussian noise

4 2 0 2 4 6 8

Setting 3, strong nonlinearity, Gaussian noise

4 2 0 2 4 6 8

Setting 4, linear, t-distributed noise

4 2 0 2 4 6 8

Setting 5, weak nonlinearity, t-distributed noise

4 2 0 2 4 6 8

Setting 6, strong nonlinearity, t-distributed noise

Figure 2: Scatter plots of 1,000 i.i.d. samples from our data generating processes. Each point represents a sample, with the xand y-axes corresponding to its responses entries y1 and y2, respectively. The red and the yellow shaded areas represents the two target regions for the two selection tasks.

Table 9: Observed power of m CS-dist with score (7) and (8) for Task 1 and 2.

Task 1 Task 2

Setting Score (7) Clipped Score (8) Score (7) Clipped Score (8)

1 0.305 0.555 0.440 0.760 2 0.069 0.104 0.208 0.405 3 0.046 0.068 0.075 0.134 4 0.191 0.324 0.222 0.333 5 0.052 0.060 0.093 0.170 6 0.038 0.046 0.048 0.063

Although both scoring methods guarantee valid FDR control, the clipped score tends to exhibit a higher observed FDR. This finding is consistent with the asymptotic analysis in Appendix B.2, which suggests that the clipped score can leverage a larger portion of the available FDR budget.

C.2.2. VARYING RESPONSE DIMENSION

The main article included results with response dimension d = 30. In this section, we access the performance of m CS-dist and m CS-learn in lower dimensional response settings d {2, 5, 10}, examining whether m CS-learn continues to outperform competing methods for smaller values of d. To simplify the analysis, we focus on Setting 3 for these experiments. All other configurations are kept unchanged.

Tables 10 and 11 show that even for smaller d, m CS-learn and m CS-dist continue to outperform the other baseline methods, achieving the best and the second-best power respectively (while controlling FDR). Notably, the performance gap of m CS-learn over competing methods becomes more pronounced as the response dimension increases, suggesting that

Multivariate Conformal Selection

Table 10: Observed FDR of different methods for lower response dimensions.

Task 1 Task 2

d CS int CS ib CS is bi m CS-d m CS-l bi m CS-d m CS-l

2 0.244 0.000 0.142 0.303 0.268 0.281 0.270 0.299 0.302 5 0.706 0.000 0.255 0.290 0.324 0.294 0.246 0.290 0.302 10 0.735 0.000 0.342 0.311 0.276 0.310 0.257 0.265 0.311 30 0.724 0.000 0.204 0.295 0.264 0.278 0.216 0.223 0.263

Table 11: Observed power of different methods for lower response dimensions.

Task 1 Task 2

d CS int CS ib CS is bi m CS-d m CS-l bi m CS-d m CS-l

2 0.029 0.000 0.022 0.032 0.047 0.063 0.050 0.068 0.069 5 1.000 0.000 0.067 0.041 0.049 0.083 0.050 0.057 0.072 10 1.000 0.000 0.055 0.044 0.058 0.069 0.062 0.069 0.106 30 1.000 0.000 0.039 0.059 0.068 0.102 0.115 0.134 0.179

m CS-learn is particularly advantageous for high-dimensional settings.

C.2.3. NONCONVEX TARGET REGION

To examine the suitability of our methods for tasks involving more irregular target regions, we conduct experiments on two additional tasks in which the target region R is nonconvex:

3. The complement of the (shifted) orthant, R = {y : yk ck k}c = {y : yk < ck for some k},

4. The complement of a sphere centered at c, R = {y : y c 2 > r}.

To ensure that a reasonable proportion (15%-35%) of responses y fall within the selection region, the coefficients ck and r are defined differently from their counterparts in Task 1 and 2. Table 12 presents the specific values, where each scalar, following the convention in Table 7, represents a vector whose entries are all equal to that scalar. All other setups, configurations, and model hyperparameters are the same as in Section 5.2.

Table 12: Coefficients for Selection Task 3 and 4

Response Dimension ck (Task 3) ck (Task 3) rk (Task 4)

10 1.1 2 5.5

30 1.6 2 9.5

Table 13 and Table 14 summarize the observed power and FDR for the two additional tasks respectively, where the dimension is d = 30 and nominal FDR level is q = 0.3. For the nonconvex tasks, m CS-dist demonstrates inferior performance compared to the baseline bi. In contrast, m CS-learn performs comparably to bi in relatively easy settings and surpasses bi in more challenging scenarios, such as Settings 3 and 6. Consequently, m CS-learn emerges as the preferred choice when R is nonconvex or otherwise irregular.

C.2.4. OTHER PRETRAINED MODELS ˆµ

We also test the performance of various methods when the underlying model ˆµ is less predictive. To simulate this effect, we train ˆµ as a linear model instead of support vector machines as in Section 5. All other setups remain unchanged.

Multivariate Conformal Selection

Table 13: Observed FDR of different methods for Task 3 and 4.

Task 3 Task 4

Setting bi m CS-d m CS-l bi m CS-d m CS-l

1 0.321 0.267 0.280 0.327 0.298 0.304 2 0.312 0.228 0.332 0.381 0.271 0.286 3 0.334 0.257 0.276 0.385 0.276 0.275 4 0.314 0.328 0.277 0.298 0.298 0.284 5 0.273 0.266 0.279 0.305 0.288 0.313 6 0.237 0.255 0.303 0.335 0.263 0.265

Table 14: Observed power of different methods for Task 3 and 4.

Task 3 Task 4

Setting bi m CS-d m CS-l bi m CS-d m CS-l

1 0.305 0.052 0.255 0.763 0.311 0.555 2 0.008 0.011 0.017 0.098 0.015 0.089 3 0.018 0.005 0.017 0.043 0.012 0.044 4 0.307 0.050 0.241 0.589 0.233 0.446 5 0.024 0.001 0.034 0.114 0.017 0.117 6 0.023 0.001 0.040 0.036 0.016 0.062

Table 15 and Table 16 summarize the FDR and power of different methods when ˆµ is a fitted linear model. Although overall performance declines for most procedures, m CS-dist and m CS-learn remain comparatively stronger than the baselines. In particular, m CS-learn appears less sensitive to the imperfect linear fit, thanks to its data-adaptive nonconformity score V θ, which is learned based on the model ˆµ and the observed data.

Table 15: Observed FDR of different methods for lower response dimensions.

Task 1 Task 2

Setting CS int CS ib CS is bi m CS-d m CS-l bi m CS-d m CS-l

1 0.776 0.101 0.266 0.274 0.264 0.267 0.257 0.273 0.276 2 0.798 0.000 0.230 0.269 0.299 0.288 0.245 0.222 0.275 3 0.742 0.000 0.210 0.312 0.235 0.206 0.232 0.223 0.287 4 0.803 0.000 0.267 0.318 0.290 0.282 0.307 0.286 0.260 5 0.813 0.000 0.174 0.306 0.279 0.272 0.273 0.298 0.285 6 0.779 0.000 0.268 0.259 0.306 0.264 0.292 0.267 0.279

Table 16: Observed power of different methods for lower response dimensions.

Task 1 Task 2

Setting CS int CS ib CS is bi m CS-d m CS-l bi m CS-d m CS-l

1 1.000 0.204 0.467 0.195 0.550 0.379 0.063 0.842 0.583 2 1.000 0.000 0.024 0.077 0.071 0.089 0.386 0.337 0.419 3 1.000 0.000 0.026 0.066 0.074 0.079 0.129 0.138 0.148 4 1.000 0.000 0.203 0.120 0.309 0.186 0.040 0.408 0.188 5 1.000 0.000 0.016 0.036 0.048 0.051 0.176 0.135 0.162 6 1.000 0.000 0.028 0.042 0.039 0.048 0.056 0.056 0.073

Multivariate Conformal Selection

C.2.5. COMPARISON OF LOSS FUNCTIONS IN m CS-learn

Here we compare the performance of m CS-learn with loss L1 (15) and loss L2 (16). The response dimension is d = 30, and the nominal FDR level is q = 0.3.

Table 17: Observed FDR of m CS-learn with loss L1 and L2.

Task 1 Task 2

Setting L1 L2 L1 L2

1 0.255 0.251 0.276 0.279 2 0.279 0.267 0.278 0.273 3 0.299 0.278 0.331 0.263 4 0.274 0.315 0.280 0.254 5 0.210 0.239 0.278 0.273 6 0.296 0.258 0.361 0.207

Table 18: Observed power of m CS-learn with loss L1 and L2.

Task 1 Task 2

Setting L1 L2 L1 L2

1 0.040 0.325 0.051 0.534 2 0.024 0.109 0.111 0.421 3 0.022 0.102 0.041 0.179 4 0.034 0.199 0.019 0.180 5 0.009 0.042 0.041 0.189 6 0.012 0.034 0.032 0.061

As noted in the main article, employing the L2 loss obviates the need for an additional round of smooth ranking, thereby enhancing both numerical stability and training efficacy. This explains the superior power observed in Table 18 when m CS-learn utilizes the L2 loss.

C.2.6. COMPARISON OF SCORE FAMILIES IN m CS-learn

When learning fθ in m CS-learn, various forms of data may be utilized as inputs, giving rise to different score families. The simplest approach draws exclusively on the covariates x as features, making it applicable even in scenarios where the model ˆµ is not available. Alternatively, incorporating the response y expands the family to (11). When ˆµ is available, its predictions ˆµ(x) can also be included as input. Although this does not increase the overall expressiveness of the family, it can accelerate the training process. In this section, we evaluate the performance of the following score families with varying complexity:

1. Covariate only: M 1{y R \ R} fθ(x; R).

2. Prediction only: M 1{y R \ R} fθ(ˆµ(x); R).

3. Covariate and Prediction: M 1{y R \ R} fθ(x, ˆµ(x); R).

4. Full Family (12) : M 1{y R \ R} fθ(x, y; R).

5. All available information: M 1{y R \ R} fθ(x, ˆµ(x), y; R).

As discussed in the main article, families 4 and 5 incorporate y as an input, which compromises the regional monotonicity of the score unless M is sufficiently large. Table 19 summarizes the FDR and power of m CS-learn with different families, with Setting 3. For both Task 1 and Task 2, families 4 and 5 violated FDR control when M = 103. Among the three subfamilies that maintained valid FDR control, family 3 exhibited the best performance.

Multivariate Conformal Selection

Table 19: Observed FDR and power of m CS-learn with different score families.

Family Task 1 Task 2 Task 1 Task 2

1 0.298 0.237 0.108 0.165 2 0.260 0.291 0.096 0.175 3 0.278 0.263 0.102 0.179 4 0.711 0.594 0.972 0.807 5 0.667 0.494 0.782 0.536

D. Additional Details for Real Data Application (Section 6)

D.1. Overview of Drug Discovery Data and Configuration of the Selection Tasks

The drug discovery dataset we used in Section 6 is compiled from various public sources (Wenzel et al., 2019; Iwata et al., 2022; Kim et al., 2023; Watanabe et al., 2018; Falc on-Cano et al., 2022; Esposito et al., 2020; Braga et al., 2015; Aliagas et al., 2022; Perryman et al., 2020; Meng et al., 2022; Vermeire et al., 2022). Because the integrated data contained missing values, we employed Chemprop (Yang et al., 2019; Heid et al., 2023) to impute these entries. The resulting imputed dataset was then used in all subsequent experiments. The processed dataset contains n = 22805 data points, and can be found at https://github.com/Tian-Bai/mcs.

We list the names, units and cutoffs of the responses (only relevant to the first selection task) in the imputed dataset in Table 20, and provide detailed descriptions of their biological significance and drug discovery relevance in Figure 4. Recall that in the first task we consider target regions of the shape R = {y : yk ck k}, where the cutoffs are the values ck defining the selection problem. Figure 3 shows the distribution of these 15 responses, with vertical red lines indicating their corresponding cutoffs. Approximately 21% of the test dataset compounds exceed all 15 thresholds, thereby qualifying for selection.

For the second task, the target region is defined as a sphere {y : y c 2 r}. For convenience, we take the center of the sphere the same as the cutoffs ck in task 1, and let r = 2.4. Under this definition, approximately 24% of the compounds qualify for selection. Similarly, for the third task where the target region is the complement of a sphere {y : y c 2 r }, we adopt the same center c and set r = 3.4. This choice ensures that 18% of the compounds qualify for selection.

Table 20: List of responses in the drug discovery dataset.

Name Unit Cutoff ck

CL microsome human log 10 (m L/min/kg) 4

CL microsome mouse log 10 (m L/min/kg) 4

CL microsome rat log 10 (m L/min/kg) 4

CL total dog log 10 (m L/min/kg) 0.5

CL total human log 10 (m L/min/kg) 0

CL total monkey log 10 (m L/min/kg) 0.5

CL total rat log 10 (m L/min/kg) 1

CYP2C8 inhibition log 10 (n Molar IC50) 3.5

CYP2C9 inhibition log 10 (n Molar IC50) 3.5

CYP2D6 inhibition log 10 (n Molar IC50) 3.5

CYP3A4 inhibition log 10 (n Molar IC50) 3.5

Papp Caco2 log 10 (10 6cm/s) 0.8

Pgp human log 10 (efflux ratio) 0.2

h ERG binding log 10 (n Molar IC50) 3.5

Log D p H 7.4 log 10 (M/M) 2

Multivariate Conformal Selection

CL_microsome_human

CL_microsome_mouse

CL_microsome_rat

2 1 0 1 2 0

CL_total_dog

2 1 0 1 2 0

CL_total_human

CL_total_monkey

2 1 0 1 2 0

CL_total_rat

2 3 4 5 6 0

CYP2C8_inhibition

2 3 4 5 6 0

CYP2C9_inhibition

CYP2D6_inhibition

CYP3A4_inhibition

h ERG_binding

Log D_p H_7.4

Figure 3: Histograms of 15 responses in the drug dataset. The vertical red lines denote the corresponding cutoffs (for the first task) for each response.

Multivariate Conformal Selection

Figure 4: Description and drug discovery relevance of the responses.

CL microsome human:

Intrinsic metabolic clearance in human liver microsomes. (Help predict rate of liver metabolism in human through the in vitro in vivo correlation.)

CL microsome mouse:

Intrinsic metabolic clearance in mouse liver microsomes. (Help predict rate of liver metabolism in human through the in vitro in vivo correlation.)

CL microsome rat:

Intrinsic metabolic clearance in rat liver microsomes. (Help predict rate of liver metabolism in human through the in vitro in vivo correlation.)

CL total dog:

Total body clearance measured in vivo in dogs. (Drug exposure in this species, crucial for determine human dose regimens through translational modeling.)

CL total human:

Total body clearance measured in vivo in humans (Drug exposure in this species, crucial for determine human dose regimens through translational modeling.)

CL total monkey:

Total body clearance measured in vivo in monkeys. (Drug exposure in this species, crucial for determine human dose regimens through translational modeling.)

CL total rat:

Total body clearance measured in vivo in rats. (Drug exposure in this species, crucial for determine human dose regimens through translational modeling.)

CYP2C8 inhibition:

Inhibition potential against human CYP2C8 enzyme. (Assessment of risk of drug-drug interactions (DDIs) involving drugs metabolized by CYP2C8. Important for safety assessment.)

CYP2C9 inhibition:

Inhibition potential against human CYP2C9 enzyme. (Assessment of drug-drug interactions (DDIs) involving drugs metabolized by CYP2C9 (e.g., warfarin). Important for safety assessment.)

CYP2D6 inhibition:

Inhibition potential against human CYP2D6 enzyme. (Assessment of drug-drug interactions (DDIs) involving drugs metabolized by CYP2D6 (e.g., some antidepressants, beta-blockers). Important for safety assessment.)

CYP3A4 inhibition:

Inhibition potential against human CYP3A4 enzyme. (Assessment of drug-drug interactions (DDIs) involving drugs metabolized by CYP3A4 (a very common pathway). Important for safety assessment.)

Papp Caco2:

Apparent permeability across Caco-2 cell monolayer. (Potential for oral absorption by crossing the intestinal wall. High Papp suggests good absorption is likely.)

Interaction with human P-glycoprotein (Pgp) efflux transporter. (Assess if the drug is pumped out of cells by Pgp, potentially limiting oral absorption and brain penetration.)

h ERG binding:

Binding/inhibition of the h ERG potassium channel. (Assessment of risk of cardiac toxicity (QT prolongation, arrhythmias). A critical safety screen; high binding is a major safety concern.)

Log D p H 7.4:

Logarithm of the distribution coefficient at p H 7.4. (Predicts drug s lipophilicity (fat vs. water solubility) under physiological conditions. Influences absorption, distribution, membrane crossing, and CNS penetration.)

Multivariate Conformal Selection

D.2. Extra Experiments on Real Data

D.2.1. NUMERICAL STABILITY OF DIFFERENT METHODS

In this section, we test the numerical stability of m CS-dist and m CS-learn on real data. To do this, unlike previous experiments where we sample new data or randomly partition data for each iteration, we fix the pretrained model ˆµ, the calibration data Dcal and the test data Dtest. This is to avoid the variability from data sampling or pretraining, and only consider the inherent variance of the methods. Due to the high computation cost of retraining fθ, we also keep fθ the same across different iterations. We keep all other setups and configurations unchanged as in Section 6.

Table 21: Observed standard error of FDRs for different methods with real data.

Task q CS int CS ib CS is bi m CS-d m CS-l

1 0.3 0.000 0.000 0.248 0.000 0.000 0.000 2 0.3 0.000 0.000 0.002 3 0.3 0.000 0.021 0.004

1 0.5 0.000 0.000 0.032 0.000 0.001 0.004 2 0.5 0.000 0.000 0.002 3 0.5 0.000 0.028 0.005

Table 22: Observed standard error of powers for different methods with real data.

Task q CS int CS ib CS is bi m CS-d m CS-l

1 0.3 0.000 0.000 0.022 0.000 0.000 0.000 2 0.3 0.000 0.000 0.001 3 0.3 0.000 0.005 0.002

1 0.5 0.000 0.000 0.056 0.000 0.000 0.004 2 0.5 0.000 0.000 0.004 3 0.5 0.000 0.011 0.005

For all methods, the standard errors for both FDR and power remain low, with the exception of CS is at a nominal level of q = 0.3. This phenomenon likely arises from the unstable nature of the subroutine nominal level search at lower userspecified nominal levels. The failure of CS is to control the FDR at this low nominal level (Figure 1) further corroborates this interpretation.