# outofdomain_unlabeled_data_improves_generalization__c959dacf.pdf

Published as a conference paper at ICLR 2024

OUT-OF-DOMAIN UNLABELED DATA IMPROVES GENERALIZATION

Amir Hossein Saberi Amir Najafi Alireza Heidari

Mohammad Hosein Movasaghinia Seyed Abolfazl Motahari Babak H. Khalaj

Department of Electrical Engineering, Department of Computer Engineering, Sharif Center for Information Systems and Data Science, Sharif Institute for Convergence Science & Technology, Sharif University of Technology, Tehran, Iran

We propose a novel framework for incorporating unlabeled data into semisupervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in Rd, where in addition to the m independent and labeled samples from the true distribution, a set of n (usually with n m) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by (d/m)1/2. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the cluster assumption , and 2) the semisupervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

1 INTRODUCTION

Semi-supervised learning has long been a focal point in the machine learning literature, primarily due to the cost-effectiveness of utilizing unlabeled data compared to labeled counterparts. However, unlabeled data in various domains, such as medicine, genetics, imaging, and audio processing, often originates from diverse sources and technologies, leading to distributional differences between labeled and unlabeled samples. Concurrently, the development of robust classifiers against adversarial attacks has emerged as a vibrant research area, driven by the rise of large-scale neural networks (Goodfellow et al., 2014; Biggio & Roli, 2018). While the primary objective of these methods is to reduce model sensitivity to minor adversarial perturbations, recent observations suggest that enhancing adversarial robustness may also improve the utilization of unlabeled samples (Najafi et al., 2019; Miyato et al., 2018).

This paper aims to demonstrate the efficacy of incorporating out-of-domain unlabeled samples to decrease the reliance on labeled in-domain data. To achieve this, we propose a novel framework inspired by a fusion of concepts from adversarial robustness and self-training. Specifically, we introduce a unique constraint to the conventional Empirical Risk Minimization (ERM) procedure, focusing exclusively on the unlabeled part of the dataset. Our theoretical and experimental analyses

Corresponding author: motahari@sharif.edu.

Published as a conference paper at ICLR 2024

show that the inclusion of unlabeled data reduces the generalization gap for both robust and nonrobust loss functions. Importantly, our alternative optimization criteria are computationally efficient and can be solved in polynomial time. We have implemented and validated the effectiveness of our method on various synthetic and real-world datasets.

From a theoretical standpoint, akin to prior research (Schmidt et al., 2018; Carmon et al., 2019; Zhai et al., 2019; Alayrac et al., 2019), we also address the binary classification problem involving two Gaussian models in Rd. This problem has been the center of attention in several recent works on theoretical analysis of both semi-supervised and/or adversarially robust learning paradigms. Despite several recent theoretical investigations, the precise trade-off between the sizes of labeled (m) and unlabeled (n) data, even in this specific case, remains incomplete. A number of works have bounded the labeled sample complexity under the assumption of an asymptotically large n (Kumar et al., 2020), while another series of papers have analyzed this task from a completely unsupervised viewpoint. We endeavor to fill this gap by providing the first empirical trade-off between m and n, even when unlabeled data originates from a slightly perturbed distribution. We derive explicit bounds for both robust and non-robust losses of linear classifiers in this scenario. Our results show that as long as n Ω m2/d , our proposed algorithm surpasses traditional techniques that solely rely on labeled data. We also consider the more general case of non-isotropic Gaussian models, as explored in previous studies.

The remainder of this paper is structured as follows: Section 1.1 provides an overview of related works in distributionally robust optimization and semi-supervised learning. Section 1.3 introduces our notation and definitions. In Section 1.2, we discuss the contributions made by our work. In Section 3, we present our novel method, followed by a theoretical analysis in Section 4. Section 5 showcases our experimental validations, further supporting our theoretical findings. Finally, we draw conclusions in Section 6.

1.1 PRIOR WORKS

One of the challenges in adversarially robust learning is the substantial difficulty in increasing the robust accuracy compared to achieving high accuracy in non-robust scenarios (Carlini & Wagner, 2017). A study by Schmidt et al. (2018) posited that this challenge arises from the larger sample complexity associated with learning robust classifiers in general. Specifically, they presented a simple model where a good classifier with high standard (non-robust) accuracy can be achieved using only a single sample, while a significantly larger training set is needed to attain a classifier with high robust accuracy. Recent works (Carmon et al., 2019; Zhai et al., 2019; Alayrac et al., 2019) demonstrated that the gap in sample complexity between robust and standard learning, as outlined by Schmidt et al. (2018) in the context of a two-component Gaussian mixture model, can be bridged with the inclusion of unlabeled samples. Essentially, unlabeled samples can be harnessed to mitigate classification errors even when test samples are perturbed by an adversary. Another study by Najafi et al. (2019) achieved a similar result using a different definition of adversarial robustness and a more comprehensive data generation model. Their approach involves the use of self-training to assign soft/hard labels to unlabeled data, contrasting our approach, where unlabeled data is exclusively utilized to constrain the set of classifiers, aiming to avoid crowded regions. While DRO serves as a tool in our approach, it is not necessarily the primary objective. In Deng et al. (2021), authors showed that in the setting of Schmidt et al. (2018), out-of-domain unlabeled samples improve adversarial robustness.

Theoretical analysis of Semi-Supervised Learning (SSL) under the so-called cluster assumption has been a long-studied task (Rigollet, 2007). However, beyond Najafi et al. (2019), several recent methods leveraging DRO for semi-supervised learning have emerged (Blanchet & Kang, 2020; Frogner et al., 2021). Notably, Frogner et al. (2021) shares similarities with Najafi et al. (2019); however, instead of assigning artificial labels to unlabeled samples, Frogner et al. (2021) employs them to delimit the ambiguity set and enhance understanding of the marginals. Our work primarily focuses on the robustness aspect of the problem rather than advancing the general SSL paradigm.

Defense mechanisms against adversarial attacks usually consider two types of adversaries: i) pointwise attacks similar to Miyato et al. (2018); Nguyen et al. (2015); Szegedy et al. (2013), and ii) distributional attacks (Staib & Jegelka, 2017; Shafieezadeh Abadeh et al., 2015; Mohajerin Esfahani & Kuhn, 2018), where in the case of the latter adversary can change the distribution of data up to a predefined budget. It has been shown that Distributionally Robust Learning (DRL) achieves a

Published as a conference paper at ICLR 2024

superior robustness compared to point-wise methods (Staib & Jegelka, 2017). Namkoong & Duchi (2017) utilized DRL in order to achieve a balance between the bias and variance of classifier s error, leading to faster rates of convergence compared to empirical risk minimization even in the non-robust case. In DRL, the learner typically aims to minimize the loss while allowing the data distribution to vary within an uncertainty neighborhood. The central idea used by Namkoong & Duchi (2017) was to regulate the diameter of this uncertainty neighborhood based on the number of samples. Gao (2022) achieved similar results in DRL while utilizing the Wasserstein metric to define the perturbation budget for data distribution. Based on the above arguments, we have also utilized DRL is the main tool in developing our proposed framework.

1.2 MAIN CONTRIBUTIONS

We introduce a novel integration of DRO and Semi-Supervised Learning (SSL), leveraging out-ofdomain unlabeled samples to enhance the generalization bound of learning problem. Specifically, we theoretically analyze our method in the setting where samples are generated from a Gaussian mixture model with two components, which is a common assumption in several theoretical analyses in this field. For example, a simpler format, when two Gaussians are isotropic and well-separated, is the sole focus of many papers such as Schmidt et al. (2018); Carmon et al. (2019); Alayrac et al. (2019).Some of our notable contributions and improvements over recent works in the field include:

(i) In Theorem 4.1, we present a non-asymptotic bound for adversarially robust learning, leveraging both labeled and unlabeled samples jointly. This result builds upon the work of Carmon et al. (2019) and Alayrac et al. (2019), which focused on the effectiveness of unlabeled samples when a single labeled sample is sufficient for linear classification of a non-robust classifier. However, these studies do not provide insights into the necessary number of unlabeled samples when multiple labeled samples are involved, particularly in scenarios where the underlying distribution exhibits limited separation between the two classes. Our theoretical bounds address and fill this crucial gap.

(ii) Theorem 4.2 introduces a novel non-asymptotic bound for integrating labeled and unlabeled samples in SSL. To underscore the significance of our findings, consider the following example: In the realizable setting, where positive and negative samples can be completely separated by a hyperplane in Rd, the sample complexity of supervised learning for a linear binary classifier is known to be O(d/ϵ) Mohri et al. (2018).However, in the non-realizable setting, this complexity escalates to O(d/ϵ2) Mohri et al. (2018).A pivotal question in learning theory revolves around how to approach the sample complexity of O(d/ϵ) in the non-realizable setting. Insights provided by Namkoong & Duchi (2017) delve into this inquiry. Notably, even with the awareness that the underlying distribution is a Gaussian mixture, the optimal sample complexity, as per Ashtiani et al. (2018), still exceeds O(d/ϵ2). Our work demonstrates that in scenarios where the underlying distribution is a Gaussian mixture and we possess m = O(d/ϵ) labeled samples, coupled with n = O d

ϵ6 unlabeled samples (without knowledge of the underlying distribution), one can achieve an error rate lower than or equal to the case of having access to O(d/ϵ2) labeled samples.

(iii) We formalize the incorporation of out-of-domain unlabeled samples into the generalization bounds of both robust and non-robust classifiers in Theorems 4.1, 4.2 and 4.4. We contend that this represents a novel contribution to the field, with its closest counterpart being Deng et al. (2021). Notably, Deng et al. (2021) addresses a scenario where the underlying distribution is a isotropic Gaussian mixture with well-separated Gaussian components, while the separation of components is not a prerequisite for our results.

1.3 NOTATION AND DEFINITIONS

Let us denote the feature space by X Rd, and assume H as a class of binary classifiers parameterized by the parameter set Θ: for each θ Θ, we have a classifier hθ H where hθ : X { 1, 1}. Assume a positive function ℓ: (X { 1, 1} Θ) R 0 as the loss function. Also, let P be the unknown data distribution over X { 1, 1}, and S = {(Xi, yi)}m i=1 for m N be a set of i.i.d. samples drawn from P. Then, for all θ Θ the true risk R and the empirical risk ˆR of a classifier w.r.t. P can be defined as follows:

R (θ, P) = EP [ℓ(X, y; θ)] , R(θ, ˆP m S ) = E ˆ P m S [ℓ(X, y; θ)] 1

i=1 ℓ(Xi, yi; θ), (1)

Published as a conference paper at ICLR 2024

where ˆP m S denotes an empirical estimate of P based on the m samples in S. We also need a way to measure the distance between various distributions that are supported over X. A well-known candidate for this goal is the Wasserstein distance (Definition A.1). Subsequently, we also define a Wasserstein ball in Definition A.2 in order to effectively constrain a set of probability measures. It should be noted that throughout this paper, the Wasserstein distance between any two distributions supported over X { 1} is defined as the distance between their respective marginals on X.

The ultimate goal of classical learning is to find the parameter θ Θ such that with high probability, R (θ ) is sufficiently close to minθ R (θ). A well-known approach to achieve this goal is Empirical Risk Minimization (ERM) algorithm, formally defined as follows:

ˆθERM (S) arg min θ Θ E ˆ P m S [ℓ(θ; X, y)] = arg min θ Θ

i=1 ℓ(θ; Xi, yi). (2)

A recent variant of ERM, which has gained huge popularity in both theory and practice, is the so-called Distributionally Robust Learning (DRL) which is formulated as follows: Definition 1.1 (Distributionally Robust Learning(DRL)). DRL aims at training a classifier which is robust against adversarial attacks on data distribution. In this regard, the learner attempts to find a classifier with a small robust risk, denoted as Rrobust (θ, P), which is defined as

Rrobust ϵ,c (θ, P) = sup P Bcϵ(P ) R (θ, P ), (3)

for all θ Θ and any ϵ 0. Therefore, DRL solves the following optimization problem:

ˆθDRL ϵ,c (S) arg min θ Θ Rrobust ϵ,c θ, ˆP m S . (4)

Surprisingly, the sophisticated minimax optimization problem of equation 4 which takes place in a subset of the infinite-dimensional space of probability measures that corresponds to the constraints, can be substantially simplified when is re-written in the dual format: Lemma 1.2 (From Blanchet et al. (2019)). For a sufficiently small ϵ > 0, the minimax optimization problem of equation 4 has the following dual form:

inf θ Θ sup P Bcϵ( ˆ P m S ) R (θ, P ) = inf γ 0

γϵ + inf θ Θ 1 m

i=1 sup Z X ℓ(Z, yi; θ) γc (Z, Xi)

where γ and ϵ are dual parameters, and there is a bijective and reciprocal relation between the ϵ and γ , i.e., the optimal value which minimizes the r.h.s.

As suggested by Sinha et al. (2017), the infγ 0 in the r.h.s. part in the above optimization problem can be removed by fixing a user-defined value for γ. This also means that if one attempts to find the optimal value for θ, the additive term γϵ is ineffective and can be removed as well.

It should be noted that this also fixes an (unknown) value for ϵ. In practice, the appropriate value for ϵ is not known beforehand and thus can be usually found through a cross-validation stage, while the same procedure can be applied to its dual counterpart, i.e., γ. In other words, the above-mentioned strategy keeps the generality of the problem intact. For the sake of simplicity in relations, throughout the rest of the paper we work with the dual formulation in equation 5 and let γ be a fixed and arbitrary value.

2 PROBLEM DEFINITION

At this point, we can formally define our problem. Let X Rd, and assume P0 be an unknown and arbitrary distribution supported on X { 1}, i.e., P0 produces feature-label pairs. For a valid cost function c : X 2 R 0, let P1 represent a shifted version of P0 such that the marginal distributions of P0 and P1 on X are shifted with Wc (P0,X, P1,X) = α for some α > 0. No assumption on P1 (y|X) is necessary in this work. Here, the subscript X implies the marginal distribution on X. Let us consider the following two sets of samples:

S0 = {(Xi, yi)}m i=1 P m 0 , S1 = {X i}n i=1 P n 1,X,

Published as a conference paper at ICLR 2024

where S0 indicates the labeled set and S1 represents the unlabeled out-of-domain data. A classical result from VC-theory states that the generalization gap in learning from only S0 (with high probability) can be bounded as

R ˆθERM, P0 min θ Θ R (θ, P0) + O p

VCdim (H)/m + p

O(1)/m, (6)

where VCdim (H) denotes the VC-dimension of hypothesis class H (Mohri et al., 2018). This bound can be prohibitively large when VCdim (H) grows uncontrollably, e.g., the case of linear classifiers in very high dimensions (d 1).

We aim to propose a general framework that leverages both S0 and S1 concurrently, and outputs (in polynomial time) an estimator, denoted by ˆθRSS, such that the second term in the r.h.s. of equation 6 would decay faster as one increases both m and n. We are specially interested in cases where n m. In the next step, we apply our method on a simplified theoretical example in order to give explicit bounds. Similar to Schmidt et al. (2018); Carmon et al. (2019); Zhai et al. (2019); Alayrac et al. (2019), we fully focus the binary classification problem of a high-dimensional Gaussian mixture model with two components using linear classifiers. Mathematically speaking, for some σ0 0 and µ0 Rd, let P0 be the feature-label joint distribution over Rd { 1, 1} as follows:

P0 (y = 1) = 1

2, P0 (X|y) = N yµ0, σ2 0I . (7)

Also, suppose a shifted version of P0, denoted by P1 with P1,X = (1/2) P

u= 1,1 N uµ1, σ2 1I , where µ0 µ1 O (α) and |σ1 σ0| O (α) 1. Given the two sample sets S0 and S1 in this configuration, the problem is to estimate the optimal linear classifier which achieves the minimum error rate.

3 PROPOSED METHOD: ROBUST SELF SUPERVISED (RSS) TRAINING

We propose a solution that combines two generally independent paradigms in machine learning: self-training (Grandvalet & Bengio, 2004; Amini & Gallinari, 2002), and distributionally robust learning in equation 4. The essence of self-training is to use the currently learned model in order to induce artificial labels on the unlabeled data. Thus, for an unlabeled sample X j and any given model parameter θ Θ, one can temporarily consider a pseudo label given by hθ X j . In this regard, the proposed solution denoted by ˆθRSS = ˆθRSS (S0, S1) can be defined as follows: Definition 3.1 (Robust Self-Supervised (RSS) Training). The essence of RSS training is to add a penalty term to the robust version of the original ERM formulation, which is solely evaluated from the out-of-domain unlabeled samples in S1. Mathematically speaking, for a cost function c and parameter γ 0, let us define the robust loss ϕγ : X { 1} Θ R as

ϕγ (X, y; θ) sup Z X ℓ(Z, y; θ) γc (Z, X) . (8)

In this regard, for a given set of parameters γ, γ , λ R 0, the proposed RSS estimator is defined as

ˆθRSS arg min θ Θ

i=1 ϕγ (Xi, yi; θ) + λ

j=1 ϕγ X j, hθ X j ; θ

The proposed RSS loss in equation 9 comprises of two main terms. The first term attempts to minimize the empirical robust risk over the labeled data in S0, where an adversary can alter the distribution of samples within a Wasserstein radius characterized by γ. In the proceeding sections, we show that γ can become asymptotically large (radius becomes infinitesimally small) as m which is similar to Gao (2022). In fact, a small (but non-zero) budget for the adversary can control the generalization. The second term works only on the unlabeled data which are artificially labeled by hθ. It can be shown that this term regulates the classifier by forcing it to avoid crowded areas. The sensitivity of such regularization is controlled by both λ and also γ .

1Having a Wasserstein distance of α between two high-dimensional Gaussian distributions implies that both mean vectors µ0, µ1 and variances σ0, σ1 are within a fraction of at most O (α) from each other.

Published as a conference paper at ICLR 2024

3.1 MODEL OPTIMIZATION: ALGORITHM AND THEORETICAL GUARANTEES

It can be shown that the for a convex loss function ℓ, convex cost function c, and sufficiently large γ and γ (i.e., sufficiently small Wasserstein radii), the optimization problem of equation 9 is convex and can be solved up to an arbitrarily high precision in polynomial time. Moreover, if ℓis not convex, e.g., H is the set of all neural networks, a simple Stochastic Gradient Descent (SGD) algorithm is still guaranteed to reach to at least a local minimum of equation 9. More specifically, equation 9 is a minimax optimization problem and consists of an inner maximization (formulated in equation 8) followed by an outer minimization. As long as the cost function c is strictly convex and γ or γ are chosen sufficiently large, the inner maximization problem of equation 8 becomes strictly concave (Najafi et al., 2019; Sinha et al., 2017). This interesting property holds regardless the convexity of ℓ, which is of paramount importance since ℓis not convex in most practical situations. On the other hand, cost function candidates for c which are considered in this paper are 2 and 2 2, which are strictly convex. Hence, equation 8 can be optimally solved in polynomial time.

The outer minimization problem of equation 9 is also differentiable as long as ℓis sufficiently smooth (again, convexity is not needed). This means the gradient of equation 9 exists and can be efficiently computed using the Envelope Theorem. Explicit bounds on the maximum number of steps in a simple SGD algorithm (with a mini-batch size of 1) in order to reach to an ε-neighborhood of the global maximum of equation 8, and a local minimum of equation 9 are given by Sinha et al. (2017). Also, formulating the gradient of minimax loss functions such as equation 9 using the envelope theorem has been carried out, for example, in (Najafi et al., 2019; Sinha et al., 2017). We have also used the same gradient formulation for the numerical optimization of our model parameters in Section 5, where experimental results on real data using neural networks have been illustrated.

In the next section, we derive theoretical guarantees for ˆθRSS and show that it leads to improved generalization bounds when n is sufficiently large and α is controlled.

4 THEORETICAL GUARANTEES AND GENERALIZATION BOUNDS

In this section, we discuss the theoretical aspects of using the RSS training method, specially for the classification of a two-component Gaussian mixture model using linear classifiers, i.e., H sign ( θ, ) : Rd { 1} | θ Rd . For the sake of simplicity in results, let us define the loss function ℓas the zero-one loss:

ℓ(X, y; θ) = 1 (y θ, X 0) . (10)

However, extension of the theoretical guarantees in this work to other types of loss functions is straightforward. The following theorem shows that the proposed RSS estimator in 9 can potentially improve the generalization bound in a robust learning scenario. Theorem 4.1. Consider the setup described in Section 2 for the sample generation process (GMM assumption), and the loss function defined in equation 10. Using RSS training with m labeled and n unlabeled samples in S0 and S1, respectively, and for any γ, δ > 0, there exist λ and γ which can be calculated solely based on input samples such that the following holds with probability at least 1 δ:

EP0 h ϕγ X, y; ˆθRSS i min θ Θ EP0 [ϕγ (X, y; θ)] (11)

α ( µ0 2 2 + σ2 0) +

2d 2n + m +

2 log (1/δ)

2 log (1/δ)

The proof, as well as how to calculate λ and γ can be found in Appendix B. Theorem 4.1 presents a generalization bound for the proposed estimator when one considers the robust loss under an adversarial budget, which is characterized by γ. Larger values of γ correspond to smaller Wasserstein radii for the distributional adversary of equation 3. The residual term in the r.h.s. of equation 11 converges to zero with a faster rate compared to that of equation 6, given n is sufficiently large and α is sufficiently small. We derive explicit conditions regarding this event in Corollary 4.3. Before that, let us show that for fixed m, as one increases the number of unlabeled samples n, the non-robust excess risk of the RSS-trained classifier decreases as well:

Published as a conference paper at ICLR 2024

Theorem 4.2. Consider the setting described in Theorem 4.1. Then, the estimator ˆθRSS of equation 9 using respectively m labeled and n unlabeled samples, along with specific values of γ, γ , and λ which can be calculated solely from the input samples, satisfies the following non-robust generalization bound with probability at least 1 δ:

R ˆθRSS, P min θ Θ R (θ, P) (12)

µ0 2 2 4σ2 0 p

µ1 2 2 + σ2 1 2dα

2d + 2 log 1

Again, the proof and the procedure for calculating γ, γ , and λ are discussed in Appendix B.

Based on the previous results, the following corollary showcases a number of surprising nonasymptotic conditions under which our generalization bound becomes superior to conventional approaches. Corollary 4.3. Consider the setting described in Theorem 4.2. Then, ˆθRSS of equation 9 with m labeled and n unlabeled samples has an advantage over the traditional ERM, if: α O (d/m) , n Ω m2/d . (13) Also, the following conditions are sufficient to make the minimum required m (for a given error bound) independent of the dimension d: α O d 1 , n Ω d3 . (14)

Proof is given in Appendix. Finally, Theorem 4.2 also implies that if unlabeled samples are drawn from the same distribution as that of the labeled ones, i.e., α = 0, then the excess risk of RSS-training satisfies the following inequality with probability at least 1 δ:

R ˆθRSS, P min θ Θ R (θ, P) O

d3 log 1/δ m2 (2n + m)

which again shows the previously-mentioned improvements when all samples are in-domain.

The assumption of an isotropic GMM with two components has been already studied in the literature (see Section 1). Next, we present a more general case of Theorem 4.2 where each Gaussian component can have a non-diagonal covariance matrix. Mathematically speaking, suppose that P0 and P1 are defined as follows: P0 (y = 1) = 1/2 , P0 (X|y) = N (yµ0, Σ0) ,

2N (µ1, Σ1) + 1

2N ( µ1, Σ1) , (16)

where µ1 µ0 O (α), Σ1 Σ0 2 O (α) and µ1 2 βλmax (Σ1). Assume a set of m labeled samples S0 P m 0 , and a set of n unlabeled samples S1 P n 1,X. Theorem 4.4 (Generalization Bound for General Gaussian Mixture Models). Consider the setting described in equation 16. Using algorithm in equation 9 with m labeled and n unlabeled samples, there exists a set of parameters γ, γ , λ for which the following holds with probability at least 1 δ:

R ˆθRSS, P min θ Θ R (θ, P) (17)

µ1 2 2 + Tr (Σ1)

ϑ = |µ1Σ 1 1 µ1 µ0Σ 1 0 µ0|, C = µ0 2 + λmin (Σ1) µ0 2

κ1 = λmax (Σ1)

λmin (Σ1) , κ 1 = λmax (Σ1)

(Σ1) = min {λi (Σ1) λj (Σ1)} , i, j : λi (Σ1) = λj (Σ1) , (18) and λi (Σ) is the i(th) eigenvalue of Σ.

Published as a conference paper at ICLR 2024

Table 1: Accuracy of the model trained on labeled datasets of sizes 10, 20, 40, and 10, 000 with varying amounts of unlabeled data from the same distribution with α = 0 (left), and different distribution with α = 0.5 µ0 2 (right).

Same distribution Different distribution

Labeled size Acc Unlabeled size Acc Labeled size Acc Unlabeled size Acc

10 0.63 10 0.61 10 0.59 100 0.66 10 0.59 100 0.65 1,000 0.79 1,000 0.78 10,000 0.82 10,000 0.81

20 0.64 20 0.65 20 0.62 200 0.69 20 0.62 200 0.65 2,000 0.80 2,000 0.79 10,000 0.82 10,000 0.80

40 0.65 40 0.65 40 0.65 400 0.71 40 0.65 400 0.73 4,000 0.81 4,000 0.78 10,000 0.82 10,000 0.80

10,000 0.83 - - 10,000 0.83 - -

Proof can be found in Appendix. One important difference to note between Theorem 4.4 and Theorem 4.2 is the choice of γ , which controls the adversarial budget for unlabeled (and out-of-domain) part of the dataset. In the setting of Theorem 4.2, we prefer to choose γ as small as possible. However, in the setting of Theorem 4.4, we consider the eigenvectors and eigenvalues of Σ1 and Σ0, as well as the direction of µ1 and µ0 in order to find the optimal value for the adversarial budget. In fact, there are cases in which selecting a large γ (less freedom for the adversary) may actually be the optimal choice.

5 EXPRIMENTAL RESULTS

The effectiveness of the proposed method has been assessed through experimenting on various datasets, including simulated data, and real-world datasets of histopathology images. Each experiment has been divided into two parts: i) cases in which both labeled and unlabeled data are sampled from the same distribution, and ii) the scenarios where the unlabeled data differs in distribution from the labeled ones. First, let us specify the datasets used in our experiments:

1. Simulated data consists of binary-labeled data points with a dimension of d = 200, generated according to the setting described in Section 2.

2. NCT-CRC-HE-100K consists of 100,000 histopathology images of colon tissue (Katherm et al., 2018). The images have dimensions of 224 224 and were captured at 20x magnification. The dataset is labeled with 9 distinct classes .

3. Patch Camelyon is a widely used benchmark dataset for medical image analysis. It consists of a large collection of 327,680 color histopathology images from lymph node, each with dimensions 96 96. The dataset has binary labels for presence/absence of metastatic tissue.

5.1 EXPERIMENT OF SIMULATED DATA

To evaluate the effectiveness of our method on simulated data, we first find the optimal classifier using only labeled samples. Then, we apply our method with a varying number of unlabeled samples. The results (see Table 1) show that our proposed method achieves accuracy improvements comparable to models trained only on labeled samples. Moreover, results indicate that our method is more effective when labeled and unlabeled data come from the same distribution. However, it still demonstrates significant improvement even when the unlabeled samples undergo a distribution shift.

Published as a conference paper at ICLR 2024

Table 2: Accuracy of the model trained on labeled data from NCT-CRC-HE-100K dataset with varying amounts of unlabeled data from the same distribution (left), as well as when unlabeled samples come from a different distribution (Patch Camelyon dataset)(right).

Same distribution Different distribution

Labeled size Acc Unlabeled size Acc Labeled size Acc Unlabeled size Acc

200 0.71 100 0.78 48 0.65 700 0.80 25 0.78 400 0.79 2,000 0.82 2,000 0.81

500 0.78 200 0.82 240 0.77 1,200 0.82 50 0.82 700 0.86 4,000 0.83 3,000 0.87

3,000 0.87 600 0.88 1040 0.83 10,000 0.89 300 0.87 2,000 0.89 20,000 0.91 8,000 0.90

50,000 0.916 - - 32,000 0.94 - -

5.2 EXPERIMENT OF HISTOPATHOLOGY DATA

The processing pipeline over the real-world dataset of histopathology images is based on using a Res Net50 encoder pre-trained on Image Net (Deng et al., 2009; He et al., 2016), which extracts and stores 1 1024 embeddings from input images. Such embeddings are then used to train a deep neural network with four layers of size 2048 and one output layer for the class id. Also, we have used a Leaky Re LU activation function.

Experimental results in this part are shown in Table 2. Under the same distribution setting, both labeled and unlabeled data have been taken from the NCT-CRC-HE-100K dataset. On the other hand, different distributions setting implies that labeled data comes from the NCT-CRC-HE-100K dataset (labels are either Normal or Tumor ), while the Patch Camelyon dataset was used for the unlabeled data. As a result, the final labeling is binary. The experimental results demonstrate that increasing the number of unlabeled data leads to an improvement in accuracy for both the same and different distribution settings.

6 CONCLUSION

In this study, we address the robust and non-robust classification challenges with a limited labeled dataset and a larger collection of unlabeled samples, assuming a slight perturbation in the distribution of unlabeled data. We present the first non-asymptotic tradeoff between labeled (m) and unlabeled (n) sample sizes when learning a two-component Gaussian mixture model. Our analysis reveals that when n Ω m2/d , the generalization bound improves compared to using only labeled data, even when unlabeled data points are slightly out-of-domain. We derive sophisticated results for the generalization error in both robust and non-robust scenarios, employing a technique based on optimizing a robust loss and regularization to avoid crowded and dense areas. Our framework integrates tools from self-training, distributionally robust learning, and optimal transport.

Experiments on synthetic and real-world datasets validate our theoretical findings, demonstrating improved classification accuracy, even for non-Gaussian cases, by incorporating out-of-domain unlabeled samples. Our methodology hinges on leveraging such data to enhance robust accuracy and adapting the uncertainty neighborhood radius based on labeled and unlabeled sample quantities to strike a balance between bias and variance in classification error.

For future work, there s room for improving and relaxing the conditions for the utility of unlabeled data. Exploring error lower-bounds and impossibility results presents another intriguing avenue. Additionally, relaxing the constraints on the level of distribution shift for out-of-domain samples could be a promising direction.

Published as a conference paper at ICLR 2024

Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Alhussein Fawzi, Robert Stanforth, and Pushmeet Kohli. Are labels required for improving adversarial robustness? Advances in Neural Information Processing Systems, 32, 2019.

Massih-Reza Amini and Patrick Gallinari. Semi-supervised logistic regression. In ECAI, volume 2, pp. 11, 2002.

Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. Advances in Neural Information Processing Systems, 31, 2018.

J. Linmans B. S. Veeling, J. Winkens, T. Cohen, and M. Welling. Rotation equivariant cnns for digital pathology, September 2018.

Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 2154 2156, 2018.

Jose Blanchet and Yang Kang. Semi-supervised learning based on distributionally robust optimization. Data Analysis and Applications 3: Computational, Classification, Financial, Statistical and Stochastic Methods, 5:1 33, 2020.

Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39 57. Ieee, 2017.

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang. Unlabeled data improves adversarial robustness. Advances in neural information processing systems, 32, 2019.

Richard J. Chen and Rahul G. Krishnan. Self-supervised vision transformers learn visual concepts in histopathology. Co RR, abs/2203.00585, 2022.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Zhun Deng, Linjun Zhang, Amirata Ghorbani, and James Zou. Improving adversarial robustness via unlabeled out-of-domain data. In International Conference on Artificial Intelligence and Statistics, pp. 2845 2853. PMLR, 2021.

Charlie Frogner, Sebastian Claici, Edward Chien, and Justin Solomon. Incorporating unlabeled data into distributionally robust learning. Journal of Machine Learning Research, 22(56):1 46, 2021.

Rui Gao. Finite-sample guarantees for wasserstein distributionally robust optimization: Breaking the curse of dimensionality. Operations Research, 2022.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Jakob Nikolas Katherm, Halama, Niels, Marx, and Alexander. 100,000 histological images of human colorectal cancer and healthy tissue, April 2018.

Published as a conference paper at ICLR 2024

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pp. 5468 5479. PMLR, 2020.

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979 1993, 2018.

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115 166, 2018.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.

Amir Najafi, Shin-ichi Maeda, Masanori Koyama, and Takeru Miyato. Robustness to adversarial perturbations in learning from incomplete data. Advances in Neural Information Processing Systems, 32, 2019.

Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. Advances in neural information processing systems, 30, 2017.

Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427 436, 2015.

Philippe Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research, 8(7), 2007.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. Advances in neural information processing systems, 31, 2018.

Soroosh Shafieezadeh Abadeh, Peyman M Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. Advances in Neural Information Processing Systems, 28, 2015.

Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

Matthew Staib and Stefanie Jegelka. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, volume 3, pp. 4, 2017.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, and Liwei Wang. Adversarially robust generalization just requires more unlabeled data. ar Xiv preprint ar Xiv:1906.00555, 2019.

A AUXILIARY DEFINITIONS

A number of our auxiliary definitions are presented in this section:

Published as a conference paper at ICLR 2024

Definition A.1 (Wasserstein Distance). Consider two probability distributions P and Q supported on X, and assume cost function c : X X R+ is a non-negative lower semi-continuous function satisfying c(X, X) = 0 for all X X. Then, the Wasserstein distance between P and Q w.r.t. c, denoted as Wc (P, Q), is defined as

Wc (P, Q) = inf µ Γ(X 2) EX,X µ [c(X, X )] , subject to µ (X, ) = P , µ ( , X ) = Q, (19)

where Γ X 2 denotes the set of all couplings over X X.

Definition A.2 (ϵ-neighborhood of a Distribution). The ϵ-neighborhood of a distribution P is defined as the set of all distributions that have a Wasserstein distance less than ϵ from P. Mathematically, it can be represented as: Bc ϵ (P) = {Q : Wc (P, Q) ϵ} . (20)

B PROOF OF THEOREMS

In this section, we present the proofs for the Theorems and Lemmas from the main manuscript.

Proof of Theorem 4.1. With respect to discussions from the main manuscript, we already know that ˆθRSS is the result of the following optimization problem:

ˆθRSS arg min θ Θ

i=1 ϕγ (Xi, yi; θ) + λ

j=1 ϕγ X j, hθ X j ; θ

where we have Θ θ Rd, θ 2 = 1 . If we consider the loss function mentioned in Theorem 4.1, along with the cost functions c(X, X ) = X X 2 and c (X, X ) = X X 2 2, the Slater conditions will be satisfied. Hence, we have strong duality, and therefore, there exists s 0 such that ˆθRSS is also the result of the following optimization problem:

ˆθRSS = arg min θ: 1

n Pn j=1 ϕγ (X j,hθ(X j);θ) s

i=1 ϕγ (Xi, yi; θ). (22)

To prove the theorem, we need to establish an upper bound on the distance between the expected robust loss of ˆθRSS and the best possible expected robust loss. To achieve this, we first provide a concentration bound for the maximum distance between the empirical and expected robust loss across all parameters with 1 n

j=1 ϕγ X j, hθ X j ; θ s, (23)

and then, we demonstrate that the parameter θ , which achieves the best possible robust loss, satisfies equation 23. Suppose that the loss function is bounded with b, i.e.,

l (X, y; θ) b, θ, X, y, ϕc γ (X, y; θ) b, c, γ, θ, X, y,

where the second inequality directly results from the first one. Therefore, 1

m Pm i=1 ϕγ (Xi, yi; θ) has the bounded difference property with parameter b/m, which results that

Φ (P0, S0, γ, c) = sup θ Θ

i=1 ϕγ (Xi, yi; θ) EP0 ϕc γ (X, y; θ) !

also has the bounded difference property with parameter b/m, where Θ is defined as

j=1 ϕγ X j, hθ X j ; θ s, θ 2 = 1

Published as a conference paper at ICLR 2024

So, we have the following inequality with probability more than 1 δ:

Φ (P0, S0, γ, c) E [Φ (P0, S0, γ, c)] +

m , θ Θ. (26)

In the above inequality we can give an upper-bound for the first term in the right hand side of the inequality as follows:

E [Φ (P0, S0, γ, c)] =ES0 P m 0

i=1 ϕγ (Xi, yi; θ) EP0 ϕc γ (θ; X, y) !#

i=1 ϕγ (Xi, yi; θ) 1

i=1 ϕγ (X i, y i; θ)

(Xi,yi) S0 ϵiϕc γ (θ; Xi, yi)

Consider the loss function mentioned in the Theorem 4.1, and the following cost functions c, c

c(X, X ) = X X 2, c (X, X ) = X X 2 2.

With these loss and cost functions we can rewrite ϕc γ and ϕc γ as follows:

ϕc γ (X, y; θ) = min {1, max {0, 1 γy θ, X }} , (28)

ϕc γ (X, hθ (X) ; θ) = max 0, 1 γ θ, X 2 . (29)

As we can see ϕc γ (X, y; θ) is a γ-lipschitz function of y θ, X . Based on the Talagrand s contraction lemma, we can continue the inequalities in equation 27 as

E [Φ (P + 0, S0, γ, c)] ES0,ϵ

(Xi,yi) S0 ϵiϕc γ (Xi, yi; θ)

(Xi,yi) S0 ϵiyi θ, Xi

inf ζ>0 sup θ Θ

(Xi,yi) S0 ϵi θ, yi Xi ζ

X i S1 ϕc γ (X i, hθ (X i) ; θ) + ζs

inf ζ>0 sup θ Θ

(Xi,yi) S0 ϵiyi θ, Xi +ζγ

X i S1 θ, X i 2 ζ (1 s)

inf ζ>0 sup θ Θ θ, 1

Xi S0 ϵiyi Xi +ζγ θT

X i S1 X i X T i

Let us set v = 1

Xi S0 ϵiyi Xi and ˆΣ = 1 n P

X i S1 X i X T i , and then continue the inequalities in equation 30 as

E [Φ (P + 0, S0, γ, c)] γES0,ϵ

inf ζ>0 sup θ Θ θ, v +ζγ θT ˆΣθ ζ (1 s) . (31)

In order to find a proper upper-bound for the Rademacher complexity, we attempt to solve the inner maximization problem sup θ Θ θ, v + ζγ θT ˆΣθ. (32)

We know that the parameter θ which maximizes the first term θ, v is a vector with Euclidean norm of 1 in the direction of v. Also, the parameter which maximizes the second term θT ˆΣθ is a vector

Published as a conference paper at ICLR 2024

with norm 1 in the direction of umax, which is the eigenvector of ˆΣ that corresponds to the maximum eigenvalue λmax. Therefore, the solution to the maximization problem in equation 32 should belong to the span of v and umax, i.e.,

θ =ψ1 v v 2 + ψ2 umax umax 2 =ψ1ˆv + ψ2 ˆumax, (33) where: ψ2 1 + ψ2 2 + 2ψ1ψ2η = 1, η = ˆv, ˆumax ,

1 ˆv, ˆumax 2

τ = ˆv T ˆΣˆv

Due to equation 32, equation 33 and equation 34, we have the following problem: sup ψ1 v 2 + ψ2η. (35)

S.t. ψ2 1 + ψ2 2 + 2ψ1ψ2η = 1,

ψ2 2λmax + ψ2 1τ + 2ψ1ψ2λmaxη s

The constraints in the above problem can be rewritten as ψ2 1 + ψ2 2 + 2ψ1ψ2η = 1, (36)

λmax ψ2 1 (λmax τ) s .

In the above equations, if we set s = λmax, then the only ψ1 which satisfies the constraint is ψ1 = 0. Suppose that we use a set of n i.i.d. samples from P1X (which is different from the S1) to estimate the λmax, and name this estimate ˆλmax. If we set

γ = ˆλmax (1 α) 3O

form Theorem 6.5 in Wainwright (2019) we have the following relation with probability more than 1 δ:

λmax (1 α) 5O

s λmax (1 α) O

By weakening the constraint in equation 35, the result of the optimization becomes larger, which means by solving the following optimization problem we have an upper-bound for equation 35: sup ψ1 v 2 + ψ2η. (39)

S.t. ψ2 1 + ψ2 2 + 2ψ1ψ2η = 1

ψ2 1 (λmax τ) αλmax + 5O

where for the Gaussian distribution and the way ˆv is built we have:

λmax τ = µ1 2 2

µ0 2 2 + σ2 0 p

µ0 2 2 + σ2 0d

In this regard, the results of optimization problems equation 39 and equation 32 have the following upper-bound:

v u u tαλmax + 5O r

Xi S0 ϵi Xi 2

v u u tαλmax + 5O r

Published as a conference paper at ICLR 2024

This way, we also have an upper-bound for Rademacher complexity in equation 27 as

E [Φ (P + 0, S0, γ, c)] γ Eϵ,S0 1

Xi S0 ϵi Xi 2

v u u u tαλmax + 5O

EX [ X 2 2] χ m

v u u u tαλmax + 5O

where for the Gaussian distribution we have:

E [Φ (P + 0, S0, γ, c)] γ

µ1 2 2 + σ2 1d χ m

v u u u tαλmax + 5O

v u u u u u u t

1 + σ2 1d µ1 2 2

v u u u tαλmax + 5O

v u u u t d

Using equation 24,equation 26 and equation 43, we have

E ˆ P m 0 ϕc γ (X, y; θ) EP0 ϕc γ (X, y; θ) O

v u u u t d

for all θ Θ, where Θ is defined as in equation 25. Now, we need to demonstrate that the parameter θ that minimizes the expected robust loss also satisfies the constraint defined in equation 23. Given the data generating distribution in here, we know that θ also minimizes the expected non-robust loss. Therefore, we have θ = arg min θ Θ EP0 [l (X, y; θ)], (45)

where for the loss function, cost functions and distributions considered in here we have:

θ = arg min θ Θ Pu N( θ,µ0 ,σ2 θ 2 2) (u 0)

= arg min θ Θ Q θ, µ0 2

= µ0 µ0 2 . (46)

In the setting of the Theorem, we also have:

Due to the Auxiliary Lemma 2, we know that if we set

γ = 1 ˆλmax log n + O( d

n) , s = 1 γ ˆλmax (1 α) 3O

then with high probability θ Θ and the following holds:

E(X,y) P0 h ϕc γ X, y; ˆθRSS i E(X,y) ˆ P m 0

h ϕc γ X, y; ˆθRSS i

Published as a conference paper at ICLR 2024

v u u u t d

E(X,y) ˆ P m 0 ϕc γ (X, y; θ )

v u u u t d

E(X,y) P0 ϕc γ (X, y; θ )

v u u u t d

Now, let s consider a scenario where we split our labeled samples into two parts, each containing m/2 samples. We then ignore the labels of the second part and add these samples to the unlabeled set. In this situation, we have m/2 labeled samples and n + m/2 unlabeled samples and the bound above inequalities changes as follows:

E(X,y) P0 h ϕc γ X, y; ˆθRSS i E(X,y) P0 ϕc γ (X, y; θ )

v u u u t2d

2d 2n + m +

(49) which completes the proof.

Proof of Theorem 4.2. Using Theorem 4.1 and Auxiliary Lemma 1, with probability more than 1 δ we have:

R ˆθRSS, P0 E(X,y) P0 h ϕc γ X, y; ˆθRSS i

E(X,y) P0 ϕc γ (X, y; θ ) + O

v u u u t2d

2d + 2 log 1

R (θ , P0) + e

v u u u t2d

2d + 2 log 1

For the setting described in the statement of the theorem, we know that θ = µ0 and ˆλmax

µ1 2 2 + σ2 1 + q

d n. By setting

v u u u t2d

2d + 2 log 1

and subsequent substitution in equation 50, one has

R ˆθRSS, P0 R (θ , P0)

µ0 2 2 4σ2 0 p

µ1 2 2 + σ2 1 2d

Published as a conference paper at ICLR 2024

Proof of Corollary 4.3. Due to the result of Theorem 4.2, to have an advantage over the ERM method we should have: 2dα

If we have the following inequality it can be seen that the above inequality will be satisfied:

It can be seen that if we have α O 1

d and n > Ω d3 , then with high probability we have :

e P ˆθRSS O 1

Therefore if we aim to have an excess risk less than ϵ we should have at least max n 1 ϵ4 , log 1

samples which is independent from the dimension of data.

Proof of theorem 4.4. From the manuscript we know that ˆθRSS is the result of the following optimization problem:

ˆθRSS arg min θ Θ

i=1 ϕγ (Xi, yi; θ) + λ

j=1 ϕγ X j, hθ X j ; θ

where we have Θ θ Rd, θ 2 = 1 . Like the proof of Theorem 4.1 if we consider the loss function mentioned in Theorem 4.1, along with the cost functions c(X, X ) = X X 2 and c (X, X ) = X X 2 2, the Slater conditions will be satisfied. Hence, we have strong duality, and therefore, there exists s 0 such that ˆθRSS is also the result of the following optimization problem:

ˆθRSS = arg min θ: 1

n Pn j=1 ϕγ (X j,hθ(X j);θ) s

i=1 ϕγ (Xi, yi; θ). (56)

To prove the theorem, we need to establish an upper bound on the distance between the expected robust loss of ˆθRSS and the best possible expected robust loss. To achieve this, we first provide a concentration bound for the maximum distance between the empirical and expected robust loss across all parameters with 1 n

j=1 ϕγ X j, hθ X j ; θ s, (57)

and then, we demonstrate that the parameter θ , which achieves the best possible robust loss, satisfies equation 57. Suppose that the loss function is bounded with b,

l (X, y; θ) b, θ, X, y, ϕc γ (X, y; θ) b, c, γ, θ, X, y,

where the second inequality results from the first one. Therefore, 1 m Pm i=1 ϕγ (Xi, yi; θ) has the bounded difference property with parameter b/m, which results that

Φ (P0, S0, γ, c) = sup θ Θ

i=1 ϕγ (Xi, yi; θ) EP0 ϕc γ (X, y; θ) !

also has the bounded difference property with parameter b/m, where Θ is defined as

j=1 ϕγ X j, hθ X j ; θ s, θ 2 = 1

Published as a conference paper at ICLR 2024

So, we have the following inequality with probability more than 1 δ:

Φ (P0, S0, γ, c) E [Φ (P0, S0, γ, c)] +

m , θ Θ. (60)

In the above inequality we can give an upper-bound for the first term in the right hand side of the inequality as follows:

E [Φ (P0, S0, γ, c)] =ES0 P m 0

i=1 ϕγ (Xi, yi; θ) EP0 ϕc γ (θ; X, y) !#

i=1 ϕγ (Xi, yi; θ) 1

i=1 ϕγ (X i, y i; θ)

(Xi,yi) S0 ϵiϕc γ (θ; Xi, yi)

Consider the loss function mentioned in the Theorem 4.4, and the following cost functions c, c

c(X, X ) = X X 2, c (X, X ) = X X 2 2.

With these loss and cost functions we can rewrite ϕc γ and ϕc γ as follows:

ϕc γ (X, y; θ) = min {1, max {0, 1 γy θ, X }} , (62)

ϕc γ (X, hθ (X) ; θ) = max 0, 1 γ θ, X 2 . (63)

As we can see ϕc γ (X, y; θ) is a γ-lipschitz function of y θ, X , therefore due to the Talagrand s lemma we can continue inequalities in equation 61 as follows:

E [Φ (P + 0, S0, γ, c)] ES0,ϵ

(Xi,yi) S0 ϵiϕc γ (Xi, yi; θ)

(Xi,yi) S0 ϵiyi θ, Xi

sup θ Θ θ, 1

Xi S0 ϵiyi Xi

Now suppose that we weaken the constraint in the above inequalities as follows:

j=1 ϕγ X j, hθ X j ; θ

EP1X,θ h ϕc γ (X, y; θ) i 4γ Tr (Σ1) + µ1 2 2

3γ 4γ Tr (Σ1) + µ1 2 2

where we use Auxiliary Lemma 3 in the equation 65 and Auxiliary Lemma 4 in the equation 66. Suppose that we set s as follows:

s = inf θ Θ 1 n

Xi S 1 ϕc γ (Xi, hθ (Xi) ; θ) + 12γ Tr ˆΣ (S 1)

d n + α, (67)

where in the above equation S 1 is generated in a same way but independent of S1, and ˆΣ (S 1) is the sample covariance matrix of S1. Therefore we have the following upper-bound for s:

s = inf θ Θ 1 n

Xi S 1 ϕc γ (Xi, hθ (Xi) ; θ) + 12γ Tr ˆΣ (S 1)

Published as a conference paper at ICLR 2024

inf θ Θ EP1X,θ h ϕc γ (X, hθ (Xi) ; θ) i + 24γ Tr (Σ1) + µ1 2 2

inf θ Θ 4e θ,µ1 2

3 γ + 24γ Tr (Σ1) + µ1 2 2

4e µT 1 Σ 1 1 µ1 2

3 γ + 24γ Tr (Σ1) + µ1 2 2

d n + α (69)

where in the equation 68 we use the result of Auxiliary Lemma 4. In the equation 69 we insert

θ = Σ 1 1 µ1 which minimizes e θ,µ1 2

2θT Σ1θ . We use equation 65 to equation 69, to define Θw, which is a super-set for Θ, as follows:

θ Θ : 4e θ,µ1 2

3 γ 4e µT 1 Σ 1 1 µ1 2

3 γ + 24γ Tr (Σ1) + µ1 2 2

= θ Θ : θ, µ1 2

θT Σ1θ µ1Σ 1 1 µ1

µ1Σ 1 1 µ1 2 18γ 2 Tr (Σ1) + µ1 2 2 + 5γ q

= θ Θ : µ1Σ 1 1 µ1 θ, µ1 2

µ1Σ 1 1 µ1 2 18γ 2 Tr (Σ1) + µ1 2 2 + 5γ q

We can write θ = Σ 1 1 µ1+ϵv Σ 1 1 µ1+ϵv 2 , where v is a vector which its Euclidean norm is equal to 1 and

belongs to the hyperplane orthogonal to Σ 1 1 µ1, and ϵ R. If we use this new notation we can rewrite Θw as follows:

Θw = θ = Σ 1 1 µ1 + ϵv Σ 1 1 µ1 + ϵv 2 : ϵ2v T Σ1 µ1µT 1 µT 1 Σ 1 1 µ1

v t, v T Σ 1 1 µ1 = 0, v 2 = 1 ,

µ1Σ 1 1 µ1 2 18γ 2 Tr (Σ1) + µ1 2 2 + 5γ q

where due to Auxiliary Lemma 7 we have:

v T Σ1 µ1µT 1 µT 1 Σ 1 1 µ1

dκ1κ 1 , (73)

κ1 =λmax (Σ1)

λmin (Σ1) , κ 1 = λmax (Σ1)

(Σ1) (Σ1) = min {λi (Σ1) λj (Σ1)} , i, j : λi (Σ1) = λj (Σ1) .

Therefore we can weaken Θw as follows:

Θw = θ = Σ 1 1 µ1 + ϵv Σ 1 1 µ1 + ϵv 2 : ϵ t , v T Σ 1 1 µ1 = 0, v 2 = 1 , (74)

Published as a conference paper at ICLR 2024

v u u t dκ1κ 1 (Σ1)

µ1Σ 1 1 µ1 2 18γ 2 (Tr (Σ1) + µ1 2 2) + 5γ q

Due to the definition of Θw in 74 and the fact that it is a super-set for Θ we can continue equation 64 as follows:

E [Φ (P + 0, S0, γ, c)] γES0,ϵ

sup θ Θ θ, 1

Xi S0 ϵiyi Xi

sup θ Θw θ, 1

Xi S0 ϵiyi Xi

λmax (Σ1) E [ X 2 2] m µ1 2

( µ1 2 2 + Tr (Σ1)) λmax (Σ1)

m µ1 2 , (76)

Using Inequalities equation 76 and equation 60 we have:

i=1 ϕγ (Xi, yi; θ) EP0 ϕc γ (X, y) ; θ γt s

( µ1 2 2 + Tr (Σ1)) λmax (Σ1)

Suppose that we define θ as follows:

θ = arg min θ Θ EP0 [l (X, y; θ)]. (78)

For the loss function, cost functions and distributions considered in the Theorem 4.4 we have:

θ = arg min θ Θ Pu N( θ,µ0 ,θT Σ0θ) (u 0)

= arg min θ Θ Q θ, µ0 2

= Σ 1 0 µ0 Σ 1 0 µ0 2 . (79)

In the setting of the theorem, we also have:

µ0 µ1 2 α, Σ1 Σ0 2 α

Due to the Lemma 6 we know that if we set s and γ as

s = inf θ Θ 1 n

Xi S 1 ϕc γ (Xi, hθ (Xi) ; θ) + 12γ Tr ˆΣ (S 1)

ˆλmax(ˆΣ(S 1)) (80)

then with high probability θ Θ, and we have the following:

EP0 h ϕc γ X, y; ˆθRSS i EP0 ϕc γ (X, y; θ ) + γt s

( µ1 2 2 + Tr (Σ1)) λmax (Σ1)

Using the result of Auxiliary Lemma 5 and the above inequality we have:

E(X,y) P0 h l X, y; ˆθRSS i E(X,y) P0 [l (X, y; θ )] + e θ ,µ0 2

Published as a conference paper at ICLR 2024

( µ1 2 2 + Tr (Σ1)) λmax (Σ1)

Therefore if we set γ as follows:

2t λmax ˆ Σ1 (S 1) , (83)

then we have:

E(X,y) P0 h l X, y; ˆθRSS i E(X,y) P0 [l (X, y; θ )]

t λmax (Σ1) m µ1 2 +

Now, if we split our labeled samples into two equal parts, each containing m/2 samples, and combine the unlabeled set with the second part, we have m/2 labeled samples and n+m/2 unlabeled samples. When we set t from equation 75 in this new scenario, the proof will be complete.

C AUXILIARY LEMMAS

Auxiliary Lemma 1. Consider the setting described in Theorem 4.1. let us define fΘ,P (θ, γ) as

fΘ,P (θ, γ) = E(X,y) P ϕc γ (X, y; θ) E(X,y) P [ℓ(X, y; θ)] . (85)

Then, for the loss and cost functions described in Theorem 4.1, P = Nyµ,σ (X, y), and the class of linear classifiers with |θ|2 = 1, the function fΘ,P (θ, γ) satisfies the following inequality:

fΘ,P (θ, γ) e

Proof. For the setting described in the lemma we have:

u = y θ, X N θ, µ , σ2 . (87)

Therefore we can calculate an upper-bound for fΘ,P (θ, γ) as follows:

E(X,y) P ϕc γ (X, y; θ) E(X,y) P [l (X, y; θ)] = Z 1/γ

2πσ2 (1 γu) e (u θ,µ )2

2πσ2 (1 γu) e θ,µ 2

And the proof is complete.

Auxiliary Lemma 2. Consider the setting in Theorem 4.1 if we define θ as follows:

θ µ0 µ0 2 , (89)

then with high probability we have:

j=1 ϕγ X j, hθ X j ; θ s,

As long as we set:

γ = 1 ˆλmax log n + O( d

n) , s = 1 γ ˆλmax (1 α) 3O

Published as a conference paper at ICLR 2024

Proof. Due to the boundedness of ϕc γ with probability more than 1 δ we have the following inequality:

X i S1 ϕc γ (X i, hθ (X i) ; θ ) EX P1X h ϕc γ (X, hθ (X) ; θ ) i +

=EX P1X max 0, 1 γ θ , X 2 +

where we have the following upper-bound for the first term in 92:

EX P1X max 0, 1 γ θ , X 2 =EX P1X 1 γ θ , X 2

γ 1 γ θ , X 2 d P1X

=EX P1X 1 γ θ , X 2

γ 1 γ θ , X 2 N µ1, σ2 1I

γ 1 γ θ , X 2 N µ1, σ2 1I

EX P1X 1 γ θ , X 2

σ1 1 γ ( θ , µ1 + σ1u)2 N 0, σ2 1

σ1 1 γ ( θ , µ1 + σ1u)2 N 0, σ2 1 .

Now if we set

γ 1 θ , µ1 + σ1 log n 2 (94)

we can continue inequalities in 93 as follows:

EX P1X max 0, 1 γ θ , X 2 EX P1X 1 γ θ , X 2 + 1 n

=1 γ θ , µ1 2 + σ2 1 + 1 n. (95)

However, since we lack access to µ1, σ1, or θ , we cannot directly obtain γ from equation 94. Nevertheless, a suitable choice for γ that satisfies equation 94 and can be determined from samples is:

γ = 1 ˆλmax log n + O( d

where ˆλmax is the maximum eigenvalue of the sample covariance matrix,ˆΣ, of a set of n unlabeled samples. Now from Equations 91 and 95 we have:

j=1 ϕγ X j, hθ X j ; θ 1 γ θ , µ1 2 + σ2 1 + 2

To complete the proof we should show that :

1 γ θ , µ1 2 + σ2 1 + 2

δ n s, (98)

Published as a conference paper at ICLR 2024

where due to the setting of this lemma we only need to set:

s = 1 γ ˆλmax (1 α) 3O

Auxiliary Lemma 3. Consider the setting described in Theorem 4.4, if we define the function ϕc γ as follows ϕc γ (X, hθ (X) ; θ) = max 0, 1 γ θ, X 2 , (100)

then with probability greater than 1 δ, for θ Θ, with θ 2 = 1 we have:

i=1 ϕc γ (Xi, hθ (Xi) ; θ) E h ϕc γ (X, hθ (X) ; θ) i 4γ Tr (Σ1) + µ1 2 2

Proof. From the definition of ϕc γ we know that this function is bounded between 0 and 1, therefore the following function has the bounded difference property with parameter 1

Φ = sup θ Θ

i=1 ϕc γ (Xi, hθ (Xi) ; θ) E h ϕc γ (X, hθ (X) ; θ) i . (102)

So we can write the Mc Diarmid s inequality for this function as follows:

δ n . (103)

Now we try to give an upper-bound for the first term in the right-hand side of the above inequality:

E [Φ] =E{Xi}n 1

i=1 ϕc γ (Xi, hθ (Xi) ; θ) E h ϕc γ (X, hθ (X) ; θ) i

E{Xi}n 1 ,{X i} n

i=1 ϕc γ (Xi, hθ (Xi) ; θ) 1

i=1 ϕc γ (X i, hθ (X i) ; θ)

E{Xi}n 1 ,{X i} n

i=1 ϕc γ (Xi, hθ (Xi) ; θ) 1

i=1 ϕc γ (X i, hθ (X i) ; θ)

2E{Xi}n 1 ,{ϵi}n 1

i=1 ϵiϕc γ (Xi, hθ (Xi) ; θ)

2γ E{Xi}n 1 ,{ϵi}n 1

i=1 ϵi θ, Xi 2 #

=2γ E{Xi}n 1 ,{ϵi}n 1

sup θ Θ θT 1 n

i=1 ϵi Xi XT i

2γ E{Xi}n 1 ,{ϵi}n 1

i=1 ϵi Xi XT i

v u u u t E{Xi}n 1 ,{ϵi}n 1

i=1 ϵi Xi XT i

i=1 ϵi Xi XT i

v u u t E{Xi}n 1

Published as a conference paper at ICLR 2024

EX [ X 4 2]

4γ Tr (Σ1) + µ1 2 2

where ϵis in equation 104 are Rademacher s random variables, and the last inequality in equation 105 is results from the fact that Xis are sampled from distribution P1 defined in Theorem 4.4 of the manuscript. And the prood is complete.

Auxiliary Lemma 4. Consider the setting described in Theorem 4.4, if we define the function ϕc γ as follows ϕc γ (X, hθ (X) ; θ) = max 0, 1 γ θ, X 2 , (106)

then we have the following upper-bound and lower-bound for the expected value of ϕc γ when X sampled from the distribution P1 defined in Theorem 4.4 of the manuscript:

E h ϕc γ (X, hθ (X) ; θ) i 4e θ,µ1 2

Proof. For the expected value of ϕc γ we can write:

E h ϕc γ (X, hθ (X) ; θ) i = max 0, 1 γ θ, X 2

γ 1 γ θ, X 2N (µ1, Σ1)

γ θT Σ1θ θ,µ1

γ θT Σ1θ θ,µ1

θT Σ1θ 1 γ θ, µ1 + u p

θT Σ1θ 2 N (0, 1). (108)

Where we have the following upper-bound for the expected value if γ >> 1 θ,µ1 2 :

E h ϕc γ (X, hθ (X) ; θ) i 4 3 γ Q

Now without loss of generality we assume that the largest eigenvalue of Σ1 is less than or equal to 1, then we have the following lower-bound if γ >> 1 θ,µ1 2 :

E h ϕc γ (X, hθ (X) ; θ) i 4e θ,µ1 2

Auxiliary Lemma 5. Consider any distribution P, loss function ℓ, transportation cost c, and θ Θ. let us define fΘ,P (θ, γ) as

fΘ,P (θ, γ) = E(X,y) P ϕc γ (X, y; θ) E(X,y) P [ℓ(X, y; θ)] . (111) Then, for the loss and cost functions described in Section 4, P = P0, where P0 described in Theorem 4.4, in the manuscript, and the class of linear classifiers with |θ|2 = 1, the function fΘ,P (θ, γ) satisfies the following inequality:

fΘ,P (θ, γ) e θ,µ0 2

Published as a conference paper at ICLR 2024

Proof. For the setting described in the lemma we have:

u = y θ, X N θ, µ0 , θT Σ0θ . (113)

Therefore we can calculate an upper-bound for fΘ,P (θ, γ) as follows:

E(X,y) P0 ϕc γ (X, y; θ) E(X,y) P0 [l (X, y; θ)] = Z 1/γ

2πθT Σ0θ (1 γu) e

And the proof is complete.

Auxiliary Lemma 6. Consider the setting in Theorem 4.4 if we define θ as follows:

θ Σ 1 0 µ0 Σ 1 0 µ0 2 , (115)

then with high probability we have:

i=1 ϕc γ (Xi, hθ (Xi) ; θ) s,

as long as we set:

s = inf θ Θ 1 n

Xi S 1 ϕc γ (Xi, hθ (Xi) ; θ) + 12γ Tr ˆΣ (S 1)

ˆλmax(ˆΣ(S 1)) (116)

Proof. To prove the lemma we first give a lower-bound for s.

s = inf θ Θ 1 n

Xi S 1 ϕc γ (Xi, hθ (Xi) ; θ) + 12γ Tr ˆΣ (S 1)

inf θ Θ EP1X,θ h ϕc γ (X, y; θ) i + 4γ Tr (Σ1) + µ1 2 2

δ n + α (117)

4e µT 1 Σ 1 1 µ1 2 3 γ + 4γ Tr (Σ1) + µ1 2 2

δ n + α, (118)

where inequality 117 is due to Auxiliary Lemma 3. To complete the proof we give an upper-bound for 1

n Pn i=1 ϕc γ (Xi, hθ (Xi) ; θ ) :

i=1 ϕc γ (Xi, hθ (Xi) ; θ ) EP1X,θ h ϕc γ (X, y; θ ) i + 4γ Tr (Σ1) + µ1 2 2

3 γ + 4γ Tr (Σ1) + µ1 2 2

δ n . (119)

From equation 119 and equation 118, we can complete the proof by setting γ as follows:

2θ T Σ1θ e µT 1 Σ 1 1 µ1 2 α

2e µT 1 Σ 1 1 µ1, (120)

where in the above inequalities we suppose that α is small enough. Suppose that we know µ1 2 βλmax (Σ1), then a good choice for γ will be:

ˆλmax(ˆΣ(S 1)) (121)

Published as a conference paper at ICLR 2024

Auxiliary Lemma 7. Consider a positive definite d d matrix, Σ, and a vector µ Rd. For and vector v Rd, that we know: v 2 = 1, v T Σ 1 1 µ = 0, (122) we have the following:

dλ2max . (123)

Proof. Suppose that we write v as follows:

v = ψ1 ˆµ + ψ1 ˆµ , ψ2 1 + ψ2 2 = 1, (124)

where ˆµ is a unit norm vector in the direction of µ, and ˆµ is a unit norm vector perpendicular to µ. If we use this new description of v we have:

v =ψ2 1 ˆµT Σ µµT

ˆµ + ψ2 2 + 2ψ1ψ2 λmin (Σ)

min ˆµT Σ µµT

ˆµ, λmin (Σ) . (125)

To complete the proof we should find a lower bound for the first term in the right-hand side of the above inequality. To this aim we write ˆµ as follows:

i=1 aiui, (126)

where ui is the ith eigenvector of Σ, and λi is its ith eigenvalue, such that λ1 λ2 λd. With this new description for ˆµ we can write:

i=1 a2 i λi 1 Pd i=1 a2 i λi

i=1 a2 i λi Qd i=1 λi Pd i=1 a2 i Q

Pd i,j=1 a2 i a2 jλi Q

k =j λk Qd i=1 λi Pd i=1 a2 i Q

Pd i=1 a4 i Qd i=1 λi Qd i=1 λi + Pd i,j=1 a2 i a2 jλi Q

k =j λk Pd i=1 a2 i Q

Pd i,j=1 a2 i a2 j (λi λj)2 Q

2 Pd i=1 a2 i Q

dλmax , (127)

where in the last inequality is defined as follows:

(Σ) = min {λi λj} , i, j : λi = λj. (128)

An important point in the above inequalities is that, if ˆµ be in the direction of an eigenvector of Σ, then ˆµT Σ µµT

µT Σ 1µ ˆµ become equal to zero but in that case ψ1 should be 0. The reason is that,

in this situation ˆµ and Σ 1 1 µ will be in the same direction and we know that ˆµ is perpendicular to Σ 1 1 µ. therefore we have:

ˆµ min λmin

λmax , λmin

and the proof is complete.

Published as a conference paper at ICLR 2024

D EXPERIMENTS DETAILS

As mentioned in the manuscript, two different experiments were designed and implemented to demonstrate the performance of the proposed method. The codes are written using the Python programming language and the Pytorch 2.0 machine learning framework. The details of each test are explained below.

D.1 EXPERIMENT ON SIMULATED DATA

D.1.1 SIMULATED DATA

The data employed in this particular section comprises two distinct classes, namely positive and negative. These classes are generated by sampling from Gaussian distributions with 200 dimensions denoted as N(µ, σ2I) and N( µ, σ2I), respectively. The parameter µ is initialized randomly such that ||µ||2 = 1. Also, for unlabeled out-of-distribution data, two different Gaussian distributions, N(µ , σ2I) and N( µ , σ2I), respectively, have been considered for positive and negative classes, which µ = µ + α.v where α = ||µ||2

k , k N and v is a random normalized vector.

For the training and testing data, two classes, positive and negative, have been uniformly sampled in labeled and unlabeled modes. All the test data are taken from the main distribution (N(µ, σ2I) and N( µ, σ2I)) and it consists of 10,000 samples, half of which belong to the positive class and the other half belong to the negative class.

D.1.2 MODELING

A linear model has been used to learn these data, the purpose of which is to obtain optimal values for the weight vector w. The cost function for when labeled data is used is according to equation 130, where the objective is to maximize the margin of the linear classifier.

llabeled(x, y) =

i=0 min 1, max 0, 1 γ.yi. w T .xi . (130)

In equation 131, xis are the feature vectors and yis are their corresponding labels, γ is a regularization parameter for robust learning, w is the weight vector of the linear model, and n is the number of labeled samples.

The cost function for unlabeled samples is defined according to equation 131, where, due to the lack of access to labels for these samples, we use the model s predictions as their labels. Here too, the goal is to maximize the classifier s margin.

lunlabeled (x ) =

j=0 min 1, max 0, 1 γ . w T .x j 2 (131)

In the above equation x is are the feature vector of unlabeled samples, γ is a regularization parameter for robust learning, and m is the number of unlabeled samples.

Finally, according to the proposed method, the combination of previous loss functions is used as the loss function of the model, when we have both labeled and unlabeled samples. We define this new loss function as follows:

ltotal (x, y, x ) = llabeled (x, y) + λ.lunlabeled (x ) (132)

In all of our experiments we use Adam optimizer Kingma & Ba (2015) with regularization term (weight-decay). Considering that the hyper-parameters of the problem must have an optimal value for each scenario; A random search process has been performed to find the optimum γ, γ , λ, and weight-decay. Finally, we select a combination of hyper-parameters that achieved the highest accuracy on a validation dataset, and we report the accuracy of our model, using these hyper-parameters, on the test samples. This process is repeated for each experiment.

The details of the hyper-parameter values of the experiments reported for simulated data in the manuscript are available in the attached file Hyperparameter-Simulated.csv.

Published as a conference paper at ICLR 2024

D.2 EXPERIMENT ON REAL DATA

D.2.1 DATASET

To evaluate the performance of the proposed method on real-world datasets, we selected a set of histopathology data. The experiment is divided into two parts: one with labeled and unlabeled data from the same distribution, and another with labeled and unlabeled data from different distributions. The NCT-CRC-HE-100K dataset is used as the main distribution, and the Patch Camelyon dataset as the out-of-distribution dataset. Here are some additional explanations about each of the datasets used in the experiment:

1. NCT-CRC-HE-100K: This dataset contains 100,000 non-overlapping 224 224 pixels patches that are extracted from 86 whole slide histopathology images of human colorectal cancer (CRC) and normal tissue. It contains 9 different classes: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM) Katherm et al. (2018).

2. Patch Camelyon: This dataset contains 327,680 patches with 96 96 pixels from whole slide histopathology images of lymph node sections. It contains 2 different classes that show the patch is a tumor or normal B. S. Veeling et al. (2018).

We conducted two distinct experiments to evaluate the impact of adding unlabeled data. The first experiment focused on assessing the accuracy improvement by incorporating unlabeled data sampled from the same distribution as the initial dataset. The second experiment aimed to evaluate the accuracy improvement when unlabeled data was sourced from a distribution different from the original dataset. For the experiment involving samples from the same distribution, we used the NCT-CRC-HE100K dataset. On the other hand, for the experiment involving out-of-domain samples, we used the Patch Camelyon dataset.

In the first experiment, where both labeled and unlabeled data are from the same distribution (NCTCRC-HE-100K), we evaluated the performance considering the 9 classes present in the dataset. However, in the second experiment using the Patch Camelyon dataset with only two labels, we needed to align the labels with the NCT-CRC-HE-100K dataset. For this purpose, we merged the cancer-associated stroma (STR) and colorectal adenocarcinoma epithelium (TUM) classes and labeled them as tumor , while the remaining categories were labeled as normal .

D.2.2 MODELING

As mentioned in the manuscript, the processing pipeline involves feeding the images through a pre-trained Res Net model that has been trained on the Image Net dataset Deng et al. (2009); He et al. (2016), and the output embedding is saved for each image. In these experiments, we use the implementation of Chen & Krishnan (2022) for Res Net50 to extract the embeddings. The stored embeddings are then fed to a deep neural network to perform classification. This deep network consists of four 2048 layers with Leaky Re LU activation function and one layer at the end with the size of the number of classes (the input has a dimension of 1024). As shown in Algorithm 1; In order to obtain the adversarial perturbed inputs based on the original input, we update the initial value of the input a number of times in the direction of the gradient. Algorithm 2 shows the training flow that uses the labeled and unlabeled data. This process can be used based on whether we are in the labeled-only mode or whether we are in the labeled and unlabeled combination mode. In this way, if we are in the labeled-only mode, the parts related to unlabeled samples will not be done, and the loss function will only include ϕi.

To determine the optimal values for the hyperparameters, namely the Learning rate, weight decay, λ, α, γ, and γ , we employ a random search technique across their respective parameter spaces. Through this approach, we systematically explore various combinations of these hyperparameters and identify the most effective values in each experiment. Table 3 presents the search space for each hyperparameter used in the experiments.

The details of the hyperparameter values of the experiments reported for histopathology data in the manuscript are available in the attached file Hyperparameter-Histopathology.csv.

Published as a conference paper at ICLR 2024

Algorithm 1 Finding the adversarial perturbed input for original input data based on gradient ascent

function ADVERSARIAL-PERTURB(x, y, γ, α, Ns)

x = x for step = 1, ..., Ns do Gradient ascent loop p = model(x ) ϕ = CE(p, y) γ.||x x||2 2 CE: Cross entropy loss α = α/s x = x + α x ϕ end for return x end function

Algorithm 2 The training loop Require: Number of epochs Nep, Number of perturbation steps Ns, Set of hyper-parameter {γ, γ , λ, Learning rate of perturbation α} L = {(x0, y0), (x0, y0), . . . , (x N, y N)} Labeled data with size of N U = {x 0, x 1, . . . , x M} Unlabeled data with the size of M k = 2 Lb = the set of batches of L batch size = N/k Ub = the set of batches of U batch size = M/k for epoch = 1, . . . , Nep do

for (xi, yi), x j in Lb, Ub do xp i = ADVERSERIAL-PERTURB(xi, yi, γ, α, step) y j = model(x j) x p j = ADVERSERIAL-PERTURB(x j, y j, γ , α, step) pi = model(xp i ) pj = model(x p j ) ϕi = CE(pi, yi) γ.||xp i xi||2 2 ϕj = CE(pj, y j) γ .||x p j xj||2 2 l = ϕi + λ.ϕj backpropagate loss (l) with gradient decent to the deep net and update the weightes end for end for

Table 3: The hyperparameter search space that we test randomly. Hyperparameter Search space (i N) Description

Learning rate {10i| 5 i 1} learning rate for adam optimizer Weight decay {10i| 7 i 2} regularization term for adam optimizer λ {10i| 5 i 2} coefficient of the unlabeled term in loss function α {10i| 5 i 1} learning rate for adversarial perturbation function γ {10i| 7 i 2} coefficient of norm term in labeled loss function γ {10i| 7 i 2} coefficient of norm term in unlabeled loss function