# enhancing_instancelevel_image_classification_with_setlevel_labels__d35232bc.pdf Published as a conference paper at ICLR 2024 ENHANCING INSTANCE-LEVEL IMAGE CLASSIFICATION WITH SET-LEVEL LABELS Renyu Zhang Department of Computer Science University of Chicago zhangr@uchicago.edu Aly A. Khan Department of Pathology and Family Medicine University of Chicago aakhan@uchicago.edu Yuxin Chen Department of Computer Science University of Chicago chenyuxin@uchicago.edu Robert L. Grossman Department of Computer Science and Medicine University of Chicago rgrossman1@uchicago.edu Instance-level image classification tasks, e.g., few-shot learning and transfer learning, have traditionally relied on single-instance labels to train models. However, set-level coarse-grained labels that capture relationships among instances can also provide rich information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveraging set-level labels. We provide a theoretical analysis of the proposed method, including recognizing conditions for fast excess risk rate, shedding light on the theoretical foundations of our approach. We conducted experiments on two distinct categories of datasets: natural image datasets and histopathology image datasets. Our experimental results demonstrate the effectiveness of our approach, showcasing improved classification performance compared to traditional single-instance label-based methods. Notably, our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks. Importantly, our experimental findings align with the theoretical analysis, reinforcing the robustness and reliability of our proposed method. This work bridges the gap between instance-level and set-level image classification, offering a promising avenue for advancing the capabilities of image classification models with set-level coarse-grained labels. 1 INTRODUCTION A large amount of labeled data is typically required in traditional machine learning approaches, e.g., few-shot learning (FSL) and transfer learning (TL), to learn a robust model. However, procuring sufficient labeled data for each task is often challenging or infeasible in real-world scenarios. In this paper, we consider a novel problem setting where, similar to FSL, we have a limited number of fine-grained labels in the target domain. In the source domain, though, we have a large amount of coarse-grained set-level labels, which are easier to obtain and relevant to fine-grained labels. For example, in a digital library, there are coarse-grained set-level labels indicating the general content of photo albums, such as beach vacation , nature landscapes , or picnic . However, within each of these albums, there are numerous individual images, each with its own unique details and characteristics that are not explicitly labeled. In the downstream task, for instance, we care about the object classification such as tree , beach , or mountain . Similarly, in the medical domain, it is often useful to predict fine-grained labels of tissues, while only set-level annotations of histopathology slides are available for training at scale. We seek to enhance the downstream classification tasks with the coarse-grained set-level labels. An effective approach to addressing the overreliance on abundant training data is FSL a paradigm that has gained significant attention in recent years (Vinyals et al., 2016; Wang & Hebert, 2016; Triantafillou et al., 2017; Finn et al., 2017; Snell et al., 2017; Sung et al., 2018; Wang et al., 2018; Oreshkin et al., 2018; Rusu et al., 2018; Ye et al., 2018; Lee et al., 2019b; Li et al., 2019). FSL Published as a conference paper at ICLR 2024 household furniture large natural outdoor scenes TCGA examples NCT examples histopathology Normal acini (Fine-grained label) Islet (Fine-grained label) Stroma (Fine-grained label) Pancreas (Course-grained label) Figure 1: (a) A collection of image sets sampled from CIFAR-100 are in the upper row. The coarsegrained label of a set is the most frequent superclass of images inside the set. WSI examples from TCGA and patches from NCT dataset are in the lower row. (b) Hierarchy of coarseand fine-grained labels for histopathology images. pretrains a model that can quickly adapt to new tasks using only a few labeled examples. Recent studies (Chen et al., 2019; Tian et al., 2020; Shakeri et al., 2022; Yang et al., 2022) have shown that pretraining, coupled with fine-tuning on a new task, outperforms more sophisticated episodic training methods. This involves initially training a base model on a diverse set of tasks using abundant labeled data from a source domain, and subsequently fine-tuning the model using only a small number of labeled examples specific to the target task. Despite their promising performance, existing FSL models typically depend on finely labeled source data for predicting fine-grained labels. As an illustrative example, we consider histopathology image classification where acquiring a substantial number of fine-grained labels for individual patches (e.g., tissue labels shown in the lower row of figure 1a) is challenging. Conversely, a wealth of coarse-grained labels (e.g. the site of origin of the tumors associated with whole slide images (WSIs) from TCGA shown on the lefthand side of the upper row of figure 1a) are easily available. This motivates us to leverage these abundant coarse-grained labels and hierarchical relationships, such as between organs and tissues (as depicted in figure 1b), to enhance representation learning. Tissues consist of cellular assemblies with shared functionalities, while organs are comprised of multiple tissues. This hierarchical relationship serves as a conceptual foundation for our representation learning and provides significant contextual information for facilitating representation learning. By using coarse-grained information within this hierarchy, our goal is to learn efficiently fine-grained tissue representations within WSIs. Another example is shown in the upper row of figure 1a. We emulate a programmatic labeler that use heuristics such as keywords, regular expression, or knowledge bases to solicit sets of images. The coarse-grained labels, e.g., the most frequent superclass of images in the set, can be used to facilitate representation learning for downstream tasks such as instance-level image classification. Our contribution In this paper, we introduce Fine-gr Ained representation learning from Coarsegra Ined Lab Els (FACILE), a novel generic representation learning framework that uses easily accessible coarse-grained annotations to quickly adapt to new fine-grained tasks. Distinct from existing practices in FSL and TL, our approach utilizes coarse-grained labels in the source domain. This sets our methodology apart from conventional FSL and TL techniques, which typically rely on meticulously labeled source data to train models. We provide an initial theoretical analysis to motivate the empirical success of FACILE and examine the convergence rate for the excess risk of downstream tasks under a novel Lipschitzness condition on the loss function concerning the fine-grained labels. Our study reveals that the availability of coarse-grained labels can lead to a substantial acceleration in the excess risk rate for fine-grained label prediction tasks, achieving a fast rate of O(1/n), where n represents the number of fine-grained data points. This analysis highlights the significant potential for leveraging coarse-grained labels to enhance the learning process in fine-grained label prediction tasks. In our experiments, we thoroughly investigate the effectiveness of FACILE through a series of extensive experiments on natural image datasets and histopathology image datasets. For natural image datasets, we sample input sets from training data from CIFAR-100 and use the unique superclass number and most frequent superclass as coarse-grained labels. The generated datasets are used to evaluate different models. We also evaluate models by fine-tuning the fully connected layer ap- Published as a conference paper at ICLR 2024 pended to Vi T-B/16 (Dosovitskiy et al., 2020) of CLIP (Radford et al., 2021) in an anomaly detection dataset based on CUB200 (He & Peng, 2019). For histopathology applications, we leverage two large datasets with coarse-grained labels to pretrain our models. Subsequently, we evaluate the performance of these trained models on a diverse collection of histopathology datasets. Our algorithm achieves strong performance on 4 downstream datasets. Notably, when tested on LC25000 (Borkowski et al., 2021), our model achieves roughly 90% average ACC with 1,000 randomly sampled tasks which only have 5 fine-grained labeled data points for each of the 5 classes, outperforms the strongest baseline by roughly 13% with logistic regression fine-grained classifier. We further evaluate various models by fine-tuning the fully connected layer appended to Vi T-B/14 (Dosovitskiy et al., 2020) of DINO V2 (Oquab et al., 2023). These models can leverage the capability of foundation models and enhance the model performance on target tasks. Our experiments provide compelling evidence of the efficacy and generalizability of FACILE across various datasets, highlighting its potential as a robust representation learning framework. 2 FINE-GRAINED REPRESENTATION LEARNING FROM COARSE-GRAINED LABELS Notation Our model pre-trains on a collection of samples, denoted by {(si, wi)}m i=1. Each si is an input set of instances {xj}a j=1, where a represents the variable input set size. {wi} are the setlevel coarse-grained labels. The space of all instances is X and the space of all instance labels, which we call fine-grained labels, is Y. The space of pre-training data is S W, where S = {{x1, . . . , xa} : xj X for j [a]} and W denotes the space of coarse-grained labels. We receive (X, Y ) from product space X Y and corresponding (S, W) from product space S W. The goal is to predict the fine-grained labels y Y from the instance features x X. 2.1 THE FACILE ALGORITHM We study the model in a FSL setting where we have three datasets: (1) pre-training coarse-grained datasets Dcg m = {(si, wi)}m i=1 sampled i.i.d. from PS,W (2) fine-grained support dataset Dfg n = {(xi, yi)}n i=1 sampled i.i.d., from PX,Y , and (3) query set Dquery. The support set Dfg n contains c classes and k samples x in each class (i.e., n kc). We assume a latent space Z for embedding Z. We define instance feature maps E = {e : X Z}, set-input functions G = {g : M W} where M = {{z1, . . . , za} : zj Z for j [a]}, and fine-grained label predictors F = {f : Z Y}. The corresponding set-input feature map of an instance feature map e is defined as ϕe : S M. We assume the class of f is parameterized and identify f with parameter vectors for theoretical analysis. We then learn feature map e, fine-grained label predictor f, and predict fine-grained label with f e. The schema of our model is illustrated in figure 2. X1 X2 . . . Xa 0kw Jr O/SV9ozl COLa FMC3sr YUOq KUOb Ts G4C2/v Eqalb J3Wb64q5Zq1Sy OPJz AKZy DB1d Qg1uo Qw MYDOAZXu HNkc6L8+58LFpz Tj Zz DH/gf P4A3nu Nfw=Z2 Za GMye5s Mh OYM5c QSyr Swtx I2opoyt OGUb Aje8surp HVR9a6ql/e1Sr2Wx1GEzi Fc/Dg Gupw Bw1o Ao MQnu EV3pyx8+K8Ox+L1o KTzxz DHzif P52Wj WM=} = S Figure 2: Schema of the FACILE model. The dotted lines represent the flow of fine-grained data, and the solid lines denote the flow of coarse-grained labels We assume two loss functions: ℓfg : Y Y R for fine-grained label prediction and ℓcg : W W R for coarse-grained label prediction. ℓfg measures the loss of the fine-grained label predictor. We assume this loss is differentiable in its first argument. ℓcg measures the loss of pre-training with coarse-grained labels. For theoretical analysis, we are interested in two particular cases of ℓcg: i) ℓcg(w, w ) = 1 {w = w } where W is a categorical space; and ii) ℓcg(w, w ) = w w (for some norm on W) where W is a continuous space. We can also measure the loss of a feature map e by ℓcg e = ℓcg(ge ϕe(s), w), where ge arg ming EPS,W ℓcg(g ϕe(S), W). We assume there is an unknown good embedding M = ϕe0(S) M, by which a set-input function ge0 can determine W, i.e., ge0(M) = ge0 ϕe0(S) = W. The strict assumption of equality can be relaxed by incorporating an additive error term into our risk bounds of ge0 ϕe0. Our primary goal is to learn an instance label predictor or fine-grained label predictor ˆf ˆe that achieves low risk EPX,Y [ℓfg( ˆf ˆe(X), Y )] and we can bound the excess risk: EPX,Y [ℓfg( ˆf ˆe(X), Y ) ℓfg(f e (X), Y )] (1) Published as a conference paper at ICLR 2024 Algorithm 1 FACILE algorithm 1: Input: loss functions ℓfg, ℓcg, predictors E, G, F, datasets Dcg m and Dfg n 2: obtain feature map ˆe A(ℓcg, Dcg m, E) 3: create dataset Dfg,aug n = {(zi, yi) : zi = ˆe(xi), (xi, yi) Dfg n }n i=1 4: obtain fine-grained label predictor ˆf ˆe, where ˆf A(ℓfg, Dfg,aug n , F) 5: Return: ˆf ˆe support samples query samples Testing FACILE with fine-grained labels training input set Pretraining FACILE w/ coarse-grained labels instance-level representations set-level representation set label: fruit and vegetables fine-grained Figure 3: An overview of the FACILE algorithm. (a) Pre-training step of FACILE with coarsegrained labels. The input is a set of images and the target is set-level coarse-grained label. e is an instance feature map and ϕe is the corresponding set-input feature map. g is the set-input model. We can instantiate the A(ℓcg, Dcg m, E) with any supervised learning algorithms, e.g., fully supervised pretraining (FSP) with cross-entropy loss and the Sup Con model. (b) Fine-grained learning of FACILE with fine-grained labels. The learned instance feature map ˆe extracts instance-level features from patches of the support set and query set. f is the fine-grained label predictor. where e arg mine E EPS,W ℓcg e (S, W) and f arg minf F EPX,Y [ℓfg(f e (X), Y )]. The pseudocode for FACILE is provided in Algorithm 1, and we further illustrate the FACILE algorithm in figure 3. Given an input set si comprising instances x1, . . . , xa, the feature map e is employed to extract instance-level features for all the instances within the input set. Subsequently, a set-input model g is utilized to generate set-level features based on the instance-level features. Our FACILE framework is designed to be a generic algorithm that is compatible with any supervised learning method in its pre-training stage. We chose Sup Con (Supervised Contrastive Learning) (Khosla et al., 2020) and FSP as they are representative of the two main approaches within supervised learning: contrastive and non-contrastive (traditional supervised) learning, respectively. During testing, we extract the pre-trained feature map ˆe and fine-tune a classifier f using the generated embeddings from ˆe and the fine-grained labels of the support set. The performance of the classifier ˆf is then reported for the query set. Note that Algorithm 1 is generic since the two learning steps can use any supervised learning algorithm. 2.2 THEORETICAL ANALYSIS We denote the underlying distribution of Dcg m as PS,W and the underlying distribution of Dfg n as PX,Y . We assume the joint distribution of Z and Y is PZ,Y . After we learn the feature map ˆe, we can define a new distribution ˆPZ,Y = P(Z, Y )1{Z = ˆe(X)}, where 1 is the indicator function. The Dfg,aug n is i.i.d. samples from ˆPZ,Y . In order to include the underlying distribution of Dcg m, and Dfg n into analysis, with a slight abuse of notation we use Am(ℓcg, PS,W , E) to denote A(ℓcg, Dcg m, E) and use An(ℓfg, ˆPZ,Y , F) to denote A(ℓfg, Dfg,aug n , F). The two learning algorithms are described as follows. Definition 1. (Coarse-grained learning; pretraining) Let Ratem(ℓcg, PS,W , E; δ) (abbreviated to Ratem(ℓcg, PS,W , E)) be the rate of Am(ℓcg, PS,W , E) which takes ℓcg, E and m i.i.d. observations from PS,W as input, and return a feature map ˆe E such that EPS,W ℓcg ˆe (S, W) Ratem(ℓcg, PS,W , E; δ) with probability at least 1 δ. Published as a conference paper at ICLR 2024 Definition 2. (Fine-grained learning; downstream task learning) Let Raten(ℓfg, PZ,Y , F; δ) (abbreviated to Raten(ℓfg, PZ,Y , F)) be the excess risk rate of An(ℓfg, PZ,Y , F) which take ℓfg, F, and n i.i.d. observations from a distribution PZ,Y as input, and returns a fine-grained predictor ˆf F such that EPZ,Y h ℓfg ˆ f (Z, Y ) ℓfg f (Z, Y ) i Raten(ℓcg, PZ,Y , F; δ) with probability at least 1 δ. Next, we introduce our relative Lipschitz assumption and the central condition for quantifying task relatedness. The Lipschitz property requires that small perturbations to the feature map e that do not harm the pre-training task, do not affect the loss of downstream task much either. Definition 3. We say that f is L-Lipschitz relative to E if for all s S, x s, y Y, and e, e E, |ℓfg(f e(x), y) ℓfg(f e (x), y)| Lℓcg(ge ϕe(s), ge ϕe (s)) The function class F is L-Lipschitz relative to E, if every f F is L-Lipschitz relative to E. Definition 3 generalizes the definition of L-Lipschitzness in Robinson et al. (2020) to bound the downstream loss deviation through the loss of the set label predictions. In the special case where s = {x}, and g is a classifier for the pretraining labels, our Lipschitz condition reduces to the Lipschitzness definition of Robinson et al. (2020). The central condition is well-known to yield fast rates for supervised learning (Van Erven et al., 2015). Please refer to definition 6 for the definition of central condition. We show that our surrogate problem (ℓfg, ˆPZ,Y , F) satisfies a central condition in proposition 7. Theorem 4. Suppose that (ℓfg, PZ,Y , F) satisfies the central condition, F is L-Lipschitz relative to E, ℓfg is bounded by B > 0, F is L -Lipschitz in its d-dimensional parameters in the l2 norm, F is contained in the Euclidean ball of radius R, and Y is compact. We also assume that Ratem(ℓcg, PS,W , E) = O (1/mα). Then when An(ℓfg, ˆPZ,Y , F) is ERM we ob- tain excess risk EPX,Y h ℓfg ˆ f ˆe(X, Y ) ℓfg f e (X, Y ) i bound with probability at least 1 δ by O dαβ log RL n+log 1 nαβ if m = Ω(nβ) and ℓcg(w, w ) = 1{w = w }. For a typical scenario where Ratem(ℓcg, PS,W , E) = O(1/ m), we can obtain fast rates with m = Ω(n2). Similarly, in the scenario where Am(ℓcg, PS,W , E) achieves fast rate, i.e., Ratem(ℓcg, PS,W , E) = O(1/m), one can obtains fast rates when m = Ω(n). More generally, if αβ 1, we observe fast rates. We prove our theorem by first showing that the excess risk of ˆf ˆe can be bounded by 2LRatem(ℓcg, PS,W , E)+Raten(ℓfg, ˆPZ,Y , F) in proposition 5. Then, we show that (ℓfg, ˆPZ,Y , F) also satisfies the weak central condition in proposition 7. Thus, Raten(ℓfg, ˆPZ,Y , F) is also bounded by proposition 8. We refer interested readers to section H for full details of the proof. In the next section, we first aim to empirically study the relationship between generalization error, coarse-grained dataset size, and fine-grained dataset size that our theoretical analysis predicts in section 3.3 and section 3.4. We also demonstrate the exceptional efficacy of the proposed algorithm compared to baseline models on natural image datasets and histopathology image datasets. 3 EMPIRICAL STUDY 3.1 BASELINE MODELS AND ALGORITHM INSTANTIATION We consider two sets of baseline models: self-supervised models (Bachman et al., 2019; He et al., 2020; Chen et al., 2020; Caron et al., 2020; Grill et al., 2020; Chen & He, 2021) and weakly supervised models (Donahue et al., 2014; Sun et al., 2017; Zeiler & Fergus, 2014; Robinson et al., 2020). Self-supervised models Given pre-training data (S, W), self-supervised learning models ignore the labels W and learn ˆe from S. Then, we can test ˆe with a new task, which consists of a support set and a query set. A new model that leverages the learned ˆe is fine-tuned on the support set and tested on the query set. We performed two self-supervised learning models in two categories, e.g., Sim CLR (Chen et al., 2020) for contrastive learning and Sim Siam (Chen & He, 2021) for non-contrastive learning. Details of these self-supervised learning algorithms can be found in section G. Published as a conference paper at ICLR 2024 Weakly supervised models We assign each instance, from the pre-training dataset, a label of the input set to which it belongs. We train feature map ˆe appended with a linear classifier on the pretraining dataset. We call this model FSP-Patch, where FSP stands for fully supervised pre-training and the model is trained with the assigned instance-level labels. For a new task with a support set and a query set, we use the ˆe to extract features for both sets, train a classifier on the support set features, and test the classifier on the query set features. Following previous work in FSL (Tian et al., 2020; Chen et al., 2019), we use l2-normalized features for downstream tasks. Unless otherwise specified, we evaluate methods with 1,000 randomly sampled meta-tasks from each dataset. All meta-tasks use 15 samples per class as the query set. The average F1/ACC and 95% confidence interval (CI) are reported. We follow the test setting of Yang et al. (2022) and use Nearest Centroid (NC), Logistic Regression (LR), and Ridge Classifier (RC). 3.2 PRETRAIN WITH UNIQUE CLASS NUMBER OF INPUT SETS In order to show the advantages of using the coarse-grained labels, we introduce a new task of pre-training with the unique class number of input sets. Inspired by Lee et al. (2019a), we use the CIFAR-100 (Krizhevsky et al., 2009) dataset, which contains 100 classes grouped into 20 superclasses. We generate input sets by sampling between 6 and 10 images from CIFAR-100 training data. The targets of the input sets are the unique superclass number of the input sets. In our downstream tasks, we perform few-shot classifications of fine-grained classes. Despite being distinct from the downstream fine-grained labels, the coarse-grained labels offer useful information for learning useful representations for downstream tasks. pretraining method unique superclass number most frequent superclass NC LR RC NC LR RC Sim CLR 76.07 0.97 75.88 1.01 75.50 1.02 75.91 1.00 75.82 1.01 75.91 1.02 Sim Siam 78.15 0.93 79.44 0.92 79.03 0.95 78.80 0.93 79.44 0.95 79.43 0.93 FSP-Patch N/A N/A N/A 73.21 0.97 73.92 0.98 73.40 0.98 FACILE-Sup Con N/A N/A N/A 79.54 0.92 79.54 0.96 79.12 0.95 FACILE-FSP 86.25 0.79 85.42 0.82 85.84 0.81 82.04 0.84 81.70 0.91 81.75 0.90 Table 1: Pretraining on input sets from CIFAR-100. Testing with 5-shot 5-way meta-test sets; average F1 and CI are reported. The Res Net18 (He et al., 2016) is used as feature maps ˆe. For FACILE-FSP, we pre-train the feature map ˆe from these input sets and targets with ℓ1 loss. The features of CIFAR-100 test images are extracted with ˆe. Training settings of Sim Siam, Sim CLR and FACILE-FSP can be found in section A.1. We then test ˆe in a few-shot manner. We random sample 5 classes, 5 examples from each class, for each meta-test dataset. The fine-grained label predictor ˆf is trained on the support examples and tested on the query examples. The performance of these models is reported in table 1. We can see that FACILE-FSP outperforms self-supervised learning models by a large margin. 3.3 PRETRAIN WITH MOST FREQUENT CLASS LABEL We sample input sets randomly from training data of CIFAR-100. The targets are the most frequent superclass of the input sets. If there is a tie in an input set, we choose a random top frequent superclass as the target of the input set. Training settings are similar to section 3.2 and can be found in section A.1. The performances of all models are reported in table 1. We can see that FACILE-FSP obtains better results compared to other models. Note that the excess risk bound of the form b = C/nγ implies a log-linear relationship log b = log C γ log n between the error and the number of fine-grained labels. We can visually interpret the learning rate γ. We study two cases: when the number of coarse-grained labels m grows linearly with the number of fine-grained labels, and when the number of coarse-grained labels m grows quadratically with the number of fine-grained labels. In order to show the generalization error rate of FACILE-FSP w.r.t. fine-grained label number on CIFAR-100 test dataset, we randomly sample 5 classes (i.e., 5-way testing) for each task. We then sample n/5 fine-grained examples in each class for the support set and sample 15 examples for each class for the query set. The curves are shown in figure 4. The figure shows the log-linear relationship of FACILE-FSP s generalization error on downstream tasks w.r.t. fine-grained label number. This visualization effectively captures how coarse-grained label number m impacts the model s generalization capabilities. 3.4 EVALUATION ON HISTOPATHOLOGY IMAGES Published as a conference paper at ICLR 2024 50 100 150 200 250 300 num fine-grained labels Figure 4: Generalization error (with two growth rates) of FACILE-FSP on CIFAR-100 test dataset as a function of the number of coarse-grained labels m. Datasets and data extraction We pretrain our models using two independent sources of WSIs. First, we downloaded data from The Cancer Genome Atlas (TCGA) from the NCI Genomic Data Commons (GDC) (Heath et al., 2021). Two collections of non-overlapping patches with different patch sizes, i.e., 224 224 and 1, 000 1, 000 at 20X magnification. Background patches with high or low intensity were removed. Because the number of patches generated with size 224 224 at 20X magnification is very large, at most 1, 000 randomly selected patches are kept for each slide. The names of the tumors/organs, from which slides are collected, are used as coarse-grained labels. Second, we downloaded all clinical slides from the Genotype-Tissue Expression (GTEx) project (Lonsdale et al., 2013), which provides a resource for studying human gene expression and regulation in relation to genetic variation. We extracted non-overlapping patches with size 1, 000 1, 000 at 20X magnification and patches with intensity larger than 0.1 and smaller than 0.85 are kept. For these slides, we used the organs from which the tissues were extracted as coarsegrained labels. Examples and class distributions for the two datasets can be found in section C. We test models on 3 public datasets: LC (Borkowski et al., 2021), PAIP (Kim et al., 2021), NCT (Kather et al., 2018) and 1 private dataset PDAC. Details of these datasets are deferred to section C. Note that the TCGA and GTEx have meticulously categorized an extensive array of cancer types and organs, covering a diverse range of tissues as outlined in the LC, PAIP, and NCT. The strategic use of WSI-level labels is rooted in their potential to enrich tissue-level classification. While these labels may appear broad, they encapsulate a wealth of underlying heterogeneity inherent to different cancer regions and tissue types. Pretrain on TCGA with patch size 224 224 We first train models on TCGA patches with size 224 224 at 20X magnification. After the models are trained, we test the feature map in these models on LC, PAIP, and NCT. Full details about FACILE-FSP, FACILE-Sup Con, and baseline models training settings can be found in section A.3. Latent augmentation (LA) has been shown to improve FSL performance for histopathology images (Yang et al., 2022). We use faiss (Johnson et al., 2019) to perform k-means clustering. Following the setting of Yang et al. (2022), the number of prototypes in the base dictionary is 16. Each sample is augmented 100 times by LA. We refer readers to section E for details of LA. 18 27 36 45 54 63 72 num fine-grained labels Figure 5: Generalization error on NCT dataset. The FACILE-FSP trains on TCGA dataset with m coarse-grained labels. We show the error curve with two growth rates of m. Main results The test result is shown in table 2. In order to show the performance improvement over models pre-trained on natural image datasets, we report the performance of the FSP model pre-trained on Image Net. We can see from table 2 that our model FACILE-FSP performs the best, with a large margin compared to other models. The contrastive learning model Sim CLR performs worse than non-contrastive learning model Sim Siam. A possible reason could be the small batch size we used for Sim CLR. Sim Siam maintains high performance even with small batch sizes. FSP-Patch achieves better performance compared to self-supervised learning models and the Image Net pretrained model, which shows the usefulness of the coarse-grained labels for downstream tasks. More experiment results about test ACC on LC, PAIP and NCT datasets can be found in section B.2.1. Test result with larger shot number is in section B.2.2. We further pre-train models on GTEx and TCGA with patch size 1, 000 1, 000 and test the models on our private dataset PDAC. We refer readers to section B.5 for experiment results on PDAC dataset. We show the generalization error of FACILE-FSP w.r.t. fine-grained label number in figure 5. The figure reveals a pronounced log-linear relationship. A larger growth rate of coarse-grained labels implies a faster rate of excess risk. Published as a conference paper at ICLR 2024 pretraining method NC LR RC LR+LA RC+LA 1-shot 5-way test on LC dataset Image Net (FSP) 63.26 1.46 63.13 1.41 63.24 1.40 64.51 1.41 64.95 1.39 Sim Siam 65.83 1.32 66.52 1.31 66.24 1.32 67.21 1.29 67.83 1.33 Sim CLR 64.57 1.36 63.85 1.37 64.16 1.37 65.78 1.33 66.81 1.40 FSP-Patch 66.73 1.29 66.25 1.29 66.59 1.28 68.01 1.24 68.28 1.26 FACILE-Sup Con 74.91 1.25 76.23 1.16 75.01 1.19 75.60 1.19 75.64 1.18 FACILE-FSP 77.39 1.21 76.14 1.25 75.18 1.30 77.55 1.17 73.72 1.34 5-shot 5-way test on LC dataset Image Net (FSP) 82.82 0.75 80.13 0.82 80.23 0.83 84.70 0.70 84.42 0.74 Sim Siam 85.12 0.68 82.69 0.75 82.80 0.76 87.45 0.63 87.50 0.66 Sim CLR 83.45 0.77 81.93 0.83 81.40 0.89 85.69 0.73 84.93 0.79 FSP-Patch 84.96 0.64 84.10 0.69 84.45 0.68 86.31 0.65 86.29 0.68 FACILE-Sup Con 91.09 0.47 90.34 0.48 90.25 0.48 91.32 0.47 90.94 0.50 FACILE-FSP 91.67 0.45 90.64 0.50 90.52 0.52 92.07 0.48 89.81 0.61 1-shot 3-way test on PAIP dataset Image Net (FSP) 45.96 1.22 47.82 1.29 47.43 1.29 46.38 1.24 44.90 1.24 Sim Siam 46.43 1.21 47.93 1.24 47.74 1.23 47.20 1.21 46.31 1.22 Sim CLR 44.51 1.16 46.44 1.14 45.59 1.15 45.40 1.14 45.04 1.16 FSP-Patch 48.85 1.21 49.44 1.26 50.27 1.22 49.76 1.20 48.44 1.21 FACILE-Sup Con 46.60 1.20 48.63 1.22 48.46 1.21 47.13 1.20 47.87 1.22 FACILE-FSP 45.40 1.24 46.71 1.20 46.60 1.21 46.36 1.22 45.49 1.20 5-shot 3-way test on PAIP dataset Image Net (FSP) 60.73 1.02 61.21 1.12 61.04 1.11 61.66 0.91 59.30 0.93 Sim Siam 62.88 0.97 62.59 1.08 63.48 1.04 65.01 0.88 63.22 0.89 Sim CLR 60.99 0.93 61.38 1.00 61.62 1.02 62.39 0.91 61.29 0.90 FSP-Patch 64.45 0.92 64.60 0.98 64.49 0.99 64.08 0.89 62.79 0.89 FACILE-Sup Con 64.74 0.91 65.63 0.97 65.93 0.97 66.68 0.86 66.48 0.82 FACILE-FSP 63.90 0.94 64.59 0.96 65.43 0.96 66.77 0.86 66.34 0.85 1-shot 9-way test on NCT dataset Image Net (FSP) 57.35 1.68 56.39 1.64 56.08 1.64 57.78 1.66 55.85 1.64 Sim Siam 63.60 1.62 64.43 1.54 64.79 1.53 65.26 1.56 65.39 1.53 Sim CLR 59.73 1.57 59.61 1.57 59.34 1.56 60.57 1.57 60.99 1.53 FSP-Patch 60.08 1.46 61.55 1.50 62.32 1.50 61.99 1.42 60.62 1.38 FACILE-Sup Con 68.10 1.29 69.63 1.25 69.81 1.24 69.54 1.25 69.77 1.22 FACILE-FSP 66.38 1.38 67.03 1.34 67.56 1.32 68.35 1.33 69.77 1.30 5-shot 9-way test on NCT dataset Image Net (FSP) 74.59 1.11 73.21 1.13 74.60 1.07 76.68 1.04 74.39 1.09 Sim Siam 79.97 1.05 79.81 1.03 80.84 0.98 83.45 0.92 83.61 0.90 Sim CLR 76.80 1.09 76.95 1.07 78.25 1.03 80.54 0.97 81.13 0.95 FSP-Patch 79.50 0.94 79.54 0.95 81.00 0.88 82.42 0.81 81.33 0.79 FACILE-Sup Con 86.79 0.61 87.89 0.58 89.10 0.52 89.53 0.52 88.58 0.54 FACILE-FSP 84.68 0.74 85.47 0.72 87.44 0.64 88.00 0.63 87.51 0.66 Table 2: Test result on LC, PAIP, and NCT dataset; average F1 and CI are reported. Benefits of pre-training on Large Pathology Datasets In order to show the benefits of pre-training on large pathology datasets, we pre-train different models on the NCT training dataset and test the performance on the LC dataset, following the setting of Yang et al. (2022). Instead of separating the mixture-domain and out-domain tasks, we directly report the average F1 and CI of LR models over all 5 classes of the LC dataset. Training details of the models can be found in section B.4. The test result on the LC dataset is shown in table 3. We can see from table 3 the best model pre-trained on NCT, i.e., FSP with strong augmentation, performs worse than our model FACILE-FSP in table 2. Our method get roughly 13% improvement compared to Yang et al. (2022) on the LC dataset. The large margin between the two best models pre-trained on two different datasets shows the importance of pre-training with a large number of coarse-grained labels. More results on LC and PAIP can be found in figure 8. Note that Sim Siam model, trained with a batch size of 55, maintains competitive performance to Mo Co v3 which needs a large batch size. pretraining method NC LR RC LR+LA RC+LA Sim Siam 76.21 0.87 74.05 1.10 74.59 1.10 77.87 0.87 76.03 0.94 Mo Co v3 ((Yang et al., 2022)) 72.82 1.25 70.29 1.43 71.31 1.40 78.72 1.00 79.71 0.95 FSP (simple aug; (Yang et al., 2022)) 56.44 1.50 52.27 1.81 55.62 1.74 63.47 1.37 63.47 1.46 FSP (strong aug) 83.53 0.79 80.81 1.01 80.27 1.08 85.57 0.77 84.06 0.89 Sup Con 81.51 0.85 78.77 1.03 78.65 1.08 83.51 0.84 83.31 0.91 Table 3: Pretraining on NCT and 5-shot 5-way testing on LC dataset; average F1 and CI are reported. More experiments and ablation study We refer interested readers to section F for ablation studies about set size, training procedures, and set-input models. We provide additional insights through Published as a conference paper at ICLR 2024 fine-tuning experiments. In Appendix B.1, we detail the fine-tuning of the Vi T-B/16 (Dosovitskiy et al., 2020) from CLIP (Radford et al., 2021) using CUB200-based anomaly detection data (He & Peng, 2019). Similarly, Appendix B.3 discusses fine-tuning the Vi T-B/14 model of DINO V2 (Oquab et al., 2023) on TCGA dataset. These experiments extend our analysis to specialized tasks, showcasing the adaptability of FACILE to foundation models. 4 RELATED WORK Weakly supervised learning The concept of weakly supervised learning is introduced as a means to alleviate the annotation bottleneck in the training of machine learning models. There has been a large body of existing work in learning with only weak labels. A comprehensive survey about weakly supervised learning is provided in Zhou (2018); Zhang et al. (2022). We study a novel form of weak supervision which is provided by set-level coarse-grained labels. Among weakly supervised learning methods, Robinson et al. (2020) studied the generalization properties of weakly supervised learning and proposed a generic learning algorithm that can learn from weak and strong labels and can be proved to achieve a fast rate. The authors consider a different setting where each instance has a weak label and a strong label, and the strong label predictor learns to predict the strong labels from the instances and their corresponding embeddings learned with weak labels. We consider the setting where we have some coarse-grained labels of some sets, rather than instances and the downstream classifiers only use the learned embeddings to train and test on the downstream tasks. Multiple-instance learning for WSIs WSI classification and regression are formulated based on multiple-instance learning (MIL) (Campanella et al., 2019; Xu et al., 2022; Ilse et al., 2018; Sharma et al., 2021; Hashimoto et al., 2020; Shao et al., 2021; Yao et al., 2020; Lu et al., 2021b;a; Chen et al., 2021b; Li et al., 2021; Chen et al., 2021a; Myronenko et al., 2021; Xiang & Zhang, 2022; Javed et al., 2022). These MIL models employ two procedures: i) feature extraction for patches cropped from a WSI and ii) aggregation of features from the same WSI. Image Net pretrained backbones, selfsupervised backbones pretrained on histopathology images, or backbones fine-tuned during training are used to extract features from patches. Deep attention pooling, graph neural networks, or sequence models, adapted for WSIs, are used for feature aggregation. In this paper, we consider a different problem setting where we enhance patch-level classification with related set-level labels. In the application of histopathology images, line 2 of our generic algorithm can be instantiated with any MIL models that have the backbones with trainable modules to extract patch-level features, e.g., Ilse et al. (2018). A complete comparison of MIL models for WSIs is out of the scope of this paper. Learning from coarsely-labeled data Another related line of research is Phoo & Hariharan (2021), where the authors assume a taxonomy of classes with two levels, i.e., a set of fine-grained classes that are more challenging to annotate and a set of coarse-grained classes that are easier to annotate. In our paper, we do not assume a taxonomy of classes for the coarse-grained and fine-grained labels. The coarse-grained and fine-grained labels are closely related via a hierarchy. Also, the inputs that are fed to models to predict the coarse-grained or fine-grained labels are different, i.e., set input for coarse-grained labels and instances for fine-grained labels. 5 CONCLUSION AND DISCUSSION Summary We introduce FACILE, a representation learning framework that leverages coarse-grained labels for model training and enhances model performance for downstream tasks. Our theoretical analysis highlights the significant potential of leveraging set-level labels to benefit the learning process of fine-grained label prediction tasks. To demonstrate the effectiveness of FACILE, we conduct pre-training on CIFAR-100-based datasets and two large public histopathology datasets using coarse-grained labels and evaluate our model on a diverse collection of datasets with fine-grained labels. Limitation and future work In this paper, we consider a novel problem setting where we enhance downstream fine-grained label classification with easily available coarse-grained labels and propose a generic algorithm that contains two supervised learning steps. It is important to note that the separate utilization of loosely related coarse-grained labels and fine-grained labels can be expensive. Specifically, the pre-training of our proposed algorithm could be expensive given large amounts of coarse-grained data and the nature of the set-input data. For this reason, we are investigating methods of selecting a subset of the coarse-grained dataset to accelerate pre-training. Published as a conference paper at ICLR 2024 6 ACKNOWLEDGEMENT The research was funded in part by the Center for Translational Data Science at the University of Chicago, the Research Computing Center at the University of Chicago, and the National Science Foundation under Grant No. 2313130. We extend our deepest gratitude to Christopher R. Weber for his generous provision of the PDAC dataset, which was invaluable to the completion of this research. Brigham & Women s Hospital & Harvard Medical School Chin Lynda 9 11 Park Peter J. 12 Kucherlapati Raju 13, Genome data analysis: Baylor College of Medicine Creighton Chad J. 22 23 Donehower Lawrence A. 22 23 24 25, Institute for Systems Biology Reynolds Sheila 31 Kreisberg Richard B. 31 Bernard Brady 31 Bressler Ryan 31 Erkkila Timo 32 Lin Jake 31 Thorsson Vesteinn 31 Zhang Wei 33 Shmulevich Ilya 31, et al. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61 70, 2012. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019. Andrew A Borkowski, Marilyn M Bui, L Brannon Thomas, Catherine P Wilson, Lauren A De Land, and Stephen M Mastorides. Lc25000 lung and colon histopathological image dataset, 2021. Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine, 25(8):1301 1309, 2019. Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp. 132 149, 2018. Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912 9924, 2020. Richard J Chen, Ming Y Lu, Muhammad Shaban, Chengkuan Chen, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Whole slide images are 2d point clouds: Context-aware survival prediction using patch-based graph convolutional networks. In Medical Image Computing and Computer Assisted Intervention MICCAI 2021: 24th International Conference, Strasbourg, France, September 27 October 1, 2021, Proceedings, Part VIII 24, pp. 339 349. Springer, 2021a. Richard J Chen, Ming Y Lu, Wei-Hung Weng, Tiffany Y Chen, Drew FK Williamson, Trevor Manz, Maha Shady, and Faisal Mahmood. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015 4025, 2021b. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. ar Xiv preprint ar Xiv:1904.04232, 2019. Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750 15758, 2021. Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640 9649, 2021c. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. Published as a conference paper at ICLR 2024 Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647 655. PMLR, 2014. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020. Harrison Edwards and Amos Storkey. Towards a neural statistician. ar Xiv preprint ar Xiv:1606.02185, 2016. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126 1135. PMLR, 2017. Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020. Noriaki Hashimoto, Daisuke Fukushima, Ryoichi Koga, Yusuke Takagi, Kaho Ko, Kei Kohno, Masato Nakaguro, Shigeo Nakamura, Hidekata Hontani, and Ichiro Takeuchi. Multi-scale domain-adversarial multiple-instance cnn for cancer subtype classification with unannotated histopathological images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3852 3861, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729 9738, 2020. Xiangteng He and Yuxin Peng. Fine-grained visual-textual representation learning. IEEE Transactions on Circuits and Systems for Video Technology, 30(2):520 531, 2019. Allison P Heath, Vincent Ferretti, Stuti Agrawal, Maksim An, James C Angelakos, Renuka Arya, Rosita Bajari, Bilal Baqar, Justin HB Barnowski, Jeffrey Burt, et al. The nci genomic data commons. Nature genetics, 53(3):257 262, 2021. Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning, pp. 4182 4192. PMLR, 2020. Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International conference on machine learning, pp. 2127 2136. PMLR, 2018. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 456. pmlr, 2015. Syed Ashar Javed, Dinkar Juyal, Harshith Padigela, Amaro Taylor-Weiner, Limin Yu, and Aaditya Prakash. Additive mil: intrinsically interpretable multiple instance learning for pathology. Advances in Neural Information Processing Systems, 35:20689 20702, 2022. Jeff Johnson, Matthijs Douze, and Herv e J egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535 547, 2019. Jakob Nikolas Kather, Niels Halama, and Alexander Marx. 100,000 histological images of human colorectal cancer and healthy tissue. https://doi.org/10.5281/zenodo.1214456, April 2018. doi: 10.5281/zenodo.1214456. URL https://doi.org/10.5281/zenodo.1214456. Published as a conference paper at ICLR 2024 Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661 18673, 2020. Yoo Jung Kim, Hyungjoon Jang, Kyoungbun Lee, Seongkeun Park, Sung-Gyu Min, Choyeon Hong, Jeong Hwan Park, Kanggeun Lee, Jisoo Kim, Wonjae Hong, et al. Paip 2019: Liver cancer segmentation challenge. Medical Image Analysis, 67:101854, 2021. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. University of Toronto, 2009. Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744 3753. PMLR, 2019a. Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10657 10665, 2019b. Alexander C Li, Alexei A Efros, and Deepak Pathak. Understanding collapse in non-contrastive siamese representation learning. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part XXXI, pp. 490 505. Springer, 2022. Bin Li, Yin Li, and Kevin W Eliceiri. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14318 14328, 2021. Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler, and Xiaogang Wang. Finding taskrelevant features for few-shot learning by category traversal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1 10, 2019. Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. ar Xiv preprint ar Xiv:1703.03130, 2017. John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotype-tissue expression (gtex) project. Nature genetics, 45(6):580 585, 2013. Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, Melissa Zhao, Maha Shady, Jana Lipkova, and Faisal Mahmood. Ai-based pathology predicts origins for cancers of unknown primary. Nature, 594(7861):106 110, 2021a. Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering, 5(6):555 570, 2021b. Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6707 6717, 2020. Andriy Myronenko, Ziyue Xu, Dong Yang, Holger R Roth, and Daguang Xu. Accounting for dependencies in deep learning based multiple instance learning for whole slide imaging. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 329 338. Springer, 2021. Maxime Oquab, Timoth ee Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023. Published as a conference paper at ICLR 2024 Boris Oreshkin, Pau Rodr ıguez L opez, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31, 2018. Anabik Pal, Zhiyun Xue, Kanan Desai, Adekunbiola Aina F Banjo, Clement Akinfolarin Adepiti, L Rodney Long, Mark Schiffman, and Sameer Antani. Deep multiple-instance learning for abnormal cell detection in cervical histopathology images. Computers in Biology and Medicine, 138: 104890, 2021. Cheng Perng Phoo and Bharath Hariharan. Coarsely-labeled data for better few-shot transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9052 9061, 2021. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021. Colin Raffel and Daniel PW Ellis. Feed-forward networks with attention can solve some long-term memory problems. ar Xiv preprint ar Xiv:1512.08756, 2015. Pierre H Richemond, Jean-Bastien Grill, Florent Altch e, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, et al. Byol works even without batch statistics. ar Xiv preprint ar Xiv:2010.10241, 2020. Joshua Robinson, Stefanie Jegelka, and Suvrit Sra. Strength from weakness: Fast learning using weak supervision. In International Conference on Machine Learning, pp. 8127 8136. PMLR, 2020. Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. ar Xiv preprint ar Xiv:1807.05960, 2018. Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017. Fereshteh Shakeri, Malik Boudiaf, Sina Mohammadi, Ivaxi Sheth, Mohammad Havaei, Ismail Ben Ayed, and Samira Ebrahimi Kahou. Fhist: A benchmark for few-shot classification of histological images. ar Xiv preprint ar Xiv:2206.00092, 2022. Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136 2147, 2021. Yash Sharma, Aman Shrivastava, Lubaina Ehsan, Christopher A Moskaluk, Sana Syed, and Donald Brown. Cluster-to-conquer: A framework for end-to-end multi-instance learning for whole slide image classification. In Medical Imaging with Deep Learning, pp. 682 698. PMLR, 2021. Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339 2343, 2015. Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017. Cancer Genome Atlas Research Network Tissue source sites: Duke University Medical School Mc Lendon Roger 1 Friedman Allan 2 Bigner Darrell 1, Emory University Van Meir Erwin G. 3 4 5 Brat Daniel J. 5 6 M. Mastrogianakis Gena 3 Olson Jeffrey J. 3 4 5, Henry Ford Hospital Mikkelsen Tom 7 Lehman Norman 8, MD Anderson Cancer Center Aldape Ken 9 Alfred Yung WK 10 Bogler Oliver 11, University of California San Francisco Vanden Berg Scott 12 Berger Mitchel 13 Prados Michael 13, et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455(7216):1061 1068, 2008. Published as a conference paper at ICLR 2024 Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945 953, 2015. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843 852, 2017. Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199 1208, 2018. Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XIV 16, pp. 266 282. Springer, 2020. Eleni Triantafillou, Richard Zemel, and Raquel Urtasun. Few-shot learning through an information retrieval lens. Advances in neural information processing systems, 30, 2017. Tim Van Erven, Peter Grunwald, Nishant A Mehta, Mark Reid, Robert Williamson, et al. Fast rates in statistical and online learning. MIT Press, 2015. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. ar Xiv preprint ar Xiv:1511.06391, 2015. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016. Yu-Xiong Wang and Martial Hebert. Learning to learn: Model regression networks for easy small sample learning. In Computer Vision ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pp. 616 634. Springer, 2016. Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7278 7286, 2018. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733 3742, 2018. Jinxi Xiang and Jun Zhang. Exploring low-rank property in multiple instance learning for whole slide image classification. In The Eleventh International Conference on Learning Representations, 2022. Zhixin Xu, Seohoon Lim, Hong-Kyu Shin, Kwang-Hyun Uhm, Yucheng Lu, Seung-Won Jung, and Sung-Jea Ko. Risk-aware survival time prediction from whole slide pathological images. Scientific Reports, 12(1):21948, 2022. Jiawei Yang, Hanbo Chen, Jiangpeng Yan, Xiaoyu Chen, and Jianhua Yao. Towards better understanding and better generalization of few-shot classification in histology images with contrastive learning. ar Xiv preprint ar Xiv:2202.09059, 2022. Shuo Yang, Lu Liu, and Min Xu. Free lunch for few-shot learning: Distribution calibration. ar Xiv preprint ar Xiv:2101.06395, 2021. Jiawen Yao, Xinliang Zhu, Jitendra Jonnagaddala, Nicholas Hawkins, and Junzhou Huang. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Medical Image Analysis, 65:101789, 2020. Published as a conference paper at ICLR 2024 Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Learning embedding adaptation for few-shot learning. ar Xiv preprint ar Xiv:1812.03664, 7, 2018. Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. ar Xiv preprint ar Xiv:1708.03888, 2017. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818 833. Springer, 2014. Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, and Alexander Ratner. A survey on programmatic weak supervision. ar Xiv preprint ar Xiv:2202.05433, 2022. Renyu Zhang, Aly A Khan, and Robert L Grossman. Evaluation of hyperbolic attention in histopathology images. In 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 773 776. IEEE, 2020a. Renyu Zhang, Christopher Weber, Robert Grossman, and Aly A Khan. Evaluating and interpreting caption prediction for histopathology images. In Machine Learning for Healthcare Conference, pp. 418 435. PMLR, 2020b. Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National science review, 5(1): 44 53, 2018. Published as a conference paper at ICLR 2024 A TRAINING DETAILS A.1 PRETRAIN WITH UNIQUE CLASS NUMBER AND MOST FREQUENT CLASS OF INPUT SETS In our study, an epoch refers to going through all the input sets in the dataset once. Sim Siam is trained for 2,000 epochs using a batch size of 512. SGD is employed with a learning rate of 0.1, weight decay of 1e-4, and momentum of 0.9. The training process incorporates a cosine scheduler. Similarly, Sim CLR is trained for 2,000 epochs with a batch size of 256 and a temperature of 0.07. SGD is used with a learning rate of 0.05, weight decay of 1e-4, and momentum of 0.9. The training also utilizes a cosine scheduler. We train FSP-Patch for 800 epochs with a batch size of 256. The SGD is used with a weight decay of 1e-4, momentum of 0.9, and cosine scheduler. FACILE-FSP is trained for 800 epochs with a batch size of 64. SGD is used with a learning rate of 0.0125, weight decay of 1e-4, and momentum of 0.9. ℓ1 loss is optimized for pretraining with unique class numbers of input sets. For FACILE-Sup Con, we train the model with 2,000 epochs and a batch size of 256. An additional temperature parameter is set to 0.07. The SGD is used with a learning rate of 0.05, weight decay of 1e-4, and momentum of 0.9. A.2 FINE-TUNE VIT-B/16 OF CLIP WITH CUB200 Sim Siam is trained for 400 epochs using a batch size of 64. SGD is used with an initial learning rate of 0.0125, weight decay of 1e-4, and momentum of 0.9. The cosine scheduler is used for the optimizer. Sim CLR is also trained for 400 epochs with a batch size of 64. An additional temperature parameter is set to 0.07. SGD is used with a learning rate of 0.0125, weight decay of 1e-4, and momentum of 0.9. The training also uses a cosine scheduler. FACILE-FSP is trained for 200 epochs with a batch size of 64. SGD is used with a learning rate of 0.0125, weight decay of 1e-4, and momentum of 0.9. For FACILE-Sup Con, we train the model with 800 epochs and a batch size of 64. An additional temperature parameter is set to 0.07. The SGD is used with an initial learning rate of 0.0125, weight decay of 1e-4, and momentum of 0.9. Both models training utilized a cosine annealing scheduler. A.3 PRETRAIN RESNET18 WITH TCGA AND GTEX DATASET In Sim Siam, Sim CLR, and FSP-Patch models, the data loader samples one patch for each slide. In FACILE-FSP and FACILE-Sup Con, the data loader samples a set of a patches for each slide. Sim Siam is trained for 5,000 epochs using a batch size of 55. SGD is employed with a learning rate of 0.01, weight decay of 1e-4, and momentum of 0.9. The training process incorporates a cosine scheduler. Similarly, Sim CLR is trained for 5,000 epochs with a batch size of 32. An additional temperature parameter is set to 0.07. SGD is used with a learning rate of 0.006, weight decay of 1e-4, and momentum of 0.9. The training also utilizes a cosine scheduler. FSP-Patch is trained for 1,000 epochs with a batch size of 64. We employ SGD with a learning rate of 0.05, weight decay of 1e-4, and momentum of 0.9. The training process includes the utilization of a cosine scheduler. FACILE-FSP is trained for 3,000 epochs with batch size 32. The input set size is 5 by default. We employ SGD with a learning rate of 0.0125, weight decay of 1e-4, and momentum of 0.9. The training process includes the utilization of a cosine scheduler. Set Transformer with 3 inducing points and 4 attention heads is used for the set-input model g. Similarly, for our FACILE-Sup Con model, we use the same input set size and set-input model. The training process is configured with a batch size of 32 and extends over 3,000 epochs. An additional temperature parameter is set to 0.07. We use SGD with a learning rate of 0.00625, weight decay of 1e-4, and momentum of 0.9. We use an MLP as a projection head with two fc layers, a hidden dimension of 512, and an output dimension of 512. Published as a conference paper at ICLR 2024 A.4 FINE-TUNE VIT-B/14 OF DINO V2 WITH TCGA Sim Siam is trained for 400 epochs with a batch size of 64, utilizing Stochastic Gradient Descent (SGD) with an initial learning rate of 0.0125, a weight decay of 1e-4, and a momentum of 0.9. A cosine scheduler was employed. Sim CLR underwent a similar training regimen for 400 epochs and a batch size of 64, with an additional temperature parameter set at 0.07 and identical SGD parameters, including the use of a cosine scheduler for learning rate adjustments. FSP-Patch also completed 400 epochs of training with a batch size of 64. The model employed SGD with a learning rate of 0.0125, a weight decay of 1e-4, and a momentum of 0.9, along with a cosine scheduler to modulate the learning rate. For FACILE-FSP, training spanned 200 epochs with a batch size of 64, using SGD with the same learning rate, weight decay, and momentum settings. FACILE-Sup Con extended its training to 800 epochs with a batch size of 64, including an additional temperature setting of 0.07 and the same SGD configuration. Both FACILE-FSP and FACILE-Sup Con models utilized a cosine annealing scheduler. B ADDITIONAL RESULT B.1 FINE-TUNE CLIP MODEL WITH ANOMALY DETECTION DATASET In this experiment, we sought to enhance model performance with coarse-grained labels of the anomaly detection datasets (Zaheer et al., 2017; Lee et al., 2019a). A total of 11,788 input sets of size 10 are constructed from the CUB200 (He & Peng, 2019) training dataset by including one example that lacks an attribute common to the other examples in the input set. The coarse-grained labels are the positions of the anomalies. This setup creates a challenging scenario for models, as they must identify the outlier among otherwise similar instances. In downstream tasks, we evaluate the fine-tuned feature encoder composed of the fixed CLIP (Radford et al., 2021) image encoder Vi TB/16 and appended fully-connected layer on the classification of species of the CUB200 test dataset. The batch normalization (Ioffe & Szegedy, 2015) and Re LU are applied to the fully-connected layer. Following this experiment setup, the rationale behind utilizing coarse-grained labels is grounded in their potential to enhance model discernment in downstream tasks. By training the model to identify anomalies in sets where one item diverges from the rest, we essentially teach it to focus on subtle differences and critical attribute features. This enhanced focus is particularly beneficial for fine-grained classification tasks in the CUB200 test dataset, where distinguishing between closely related species requires the model to recognize and prioritize minute, yet significant, differences. The model training approach in this experiment centered around the CLIP image encoder, enhanced with an additional fully-connected layer. FACILE-FSP and FACILE-Sup Con incorporate this setup, utilizing the CLIP-based feature encoder and focusing on finetuning the fully-connected layer through the FACILE pretraining step. In contrast, the Sim Siam approach leverages the CLIP image encoder as a backbone while finetuning the projector and predictor components. Similarly, the Sim CLR method also uses the CLIP encoder as its foundation but focuses solely on finetuning the projector. These varied strategies reflect our efforts to optimize the feature encoder for accurately identifying anomalies and improving classification performance in related tasks. The training details can be found in section A.2. pretraining method NC LR RC CLIP (Vi T-B/16) 83.84 1.10 81.01 1.23 82.75 1.17 Sim CLR 84.03 1.08 83.49 1.14 86.30 1.03 Sim Siam 84.02 1.10 83.90 1.13 85.68 1.07 FACILE-Sup Con 87.49 0.99 86.57 1.07 88.01 0.99 FACILE-FSP 88.74 0.94 88.45 0.96 88.36 0.95 Table 4: Pretraining on input sets from CUB200. Testing with 5-shot 20-way meta-test sets; average F1 and CI are reported. Note that table 4 clearly demonstrates that all models tested benefit from incorporating data from the target domain. Notably, both FACILE-Sup Con and FACILE-FSP exhibit superior performance Published as a conference paper at ICLR 2024 compared to other baseline models. This observation underscores the effectiveness of our models in leveraging coarse-grained labels to enhance their anomaly detection capabilities. B.2 PRETRAIN RESNET18 WITH TCGA B.2.1 ACC ON LC, PAIP, AND NCT DATASETS We pretrain the models on TCGA datasets with patches size 224 224 at 20X magnification. Then, these pretrained models are tested on LC, PAIP, and NCT datasets. The average ACC and CI on the LC, PAIP, and NCT datasets are shown in table 5. pretraining method NC LR RC LR+LA RC+LA 1-shot 5-way test on LC dataset Image Net (FSP) 65.64 0.49 66.06 0.46 65.92 0.48 66.60 0.48 67.09 0.47 Sim Siam 68.88 0.51 68.53 0.48 68.27 0.48 68.81 0.49 70.24 0.47 Sim CLR 66.41 0.48 66.52 0.46 66.10 0.46 67.70 0.45 68.71 0.46 FSP-Patch 68.56 0.46 68.51 0.45 68.68 0.46 69.38 0.46 69.63 0.46 FACILE-Sup Con 76.64 0.50 77.88 0.47 76.77 0.47 77.15 0.48 77.16 0.48 FACILE-FSP 79.01 0.49 78.16 0.48 77.43 0.50 79.15 0.47 75.81 0.48 5-shot 5-way test on LC dataset Image Net (FSP) 82.79 0.32 81.31 0.31 81.13 0.30 84.50 0.30 84.73 0.28 Sim Siam 85.12 0.30 83.39 0.32 83.85 0.30 87.74 0.27 87.90 0.26 Sim CLR 83.75 0.30 82.38 0.30 82.32 0.31 86.12 0.28 85.40 0.30 FSP-Patch 85.15 0.29 84.38 0.31 85.01 0.29 86.71 0.28 86.24 0.27 FACILE-Sup Con 91.16 0.24 90.48 0.24 90.40 0.24 91.39 0.24 91.03 0.22 FACILE-FSP 91.77 0.21 90.85 0.23 90.77 0.24 92.19 0.22 90.02 0.24 pretraining method NC LR RC LR+LA RC+LA 1-shot 3-way test on PAIP dataset Image Net (FSP) 48.44 0.65 50.34 0.65 50.21 0.62 48.90 0.62 47.51 0.59 Sim Siam 49.42 0.65 50.25 0.65 49.76 0.65 49.51 0.62 49.09 0.63 Sim CLR 47.39 0.59 48.35 0.59 47.97 0.58 47.77 0.59 47.65 0.60 FSP-Patch 51.61 0.68 51.61 0.67 52.06 0.67 51.74 0.66 51.38 0.66 FACILE-Sup Con 49.65 0.61 51.32 0.66 51.16 0.63 50.00 0.62 50.81 0.65 FACILE-FSP 48.91 0.61 49.57 0.63 49.68 0.63 49.42 0.65 48.60 0.64 5-shot 3-way test on PAIP dataset Image Net (FSP) 62.46 0.52 62.48 0.48 63.14 0.50 62.11 0.51 60.52 0.49 Sim Siam 63.05 0.52 64.44 0.49 64.66 0.50 65.44 0.53 64.64 0.55 Sim CLR 61.48 0.52 61.84 0.53 62.75 0.51 63.03 0.52 61.70 0.52 FSP-Patch 65.29 0.49 65.81 0.51 65.98 0.48 65.70 0.50 64.01 0.52 FACILE-Sup Con 65.44 0.51 66.75 0.52 67.11 0.51 67.24 0.53 67.06 0.52 FACILE-FSP 64.68 0.53 65.75 0.49 66.58 0.51 67.42 0.53 67.06 0.53 pretraining method NC LR RC LR+LA RC+LA 1-shot 9-way test on NCT dataset Image Net (FSP) 58.75 0.35 58.66 0.36 58.48 0.34 58.83 0.36 57.32 0.36 Sim Siam 64.76 0.40 66.09 0.39 66.09 0.39 66.54 0.40 67.05 0.41 Sim CLR 60.47 0.41 61.17 0.38 61.43 0.39 61.65 0.40 62.48 0.38 FSP-Patch 61.03 0.42 63.53 0.40 63.26 0.42 62.75 0.43 61.57 0.42 FACILE-Sup Con 68.99 0.45 70.76 0.40 70.89 0.41 70.45 0.45 70.63 0.44 FACILE-FSP 67.43 0.44 68.45 0.4 68.97 0.42 69.53 0.43 70.89 0.42 5-shot 9-way test on NCT dataset Image Net (FSP) 74.82 0.26 74.35 0.26 75.20 0.26 77.11 0.23 74.89 0.26 Sim Siam 80.59 0.23 80.51 0.23 81.54 0.21 83.68 0.22 83.85 0.22 Sim CLR 77.30 0.25 77.64 0.24 79.17 0.24 80.99 0.24 81.71 0.23 FSP-Patch 79.61 0.25 79.89 0.24 81.71 0.23 82.92 0.24 81.67 0.24 FACILE-Sup Con 86.89 0.22 88.06 0.20 89.26 0.19 89.62 0.19 88.67 0.21 FACILE-FSP 84.83 0.24 85.78 0.23 87.68 0.20 88.16 0.20 87.67 0.20 Table 5: Models tested on LC, PAIP, and NCT dataset; average ACC and CI are reported. B.2.2 TEST WITH LARGE SHOT NUMBER We further test the trained models with a larger shot number k. The result is shown in table 6 Published as a conference paper at ICLR 2024 pretraining method NC LR RC LR+LA RC+LA 10-shot 5-way on LC Image Net (FSP) 78.76 0.94 78.92 0.92 80.45 0.87 82.25 0.83 80.20 0.89 Sim Siam 88.52 0.55 87.20 0.58 87.73 0.56 91.62 0.46 91.88 0.47 Sim CLR 87.02 0.64 86.26 0.64 85.61 0.72 90.28 0.52 89.60 0.58 FSP-Patch 88.41 0.53 88.64 0.52 89.15 0.51 90.49 0.50 89.88 0.54 FACILE-Sup Con 92.84 0.39 92.87 0.38 93.21 0.37 94.25 0.36 93.72 0.39 FACILE-FSP 93.10 0.39 93.11 0.38 93.63 0.37 94.52 0.35 93.07 0.45 10-shot 3-way on PAIP Image Net (FSP) 65.36 0.91 65.17 1.00 65.40 0.99 66.52 0.81 64.45 0.81 Sim Siam 67.19 0.88 67.35 0.98 68.55 0.94 70.88 0.77 70.62 0.77 Sim CLR 65.77 0.85 66.70 0.91 67.01 0.91 68.41 0.79 66.96 0.82 FSP-Patch 68.50 0.82 69.12 0.85 69.39 0.85 70.13 0.75 68.25 0.76 FACILE-Sup Con 70.03 0.81 71.24 0.84 72.17 0.83 73.31 0.71 72.50 0.71 FACILE-FSP 69.19 0.82 71.13 0.82 71.78 0.81 73.22 0.73 72.78 0.71 10-shot 9-way on NCT Image Net (FSP) 78.76 0.94 78.92 0.92 80.45 0.87 82.25 0.83 80.20 0.89 Sim Siam 82.92 0.91 83.42 0.89 84.76 0.81 87.66 0.72 88.12 0.69 Sim CLR 80.34 0.96 81.67 0.90 83.09 0.84 85.96 0.76 86.82 0.72 FSP-Patch 83.36 0.77 84.05 0.74 85.93 0.65 87.15 0.62 86.05 0.63 FACILE-Sup Con 89.57 0.49 91.11 0.45 92.20 0.41 92.88 0.39 92.02 0.41 FACILE-FSP 87.54 0.61 89.25 0.56 90.77 0.49 91.63 0.48 91.23 0.50 Table 6: Test result on LC, PAIP, and NCT dataset with shot number 10; average F1 and CI are reported. B.3 FINE-TUNE VIT-B/14 OF DINO V2 ON TCGA DATASET Similar to section B.1, we fine-tune a fully-connected layer that is appended after DINO V2 Oquab et al. (2023) Vi T-B/14. This methodology is applied across various models to assess their performance on histopathology image datasets. By adopting the DINO V2 architecture, known for its robustness and effectiveness in visual representation learning, we aim to harness its potential for the specialized domain of histopathology. We refer interested readers to section A.4 for details of pretraining. Notably, our methods, FACILE-Sup Con and FACILE-FSP, demonstrated markedly superior results in comparison to other baseline models when applied to histopathology image datasets as shown in table 7. This outcome highlights the effectiveness of these methods in leveraging coarse-grained labels specific to histopathology, thereby greatly enhancing the model performance of downstream tasks. Another critical insight emerged from our research: the current foundation model, DINO V2, exhibits limitations in its generalization performance on histopathology images. This suggests that while DINO V2 provides a strong starting point due to its robust visual representation capabilities, there is a clear need for further finetuning or prompt learning to optimize its performance for the unique challenges presented by histopathology datasets. This finding underscores the importance of specialized adaptation in the application of foundation models to specific domains like medical imaging. B.4 BENEFITS OF PRETRAINING ON LARGE PATHOLOGY DATASETS In order to demonstrate the advantages of pretraining on large pathology datasets, we compare the performance of models pretrained on TCGA datasets with those pretrained on NCT datast, which are also studied in Yang et al. (2022). The Sim Siam model is trained for 100 epochs. SGD optimizer is used with learning rate of 0.01, weight decay of 0.0001, momentum of 0.9, and cosine learning rate decay. The batch size is 55. For Mo Co v3, similar to (Chen et al., 2021c; Yang et al., 2022), LARS optimizer (You et al., 2017) was used with an initial learning rate of 0.3, weight decay of 1.5e 6, the momentum of 0.9, and cosine decay schedule. Mo Co v3 was trained with a batch size of 256 for 200 epochs. The FSP model with simple augmentation follows the setting of Yang et al. (2022). SGD optimizer with learning rate of 0.5, momentum of 0.9 and weight decay of 0 are used. A large batch size is used 512. The model is trained for 100 epochs with step decay schedule. The learning rate Published as a conference paper at ICLR 2024 pretraining method NC LR RC LR+LA RC+LA 1-shot 5-way test on LC dataset DINO V2 (Vi T-B/14) 44.82 1.41 47.51 1.39 47.63 1.38 47.36 1.39 48.88 1.44 Sim Siam 48.79 1.37 49.43 1.35 48.43 1.36 49.38 1.34 49.50 1.34 Sim CLR 50.47 1.31 50.52 1.33 50.44 1.32 51.66 1.32 51.78 1.38 FSP-Patch 49.73 1.41 53.59 1.38 53.07 1.41 51.79 1.40 51.27 1.43 FACILE-Sup Con 56.24 1.43 56.51 1.41 55.95 1.42 56.29 1.43 54.07 1.44 FACILE-FSP 55.67 1.40 56.26 1.36 55.83 1.35 56.01 1.38 55.35 1.40 5-shot 5-way test on LC dataset DINO V2 (Vi T-B/14) 66.12 0.98 64.71 1.12 66.36 1.10 72.95 0.93 75.11 0.91 Sim Siam 67.51 0.96 64.99 1.05 65.39 1.05 70.30 0.93 71.19 0.93 Sim CLR 70.10 0.92 69.28 0.96 69.18 0.97 72.99 0.92 72.91 0.94 FSP-Patch 71.97 0.96 71.11 1.04 71.19 1.03 73.96 0.94 73.20 0.96 FACILE-Sup Con 75.58 0.88 74.26 0.94 73.20 0.95 75.81 0.90 74.34 0.96 FACILE-FSP 75.86 0.86 74.64 0.89 74.12 0.93 76.17 0.88 75.08 0.95 1-shot 3-way test on PAIP dataset DINO V2 (Vi T-B/14) 41.51 1.27 44.37 1.26 44.28 1.25 42.43 1.27 42.78 1.27 Sim Siam 49.42 1.28 48.07 1.35 48.44 1.36 48.76 1.33 46.48 1.37 Sim CLR 48.60 1.19 48.76 1.25 47.98 1.26 48.94 1.23 47.20 1.26 FSP-Patch 46.09 1.17 47.44 1.18 48.09 1.19 46.76 1.18 43.68 1.22 FACILE-Sup Con 51.97 1.18 52.25 1.22 51.80 1.22 51.36 1.22 50.24 1.23 FACILE-FSP 51.34 1.16 51.18 1.19 51.51 1.19 51.50 1.16 49.77 1.22 5-shot 3-way test on PAIP dataset DINO V2 (Vi T-B/14) 57.59 1.07 58.19 1.10 59.37 1.07 61.84 0.85 60.81 0.86 Sim Siam 61.56 0.97 62.52 1.01 62.81 1.01 64.40 0.86 62.44 0.93 Sim CLR 62.20 0.93 61.78 0.99 63.20 0.97 63.38 0.86 63.03 0.88 FSP-Patch 63.77 0.88 63.85 0.94 63.85 0.93 63.61 0.85 60.91 0.87 FACILE-Sup Con 67.16 0.84 67.29 0.89 66.88 0.90 67.61 0.85 66.34 0.84 FACILE-FSP 67.14 0.85 67.67 0.84 67.54 0.86 67.12 0.81 66.05 0.83 1-shot 9-way test on NCT dataset DINO V2 (Vi T-B/14) 56.03 1.62 59.11 1.57 60.13 1.55 58.71 1.57 59.06 1.55 Sim Siam 62.60 1.45 61.89 1.50 61.90 1.51 62.27 1.47 61.05 1.44 Sim CLR 65.43 1.43 64.18 1.44 64.15 1.46 64.83 1.43 62.69 1.38 FSP-Patch 65.22 1.49 65.93 1.41 65.94 1.40 65.26 1.45 62.66 1.46 FACILE-Sup Con 71.55 1.36 70.36 1.37 70.52 1.35 71.05 1.35 68.85 1.40 FACILE-FSP 72.05 1.34 70.70 1.35 70.77 1.34 71.14 1.34 68.03 1.40 5-shot 9-way test on NCT dataset DINO V2 (Vi T-B/14) 76.85 0.98 76.51 1.02 78.67 0.94 82.20 0.82 82.75 0.83 Sim Siam 80.81 0.85 80.06 0.87 81.55 0.85 83.18 0.80 82.39 0.83 Sim CLR 82.87 0.80 81.91 0.82 82.86 0.80 83.92 0.77 82.89 0.79 FSP-Patch 83.63 0.83 83.49 0.80 84.34 0.78 85.32 0.75 83.03 0.79 FACILE-Sup Con 87.74 0.64 87.00 0.64 87.38 0.62 87.82 0.63 86.15 0.69 FACILE-FSP 87.93 0.65 87.52 0.65 87.72 0.62 88.01 0.64 86.46 0.70 Table 7: Test result on LC, PAIP, and NCT dataset with Vi T-B/14 from DINO V2; average F1 and CI are reported. multiplied by 0.1 at 30, 60, and 90 epochs respectively. The FSP model with strong augmentation was trained for 50 epochs. The batch size is set to 64. The SGD is used with a learning rate of 0.03, momentum of 0.9, weight decay of 0.0001, and the cosine schedule. The model is trained for 50 epochs. The Sup Con model is trained with trained for 100 epochs. The batch size is set to 64. The SGD optimizer is used with a learning rate of 0.01, momentum of 0.9, weight decay of 0.0001, and the cosine schedule. table 8 shows the performance of the pretrained models on the LC and PAIP dataset with shot numbers 1 or 5. Notably, the best-performing models on the two test datasets exhibit a significant performance gap compared to the best models pretrained on TCGA datasets as depicted in table 2. B.5 PRETRAIN ON TCGA AND GTEX WITH PATCH SIZE 1,000X1,000 We train the models using TCGA patches of size 1, 000 1, 000, which are extracted from 20X magnification and resized to 224 224. Subsequently, the pretrained models are evaluated on PDAC datasets, and the corresponding test performance is presented in Figure 9. Notably, for shot number of 1 and 5, our model significantly outperforms other models, demonstrating a substantial performance margin. Published as a conference paper at ICLR 2024 pretraining method NC LR RC LR+LA RC+LA 1-shot 5-way test on LC dataset Sim Siam 59.30 1.31 58.67 1.41 58.58 1.40 59.66 1.35 59.85 1.35 Mo Co v3 ((Yang et al., 2022)) 59.38 1.62 59.39 1.68 59.46 1.68 60.15 1.59 60.54 1.58 FSP (simple aug; (Yang et al., 2022)) 51.42 1.59 46.06 1.88 46.33 1.86 50.53 1.65 51.00 1.65 FSP (strong aug) 68.00 1.29 66.17 1.41 66.18 1.46 68.39 1.34 68.02 1.40 Sup Con 64.48 1.33 63.52 1.42 63.84 1.40 65.43 1.33 65.98 1.38 5-shot 5-way test on LC dataset Sim Siam 76.21 0.87 74.05 1.10 74.59 1.10 77.87 0.87 76.03 0.94 Mo Co v3 ((Yang et al., 2022)) 72.82 1.25 70.29 1.43 71.31 1.40 78.72 1.00 79.71 0.95 FSP (simple aug; (Yang et al., 2022)) 56.44 1.50 52.27 1.81 55.62 1.74 63.47 1.37 63.47 1.46 FSP (strong aug) 83.53 0.79 80.81 1.01 80.27 1.08 85.57 0.77 84.06 0.89 Sup Con 81.51 0.85 78.77 1.03 78.65 1.08 83.51 0.84 83.31 0.91 1-shot 3-way test on PAIP dataset Sim Siam 37.13 1.14 38.26 1.13 37.93 1.15 38.00 1.12 38.67 1.12 Mo Co v3 ((Yang et al., 2022)) 43.17 1.26 42.48 1.30 43.02 1.31 43.55 1.28 44.57 1.28 FSP (simple aug; (Yang et al., 2022)) 37.15 1.07 36.69 1.13 37.39 1.08 37.40 1.07 35.28 1.09 FSP (strong aug) 47.67 1.18 48.44 1.19 48.16 1.21 48.27 1.17 49.38 1.19 Sup Con 48.45 1.19 49.29 1.20 48.97 1.22 49.47 1.20 48.53 1.20 5-shot 3-way test on PAIP dataset Sim Siam 47.52 1.00 48.12 1.10 47.04 1.11 52.70 0.95 54.51 1.00 Mo Co v3 ((Yang et al., 2022)) 55.43 1.00 54.23 1.09 54.05 1.09 56.07 0.92 55.73 0.93 FSP (simple aug; (Yang et al., 2022)) 44.98 0.95 45.13 0.96 45.30 0.96 44.34 0.87 44.03 0.88 FSP (strong aug) 62.00 0.88 62.48 0.97 62.04 0.98 64.82 0.86 64.60 0.87 Sup Con 63.62 0.91 64.38 0.96 63.61 1.00 64.37 0.87 64.28 0.88 Table 8: pretraining on NCT dataset and testing on LC and PAIP dataset; average F1 and CI are reported. pretraining method NC LR RC LR+LA RC+LA 1-shot 5-way test Image Net (FSP) 29.57 1.07 31.32 1.09 31.16 1.07 30.88 1.08 30.14 1.08 Sim Siam 30.48 1.08 30.18 1.12 30.19 1.13 30.41 1.08 31.13 1.10 Sim CLR 30.79 1.08 30.93 1.13 30.78 1.12 31.33 1.08 31.22 1.07 FSP-Patch 34.04 1.16 33.99 1.20 33.69 1.20 34.29 1.15 34.99 1.16 FACILE-Sup Con 35.44 1.17 34.94 1.20 34.58 1.22 35.68 1.16 35.27 1.17 FACILE-FSP 37.36 1.16 36.07 1.23 36.93 1.21 36.79 1.19 36.81 1.18 5-shot 5-way test Image Net (FSP) 41.83 0.96 41.30 1.10 41.08 1.08 42.38 0.94 41.29 0.93 Sim Siam 40.15 1.03 37.29 1.21 37.43 1.21 41.87 1.00 42.70 1.01 Sim CLR 40.30 1.04 38.74 1.19 39.02 1.16 40.98 0.96 40.90 0.98 FSP-Patch 44.26 1.10 42.99 1.20 43.69 1.12 46.32 0.97 46.69 0.96 FACILE-Sup Con 45.83 1.09 45.07 1.18 45.93 1.13 47.72 0.95 47.00 0.95 FACILE-FSP 48.21 1.04 47.62 1.12 47.94 1.08 48.84 0.95 48.37 0.95 Table 9: Models pretrained on TCGA and tested on PDAC dataset; average F1 and CI are reported. Similarly, we train the models using GTEx patches with dimensions of 1, 000 1, 000. The patches are extracted from 20X magnification and resized to 224 224. The pretrained models are tested on PDAC datasets, revealing similar outcomes, as illustrated in Table 10. pretraining method NC LR RC LR+LA RC+LA 1-shot 5-way test Sim Siam 34.78 1.18 34.57 1.25 35.30 1.25 35.13 1.19 35.27 1.19 Sim CLR 33.68 1.14 33.74 1.18 33.69 1.17 34.28 1.14 33.84 1.12 FSP-Patch 31.87 1.09 32.90 1.13 32.53 1.11 32.55 1.09 32.10 1.07 FACILE-Sup Con 34.36 1.06 34.35 1.13 34.39 1.14 34.70 1.07 34.35 1.07 FACILE-FSP 35.62 1.10 35.51 1.15 35.40 1.13 35.87 1.10 36.16 1.09 5-shot 5-way test Sim Siam 46.00 1.10 43.26 1.30 44.19 1.26 47.24 1.00 47.85 1.00 Sim CLR 44.44 1.08 43.40 1.19 43.58 1.15 44.60 0.98 44.17 0.96 FSP-Patch 42.09 0.99 40.15 1.15 40.69 1.09 42.71 0.92 42.66 0.90 FACILE-Sup Con 44.85 1.02 43.65 1.15 44.01 1.13 46.37 0.93 45.10 0.92 FACILE-FSP 46.91 0.97 46.32 1.07 47.10 1.02 48.01 0.90 47.70 0.89 Table 10: Models pretrained on GTEx and tested on PDAC dataset; average F1 and CI are reported. Published as a conference paper at ICLR 2024 C.1 GTEX DATASET The Genotype-Tissue Expression (GTEx) project is a pioneering initiative aimed at constructing an extensive public repository to investigate tissue-specific gene expression and regulation. The GTEx project collected samples from 54 non-diseased tissue sites across nearly 1000 individuals, with an emphasis on molecular assays such as Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), and RNA-sequencing. Additionally, the GTEx Biobank contains a plethora of unutilized samples. The GTEx portal (https://gtexportal.org/home/) provides unrestricted access to a plethora of data, including gene expression levels, quantitative trait loci (QTLs), and histology images, to aid the research community in advancing our understanding of human gene expression and its regulation. We downloaded all the slides from the GTEx portal. The organs from which the slides are extracted are used for coarse-grained labels. We extract all the non-overlapping patches with size 1, 000 1, 000 and only keep those with intensity in [0.1, 0.85] to filter out backgrounds. The number of slides from each organ for GTEx can be found in figure 6. Thumbnails of WSI examples from the GTEx dataset can be found in figure 7. Esophagus Muscularis Heart Gastroesophageal Junction Pituitary Gland Kidney Cortex Skin Kidney Medulla Coronary Artery Colon Adrenal Glands Esophagus Mucosa Adipose Tissue Omentum Atrial Appendage Fallopian Tube Suprapubic Skin Ovary Sigmoid Colon Pancreas Skeletal Muscle Testis Tibial Artery Minor Salivary Glands Liver Urinary Bladder Lung Tibial Nerve Ileum Brain Cerebellum Brain Cortex Mammary Tissue Breast Aorta Thyroid Gland slide number Figure 6: Slide number for each organ in GTEx C.2 TCGA DATASET The Cancer Genome Atlas (TCGA; https://www.cancer.gov/ccg/research/ genome-sequencing/tcga) is a project that aims to comprehensively characterize genetic mutations responsible for cancer using genome sequencing and bioinformatics. The TCGA dataset consists of 10,825 patient samples, including gene expression, DNA methylation, copy number variation, and mutation data, histopathology data, among others (source sites: Duke University Medical School Mc Lendon Roger 1 Friedman Allan 2 Bigner Darrell 1 et al., 2008; 13 et al., 2012). This large-scale dataset has enabled researchers to identify numerous genomic alterations associated with cancer and has contributed to the development of new diagnostic and therapeutic approaches. We downloaded all the diagnostic slides from GDC portal https://portal.gdc.cancer. gov/. The project names of the slides are used for coarse-grained labels. We extract patches at two different scales, i.e., 224 224 and 1, 000 1, 000 at 20X magnification, from all the slides. The number of slides from each project for TCGA can be found in figure 8. Thumbnails of WSI examples from TCGA dataset can be found in figure 9. Published as a conference paper at ICLR 2024 Colon Adrenal Glands Esophagus Mucosa Adipose Tissue Ectocervix Omomentum Atrial Appendage Fallopian Tube Suprapubic Skin Uterus Ovary Sigmoid Colon Pancreas Skeletal Muscle Testis Tibial Artery Salivary Glands Endocervix Liver Urinary Bladder Lung Tibial Nerve Vagina Prostate Ileum Brain Cerebellum Brain Cortex Breast Thyroid Gland Aorta Heart Esophagus Muscularis Gastroesophageal Junction Pituitary Gland Spleen Stomach Kidney Cortex Skin Kidney Medulla Coronary Artery Figure 7: Randomly deleted examples from GTEx dataset Published as a conference paper at ICLR 2024 slide number Figure 8: Slide number for each tumor in TCGA C.3 PDAC DATASET To address the presence of multiple tissues within certain patches, we employ a labeling strategy that involves identifying and labeling the centered tissues within these patches. To ensure annotation accuracy, each patch undergoes labeling by a minimum of two pathologists, thereby maintaining the quality of the annotations. For the specific patch numbers corresponding to each tissue in the PDAC dataset, please refer to Figure 10. Furthermore, examples of patches from the PDAC dataset are provided in Figure 11, offering visual illustrations of the dataset. Coarse-Grained Dataset Data type and annotation WSI number Extracted patch number GTEx slides; organs 25,501 9,465,689 TCGA slides; tumors 11,638 10,321,273 (11,588,226 w/ size 224) Fine-Grained Dataset Data type and annotation WSI number Extracted patch number PDAC patches; tissues 194 12,250 LC25000 patches; tissues 1,250 25,000 PAIP19 patches; tissues 60 75,000 NCT-CRC-HE-100K patches; tissues 86 100,000 Table 11: Dataset statistics In order to validate our model on a real-world dataset, we generated WSIs of Pancreatic Ductal Adenocarcinoma (PDAC)1. PDAC, a particularly aggressive and lethal form of cancer originating in the pancreatic duct cells, presents various subtypes, each with distinct morphological characteristics. These variations underscore the need for advanced automated tools to accurately characterize and differentiate between these subtypes, thereby aiding disease studies and potentially informing treatment strategies. Examples of PDAC and class distribution are detailed in section C. There are in total 12,250 annotated patches extracted from 194 slides. The patch size used for this analysis is 1, 000 1, 000 at a 20X magnification. Each patch was annotated into one of 5 classes (i.e., Stroma, Normal Acini, Normal Duct, Tumor, and Islet) and confirmed by at least two pathologists. C.4 NCT, PAIP, AND LC We test our models on 4 datasets with fine-grained labels. These datasets are from diverse body sites. Statistics of these datasets can be found in table 11. NCT-CRC-HE-100K (NCT) is collected from colon (Kather et al., 2018). It consists of 9 classes with 100K non-overlapping patches. The patch size is 224 224. LC25000 (LC) is collected from lung and colon sites (Borkowski et al., 2021). It has 5 classes and each class has 5,000 patches. The patch size is 768 768. We resize the patches to 224 224. PAIP19 (PAIP) is collected from liver site (Kim et al., 2021). There are in total 50 WSIs. The WSIs are cropped into patches with size 1We will make data publicly available upon acceptance of our paper Published as a conference paper at ICLR 2024 TCGA-PRAD TCGA-SKCM TCGA-LUSC TCGA-THCA TCGA-BRCA TCGA-UCEC TCGA-HNSC TCGA-KIRC Figure 9: Randomly selected examples from TCGA dataset Published as a conference paper at ICLR 2024 Normal Acini Normal Duct Figure 10: Patch number for each tissue for PDAC 224 224. We only keep those patches with masks and assign labels with majority voting similar to Yang et al. (2022). We downsample these patches to 75K patches, with 25K in each class. D DATA AUGMENTATION Two data augmentation strategies are used in this paper. Simple augmentation Following Yang et al. (2022), we also used a simple augmentation policy which includes random resized cropping and horizontal flipping. In our paper, this simple augmentation policy is only used for FSP-Patch model pretraining on the NCT dataset. Strong augmentation Following previous work (Grill et al., 2020; Chen et al., 2021c; Yang et al., 2022), for Sim CLR and Sup Con models, we used similar strong data augmentation which contains random resized cropping, horizontal flipping, horizontal flipping, color jittering (Wu et al., 2018) with (brightness=0.8, contrast=0.8, saturation=0.8, hue=0.2, probability=0.8), grayscale conversion (Wu et al., 2018) with (probability=0.2), Gaussian blurring (Chen et al., 2020) with (kernel size=5, min=0.1, max=2.0, probability=0.5), and polarization (Grill et al., 2020) with (threshold=128, probability=0.2). In implementing the Sim Siam model, we adopted a comparable augmentation strategy, utilizing robust data augmentation techniques. Specifically, we fine-tuned parameters for color jittering, setting brightness, contrast, and saturation adjustments to 0.4, and hue to 0.1. These modifications were applied with a probability of 0.8, as informed by Chen & He (2021). E LATENT AUGMENTATION Latent augmentation (LA) was originally proposed in Yang et al. (2022) to improve the performance of the few-shot learning system in a simple unsupervised way. The pretrained feature extractor can only transfer parts of available knowledge in the pretraining datasets by the learned weights of the feature extractor. More transferable knowledge is inherent in the pretraining data representations. In order to fully exploit the pretraining data, possible semantic shifts of clustered representations of the pretraining dataset are transferred to downstream tasks besides the pretrained feature extractor weights. The k-means clustering method is performed on the representations of pretraining datasets, which are generated by the pretrained feature extractor ˆe. Assume we obtain C clusters after clustering. The base dictionary B = {(ci, Σi)}C i=1 is constructed, where ci is the i-th cluster prototype, Published as a conference paper at ICLR 2024 Islet Normal duct Stroma Tumor Figure 11: Randomly selected examples from each class of PDAC dataset. Published as a conference paper at ICLR 2024 i.e., mean representation of all samples in the cluster and Σi is the covariance matrix of the cluster. During downstream task testing, LA uses the original representation z to select the closest prototype from B. We can get additive augmentation z = z + δ, where δ is sampled from N(0, Σi ) and i is the index of closest prototype of z. The classifier of the downstream tasks is then trained on both the original representations and the augmented representations. F ABLATION STUDY F.1 SET-INPUT MODELS Pooling architectures have been used in various set-input problems, e.g, 3D shape recognition (Shi et al., 2015; Su et al., 2015), learning the statistics of a set (Edwards & Storkey, 2016). Vinyals et al. (2015); Ilse et al. (2018) pool elements in a set by a weighted average with weights computed by the attention module. (Zaheer et al., 2017; Edwards & Storkey, 2016) proposed to aggregate embeddings of instances, extracted using a neural network, with pooling operations (e.g., mean, sum, max). This simple method satisfies the permutation invariant property and can work with any set size. Santoro et al. (2017) used a relational network to model all pairwise interactions of elements in a given set. Lee et al. (2019a) proposed to use the Transformer (Vaswani et al., 2017) to explicitly model higher-order interactions among the instances in a set. We evaluate three set-input models for the FACILE-FSP model: attention-based MIL pooling (Ilse et al., 2018), Deep Set (Zaheer et al., 2017), and Set Transformer (Lee et al., 2019a). Attentionbased MIL pooling uses a weighted average of instance embeddings from a set where weights are determined by a neural network. The attention-based MIL pooling corresponds to a version of attention (Lin et al., 2017; Raffel & Ellis, 2015). It has been adapted by Zhang et al. (2020b;a); Pal et al. (2021) in the context of H&E images. It uses a single fully connected layer and softmax with batch normalization and Re LU activation to predict the attention weights for instances. In the Deep Set model, each instance in a set is independently fed into a neural network that takes fixed-sized inputs. The extracted features are then aggregated using a pooling operation (i.e., mean, sum, or max). The final output is obtained by further non-linear operations. The simple architecture satisfies the permutation invariant property and can work with any set size. Set Transformer adapted the Transformer model for set data. It leverages the attention mechanism (Vaswani et al., 2017) to capture interactions between instances of the input set. It applies the idea of inducing points from the sparse Gaussian process literature to reduce quadratic complexity to linear in the size of the input set. We train FACILE-FSP with three set-input models. The set size a is set to 5. In the attention-based MIL pooling model, we implemented the simple version, and use the single fc layer with softmax to predict attention weights from Res Net18 extracted features. For Deep Set model, we use two fc layers with Re LU activation functions in between to extract instance features before set pooling. In the Set Transformer, we use 4 attention heads and 3 inducing points. From table 12, we conclude that none of the 3 set-input models used in FACILE-FSP is consistently better than the other set-input models. The Deep Set model achieves the highest average F1 score with more tasks. F.2 LEARNING CURVE To validate the adequacy of training for all models, we assess the intermediate checkpoints of each pretraining model on the LC dataset. The learning curves and confidence intervals (CI) of FACILEFSP, FSP-Patch, and Sim Siam are displayed in figure 12. Upon careful examination of the learning curves in figure 12, we observe conclusive evidence of complete training for all models, as they have reached convergence. F.3 INPUT SET SIZE To examine the impact of input set size on downstream tasks, we conduct pretraining experiments using FACILE-FSP on the TCGA dataset with varying input set sizes. The resulting feature map e Published as a conference paper at ICLR 2024 set-input model NC LR RC LR+LA RC+LA 1-shot 5-way test on LC dataset Attention-based MIL pooling 70.53 1.32 69.86 1.39 69.75 1.37 71.15 1.31 70.31 1.34 Deep Set 77.84 1.16 77.56 1.16 77.56 1.17 79.16 1.09 77.38 1.18 Set Transformer 75.09 1.30 73.57 1.29 73.16 1.33 74.03 1.28 72.88 1.34 5-shot 5-way test on LC dataset Attention-based MIL pooling 88.12 0.59 81.60 1.04 82.51 0.97 89.18 0.57 88.15 0.65 Deep Set 90.35 0.50 90.91 0.47 91.54 0.46 91.68 0.50 90.97 0.54 Set Transformer 90.67 0.54 89.18 0.61 89.02 0.63 90.03 0.59 88.71 0.67 1-shot 3-way test on PAIP dataset Attention-based MIL pooling 50.98 1.37 51.93 1.35 51.91 1.36 51.98 1.36 52.39 1.35 Deep Set 52.04 1.25 53.27 1.25 54.19 1.26 52.66 1.25 52.79 1.23 Set Transformer 48.81 1.21 50.08 1.24 50.75 1.23 50.03 1.23 49.41 1.20 5-shot 3-way test on PAIP dataset Attention-based MIL pooling 67.04 1.00 66.06 1.17 66.61 1.10 70.19 0.87 70.54 0.81 Deep Set 69.42 0.85 69.93 0.92 70.52 0.87 69.96 0.84 68.39 0.84 Set Transformer 66.61 0.91 67.57 0.95 67.78 0.95 68.24 0.85 67.20 0.86 1-shot 9-way test on NCT dataset Attention-based MIL pooling 60.04 1.40 64.53 1.29 64.81 1.31 64.00 1.34 66.66 1.32 Deep Set 68.21 1.30 68.17 1.31 68.69 1.30 69.24 1.28 68.18 1.33 Set Transformer 67.76 1.31 68.52 1.30 68.55 1.28 68.33 1.28 67.72 1.28 5-shot 9-way test on NCT dataset Attention-based MIL pooling 81.94 0.75 82.40 0.72 84.46 0.65 86.49 0.62 87.66 0.59 Deep Set 85.18 0.60 85.87 0.60 87.11 0.56 87.06 0.61 85.81 0.66 Set Transformer 86.45 0.62 87.74 0.59 87.97 0.58 88.00 0.59 86.92 0.61 Table 12: Performance of FACIEL-FSP with three different set-input models; average F1 and CI are reported. 200 400 600 800 number of epochs NC LR RC LR+LA RC+LA (a) FACILE-FSP model 200 400 600 800 1000 number of epochs NC LR RC LR+LA RC+LA (b) FSP-Patch model 1000 2000 3000 4000 5000 number of epochs NC LR RC LR+LA RC+LA (c) Sim Siam model Figure 12: Learning curves of FACILE-FSP model, FSP-Patch model, and Sim Siam. The mean F1 score and CI of 5 few-shot models tested on the LC dataset with 5-shot are shown with curves. from the trained FACILE-FSP is then evaluated on LC, PAIP, and NCT datasets with shot numbers 1 and 5. The corresponding performances are reported in table 13. Observing table 13, we find that models with an input set size of 5 consistently demonstrate superior performance for LC and PAIP datasets. While slight improvements are observed for larger input set sizes, they are not substantial. Conversely, for the NCT dataset, as presented in table 13, the best performance is attained when the input set size is 10. G CONTRASTIVE AND NON-CONTRASTIVE LEARNING MODELS Self-supervised learning achieves promising results on multiple visual tasks (Bachman et al., 2019; He et al., 2020; Chen et al., 2020; Grill et al., 2020; Caron et al., 2020; Chen & He, 2021). Contrastive learning method avoid collapse by encouraging the representations to be far apart for views from different images. Henaff (2020); He et al. (2020); Misra & Maaten (2020); Chen et al. (2020) implemented instance discrimination, in which a pair of augmented views from the same image are positive and others are negative. Caron et al. (2020; 2018) contrasted different cluster of positives. Published as a conference paper at ICLR 2024 set size 2 5 10 15 1-shot 5-way test on LC dataset NC 75.29 1.33 77.84 1.16 74.88 1.36 75.25 1.29 LR 73.72 1.33 77.56 1.16 73.84 1.29 74.00 1.27 RC 74.10 1.34 77.56 1.17 73.42 1.31 73.42 1.29 LR+LA 75.27 1.28 79.16 1.09 74.41 1.31 74.92 1.26 RC+LA 74.36 1.33 77.38 1.18 72.60 1.34 73.16 1.32 5-shot 5-way test on LC dataset NC 90.62 0.56 90.35 0.50 90.62 0.57 90.83 0.55 LR 89.41 0.63 90.91 0.47 89.80 0.59 89.63 0.60 RC 89.11 0.63 91.54 0.46 89.26 0.61 89.25 0.60 LR+LA 90.46 0.58 91.68 0.50 90.29 0.57 90.46 0.56 RC+LA 89.64 0.63 90.97 0.54 88.52 0.66 89.00 0.64 NC 48.95 1.24 52.04 1.25 51.72 1.22 52.46 1.20 LR 50.55 1.22 53.27 1.25 52.33 1.25 53.38 1.23 RC 50.14 1.25 54.19 1.26 53.04 1.24 52.68 1.25 LR+LA 50.12 1.22 52.66 1.25 52.96 1.21 53.41 1.21 RC+LA 49.91 1.22 52.79 1.23 51.67 1.17 51.51 1.20 5-shot 3-way test on PAIP dataset NC 66.99 0.93 69.42 0.85 69.10 0.91 69.08 0.87 LR 68.11 0.94 69.93 0.92 70.30 0.90 69.28 0.90 RC 68.63 0.91 70.52 0.87 70.45 0.87 70.12 0.90 LR+LA 69.03 0.83 69.96 0.84 70.25 0.81 70.00 0.81 RC+LA 67.32 0.83 68.39 0.84 68.35 0.83 67.70 0.81 NC 66.31 1.36 68.21 1.30 72.44 1.25 72.05 1.27 LR 68.55 1.32 68.17 1.31 72.62 1.25 72.14 1.27 RC 68.58 1.32 68.69 1.30 72.60 1.25 72.04 1.27 LR+LA 67.42 1.33 69.24 1.28 72.18 1.26 71.92 1.27 RC+LA 65.87 1.36 68.18 1.33 69.98 1.31 69.88 1.28 5-shot 9-way test on NCT dataset NC 85.28 0.72 85.18 0.60 88.25 0.56 88.22 0.57 LR 86.39 0.69 85.87 0.60 88.80 0.55 88.55 0.55 RC 87.03 0.66 87.11 0.56 89.25 0.52 89.02 0.54 LR+LA 86.85 0.65 87.06 0.61 88.52 0.59 88.93 0.55 RC+LA 85.60 0.70 85.81 0.66 87.40 0.63 87.74 0.59 Table 13: Abation on set size; models tested on LC, PAIP, and NCT dataset; average F1 and CI are reported. Non-contrastive models (Grill et al., 2020; Richemond et al., 2020; Chen & He, 2021) removed the reliance on negatives. These non-contrastive models achieved strong results in the Image Net (Deng et al., 2009) pretraining setting. Sim Siam (Chen & He, 2021) works with typical batches and does not rely on large-batch training, which makes it preferable for academics and practitioners with low computation resources. In this section, some contrastive learning and non-contrastive learning models, e.g., Sim CLR, Sup Con, and Sim Siam, that are used in this paper are explained. Details of implementation are provided. There are three main components in Sim CLR and Sup Con framework. We follow the notation of Khosla et al. (2020) in this section to explain Sim CLR and Sup Con. Data augmentation Aug( ). For each input sample x, the augmentation module generates two random augmented views, i.e., x Aug(x). The augmentation schedules used in this paper are explained in section D. Encoder Enc( ). The encoder extracts a representation vector r = Enc( x). The pair of augmented views are separately fed to the same encoder and generate a pair of representations. The r is normalized to the unit hypersphere. Projection head Proj( ). It maps r to a vector z = Proj(r). We instantiate Proj( ) as a multi-layer perceptron (MLP) with a single hidden layer of size 512 and output vector size of 512. We also normalize the output to the unit hypersphere. Published as a conference paper at ICLR 2024 Projection head Projection head Contrastive loss Figure 13: Abstraction of Sim CLR structure For a set of N randomly sampled sample/label pairs, {(xk, yk)}N k=1. The corresponding batch used for training consists of 2N pairs, {( xl, yl)}2N l=1, where x2k 1 and x2k are two random augmented views of xk and y2k 1 = y2k = yk. Let i I {1 . . . 2N} be the index of an arbitrary augmented sample and let j(i) be the index of the other augmented sample originating from the same source sample. The abstraction of Sim CLR structure can be found in figure 13. In Sim CLR, the loss takes the following form. i I Lself i = X i I log exp zi zj(i)/τ P a A(i) exp (zi za/τ) (2) where τ is the temperature parameter. A(i) I\{i}. The denominator has a total of 2N 1 terms. In this paper, the τ is always set to 0.07. The patches are augmented randomly by the augmentation module described in section D. We use an MLP as a projection head with two fully-connected layers, a hidden dimension of 512, and an output dimension of 512. For supervised learning, the contrastive loss in equation 2 cannot handle class discrimination (Khosla et al., 2020). Khosla et al. (2020) proposed two straightforward ways, as shown in equation 3 and equation 4, to generalize equation 2 to incorporate supervison. Lsup out = X i I Lsup out ,i = X p P (i) log exp (zi zp/τ) P a A(i) exp (zi za/τ) (3) Lsup in = X i I Lsup in,i = X exp (zi zp/τ) P a A(i) exp (zi za/τ) Here P(i) {p A(i) : yp = yi} is the set of indices of all positives in the batch distinct from i. The authors showed that Lsup in Lsup out and Lsup out is the superior supervised loss function. Thus, we use Sup Con with equation 3 as the default loss. The τ is also set to 0.07. In our model FACILE-Sup Con, the input sample is a set of randomly sampled patches and labels are slide properties, i.e., organs or TCGA projects. Each patch is augmented randomly by the augmentation module described in section D. The feature map e and set function g work as the encoder Enc( ). We also use an MLP as a projection head with two fully-connected layers, a hidden dimension of 512, and an output dimension of 512. Published as a conference paper at ICLR 2024 Projection MLP Prediction MLP Projection MLP Figure 14: Abstraction of Sim Siam structure When employing set-input data with the Sup Con method, the standard practice of augmenting each instance within a set poses significant challenges for the training of Sup Con models. These challenges stem from two main aspects: 1) Complexity in maximizing agreement with set-input data: Sup Con is traditionally trained to maximize agreement between differently augmented views of the same data point using labeled data. In our application, using set-input data means that we apply conventional data augmentation methods to each instance within a set. This results in an independently augmented set of images, as opposed to augmenting a single instance. This complexity makes it more challenging to achieve the desired maximization of agreement. 2) Constraints on batch sizes due to set inputs: Set-input models take a batch of sets as input instead of a batch of instances. It requires us to use relatively smaller batch sizes when using the same hardware configuration because of the set input. It s important to emphasize that the batch size is a critical factor for the effectiveness of the Sup Con model. We have observed that despite these challenges, the performance of FACILE-Sup Con is commendable in contexts involving smaller datasets or less complex models, i.e., CIFAR-100 in section 3.2 and section 3.3 or smaller trainable models as discussed in Appendices B.1 and B.3. We believe that our approach, with its nuanced application of Sup Con in a set-input context, offers a valuable contribution to the field and shows the versatility of the FACILE algorithm. G.3 SIMSIAM Non-contrastive models, e.g., Sim Siam and BYOL, achieve strong results in typical Image Net (Deng et al., 2009) pretraining setting (Chen & He, 2021; Grill et al., 2020; Li et al., 2022). Among the noncontrastive models, Sim Siam removes the negatives and uses stop-grad to avoid collapse. Besides, it trains faster, requires less GPU memory, and works well with small batch size (Chen & He, 2021; Li et al., 2022), which makes it extremely appealing to academics. The abstraction of Sim Siam structure is shown in figure 14. Given two augmented views x1 and x2 of the same image x, Sim Siam learns to use x1 to predict the representation of x2. Specifically, x1 is passed into the online backbone network on the upper. The x2 is passed into the target backbone network on the lower. The outputs of the two backbone networks are passed to the projection MLPs and then a prediction MLP is used to predict the projected representation of x2 from the projected representation of x1. Sim Siam uses the same network for the online and target backbone and projection networks. In our paper, the projection MLP has 3 fully-connected layers with a hidden dimension of 512 and an output dimension of 512. It has batch normalization (BN) applied to each fully-connected layer including its output fully-connected layer. The prediction MLP also has BN applied to its hidden fully-connected layer. Its output fully-connected layer does not have BN or Re LU. The prediction MLP has 2 layers. Published as a conference paper at ICLR 2024 H EXCESS RISK BOUND OF FACILE Our proof framework follows closely the work of Robinson et al. (2020). We consider the setting where we have some coarse-grained labels of some sets, rather than instances and the downstream classifiers only use the learned embeddings to train and test on the downstream tasks. /Robinson et al. (2020) considers a different setting where each instance has a weak label and a strong label, and the strong label predictor learns to predict the strong labels from the instances and their corresponding embeddings learned with weak labels. The diagram of only using trained embeddings for downstream tasks is more often used in self-supervised learning and representation learning for FSL literature (Du et al., 2020; Yang et al., 2021; Bachman et al., 2019; He et al., 2020; Chen et al., 2020; Grill et al., 2020; Caron et al., 2020; Chen & He, 2021). The coarse-grained data contains useful information, which is characterized by our defined Lipschitzness, to pretrain a instance feature map that can be leveraged for downstream FSL. We include the full proof of our key result as follows. In order to prove theorem 4, we first split the excess risk by the following proposition. Proposition 5. Suppose that f is L-Lipschitz relative to E. The excess risk E h ℓfg ˆ f ˆe(X, Y ) ℓfg f e (X, Y ) i is bounded by, 2LRatem(ℓcg, PS,W , E) + Raten(ℓfg, ˆPZ,Y , F) Proof. We split the excess risk into three parts EPX,Y h ℓfg ˆ f ˆe(X, Y ) ℓfg f e (X, Y ) i =EPX,Y h ℓfg ˆ f ˆe(X, Y ) ℓfg f ˆe(X, Y ) i + EPX,Y h ℓfg f ˆe(X, Y ) ℓfg f e0(X, Y ) i + EPX,Y h ℓfg f e0(X, Y ) ℓfg f e (X, Y ) i For the second term and third term, relative Lipschitzness of f to E delivers EPX,Y h ℓfg f ˆe(X, Y ) ℓfg f e0(X, Y ) i = EPX,Y,S,W h ℓfg f ˆe(X, Y ) ℓfg f e0(X, Y ) i LEPX,Y,S,W ℓcg (gˆe ˆe(S), ge0 e0(S)) = LEPS,W ℓcg (gˆe ˆe(S), ge0 e0(S)) , EPX,Y h ℓfg f e0(X, Y ) ℓfg f e (X, Y ) i = EPX,Y,S,W h ℓfg f e0(X, Y ) ℓfg f e (X, Y ) i LEPX,Y,S,W ℓcg (ge0 e0(S), ge e (S)) = LEPS,W ℓcg (ge0 e0(S), ge e (S)) Since e attains minimal risk and W = ge0 e0(S), the sum of the two terms can be bounded by, LEPS,W ℓcg (gˆe ˆe(S), ge0 e0(S)) + LEPS,W ℓcg (ge0 e0(S), ge e (S)) 2LEPS,W ℓcg (gˆe ˆe(S), W) 2LRatem(ℓcg, PS,W , E) By combining the bounds on the three terms we can get the claim. The central condition is well-known to yield fast rates for supervised learning (Van Erven et al., 2015). It directly implies that we could learn a map Z Y with e O(1/n) excess risk. The difficulty is that at test time we would need access to latent value Z = e(X). To circumnavigate this hurdle, we replace e0 with ˆe and solve the supervised learning problem (ℓfg, ˆPZ,Y , F). It is not clear whether this surrogate problem satisfies the central condition. We show that (ℓfg, ˆPZ,Y , F) indeed satisfies a weak central condition and shows weak central condition still enables strong excess risk guarantees. Following Robinson et al. (2020); Van Erven et al. (2015), we define the central condition on F. Published as a conference paper at ICLR 2024 Definition 6. (The central condition). A learning problem (ℓfg, PZ,Y , F) on Z Y is said to satisfy the ϵ-weak η-central condition if there exists an f F such that E(Z,Y ) PZ,Y h eη(ℓfg f (Z,Y ) ℓfg f (Z,Y ))i eηϵ for all f F. The 0-weak central condition is known as the strong central condition. Capturing relatedness of pretraining and downstream task with the central condition. Intuitively, the strong central condition requires that the minimal risk model f attains a higher loss than f F on a set of Z, Y with an exponentially small probability. This is likely to happen when Z is highly predictive of Y so that the probability of P(Y |Z) concentrates in a single location for most Z. If f in F such that f (Z) maps into this concentration, ℓfg f (Z, Y ) will be close to zero most of the time. We assume that the strong central condition holds for the learning problem (ℓfg, PZ,Y , F) where Z = e0(X). Similar to Robinson et al. (2020), we split the learning procedure into two supervised tasks as depicted in Algorithm 1. In the algorithm, we replace (ℓfg, PZ,Y , F) with (ℓfg, ˆPZ,Y , F). We will show that (ℓfg, ˆPZ,Y , F) satisfies the weak central condition. Proposition 7. Assume that ℓcg(w, w ) = 1 {w = w } and that ℓfg is bounded by B > 0, F is L-Lipschitz relative to E, and that (ℓfg, PZ,Y , F) satisfies ϵ-weak central condition. Then (ℓfg, ˆPZ,Y , F) satisfies the ϵ + O exp(ηB) η Ratem (E, PS,W ) -weak central condition with probability at least 1 δ. Proof. Note that 1 η log E ˆ PZ,Y exp η(ℓfg f ℓfg f ) = 1 η log EPX,Y exp η(ℓfg f ˆe ℓfg f ˆe) To prove that (ℓfg, ˆPZ,Y , F) satisfies the central condition we therefore need to bound 1 η log EPX,Y exp η(ℓfg f ˆe ℓfg f ˆe) by some constant. 1 η log EPX,Y exp η(ℓfg f ˆe ℓfg f ˆe) η log EPX,Y,S,W exp η(ℓfg f ˆe ℓfg f ˆe) η log EPX,Y,S,W h exp(η(ℓfg f ˆe ℓfg f ˆe))1{ˆgˆe ˆe(S) = W} i + 1 η log EPX,Y,S,W h exp(η(ℓfg f ˆe ℓfg f ˆe))1{ˆgˆe ˆe(S) = W} i η log EPX,Y,S,W h exp(η(ℓfg f e0 ℓfg f e0))1{ˆgˆe ˆe(S) = W} i | {z } first term 1 η log EPX,Y,S,W h exp(η(ℓfg f ˆe ℓfg f ˆe))1{ˆgˆe ˆe(S) = W} i | {z } second term The third line follows from the fact that for any f in the event {ˆgˆe ˆe(S) = W} we have ℓfg f ˆg = This is because |ℓfg f ˆe(X, Y ) ℓfg f e0(X, Y )| Lℓcg(gˆe ˆe(S), ge0 e0(S)) = Lℓcg(W, W) = 0. η log EPX,Y,S,W h exp(η(ℓfg f e0 ℓfg f e0)) i after we drop the 1{ˆgˆe ˆe(S) = W}. It is bounded by ϵ with the weak central condition. The second term is bounded by Published as a conference paper at ICLR 2024 1 η log EPX,Y,S,W h exp(η(ℓfg f ˆe ℓfg f ˆe))1{ˆgˆe ˆe(S) = W} i η log EPX,Y,S,W [exp(ηB)1{ˆgˆe ˆe(S) = W}] η log EPS,W [exp(ηB)1{ˆgˆe ˆe(S) = W}] η EPS,W [exp(ηB)1{ˆgˆe ˆe(S) = W}] η PS,W (ˆgˆe ˆe(S) = W) η Ratem(ℓcg, PS,W , E) The first inequality uses the fact that ℓfg is bounded by B. The forth line is because that log x < x. By combining this bound with ϵ bound on the first term we can get the claimed result of proposition 7. The proof of the main theorem further relies on a proposition provided by Robinson et al. (2020), as we show below: Proposition 8. Robinson et al. (2020) Suppose (ℓfg, QZ,Y , F) satisfies the ϵ-weak central condition, ℓfg is bounded by B > 0, F is L -Lipschitz in its d-dimensional parameters in the l2 norm, F is contained in Euclidean ball of radius R, and Y is compact. Then when An(ℓfg, QZ,Y , F) is ERM, the excess risk EZ,Y QZ,Y h ℓfg ˆ f (Z, Y ) ℓfg f (Z, Y ) i is bounded by, with probability at least 1 δ, where V = B + ϵ. Proof of the main theorem: If m = Ω(nβ), the Ratem(ℓcg, PS,W , E) = O( 1 mα ) = O( 1 nαβ ). proposition 7 concludes that (ℓfg, ˆPZ,Y , F) satisfies the O( 1 nαβ )-weak central condition with probability at least 1 δ. Thus by proposition 8, we can get Raten(ℓfg, ˆPZ,Y , F) = O dαβ log RL n+log 1 δ n + B nαβ . Combining bounds with proposition 5 we conclude that E h ℓfg ˆ f ˆe(X, Y ) ℓfg f e (X, Y ) i 2LRatem(ℓcg, PS,W , E) + Raten(ℓfg, ˆPZ,Y , F) O dαβ log RL n + log 1