# robust_selfsupervised_multiinstance_learning_with_structure_awareness__30f75c1a.pdf

Robust Self-Supervised Multi-Instance Learning with Structure Awareness

Yejiang Wang1, Yuhai Zhao1,*, Zhengkui Wang2, Meixia Wang1

1School of Computer Science and Engineering, Northeastern University, China 2Info Comm Technology Cluster, Singapore Institute of Technology, Singapore wangyejiang@stumail.neu.edu.cn, zhaoyuhai@mail.neu.edu.cn, zhengkui.wang@singaporetech.edu.sg, wangmeixia@stumail.neu.edu.cn

Multi-instance learning (MIL) is a supervised learning where each example is a labeled bag with many instances. The typical MIL strategies are to train an instance-level feature extractor followed by aggregating instances features as bag-level representation with labeled information. However, learning such a bag-level representation highly depends on a large number of labeled datasets, which are difﬁcult to get in realworld scenarios. In this paper, we make the ﬁrst attempt to propose a robust Self-supervised Multi-Instance LEarning architecture with Structure awareness (SMILES) that learns unsupervised bag representation. Our proposed approach is: 1) permutation invariant to the order of instances in bag; 2) structure-aware to encode the topological structures among the instances; and 3) robust against instances noise or permutation. Speciﬁcally, to yield robust MIL model without label information, we augment the multi-instance bag and train the representation encoder to maximize the agreement between the representations of the same bag in its different augmented forms. Moreover, to capture topological structures from nearby instances in bags, our framework learns optimal graph structures for the bags and these graphs are optimized together with message passing layers and ordered weighted averaging operator towards contrastive loss. Our main theorem characterizes the permutation invariance of the bag representation. Compared with state-of-the-art supervised MIL baselines, SMILES achieves average improvement of 4.9%, 4.4% in classiﬁcation accuracy on 5 benchmark datasets and 20 newsgroups datasets, respectively. In addition, we show that the model is robust to the input corruption.

Introduction Multi-instance learning (MIL) is a form of supervised learning where training instances are arranged in sets, namely bags, and each bag is assigned a binary label (Yuan et al. 2021; Pal et al. 2022; Huang et al. 2022). The standard MIL assumption is that a bag is positive if it contains at least one positive instance, and negative otherwise. In the past decades, much research effort has been devoted to improve the performance of MIL by learning the bag-level representation, which implicitly utilizes bag-tobag similarity or explicitly trains a bag classiﬁer (Feng et al.

*Corresponding author. Copyright 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2021; Wang et al. 2018). For large-scale MIL scenarios like drug activity prediction, where each molecule can be represented as a bag and the instances correspond to different conformations (molecular structures) of that compound, these methods often implement the bag-level MIL models by a two-stage strategy: ﬁrst training an instance-level feature extractor, and then aggregating features as bag-level representations with label information (Ilse, Tomczak, and Welling 2018). However, it may be difﬁcult to collect a multi-instance learning datasets composed of fully labeled bags in real-world applications, due to the signiﬁcant labeling costs. For example, high-quality molecule data with human labeling could be costly and it is difﬁcult, if not impossible to create fully labeled datasets with millions of molecules (Rong et al. 2020; Zhang et al. 2021). To tackle this challenge, in this paper, we make an attempt to learn the representation of bag in a self-supervised manner without the requirement of any label information. A robust self-supervised MIL learning for bag representation should fulﬁl below important properties. First, it should generate the bag representation that is invariant to the permutation of the set of instances. The example (i.e. a bag) in MIL is described by a set of feature instances. The order independence of set can be used to design models with improved efﬁciency and generalization (Wagstaff et al. 2022; Maron et al. 2019). Second, it should have the capability of capturing the topological structure information on tasks where the objects have inherent interactions. Previous studies on multi-instance learning typically treated instances in the bags as independently and identically distributed (Huang et al. 2022; Feng et al. 2021; Ilse, Tomczak, and Welling 2018). However, the local structures and the proximity information from nearby instances are important in MIL model (Zhang 2021). Third, it should be able to handle the bag noise. It is inevitable that the provided multi-instance bags are incomplete and noisy in real-world scenarios (Chevaleyre and Zucker 2000; Luengo et al. 2021). Hence, developing robust multi-instance learning models to resist unnoticeable perturbation (e.g., missing or error instances in bags) is of signiﬁcant importance. In this paper, we provide a full characterization of multiinstance learning and present a robust Self-supervised Multi Instance LEarning architecture with Structure awareness (SMILES) to capture all the above properties. Speciﬁcally,

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

to yield robust MIL model without using labels, we augment the multi-instance bag and train the representation encoder to maximize the correspondence between the representations of the same bag in its different augmented forms. Through maximizing their consistency, our model is robust with respect to noise/perturbation of data. To capture the geometric structures of instances in bags, we generate learnable graph adjacency matrices for the bags, which well respects the node proximity conveyed by the original feature instances, and these graphs are optimized together with message passing layers and the ordered weighted averaging operator towards contrastive loss, where informative hidden connections can be discovered. Our main theorem characterizes the permutation invariance of the representation of node and bag. In summary, our core contributions are three-fold: We propose an unsupervised learning paradigm SMILES for multi-instance learning, which is more practical and challenging than the existing supervised counterpart. To the best of our knowledge, this is the ﬁrst attempt to learn the bag representation in an unsupervised setting. Our self-supervised model provides approaches to augment the bag to offer noise robustness when the MIL data have possibly noise. Together with the graph generation for the multiple instances, the bag representation can be associated with structure awareness, while be invariant to permutations of its constituent instances. Extensive experiments show that SMILES, the selfsupervised model signiﬁcantly outperforms state-of-theart supervised MIL models, and is robust against common injected noise/permutation.

Related Studies and Preliminaries In this section, we brieﬂy review related studies and introduce preliminary knowledge. Multi-Instance Learning. In traditional supervised learning, each learning example consists of a ﬁxed number of values (i.e., an instance) with a label. However, in many applications, only a bag of instances is given a label, which is referred as multi-instance learning (MIL) (Dietterich, Lathrop, and Lozano-P erez 1997). In this paper, we follow the notation of (G artner et al. 2002). X Rdin denotes the instance space, Ωis the set of labels y. In MIL, the label is assumed to be binary, so Ω= { , }. A multi-instance concept is a map νmi : 2X Ωdeﬁned as

νmi(X) x X : c(x) (1)

where c C is a concept from a concept space C, and X X is a set of instances. Supervised MIL problem aims to predict labels of new bags based on the labeled training dataset D = {(X, y)} (Lin et al. 2022; Chu et al. 2020). In this work, we are interested in learning multi-instance with no supervision. Self-supervised learning (SSL), emerging as a learning paradigm that can enable training on massive unlabeled data, recently has received considerable attention (Von K ugelgen et al. 2021; Reed et al. 2022). However, as far as we know, there is no work explored selfsupervised learning for multi-instance problem. Formally, self-supervised multi-instance learning should learn bag

representation by a function frep : 2X Rdout that transforms the multi-instance of a bag X into a dout-dimensional instance space frep(X) = (a1, ..., adout) without label. Multi-instance Noise. Obtaining multi-instance model that are robust to perturbation/noise has been an active topic of research (Chevaleyre and Zucker 2000; Luengo et al. 2021). In this paper, we study the set of training perturbations: Up(Λ) = δ Rdin : δ p Λ , where δ denotes the measurement error, p is the ℓp-norm and Λ controls the amplitude of the perturbations. Formally, the generation of noise is a black-box feedback mechanism which, when called at instance x, returns a random vector g(x; δ) with δ drawn from some (complete) probability space (Up(Λ), F, P), which is independent of the value of x. Therefore, the oracle draws an i.i.d. sample δ Up(Λ) and returns an observed instance: g(x; δ) = x+δ. In supervised setting, the simplest and most straightforward way to defend against such noise is to minimize the loss of measurement error examples

argmin θ E(X,y) D,δ Up(Λ) Lce(θ, {g(x; δ)|x X}, y) (2)

where Lce is the cross-entropy loss, and θ is parameters. Multi-instance Structure. Multi-instance structure learning targets jointly learning graph structure and corresponding representation to improving the expressiveness of MIL models (Pal et al. 2022; Zhao et al. 2021). It aims to learn functions of sets of n instances X into graph with n nodes. Let G = (V, E, X ) be an undirected graph with respect to X, where V = {v1, . . . , vn} is a set of vertices and suppose that every vertex vi corresponds to a d -dimension vector x i X , and E is a similarity matrix of vertexes, the element ei,j denotes the weight of edge. Another way to see graph with features is to see the graph as tensors of order 2: G (Fn, Fn2), where F denotes arbitrary ﬁnitedimensional space of the form Rq (for various values of q) typically representing the feature space. Here, X Fn and E Fn2. Multi-instance structure learning considers learning a function fsl : 2X (Fn, Fn2) maps the input space 2X to the graph space. Intuitively, if xi and xj are nearest neighbors on X 2X with a high degree of similarity, the corresponding vertexes should be close to one another. Permutation Invariance. In multi-instance learning tasks the representation of bag that we want to learn should be invariant to any permutation of the instances in bag. In addition, for the learned multi-instance structure (i.e. graph), it is importance to ensure that the model remains permutation invariant to the structure (Zhang et al. 2022; Zaheer et al. 2017; Wagstaff et al. 2019). For a multi-instance bag, an instance permutation action π SB is a left action φ : SB 2X 2X with the element π on a sorted sequence of n instances represented as X = (x1, . . . , xn) of a bag to output a corresponding permuted sequence of instances i.e., φ(π, X) = (xπ(1), . . . , xπ(n)). A map f : 2X Rdout satisfying f φ(π, X) = f(X) for all π SB and X 2X is called permutation invariant. For the graph generated from bag, a vertex permutation action π SG is deﬁned in similar way: φ : SG V V , and φ(π, V ) = (vπ(1), . . . , vπ(n)). The permutation action π SG also acts on any vector deﬁned over the nodes V , i.e.,

(xi) Fn, and output an equivalent vector with the order of the nodes permuted i.e., (xπi) Fn. A function f acting on a graph G given by f : (Fn, Fn2) Rdout is G-invariant whenever it is invariant to any vertex permutation action π SG in the (Fn, Fn2) graph space i.e., f φ(π, V ) = f(V ) and all isomorphic graphs obtain the same representation.

Methodology In this section, we will present a robust self-supervised multi-instance learning method with structual awareness, named SMILES. Given an input bag, SMILES aims to learn the self-supervised representation of the bag through maximizing the consistency between two augmented views of the input bag via contrastive loss in the latent space. To capture the structural relations among instances we generate multiinstance graph in a learnable manner. The bag representation is obtained by encoding the graph with message passing layers and the ordered weighted averaging operator, where permutation invariant of uniﬁed representaion encoder for the bag is theoretical guaranteed. We summarize all the steps of our framework in Algorithm 1.

Bag Augmentation A way of inducing inductive bias for multi-instance learning is data augmentation, which we use in the bag data and which plays a prominent role in robustness learning overall. Deﬁne A : 2X 2X as the function class of augmentations and Frep as the class of representation encoders. For fa A and any X 2X , we deﬁne

fa(X) = n x x = g(x; δ), x X, δ Up(Λ) o (3)

Suppose that there is f Frep satisﬁes f(fa(X)) = f(X) for X. The noise perturbation can provide contrastive information in various magnitude for the encoder to learn the representations. Thus, the self-supervised noise-against multiinstance learning objective for an instance-wise perturbation, following the supervised formulation of Eq.(2), could be given as follows

argmin θ E(X) D Lθ X, {fa(X)}, X (4)

where {X } are the negative bags for X, which are bags of other examples, and the contrastive loss Lθ can be deﬁned

Lθ X, X+ , X

P {z+} exp (cos (z, z+) /τ) P z {z+,z } exp (cos (z, z ) /τ)

where τ is a temperature, cos(u, v) = u T v/ u v denotes cosine similarity, z, {z+} and {z } are corresponding latent vectors obtained by the representation encoder z = frep(X), {z+} and {z }, respectively.

Bag Structure Awareness The construction of a meaningful graph topology plays a crucial role in the effective representation and analysis of

multi-instance data. However, a natural choice of the graph is not readily available from bag and it is thus desirable to infer or learn a graph topology from the instances in the bag. We generate the feature similarity matrix E Fn2 for determining the possibility of an edge between two nodes based on node features. Speciﬁcally, for each node vi with feature vector xi X, we adopt a non-linear feature mapping layer fnl : Rdin F to project the feature xi to a d -dimension latent feature

x i = fnl(xi) := σ (xi Wnl + bnl) (6)

where σ( ) denotes a non-linear activation function, Wnl Rdin d and bnl R1 d denote the mapping matrix and the bias vector, respectively. Then, we perform metric learning on the latent features and obtain the learned feature similarity graph E Fn2 where the edge between nodes vi and vj is obtained by

E[i, j] = s x i, x j Js x i, x j ϵK (7)

where J K is the Iverson bracket, i.e., 1 whenever a condition in the bracket is satisﬁed, and 0 otherwise. ϵ [0, 1] is the threshold that controls the sparsity of feature similarity graph, and larger ϵ implies a more sparse feature similarity graph. s is a K-head weighted cosine similarity function

s x i, x j = 1

k cos wk x i, wk x j (8)

where denotes the Hadamard product, and Wkh = [wk] is the learnable parameter matrix of s that weights the importance of different dimensions of the lantern feature vectors. By performing metric learning as in Eq.(7) and ruling out edges with little feature similarity by threshold ϵ, we learn the candidate feature similarity graph (X , E) = fsl(X).

Bag Representation Given a graph G = (V, E, X ) generated above, to learn vertex representations for every vertex v V , in this work, we use a message passing framework on the generated graph, which preserves adjacency information between nodes as follows. Let hℓ i Fℓdenotes the feature at layer ℓassociated with node i, the updated feature hℓ+1 i is obtained as: hℓ+1 i = fupd(hℓ i, {{hℓ j |j Ni}}), where j Ni means that nodes j and i are neighbors in the graph G, i.e. (i, j) E, and the function fupd : 2Fℓ Fℓ+1 is a learnable function taking as input the feature vector of the center vertex hℓ i and the multiset of features of the neighboring vertices {{hℓ j |j Ni}}. Indeed, for any such function fupd can be approximated by a layer of the form

hℓ+1 i = σ W ℓ hℓ i f ℓ hℓ i, {{hℓ j| j Ni}} (9)

where f ℓ: 2Fℓ Fℓ+1 is injective set funcitons in the ℓth layer, denotes vector concatenation, W ℓis learnable weight matrix and σ is an element-wise activation function. We get the ℓ-th message passing layer f ℓ mp : Fℓ Fℓ+1 (note that fmp depends implicitly on the graph/edge). Then,

by the composition of f ℓ mp, we obtain the novel representation of each instance in bag

x i = f L mp . . . f 2 mp f 1 mp(x i) (10)

where L denotes the total number of layers used. Theorem 1 (Node Representation). For a node-featured graph G = (V, X , E) Fn Fn2 and a vertex representation function ψ(v, V, E, X ) : V (Fn, Fn2) Rdout on v V given by Eq.(10). Then, for all permutation actions π SG,

ψ(v, V, E, X ) = ψ(φ(π, v), φ(π, V ), E, φ(π, X )).

This implies that the map ψ is a G-invariant for any node.

Proof. For any two different vertex permutation actions π, π SG supposed they are satisﬁed ψ(φ(π, v), φ(π, V ), E, φ(π, X )) = ψ(φ(π , v), φ(π , V ), E, φ(π , X )). This indicates that for different order of the nodes in graph the same vertex may get different representations. For each layer f ℓ mp on the vertex v there is a corresponding map f ℓ mp,v : Rdℓ Rdℓ+1. Let ℓ= 1, expanding f ℓ mp,v for vertex permutation actions and applying the cancellation law of groups, since h0 v = x v is identical for these two permutations, it means the h1 v is equivalent as well. However, according to the previous assumption this is not possible. By induction on ℓ 2, if the contradiction holds for some ℓ, it holds for ℓ+ 1 as well, which conclude our proof.

However, the above process generates only representations of instances in bag, hence aggregation operation is required, which tries to summarize those representations into a single element. To do this, the representation of any bag are obtained using the ordered weighted averaging (OWA) operator (Yager 1988) fowa : Fn L+1 Rdout as

z = fowa ({x i | i [n]} ; ζ) =

i=1 ζix (i) (11)

where x (i) is the i-th largest element in the set {x i | i [n]}, and ζ = [ζ1 . . . ζn] is a parameter vector associated with fowa, such that ζi is nonnegative and Pn i=1 ζi = 1. The OWA operator can be seen as a generalization of any aggregation operation that can be made over a set of values. For example, the maximum operator over a set of values can be modeled with the weight vector 1, 0, . . . , 0 . In this work we choose 1/n, . . . , 1/n for averaging aggregation. Based on the Theorem 1 and the permutation invariance of OWA operator, we can easily get following result. Theorem 2 (Bag Representation). The representation encoder frep(X) = fowa({f L mp . . . f 2 mp f 1 mp fsl g(xi; δ) | xi X}) is permutation invariant for bag X. We further introduce several practical data transformations to approximate the perturbation distribution in Eq.(3). Remark (Bag Augmentation). For the model frep, randomly dropping, masking, replacing, or randomizing instances in bags are all special cases of bag augmentation in Eq.(3), where dropping randomly removes certain ratio

Algorithm 1: SMILES.

1: input: unlabeled training data X X, batch size N, temperature τ, augmentation ratio c, encoder network frep, pre-train head network g.

2: for sampled mini-batch {Xi}N i=1 X do 3: for i [N], ˇXi = fa(Xi), Xi = fa(Xi).

# generate corrupted views. 4: let ( ˇX i, ˇEi)=fsl( ˇXi), ( X i, Ei)=fsl( Xi), i [N]. # generate multi-instance graph. 5: let ˇz i = g(frep( ˇX i)), z i = g(frep( X i)), i [N]. # embeddings for views. 6: let ti,j = ˇz i z i/ ˇz i 2 z j 2 , i, j [N]. # pairwise similarity.

7: deﬁne Lθ := 1

N PN i=1 log exp(ti,i/τ) 1 N PN k=1 exp(ti,k/τ)

8: update networks frep and g to minimize Lθ by SGD. 9: end for 10: return encoder network frep.

of instances, masking randomly set certain ratio of instance elements as zero, replacing randomly replaces certain ratio of elements of an instance with the corresponding elements of another randomly choosing instance in bag, randomizing randomly assigns random vectors to certain ratio of instances. Architecture. Based on above analysis, we propose SMILES framework (Algorithm 1) as following: (i) Bag data augmentation. The given bag X undergoes bag augmentations to obtain two correlated views ˇX, X := fa(X), as a positive pair. In practice, according to Ramark, in our work, view augmentation methods include instance dropping, adding, randomizing, or masking. (ii) Bag structure awareness. The multi-instance graph is generated by (X , E) = fsl(X) according to Eq.(6) and (7) for the augmented bags, which capture the structural interactions between instances in the bags. (iii) Encoder. A representation encoder frep( ) extracts baglevel representation vectors ˇz, z for the augmented bags using the above graphs ˇX , X . The message passing layers in two encoders share parameters in the pre-training. (iv) Projection head. A non-linear function g( ) named projection head maps representations to another latent space where the contrastive loss is calculated. In our work, a twolayer perceptron is applied to obtain ˇz = g(ˇz), z = g( z). (v) Contrastive loss. A contrastive loss function Lθ( ) (in Eq.(5)) is deﬁned to enforce maximizing the consistency between positive pairs ˇz , z compared with negative pairs.

Experiments We empirically evaluate SMILES against state-of-the-art supervised multi-instance learning algorithms on ﬁve popular benchmark datasets, twenty text datasets from the 20Newsgroups corpus and three datasets for the task of biocreative text categorization (see the Appendix for detail). Since there is no unsupervised MIL algorithm at present, to evaluate the proposed SMILES, we use three categories of

Algorithm Average MUSK1 MUSK2 FOX TIGER ELEPHANT

mi-SVM 77.9 N/A 87.4 N/A 83.6 N/A 58.2 N/A 78.4 N/A 82.2 N/A MI-SVM 77.7 N/A 77.9 N/A 84.3 N/A 57.8 N/A 84.2 N/A 84.3 N/A MI-Kernel 81.2 2.0 88.0 3.1 89.3 1.5 60.3 2.8 84.2 1.0 84.3 1.6 EM-DD 76.5 4.4 84.9 4.4 86.9 4.8 60.9 4.5 73.0 4.3 77.1 4.3 mi-Graph 82.8 3.7 88.9 3.3 90.3 3.9 62.0 4.4 86.0 3.7 86.9 3.5 MI-VLAD 80.4 4.0 87.1 4.3 87.2 4.2 62.0 4.4 81.1 3.9 85.0 3.6 mi-FV 81.5 4.1 90.9 4.2 88.4 4.2 62.1 4.9 81.3 3.7 85.2 3.6 MI-SDB 87.4 3.7 93.1 4.0 91.2 4.1 78.9 3.4 86.5 4.2 87.5 3.0 BDR 84.5 3.5 92.4 2.7 90.3 5.2 62.8 3.4 86.9 3.6 90.2 2.9

mi-Net 80.8 3.8 88.9 3.9 85.8 4.9 61.3 3.5 82.4 3.4 85.8 3.7 MI-Net 81.2 3.8 88.7 4.1 85.9 4.6 62.2 3.8 83.0 3.2 86.2 3.4 MI-Net (DS) 82.3 3.8 89.4 4.2 87.4 4.3 63.0 3.7 84.5 3.9 87.2 3.2 MI-Net (RC) 81.6 4.2 89.8 4.3 87.3 4.4 61.9 4.7 83.6 3.7 85.7 4.0 Attention 81.4 3.5 89.2 4.0 85.8 4.8 61.5 4.3 83.9 2.2 86.8 2.2 Gated-Attention 81.3 3.3 90.0 5.0 86.3 4.2 60.3 2.9 84.5 1.8 85.7 2.7 B-Graph 82.2 3.0 89.7 3.7 87.1 2.8 64.0 4.1 82.9 2.2 87.5 2.4

SMILES 92.9 2.1 92.7 1.2 96.2 1.6 85.5 4.3 92.0 1.5 98.2 2.0

Table 1: Mean and standard error (when available) of classiﬁcation accuracy (in %) for benchmark MIL datasets. The best results in each column are shown in bold. Higher accuracies are better.

supervised baselines: (i) the instance space approaches include mi-SVM and MI-SVM (Andrews, Tsochantaridis, and Hofmann 2002), EM-DD (Zhang and Goldman 2001), MIVLAD and mi-FV (Wei, Wu, and Zhou 2017); (ii) the bag space methods include MI-Kernel (G artner et al. 2002), mi Graph (Zhou, Sun, and Li 2009), BDR (Huang et al. 2022) and MI-SDB (Feng et al. 2021); (iii) we also compare with the embedding space methods mi-Net and MI-Net (Wang et al. 2018), Attention Neural Network and Gated Attention Neural Network (Ilse, Tomczak, and Welling 2018), and BGraph (Pal et al. 2022), these methods use neural networks or attention to learn embeddings of the bags. For the baselines, we set the hyper-parameters as suggested by their authors. For SMILES, we report the mean 10-fold cross validation accuracy after 5 runs followed by a linear SVM. The linear SVM is trained by applying cross validation on training data folds and the best mean accuracy is reported. We conduct experiment with the values of the number of message passing layers, the number of epochs, batch size, the parameter C of SVM, the threshold ϵ, augmentation ratio c and temperature τ in the sets {2, 4, 8, 12}, {10, 20, 40, 100}, {32, 64, 128, 256}, {10 3, . . . , 102, 103}, {0.1, . . . , 0.5}, {10%, . . . , 50%} and {0.05, 0.1, 0.2, 0.5, 1.0, 2.0} respectively. The hidden dimension of layer is set to 128. Based on the Remark, the augmentation strategies include dropping, masking, replacing, and randomizing instances.

MIL Benchmark Datasets

We ﬁrst evaluate our proposed framework on the benchmark datasets MUSK1, MUSK2 (Dietterich, Lathrop, and Lozano-P erez 1997) for drug activity prediction, and FOX, TIGER, and ELEPHANT (Andrews, Tsochantaridis, and Hofmann 2002) for image classiﬁcation. Table 1 shows the MIL result of each algorithm. It is observed that all the deep

learning approaches are not well suited for these datasets as they are composed of precomputed features and the size of the bags are relatively small. But surprisingly, SMILES not only outperforms all deep supervised models, but achieves the state-of-the-art results with respect to the traditional supervised MIL algorithms on these small datasets. For example, the accuracy of SMILES is 25.5% higher than BGraph, the best deep baseline, and 6.6% higher than MISDB, the best traditional algorithm, in data FOX. Tabel 1 lists the average accuracy of 5 benchmark datasets, from which SMILES achieves the best performance as well.

20 News Groups

In this section, we conduct the experiment on corpus data 20 News Groups (Zhou, Sun, and Li 2009). It contains posts from newsgroups on 20 subjects. When one of the subjects is selected as the positive class, all 19 other subjects are used as the negative class. The bags are collections of posts from different subjects. The classiﬁcation accuracy with comparison to supervised models is summarized in Table 2. From Table 2 we observe that all neural network based models outperform the classical MIL models on average in this task. This result suggests that using neural network can get better performance on these corpus data. In Table 2, we can ﬁnd that our method signiﬁcantly outperforms all the baselines for all the cases except for the misc.forsale, sci.electronics and rec.motorcycles. For example, SMILES outperforms the second best algorithm 13.4% on talk.politics.mideast, 11.4% on sci.crypt, 11.2% on rec.autos, 10.0% on sci.space, 10.4% on talk.politics.guns, and 10.8% on talk.politics.misc. And the average classiﬁcation accuracy of all 20 multi-instance datasets indicate that our method outperforms others baselines, including MIKernel, mi-Graph, mi FV, MI-SDB, mi-Net, MI-Net and its variants, and B-graph, with about 12.5% improvement in

Algorithm SMILES MIKernel mi Graph mi FV MISDB

mi Net MINet MINet (DS) MINet (RC) BGraph

alt.atheism 89.1 2.4 60.2 3.9 65.5 4.0 84.5 1.4 85.7 1.2 83.1 2.3 84.7 1.8 84.4 2.0 83.6 1.5 88.5 2.2 comp.graphics 90.3 1.3 47.0 3.3 77.8 1.6 59.6 5.8 74.6 1.1 81.7 0.6 82.0 1.5 81.9 0.5 81.5 0.9 80.1 3.2 comp.os.ms-windows.misc 72.5 3.7 51.0 5.2 63.1 1.5 61.3 1.2 67.1 3.2 70.4 1.7 70.7 1.1 70.9 1.1 70.7 1.4 71.9 3.6 comp.sys.ibm.pc.hardware 80.0 2.7 46.9 3.6 59.5 2.7 65.9 3.4 67.8 2.8 79.0 1.8 78.6 1.0 78.3 1.3 78.5 1.0 75.5 3.4 comp.sys.mac.hardware 85.0 1.1 44.5 3.2 61.7 4.8 65.9 2.4 65.7 2.3 79.4 1.6 79.1 1.5 79.7 1.1 79.2 1.9 79.2 3.1 comp.windows.x 91.2 2.4 50.8 4.3 69.8 2.1 76.9 3.5 79.0 1.9 79.9 1.8 80.9 1.9 80.1 1.1 81.2 2.7 86.1 2.7 misc.forsale 68.4 3.6 51.8 2.5 55.2 2.7 56.6 2.7 57.2 2.4 67.1 0.9 66.7 1.2 66.0 1.6 67.2 1.2 75.8 3.5 rec.autos 90.1 2.7 52.9 3.3 72.0 3.7 66.7 5.4 77.5 2.3 76.5 1.2 76.9 1.6 76.4 1.6 76.1 1.6 78.9 3.3 rec.motorcycles 72.5 2.7 50.6 3.5 64.0 2.8 80.0 1.6 85.8 1.9 83.4 1.1 84.2 1.0 83.5 1.5 83.3 1.3 85.5 2.4 rec.sport.baseball 95.0 3.6 51.7 2.8 64.7 3.1 78.0 2.7 82.1 2.5 86.0 1.6 86.7 1.7 85.7 2.5 87.1 1.4 83.5 3.1 rec.sport.hockey 93.4 2.6 51.3 3.4 85.0 2.5 82.4 4.2 90.8 1.8 89.0 1.7 90.2 1.4 91.1 1.6 89.8 1.1 90.0 2.3 sci.crypt 93.3 3.1 56.3 3.6 69.6 2.1 76.1 3.0 78.6 2.1 79.5 1.4 77.9 1.5 77.8 2.6 78.6 2.3 81.9 3.7 sci.electronics 87.5 3.1 50.6 2.0 87.1 1.7 55.5 1.4 90.1 2.2 92.1 0.8 93.2 0.4 92.7 0.5 93.1 0.7 91.4 2.9 sci.med 90.5 2.4 50.6 1.9 62.1 3.9 78.4 1.8 78.2 2.7 85.5 0.9 84.2 0.7 84.7 1.3 83.8 1.4 79.8 3.2 sci.space 98.9 2.9 54.7 2.5 75.7 3.4 81.7 1.5 83.9 1.5 79.8 1.3 79.5 2.8 80.1 2.6 80.3 2.6 88.9 2.6 soc.religion.christian 87.5 3.4 49.2 3.4 59.0 4.7 81.5 2.2 81.4 2.0 79.9 1.5 80.7 1.7 80.1 1.4 80.5 2.0 79.6 3.6 talk.politics.guns 88.6 2.8 47.7 3.8 58.5 6.0 74.7 1.9 75.8 3.5 76.1 1.9 78.2 1.8 77.0 2.4 77.3 1.0 77.7 4.9 talk.politics.mideast 97.4 3.9 55.9 2.8 73.6 2.6 79.9 3.4 80.6 2.0 83.9 1.0 84.0 1.2 83.8 1.0 83.3 2.0 82.0 3.4 talk.politics.misc 88.8 4.5 51.5 3.7 70.4 3.6 69.9 2.1 72.3 2.8 76.5 1.5 75.8 2.3 76.8 2.2 75.6 1.9 78.0 5.0 talk.religion.misc 82.5 2.2 55.4 4.3 63.3 3.5 74.0 3.8 73.9 2.6 74.4 1.5 76.2 1.7 76.2 1.5 74.3 1.2 80.1 3.6

Average 87.2 2.8 51.5 3.3 67.9 3.1 72.5 2.7 77.4 2.2 80.1 1.4 80.5 1.5 80.3 1.5 80.2 1.5 81.7 3.2

Table 2: Mean and std. error of classiﬁcation accuracy (in %) along with average accuracy of the algorithms for the 20 Newsgroups datasets. The best results in each row are shown in bold. Higher accuracies are better.

Augmentation MUSK2 FOX TIGER ELEPHANT

No Aug 88.3 1.7 80.5 1.3 90.1 2.9 92.2 2.3 20% Drop 89.9 3.2 83.2 2.8 90.3 3.3 94.5 3.1 20% Mask 95.1 2.0 85.3 3.2 91.1 2.4 95.0 2.4 20% Replace 92.7 3.1 85.3 1.6 91.0 1.5 96.8 2.0 20% Random 91.6 3.8 81.4 2.2 90.7 3.0 93.9 1.5

Table 3: Ablation study of bag augmentations on 4 datasets.

Structure MUSK2 FOX TIGER ELEPHANT

No Graph 80.2 1.4 65.8 2.1 62.1 4.2 69.2 3.6 With Graph 95.5 2.1 85.1 1.3 91.2 1.2 97.5 1.5

Table 4: Ablation study of bag structure generation.

performance. For example, the average accuracy of SMILES is better than B-Graph by 5.5%, MI-SDB by 9.8%, mi Graph by 19.3%, respectively.

Ablation Analysis Here we want to prove that the augmentation selection policy and the intensity of augmentations really matter to the ﬁnal results. Note that we ﬁxed one of the augmentation as No Aug (i.e. ˇX = X) and all the other augmentation methods require a hyper-parameter aug ratio that controls the portion of instances/elements that are selected for augmentation. The aug ratio is set to a constant in every experiment (e.g., 20% by default). We perform an ablation study of different augmentation polices on four datasets MUSK2, FOX, TIGER, and ELEPHANT and intensities on FOX as shown in Table 3, Table 5 respectively. We conclude that: augmenting the bag data indeed boosts the performance of the proposed algorithm; the choice of aug ratio has a considerable effect on the ﬁnal performance and the classiﬁcation perfor-

Aug Ratio Drop Mask Replace Random

10% 83.0 1.3 84.5 2.1 81.3 3.2 82.1 1.7 20% 83.2 2.8 85.3 3.2 85.3 1.6 81.4 2.2 30% 82.5 2.0 85.0 4.1 85.5 1.6 82.0 0.3 40% 76.1 1.6 83.5 1.9 84.1 3.1 81.4 2.7 50% 73.7 1.4 81.3 1.6 82.8 2.0 79.5 1.8

Table 5: Ablation study of the aug ratio of bag augmentations on FOX dataset.

mance degenerates as the intensity of augmentation grows overly high. It is inappropriate to apply the same aug ratio to different augmentations. In addition, we want to analyze the impact of using structure on learning high-quality bag representation in unsupervised MIL. We conduct an additional ablation study over SMILES with (named With Graph) or without (named No Graph) bag structure awareness on four datasets MUSK2, FOX, TIGER, and ELEPHANT, shown in Table 4. From the Table 4, we observe that the accuracy of With Graph is better than No Graph 15.3%, 19.3%, 29.1%, and 28.3% on these datasets respectively, which demonstrates the importance of using structure recognition in multi-instance learning.

Robustness with Injected Noise

We compare SMILES to other sota supervised MIL methods with clean and injected noises in three datasets COMPONENT, PROCESS, and FUNCTION for the task of biocreative text categorization (Feng et al. 2021). To produce a noisy setting close to the real world, for each bag in dataset, there is η chance for it to be corrupted by noise. Speciﬁcally, all the instances x in the noise bag X is corrupted by g(x; δ), where δ is drawn from Gaussian distributions. Table 6 reports the classiﬁcation accuracy of each method on

Datasets Methods Clean (η = 0.0) Noise Rate (η) 0.1 0.2 0.3 0.4

MI-Kernel 86.4 1.1 86.1 1.6 84.7 2.6 83.4 4.2 81.3 2.4 mi-Graph 90.4 2.6 89.7 2.0 88.3 2.4 89.0 1.1 88.5 2.6 mi-FV 91.0 1.5 90.8 2.1 89.1 1.8 89.2 2.7 88.9 2.8 mi-Net 89.5 2.4 89.0 3.0 88.4 2.3 87.6 1.7 87.3 3.4 MI-Net 89.8 2.9 89.3 2.7 88.8 1.0 87.9 2.5 87.3 1.6 Attention 90.8 1.4 90.1 3.2 89.3 2.9 89.1 2.1 88.3 1.9 Gated-Attention 91.1 3.0 90.0 1.6 89.3 4.3 89.4 2.4 88.5 1.8 B-Graph 87.4 2.9 87.1 1.4 86.4 2.1 86.3 1.0 86.0 3.5 MI-SDB 85.3 4.0 84.6 2.8 84.0 3.6 82.6 1.8 81.7 2.9

SMILES 91.1 1.5 90.4 2.4 89.4 2.1 91.0 3.0 89.1 4.1

MI-Kernel 91.5 2.7 90.6 3.2 89.7 2.6 89.5 1.7 87.4 1.5 mi-Graph 92.3 3.3 91.9 1.7 91.6 4.1 91.7 2.6 90.9 2.8 mi-FV 94.0 2.1 93.8 1.6 93.4 1.4 93.7 2.8 92.6 2.5 MI-Net 91.6 3.2 91.6 1.6 91.3 2.4 90.8 1.9 89.9 1.4 mi-Net 91.7 2.1 91.6 2.2 91.4 4.0 91.0 1.7 90.3 2.8 Attention 94.3 1.7 94.0 1.2 93.6 1.6 92.6 1.6 92.3 1.2 Gated-Attention 94.6 4.0 94.1 0.7 93.5 1.2 92.9 2.9 92.6 2.5 B-Graph 91.9 2.3 91.5 2.6 91.4 3.1 91.4 2.9 91.0 2.2 MI-SDB 81.7 2.7 74.0 2.3 72.1 1.9 73.0 3.5 71.1 2.8

SMILES 94.7 2.7 94.1 2.1 93.7 2.3 94.6 3.2 92.7 1.2

MI-Kernel 92.3 2.1 92.1 2.3 91.5 1.6 91.9 2.8 90.8 1.5 mi-Graph 93.5 2.7 93.1 2.0 92.8 1.4 91.0 3.1 90.6 2.1 mi-FV 94.3 1.7 94.0 2.3 92.9 3.0 92.6 2.2 91.7 2.6 MI-Net 94.1 1.5 93.8 1.6 93.7 2.5 93.5 2.3 92.7 1.9 mi-Net 94.7 3.0 94.2 2.9 93.5 1.6 93.0 2.0 92.8 2.4 Attention 95.7 2.7 95.5 1.3 95.3 2.2 94.9 1.5 94.5 2.3 Gated-Attention 95.9 1.7 95.6 3.4 95.5 1.3 95.3 2.8 94.2 2.0 B-Graph 93.9 2.9 93.5 2.4 93.2 3.2 92.4 1.4 91.9 2.7 MI-SDB 84.8 2.0 84.0 1.7 83.1 1.9 82.5 3.5 75.5 2.6

SMILES 96.2 3.6 96.2 1.5 95.5 2.1 95.9 1.8 95.8 2.2

Table 6: Test accuracies (%) of different methods on benchmark datasets with clean or noise (η [0.1, 0.2, 0.3, 0.4]). The results (mean std) are reported and the best results are boldfaced.

these three datasets. We can also observe that our proposed unsupervised bag representation method is clearly superior to other compared supervised MIL baselines.

In most cases, we produce a substantial improvement. On COMPONENT and FUNCTION, light noise (e.g., 10%, 20% noise) does not lead to much drop in the classiﬁcation results. They are even comparable to other sota methods on the clean datasets. On PROCESS, when there is 30% noise, we improve the accuracy by the best sota baseline by nearly 0.6%, and when there is 40% noise, we improve the accuracy by the best sota method by 1.3%. This indicates that when the noise becomes more complex, the performance gap between the best supervised algorithm and SMILES further increase. Therefore, contrastive-based bag augmentation is a simple yet very effective trick to prevent noise. To summarize, compared with supervised baselines, SMILES produce improvements on the clean datasets. Moreover, when there are noises in data, SMILES is more robust than these supervised counterparts.

Self-supervised learning has seen success in various domains, but little progress has been made for the multiinstance data. In this paper, we proposed a self-supervised multi-instance learning method SMILES focused on learning the representations of bags that are effective in downstream classiﬁcation tasks. SMILES provides a uniﬁed approach to meet a number of fundamental postulates including permutation invariance, structure-awareness and robustness, while conducts theoretical analysis towards understanding of our framework. Speciﬁcally, we augment the bags and train the encoder to maximize the agreement of two jointly sampled positive bag pairs to yield robust MIL model without label. To capture topological structures of bags, our framework learns graphs for the bags and these graphs are optimized together with message passing layers and ordered weighted averaging operator towards contrastive loss. Experiment results verify the state-of-the-art performance of our proposed framework in both generalizability and robustness.

Acknowledgments This research would like to acknowledge the support from National Natural Science Foundation of China (62032013 and 62076143), Singapore Institute of Technology Ignition Grant (R-IE2-A405-0001), Innovative Talents of Higher Education in Liaoning Province (No. LR2020076) and Basic Research Operating Funds for National Defence Major Incubation Projects (No. N2116017).

References Andrews, S.; Tsochantaridis, I.; and Hofmann, T. 2002. Support vector machines for multiple-instance learning. Advances in Neural Information Processing Systems, 15. Chevaleyre, Y.; and Zucker, J.-D. 2000. Noise-tolerant rule induction from multi-instance data. In Proceedings of the ICML-2000 workshop on Attribute-Value and Relational Learning: Crossing the Boundaries. Chu, Y.; Yue, X.; Yu, L.; Sergei, M.; and Wang, Z. 2020. Automatic image captioning based on Res Net50 and LSTM with soft attention. Wireless Communications and Mobile Computing, 2020: 1 7. Dietterich, T. G.; Lathrop, R. H.; and Lozano-P erez, T. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artiﬁcial Intelligence, 89(1): 31 71. Feng, L.; Shu, S.; Cao, Y.; Tao, L.; Wei, H.; Xiang, T.; An, B.; and Niu, G. 2021. Multiple-instance learning from similar and dissimilar bags. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 374 382. G artner, T.; Flach, P. A.; Kowalczyk, A.; and Smola, A. J. 2002. Multi-instance kernels. In International Conference on Machine Learning, 179 186. Huang, S.; Liu, Z.; Jin, W.; and Mu, Y. 2022. Bag dissimilarity regularized multi-instance learning. Pattern Recognition, 126: 108583. Ilse, M.; Tomczak, J.; and Welling, M. 2018. Attentionbased deep multiple instance learning. In International Conference on Machine Learning. Lin, T.; Xu, H.; Yang, C.; and Xu, Y. 2022. Interventional multi-instance learning with deconfounded instancelevel prediction. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 36(2): 1601 1609. Luengo, J.; S anchez-Tarrag o, D.; Prati, R. C.; and Herrera, F. 2021. Multiple instance classiﬁcation: Bag noise ﬁltering for negative instance noise cleaning. Information Sciences, 579: 388 400. Maron, H.; Fetaya, E.; Segol, N.; and Lipman, Y. 2019. On the universality of invariant networks. In International Conference on Machine Learning, 4363 4371. PMLR. Pal, S.; Valkanas, A.; Regol, F.; and Coates, M. 2022. Bag graph: Multiple instance learning using bayesian graph neural networks. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence. Reed, C. J.; Yue, X.; Nrusimha, A.; Ebrahimi, S.; Vijaykumar, V.; Mao, R.; Li, B.; Zhang, S.; Guillory, D.; Metzger, S.; et al. 2022. Self-supervised pretraining improves selfsupervised pretraining. In Proceedings of the IEEE/CVF

Winter Conference on Applications of Computer Vision, 2584 2594. Rong, Y.; Bian, Y.; Xu, T.; Xie, W.; Wei, Y.; Huang, W.; and Huang, J. 2020. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33: 12559 12571. Von K ugelgen, J.; Sharma, Y.; Gresele, L.; Brendel, W.; Sch olkopf, B.; Besserve, M.; and Locatello, F. 2021. Selfsupervised learning with data augmentations provably isolates content from style. Advances in Neural Information Processing Systems, 34: 16451 16467. Wagstaff, E.; Fuchs, F.; Engelcke, M.; Posner, I.; and Osborne, M. A. 2019. On the limitations of representing functions on sets. In International Conference on Machine Learning, 6487 6494. PMLR. Wagstaff, E.; Fuchs, F. B.; Engelcke, M.; Osborne, M. A.; and Posner, I. 2022. Universal approximation of functions on sets. JMLR, 23(151): 1 56. Wang, X.; Yan, Y.; Tang, P.; Bai, X.; and Liu, W. 2018. Revisiting multiple instance neural networks. Pattern Recognition, 74: 15 24. Wei, X.; Wu, J.; and Zhou, Z. 2017. Scalable algorithms for multi-instance learning. IEEE Transactions on Neural Networks and Learning Systems, 28(4): 975 987. Yager, R. R. 1988. On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 18(1): 183 190. Yuan, T.; Wan, F.; Fu, M.; Liu, J.; Xu, S.; Ji, X.; and Ye, Q. 2021. Multiple instance active learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5330 5339. Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep sets. Advances in Neural Information Processing Systems, 30. Zhang, L.; Tozzo, V.; Higgins, J.; and Ranganath, R. 2022. Set norm and equivariant skip connections: Putting the deep in deep sets. In International Conference on Machine Learning, 26559 26574. PMLR. Zhang, Q.; and Goldman, S. A. 2001. EM-DD: An improved multiple-instance learning technique. Advances in Neural Information Processing Systems, 14. Zhang, W. 2021. Non-i.i.d. multi-instance learning for predicting instance and bag labels with variational autoencoder. In Zhou, Z.-H., ed., International Joint Conference on Artiﬁcial Intelligence, 3377 3383. International Joint Conferences on Artiﬁcial Intelligence Organization. Zhang, Z.; Liu, Q.; Wang, H.; Lu, C.; and Lee, C.-K. 2021. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems, 34: 15870 15882. Zhao, Y.; Wang, Y.; Wang, Z.; and Zhang, C. 2021. Multigraph Multi-label Learning with Dual-granularity Labeling. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2327 2337. Zhou, Z.-H.; Sun, Y.-Y.; and Li, Y.-F. 2009. Multi-instance learning by treating instances as non-i.i.d. samples. In International Conference on Machine Learning.