# residual_relaxation_for_multiview_representation_learning__12cd024e.pdf

Residual Relaxation for Multi-view Representation Learning

Yifei Wang1 Zhengyang Geng2 Feng Jiang2 Chuming Li3

Yisen Wang2,4 Jiansheng Yang1 Zhouchen Lin2,4,5

1 School of Mathematical Sciences, Peking University, China 2 Key Lab. of Machine Perception, School of Artiﬁcial Intelligence, Peking University, Beijing, China 3 School of Engineering, The University of Sydney, Australia 4 Institute for Artiﬁcial Intelligence, Peking University, Beijing, China 5 Pazhou Lab, Guangzhou, China

Multi-view methods learn representations by aligning multiple views of the same image and their performance largely depends on the choice of data augmentation. In this paper, we notice that some other useful augmentations, such as image rotation, are harmful for multi-view methods because they cause a semantic shift that is too large to be aligned well. This observation motivates us to relax the exact alignment objective to better cultivate stronger augmentations. Taking image rotation as a case study, we develop a generic approach, Pretext-aware Residual Relaxation (Prelax), that relaxes the exact alignment by allowing an adaptive residual vector between different views and encoding the semantic shift through pretext-aware learning. Extensive experiments on different backbones show that our method can not only improve multi-view methods with existing augmentations, but also beneﬁt from stronger image augmentations like rotation.

1 Introduction

Without access to labels, self-supervised learning relies on surrogate objectives to extract meaningful representations from unlabeled data, and the chosen surrogate objectives largely determine the quality and property of the learned representations [24, 19]. Recently, multi-view methods have become a dominant approach for self-supervised representation learning that achieves impressive downstream performance, and many modern variants have been proposed [22, 14, 1, 23, 2, 12, 3, 4, 11, 5]. Nevertheless, most multi-view methods can be abstracted and summarized as the following pipeline: for each input x, we apply several (typically two) random augmentations to it, and learn to align these different views (x1, x2, . . . ) of x by minimizing their distance in the representation space.

In multi-view methods, the pretext, i.e., image augmentation, has a large effect on the ﬁnal performance. Typical choices include image re-scaling, cropping, color jitters, etc [2]. However, we ﬁnd that some augmentations, for example, image rotation, is seldom utilized in state-of-the-art multiview methods. Among these augmentations, Figure 1a shows that rotation causes severe accuracy drop in a standard supervised model. Actually, image rotation is a stronger augmentation that largely affects the image semantics, and as a result, enforcing exact alignment of two different rotation angles could degrade the representation ability in existing multi-view methods. Nevertheless, it does not mean that strong augmentations cannot provide useful semantics for representation learning. In fact, rotation is known as an effective signal for predictive learning [10, 30, 21]. Differently, predictive methods learn representations by predicting the pretext (e.g., rotation angle) from the cor-

Corresponding author: Yisen Wang (yisen.wang@pku.edu.cn).

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Raw Horizontal Flip Gray Scale Color Jitter Rotation 0

Test Acc (%)

(a) Comparison of augmentations.

exact alignment

residual alignment

(b) A toy example of residual relaxation.

Figure 1: Left: the effect of different augmentations of CIFAR-10 test images with a supervised model (trained without using any data augmentation, more details in Appendix A). Right: an illustration of the exact alignment objective of multi-view methods (z z) and the relaxed residual alignment of our Prelax (z r z). As the rotation largely modiﬁes the image semantics, our Prelax adopts a rotation-aware residual vector r to bridge the representation of two different views.

responding view. In this way, the model is encouraged to encode pretext-aware image semantics, which also yields good representations.

To summarize, strong augmentations like rotation carry meaningful semantics, while being harmful for existing multi-view methods due to large semantic shift. To address this dilemma, in this paper, we propose a generic approach that generalizes multi-view methods to cultivating stronger augmentations. Drawing inspirations from the soft-margin SVM, we propose residual alignment, which relaxes the exact alignment in multi-view methods by incorporating a residual vector between two views. Besides, we develop a predictive loss for the residual to ensure that it encodes the semantic shift between views (e.g., image rotation). We name this technique as Pretext-aware REsidual Re LAXation (Prelax), and an illustration is shown in Figure 1b. Prelax serves as a generalized multi-view method that is adaptive to large semantic shift and combines image semantics extracted from both pretext-invariant and pretext-aware methods. We summarize our contributions as follows:

We propose a generic technique, Pretext-aware Residual Relaxation (Prelax), that generalizes multi-view representation learning to beneﬁt from stronger image augmentations.

Prelax not only extracts pretext-invariant features as in multi-view methods, but also encodes pretext-aware features into the pretext-aware residuals. Thus, it can serve as a uniﬁed approach to bridge the two existing methodologies for representation learning.

Experiments show that Prelax can bring signiﬁcant improvement over both multi-view and predictive methods on a wide range of benchmark datasets.

2 Related Work

Multi-view Learning. Although multi-view learning could refer to a wider literature [16], here we restrict our discussions to the context of Self-Supervised Learning (SSL), where multi-view methods learn representations by aligning multiple views of the same image generated through random data augmentation [1]. There are two kinds of methods to keep the representations well separated: contrastive methods, which achieve this by maximizing the difference between different samples [2, 12], and similarity-based methods, which prevent representation collapse via implicit mechanisms like predictor and gradient stopping [11, 5]. Although having lots of modern variants, multi-view methods share the same methodology, that is to extract features that are invariant to the predeﬁned augmentations, i.e., pretext-invariant features [20].

Predictive Learning. Another thread of methods is to learn representations by predicting selfgenerated surrogate labels. Speciﬁcally, it applies a transformation (e.g., image rotation) to the input image and requires the learner to predict properties of the transformation (e.g., the rotation angle) from the transformed images. As a result, the extracted image representations are encouraged to become aware of the applied pretext (e.g., image rotation). Thus, we also refer to them as pretext-

aware methods. The pretext tasks can be various, to name a few, Rotation [10], Jigsaw [21], Relative Path Location [8], Colorization [30].

Generalized Multi-view Learning. Although there are plenty of works on each branch, how to bridge the two methodologies remains under-explored. Prior to our work, there are only a few works on this direction. Some directly combine AMDIM (pretext-invariant) [1] and Rotation (pretextaware) [10] objectives [9]. However, a direct combination of the two contradictory objectives may harm the ﬁnal representation. Loo C [28] proposes to separate the embedding space to several parts, where each subspace learns local invariance w.r.t. a speciﬁc augmentation. But this is achieved at the cost of limiting the representation ﬂexibility of each pretext to the predeﬁned subspace. Different from them, our proposed Prelax provides a more general solution by allowing an adaptive residual vector to encode the semantic shift. In this way, both kinds of features are encoded in the same representation space.

3 The Proposed Pretext-aware Residual Relaxation (Prelax) Method

3.1 Preliminary

Problem Formulation. Given unlabeled data {xi}, unsupervised representation learning aims to learn an encoder network Fθ that extracts meaningful low-dimensional representations z Rdz from high-dimensional input images x Rdx. The learned representation is typically evaluated on a downstream classiﬁcation task by learning a linear classiﬁer with labeled data {xi, yi}.

Multi-view Representation Learning. For an input image x Rdx, we can generate a different view by data augmentation, x = t(x), where t T refers to a randomly drawn augmentation operator from the pretext set T . Then, the transformed input x and the original input x are passed into an online network Fθ and a target network Fφ, respectively. Optionally, the output of the online network is further processed by an MLP predictor network Gθ, to match the output of the target network. As two different views of the same image (i.e., positive samples), x and x should have similar representations, so we align their representations with the following similarity loss,

Lsim(x , x; θ) = Gθ (Fθ(x )) Fφ(x) 2 2. (1)

The representations, e.g., z = Fθ(x), are typically projected to a unit spherical ball before calculating the distance (z/ z 2), which makes the ℓ2 distance equivalent to the cosine similarity [2].

Remark. Aside from the similarity loss between positive samples, contrastive methods [22, 14, 23, 2] further encourage representation uniformity with an additional regularization minimizing the similarity between input and an independently drawn negative sample. Nevertheless, some recent works ﬁnd that the similarity loss alone already sufﬁces [11, 5]. In this paper, we mainly focus on improving the alignment between positive samples in the similarity loss. It can also be easily extended to contrastive methods by considering the dissimilarity regularization.

3.2 Objective Formulation

As we have noticed, the augmentation sometimes may bring a certain amount of semantic shift. Thus, enforcing exact alignment of different views may hurt the representation quality, particularly when the data augmentation is too strong for the positive pairs to be matched exactly. Therefore, we need to relax the exact alignment in Eq. (1) to account for the semantic shift brought by the data augmentation.

Residual Relaxed Similarity Loss. Although the representations may not align exactly, i.e., z = z, however, the representation identity will always hold: z (z z) = z, where z z represents the shifted semantics by augmentation. This makes this identity a proper candidate for multi-view alignment under various augmentations as long as the shifted semantic is taken into consideration.

Speciﬁcally, we replace the exact alignment (denoted as ) in the similarity loss (Eq. (1)) by the proposed identity alignment, i.e.,

Gθ(z θ) zφ Gθ(z θ) Gθ(r) zφ, (2)

where we include a residual vector r = z θ zθ = Fθ(x ) Fθ(x) to represent the difference on the representations. To further enable a better tradeoff between the exact and identity alignments,

we have the following residual alignment:

Gθ(z θ) αGθ(r) zφ, (3)

where α [0, 1] is the interpolation parameter. When α = 0, we recover the exact alignment; when α = 1, we recover the identity alignment. We name the corresponding learning objective as the Residual Relaxed Similarity (R2S) loss, which minimizes the squared ℓ2 distance among two sides:

Lα R2S(x , x; θ) = Gθ (Fθ(x )) αGθ(r) Fφ(x) 2 2. (4)

Predictive Learning (PL) Loss. To ensure the relaxation works as expected, the residual r should encode the semantic shift caused by the augmentation, i.e., the pretext. Inspired by predictive learning [10], we utilize the residual to predict the corresponding augmentation for its pretext-awareness. In practice, the assigned parameters for the random augmentation t can be generally divided into the discrete categorical variables td (e.g., ﬂipping or not, graying or not), and the continuous variables tc (e.g., scale, ratio, jittered brightness). Thus, we learn a PL predictor Hθ to predict (td, tc) with cross entropy loss (CE) and mean square error loss (MSE), respectively:

LPL(x , x, t; θ) = CE(Hd θ(r), td) + Hc θ(r) tc 2 2. (5)

Constraint on the Similarity. Different from the exact alignment, the residual vector can be unbounded, i.e., the difference between views can be arbitrarily large. This is not reasonable as the two views indeed share many common semantics. Therefore, we should utilize this prior knowledge to prevent the bad cases under residual similarity and add the following constraint

Lsim = Gθ (Fθ(x )) Fφ(x) 2 2 ε, (6)

where ε denotes the maximal degree of mismatch allowed between positive samples.

The Overall Objective of Prelax. By combining the three components above, we can reliably encode the semantic shift between augmentations while ensuring a good alignment between views:

min θ Lα R2S(x , x; θ) + γLPL(x , x; θ),

s.t. Gθ (Fθ(x )) Fφ(x) 2 2 ε. (7)

For simplicity, we transform it into a Lagrangian objective with a ﬁxed multiplier β 0, and obtain the overall Pretext-aware REsidual Re LAXation (Prelax) objective,

Lα R2S(x , x; θ) + γLPL(x , x; θ) + βLsim(x , x; θ), (8)

where α tradeoffs between the exact and identity alignments, γ adjusts the amount of pretextawareness of the residual, and β controlls the degree of similarity between positive pairs. An illustrative diagram of the Prelax objective is shown in Figure 1.

Discussions. In fact, there are other alternatives to relax the exact alignment. For example, we can utilize a margin loss

Lmargin(x , x; θ) = max( Gθ (Fθ(x )) Fφ(x) 2 2 η, 0), (9)

where η > 0 is a threshold for the mismatch tolerance. However, it has two main drawbacks: 1) as each image and augmentation have different semantics, it is hard to choose a universal threshold for all images; and 2) the representation keeps shifting along with the training progress, making it even harder to maintain a proper threshold dynamically. Thus, a good relaxation should be adaptive to the training progress and the aligning of different views. While our Prelax adopts pretext-aware residual vector, which is learnable, ﬂexible, and semantically meaningful.

3.3 Theoretical Analysis

As Prelax encodes both pretext-invariant and pretext-aware features, it can be semantically richer than both multi-view learning and predictive learning. Following the information-theoretic framework developed by [25], we show that Prelax provably enjoys better downstream performance.

We denote the random variable of input as X and learn a representation Z through a deterministic encoder Fθ: Z = Fθ(X)2. The representation Z is evaluated for a downstream task T by learning

2We use capitals to denote the random variable, e.g., X, and use lower cases to denote its outcome, e.g., x.

5nr Tr Nfek5lwf Vxvn R1ls A8Ow BFw Slog Ev QBC2Aw SN4Bq/gz Xqy Xqx362M6Wr Kn V3w B9bn D41Cl KM=</latexit>LR2S Lsim

Figure 2: A diagram of our proposed Prelax objective. An image x is ﬁrstly augmented as x . Then the positive pair (x, x ), is processed by the online network Fθ and the target network Fφ, respectively. Output of the online network is further processed by the target network Gθ, and the gradient of Fφ is detached, i.e., stop grad, denoted as sg. Then the outputs are used to compute the three objectives, LR2S (Eq. 4), LPL (Eq. 5), and Lsim (Eq. 1) in the Prelax objective (Eq. 7).

a classiﬁer on top of Z. From an information-theoretic learning perspective, a desirable algorithm should maximize the Mutual Information (MI) between Z and T, i.e., I(Z; T) [6]. Supervised learning on task T can learn representations by directly maximizing I(Z; T). Without access to the labels T, unsupervised learning resorts to maximizing I(Z; S), where S denotes the surrogate signal S designed by each method. Speciﬁcally, multi-view learning matches Z with the randomly augmented view, denoted as Sv; while predictive learning uses Z to predict the applied augmentation, denoted as Sa. In Prelax, as we combine both semantics, we actually maximize the MI w.r.t. their joint distribution, i.e., I(Z; Sv, Sa). We denote the representations learned by supervised learning, multi-view learning, predictive learning, and Prelax as Zsup, Zmv, ZPL, ZPrelax, respectively. Theorem 1. Assume that by maximizing the mutual information, each method can retain all information in X about the learning signal S (or T), i.e., I(X; S) = max Z I(Z; S). Then we have the following inequalities on their task-relevant information I(Z; T): I(X; T) = I(Zsup; T) I(ZPrelax; T) max(I(Zmv; T), I(ZPL; T)). (10)

Theorem 2. Further assume that T is a K-class categorical variable. In general, we have the upper bound ue on the downstream Bayes errors P e := Ez [1 maxt T P (T = t|z)], P e ue := log 2 + P e sup log K + I(X; T|S). (11)

where P e = Th(P e) = min{max{P e, 0}, 1 1/K} denotes the thresholded Bayes error. Accordingly, we have the following inequalities on the upper bounds for different unsupervised methods, ue sup ue Prelax min(ue mv, ue PL) max(ue mv, ue PL). (12)

Theorem 1 shows that Prelax extracts more task-relevant information than multi-view and predictive methods, and Theorem 2 further shows that Prelax has a tighter upper bound on the downstream Bayes error. Therefore, Prelax is indeed theoretically superior to previous unsupervised methods by utilizing both pretext-invariant and pretext-aware features. Proofs are in Appendix B.

4 Practical Implementation

In this part, we present three practical variants of Prelax to generalize existing multi-view backbones: 1) one with existing multi-view augmentations (Prelax-std); 2) one with a stronger augmentation, image rotation (Prelax-rot); and 3) one with previous two strategies (Prelax-all).

4.1 Backbone

BYOL [11] and Sim Siam [5] are both similarity-based methods and they differ mainly in the design of the target network Fφ. BYOL [11] utilizes momentum update of the target parameters φ from

the online parameters θ, i.e., φ τφ + (1 τ)θ, where τ [0, 1] is the target decay rate. While Sim Siam [5] simply regards the (stopped-gradient) online network as the target network, i.e., φ sg(θ). We mainly take Sim Siam for discussion and our analysis also applies to BYOL.

For a given training image x, Sim Siam draws two random augmentations (t1, t2) and get two views (x1, x2), respectively. Then, Sim Siam maximizes the similarity of their representations with a dual objective, where the two views can both serve as the input and the target to each other, LSimsiam(x; θ) = Gθ(Fθ(x1)) Fφ(x2) 2 2 + Gθ(Fθ(x2)) Fφ(x1) 2 2. (13)

4.2 Prelax-std

To begin with, we can directly generalize the baseline method with our Prelax method under existing multi-view augmentation strategies [2, 11]. For the same positive pair (x1, x2), we can calculate their residual vector r12 = Fθ(x1) Fθ(x2) and use it for the R2S loss (Eq. (4))

Lα R2S(x1, x2; θ) = Gθ (Fθ(x1)) αGθ(r12) Fφ(x2) 2 2. (14) We note that there is no difference in using r12 or r21 as the two views are dual. Then, we can adopt the similarity loss in the reverse direction as our similarity constraint loss, Lsim(x2, x1; θ) = Gθ (Fθ(x2)) Fφ(x1) 2 2. (15) At last, we use the residual r12 for the PL loss to predict the augmentation parameters of x1, i.e., t1, because r12 = z1 z2 directs towards z1. Combining the three losses above, we obtain our Prelaxstd objective, LPrelax std(x; θ) = Lα R2S(x1, x2; θ) + γLPL(x1, x2, t1; θ) + βLsim(x2, x1; θ). (16)

4.3 Prelax-rot

As mentioned previously, with our residual relaxation we can beneﬁt from stronger augmentations that are harmful for multi-view methods. Here, we focus on the image rotation example and propose the Prelax-rot objective with rotation-aware residual vector. To achieve this, we further generalize existing dual-view methods by incorporating a third rotation view.

Speciﬁcally, given two views (x1, x2) generated with existing multi-view augmentations, we additionally draw a random rotation angle a R = {0 , 90 , 180 , 270 } and apply it to rotate x1 clockwise, leading to the third view x3. Note that the only difference between x3 and x1 is the rotation semantic a. Therefore, if we substitute x1 with x3 in the similarity loss, we should add a rotation-aware residual r31 = z3 z1 to bridge the gap. Motivated by this analysis, we propose the Rotation Residual Relaxation Similarity (R3S) loss, Lα R3S(x1:3; θ) = Gθ(Fθ(x3)) αGθ(r31) Fφ(x2) 2 2. (17) which replace Gθ(Fθ(x1)) by its rotation-relaxed version Gθ(Fθ(x3)) αGθ(r31) in the similarity loss. Comparing the R2S loss (Eq. 14) and the R3S loss, we note that the relaxation of the R2S loss accounts for all the semantic shift between x1 and x2, while that of the R3S loss only accounts for the rotation augmentation between x1 and x3. Therefore, we could use the residual r31 to predict the rotation angle a with the following Rot PL loss for its rotation-awareness: Lrot PL(x1, x3, a; θ) = CE(Hθ(r31), a). (18) Combining with the similarity constraint, we obtain the Prelax-rot objective: LPrelax rot(x; θ) = Lα R3S(x1:3; θ) + γLrot PL(x1, x3, a; θ) + βLsim(x2, x1; θ). (19)

4.4 Prelax-all

We have developed Prelax-std that cultivates existing multi-view augmentations and Prelax-rot that incorporates image rotation. Here, we further utilize both existing augmentations and image rotation by combining the two objectives together, denoted as Prelax-all:

LPrelax all(x; θ) =1

2 (Lα1 R2S(x1, x2; θ) + Lα2 R3S(x1:3; θ)) + γ1

2 LPL(x1, x2, t1; θ)

2 Lrot PL(x1, x3, a; θ) + βLsim(x2, x1; θ), (20)

where α1, α2, γ1, γ2 denotes the coefﬁcients for R2S, R3S, PL and Rot PL losses, respectively.

4.5 Discussions

Here we design three practical versions as different implementations of our generic framework of residual relaxation. Among them, Prelax-std focuses on further cultivating existing augmentation strategies, Prelax-rot is to incorporate the stronger (potentially harmful) rotation augmentation, while Prelax-all combines them all. Through the three versions, we demonstrate the wide applicability of Prelax as a generic framework. As for practical users, they could also adapt Prelax to their own application by incorporating speciﬁc domain knowledge. In this paper, as we focus on natural images, we take rotation as a motivating example as it is harmful for natural images. Nevertheless, rotation is not necessarily harmful in other domains, e.g., medical images. Instead, random cropping could instead be very harmful for medical images as the important part could lie in the corner. In this scenario, our residual relaxation mechanism could also be used to encode the semantic shift caused by cropping and alleviate its bad effects.

5 Experiments

Datasets. Due to computational constraint, we carry out experiments on a range of medium-sized real-world image datasets, including well known benchmarks like CIFAR-10 [15], CIFAR-100 [15], and two Image Net variants: Tiny-Image Net-200 (200 classes with image size resized to 32 32) [27] and Image Nette (10 classes with image size 128 128)3.

Backbones. As Prelax is designed to be a generic method for generalizing existing multi-view methods, we implement it on two different multi-view methods, Sim Siam [5] and BYOL [11]. Specifically, we notice that Sim Siam reported results on CIFAR-10, while the ofﬁcial code of BYOL included results on Image Nette. For a fair comparison, we evaluate Sim Siam and its Prelax variant on CIFAR-10, and evaluate BYOL and its Prelax variant on Image Nette. In addition, we evaluate Sim Siam and its Prelax variant on two additional datasets CIFAR-100 and Tiny-Image Net-200, which are more challenging because they include a larger number of classes. For computational efﬁciency, we adopt the Res Net-18 [13] backbone (adopted in Sim Siam [5] for CIFAR-10) to benchmark our experiments. For a comprehensive comparison, we also experiment with larger backbones, like Res Net-34 [13], and the results are included in Appendix C.

Setup. For Prelax-std, we use the same data augmentations as Sim Siam [2, 5] (or BYOL [11]), including Random Resized Crop, Random Horizontal Flip, Color Jitter, and Random Grayscale, etc using the Py Torch notations. For Prelax-rot and Prelax-all, we further apply a random image rotation at last of the transformation, where the angles are drawn randomly from {0 , 90 , 180 , 270 }. To generate targets for the PL objective in Prelax, for each image, we collect the assigned parameters in each random augmentation, such as crop centers, aspect ratios, rotation angles, etc. More details can be found in Appendix A.

Training. For Sim Siam and its Prelax variants, we follow the same hyperparameters in [5] on CIFAR-10. Speciﬁcally, we use Res Net-18 as the backbone network, followed by a 3-layer projection MLP, whose hidden and output dimension are both 2048. The predictor is a 2-layer MLP whose hidden layer and output dimension are 512 and 2048 respectively. We use SGD for pre-training with batch size 512, learning rate 0.03, momentum 0.9, weight decay 5 10 4, and cosine decay schedule [18] for 800 epochs. For BYOL and its Prelax variants, we also adopt the Res Net-18 backbone, and the projector and predictor are 2-layer MLPs whose hidden layer and output dimension are 256 and 4096 respectively. Following the default hyper-parameters on Image Nette4, we use LARS optimizer [29] to train 1000 epochs with batch size 256, learning rate 2.0, weight decay 1 10 6 while excluding the biases and batch normalization parameters from both LARS adaptation and weight decay. For the target network, the exponential moving average parameter τ starts from τbase = 0.996 and increases to 1 during training. As for the Prelax objective, we notice that sometimes, adopting a reverse residual r21 in the R2S loss (Eq. (14)) can bring slightly better results, which could be due to the swapped prediction in Sim Siam s dual objective (Eq. (13)). Besides, a na ıve choice of Prelax coefﬁcients already works well: α = 1, β = 1, γ = 0.1 for Prelax-std and Prelax-rot, and α1 = α2 = 1, β = 1, γ1 = γ2 = 0.1 for Prelax-all. More discussion about the hyper-parameters of Prelax can be found in Appendix E.

3https://github.com/fastai/imagenette 4https://github.com/deepmind/deepmind-research/tree/master/byol

Table 1: Linear evaluation on CIFAR-10 (a) and Image Nette (b) with Res Net-18 backbone. TTA: Test-Time Augmentation.

(a) CIFAR-10.

Method Acc. (%)

Supervised [13] (re-produced) 95.0

Rotation [10] (re-produced) 88.3 BYOL [11] (re-produced) 91.1 Sim CLR [2] 91.1 Sim Siam [5] 91.8

Sim Siam + Prelax 93.4

(b) Image Nette.

Method Acc. (%)

Supervised 91.0 Supervised + TTA 92.2

BYOL [11] (Res Net-18) 91.9 BYOL [11] (Res Net-50) 92.3

BYOL + Prelax (Res Net-18) 92.6

Evaluation. After unsupervised training, we evaluate the backbone network by ﬁne-tuning a linear classiﬁer on top of its representation with other model weights held ﬁxed. For Sim Siam and its Prelax variants, the linear classiﬁer is trained on labeled data from scratch using SGD with batch size 256, learning rate 30.0, momentum 0.9 for 100 epochs. The learning rate decays by 0.1 at the 60-th and 80-th epochs. For BYOL and its Prelax variants, we use SGD with Nesterov momentum over 80 epochs using batch size 25, learning rate 0.5 and momentum 0.9. Besides the in-domain linear evaluation, we also evaluate its transfer learning performance on the out-of-domain data by learning a linear classiﬁer on the labeled target domain data.

5.1 Performance on Benchmark Datasets

CIFAR-10. In Table 1a, we compare Prelax against previous multi-view methods (Sim CLR [2], Sim Siam [5], and BYOL [11]) and predictive methods (Rotation [10]) on CIFAR-10. We can see that multi-view methods are indeed better than predictive ones. Nevertheless, predictive learning alone (e.g., Rotation) achieves quite good performance, indicating that pretext-aware features are also very useful. By encoding both pretext-invariant and pretext-aware features, Prelax outperforms previous methods by a large margin, and achieve state-of-the-art performance on CIFAR-10. A comparison of the learning dynamics between Sim Siam and Prelax can be found in Appendix F.

Image Nette. Beside the Sim Siam backbone, we further apply our Prelax loss to the BYOL framework [11] and evaluate the two methods on the Image Nette dataset. In Table 1b, Prelax also shows a clear advantage over BYOL. Speciﬁcally, it improves the Res Net-18 version of BYOL by 0.7%, and even outperforms the Res Net-50 version by 0.3%.

Here, we can see that Prelax yields signiﬁcant improvement on two different datasets with two different backbone methods. Thus, Prelax could serve as a generic method for improving existing multi-view methods by encoding pretext-aware features into the residual relaxation. For completeness, we also evaluate Prelax on the large scale dataset, Image Net [7], as well as its transferability to other kinds of downstream tasks, such as object detection and instance segmentation on MS COCO [17]. As shown in Appendix D, Prelax still consistently outperforms the baselines across all tasks.

5.2 Effectiveness of Prelax Variants

For a comprehensive comparison of the three variants of Prelax objectives (Prelax-std, Prelax-rot, and Prelax-all), we conduct controlled experiments on a range of datasets based on the Sim Siam backbone. Except CIFAR-10, we also conduct experiments on CIFAR-100 and Tiny-Image Net200, which are more challenging with more classes of images. For a fair comparison, we use the same training and evaluation protocols across all tasks.

In-domain Linear Evaluation. As shown in Table 2a, our Prelax objectives outperform the multiview objective consistently on all three datasets, where Prelax-all improves Sim Siam by 1.6% on CIFAR-10, 3.1% on CIFAR-100, and 1.5% on Tiny-Image Net-200. Besides, Prelax-std and Prelaxrot are also better than Sim Siam in most cases. Thus, the pretext-aware residuals in Prelax indeed help encode more useful semantics.

Table 2: A detailed comparison of Sim Siam [5] and Prelax (ours) across three datasets: CIFAR-10 (C10), CIFAR-100 (C100), and Tiny-Image Net-200 (Tiny200) with the same hyper-parameters.

(a) In-domain linear evaluation.

Method CIFAR-10 CIFAR-100 Tiny-Image Net-200

Sim Siam [5] 91.8 66.9 47.7

Sim Siam + Prelax-std 92.5 67.5 47.9 Sim Siam + Prelax-rot 92.4 67.3 47.1 Sim Siam + Prelax-all 93.4 70.0 49.2

(b) Out-of-domain linear evaluation.

Method C100 C10 Tiny200 C10 Tiny200 C100

Sim Siam [5] 44.1 43.9 21.8

Sim Siam + Prelax-std 45.0 45.1 21.8 Sim Siam + Prelax-rot 45.0 45.1 22.0 Sim Siam + Prelax-all 44.9 44.6 22.1

Table 3: Linear evaluation results of possible mechanisms for generalized multi-view learning on CIFAR-10 with Sim Siam backbone.

(a) Comparison against alternative options.

Method Acc. (%)

Sim Siam [5] 91.8 Sim Siam + margin loss 91.9

Rotation [10] 88.3 Sim Siam + rotation aug. 87.9 Sim Siam + Rotation loss 91.7

Sim Siam + Prelax (ours) 93.4

(b) Ablation study.

Method Acc. (%)

Sim (i.e., Sim Siam [5]) 91.8

Sim + PL 92.2 Sim + R2S 91.5 R2S + PL 91.7 Sim + PL + R2S (Prelax-std) 92.5

Sim + Rot PL 91.1 Sim + R3S 91.9 R3S + Rot PL 79.8 Sim + Rot PL + R3S (Prelax-rot) 92.4

Out-of-domain Linear Evaluation. Besides the in-domain linear evaluation, we also transfer the representations to a target domain. In the out-of-domain linear evaluation results shown in Table 2b, the Prelax objectives still have a clear advantage over the multi-view objective (Sim Siam), while sometimes Prelax-std and Prelax-rot enjoy better transferred accuracy than Prelax-all.

5.3 Empirical Understandings of Prelax

Comparison against Alternative Options. In Table 3a, we compare Prelax against several other relaxation options. Sim Siam + margin refers to the margin loss discussed in Eq. (9), which uses a scalar η to relax the exact alignment in multi-view methods. Here we tune the margin η = 0.5 with the best performance. Nevertheless, it has no clean advantage over Sim Siam. Then, we try several options for incorporating a strong augmentation and image rotation: 1) Rotation is the PL baseline by predicting rotation angles [10], which is inferior to multi-view methods (Sim Siam). 2) Sim Siam + rotation aug. directly applies a random rotation augmentation to each view, and learn with the Sim Siam loss. However, it leads to lower accuracy, showing that the image rotation, as a strong augmentation, will hurt the performance of multi-view methods. 3) Sim Siam + Rotation directly combines the Sim Siam loss and the Rotation loss for training, which is still ineffective. 4) Our Prelax shows a signiﬁcant improvement over Sim Siam and other variants, showing that the residual alignment is an effective mechanism for utilizing strong augmentations like rotation.

(a) Representation visualization.

(b) Nearest image retrieval.

Figure 3: (a) Representation visualization of our Prelax on CIFAR-10 test set. Each point represents an image representation and its color denotes the class of the image. (b) On CIFAR-10 test set, given 10 random queries (not cherry-picked), we retrieve 15 nearest images in the representation space with Prelax (ours).

Ablation Study. We perform ablation study of each component of the Prelax objectives on CIFAR10. From Table 3b, we notice that simply adding the PL loss alone cannot improve over Sim Siam consistently, for example, Sim + Rot PL causes 0.7 point drop in test accuracy. While with the help of our residual relaxation, we can improve over the baselines signiﬁcantly and consistently, for example, Prelax-rot (Sim + Rot PL + R3S) brings 0.6 point improvement on test accuracy. Besides, we can see that the PL loss is necessary by making the residual pretext-aware, without which the performance drops a lot, and the similarity constraint (Sim loss) is also important by avoiding bad cases when augmented images drift far from the anchor image. Therefore, the ablation study shows the residual relaxation loss, similarity loss, and PL loss all matter in our Prelax objectives.

5.4 Qualitative Analysis

Representation Visualization. To provide an intuitive understanding of the learned representations, we visualize them with t-SNE [26] on Figure 3a. We can see that in general, our Prelax can learn well-separated clusters of representations corresponding to the ground-truth image classes.

Image Retrieval. In Figure 3b, we evaluate Prelax on an image retrieval task. Given a random query image (not cherry-picked), the top-15 most similar images in representation space are retrieved, and the query itself is shown in the ﬁrst column. We can see that although the unsupervised training with Prelax has no access to labels, the retrieved nearest images of Prelax are all correctly from the same class and semantically consistent with the query.

6 Conclusion

In this paper, we proposed a generic method, Prelax (Pretext-aware Residual Relaxation), to account for the (possibly large) semantic shift caused by image augmentations. With pretext-aware learning of the residual relaxation, our method generalizes existing multi-view learning by encoding both pretext-aware and pretext-invariant representations. Experiments show that our Prelax has outperformed existing multi-view methods signiﬁcantly on a variety of benchmark datasets.

Acknowledgement

Yisen Wang is partially supported by the National Natural Science Foundation of China under Grant 62006153, and Project 2020BD006 supported by PKU-Baidu Fund. Jiansheng Yang is supported by the National Science Foundation of China under Grant No. 11961141007. Zhouchen Lin was supported by the NSF China (No.s 61625301 and 61731018), NSFC Tianyuan Fund for Mathematics (No. 12026606) and Project 2020BD006 supported by PKU-Baidu Fund.

[1] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. ar Xiv preprint ar Xiv:1906.00910, 2019. 1, 2, 3

[2] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. ICML, 2020. 1, 2, 3, 6, 7, 8

[3] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. ar Xiv preprint ar Xiv:2006.10029, 2020. 1

[4] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020. 1

[5] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. ar Xiv preprint ar Xiv:2011.10566, 2020. 1, 2, 3, 5, 6, 7, 8, 9

[6] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. 5

[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. CVPR, 2009. 8

[8] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. ICCV, 2015. 3

[9] Zeyu Feng, Chang Xu, and Dacheng Tao. Self-supervised representation learning from multidomain data. ICCV, 2019. 3

[10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. ICLR, 2018. 1, 3, 4, 8, 9

[11] Jean-Bastien Grill, Florian Strub, Florent Altch e, C. Tallec, Pierre H. Richemond, Elena Buchatskaya, C. Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, B. Piot, K. Kavukcuoglu, R emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. Neur IPS, 2020. 1, 2, 3, 5, 6, 7, 8

[12] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. CVPR, 2020. 1, 2

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016. 7, 8

[14] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ICLR, 2019. 1, 3

[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 7

[16] Yingming Li, Ming Yang, and Zhongfei Zhang. A survey of multi-view representation learning. IEEE transactions on Knowledge and Data Engineering, 31(10):1863 1883, 2018. 2

[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 8

[18] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ICLR, 2017. 7

[19] Sean Metzger, Aravind Srinivas, Trevor Darrell, and Kurt Keutzer. Evaluating self-supervised pretraining without using labels. ar Xiv preprint ar Xiv:2009.07724, 2020. 1

[20] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. CVPR, 2020. 2

[21] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving Jigsaw puzzles. ECCV, 2016. 1, 3

[22] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. 1, 3

[23] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. ar Xiv preprint ar Xiv:1906.05849, 2019. 1, 3

[24] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. Neur IPS, 2020. 1

[25] Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Selfsupervised learning from a multi-view perspective. ICLR, 2021. 4

[26] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2008. 10

[27] Jiayu Wu, Qixiang Zhang, and Guoxi Xu. Tiny Image Net challenge, 2017. 7

[28] Tete Xiao, Xiaolong Wang, Alexei A Efros, and Trevor Darrell. What should not be contrastive in contrastive learning. ICLR, 2021. 3

[29] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for Image Net training. ar Xiv preprint ar Xiv:1708.03888, 2017. 7

[30] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. ECCV, 2016.