# on_dataaugmentation_and_consistencybased_semisupervised_learning__d6a45389.pdf

Published as a conference paper at ICLR 2021

ON DATA-AUGMENTATION AND CONSISTENCYBASED SEMI-SUPERVISED LEARNING

Atin Ghosh & Alexandre H. Thiery Department of Statistics and Applied Probability National University of Singapore atin.ghosh@u.nus.edu a.h.thiery@nus.edu.sg

Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the Π-model, temporal ensembling, the mean teacher, or the virtual adversarial training, have advanced the state of the art in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. In this text, we analyse (variations of) the Π-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classiﬁers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Importantly, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a framework for understanding and experimenting with SSL methods.

1 INTRODUCTION

Consider a dataset D = DL DU that is comprised of labelled samples DL = {xi, yi}i IL as well as unlabelled samples DU = {xi}i IU . Semi-Supervised Learning (SSL) is concerned with the use of both the labelled and unlabeled data for training. In many scenarios, collecting labelled data is difﬁcult or time consuming or expensive so that the amount of labelled data can be relatively small when compared to the amount of unlabelled data. The main challenge of SSL is in the design of methods that can exploit the information contained in the distribution of the unlabelled data (Zhu05; CSZ09).

In modern high-dimensional settings that are common to computer vision, signal processing, Natural Language Processing (NLP) or genomics, standard graph/distance based methods (BC01; ZG02; ZGL03; BNS06; DSST19) that are successful in low-dimensional scenarios are difﬁcult to implement. Indeed, in high-dimensional spaces, it is often difﬁcult to design sensible notions of distances that can be exploited within these methods. We refer the interested reader to the book-length treatments (Zhu05; CSZ09) for discussion of other approaches.

The manifold assumption is the fundamental structural property that is exploited in most modern approaches to SSL: high-dimensional data samples lie in a small neighbourhood of a low-dimensional manifold (TP91; BJ03; Pey09; Cay05; RDV+11). In computer vision, the presence of this lowdimensional structure is instrumental to the success of (variational) autoencoder and generative adversarial networks: large datasets of images can often be parametrized by a relatively small number of degrees of freedom. Exploiting the unlabelled data to uncover this low-dimensional structure is crucial to the design of efﬁcient SSL methods. A recent and independent evaluation of several modern methods for SSL can be found in (OOR+18). It is found there that consistency-based methods (BAP14; SJT16; LA16; TV17; MMIK18; LZL+18; GSA+20), the topic of this paper, achieve state-of-the art performances in many realistic scenarios.

Contributions: consistency-based semi-supervised learning methods have recently been shown to achieve state-of-the-art results. Despite these methodological advances, the understanding of these methods is still relatively limited when compared to the fully-supervised setting (SMG13;

Published as a conference paper at ICLR 2021

AS17; SBD+18; TZ15; SZT17). In this article, we do not propose a new SSL method. Instead, we analyse consistency-based methods in settings where analytically tractable results can be obtained, when the data-samples lie in the neighbourhood of well-deﬁned and tractable low-dimensional manifolds, and simple and controlled experiments can be carried out. We establish links with Manifold Tangent Classiﬁers and demonstrate that consistency-based SSL methods are in general more powerful since they can better exploit the local geometry of the data-manifold if efﬁcient data-augmentation/perturbation schemes are used. Furthermore, in section 4.1 we show that the popular Mean Teacher method and the conceptually more simple Π-model approach share the same solutions in the regime when the data-augmentations are small; this conﬁrms often reported claim that the data-augmentation schemes leveraged by the recent SSL, as well as fully unsupervised algorithms, are instrumental to their success. Finally, in section 4.3 we propose an extension of the Hidden Manifold Model (GMKZ19; GLK+20). This generative model allows us to investigate the properties of consistency-based SSL methods, taking into account the data-augmentation process and the underlying low-dimensionality of the data, in a simple and principled manner, and without relying on a speciﬁc dataset. For gaining understanding of SSL, as well as self-supervised learning methods, we believe it to be important to develop a framework that (i) can take into account the geometry of the data (ii) allows the study of the inﬂuence of the quality of the data-augmentation schemes (iii) does not rely on any particular dataset. While the understanding of fully-supervised methods have largely been driven by the analysis of simpliﬁed model architectures (eg. linear and two-layered models, large dimension asymptotic such as the Neural Tangent Kernel), these analytical tools alone are unlikely to be enough to explain the mechanisms responsible for the success of SSL and self-supervised learning methods (CKNH20; GSA+20), since they do not, and cannot easily be extended to, account for the geometry of the data and data-augmentation schemes. Our proposed framework offers a small step in that direction.

2 CONSISTENCY-BASED SEMI-SUPERVISED LEARNING

For concreteness and clarity of exposition, we focus the discussion on classiﬁcation problems. The arguments described in the remaining of this article can be adapted without any difﬁculty to other situations such as regression or image segmentation. Assume that the samples xi X RD can be represented as D-dimensional vectors and that the labels belong to C 2 possible classes, yi Y {1, . . . , C}. Consider a mapping Fθ : RD RC parametrized by θ Θ R|Θ|. This can be a neural network, although that is not necessary. For x X, the quantity Fθ(x) can represent probabilistic output of the classiﬁer, or , for example, the pre-softmax activations. Empirical risk minimization consists in minimizing the function

LL(θ) = 1 |DL|

i IL ℓ(Fθ(xi), yi)

for a loss function ℓ: RC Y 7 R. Maximum likelihood estimation corresponds to choosing the loss function as the cross entropy. The optimal parameter θ Θ is found by a variant of stochastic gradient descent (RM51) with estimated gradient

i BL ℓ(Fθ(xi), yi)

for a mini-batch BL of labelled samples. Consistency-based SSL algorithms regularize the learning by enforcing that the learned function x 7 Fθ(x) respects local derivative and invariance constraints. For simplicity, assume that the mapping x 7 Fθ(x) is deterministic, although the use of dropout (SHK+14) and other sources of stochasticity are popular in practice. The Π-model (LA16; SJT16) makes use of a stochastic mapping S : X Ω X that maps a sample x X and a source of randomness ω Ω RdΩto another sample Sω(x) X. The mapping S describes a stochastic data augmentation process. In computer vision, popular data-augmentation schemes include random translations, rotations, dilatations, croppings, ﬂippings, elastic deformations, color jittering, addition of speckle noise, and many more domain-speciﬁc variants. In NLP, synonym replacements, insertions and deletions, back-translations are often used although it is often more difﬁcult to implement these data-augmentation strategies. In a purely supervised setting, data-augmentation can be used as a

Published as a conference paper at ICLR 2021

regularizer. Instead of directly minimizing LL, one can minimize instead

i IL Eω[ℓ(Fθ[Sω(xi)], yi)].

In practice, data-augmentation regularization, although a simple strategy, is often crucial to obtaining good generalization properties (PW17; CZM+18; LBC17; PCZ+19). The idea of regularizing by enforcing robustness to the injection of noise can be traced back at least to (Bis95). In the Π-model, the data-augmentation mapping S is used to deﬁne a consistency regularization term,

R(θ) = 1 |D|

i IL IU Eω n Fθ[Sω(xi)] Fθ (xi) 2o . (1)

The notation θ designates a copy of the parameter θ, i.e. θ = θ, and emphasizes that when differentiating the consistency regularization term θ 7 R(θ), one does not differentiate through θ . In practice, a stochastic estimate of R(θ) is obtained as follows. For a mini-batch B of samples {xi}i B, the current value θ Θ of the parameter and the current predictions fi Fθ (xi), the quantity

Fθ[Sω(xi)] fi 2 )

is an approximation of R(θ). There are indeed many variants (eg. use of different norms, different manners to inject noise), but the general idea is to force the learned function x 7 Fθ(x) to be locally invariant to the data-augmentation scheme S. Several extensions such as the Mean Teacher (TV17) and the VAT (MMIK18) schemes have been recently proposed and have been shown to lead to good results in many SSL tasks. The recently proposed and state-of-the-art BYOL approach (GSA+20) is relying on mechanisms that are very close to the consistency regularization methods discussed on this text.

If one recalls the manifold assumption, this approach is natural: since the samples corresponding to different classes lie on separate manifolds, the function Fθ : X RC should be constant on each one of these manifolds. Since the correct value of Fθ is typically well approximated or known for labelled samples (xi, yi) DL, the consistency regularization term equation 1 helps propagating these known values across these manifolds. This mechanism is indeed similar to standard SSL graph-based approaches such as label propagation (ZG02). Graph-based methods are difﬁcult to directly implement in computer vision, or NLP, when a meaningful notion of distance is not available. This interpretation reveals that it is crucial to include the labelled samples in the regularization term equation 1 in order to help propagating the information contained in the labelled samples to the unlabelled samples. Our numerical experiments suggest that, in the standard setting when the number of labelled samples is much lower than the number of unlabeled samples, i.e. |DL| |DU|, the formulation equation 1 of the consistency regularization leads to sub-optimal results and convergence issues: the information contained in the labelled data is swamped by the number of unlabelled samples. In all our experiments, we have adopted instead the following regularization term

R(θ) = 1 |DL|

i IL Eω n Fθ[Sω(xi)] Fθ (xi) 2o

j IU Eω n Fθ[Sω(xj)] Fθ (xj) 2o (2)

that balances the labelled and unlabelled data samples more efﬁciently. Furthermore, it is clear that the quality and variety of the data-augmentation scheme S : X Ω X is pivotal to the success of consistency-based SSL methods. We argue in this article that it is the dominant factor contributing to the success of this class of methods. Effort spent on building efﬁcient local data-augmentation schemes will be rewarded in terms of generalization performances. Designing good data-augmentation schemes is an efﬁcient manner of injecting expert/prior knowledge into the learning process. It is done by leveraging the understanding of the local geometry of the data manifold. As usual and not surprisingly (NGP98; MHF+12), in data-scarce settings, any type of domain-knowledge needs to be exploited and we argue that consistency regularization approaches to SSL are instances of this general principle.

Published as a conference paper at ICLR 2021

3 APPROXIMATE MANIFOLD TANGENT CLASSIFIER

It has long been known (SLDV98) that exploiting the knowledge of derivatives, or more generally enforcing local invariance properties, can greatly enhance the performance of standard classiﬁers/regressors (HK02; CS02). In the context of deep-learning, the Manifold Tangent Classiﬁer (RDV+11) is yet another illustration of this idea. Consider the data manifold M X RD and assume that the data samples lie on a neighbourhood of it. For x M, consider as well the tangent plane Tx to M at x. Assuming that the manifold M is of dimension 1 d D, the tangent plane Tx is also of dimension d with an orthonormal basis ex 1, . . . , ex d RD. This informally means that, for suitably small coefﬁcients ω1, . . . , ωd R, the transformed sample x X deﬁned as

j=1 ωj ex j

also lies, or is very close to, the data manifold M. A possible stochastic data-augmentation scheme can therefore be deﬁned as Sω(x) = x + Vω where Vω = Pd j=1 ωj ex j . If ω is a multivariate ddimensional centred Gaussian random vector with suitably small covariance matrix, the perturbation vector Vω is also centred and normally distributed. To enforce that the function x Fθ(x) is locally approximately constant along the manifold M, one can thus penalize the derivatives of Fθ at x in the directions Vω. Denoting by Jx RC,D the Jacobian with respect to x RD of Fθ at x M, this can be implemented by adding a penalization term of the type Eω[ Jx Vω 2] = Tr Γ JT x Jx , where Γ RD,D is the covariance matrix of the random vector ω Vω. This type of regularization of the Jacobian along the data-manifold is for example used in (BNS06). More generally, if one assumes that for any x, ω X Ωwe have Sε ω(x) = x + ε D(x, ω) + O(ε2), for some derivative mapping D : X Ω X, it follows that

lim ε 0 1 ε2 Eω Fθ[Sε ω(x)] Fθ(x) 2 = Eω Jx D(x, ω) 2 = Tr Γx,S JT x Jx

where Γx,S is the covariance matrix of the X-valued random vector ω 7 D(x, ω) X. This shows that consistency-based methods can be understood as approximated Jacobian regularization methods, as proposed in (SLDV98; RDV+11).

3.1 LIMITATIONS

In practice, even if many local dimension reduction techniques have been proposed, it is still relatively difﬁcult to obtain a good parametrization of the data manifold. The Manifold Tangent Classiﬁer (MTC) (RDV+11) implements this idea by ﬁrst extracting in an unsupervised manner a good representation of the dataset D by using a Contractive-Auto-Encoder (CAE) (RVM+11). This CAE can subsequently be leveraged to obtain an approximate basis of each tangent plane Txi for xi D, which can then be used for penalizing the Jacobian of the mapping x 7 Fθ(x) in the direction of the tangent plane to M at x. The above discussion shows that the somewhat simplistic approach consisting in adding an isotropic Gaussian noise to the data samples is unlikely to deliver satisfying results. It is equivalent to penalizing the Frobenius norm Jx 2 F of the Jacobian of the mapping x 7 Fθ(x); in a linear model, that is equivalent to the standard ridge regularization. This mechanism does not take at all into account the local-geometry of the data-manifold. Nevertheless, in medical imaging applications where scans are often contaminated by speckle noise, this class of approaches which can be thought off as adding artiﬁcial speckle noise, can help mitigate over-ﬁtting (DRS+18).

There are many situations where, because of data scarcity or the sheer difﬁculty of unsupervised representation learning in general, domain-speciﬁc data-augmentation schemes lead to much better regularization than Jacobian penalization. Furthermore, as schematically illustrated in Figure 1, Jacobian penalization techniques are not efﬁcient at learning highly non-linear manifolds that are common, for example, in computer vision. For example, in pixel space", a simple image translation is a highly non-linear transformation only well approximated by a ﬁrst order approximation for very small translations. In other words, if x X represents an image and g(x, v) is its translated version by a vector v, the approximation g(x, v) x+ vg(x), with vg(x) limε 0 (g(x, ε v) g(x)/ε, becomes poor as soon as the translation vector v is not extremely small.

In computer vision, translations, rotations and dilatations are often used as sole data-augmentation schemes: this leads to a poor local exploration of the data-manifold since this type transformations

Published as a conference paper at ICLR 2021

labelledsamples

oooo unlabeledsamples

localdataaugmentation

Figure 1: Left: Jacobian (i.e. ﬁrst order) Penalization method are short-sighted and do not exploit fully the data-manifold Right: Data-Augmentation respecting the geometry of the data-manifold.

only generate a very low dimensional exploration manifold. More precisely, the exploration manifold emanating from a sample x0 X, i.e. {S(x0, ω) : ω Ω}, is very low dimensional: its dimension is much lower than the dimension d of the data-manifold M. Enriching the set of data-augmentation degrees of freedom with transformations such as elastic deformation or non-linear pixel intensity shifts is crucial to obtaining a high-dimensional local exploration manifold that can help propagating the information on the data-manifold efﬁciently (CZM+19; PCZ+19).

4 ASYMPTOTIC PROPERTIES

4.1 FLUID LIMIT

Consider the standard Π-model trained with a standard Stochastic Gradient Descent (SGD). Denote by θt Θ the current value of the parameter and η > 0 the learning rate. We have

θk+1 = θk η θ

i BL ℓ( Fθk(xi), yi ) + λ |BL|

Fθk(Sω[xj]) fj 2

Fθk(Sω[xk]) fk 2 (3)

for a parameter λ > 0 that controls the trade-off between supervised and consistency losses, as well as subsets BL and BU of labelled and unlabelled data samples, and fj Fθ (xj) for θ θk as

discussed in Section 2. The right-hand-side is an unbiased estimate of η θ h LL(θk)+λ R(θk) i with

variance of order O(η2), where the regularization term R(θk) is described in equation 2. It follows from standard ﬂuid limit approximations (EK09)[Section 4.8] for Markov processes that, under mild regularity and growth assumptions and as η 0, the appropriately time-rescaled trajectory {θk}k 0 can be approximated by the trajectory of the Ordinary Differential Equation (ODE).

Proposition 4.1 Let D([0, T], R|Θ|) be the usual space of càdlàg R|Θ|-valued functions on a bounded time interval [0, T] endowed with the standard Skorohod topology. Consider the update equation 3 with learning rate η > 0 and deﬁne the continuous time process θ η(t) = θ[t/η]. The sequence of processes θ η D([0, T], R|Θ|) converges weakly in D([0, T], R|Θ|) and as η 0 to the solution of the ordinary differential equation

θt = L(θt) + λ R(θt) . (4)

The article (TV17) proposes the mean teacher model, an averaging approach related to the standard Polyak-Ruppert averaging scheme (Pol90; PJ92), which modiﬁes the consistency regularization term equation 2 by replacing the parameter θ by an exponential moving average (EMA). In practical

Published as a conference paper at ICLR 2021

terms, this simply means that, instead of deﬁning fj = Fθ (xj), with θ = θk in equation 3, one sets fj = Fθavg,k(xj) where the EMA process {θavg,k}k 0 is deﬁned through the recursion θavg,k = (1 α η) θavg,k 1 + α η θk where the coefﬁcient α > 0 controls the time-scale of the averaging process. The use of the EMA process {θavg,k}k 0 helps smoothing out the stochasticity of the process θk. Similarly to Proposition 4.1, as η 0, the joint process (θ η t , θ η avg,t) (θη [t/η], θη avg,[t/η]) converges as η 0 to the solution of the following ordinary differential equation

θt = L(θt) + λ R(θt, θavg,t)

θavg,t = α (θavg,t θt) (5)

where the notation R(θt, θavg,t) designates the same quantity as the one described in equation 2, but with an emphasis on the dependency on the EMA process. At convergence (θt, θavg,t) (θ , θavg, ), one must necessarily have that θ = θavg, , conﬁrming that, in the regime of small learning rate η 0, the Mean Teacher method converges, albeit often more rapidly, towards the same solution as the more standard Π-model. This indicates that the improved performances of the Mean Teacher approach sometimes reported in the literature are either not statistically meaningful, or due to poorly executed comparisons, or due to mechanisms not captured by the η 0 asymptotic. Indeed, several recently proposed consistency based SSL algorithms (BCG+19; SBL+20; XDH+19) achieve state-of-the-art performance across diverse datasets without employing any exponential averaging processes. These results are achieved by leveraging more sophisticated data augmentation schemes such as Rand-Augment (CZSL19) , Back Translation (ALAC17) or Mixup (ZCDLP17).

4.2 MINIMIZERS ARE HARMONIC FUNCTIONS

To understand better the properties of the solutions, we consider a simpliﬁed setting further exploited in Section 4.3. Assume that F : X RD R and Y R and that, for every yi Y R, the loss function f 7 ℓ(f, yi) is uniquely minimized at f = yi. We further assume that the data-manifold M RD can be globally parametrized by a smooth and bijective mapping Φ : Rd M RD. Similarly to the Section 2, we consider a data-augmentation scheme that can be described as Sεω(x) = Φ(z + εω) for z = Φ 1(x) and a sample ω from a Rd-valued centred and isotropic Gaussian distribution. We consider a ﬁnite set of labelled samples {xi, yi}i IL, with xi = Φ(zi) and zi Rd for i IL. We choose to model the large number of unlabelled data samples as a continuum distributed on the data manifold M as the push-forward measure Φ µ(dz) of a probability distribution µ(dz) whose support is Rd through the mapping Φ. This means that an empirical average of the type (1/|DU|) P

i Iu ϕ(xi) can be replaced by R ϕ[Φ(z)] µ(dz). We investigate the regime ε 0 and, similarly to Section 2, the minimization of the consistency-regularized objective

Rd Eω n Fθ[Sεω(Φ(z))] Fθ(Φ(z)) 2o µ(dz). (6)

For notational convenience, set fθ Fθ Φ. Since Sεω[Φ(z)] = Φ(z + ε ω), as ε 0 the quantity 1

ε2 Eω n Fθ[Sεω(Φ(z))] Fθ(Φ(z)) 2o converges to zfθ 2 and the objective function equation 6 approaches the quantity

G(fθ) 1 |DL|

i IL ℓ(fθ(zi), yi) + λ Z

Rd zfθ(z) 2 µ(dz). (7)

A minimizer f : Rd R of the functional G that is consistent with the labelled data, i.e. f(zi) = yi for i IL, is a minimizer of the energy functional f 7 R

Rd zfθ(z) 2 µ(dz) subject to the constraints f(zi) = yi. It is the variational formulation of the Poisson equation f(z) = 0 for z Rd \ {zi}i IL f(zi) = yi for i IL. (8)

Note that the solution does not depend on the regularization parameter λ in the regime of ε 0: this indicates, as will be discussed in Section 4.3 in detail, that the generalization properties of consistency-based SSL methods will typically be insensitive to this parameter, in the regime of small data-augmentation at least. Furthermore, equation 8 shows that consistency-based SSL methods

Published as a conference paper at ICLR 2021

are indeed based on the same principles as more standard graph-based approaches such as Label Propagation (ZG02): solutions are gradient/Laplacian penalized interpolating functions. In Figure 2, we consider the case where D = d = 2 with trivial mapping Φ(x) = x. We consider labelled data situated on the right (resp. left) boundary of the unit square and corresponding to the label y = 0 (resp. y = 1). For simplicity, we choose the loss function ℓ(f, y) = 1

2 (f y)2 and parametrize Fθ fθ with a neural network with a single hidden layer with N = 100 neurons. As expected, the Π-model converges to the solution to the Poisson equation 8 in the unit square with boundary condition f(u, v) = 0 for u = 0 and f(u, v) = 1 for u = 1.

Figure 2: Labelled data samples with class y = 0 (green triangle) and y = +1 (red dot) are placed on the Left/Right boundary of the unit square. Unlabelled data samples (blue stars) are uniformly placed within the unit square. We consider a simple regression setting with loss function ℓ(f, y) = 1

2 (f y)2. Left: Randomly initialized neural network. Middle: labelled/unlabelled data Right: Solution of f obtained by training a standard Π-model. It is the harmonic function f(u, v) = u, as described by equation 8.

4.3 GENERATIVE MODEL FOR SEMI-SUPERVISED LEARNING

As has been made clear throughout this text, SSL methods crucially rely on the dependence structure of the data. The existence and exploitation of a much lower-dimensional manifold M supporting the data-samples is instrumental to this class of methods. Furthermore, the performance of consistencybased SSL approaches is intimately related to the data-augmentation schemes they are based upon. Consequently, in order to understand the mechanisms that are at play when consistency-based SSL methods are used to uncover the structures present in real datasets, it is important to build simpliﬁed and tractable generative models of data that (1) respect these low-dimensional structures and (2) allow the design of efﬁcient data-augmentation schemes. Several articles have investigated the inﬂuence of the dependence structures that are present in the data on the learning algorithm (BM13; Mos16). Here, we follow the Hidden Manifold Model (HMM) framework proposed in (GMKZ19; GLK+20) where the authors describe a model of synthetic data concentrating near low-dimensional structures and analyze the learning curve associated to a class of two-layered neural networks.

0 20 40 60 80 100 Epoch

= 1.0 = 10.0 = 100.0

Unregularized

0 25 50 75 100 125 150 175 200 Epoch

= 0.03 = 0.10 = 0.30 = 1.00

Figure 3: Left: For a ﬁxed data-augmentation scheme, generalization properties for λ spanning two orders of magnitude. Right: Inﬂuence of the quantity of the data-augmentation of the generalization properties.

Published as a conference paper at ICLR 2021

Low-dimensional structure: Similarly to Section 4.2, assume that the D-dimensional data-samples xi X can be expressed as xi = Φ(zi) RD for a ﬁxed smooth mapping Φ : Rd RD. In other words, the data-manifold M is d-dimensional and the mapping Φ can be used to parametrize it. The mapping Φ is chosen to be a neural network with a single hidden layer with H neurons, although other choices are indeed possible. For z = (z1, . . . , zd) Rd, set Φ(z) = A1 2 ϕ(A0 1z + b1) for matrices A0 1 RH,d and A1 2 RD,H, bias vector b1 RH and non-linearity ϕ : R R applied element-wise. In all our experiments, we use the ELU non-linearity. We adopt the standard normalization A0 1 i,j = w(1) i,j /

d and A1 2 i,j = w(2) i,j /

H for weights w(k) i,j drawn i.i.d from a centred Gaussian distribution with unit variance; this ensures that, if the coordinate of the input vector z Rd are all of order O(1), so are the coordinates of x = Φ(z).

Data-augmentation: consider a data sample xi M on the data-manifold. It can also be expressed as xi = Φ(zi). We consider the natural data-augmentation process which consists in setting Sεω(xi) = Φ(zi + εω) for a sample ω Rd from an isotropic Gaussian distribution with unit covariance and ε > 0. Crucially, the data-augmentation scheme respect the low-dimensional structure of the data: the perturbed sample Sεω(xi) belongs to the data-manifold M for any perturbation vector ε ω. Note that, for any value of ε, the data-augmentation preserves the low-dimensional manifold: perturbed samples Sεω(xi) exactly lie on the data-manifold. The larger ε, the more efﬁcient the data-augmentation scheme; this property is important since it allows to study the inﬂuence of the amount of data-augmentation.

Classiﬁcation: we consider a balanced binary classiﬁcation problem with |DL| 2 labelled training examples {xi, yi}i IL where xi = Φ(zi) and yi Y { 1, +1}. The sample zi Rd corresponding to the positive (resp. negative) class are assumed to have been drawn i.i.d from a Gaussian distribution with identity covariance matrix and mean µ+ Rd (resp. mean µ Rd). The distance µ+ µ quantiﬁes the hardness of the classiﬁcation task.

Neural architecture and optimization: Consider ﬁtting a two-layered neural network Fθ : RD R by minimising the negative log-likelihood LL(θ) (1/|DL|) P

i ℓ[Fθ(xi), yi] where ℓ(f, y) = log(1 + exp[ y f]). We assume that there are |DL| = 10 labelled data pairs {xi, yi}i=IL, as well as |DU| = 1000 unlabelled data samples, that the ambient space has dimension D = 100 and the data manifold M has dimension d = 10. The function Φ uses H = 30 neurons in its hidden layer. In all our experiments, we use a standard Stochastic Gradient Descent (SGD) method with constant learning rate and momentum β = 0.9. For minimizing the consistency-based SSL objective LL(θ) + λ R(θ), with regularization R(θ) given in equation 2, we use the standard strategy (TV17) consisting in ﬁrst minimizing the un-regularized objective alone LL for a few epochs in order for the function Fθ to be learned in the neighbourhood of the few labelled data-samples before switching on the consistency-based regularization whose role is to propagate the information contained in the labelled samples along the data manifold M.

Insensitivity to λ: Figure 3 (Left) shows that this method is relatively insensitive to the parameter λ, as long as it is within reasonable bounds. This phenomenon can be read from equation 8 that does not depend on λ. Much larger or smaller values (not shown in Figure 3) of λ do lead, unsurprisingly, to convergence and stability issues.

Amount of Data-Augmentation: As is reported in many tasks (CZM+18; ZCG+19; KYF20), tuning the amount data-augmentation in deep-learning applications is often a delicate exercise that can greatly inﬂuence the resulting performances. Figure 3 (Right) reports the generalization properties of the method for different amount of data-augmentation. Too low an amount of data-augmentation (i.e. ε = 0.03) and the ﬁnal performance is equivalent to the un-regularized method. Too large an amount of data-augmentation (i.e. ε = 1.0) also leads to poor generalization properties. This is because the choice of ε = 1.0 corresponds to augmented samples that are very different from the distribution of the training dataset (i.e. distributional shift), although these samples are still supported by the data-manifold.

Quality of the Data-Augmentation: to study the inﬂuence of the quality of the data-augmentation scheme, we consider a perturbation process implemented as Sεω[k](xi) = Φ(zi+ω[k]) for xi = Φ(zi) where the noise term ω[k] is deﬁned as follows. For a data-augmentation dimension parameter 1 k d we have ω[k] = (ξ1, . . . , ξk, 0, . . . , 0) for i.i.d standard Gaussian samples ξ1, . . . , ξk R. This data-augmentation scheme only explores the ﬁrst k dimensions of the d-dimensional datamanifold: the lower k, the poorer the exploration of the data-manifold. As demonstrated on Figure

Published as a conference paper at ICLR 2021

0 25 50 75 100 125 150 175 200 Epoch

k=5 k=6 k=7 k=8 k=9 k=10

5 6 7 8 9 10 k: Data Augmentation Dimension

Generalization at Epoch 200

Figure 4: Learning curve test (NLL) of the Π-model with λ = 10 for different quality" of dataaugmentation. The data manifold is of dimension d = 10 in an ambient space of dimension D = 100. For xi = Φ(zi) and 1 k d, the data-augmentation scheme is implemented as Sεω[k](xi) = Φ(zi + ε ω[k]) where ω[k] is a sample from a Gaussian distribution whose last (d k) coordinates are zero. In other words, the data-augmentation scheme only explores k dimensions out of the d dimensions of the data-manifold. We use ε = 0.3 in all the experiments. Left: Learning curves (Test NLL) for data-augmentation dimension k [5, 10] Right: Test NLL at epoch N = 200 (see left plot) for data-augmentation dimension k [5, 10].

0 50 100 150 200 250 300 Epochs

MT: MT=0.900

MT: MT=0.950

MT: MT=0.990

MT: MT=0.995

Figure 5: Mean-Teacher (MT) learning curves (Test NLL) for different values of the exponential smoothing parameter βMT (0, 1). For βMT {0.9, 0.95, 0.99, 0.995}, the ﬁnal test NLL obtained through the MT approach is identical to the test NLL obtained through the Π-model. In all the experiments, we used λ = 10 and used SGD with momentum β = 0.9.

4, lower quality data-augmentation schemes (i.e. lower values of k [0, d]) hurt the generalization performance of the Π-model.

Mean-Teacher versus Π-model: we implemented the Mean-Teacher (MT) approach with an exponential moving average (EMA) process θavg,k = βMT θavg,k 1 + (1 βMT) θk for the MT parameter θavg,k with different scales βMT {0.9, 0.95, 0.99, 0.995}, as well as a Π-model approach, with λ = 10 and ε = 0.3. Figure 5 shows, in accordance with Section 4.1, that the different EMA schemes lead to generalization performances similar to a standard Π-model.

5 CONCLUSION

Consistency-based SSL methods rely on a subtle trade-off between the exploitation of the labelled samples and the discovery of the low-dimensional data-manifold. The results presented in this article highlight the connections with more standard methods such as Jacobian penalization and graphbased approaches and emphasize the crucial role of the data-augmentation scheme. The analysis of consistency-based SSL methods is still in its infancy and our numerical simulations suggest that the variant of the Hidden Manifold Model described in this text is a natural framework to make progress in this direction.

Published as a conference paper at ICLR 2021

[ALAC17] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. ar Xiv preprint ar Xiv:1710.11041, 2017.

[AS17] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. ar Xiv preprint ar Xiv:1710.03667, 2017.

[BAP14] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, pages 3365 3373, 2014.

[BC01] Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 19 26. Morgan Kaufmann Publishers Inc., 2001.

[BCG+19] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pages 5050 5060, 2019.

[Bis95] Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108 116, 1995.

[BJ03] Ronen Basri and David W Jacobs. Lambertian reﬂectance and linear subspaces. IEEE Transactions on Pattern Analysis & Machine Intelligence, (2):218 233, 2003.

[BM13] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872 1886, 2013.

[BNS06] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399 2434, 2006.

[Cay05] Lawrence Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12(1-17):1, 2005.

[CKNH20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. ar Xiv preprint ar Xiv:2002.05709, 2020.

[CS02] Olivier Chapelle and Bernhard Schölkopf. Incorporating invariances in non-linear support vector machines. In Advances in neural information processing systems, pages 609 616, 2002.

[CSZ09] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542 542, 2009.

[CZM+18] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. ar Xiv preprint ar Xiv:1805.09501, 2018.

[CZM+19] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 113 123, 2019.

[CZSL19] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical data augmentation with no separate search. ar Xiv preprint ar Xiv:1909.13719, 2019.

[DRS+18] Sripad Krishna Devalla, Prajwal K Renukanand, Bharathwaj K Sreedhar, Giridhar Subramanian, Liang Zhang, Shamira Perera, Jean-Martial Mari, Khai Sing Chin, Tin A Tun, Nicholas G Strouthidis, et al. Drunet: a dilated-residual U-net deep learning network to segment optic nerve head tissues in optical coherence tomography images. Biomedical optics express, 9(7):3244 3265, 2018.

Published as a conference paper at ICLR 2021

[DSST19] Matthew M Dunlop, Dejan Slepˇcev, Andrew M Stuart, and Matthew Thorpe. Large data and zero noise limits of graph-based semi-supervised learning algorithms. Applied and Computational Harmonic Analysis, 2019.

[EK09] Stewart N Ethier and Thomas G Kurtz. Markov processes: characterization and convergence, volume 282. John Wiley & Sons, 2009.

[GLK+20] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mezard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. ar Xiv preprint ar Xiv:2002.09339, 2020.

[GMKZ19] Sebastian Goldt, Marc Mézard, Florent Krzakala, and Lenka Zdeborová. Modelling the inﬂuence of data structure on learning in neural networks. ar Xiv preprint ar Xiv:1909.11500, 2019.

[GSA+20] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. ar Xiv preprint ar Xiv:2006.07733, 2020.

[HK02] Bernard Haasdonk and Daniel Keysers. Tangent distance kernels for support vector machines. In Object recognition supported by user interaction for service robots, volume 2, pages 864 868. IEEE, 2002.

[KYF20] Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. ar Xiv preprint ar Xiv:2004.13649, 2020.

[LA16] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ar Xiv preprint ar Xiv:1610.02242, 2016.

[LBC17] Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. Smart augmentation learning an optimal data augmentation strategy. IEEE Access, 5:5858 5869, 2017.

[LZL+18] Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. Smooth neighbors on teacher graphs for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8896 8905, 2018.

[MHF+12] Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V Lilienfeld, and Klaus-Robert Müller. Learning invariant representations of molecules for atomization energy prediction. In Advances in Neural Information Processing Systems, pages 440 448, 2012.

[MMIK18] Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.

[Mos16] Elchanan Mossel. Deep learning and hierarchal generative models. ar Xiv preprint ar Xiv:1612.09057, 2016.

[NGP98] Partha Niyogi, Federico Girosi, and Tomaso Poggio. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11):2196 2209, 1998.

[OOR+18] Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235 3246, 2018.

[PCZ+19] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. ar Xiv preprint ar Xiv:1904.08779, 2019.

[Pey09] Gabriel Peyré. Manifold models for signals and images. Computer Vision and Image Understanding, 113(2):249 260, 2009.

Published as a conference paper at ICLR 2021

[PJ92] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838 855, 1992.

[Pol90] Boris T Polyak. New stochastic approximation type procedures. Automat. i Telemekh, 7(98-107):2, 1990.

[PW17] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classiﬁcation using deep learning. ar Xiv preprint ar Xiv:1712.04621, 2017.

[RDV+11] Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent classiﬁer. In Advances in Neural Information Processing Systems, pages 2294 2302, 2011.

[RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951.

[RVM+11] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 833 840. Omnipress, 2011.

[SBD+18] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deep learning. 2018.

[SBL+20] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semisupervised learning with consistency and conﬁdence. ar Xiv preprint ar Xiv:2001.07685, 2020.

[SHK+14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014.

[SJT16] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pages 1163 1171, 2016.

[SLDV98] Patrice Y Simard, Yann A Le Cun, John S Denker, and Bernard Victorri. Transformation invariance in pattern recognition tangent distance and tangent propagation. In Neural networks: tricks of the trade, pages 239 274. Springer, 1998.

[SMG13] Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013.

[SZT17] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810, 2017.

[TP91] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71 86, 1991.

[TV17] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195 1204, 2017.

[TZ15] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1 5. IEEE, 2015.

[XDH+19] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised data augmentation for consistency training. 2019.

[ZCDLP17] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

Published as a conference paper at ICLR 2021

[ZCG+19] Barret Zoph, Ekin D Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, and Quoc V Le. Learning data augmentation strategies for object detection. ar Xiv preprint ar Xiv:1906.11172, 2019.

[ZG02] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer, 2002.

[ZGL03] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian ﬁelds and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912 919, 2003.

[Zhu05] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2005.