# mixup_as_locally_linear_outofmanifold_regularization__0b5b7e02.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Mix Up as Locally Linear Out-of-Manifold Regularization

Hongyu Guo National Research Council Canada 1200 Montreal Road, Ottawa hongyu.guo@nrc-cnrc.gc.ca

Yongyi Mao School of Electrical Engineering & Computer Science University of Ottawa, Ottawa, Ontario yymao@eecs.uottawa.ca

Richong Zhang BDBC, School of Computer Science and Engineering Beihang University, Beijing, China zhangrc@act.buaa.edu.cn

Mix Up (Zhang et al. 2017) is a recently proposed dataaugmentation scheme, which linearly interpolates a random pair of training examples and correspondingly the one-hot representations of their labels. Training deep neural networks with such additional data is shown capable of signiﬁcantly improving the predictive accuracy of the current art. The power of Mix Up, however, is primarily established empirically and its working and effectiveness have not been explained in any depth. In this paper, we develop an understanding for Mix Up as a form of out-of-manifold regularization , which imposes certain local linearity constraints on the model s input space beyond the data manifold. This analysis enables us to identify a limitation of Mix Up, which we call manifold intrusion . In a nutshell, manifold intrusion in Mix Up is a form of under-ﬁtting resulting from conﬂicts between the synthetic labels of the mixed-up examples and the labels of original training data. Such a phenomenon usually happens when the parameters controlling the generation of mixing policies are not sufﬁciently ﬁne-tuned on the training data. To address this issue, we propose a novel adaptive version of Mix Up, where the mixing policies are automatically learned from the data using an additional network and objective function designed to avoid manifold intrusion. The proposed regularizer, Ada Mix Up, is empirically evaluated on several benchmark datasets. Extensive experiments demonstrate that Ada Mix Up improves upon Mix Up when applied to the current art of deep classiﬁcation models.

Introduction Deep learning techniques have achieved profound success in many challenging real-world applications, including image recognition (Krizhevsky, Sutskever, and Hinton 2012; He et al. 2016), speech recognition (Hinton et al. 2012; Graves, Mohamed, and Hinton 2013), and machine translation (Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2014), amongst many others (Goodfellow et al. 2014; Silver et al. 2016; Kipf and Welling 2016). A recently

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

proposed such technique, Mix Up (Zhang et al. 2017), is a simple and yet very effective data-augmentation approach to enhance the performance of deep classiﬁcation models. Through linearly interpolating random data sample pairs and their training targets, Mix Up generates a synthetic set of examples and use these examples to augment the training set. In (Zhang et al. 2017), Mix Up is shown to dramatically improve the predictive accuracy of the current art of deep neural networks. Despite its demonstrated effectiveness, the power of Mix Up is mostly established empirically. To date, the working of Mix Up has not been well explained. In addition, the mixing (i.e., interpolation) policies in Mix Up are controlled by a global hyper-parameter α, which needs to be tuned by trial and error on the data set. This on one hand makes Mix Up inconvenient to use, and on the other hand, adds to the mystery what role such a parameter controls and how to tune it properly. The study of this current paper is motivated by developing a deeper understanding of Mix Up, speciﬁcally pertaining to why it works. To that end, we formulate Mix Up as a new form of regularization. Speciﬁcally, we categorize regularization schemes as data-independent regularization and data-dependent regularization. Data-independent regularization imposes constraints on the model without exploiting the structure (e.g., the distribution) of data; typical examples include penalizing various forms of the norm of the network parameters (e.g., weight decay (Hanson and Pratt 1988)) or dropout (Srivastava et al. 2014). Data-dependent regularization constrains the parameter space of the model in a way that depends on the structure of the data; previous examples of such schemes include, data augmentation schemes (Simard et al. 1998; Lecun et al. 1998; Simonyan and Zisserman 2014), adversarial training schemes (e.g., (Goodfellow, Shlens, and Szegedy 2014)), and some information bottleneck based regularization schemes (e.g., (Alemi et al. 2016)). In this paper, we show that Mix Up is also a data-dependent regularization scheme in the sense that the imposed constraints on the model explicitly exploit

the data distribution. But different from all previous datadependent regularization schemes, Mix Up imposes its constraints by making use of the regions in the input space of the model that are outside of the data manifold. Identifying the constraints imposed by Mix Up with a set of locally linear constraints, we call such a data-dependent regularization scheme a locally linear out-of-manifold regularization scheme. Under this perspective, we identify an intrinsic problem in Mix Up, which we refer to as manifold intrusion . Brieﬂy, manifold intrusion occurs when a mixed example collides with a real example in the data manifold, but is given a soft label that is different from the label of the real example. Figure 1 (left) shows some examples of manifold intrusion. In the ﬁgure, the mixed images (top row) look close to a hand-written 8 (which indeed live in the data set) and yet Mix Up assigns a soft-label that is not 8 . For example, the soft-label of the top-left image is given as the half-half mix of labels 1 and 5 . We explain in this work that mani-

Figure 1: Left: Linearly interpolated images (top row) from the original images (bottom two rows). Right: Performance of Mix Up on a reduced Cifar100 data set (reduced to containing 20% of data samples) vs various values of α.

fold intrusion necessarily results in under-ﬁtting and degradation of the model performance. This is conﬁrmed by our experiments, where we see Mix Up with an inappropriate hyper-parameter α actually deteriorates the model performance (Figure 1, right). Indeed, with Mix Up, one has to select hyper-parameter α carefully via trial and error, in order to enhance the base model. Built on our understanding and motivated by the potential manifold intrusion in Mix Up, we then propose a generalized and adaptive version of Mix Up, which we call Ada Mix Up. On one hand, Ada Mix Up extends the mixing policies in the standard Mix Up from 2-fold to higher-fold; more importantly it adaptively learns good policy regions from the data and automatically avoids manifold intrusion. Our experiments on several image classiﬁcation tasks show that Ada Mix Up signiﬁcantly improves the current art of deep classiﬁcation models and outperforms the standard Mix Up by a large margin.

Preliminaries

In a standard classiﬁcation setting, let X denote the vector space in which each example x lives. Throughout the paper, the elements of X will be treated as vectors and all vectors in this paper are taken as column vectors.

Let Y denote the set of all class labels. The objective of our classiﬁcation problem is to develop a classiﬁer which assigns every example a label in Y. It is worth noting however that not every x X is a valid example. That is, some x cannot be associated to a label in Y. Speciﬁcally, the set of all valid examples is only a subset of X, which we refer to as the data manifold, or in short, the manifold, and denote it by M. The following hypothesis is adopted for nearly every classiﬁcation problem.

Hypothesis 1 (Basic Hypothesis) Each x M has a unique label in Y. We use g(x) to denote the label of x, where g is a function mapping M onto Y. We stress that on X outside M, g is not deﬁned.

In the classiﬁcation problem, we are given a subset D M as the training examples, where for each x D, g(x) is given. A distribution over a set of m elements will be denoted by a vector in Rm. It is well known that the set of all such distributions form the probability simplex in Rm, which we will denote by Sm. The set of all distributions on Y is obviously a probability simplex, but we will give it a separate notation P(Y) for distinction. Speciﬁcally, each distribution in P(Y) is regarded as a soft label , whereas each distribution in other probability simplexes potentially serves as a set of coefﬁcients, which will be used to mix training examples via a convex combination. For example, a Bernoulli distribution [α, 1 α]T as a distribution in S2 may potentially be used to mix two examples x, x X by αx + (1 α)x . Thus the distributions in these simplexes will be referred to as a mixing policy, or simply policy. Any subset Λ of these probability simplex with non-zero Lebesgue measure will then be called a policy region. A distribution or policy will be called degenerate if it is a distribution containing a single non-zero probability mass. Let F(X, Y) denote the space of all functions mapping X to P(Y). A discriminative classiﬁcation model (such as a neural network model) that outputs a predictive distribution over Y for an example x X can be characterized as a subset H F(X, Y) Given the model H, the objective of learning is then ﬁnding a member function H H which hopefully, for each x M, returns a predictive distribution that puts highest probability on g(x). Usually such an H is found by optimizing a well-deﬁned loss function LD(H) over all H in the space H b H := arg min H H LD(H). (1)

We note that we subscript the loss with D to emphasize its dependency on the training data D.

Regularization It is well known that when the space H is large (for example, high dimensional), or when the model is too complex, the model tends to overﬁt and generalize poorly. The main techniques to cure such a problem is regularization. Regularization, in our view, may refer to any technique that effectively reduces the model capacity or reduces the

space H. In this paper, we wish to divide regularization into two categories: data-dependent and data-independent. Data-independent regularization is the usual notion of regularization. Typical such techniques directly impose certain constraint, say, C(H) < A, on the functions H H. This turns the optimization problem (1) to a constrained optimization problem, which, under the Lagrangian formulation, can be expressed by

b H := arg min H H {LD(H) + βC(H)} (2)

For example, the weight-decay regularization is a dataindependent regularization, which essentially imposes a L2norm constraint on the model parameters. Dropout (Srivastava et al. 2014) can be regarded as another example of such a regularizer, which can be understood as reducing the model capacity in a Bayesian style (Gal and Ghahramani 2015) and which has an equivalent loss function similar to the form of (2). A key feature in data-independent regularization is that the training data do not enter the regularization term C(H) in such a scheme. Data-dependent regularization imposes constraints or makes additional assumptions on H with respect to the training data D. The most well known such techniques are data augmentation. Speciﬁcally, in a data augmentation scheme, another set D X, believed to be inside M but beyond the training set D, are introduced for training. For example, if D is a set of images, D can be obtained by rotating each image in D; the label g(x) of an image x in D is taken as the label of the image x before rotation. Data augmentation turns the optimization problem (1) into

b H := arg min H H {LD(H) + LD (H)} (3)

where LD is the same loss function but on data D . Aligning (3) with (2), one can equivalently regard data augmentation as imposing a constraint on the space H (hence a regularization scheme), and such a constraint, depending on the data set D, is precisely due to the hypothesis that D M and g( ) is rotation-invariant. Another data-dependent regularization scheme is adversarial training (see, e.g., (Goodfellow, Shlens, and Szegedy 2014)). In such a scheme, for each training example x, an adversarial example x is found, and it is assumed that x has the same label as x. The adversarial examples are then also used for training purpose. Although delicate procedures have been proposed to ﬁnd adversarial examples, the found adversarial examples are used in the same way as the additional data set D in data augmentation. In both data augmentation and adversarial training, the constraints imposed on the label function g is in the data manifold; the constraints essentially specify how function g should behave on additional points in the manifold M. Next we will argue that Mix Up may be viewed as another data-dependent regularization scheme, which introduces constraints on g outside the manifold M.

Locally Linear Out-of-Manifold Regularization

For each y Y, let δy P(Y) be the single-point-mass (namely, degenerate) distribution on Y which puts all the probability mass on y. Now we associate with g a function G : X P(Y) satisfying

G(x) := δg(x), for any x M. (4)

Obviously, on the data manifold M, G is simply a representation of g in F(X, Y). Additionally any G satisfying the above equation is as perfect as g for any valid example. Thus, any such function G can be taken as a ground-truth classiﬁer. Then the classiﬁcation problem can be reformulated as constructing a model H F(X, Y) and ﬁnding a member H H that well approximates a function G satisfying (4). Noting that the condition (4) on G is only within M, and it imposes no condition on the space X outside M. However, the functions in H are deﬁned on X, well beyond M. Thus, if one chooses to regularize the model in a data-dependent manner, there is a lots of degrees of freedom beyond M which one can exploit. We will show that the Mix Up strategy presented in (Zhang et al. 2017) can be seen as such a technique. To begin, for any subset Ω X, we will use Ω(k) to denote the set of all k-column matrices in which each column is a point (vector) in Ω. This also deﬁnes notations M(k)

and D(k); however with these latter two notations, we require each matrix in M(k) and D(k) satisfy an additional condition, namely, the labels of the columns of the matrix must be all distinct. Let k be a given integer not less than 2. Let Λ Sk be a policy region, and X X (k). We say that a function F F(X, Y) is Λ-linear with respect to X if for any λ Λ

F(Xλ) = F(X)λ

where F( ), when having its domain extended to matrices, acts on matrix X column-wise. We also refer to such a notion of linearity as a k-fold local linearity. We note that if we allow Λ to Rk, then Rk-linearity with respect to all X X (k) for every integer k 1 implies the standard linearity.

Standard Mix Up

Hypothesis 2 (Standard Mix Up Hypothesis) There exists a policy region Λ S2 such that for any matrix X M(2), the function G is Λ-linear with respect to X.

We note that this hypothesis only assumes a fold-2 local linearity and a global choice of Λ. If Hypothesis 2 holds and one is given Λ, then one can implementing the hypothesis as a constraint on H, by imposing on the optimization problem the following constraint. Standard Mix Up Constraint: Every H H is Λ-linear w.r.t. every X D(2),

This gives rise a regularization scheme, which can be implemented by repeating the following process: draw a random pair data points1 in D to form X, and draw a random λ from Λ; take Xλ as an additional training example and G(X)λ as its training target. This scheme, generating more training examples, is essentially the standard Mix Up scheme (Zhang et al. 2017). In this paper, we however point out that one needs to caution with such a regularization scheme.

Lemma 1 Let Λ be a policy region in S2 and X M(2). If there is a non-degenerate λ Λ with Xλ M, then Hypothesis 2 can not hold with this choice Λ2.

Proof: The condition of the lemma, stated more precisely in Footnote 2 suggests that there is a region Λ Λ with non-zero Lebesgue measure such that every λ Λ is nondegenerate and Xλ M. The fact that Xλ M suggests, by Hypothesis 1, that g(Xλ) Y, or G(Xλ) is a singlepoint-mass distribution for every λ Λ . To induce a contradiction, suppose that Hypothesis 2 holds for Λ, namely, G is Λ-linear with respect to every X M(2). Then G(Xλ) = G(X)λ for every λ Λ Λ. Since the two columns (points) in X have different labels, the two columns of G(X) must be two different one-hot vectors (degenerate distributions). But every λ Λ is non-degenerate; as a consequence, every G(X)λ must be non-degenerate. This contradicts the previous argument that G(Xλ) is a single-point-mass. Therefore Hypothesis 2 fails on a region in Λ with non-zero measure. 2 Geometrically, the failure of Hypothesis 2 in this lemma is due to that the mixing x1 and x2 by λ causes the mixed point Xλ to intrude into M and collide with a point, say, x in M. We call such a phenomenon manifold intrusion. Note that in this case, Xλ and x are in fact the same point, but they have different labels. Obviously, when such intrusion occurs, regularization using X and λ will contradict with the original training data. This essentially induces a form of under-ﬁtting. Figure 2 depicts the effect of Mix Up on constraining the space H, where the region G denotes the space of functions in F(X, Y) satisfying (4). When the space H is large, there is a large region of H satisfying (4). This gives a large intersection of H and G, each member of which is a perfect classiﬁer on the training set but performing questionably on the testing set. Thus the model without Mix Up tends to result in large prediction variances or over-ﬁtting (left ﬁgure). When Mix Up is applied and manifold intrusion does not occur (middle ﬁgure), the Mix Up constraint effectively reduces the space H. Since the Mix Up constraint does not

1Although we have required in the deﬁnition of D(2) that two points in matrix X should have different labels, this needs not to be rigorously followed in implementation. Relaxing this condition in fact is expected to decrease the chance of manifold intrusion. Similar comments apply to the Ada Mix Up later. 2More precisely, in order to make the hypothesis fail, instead of requiring a single non-degenerate λ, we in fact require a subset of Λ having non-zero measure. But we abbreviate it here for simplicity. Similar comments apply to Lemma. 2.

conﬂict (4) (at least explicitly), the reduced H would contain members compatible with (4). Thus the intersection of G and H reduces while remaining non-empty. Each member in the intersection is a perfect classiﬁer on the training set and the mixed examples. But the signiﬁcantly reduced intersection gives rise to signiﬁcantly lower prediction variances on the testing set, thereby serving to regularize the model and reduce over-ﬁtting. When the application of Mix Up results in manifold intrusion (right ﬁgure), there is no member in H that satisﬁes both the Mix Up constraint and the condition (4). The intersection of H and G then becomes empty. In this case, no member in H can ﬁt both the original training data and the (intruding) mixed examples. This gives rise to under-ﬁtting and high prediction biases. From the view point of restricting the space H, one wishes to make Λ as large as possible. But this lemma sets a limit on the possible Λ that makes Hypothesis 2 hold. Due to this lemma, we need to assure that the Λ we choose does not cause such an intrusion . Let Λ denote the largest policy region in S2 such that manifold intrusion does not occur. In the original Mix Up, recall that inappropriate hyper-parameter result in no gain and only negative impact. We believe that much of this is due to manifold intrusion. On the other hand, the search of good hyper-parameter α in the original Mix Up can be seen as searching for Λ by trial and error.

Figure 2: Effect of Mix Up on constraining H

In this paper, we generalize the standard Mix Up to regularization with higher-fold local linearities, where Λ (more precisely its generalization) is learned from the data.

Adaptive Mix Up (Ada Mix Up)

Hypothesis 3 (Generalized Mix Up Hypothesis) For any integer k not less than 2 and not greater than some kmax, and for any matrix X M(k), there exists a policy region Λ Sk such that the function G is Λ-linear w.r.t. X. Note that here Λ depends on X, which we will denote by Λ(X).

If this hypothesis holds, and if for every k {2, 3, . . . , kmax} and every X M(k), Λ(X) in the hypothesis is given, then the hypothesis can be implemented as the following constraint on H. Generalized Mix Up Constraint: For every k {2, 3, . . . , kmax}, every X D(k) and every Λ(X), every H H is Λ(X)-linear w.r.t. X.

Imposing such a constraint in training provides another regularization scheme, which will be one ingredient of the proposed Ada Mix Up scheme. We will present Ada Mix Up in the next section, after developing its other ingredients.

Lemma 2 Let k {2, 3, . . . , kmax} and X M(k). Suppose that Λ is a policy region in Sk. If there is a nondegenerate λ Λ with Xλ M, then Hypothesis 3 with Λ(X) = Λ can not hold. This lemma parallels Lemma 1 and can be proved similarly. Lemma 2 similarly suggests that when the imposed constraint results in intrusion into the manifold M, Hypothesis 3 necessarily fails. The lemma also sets a limit for each Λ(X). For each X, we will denote the largest Λ(X) by Λ (X). The following result immediately follows.

Lemma 3 For any X M(2), Λ Λ (X). Proof: By the deﬁnitions of Λ and Λ (X), Λ = T

X M(2) Λ (X). Thus Λ Λ (X). 2 In reality, one expects that Λ is strictly contained in Λ (X). That is, even when we take kmax = 2, the Generalized Mix Up Hypothesis imposes a stronger constraint than the Standard Mix Up Hypothesis, and hence can provide a stronger regularization without causing under-ﬁtting. When kmax > 2, the Generalized Mix Up Hypothesis imposes additional constraints, which further regularizes the model and improves generalization.

Ada Mix Up From here on, we stay in the Generalized Mix Up regime, namely assuming Hypothesis 3 holds. We now explore the possibility of learning Λ(X) for each X. Suppose that each Λ (X) Sk can be sufﬁciently parametrized by πk(X) in some space Π. Then πk( ) can be regarded as a function mapping M(k) to Π. Then learning each Λ(X) reduces to learning the function πk( ). We now consider using a neural network to represent the function πk. This network will be referred to as a policy region generator. With a slight abuse of notation, we will use πk(X) to denote any candidate choice of Λ (X) Sk generated by the network πk( ). To train the networks {πk}, consider the construction of another network ϕ( ). The network takes any x X as input and outputs a predictive distribution on whether the x lies in manifold M. This network, which we refer to as an intrusion discriminator, aims at distinguishing elements inside the manifold M and those outside. By assuring that the virtual examples obtained by mixing with the policies in πk(X) are outside the manifold and the original examples are inside the manifold, the supervising signal trains both the networks {πk} and ϕ. More details are given below. For a given x X, let p( |x; ϕ) be the predictive distribution on {1, 0} generated by network ϕ, where 1 denotes x is outside M. The loss function for training {πk} and ϕ is given by the following cross-entropy loss, which we call the intrusion loss :

Lintr := 1 kmax 1

k=2 EX D(k),λ πk(X) log p(1|Xλ; ϕ)

+ Ex D log p(0|x; ϕ) (5) This intrusion loss Lintr, depending on both πk s and ϕ will be responsible for providing training signals for πk s and ϕ.

Given πk s, let D be the set of all the virtual examples Xλ that can be obtained by drawing an X from D(k) for some k and mixing with some policy λ in region πk(X). Let the Mix Up Loss LD be the average cross-entropy loss for the virtual examples Xλ in D with respect to their respective soft labels G(X)λ. Then we arrive at an overall loss function

Ltotal := LD(H) + LD (H, {πk}) + Lintr({πk}, ϕ) (6)

Training the networks H, {πk} and ϕ on Ltotal then gives rise to the proposed Ada Mix Up framework. Comparing the loss in (6) with the objective function in (2), one can equate LD (H, {πk}) + Lintr({πk}, ϕ) in (6) with regularization term βC(H) in (2). Speciﬁcally, here LD (H, {πk}) serves to regularize the model and Lintr({πk}, ϕ) serves to prevent the model from over-regularization , namely, intrusion or under-ﬁtting. This justiﬁes Ada Mix Up (and the standard Mix Up) as a regularization technique. The difference between Ada Mix Up (6) and the standard regularization (2) is however two fold: ﬁrst, unlike the standard regularization which is data-independent, Ada Mix Up depends on the dataset; on the other hand, Ada Mix Up contains parameters, which are adaptively learned from the dataset D. Given that each πk has sufﬁcient capacity and properly trained, in theory πk(X) can approximate Λ (X) well. Then if the ϕ also has sufﬁcient capacity, the intrusion loss Lintr can be driven to near zero. In practice however, one usually need to reduce the capacities of πk s and ϕ. Then the intrusion loss may be seen as an approximate measure of the extent to which the data allows for mix up without causing intrusion under the architectural choices of πk s and ϕ.

Implementation of Ada Mix Up Implementation of {πk} Instead of constructing kmax 1 networks πk s, we in fact implement these networks recursively using a single network π2. For this purpose, we recursively parametrize Λ(X) Sk, which we will rewrite as Λk(X) for clarity. For an X X (k), we may express it as [X(k 1), xk], where X(k 1) is the sub-matrix of X containing the ﬁrst k 1 columns. Then, for k > 2, we parametrize

Λk(X) := γλ 1 γ

: λ Λk 1 X(k 1) , γ 1 γ

Λ2 [λX(k 1), xk] (7)

This then allows the use of single network π2 to express Λk(X) for all X and all k. Speciﬁcally, a Fold-2 mixing policy is parameterized by a single number γ. For each input X X (2), π2 returns two positive values α and with α + 1, which speciﬁes Λ(X) as the interval (α, α + ) as the range for γ. To this end, the last layer of the π2 network is implemented as a softmax function, which generates a triplet of values (α, , α ) with sum of one, where the third element α is discarded. Using network π2, we generate a set of Fold-2 mixed examples by repeatedly drawing a pair of samples in D and mixing them using a random policy drawn from policy region generated by π2 on the pair of samples.

Policy Region Generator

γ ˆx = γx1 + (1 γ)x2

ˆx/+ x1/ x2/ x3/

Intrusion Discriminator

Figure 3: Fold-2 Ada Mix Up for a single triplet (x1, x2, x3). Each batch is implemented to contain multiple such triplets. + and - indicate positive and negative examples, respectively.

To generate mixed samples with an increased maximum fold number, we consider the original examples and previously generated mixed examples as the new training set Dnew. We then repeat the following process: draw a pair of samples, one from Dnew and the other from D; submit the pair to network π2 to generate a policy region; draw a random policy from the region and obtain a new sample by mixing the pair of samples using the drawn policy.

Reparametrization Trick To allow the gradient signal to back-propagate through the policy sampler, a reparametrization trick similar to that of (Kingma and Welling 2013) is used. Speciﬁcally, drawing γ from (α, α + ) is implemented by drawing ϵ from the uniform distribution over (0, 1), then let γ := ϵ + α.

Implementation of ϕ( ) The network ϕ is a neural network that classiﬁes original examples from the mixed examples. In our implementation, ϕ shares with the classiﬁer network all but the ﬁnal layers. The ﬁnal layer of ϕ is a simple logistic regression binary classiﬁer.

Training The network ϕ and π2 are trained jointly, where we iterate over minimizing LD + LD and minimizing Lintr in (6) via mini-batched SGD. A Fold-2 Ada Mix Up is illustrated in Figure 3.

Experiments Data Sets and Experimental Settings We evaluate Ada Mix Up on eight benchmarks. MNIST is the popular digit (1-10) recognition dataset with 60,000 training and 10,000 test gray-level, 784-dimensional images. Fashion is an image recognition dataset having the same scale as MNIST, containing 10 classes of fashion product pictures. SVHN is the Google street view house numbers recognition data set with 73,257 digits (1-10) 32x32 color images for training, 26,032 for testing, and 531,131 additional, easier samples. We did not use the additional images. Cifar10 is an image classiﬁcation dataset with 10 classes, 50,000 training and 10,000 test samples. Cifar100 is similar to CIFAR10 but with 100 classes and 600 images each. Cifar10-S and Cifar100-S are respectively Cifar10 and Cifar100 reduced

to containing only 20% of the training samples. Image Net-R is the Image Net-2012 classiﬁcation dataset (Russakovsky et al. 2014) with 1.3 million training images, 50,000 validation images, and 1,000 classes. We follow the data processing approaches used in Mixup (Zhang et al. 2017), except that the crop size is 100x100 instead of 224x224 due to our limited computational resources. We report both top-1 and top-5 error rates. In our experiments, the Intrusion Discriminator, Classiﬁer, and Policy Region Generator have the same network architecture, and the ﬁrst two share the same set of parameters. We test Ada Mix Up on two types of baseline networks: a three layers CNN as implemented in (Wu and others 2016) as the baseline for easier tasks MNIST and Fashion, and a Res Net-18 as implemented in (Zagoruyko and Komodakis 2016) for the other six more difﬁcult tasks. All models examined are trained using mini-batched backprop, as speciﬁed in (Wu and others 2016) and (Zagoruyko and Komodakis 2016), for 400 epochs. Each reported performance value (accuracy or error rate) is the median of the performance values obtained in the ﬁnal 10 epochs.

Predictive Performance

Our ﬁrst experiment compares Fold-2 Ada Mix Up (i.e., mixing image pairs) to two networks: the baseline networks (i.e., 3-layer CNN for MNIST and Fashion and Res Net-18 for the rest datasets) and Mix Up on the baseline networks. We present the error rates obtained by the Baseline, Mix Up, and Ada Mix Up, along with the relative error reduction of Ada Mix Up over the Baseline, in Table 1. Table 1 indicates that the Ada Mix Up outperforms both the Baseline and Mix Up on all the eight testing datasets. Also, the relative error reduction of Ada Mix Up over the Baseline is at least 5.7%, and with a large margin in some cases. For example, for the SVHN and Cifar10 datasets, the relative error reduction achieved by Ada Mix Up are over 30%. Table 1 also suggests that, with the default α value, Mix Up failed to improve the baseline systems accuracy on four out of the eight datasets, namely the MNIST, Cifar10-S, Cifar100-S, and Imagenet-R datasets, as underlined in Table 1, suggesting over-regularization or under-ﬁtting.

Data Set Baseline Mix Up Ada Relative Mix Up Impro. (%) mnist 0.52 0.57 0.49 5.77 fashion 7.37 6.92 6.21 15.74 svhn 4.50 3.80 3.12 30.67 cifar10 5.53 4.24 3.52 36.35 cifar100 25.6 21.14 20.97 18.09 cifar10-S 7.68 7.88 6.85 10.81 cifar100-S 28.47 29.39 26.72 6.15 Image Net-R top1 53.00 54.89 49.17 7.22 Image Net-R top5 29.41 31.02 25.78 12.34

Table 1: Error rates (%) obtained by the testing methods.

In Table 2, we also present the α and values learned by the Policy Region Generator, along with the losses of both the Intrusion Discriminator and the Classiﬁer of the

Ada Mix Up method. Results in Table 2 indicate that the values of α and vary for different datasets; the former ranges from 0.4 to 0.9 but the range for is much smaller, with values between 0.010 and 0.035. Noting that the intrusion losses are fairly close to 0 on all the datasets, suggesting that these datasets well support Fold-2 Ada Mix Up under the proposed structure of Policy Region Generator.

Ada values α Lintr LD + LD mnist 0.497 0.010 0.002 0.201 fashion 0.511 0.011 0.003 0.290 svhn 0.924 0.011 0.002 0.116 cifar10 0.484 0.029 0.003 0.371 cifar100 0.484 0.034 0.002 0.679 cifar10-S 0.486 0.027 0.003 0.479 cifar100-S 0.482 0.035 0.007 0.712 Image Net-R 0.493 0.004 0.038 2.382

Table 2: Fold-2 Ada Mix Up: Mean α, , and training losses.

Training Characteristics

Figure 4 depicts the behavior of Policy Region Generator in Fold-2 Ada Mix Up over training iterations. It appears that the Policy Region Generator initially explores a wide range of policy space before settling down at around 40K iterations (when the means of α and both stabilize, at 0.48 and 0.03 respectively). Figure 5 shows the average of the drawn mixing policy γ in Fold-2 Ada Mix Up during training iterations and the corresponding classiﬁcation performance of the model. The range of γ appears to stabilize at around 40K iterations (left plot), consistent with the observation in Figure 4. Interestingly, also at the same time, the classiﬁcation error drops and begins to stabilize (right plot).

Figure 4: The mean of α (left) and the mean of (right) in Fold-2 Ada Mix Up on Cifar10.

Figure 5: Fold-2 Ada Mix Up on Cifar10: Mean of mixing policy γ (left) and training/testing error rates (right).

Figures 6 shows some typical mixed images in Fold2 Ada Mix Up and their corresponding original images in MNIST. We note that these mixed images obviously distinguish themselves from the original images. Thus they do not intrude into the data manifold.

Figure 6: Mixed images (top row) in Fold-2 Ada Mix Up from the original images (bottom 2 rows) in MNIST.

Impact of Mixing Fold Table 3 presents the error rates obtained by Fold-3 and Fold4 Ada Mix Up on the seven testing datasets (excluding Image Net due to limited computational resources). We see that increasing the mixing fold from 2 to higher values, the performance of Ada Mix Up improves. This is due to stronger constraints are imposed to the model and hence results in stronger regularization. One however expects that imposing stronger constraints makes the model more difﬁcult to ﬁt and further increasing mixing fold may eventually lead to diminishing gain and even under-ﬁtting. This is hinted by Table 4, where the averages of learned mixing parameters and the training losses of the Fold-3 Ada Mix Up are presented. In Table 4, we see that the intrusion losses of Fold3 Ada Mix Up are much higher than those of Fold-2 Ada Mix Up (Table 2). This implies that in higher-fold Ada Mix Up, it is more difﬁcult for the Policy Region Generator to carve a good policy region that is not regarded by the Intrusion Discriminator as causing intrusion. This difﬁculty is also suggested by the high α2 values in Table 4 (often close to 1). They indicate that in these cases, mixing only occurs slightly .

Ada Mix Up Fold-2 Fold-3 Fold-4 mnist 0.49 0.42 0.41 fashion 6.21 5.88 5.71 svhn 3.12 2.98 2.91 cifar10 3.52 3.25 3.21 cifar100 20.97 20.68 20.29 cifar10-S 6.85 6.14 5.96 cifar100-S 26.72 25.72 25.49

Table 3: Error rates (%) of Fold-3 and Fold-4 Ada Mix Up.

Sensitivity to Model Capacity We vary the number of ﬁlters on each layer of the Res Net18 with half quarter, quarter, half, and three-quarter of the original number of ﬁlters (denoted as base ﬁlter), and present the test error rates on Cifar100 in Figure 7 (green curves). The green curves in Figure 7 indicate that the Fold-2 Ada Mix Up beneﬁts from large number of ﬁlters deployed:

Fold-3 α1/ 1 α2/ 2 Lintr LD + LD mnist 0.533/0.013 0.662/0.013 0.002 0.337 fashion 0.517/0.015 0.678/0.019 0.024 0.327 svhn 0.637/0.011 0.943/0.020 0.028 0.482 cifar10 0.516/0.028 0.938/0.041 0.003 0.403 cifar100 0.121/0.010 0.959/0.015 0.004 0.305 cifar10-S 0.504/0.039 0.849/0.068 0.028 0.505 cifar100-S 0.639/0.011 0.943/0.018 0.023 0.542

Table 4: Fold-3 Ada Mix Up: average training losses and policy region parameters. (α1, 1): initial Fold-2 mixing parameter; (α2, 2): further mixing parameter.

the accuracy keeps improving while enlarging the number of ﬁlters. On the contrary, the Res Net-18 received dramatically accuracy improvement before the number of ﬁlters is half of the base but obtained no further improvement after.

Figure 7: Error rates obtained by Res Net-18 and Ada Mix Up on Cifar100, when varying the number of ﬁlters (green curves) and varying the size of the samples (red curves).

Effect of Data Size We down sample the Cifar100 data with 20%, 40%, 60% and 80% of the training samples, and present the test error rates obtained by Fold-2 Ada Mix Up in Figure 7 (red curves). The red curves in Figure 7 indicate that more training samples help with both the Res Net-18 and Ada Mix Up, but the accuracy gap between the two methods are widening while increasing the training sample size.

Beneﬁt of the Intrusion Discriminator Table 5 lists the test error rates obtained by the Fold-2 Ada Mix Up with Res Net-18 on the Cifar10 and Cifar100 data, but without the Intrusion Discriminator. Table 5 shows

Data Set Baseline Ada Ada Mix Up w/o Res Net-18 Mix Up Intru. Discr. cifar10 5.53 3.52 3.83 cifar100 25.6 20.97 24.75

Table 5: Error rates obtained by Fold-2 Ada Mix Up without the Intrusion Discriminator on the Cifar10 and Cifar100.

that excluding the Intrusion Discriminator component, the

Ada Mix Up method was still able to improve the accuracy over the baseline, but obtained much lower accuracy than that of including the Intrusion Discriminator unit.

Interpolating on Hidden Layer We also evaluate an alternative structure for Ada Mix Up, where mixing happens on the layer before the ﬁnal softmax layer in Res Net-18. Our results suggest that the test errors of the Fold-2 Ada Mix Up increased dramatically due to the high cost of the Intrusion Discriminator. The error rates increased from 20.97% to 22.21% and from 3.52% to 4.94%, respectively for Cifar100 and Cifar10. Notably, both have large intrusion loss of around 0.49, which may due to the fact that perceptual similarity for images in the hidden embedding space may not as distinguishable as that in the input space. As a result, the Policy Generator failed to ﬁnd good mixing policies to avoid colliding into the data manifold.

Related Work Common data augmentation methods have been designed based on substantial domain knowledge (Lecun et al. 1998; Simonyan and Zisserman 2014; Amodei et al. 2015; Zhong et al. 2017), relied on speciﬁc network architectures (Gastaldi 2017; Devries and Taylor 2017; Yamada, Iwamura, and Kise 2018), or leveraged feedback signals to search the optimal augmentation strategies (Lemley, Bazrafkan, and Corcoran 2017; Cubuk et al. 2018). Our method excludes those requirements, and only leverages a simple linear interpolation for data augmentation. Our work closely relates to approaches linearly interpolating examples and labels (Zhang et al. 2017; De Vries and W. Taylor 2017; Verma et al. 2018; Tokozume, Ushiku, and Harada 2017). Nevertheless, these approaches depends on the correct user-predeﬁned mixing policies. Also, their interpolation typically lies along the lines of sample pairs. On the contrary, our approach automatically learns the mixing policy regions and beneﬁts from mixing multiple images.

Conclusions and Outlook This work justiﬁes the effectiveness of Mix Up from the perspective of out-of-manifold regularization. We also identify the inherent problem with standard Mix Up in causing manifold intrusion. This motivates us to develop Ada Mix Up, which generalizes Mix Up to higher-fold mixing policies and automatically learns the policy regions that avoid manifold intrusion. To us, this work is only the beginning of a rich research theme. The justiﬁcation of Mix Up and Ada Mix Up we provide thus far is primarily qualitative. It is desirable to quantitatively characterize the generalization capability of Ada Mix Up. Moreover, the out-of-manifold regularization perspective potentially opens a door to new regularization techniques. Beyond local linearization, we believe that there is a rich space of other possibilities.

Acknowledgments The authors wish to thank the anonymous reviewers for their valuable comments, some of which are likely to help further-

ing this work. This work is supported partly by China 973 program (2015CB358700), by the National Natural Science Foundation of China (61772059), and by the Beijing Advanced Innovation Center for Big Data and Brain Computing.

Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2016. Deep variational information bottleneck. Co RR abs/1612.00410. Amodei, D.; Anubhai, R.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Chen, J.; Chrzanowski, M.; Coates, A.; Diamos, G.; Elsen, E.; Engel, J.; Fan, L.; Fougner, C.; Han, T.; Hannun, A. Y.; Jun, B.; Le Gresley, P.; Lin, L.; Narang, S.; Ng, A. Y.; Ozair, S.; Prenger, R.; Raiman, J.; Satheesh, S.; Seetapun, D.; Sengupta, S.; Wang, Y.; Wang, Z.; Wang, C.; Xiao, B.; Yogatama, D.; Zhan, J.; and Zhu, Z. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. Co RR abs/1512.02595. Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. Co RR abs/1409.0473. Cubuk, E. D.; Zoph, B.; Man e, D.; Vasudevan, V.; and Le, Q. V. 2018. Autoaugment: Learning augmentation policies from data. Co RR abs/1805.09501. Devries, T., and Taylor, G. W. 2017. Improved regularization of convolutional neural networks with cutout. Co RR abs/1708.04552. De Vries, T., and W. Taylor, G. 2017. Dataset augmentation in feature space. Co RR. Gal, Y., and Ghahramani, Z. 2015. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Co RR abs/1506.02142. Gastaldi, X. 2017. Shake-shake regularization. Co RR abs/1705.07485. Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde Farley, D.; Ozair, S.; Courville, A. C.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672 2680. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. Co RR abs/1412.6572. Graves, A.; Mohamed, A.; and Hinton, G. E. 2013. Speech recognition with deep recurrent neural networks. Co RR abs/1303.5778. Hanson, S. J., and Pratt, L. Y. 1988. Comparing biases for minimal network construction with back-propagation. In NIPS1988, 177 185. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity mappings in deep residual networks. Co RR abs/1603.05027. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; rahman Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.; and Kingsbury, B. 2012. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine. Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. Co RR abs/1312.6114. Kipf, T. N., and Welling, M. 2016. Semi-supervised classiﬁcation with graph convolutional networks. Co RR abs/1609.02907. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 1097 1105. USA: Curran Associates Inc. Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324.

Lemley, J.; Bazrafkan, S.; and Corcoran, P. 2017. Smart augmentation - learning an optimal data augmentation strategy. Co RR abs/1703.08383. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. S.; Berg, A. C.; and Li, F. 2014. Imagenet large scale visual recognition challenge. Co RR abs/1409.0575. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T. P.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484 489. Simard, P.; Le Cun, Y.; Denker, J. S.; and Victorri, B. 1998. Transformation invariance in pattern recognition-tangent distance and tangent propagation. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, 239 27. Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. Co RR abs/1409.1556. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overﬁtting. J. Mach. Learn. Res. 15(1):1929 1958. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. Co RR abs/1409.3215. Tokozume, Y.; Ushiku, Y.; and Harada, T. 2017. Between-class learning for image classiﬁcation. Co RR abs/1711.10284. Verma, V.; Lamb, A.; Beckham, C.; Courville, A.; Mitliagkas, I.; and Bengio, Y. 2018. Manifold mixup: Encouraging meaningful on-manifold interpolation as a regularizer. Co RR. Wu, Y., et al. 2016. Tensorpack. https://github.com/tensorpack/. Yamada, Y.; Iwamura, M.; and Kise, K. 2018. Shakedrop regularization. Co RR abs/1802.02375. Zagoruyko, S., and Komodakis, N. 2016. Url https://github.com/szagoruyko/ wide-residual-networks. Zhang, H.; Ciss e, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2017. Random erasing data augmentation. Co RR abs/1708.04896.