# breaking_interlayer_coadaptation_by_classifier_anonymization__b2c37fb5.pdf

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

Ikuro Sato 1 Kohta Ishikawa 1 Guoqing Liu 1 Masayuki Tanaka 2

This study addresses an issue of co-adaptation between a feature extractor and a classiﬁer in a neural network. A na ıve joint optimization of a feature extractor and a classiﬁer often brings situations in which an excessively complex feature distribution adapted to a very speciﬁc classiﬁer degrades the test performance. We introduce a method called Feature-extractor Optimization through Classiﬁer Anonymization (FOCA), which is designed to avoid an explicit co-adaptation between a feature extractor and a particular classiﬁer by using many randomlygenerated, weak classiﬁers during optimization. We put forth a mathematical proposition that states the FOCA features form a point-like distribution within the same class in a class-separable fashion under special conditions. Real-data experiments under more general conditions provide supportive evidences.

1. Introduction

When speciﬁc signal patterns are repeatedly delivered by hidden neurons in a neural network during training, the network parameters are updated in a strongly tied way, or co-adapted, so that the network becomes vulnerable against small input perturbations (Hinton et al., 2012; Srivastava et al., 2014). To discourage co-adaptation, Hinton et al. proposed a method called Dropout that randomly deactivates neurons during training. Properties of Dropout training have been intensively studied (Helmbold & Long, 2015; Baldi & Sadowski, 2013; Gal & Ghahramani, 2016; Wager et al., 2013; Ren et al., 2016; Warde-Farley et al., 2013; Bengio et al., 2013); whereas there is a critique saying it does not necessarily yield coadaptation prevention ability (Helmbold & Long, 2018).

Yosinski et al. studied the degrees of inter-layer co-

1Denso IT Laboratory, Inc., Japan 2National Institute of Advanced Industrial Science and Technology, Japan. Correspondence to: Ikuro Sato <isato@d-itlab.co.jp>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

-0.3 0.1 0.4 0.8

-3.3 -0.7 1.8 4.4

(a) Joint opt. Single- (left) and multi-layered classiﬁers (right).

-0.2 0.1 0.4 0.7

-6.4 -4.7 -3 -1.3

(b) FOCA. Single- (left) and multi-layered classiﬁers (right).

Figure 1. Visualization of typical 2D features of two-class training data. (a) A na ıve joint optimization of a feature extractor and a classiﬁer; (b) FOCA (ours). Features in (b) form nearly point-like distributions per class, whereas those in (a) form more complex distributions. An L2 loss is minimized in each case. Black (white) dots indicate +( )1-class data points, and the colored maps indicate the classiﬁers outputs, where in (b) averaged outputs of 256 weak classiﬁers are shown.

adaptation by examining test performance that mid-layer features can yield (Yosinski et al., 2014). In part of their experiments, they split an end-to-end trained network into two blocks of layers, initialized the second-block parameters with random numbers, and trained the second block from scratch with the ﬁrst-block parameters held ﬁxed. They found that there are often cases where the secondary optimization degrades the test performance compared to the preceding primary joint optimization, despite that Dropout is adopted. In these cases, inter-layer co-adaptation (or fragile co-adaptation, in their words) happens between two blocks. Potentially, there is a chance that the secondary optimization ﬁnds the same minimum achieved by the primary optimization; however, in reality the chance rate is usually not high. Excessively complex feature distribution, like the ones shown in Fig. 1 (a), would be a major factor that induces inter-layer co-adaptation. Yosinski et al. also showed that inter-layer co-adaptation tends to cause negative effects

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

in the cross-domain transfer.

Is there a way to fundamentally avoid inter-layer coadaptation? Based on a thought that na ıve joint optimization of a feature extractor and a classiﬁer would result in unwanted co-adaptation between them, we seek a more fundamental approach to break the adhesion, rather than searching best-performing feature layers empirically (Yosinski et al., 2014; Kobayashi, 2017). The questions we try to answer in this work are: a) Is it possible to train a feature extractor without inter-layer co-adaptation to a particular classiﬁer? b) After such training, what kind of characteristics along with robustness against unwanted inter-layer co-adaptation does the feature extractor acquire?

Regarding the ﬁrst problem, we introduce a particular feature-extractor optimization method called Featureextractor Optimization through Classiﬁer Anonymization (FOCA) in Section 2. FOCA is designed so that the feature extractor does not explicitly co-adapt to a particular classiﬁer. Instead, it uses randomly generated, weak classiﬁers during the feature-extractor training. FOCA belongs to a family of network randomization methods (Srivastava et al., 2014; Wan et al., 2013; Zeiler & Fergus, 2013; Huang et al., 2016; Singh et al., 2016), but is different from others in terms that FOCA does not employ a joint optimization of a feature extractor and a classiﬁer. The classiﬁer part is anonymized by marginalizing independently generated, weak classiﬁers; in this way explicit co-adaptation to a particular classiﬁer is avoided.

Regarding the second problem, we obtained intriguing mathematical proposition (Section 3) and experimental evidences about simplicity of FOCA feature distributions. Let us suppose class-c features form a point-like distribution in a class-separable fashion. In that case a strong classiﬁer for a partial dataset must be also strong for the entire dataset. This characteristics is largely conﬁrmed for the FOCA features (Section 4.1). The distance between large-dataset solution and small-dataset solution in the classiﬁer parameter space is indeed very small when FOCA is adopted (Section 4.2). Low-dimensional analyses of the FOCA features exhibit nearly point-like distributions (Section 4.3).

2. Optimization Method to Break Inter-Layer Co-Adaptation

In this section, we introduce FOCA that aims at training a feature extractor without inter-layer co-adaptation to a particular classiﬁer. We ﬁrst go over the basic joint optimization method, then introduce FOCA.

2.1. Joint Optimization: a Review

Let (x, t) be a pair of a d I-dimensional input data and the corresponding d O-dimensional target data, respectively. The

training dataset D contains n D such pairs. Feature extractor Fφ : Rd I Rd F transforms an input to a d F -dimensional feature with parameter set φ, and classiﬁer Cθ : Rd F Rd O transforms a feature to a d O-dimensional output vector with parameter set θ. A joint optimization problem is given as

(φ , θ ) = arg min φ,θ

(x,t) D L (Cθ(Fφ(x)), t) , (1)

where L( , t) : Rd O R deﬁnes the sample-wise loss between the network output and the target.

When SGD training is na ıvely applied, at each step the classiﬁer is updated so as to become more discriminative for the presented features, no matter how complex the feature distribution is. The feature extractor, on the other hand, is updated so that the classiﬁer at that moment becomes stronger, no matter how complex the decision boundary is. The toy example in Fig. 1 (a) demonstrates such a case, where training results in excessively complex feature distribution.

2.2. Feature-extractor Optimization through Classiﬁer Anonymization (FOCA)

Below, we introduce FOCA for optimizing a feature extractor without explicitly co-adapting to a particular classiﬁer. The optimization problem is deﬁned as

φ = arg min φ

(x,t) D Eθ ΘφL (Cθ(Fφ(x)), t) , (2)

where Θφ represents a predeﬁned distribution function of weak classiﬁers for a given parameter set φ, and Eθ Θφ represents the expectation value over θ Θφ. The feature extractor is optimized with respect to a set of weak classiﬁers that are independently sampled from Θφ and thus is not able to co-adapt to a particular classiﬁer, as long as Θφ generates distinct weak classiﬁers.

The weakness of the discriminative power of θ Θφ is essential in this formulation. If θ Θφ is designed to be too strong for D, its decision boundary likely becomes fairly complex during the training, and the feature extractor would update itself to better ﬁt the complex decision boundary, resulting in a vicious cycle. On the other hand, if θ Θφ is too weak or even adversarial, the optimization process would not converge.

The marginalization over weak classiﬁers likely prevents the feature distribution from becoming excessively complex. Even at the end of the optimization, there is generally a large number of distinct weak classiﬁers, and the feature extractor is optimized with respect to the ensemble of these weak classiﬁers. Although some of the weak classiﬁers may have excessively complex decision boundaries, marginalization over the classiﬁer ensemble likely smoothens those

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

out. This likely yields a relatively simple decision boundary and reasonably strong classiﬁcation power, as is essential in the classical Classiﬁer Bagging algorithms (Breiman, 1996) and other ensemble learning algorithms (Hara et al., 2017; Zahavy et al., 2018). Therefore, the form of the feature distribution likely becomes simple as far as the feature-extractor s description ability allows.

There is some room in deﬁning Θφ, and here we introduce a particular deﬁnition. Let

Θφ = U({θφ,b; b = b1, b2, }), (3)

where U(s) is a discrete uniform distribution function for all elements in set s, and θφ,b is a solution that minimizes a batch-wise loss function with a norm regularization,

θφ,b = arg min θ

(x,t) b L (Cθ(Fφ(x)), t) + λ θ 2 2. (4)

Here, batch b comprises nb training samples that cover all classes, and λ > 0. We further assume that the classiﬁer parameters are initialized with random numbers prior to optimizing; therefore, there is almost no chance of having continuity between θφ,b and θφ+δφ,b for δφ 1.

A solution θφ,b, a strong classiﬁer for the batch b, is not generally strong for the entire dataset D for given φ because it does not see training samples other than the ones in b. However, there is no guarantee that θφ,b is always a weak classiﬁer to D in a classical sense; that is, a weak classiﬁer performs only slightly better than random guesses. Indeed, θφ,b can even work adversarially to D, meaning its accuracy is below the chance rate. But, for brevity, we simply call θφ,b a weak classiﬁer in this work.

The norm regularization term in Eq. (4) helps to avoid blowups during the feature-extractor training. The scale of θφ,b can be very large without the regularizer when two feature vectors in b stand close to each other. In such a case instability likely occurs.

After a feature extractor is obtained by Eq. (2), the following secondary optimization using the entire dataset provides a ﬁnal, single classiﬁer.

θ = arg min θ

(x,t) D L (Cθ(Fφ (x)), t) . (5)

Here, the classiﬁer is trained with ﬁxed features. Note that the classiﬁer architecture in this secondary optimization can differ from the one used in the primary optimization.

Our method and meta-learning (Finn et al., 2017) share a following similarity, despite that the goals are different (coadaptation prevention vs. transferable multi-task learing). Our feature extractor acts like task-generic base network, and our classiﬁers act like taskwise ﬁne-tuned models.

Approximate minimization. Regarding the number of weak classiﬁers used in a single update of φ, it is impossible to prepare a complete set of possible weak classiﬁers due to the huge number of distinct batches of the same size. We must adopt approximation instead. Algorithm 1 1 gives an approximate solution of φ in two senses: 1) a single weak classiﬁer is sampled from Θφ per φ-update instead of taking a marginalization over Θφ, and 2) Θφ is held ﬁxed in the computation of gradients with respect to φ.

Algorithm 1 Approximate minimization in Eq. (2) Input: total number of iterations T; number of classes C; number of class-c samples nc(c = 1, , C); number of samples per class for θ-update k; total number of samples n D; minibatch size for φ-update m; learning rate η 1: Begin 2: Initialize φ by random variables. 3: for t = 1 : T do 4: Ic = [randi(n1, k), , randi(n C, k)] 5: θ = arg min θ P

i Ic L (Cθ (Fφ(xi)), ti) + λ θ 2 2

6: If = randi(n D, m) 7: φ φ η

i If L (Cθ(Fφ(xi)), ti) / φ 8: end for 9: End Output: feature-extractor parameters φ = φ

It is worth mentioning that there is another way of generating a reasonably weak classiﬁer: to take a batch (which can be large) and then to optimize the batch-wise loss in an incomplete fashion by stopping after a relatively small number of iterations, say 20 times. This works ﬁne, though the deﬁnition of θφ,b becomes mathematically less clear.

3. Mathematical Property

We now show a proposition about the simplicity of FOCA feature distributions. It will be proven that under some special conditions any two samples have exactly the same features when target classes are the same, but have different features when target classes are different. Let us ﬁrst introduce a lemma about implicit optimality for individual features, and then put forth the proposition.

Lemma 3.1. Suppose that a multi-layered feature extractor with two restrictions is used:

1) The activation function a satisﬁes

a : R R+, a(z)

2) The last layer is fully-connected.

If φ simultaneously minimizes sample-wise losses

1In the pseudocode, randi(i, j) returns a j-dimensional vector with each element being a random variable U({1, 2, , i}).

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

L(Cθ(Fφ(x)), t) for all (x, t) D, then,

Cθ Fφ L(Cθ(Fφ (x)), t)

Cθ = 0, (x, t) D. (7)

Fφ is a short-hand notation for Cθ(f)

f f=Fφ (x). A

summation symbol over Cθ indices is ignored in Eq. (7).)

Proof. Let φℓbe the parameter set of the last weight layer in the feature extractor, and let xℓbe its input. Then, the i-th element of the feature layer, which is fully-connected from the previous layer, is given as Fφ(i) = a(P j φℓ(i, j)xℓ(j)).

j φℓ(i, j)xℓ(j). Then, Fφ(i) φℓ(i,j) = a(z)

0, since a(z)

z = 0 and xℓ> 0. The inequality Fφ(i) φℓ(i,j) = 0,

the supposition L φℓ(i,j) φ=φ = 0, (x, t) D, and the

chain rule immediately leads Eq. (7).

In the following discussion, we assume the conditions stated below hold.

(C1) A multi-layered feature extractor with two restriction is used: 1) The activation function satisﬁes Eq. (6); 2) The last layer is fully-connected.

(C2) The target values are t {t1, t2} for all samples.

(C3) A sample-wise loss function of the form Lφ,θ(x, t) = (Cθ(Fφ(x)) t)2 is adopted.

(C4) A linear classiﬁer Cθ(Fφ(x)) = θ Fφ(x)+θ0 is used.

(C5) Θφ = U({θφ,b; b = b1, b2, }), where θb = arg min θ

(x,t) b Lφ,θ(x, t) + 1

2λ θ 2 2, and b1, b2, are

distinct batches, each of which comprises one sample from t1 class and one sample from t2 class.

Proposition 3.2. Suppose that φ simultaneously minimizes the classiﬁer-anonymized, sample-wise losses Eθ Θφ Lφ,θ(x, t) in a class-separable fashion for all (x, t) D. Then, samples from the same class share the same features; i.e., Fφ (x) = Fφ (x ), x, x Xc, but samples from different classes do not; i.e., Fφ (x) = Fφ (x ), x Xc, x Xc =c.

Proof. Lemma 3.1 about the implicit optimality of individual features with respect to sample-wise losses yields,

Eθ Cθ Fφ Lφ ,θ(x, t)

Cθ = 0, (x, t) D, (8)

where Eθ is a short-hand notation for Eθ Θφ. By taking the partial derivatives in Eq. (8), one obtains, Eθ θ θ Fφ (x) = Eθ θ(t θ0), (x, t) D. (9)

The singular-value decomposition of the matrix consisting of column vectors θb sampled from Θφ yields θb1, θb2, = USV , where diagonal elements of the positive diagonal matrix S consists of the singular values aligned in decreasing order. Then, U θb = [ θn b , 0, , 0] , where θn b is the non-singular components of θb. Taking the non-singular part in Eq. (9), Eθ θn θn F n φ (x) = Eθ θn(t θ0), (x, t) D. (10)

Here, U Fφ (x) = h F n φ (x), F s φ (x) i , where F n φ (x)

is the corresponding non-singular part, meaning F n φ (x) and θn sharing the same dimension. The matrix Eθ θn θn is obviously invertible; therefore, Eq. (10) can be solved for F n φ (x). It then tells us that (a): F n φ (x) depends only on t and θb s. On the other hand, the minimum-norm solution θb satisﬁes,

[(f1 f2)(f1 f2) + λI] θb = (t1 t2)(f1 f2), (11)

where I is the identity matrix and f1(2) = Fφ with the target t = t1(t2) as a short-hand notation. Taking the nonsingular part in Eq. (11), the minimum-norm solution θn b satisﬁes (b):

[(f n 1 f n 2 )(f n 1 f n 2 ) +λI] θn b = (t1 t2)(f n 1 f n 2 ), (12)

where superscript n denotes the non-singular part. Given the deﬁnition that [ θn b1, θn b2, ] is full-rank, statements (a) and (b) do not contradict only if

v R, θn b = v, b. (13)

It is, θn b is one-dimensional and constant for all b. Then, statement (a) yields, F n φ (x) = F n φ (x ), x, x Xc. Note that F n φ (x) = F n φ (x ), x Xc, x Xc =c; otherwise φ is not a class-separable solution. Because θ0 b = 1/2 P

(x,t) b(t θn F n φ ), θ0 b must be the same for all b. Since θb = U[ θn b , 0, , 0] , θb must be the same for all b also. The fact that the minimum-norm solutions are the same for all combinations of t = t1 and t = t2 data points tells that Fφ (x) = Fφ (x ), x, x Xc and Fφ (x) = Fφ (x ), x Xc, x Xc =c.

According to this proposition, if the feature extractor has an enough representation ability under certain conditions, all the input data of class c are projected to a single point in the feature space in a class-separable way. The left side of Fig. 1 (b) visualizes 2D features of the toy data optimized by FOCA under the conditions (C1)-(C5)2. Features of the same class are conﬁned in a vicinity, the size of which is much smaller than the distance between the class centroids. It is intriguing to observe such a point-like distribution

2Here θφ,b is the analytical solution.

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

property, even though we do not explicitly impose a thing like maximization of the between-class scatter with respect to the sum of within-class scatters. Although we have not succeeded in proving for multi-layered classiﬁer cases, the toy experiment still exhibits the point-like distribution property as well; see the right side of Fig. 1 (b).

4. Experiment

Let us ﬁrst state our motivations for a series of experiments. We saw in Section 3 that the FOCA features obey a pointlike distribution per class under the special conditions. The question we try to answer in this section is, Do the FOCA features form a point-like distribution or some similar distribution under more realistic conditions? Let us suppose that features form a point-like distribution. Then, following secondary classiﬁer optimization with the feature extractor held ﬁxed should yield a similar decision boundary no matter what subset of the entire dataset it learns, as long as all classes are covered. We indeed conﬁrmed that high test performances are achieved by FOCA, even when the secondary optimization uses smallest possible partial datasets; namely, only one data from each class (see Section 4.1). When FOCA is used, the secondary optimization leads the classiﬁer parameter vector to almost the same point regardless of the size of the partial dataset it uses (see Section 4.2). Lastly, low-dimensional analyses revealed that FOCA features projected onto a hypersphere form a nearly a point-like distribution in a class-separable fashion (see Section 4.3).

Datasets. We use the CIFAR-10 dataset, a 10-class image classiﬁcation dataset having 5 104 training samples, and the CIFAR-100 dataset, a 100-class image classiﬁcation dataset having the same number of samples (Krizhevsky & Hinton, 2009). Both datasets have similar properties except for the number of classes (10:100) and the number of samples per class (5000:500). The idea is to see how these differences affect feature distribution properties.

Methods. In each experiment, FOCA is compared with other methods below. Plain: a vanilla mini-batch SGD training. Noisy (Graves, 2011): the same training rule as in Plain, except that a zero-mean random Gaussian noise is added to each of the classiﬁer parameters during training. Dropout (Hinton et al., 2012): adopted only to the classiﬁer part. Batch Normalization (Ioffe & Szegedy, 2015): adopted to the entire architecture.

Dropout is claimed to reduce co-adaptation (Hinton et al., 2012), though there is a counterpoint to this view (Helmbold & Long, 2018). FOCA, Dropout and Noisy share the same characteristics in a sense that the classiﬁer s descriminative power is weakened and the classiﬁer ensemble is implicitly taken during training. Apart from FOCA, Dropout and Noisy employ joint optimization, and we are interested to

see how this affects the robustness against inter-layer coadaptation. Batch Normalization is included in comparison based on a thought that the way it propagates signals from one layer to the other may have some functionality mitigating inter-layer co-adaptation.

Architecture. The architecture that we use for the primary optimization in each CIFAR-10 experiment is the one introduced in (Lee et al., 2016), except that we replaced the last two layers by three fully-connected (FC) layers of the form: 4096(feature dim.)-β-β-10, where β = 1024 for Dropout and β = 128 otherwise. The architecture for the secondary optimization is 4096-128-128-10 for all methods. The architecture for the CIFAR-100 primary optimizations is VGG-16 (Simonyan & Zisserman, 2015), except that the last three FC layers are replaced by 512(feature dim.)-β-β-100, where β = 512 for Dropout and β = 128 otherwise. The architecture for the secondary optimization is 512-128-128-100 for all methods.

Training details. SGD with momentum is used in each baseline experiment. In each FOCA experiment, the featureextractor part uses SGD with momentum, and the classiﬁer part uses gradient descent with momentum. In each training, we tested a couple of different initial learning rates and chose the best-performing one in the validation. A manual learning rate scheduling is adopted; the learning rate is dropped by a ﬁxed factor 1-3 times. The weak classiﬁers are randomly initialized each time by zero-mean Gaussian distribution with standard deviation 0.1 for both CIFAR-10 and -100. Cross entropy loss with softmax normalization and Re LU activation (Nair & Hinton, 2010) are used in every case. No data augmentation is adopted. The batch size b used in the weak-classiﬁer training is 100 for the CIFAR10 and 1000 for the CIFAR-100 experiments. The number of updates to generate θ is 32 for the CIFAR-10 and 64 for the CIFAR-100 experiments. Max-norm regularization (Srivastava et al., 2014) is used for the FOCA training, to stabilize the training. We found that the FOCA training can be made even more stable when updating the featureextractor parameters u times for a given weak classiﬁer parameters. We used this trick with u = 8 in the CIFAR100 experiments.

4.1. Test Performances of Classiﬁers Trained on Partial Datasets

Motivation. We are interested in two aspects. 1) Is the test performance produced by the primary joint optimization reproducible by the secondary classiﬁer optimization? (FOCA is excluded here because it is not a joint optimization method.) 2) How does the test performance degrade when smaller datasets are learned in the secondary optimization? Remember that a point-like distributed features are expected to demonstrate little degradation.

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

101 102 103 104 5x104 0.1

0.7 endtoend - 13.51 11.87 13.43 9.65

secondary @5x104

11.06 14.15 12.66 14.95 10.32

FOCA Plain Noisy Dropout Batch Norm

(a) CIFAR-10

102 103 104 5x104 0.4

0.7 endtoend - 44.04 43.77 43.24 39.28

secondary @5x104

43.33 51.81 50.57 51.14 47.91

FOCA Plain Noisy Dropout Batch Norm

(b) CIFAR-100

Figure 2. Test error rates of classiﬁers trained on partial datasets. For each method, partial datasets D are constructed r(n D ) times, where r(5 104) = 1, r(104) = 5, r(103) = 15, r(102) = 50, and r(10) = 150. The solid lines indicate the mean values and the error bars indicate 1 standard deviations of test error rates. The numbers in the right side indicate the test error rates (%) of the end-to-end optimizations and the secondary optimizations for n D = 5 104 (also shown by the bullet points).

Experimental procedure. We trained feature extractors on the entire training dataset using FOCA and the other methods. All methods except for FOCA employ the end-toend learning scheme. Then, we detach the feature extractors and the classiﬁers after learning. Next, for all methods, we ﬁx the feature-extractor parameters and train classiﬁers from scratch by the orthodox backpropagation on reduced datasets D of size n D . For CIFAR-10 experiments, n D = 5 104, 104, 103, 102, 10 (10 is the smallest possible dataset size). For CIFAR-100 experiments, n D = 5 104, 104, 103, 102 (102 is the smallest possible).

Figure 2 now shows the test error rates vs. n D .

Cases of n D = n D. For all end-to-end optimization methods tested here, the test performances after the secondary optimizations at n D = 5 104 happen to be worse than corresponding end-to-end optimization results (see the right side of Fig. 2). This is one of the indications that inter-layer co-adaptation occurs to some degree in each of these joint optimization methods.

Cases of n D < n D. For CIFAR-10, when n D 103, FOCA outperforms the other methods. To our surprise, the

average test error rate of FOCA at n D = 10 is 14.45%, which is indeed better than the test error rates of Dropout (14.95%) at n D = 5 104 Plain (14.50%) at n D = 104, Noisy (14.75%) at n D = 103, and Bach Normalization (16.93%) at n D = 102. For CIFAR-100, FOCA outperforms the other methods at any n D . The average error rate of FOCA at n D = 102 is 46.03%, which is indeed better than the test error rate of all the other methods at n D = n D = 5 104. We think that a high test performance of FOCA for n D n D is one indication of relatively simple form of feature distribution.

4.2. Approximate Geodesic Distances between Solutions

Motivation. We just saw that classiﬁers trained with the largest possible dataset (n D = n D) and classiﬁers trained with the smallest possible partial dataset (n D = 10 for CIFAR-10, n D = 100 for CIFAR-100) exhibit similar test performances when the FOCA features are used. Let us call the former the large-dataset solution θLD and the latter the small-dataset solution θSD. The above observation drove us to investigate distances between θLD and θSD. The distance should be small when the features form a point-like distribution per class, and we expect FOCA has this characteristics. We employ the approximate geodesic distance here to take changes in the loss landscape into account. If θLD and θSD are virtually the same point with FOCA, we further expect that the test error rate is almost unchanged at any intermediate point between θLD and θSD. Since two neural networks having the same architecture and different parameters can produce the same output for an arbitrary input (Watanabe, 2009), we initialized networks with the same set of random numbers in the experiments conducted in this subsection.

Experimental procedure. After optimizing a feature extractor on the full training dataset, θLD is optimized with features of the full dataset of size n D = n D = 5 104, and θSD is optimized with features of a smallest possible partial dataset; n D = 10 for CIFAR-10 and n D = 100 for CIFAR-100. To quantify the separation between θLD and θSD, we partition the straight line connecting θLD and θSD into P line segments of equal lengths in the parameter space; i.e.,

θα = αθSD + (P α)θLD

P , α = 0, 1, , P. (14)

We deﬁne in this article an approximate geodesic distance d(θLD, θSD) between θLD and θSD as

d(θLD, θSD) =

α=0 d(θα, θα+1)2 # 1

Here, d(θα, θα+1) is the distance between θα and θα+1 with

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

0 0.5 1 0.1

(a) CIFAR-10, 1st layer (b) CIFAR-10, 2nd layer (c) CIFAR-10, 3rd layer (d) CIFAR-10, error rate

0 0.5 1 0.4

(e) CIFAR-100, 1st layer (f) CIFAR-100, 2nd layer (g) CIFAR-100, 3rd layer (h) CIFAR-100, error rate

Figure 3. The approximate geodesic distances (a-c, e-g) between θLD and θSD and test error rates at θα (d, h). In (a-c, e-g), the solid lines indicate segment-wise distances d(θα, θα+1), α {0, , P 1}, and the dashed lines indicate total distances d(θLD, θSD).

respect to the Fisher information metric Iα evaluated at θα,

d(θα, θα+1)2 = (θα+1 θα) Iα(θα+1 θα), (16)

Iα = E(x,t) D

Lφ ,θ(x, t)

Lφ ,θ(x, t)

where D is either D or a subset of D for ease of computation, and Lφ ,θ(x, t) is a short-hand notation of L(Cθ(Fφ (x)), t). To compute a genuine geodesic distance, one needs to compute the sum of Fisher-metric distances between pairs of inﬁnitesimally separated points along the curve that minimizes the squared sum. This is computationally infeasible; we instead approximate the curve by the straight line, as explained. In the experiment, we let D be a randomly chosen subset consisting of 5% of the entire training samples. We set P = 15. We evaluated d(θLD, θSD) layer by layer.

Figure 3 now shows the approximated geodesic distances and test error rates at θα.

Approximate geodesic distances. For CIFAR-10, FOCA exhibits some orders-of-magnitude, say 40-180 times, smaller distances d(θLD, θSD) than the other methods. For CIFAR-100, FOCA exhibits 3-9 times smaller distances than the other methods. We guess the reason why the differences are moderate for the CIFAR-100 cases is that point-like distribution is harder to obtain for CIFAR-100. Nevertheless, FOCA exhibit smallest approximate geodesic distances in all cases, and we think this is an implicit evidence that the distribution of the FOCA features is simple

enough so that the discriminative function generated with n D = n D is virtually reproducible when n D n D.

Test error rates at θα. For FOCA, test error rates are almost constant at all θα. Together with the small d(θLD, θSD), two points θLD and θSD could be viewed as virtually the same point, when FOCA is used. For other methods, test error rates increase more rapidly toward θSD = θ15.

4.3. Low-Dimensional Properties

Motivation. We now use classical component analyses to clarify the low-dimensional structure of the FOCA features.

In this subsection we only show the CIFAR-10 results, because the CIFAR-100 results are qualitatively similar.

Principal Component Analysis (PCA). Figure 4 (a) shows scatter plots of training-data features projected by 2D bases of the PCA that is applied to all features. For a given class, the projected FOCA features look nearly one-dimensional, not a point-like, per class. This one-dimensional characteristics is probably due to the use of softmax normalization at the last layer, though we have no proof so far. In contrast, the projected features of the other methods clearly span two dimensions, roughly conﬁned in an ellipse-like region, for a given class.

Linear Discriminant Analyses (LDA) with normalization. Next, we examine the LDA on features normalized to unit lengths. Normalization is taken based on the observation that the 2D features in Fig. 4 (a) are distributed

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

-2000 -1000 0 f1(x)

-200 -100 0 f1(x)

-100 -50 0 f1(x)

-100 -50 0 f1(x)

-40 -20 0 20 f1(x)

(a) Training-data features projected by the PCA bases.

0 0.1 0.2 f1(x)

0 0.02 0.04 f1(x)

0 0.02 0.04 0.06 f1(x)

0 0.02 0.04 f1(x)

0 0.02 0.04 0.06 f1(x)

(b) Training-data features ﬁrst normalized to unit lengths then projected by the LDA bases, constructed on the class-1, normalized features vs. the rest of the normalized features.

0 0.1 0.2 f1(x)

0 0.02 0.04 f1(x)

0 0.02 0.04 0.06 f1(x)

0 0.02 0.04 f1(x)

0 0.02 0.04 f1(x)

(c) Test-data features ﬁrst normalized to unit lengths then projected by the same LDA bases used in (b).

Figure 4. Two-dimensional visualization of 4096-dimensional, CIFAR-10 features. Methods are (from the left): FOCA (ours), Plain, Noisy, Dropout, and Bach Normalization. Colors indicate true classes.

Table 1. The results of the class-1-vs-rest LDA with normalization.

Method Eigenvalue Test error rate FOCA (ours) 247.28 2.01% Plain 5.74 2.71% Noisy 7.49 2.86% Dropout 5.81 2.78% Bach Norm 7.28 2.43%

mostly along radial direction about a point close to the origin. Figure 4 (b) shows the 2D features that are normalized and projected by the LDA bases described above. Only class-1-vs-rest results are shown because no signiﬁcant differences are observed when replacing class-1 by another class. Here, we can observe a remarkable differences; the projected FOCA features are linearly separable with a fairly large margin, compared to the characteristic scales of the class-1 distribution or of the rest-of-the-class distribution. The form of the feature distribution is close to point-like per class, somewhat similar to the observation in the toy experiment shown in Fig. 1 (b). In contrast, the other methods exhibit linearly non-separable feature distributions.

Linear separability by the LDA with normalization. The generalized eigenvalue computed in the LDA discussed above are given in Table 1. The values are the largest ra-

tios of the between-class scatters to the within-class scatters after linear projections. FOCA exhibits orders of magnitude larger generalized eigenvalue than other methods. This supports the high level of linear separability of the FOCA features. Figure 4 (c) shows the normalized features of test data, projected by the same LDA bases, to see the generalizability. Table 1 also shows the smallest possible test error rates (class-1 vs. rest) by setting a threshold along the principal axis. FOCA yields the lowest test error rate.

5. Conclusion

A na ıve joint optimization of a feature extractor and a classiﬁer in a neural network often brings cases where both sets of parameters are tied in a so complex way that the classiﬁer is irreplaceable without degrading the test performance. We introduced a method called Feature-extractor Optimization through Classiﬁer Anonymization (FOCA), that is designed to break unwanted inter-layer co-adaptation. FOCA produces a feature extractor that does not explicitly adapt to a particular classiﬁer. We gave a mathematical proposition that guarantees a simple form of feature distribution under special conditions; indeed, features form a point-like distribution in a class-separable way. Different kinds of realdataset experiments under more general conditions provide supportive evidences.

Breaking Inter-Layer Co-Adaptation by Classiﬁer Anonymization

Baldi, P. and Sadowski, P. J. Understanding dropout. In Neural Information Processing Systems (NIPS), pp. 2814 2822, 2013.

Bengio, Y., L eonard, N., and Courville, A. C. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv:1308.3432, 2013.

Breiman, L. Bagging predictors. Machine Learning, 24: 123 140, 1996.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), 2017.

Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016.

Graves, A. Practical variational inference for neural networks. In Neural Information Processing Systems (NIPS), 2011.

Hara, K., Saitoh, D., and Shouno, H. Analysis of dropout learning regarded as ensemble learning. ar Xiv:1706.06859, 2017.

Helmbold, D. and Long, P. Surprising properties of dropout in deep networks. Journal of Machine Learning Research, 18:1 28, 04 2018.

Helmbold, D. P. and Long, P. M. On the inductive bias of dropout. Journal of Machine Learning Research, 16: 3403 3454, 2015.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv:1207.0580, 2012.

Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Q. Deep networks with stochastic depth. In European Conference on Computer Vision (ECCV), 2016.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

Kobayashi, T. Sharing convnet across heterogeneous tasks. In International Conference on Neural Information Processing (ICONIP), 2017.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report 4, 2009.

Lee, C.-Y., Gallagher, P., and Tu, Z. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

Nair, V. and Hinton, G. E. Rectiﬁed linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2010.

Ren, Y., Zhang, L., and Suganthan, P. N. Ensemble classiﬁcation and regression-recent developments, applications and future directions [review article]. IEEE Computational Intelligence Magazine, 11(1):41 53, 2016.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.

Singh, S., Hoiem, D., and Forsyth, D. Swapout: Learning an ensemble of deep architectures. In Neural Information Processing Systems (NIPS), 2016.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15:1929 1958, 2014.

Wager, S., Wang, S., and Liang, P. S. Dropout training as adaptive regularization. In Neural Information Processing Systems (NIPS), 2013.

Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. Regularization of neural network using dropconnect. In International Conference on Machine Learning (ICML), 2013.

Warde-Farley, D., Goodfellow, I. J., Courville, A. C., and Bengio, Y. An empirical analysis of dropout in piecewise linear networks. ar Xiv:1312.6197, 2013.

Watanabe, S. Algebraic geometry and statistical learning theory. Cambridge University Press, 2009.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Neural Information Processing Systems (NIPS), 2014.

Zahavy, T., Sivak, A., Kang, B., Feng, J., and Mannor, H. X. S. Ensemble robustness and generalization of stochastic learning algorithms. In Workshop of International Conference on Learning Representations (ICLR), 2018.

Zeiler, M. and Fergus, R. Stochastic pooling for regularization of deep convolutional neural networks. In International Conference on Learning Representation (ICLR), 2013.