# exploring_generalization_in_deep_learning__96283f6e.pdf

Exploring Generalization in Deep Learning

Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur, srinadh, mcallester, nati}@ttic.edu

With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.

1 Introduction

Learning with deep neural networks has enjoyed huge empirical success in recent years across a wide variety of tasks. Despite being a complex, non-convex optimization problem, simple methods such as stochastic gradient descent (SGD) are able to recover good solutions that minimize the training error. More surprisingly, the networks learned this way exhibit good generalization behavior, even when the number of parameters is signiﬁcantly larger than the amount of training data [20, 30].

In such an over parametrized setting, the objective has multiple global minima, all minimize the training error, but many of them do not generalize well. Hence, just minimizing the training error is not sufﬁcient for learning: picking the wrong global minima can lead to bad generalization behavior. In such situations, generalization behavior depends implicitly on the algorithm used to minimize the training error. Different algorithmic choices for optimization such as the initialization, update rules, learning rate, and stopping condition, will lead to different global minima with different generalization behavior [7, 12, 18]. For example, Neyshabur et al. [18] introduced Path-SGD, an optimization algorithm that is invariant to rescaling of weights, and showed better generalization behavior over SGD for both feedforward and recurrent neural networks [18, 22]. Keskar et al. [12] noticed that the solutions found by stochastic gradient descent with large batch sizes generalizes worse than the one found with smaller batch sizes, and Hardt et al. [10] discuss how stochastic gradient descent ensures uniform stability, thereby helping generalization for convex objectives.

What is the bias introduced by these algorithmic choices for neural networks? What ensures generalization in neural networks? What is the relevant notion of complexity or capacity control?

As mentioned above, simply accounting for complexity in terms of the number of parameters, or any measure which is uniform across all functions representable by a given architecture, is not sufﬁcient to explain the generalization ability of neural networks trained in practice. For linear models, norms and margin-based measures, and not the number of parameters, are commonly used for capacity control [5, 9, 25]. Also norms such as the trace norm and max norm are considered as sensible inductive biases in matrix factorization and are often more appropriate than parameter-counting measures such as the rank [27, 28]. In a similar spirit, Bartlett [3], Neyshabur et al. [20] and in parallel to this work, Bartlett et al. [2] suggested different norms of network parameters to measure the capacity of neural networks. In a different line of work, Keskar et al. [12] suggested sharpness (robustness of the training error to perturbations in the parameters) as a complexity measure for neural networks. Others, including Langford and Caruana [13] and more recently Dziugaite and Roy [8], propose a PAC-Bayes analysis.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

What makes a complexity measure appropriate for explaining generalization in deep learning? First, an appropriate complexity measure must be sufﬁcient in ensuring generalization. Second, networks learned in practice should be of low complexity under this measure. This can happen if our optimization algorithms bias us toward lower complexity models under this measure and it is possible to capture real data using networks of low complexity. In particular, the complexity measure should help explain several recently observed empirical phenomena that are not explained by a uniform notion of complexity:

It is possible to obtain zero training error on random labels using the same architecture for which training with real labels leads to good generalization [30]. We would expect the networks learned using real labels (and which generalizes well) to have much lower complexity, under the suggested measure, than those learned using random labels (and which obviously do not generalize well). Increasing the number of hidden units, thereby increasing the number of parameters, can lead to a decrease in generalization error even when the training error does not decrease [20]. We would expect to see the complexity measure decrease as we increase the number of hidden units. When training the same architecture, with the same training set, using two different optimization methods (or different algorithmic or parameter choices), one method results in better generalization even though both lead to zero training error [18, 12]. We would expect to see a correlation between the complexity measure and generalization ability among zero-training error models.

In this paper we examine complexity measures that have recently been suggested, or could be considered, in explaining generalization in deep learning. We evaluate the measures based on their ability to theoretically guarantee generalization, and their empirical ability to explain the above phenomena. Studying how each measure can guarantee generalization also let us better understand how it should be computed and compared in order to explain the empirical phenomena.

We investigate complexity measures including norms, robustness and sharpness of the network. We emphasize in our theoretical and empirical study the importance of relating the scale of the parameters and the scale of the output of the network, e.g. by relating norm and margin. In this light, we discuss how sharpness by itself is not sufﬁcient for ensuring generalization, but can be combined, through PAC-Bayes analysis, with the norm of the weights to obtain an appropriate complexity measure. The role of sharpness in PAC-Bayesian analysis of neural networks was also recently noted by Dziugaite and Roy [8], who used numerical techniques to numerically optimize the overall PAC-Bayes bound here we emphasize the distinct role of sharpness as a balance for norm.

Let fw(x) be the function computed by a d layer feed-forward network with parameters w and Rectiﬁed Linear Unit (Re LU) activations, fw(x) = Wd φ(Wd 1 φ(....φ(W1x))) where φ(z) = max{0, z}. Let hi be the number of nodes in layer i, with h0 = n. Therefore, for any layer i, we have Wi Rhi hi 1. Given any input x, the loss of the prediction by the function fw is then given by ℓ(w, x). We also denote by L(w) the expected loss and by b L(w) the empirical loss over the training set. For any integer k, [k] denotes the set {1, 2, , k}. Finally, . F , . 2, . 1, . denote Frobenius norm, the spectral norm, element-wise ℓ1-norm and element-wise ℓ norm respectively.

2 Generalization and Capacity Control in Deep Learning

In this section, we discuss complexity measures that have been suggested, or could be used for capacity control in neural networks. We discuss advantages and weaknesses of each of these complexity measures and examine their abilities to explain the observed generalization phenomena in deep learning.

We consider the statistical capacity of a model class in terms of the number of examples required to ensure generalization, i.e. that the population (or test error) is close to the training error, even when minimizing the training error. This also roughly corresponds to the maximum number of examples on which one can obtain small training error even with random labels.

Given a model class H, such as all the functions representable by some feedforward or convolutional networks, one can consider the capacity of the entire class H this corresponds to learning with a uniform prior or notion of complexity over all models in the class. Alternatively, we can also consider some complexity measure, which we take as a mapping that assigns a non-negative number to every hypothesis in the class - M : {H, S} R+, where S is the training set. It is then sufﬁcient to consider the capacity of the restricted class HM,α = {h : h H, M(h) α} for a given α 0. One can then ensure generalization of a learned hypothesis h in terms of the capacity of HM,M(h). Having a good hypothesis with low complexity, and being biased toward low complexity (in terms of M) can then be sufﬁcient for learning, even if the capacity of the entire H is high. And if we are indeed relying on M for ensuring generalization (and in particular, biasing toward models with lower complexity under M), we would expect a learned h with lower value of M(h) to generalize better. For some of the measures discussed, we allow M to depend also on the training set. If this is done carefully, we can still ensure generalization for the restricted class HM,α.

We will consider several possible complexity measures. For each candidate measure, we ﬁrst investigate whether it is sufﬁcient for generalization, and analyze the capacity of HM,α. Understanding the capacity corresponding to different complexity measures also allows us to relate between different measures and provides guidance as to what and how we should measure: From the above discussion, it is clear that any monotone transformation of a complexity measures leads to an equivalent notion of complexity. Furthermore, complexity is meaningful only in the context of a speciﬁc hypothesis class H, e.g. speciﬁc architecture or network size. The capacity, as we consider it (in units of sample complexity), provides a yardstick by which to measure complexity (we should be clear though, that we are vague regarding the scaling of the generalization error itself, and only consider the scaling in terms of complexity and model class, thus we obtain only a very crude yardstick sufﬁcient for investigating trends and relative phenomena, not a quantitative yardstick).

2.1 Network Size

For any model, if its parameters have ﬁnite precision, its capacity is linear in the total number of parameters. Even without making an assumption on the precision of parameters, the VC dimension of feedforward networks can be bounded in terms of the number of parameters dim(w)[1, 3, 6, 23]. In particular, Bartlett [4] and Harvey et al. [11], following Bartlett et al. [6], give the following tight (up to logarithmic factors) bound on the VC dimension and hence capacity of feedforward networks with Re LU activations: VC-dim = O(d dim(w)) (1) In the over-parametrized settings, where the number of parameters is more than the number of samples, complexity measures that depend on the total number of parameters are too weak and cannot explain the generalization behavior. Neural networks used in practice often have signiﬁcantly more parameters than samples, and indeed can perfectly ﬁt even random labels, obviously without generalizing [30]. Moreover, measuring complexity in terms of number of parameters cannot explain the reduction in generalization error as the number of hidden units increase [20] (see also Figure 4).

2.2 Norms and Margins

Capacity of linear predictors can be controlled independent of the number of parameters, e.g. through regularization of its ℓ2 norm. Similar norm based complexity measures have also been established for feedforward neural networks with Re LU activations. For example, capacity can be bounded based on the ℓ1 norm of the weights of hidden units in each layer, and is proportional to Qd i=1 Wi 2 1, , where Wi 1, is the maximum over hidden units in layer i of the ℓ1 norm of incoming weights to the hidden unit [5]. More generally Neyshabur et al. [19] considered group norms ℓp,q corresponding to ℓq norm over hidden units of ℓp norm of incoming weights to the hidden unit. This includes ℓ2,2 which is equivalent to the Frobenius norm where the capacity of the network is proportional to Qd i=1 Wi 2 F . They further motivated a complexity measure that is invariant to node-wise rescaling reparametrization 1, suggesting ℓp path norms which is the minimum over all node-wise rescalings of Qd i=1 Wi p, and is equal to ℓp norm of a vector with coordinates each of which is the product

1Node-rescaling can be deﬁned as a sequence of reparametrizations, each of which corresponds to multiplying incoming weights and dividing outgoing weights of a hidden unit by a positive scalar α. The resulting network computes the same function as the network before the reparametrization.

of weights along a path from an input node to an output node in the network. While preparing this manuscript, we became aware of parallel work Bartlett et al. [2] that proves generalization bounds

with capacity is proportional to Qd i=1 Wi 2 2 Pd j=1 Wj 1 / Wj 2 2/3 3 .

Capacity control in terms of norm, when using a zero/one loss (i.e. counting errors) requires us in addition to account for scaling of the output of the neural networks, as the loss is insensitive to this scaling but the norm only makes sense in the context of such scaling. For example, dividing all the weights by the same number will scale down the output of the network but does not change the 0/1 loss, and hence it is possible to get a network with arbitrary small norm and the same 0/1 loss. Using a scale sensitive losses, such as the cross entropy loss, does address this issue (if the outputs are scaled down toward zero, the loss becomes trivially bad), and one can obtain generalization guarantees in terms of norm and the cross entropy loss.

However, we should be careful when comparing the norms of different models learned by minimizing the cross entropy loss, in particular when the training error goes to zero. When the training error goes to zero, in order to push the cross entropy loss (or any other positive loss that diminish at inﬁnity) to zero, the outputs of the network must go to inﬁnity, and thus the norm of the weights (under any norm) should also go to inﬁnity. This means that minimizing the cross entropy loss will drive the norm toward inﬁnity. In practice, the search is terminated at some ﬁnite time, resulting in large, but ﬁnite norm. But the value of this norm is mostly an indication of how far the optimization is allowed to progress using a stricter stopping criteria (or higher allowed number of iterations) would yield higher norm. In particular, comparing the norms of models found using different optimization approaches is meaningless, as they would all go toward inﬁnity.

Instead, to meaningfully compare norms of the network, we should explicitly take into account the scaling of the outputs of the network. One way this can be done, when the training error is indeed zero, is to consider the margin of the predictions in addition to the norms of the parameters. We refer to the margin for a single data point x as the difference between the score of the correct label and the maximum score of other labels, i.e.

fw(x)[ytrue] max y =ytrue fw(x)[y] (2)

In order to measure scale over an entire training set, one simple approach is to consider the hard margin , which is the minimum margin among all training points. However, this deﬁnition is very sensitive to extreme points as well as to the size of the training set. We consider instead a more robust notion that allows a small portion of data points to violate the margin. For a given training set and small value ϵ > 0, we deﬁne the margin γmargin as the lowest value of γ such that ϵm data point have margin lower than γ where m is the size of the training set. We found empirically that the qualitative and relative nature of our empirical results is almost unaffected by reasonable choices of ϵ (e.g. between 0.001 and 0.1).

The measures we investigate in this work and their corresponding capacity bounds are as follows 2:

ℓ2 norm with capacity proportional to 1 γ2 margin Qd i=1 4 Wi 2 F [19].

ℓ1-path norm with capacity proportional to 1 γ2 margin

j Qd k=0[hk]

i=1 2Wi[ji, ji 1]

ℓ2-path norm with capacity proportional to 1 γ2 margin P

j Qd k=0[hk] Qd i=1 4hi W 2 i [ji, ji 1].

spectral norm with capacity proportional to 1 γ2 margin Qd i=1 hi Wi 2 2.

where Qd k=0[hk] is the Cartesian product over sets [hk]. The above bounds indicate that capacity can be bounded in terms of either ℓ2-norm or ℓ1-path norm independent of number of parameters. The

2We have dropped the term that only depends on the norm of the input. The bounds based on ℓ2-path norm and spectral norm can be derived directly from the those based on ℓ1-path norm and ℓ2 norm respectively. Without further conditions on weights, exponential dependence on depth is tight but the 4d dependence might be loose [19]. As we discussed at the beginning of this subsection, in parallel work, Bartlett et al. [2] have improved the spectral bound.

size of traning set

10K 20K 30K 40K 50K 1020

true labels random labels

size of traning set

10K 20K 30K 40K 50K 1025

size of traning set 10K 20K 30K 40K 50K 100

size of traning set

10K 20K 30K 40K 50K 105

1015 ℓ2 norm ℓ1-path norm ℓ2-path norm spectral norm

Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scaling issues (see Section 2), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for ℓ2-path norm we plot γ 2 margin P

j Qd k=0[hk] Qd i=1 W 2 i [ji, ji 1].

We also set the margin over training set S to be 5th-percentile of the margins of the data points in S, i.e. Prc5 {fw(xi)[yi] maxy =yi fw(x)[y]|(xi, yi) S}. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Moreover, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set.

ℓ2-path norm dependence on the number of hidden units in each layer is unavoidable. However, it is not clear if a bound that only depends on the product of spectral norms is possible.

As an initial empirical investigation of the appropriateness of the different complexity measures, we compared the complexity (under each of the above measures) of models trained on true versus random labels. We would expect to see two phenomena: ﬁrst, the complexity of models trained on true labels should be substantially lower than those trained on random labels, corresponding to their better generalization ability. Second, when training on random labels, we expect capacity to increase almost linearly with the number of training examples, since every extra example requires new capacity in order to ﬁt it s random label. However, when training on true labels we expect the model to capture the true functional dependence between input and output and thus ﬁtting more training examples should only require small increases in the capacity of the network. The results are reported in Figure 1. We indeed observe a gap between the complexity of models learned on real and random labels for all four norms, with the difference in increase in capacity between true and random labels being most pronounced for the ℓ2 norm and ℓ2-path norm.

Lipschitz Continuity and Robustness The measures/norms we discussed so far also control the Lipschitz constant of the network with respect to its input. Is the capacity control achieved through the bound on the Lipschitz constant? Is bounding the Lipschitz constant alone enough for generalization? In Appendix A, we show that the current bounds using Lipschitz have exponential dependence to the input dimension and therefore the capacity bounds discussed above are not merely a consequence of bounding the Lipschitz constant.

In Section 3 we present further empirical investigations of the appropriateness of these complexity measures to explain other phenomena.

2.3 Sharpness

The notion of sharpness as a generalization measure was recently suggested by Keskar et al. [12] and corresponds to robustness to adversarial perturbations on the parameter space:

ζα(w) = max|νi| α(|wi|+1) b L(fw+ν) b L(fw)

1 + b L(fw) max |νi| α(|wi|+1) b L(fw+ν) b L(fw), (3)

where the training error b L(fw) is generally very small in the case of neural networks in practice, so we can simply drop it from the denominator without a signiﬁcant change in the sharpness value.

As we will explain below, sharpness deﬁned this way does not capture the generalization behavior. To see this, we ﬁrst examine whether sharpness can predict the generalization behavior for networks trained on true vs random labels. In the left plot of Figure 2, we plot the sharpness for networks trained on true vs random labels. While sharpness correctly predicts the generalization behavior for

size of traning set 10K 20K 30K 40K 50K

true labels random labels

KL #108 0 1 2 3

expected sharpness

0.3 5K 10K 30K 50K

KL #108 0 1 2 3

expected sharpness

0.3 5K 10K 30K 50K

Figure 2: Sharpness and PAC-Bayes measures on a VGG network trained on subsets of CIFAR10 dataset with true or random labels. In the left panel, we plot max sharpness, calculated as suggested by Keskar et al. [12] where the perturbation for parameter wi has magnitude 5.10 4(|wi| + 1). The middle and right plots show the relationship between expected sharpness and KL divergence in PAC-Bayes bound for true and random labels respectively. For PAC-Bayes plots, each point in the plot correspond to a choice of α where the standard deviation of the perturbation for the parameter wi is α(10 |wi| + 1). The corresponding KL to each α is weighted ℓ2 norm where the weight for each parameter is the inverse of the standard deviation of the perturbation.

true labels random labels

bigger networks, for networks of smaller size, those trained on random labels have less sharpness than the ones trained on true labels. Furthermore sharpness deﬁned above depends on the scale of w and can be artiﬁcially increased or decreased by changing the scale of the parameters. Therefore, sharpness alone is not sufﬁcient to control the capacity of the network.

Instead, we advocate viewing a related notion of expected sharpness in the context of the PACBayesian framework. Viewed this way, it becomes clear that sharpness controls only one of two relevant terms, and must be balanced with some other measure such as norm. Together, sharpness and norm do provide capacity control and can explain many of the observed phenomena. This connection between sharpness and the PAC-Bayes framework was also recently noted by Dziugaite and Roy [8].

The PAC-Bayesian framework [16, 17] provides guarantees on the expected error of a randomized predictor (hypothesis), drawn form a distribution denoted Q and sometimes referred to as a posterior (although it need not be the Bayesian posterior), that depends on the training data. Let fw be any predictor (not necessarily a neural network) learned from training data. We consider a distribution Q over predictors with weights of the form w + ν, where w is a single predictor learned from the training set, and ν is a random variable. Then, given a prior distribution P over the hypothesis that is independent of the training data, with probability at least 1 δ over the draw of the training data, the expected error of fw+ν can be bounded as follows [15]:

Eν[L(fw+ν)] Eν[b L(fw+ν)] + 4

s KL (w + ν P) + ln 2m

Substituting Eν[b L(fw+ν)] with b L(fw) + Eν[b L(fw+ν)] b L(fw) we can see that the PAC-Bayes bound depends on two quantities - i) the expected sharpness and ii) the Kullback Leibler (KL) divergence to the prior P. The bound is valid for any distribution measure P, any perturbation distribution ν and any method of choosing w dependent on the training set. A simple way to instantiate the bound is to set P to be a zero mean, σ2 variance Gaussian distribution. Choosing the perturbation ν to also be a zero mean spherical Gaussian with variance σ2 in every direction, yields the following guarantee (w.p. 1 δ over the training set):

Eν N (0,σ)n[L(fw+ν)] b L(fw) + Eν N (0,σ)n[b L(fw+ν)] b L(fw) | {z } expected sharpness

w 2 2 2σ2 | {z } KL

Another interesting approach is to set the variance of the perturbation to each parameter with respect to the magnitude of the parameter. For example if σi = α |wi| + β, then the KL term in the above expression changes to P

i w2 i 2σ2 i . The above generalization guarantees give a clear way to think about capacity control jointly in terms of both the expected sharpness and the norm, and as we discussed earlier indicates that sharpness by itself cannot control the capacity without considering the scaling. In the above generalization bound, norms and sharpness interact in a direct way depending on σ,

#random labels 0 1K 2K 3K 4K 5K

training test

#random labels 0 1K 2K 3K 4K 5K

2 norm spectral norm path1 norm path2 norm sharpness

KL #107 0 1 2 3 4 5

expected sharpness

0 1K 2K 3K 4K 5K

Figure 3: Experiments on global minima with poor generalization. For each experiment, a VGG network is trained on union of a subset of CIFAR10 with size 10000 containing samples with true labels and another subset of CIFAR10 datasets with varying size containing random labels. The learned networks are all global minima for the objective function on the subset with true labels. The left plot indicates the training and test errors based on the size of the set with random labels. The plot in the middle shows change in different measures based on the size of the set with random labels. The plot on the right indicates the relationship between expected sharpness and KL in PAC-bayes for each of the experiments. Measures are calculated as explained in Figures 1 and 2.

as increasing the norm by decreasing σ causes decrease in sharpness and vice versa. It is therefore important to ﬁnd the right balance between the norm and sharpness by choosing σ appropriately in order to get a reasonable bound on the capacity.

In our experiments we observe that looking at both these measures jointly indeed makes a better predictor for the generalization error. As discussed earlier, Dziugaite and Roy [8] numerically optimize the overall PAC-Bayes generalization bound over a family of multivariate Gaussian distributions (different choices of perturbations and priors). Since the precise way the sharpness and KL-divergence are combined is not tight, certainly not in (5), nor in the more reﬁned bound used by Dziugaite and Roy [8], we prefer shying away from numerically optimizing the balance between sharpness and the KL-divergence. Instead, we propose using bi-criteria plots, where sharpness and KL-divergence are plotted against each other, as we vary the perturbation variance. For example, in the center and right panels of Figure 2 we show such plots for networks trained on true and random labels respectively. We see that although sharpness by itself is not sufﬁcient for explaining generalization in this setting (as we saw in the left panel), the bi-criteria plots are signiﬁcantly lower for the true labels. Even more so, the change in the bi-criteria plot as we increase the number of samples is signiﬁcantly larger with random labels, correctly capturing the required increase in capacity. For example, to get a ﬁxed value of expected sharpness such as ϵ = 0.05, networks trained with random labels require higher norm compared to those trained with true labels. This behavior is in agreement with our earlier discussion, that sharpness is sensitive to scaling of the parameters and is not a capacity control measure as it can be artiﬁcially changed by scaling the network. However, combined with the norm, sharpness does seem to provide a capacity measure.

3 Empirical Investigation

In this section we investigate the ability of the discussed measures to explain the the generalization phenomenon discussed in the Introduction. We already saw in Figures 1 and 2 that these measures capture the difference in generalization behavior of models trained on true or random labels, including the increase in capacity as the sample size increases, and the difference in this increase between true and random labels.

Different Global Minima Given different global minima of the training loss on the same training set and with the same model class, can these measures indicate which model is going to generalize better? In order to verify this property, we can calculate each measure on several different global minima and see if lower values of the measure imply lower generalization error. In order to ﬁnd different global minima for the training loss, we design an experiment where we force the optimization methods to converge to different global minima with varying generalization abilities by forming a confusion set that includes samples with random labels. The optimization is done on the loss that includes examples from both the confusion set and the training set. Since deep learning models have very high capacity, the optimization over the union of confusion set and training set generally leads to a point with zero error over both confusion and training sets which thus is a global minima for the

#hidden units 8 32 128 512 2K 8K

training test

32 128 512 2K 8K 0

KL #106 0 1 2

expected sharpness

32 128 512 2048

Figure 4: The generalization of two layer perceptron trained on MNIST with varying number of hidden units. The left plot indicates the training and test errors. The test error decreases as the size increases. The middle plot shows measures for each of the trained networks. The plot on the right indicates the relationship between sharpness and KL in PAC-Bayes for each experiment. Measures are calculated as explained in Figures 1 and 2.

training set. We randomly select a subset of CIFAR10 dataset with 10000 data points as the training set and our goal is to ﬁnd networks that have zero error on this set but different generalization abilities on the test set. In order to do that, we train networks on the union of the training set with ﬁxed size 10000 and confusion sets with varying sizes that consists of CIFAR10 samples with random labels; and we evaluate the learned model on an independent test set. The trained network achieves zero training error but as shown in Figure 3, the test error of the model increases with increasing size of the confusion set. The middle panel of this Figure suggests that the norm of the learned networks can indeed be predictive of their generalization behavior. However, we again observe that sharpness has a poor behavior in these experiments. The right panel of this ﬁgure also suggests that PAC-Bayes measure of joint sharpness and KL divergence, has better behavior - for a ﬁxed expected sharpness, networks that have higher generalization error, have higher norms.

Increasing Network Size We also repeat the experiments conducted by Neyshabur et al. [20] where a fully connected feedforward network is trained on MNIST dataset with varying number of hidden units and we check the values of different complexity measures on each of the learned networks. The left panel in Figure 4 shows the training and test error for this experiment. While 32 hidden units are enough to ﬁt the training data, we observe that networks with more hidden units generalize better. Since the optimization is done without any explicit regularization, the only possible explanation for this phenomenon is the implicit regularization by the optimization algorithm. Therefore, we expect a sensible complexity measure to decrease beyond 32 hidden units and behave similar to the test error. Different measures are reported for learned networks. The middle panel suggest that all margin/norm based complexity measures decrease for larger networks up to 128 hidden units. For networks with more hidden units, ℓ2 norm and ℓ1-path norm increase with the size of the network. The middle panel suggest that ℓ2-path norm and spectral norm can provide some explanation for this phenomenon. However, as we discussed in Section 2, the actual complexity measure based on ℓ2-path norm and spectral norm also depends on the number of hidden units and taking this into account indicates that these measures cannot explain this phenomenon. In Appendix A, we discuss another complexity measure that also depends the spectral norm through Lipschitz continuity or robustness argument. Even though this bound is very loose (exponential in input dimension), it is monotonic with respect to the spectral norm that is reported in the plots. The right panel shows that the joint PAC-Bayes measure decrease for larger networks up to size 128 but fails to explain this generalization behavior for larger networks. This suggests that the measures looked so far are not sufﬁcient to explain all the generalization phenomenon observed in neural networks.

4 Conclusion

Learning with deep neural networks displays good generalization behavior in practice, a phenomenon that remains largely unexplained. In this paper we discussed different candidate complexity measures that might explain generalization in neural networks. We outline a concrete methodology for investigating such measures, and report on experiments studying how well the measures explain different phenomena. While there is no clear choice yet, some combination of expected sharpness and norms do seem to capture much of the generalization behavior of neural networks. A major issue still left unresolved is how the choice of optimization algorithm biases such complexity to be low, and what is the precise relationship between optimization and implicit regularization.

[1] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.

[2] P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. ar Xiv preprint ar Xiv:1706.08498, 2017.

[3] P. L. Bartlett. The sample complexity of pattern classiﬁcation with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525 536, 1998.

[4] P. L. Bartlett. The impact of the nonlinearity on the VC-dimension of a deep network. Preprint, 2017.

[5] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

[6] P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear vc dimension bounds for piecewise polynomial networks. Neural computation, 10(8):2159 2173, 1998.

[7] P. Chaudhari, A. Choromanska, S. Soatto, and Y. Le Cun. Entropy-sgd: Biasing gradient descent into wide valleys. ar Xiv preprint ar Xiv:1611.01838, 2016.

[8] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. ar Xiv preprint ar Xiv:1703.11008, 2017.

[9] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in computational mathematics, 13(1):1 50, 2000.

[10] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, 2016.

[11] N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension bounds for piecewise linear neural networks. ar Xiv preprint ar Xiv:1703.02930, 2017.

[12] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016.

[13] J. Langford and R. Caruana. (not) bounding the true error. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, pages 809 816. MIT Press, 2001.

[14] U. v. Luxburg and O. Bousquet. Distance-based classiﬁcation with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669 695, 2004.

[15] D. Mc Allester. Simpliﬁed pac-bayesian margin bounds. Lecture notes in computer science, pages 203 215, 2003.

[16] D. A. Mc Allester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pages 230 234. ACM, 1998.

[17] D. A. Mc Allester. PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164 170. ACM, 1999.

[18] B. Neyshabur, R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Advanced in Neural Information Processsing Systems (NIPS), 2015.

[19] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In Proceeding of the 28th Conference on Learning Theory (COLT), 2015.

[20] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Proceeding of the International Conference on Learning Representations workshop track, 2015.

[21] B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Data-dependent path normalization in neural networks. In the International Conference on Learning Representations, 2016.

[22] B. Neyshabur, Y. Wu, R. Salakhutdinov, and N. Srebro. Path-normalized optimization of recurrent neural networks with relu activations. Advances in Neural Information Processing Systems, 2016.

[23] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

[25] A. J. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization operators and support vector kernels. Neural networks, 11(4):637 649, 1998.

[26] J. Sokolic, R. Giryes, G. Sapiro, and M. R. Rodrigues. Generalization error of invariant classiﬁers. ar Xiv preprint ar Xiv:1610.04574, 2016.

[27] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In International Conference on Computational Learning Theory, pages 545 560. Springer Berlin Heidelberg, 2005.

[28] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Advances in neural information processing systems, pages 1329 1336, 2005.

[29] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86(3):391 423, 2012.

[30] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.