# collegial_ensembles__56266238.pdf

Collegial Ensembles

Etai Littwin Ben Myara Sima Sabah Joshua Susskind Shuangfei Zhai Oren Golan Apple Inc. {elittwin, bmyara, sima, jsusskind, szhai, ogolan}@apple.com

Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of overparameterization achieved through ensembling, where we deﬁne collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a single model. We show that the optimization dynamics of CE simplify dramatically when the number of models in the ensemble is large, resembling the dynamics of wide models, yet scale much more favorably. We use recent theoretical results on the ﬁnite width corrections of the NTK to perform efﬁcient architecture search in a space of ﬁnite width CE that aims to either minimize capacity, or maximize trainability under a set of constraints. The resulting ensembles can be efﬁciently implemented in practical architectures using group convolutions and block diagonal layers. Finally, we show how our framework can be used to analytically derive optimal group convolution modules originally found using expensive grid searches, without having to train a single model.

1 Introduction

n L n L n L

... m paths ...

Figure 1: Collegial Ensemble

Neural networks exhibit generalization behavior in the overparameterized regime, a phenomenon that has been well observed in practice [23, 2, 18, 17]. Recent theoretical advancements have been made to try and understand the trainability and generalization properties of over-parameterized neural networks, by observing their nearly convex behaviour at large width [13, 15]. For a wide neural network F(x) with parameters θ and a convex loss L, the parameter updates µ θL can be represented in the space of functions as kernel gradient decent (GD) updates µ FL, with the Neural Tangent Kernel [10] (NTK) function K(x, xj) = θF(x) θ F(xj) operating on x, xj:

θ = µ θL F(x) µ X

j K(x, xj) F(xj)L (1)

In the inﬁnite width limit, the NTK remains constant during training, and GD reduces to kernel GD, rendering the optimization a convex problem. Hence, over parameterized models in the large width sense both generalize, and are simpler to optimize. In this work, we consider a different type of over-parameterization achieved through ensembling. We denote by collegial ensembles (CE) models where the output, either intermediate or ﬁnal, is constructed from the aggregation of multiple identical pathways (see illustration in Fig. 1). We show that the training dynamics of CE simplify when the ensemble multiplicity is large, in a similar sense as wide models, yet at a much cheaper cost in terms of parameter count. Our results

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

indicate the existence of a sweet spot for the correct ratio between width and multiplicity, where close to convex behaviour does not come at the expense of size. To ﬁnd said sweet-spot , we rely on recent ﬁndings on ﬁnite corrections of gradients [7, 16], though we use them in a more general manner than their original presentation. Speciﬁcally, we use the following assumption stated informally: Assumption 1. (Informal) Denote by K the NTK at initialization of a fully connected ANN with hidden layer widths n1, ..., n L and depth L. There exists positive constants α, C such that:

V ar(K) C(eα PL l=1 n 1 l 1) (2)

where the variance is measured on the individual entries of K, with respect to the random sampling of the weights.

In [7] and [16], assumption 1 is proven for the on-diagonal entries of K, in fully connected architectures. In this work, we assume it holds in a more general sense for practical architectures, with different constants of α, C (we defer the reader to Sec. 4 in the appendix for further empirical validation of this assumption). Since V ar(K) diminishes with width, we hypothesize that a small width neutral network behaves closer to its large width counterpart as V ar(K) decreases. Notably, similar observations using activations and gradient variance as predictors of successful initializations were presented in [6, 8]. Motivated by this hypothesis, we formulate a primal-dual constrained optimization problem that aims to ﬁnd an optimal CE with respect to the following objectives:

1. Primal (optimally smooth): minimize V ar(K) for a ﬁxed number of parameters. 2. Dual (optimally compact): minimize number of parameters for a ﬁxed V ar(K).

The primal formulation seeks to ﬁnd a CE which mimics the simpliﬁed dynamics of a wide model using a ﬁxed budget of parameters, while Fthe dual formulation seeks to minimize the number of parameters without suffering the optimization and performance consequences typically found in the narrow regime . Using both formulations, we ﬁnd excellent agreement between our theoretical predictions and empirical results, on both small and large scale models.

Our main contributions are the following:

1. We adapt the popular over-parameterization analysis to collegial ensembles (CE), in which the output units of multiple architecturally identical models are aggregated, scaled, and trained as a single model. For ensembles with m models each of width n, we show that under gradient ﬂow and the L2 loss, the NTK remains close to its initial value up to a O (mn) 1 correction. 2. We formulate two optimization problems that seek to ﬁnd optimal ensembles given a baseline architecture, in the primal and dual form, respectively. The optimally smooth ensemble achieves higher accuracy than the baseline, using the same budget of parameters. The optimally compact ensemble achieves a similar performance as the baseline, with signiﬁcantly fewer trainable parameters. 3. We show how optimal grouping in Res Ne Xt [20] architectures can be derived and improved upon using our framework, without the need for an expensive search over hyper-parameters.

The remaining paper is structured as follows. In Sec. 2 we formally deﬁne collegial ensembles, and present our results for their training dynamics in the large m, n regime. In Sec. 3 we present our framework for architecture search in the space of collegial ensembles, and in Sec. 4 we present further experimental results on the CIFAR-10/CIFAR-100 [11] and Image Net [4] datasets using large scale models.

2 Collegial Ensembles

We dedicate this section to formally deﬁne collegial ensembles, and analyze their properties from the NTK perspective. Speciﬁcally, we would like to formalize the notion of the large ensemble regime, where its dynamic behaviour is reminiscent of wide single models. In the following analysis we consider simple feed forward fully connected neural network Fn(x, θ) : Rn0 R, where the width

3 2 1 0 1 2 3 m

NTK[x0, x ]

2.0 NTK[x0, x ]

3 2 1 0 1 2 3

Figure 2: Convergence of the ensemble NTK to the NMK when increasing the number of models in the ensemble. NTK[x0, xγ] is computed for both the diagonal and off-diagonal elements xγ = [cos(γ), sin(γ)] for γ [ π, π]. The smoother surface as m increases in (a) demonstrates the convergence of the NTK. The black line in (b), computed with 100 wider model, shows that the convergence is indeed to the NMK, and the NTK mean does not depend on the width of the model. Each model in the ensemble is fully connected with L = 4 layers and n = 1000.

of the hidden layers are given by n = {nl}L l=1 ZL +, adopting the common NTK parameterization:

Fn(x, θ) = r

1 n L θL(...φ( r

2 n1 θ2φ( r

2 n0 θ1x))). (3)

where x Rn0 is the input, φ( ) is the Re LU activation function, θl is the weight matrix associated with layer l, and θ denotes the concatenation of all the weights of all layers, which are initialized iid from a standard normal distribution. Given a dataset X = {xi}N i=1, the empirical NTK denoted by Kn(θ) RN N is given by Kn(θ) = θFn(θ) θ Fn(θ), where Fn(θ) = [Fn(x1, θ), ..., Fn(x N, θ)] . Given the network Fn, we parameterize a space of ensemble members Fe(Θ) by a multiplicity parameter 1 m and a width vector n, such that:

Fe(Θ) = 1 m

j=1 Fn(θj), Ke(Θ) = 1

j=1 Kn(θj) (4)

where Θ = [θ 1 ...θ m] is the concatenation of the weights of all the ensembles, and Ke(Θ) = ΘFe ΘFe is the NTK of the ensemble. Plainly speaking, the network Fn deﬁnes a space of ensembles given by the scaled sum of m neural networks of the same Fn architecture, with weights initialized independently from a normal distribution. Since each model Fn(θj) in the ensemble is architecturally equivalent to Kn(θ), it is easy to show that the inﬁnite width kernel is equal for both models: K = limmin(n) Kn(θ) = limmin(n) Ke(Θ). We deﬁne the Neural Mean Kernel (NMK) K n as the mean of the empirical NTK:

K n = E θ[Kn(θ)]]. (5)

The NMK is deﬁned by an expectation over the normally distributed weights, and does not immediately equal the inﬁnite width limit of the NTK given by K . The following Lemma stems from the application of the strong law of large numbers (LLN): Lemma 1 (Inﬁnite ensemble). The following holds:

Ke(Θ) a.s K n as m . (6)

We defer the reader to Sec. 5 in the appendix for the full proof. While both K n and K do not depend on the weights, they are deﬁned differently. K n is deﬁned by an expectation over the weights, and depends on the width of the architecture, whereas K is deﬁned by an inﬁnite width limit. However, empirical observation using Monte Carlo sampling, as presented in Fig. 2, show little to no dependence of the NMK on the widths n. Moreover, we empirically observe that K n K (Note that similar observations have been reported in [10]). We next show that under gradient ﬂow, Ke(Θ) remains close to its initial value for the duration of the optimization process when mn is large. Given

the labels vector y RN and the L2 cost function at time t, Lt = 1

2 Fe(Θt) y 2 2, under gradient ﬂow with learning rate µ, the weights evolve over continuous time according to Θt = µ Θt Lt. The following theorem gives an asymptotic bound on the leading correction of the ensemble NTK over time. For simplicity, we state our result for constant width networks. Theorem 1 (NTK evolution over time). Assuming analytic activation functions φ( ) with bounded derivatives of all orders, and the L2 cost function , it holds for any ﬁnite t:

Ke(Θt) Ke(Θ0) Op( 1

where the notation xn = Op(yn) states that xn/yn is stochastically bounded.

We defer the reader to Sec. 5 in the appendix for the full proof, as well as empirical validation of The. 1. Large collegial ensembles therefore refer to a regime where mn is large. In the case of inﬁnite multiplicity, optimization dynamics reduces to kernel gradient descent with K n , rather than K as the relevant kernel function. A striking implication arises from Theorem 1. The total parameter count in the ensemble is linear in m, and quadratic in n, hence it is much more parameter efﬁcient to increase m rather than n. Since the large regime depends on both n and m, CE possess an inherent degree of ﬂexibility in their practical design. As we show in the following section, this increased ﬂexibility allows the design of both parameter efﬁcient ensembles, and better performing ensembles, when compared with the baseline model.

3 Efﬁcient Ensembles

In this section, we use Proposition 1 to derive optimally smooth and compact ensembles. We parameterize the space of ensembles using m, n, and a baseline architecture F n, where n is the width vector of the baseline model. Denote by β(n) the total parameter count in Fn, we deﬁne parameter efﬁciency ρ(m, n) by the ratio between the parameter count in the baseline model βs β( n), and the parameter count in the ensemble given by βe mβ(n):

βe = βs mβ(n). (8)

Using Proposition. 1, the variance of Kn as a function of widths n and depth L, is given by:

V ar Kn(θ) (eα PL l=1 n 1 l 1). (9)

for some value of α.

Primal formulation: We cast the primal objective as an optimization problem, where we would like to ﬁnd parameters mp, np that correspond to the smoothest ensemble:

mp, np = arg min m,n V ar Ke(Θ) s.t ρ(m, n) = 1. (10)

Since the weights for each model are sampled independently, it holds that:

V ar Ke(Θ) =

1 m2 V ar Kn(θj) = (eα PL l=1 n 1 l 1) m . (11)

Equating the parameter count in both models to maintain a ﬁxed efﬁciency, we can derive for each n the number of the models mp(n) in the primal formulation:

mp(n) = βs β(n) np = arg min n

h(eα PL l=1 n 1 l 1) mp(n)

The optimal parameters np can be found using a grid search.

20 40 60 80 100 120 n

80 var(n) m(n)

(a) Primal formulation

20 40 60 80 100 120 n

40 (n) m(n)

(b) Dual formulation

Figure 3: Primal and dual objective curves for a baseline feedforward fully connected network with L = 6 layers, n = 500, and n0 = 748. (a) The minimizer of the primal objective (red) is achieved for n = 48 and m(48) 30. (b) The maximizer of the dual objective (red) is achieved for n = 48 and m(48) 12, achieving an efﬁciency value ρ(48) 2.45.

Dual formulation: The dual formulation can be cast as an optimization problem, with the following objective:

md, nd = arg max m,n ρ(m, n) s.t V ar Ke(Θ) = V ar K n(θ) . (13)

Matching the smoothness criterion using Eq. 11, we can derive for each n the number of models md(n) in the dual formulation:

md(n) = (eα PL l=1 n 1 l 1)

(eα PL l=1 n 1 l 1) nd = arg max n

h 1 md(n) βs β(n)

Ideally, we can ﬁnd md, nd such that the total parameter count in the ensemble is considerably reduced. Equating the solutions for both the primal and dual problems in Eq. 12 and Eq. 14, it is straightforward to verify that nd = np, implying strong duality. Therefore, the primal and dual solutions differ only in the optimal multiplicity m(n) of the ensemble. Both objectives are plotted in Fig. 3 using a feedforward fully connected network baseline with L = 6 and constant width n = 500.

Note that the efﬁcient ensembles framework outlined in this section can readily be applied with different efﬁciency metrics. For instance, instead of using the parameter efﬁciency, one could consider the FLOPs efﬁciency (see Appendix Sec. 3).

4 Experiments

In the following section we conduct experiments to both validate our assumptions, and evaluate our framework for efﬁcient ensemble search. Starting with a toy example, we evaluate the effect of V ar(K) and βe on test performance using fully connected models trained on the MNIST dataset. For the latter experiments, we move to larger scale models trained on CIFAR-10/100 and the Image Net [4] datasets.

4.1 Ablation Study MNIST

An effective way to improve the performance of a neural network is to increase its size. Recent slim architectures, such as Res Ne Xt, demonstrate it possible to maintain accuracy while signiﬁcantly reducing parameter count. In Fig. 4 we provide further empirical evidence that capacity of a network by itself is not a good predictor of performance, when decoupled from other factors.

Speciﬁcally, we demonstrate strong correlation between the empirical test error and the variance V ar(K), while βe is kept constant (primal). On the other hand, increasing βe while keeping V ar(K) constant (dual) does not improve the performance. For both experiments we use as a baseline a fully connected model with L = 6 layers and width n = 200 for each layer. The width of a layer for each of the m models in the ensemble is n. Each ensemble was trained on MNIST for 70 epochs with the Adam optimizer, and the accuracy was averaged over 100 repetitions.

0.30 Var(n) test error (%)

0.30 Var(n) test error (%)

e(n) test error (%)

e(n) test error (%)

Figure 4: Decoupling capacity and variance. The error (blue) is highly correlated with V ar(K), and less sensitive to βe. (a) and (b) show the theoretical variance of the model correlates well with accuracy. (c) and (d) show the corresponding number of parameters βe. Decreasing the variance (a) improves performance when βe is ﬁxed (c). Increasing βe signiﬁcantly (d) without reducing the variance (b) can cause degradation in performance due to overﬁtting.

4.2 Aggregated Residual Transformations

Res Ne Xt [20] introduces the concept of aggregated residual transformations utilizing group convolutions, which achieves better parameter/capacity trade off than its Res Net counterpart. In this paper, we view Res Ne Xt as a special case of CE, where the ensembling happens at the block level. We hypothesize that the optimal blocks identiﬁed with CE will lead to an overall better model when stacked up, and by doing so we get the beneﬁt of factorizing the design space of a network to modular levels. See Algorithm 2 for the detailed recipe.

For these experiments, we use both the CIFAR-10/100 and the Image Net datasets following the implementation details described in [20]. We also report results on Image Net64 64 and Image Net32 32, datasets introduced in [3] that preserve the number of samples and classes of the original Image Net1K [4], but downsample the image resolutions to 64 64 and 32 32 respectively (see Appendix Sec. 1).

Fitting α to a Res Net block. The ﬁrst step in the optimization required for both the primal and dual objectives, is to approximate the α parameter in Eq. 9. For convolutional layers, the coefﬁcient multiplying α becomes PL l=1 fan-in 1 l where fan-inl is the fan-in of layer l. Following Algorithm 1, we approximate the α corresponding to a Res Net block parametrized by n = [n, n] as depicted in Fig. 5. We compute a Monte Carlo estimate of the second moment of one diagonal entry of the NTK matrix for increasing width n J1, 256K and ﬁxed nin=nout=256. For simplicity, we ﬁt the second moment normalized by the squared ﬁrst moment, given by eα PL l=1 fan-in 1 l , which can easily be ﬁtted with a ﬁrst degree polynomial when considering its natural logarithm. We ﬁnd α 1.60 and show the ﬁtted second moment in Appendix Fig. 2.

4.2.1 CIFAR-10/100

Primal formulation. As a baseline architecture, we use a 1 128d Res Net, following the notations of [20] section 5.1. Following Algorithm 2, we compute m(n) for n J1, 128K and ﬁnd the optimum np = 10 and mp 37, after adjusting mp to match the number of parameters of the baseline and account for rounding errors and different block topology approximations. As can be seen in Table 1a, the model achieving the primal optimum, 37 10d, attains better test error on CIFAR-10/100 than the Res Ne Xt baseline 3 64d at a similar parameter budget. We also report results for a wider baseline 8 64d from [20] and show similar trends. The test error for multiple models sitting on the primal curve is depicted in Fig. 6a for CIFAR-100 and Appendix Fig. 1 for CIFAR-10. Test errors are averaged over the last 10 epochs over 10 runs.

Dual formulation. Using the same Res Net base block as for the primal experiment, thus using the same ﬁtted α, we compute the optimal nd and md maximizing the parameter efﬁciency curve ρ and ﬁnd the same n as the primal, nd = 10, and md 10. The resulting Res Ne Xt network has 3.3 times fewer parameters than the baseline and achieves similar or slightly degraded performance on

CIFAR-10/100 as shown in Table 1b. The efﬁciency curve ρ depicted in red in Fig. 6b is constructed using a single Res Net block topology and with non integer numbers for m as described above. Thus it only approximates the real parameter efﬁciency, explaining why some models in the close vicinity of the optimum have a slightly higher real efﬁciency as can be seen in Table 1b. The test error for multiple models sitting on the dual curve is depicted in Fig. 6b for CIFAR-100 and Appendix Fig. 1 for CIFAR-10.

4.2.2 Image Net

Primal formulation. Following [20], we use Res Net-50 and Res Net-101 as baselines and report results in Table 2a. Our Res Net-50 based optimal model, 12 10d, obtains slightly better top-1 and top-5 errors than the baseline 32 4d reported in [20]. This is quite remarkable given that [20] converged to this architecture via an expensive grid search. Our Res Net-101 based optimal model achieves a signiﬁcantly better top-1 and top-5 error than the Res Net-101 baseline, and a slightly higher top-1 and top-5 error than the Res Ne Xt baseline 32 4d.

Dual formulation. Using Res Net-50 and Res Net-101 as baselines, we ﬁnd models that achieve similar top-1 and top-5 errors with signiﬁcantly less parameters. Detailed results can be found in Table 2b.

Implementation details. We follow [20] for the implementation details of Res Net-50, Res Net-101 and their Res Ne Xt counterparts. We use SGD with 0.9 momentum and a batch size of 256 on 8 GPUs (32 per GPU). The weight decay is 0.0001 and the initial learning rate 0.1. We train the models for 100 epochs and divide the learning rate by a factor of 10 at epoch 30, 60 and 90. We use the same data normalization and augmentations as in [20] except for lighting that we do not use.

5 Related Work

Various forms of multi-pathway neural architectures have surfaced over the years. In the seminal Alex Net architecture [12], group convolutions were used as a method to distribute training on multiple GPUs. More recently, group convolutions were popularized by Res Ne Xt [20], empirically demonstrating the beneﬁt of aggregating multiple residual branches. In [24], a channel shufﬂe unit was introduced in order to promote information transfer between different groups. In [9] and [19], the connections between pre-deﬁned set of groups are learned in a differentiable manner, and in [25], grouping is achieved through pruning of full convolutional blocks. In a seemingly unrelated front, the theoretical study of wide neural networks has seen considerable progress recently. A number of papers [13, 21, 1, 5, 14] have followed in the footsteps of the original NTK paper [10]. In [13], it is shown that wide models of any architecture evolve as linear models, and in [21], a general framework for computing the NTK of a broad family of architectures is proposed. Finite width corrections to the NTK are derived in [7, 15, 5]. In this work, we extend the wide regime to the multiplicity dimension, and show two distinct regimes where different kernels reign. We then use ﬁnite corrections of NTK to formulate two optimality criterions, and demonstrate their usefulness in predicting efﬁcient and performant ensembles.

Algorithm 1: Fitting α per architecture

Input: Baseline module with n={nl}L l=1, a set of width ratios {rj}R j=1, T, samples {x1, x2}. Output: Fitted α.

1 for j = 1, ..., R do

2 Construct module nj = {rj nl}L l=1.

3 for t = 1, ..., T do

4 Sample weights of nj.

5 Compute Knj(x1, x2).

6 Estimate V ar Knj .

7 Fit α using Eq. 9.

Algorithm 2: Find Optimal CE module

Input: Baseline module with n={nl}L l=1, a set of width ratios {rj}R j=1, T, samples {x1, x2}, βs. Output: optimal n , m(n ).

1 Fit α using Algorithm 1.

2 if Primal then

3 ﬁnd n = np and m = mp using Eq. 12.

4 else if Dual then

5 ﬁnd n = nd and m = md using Eq. 14.

6 Correct n and m to nearest integer values.

0 20 40 60 80 100 120 n

objective test error (%)

0 20 40 60 80 100 120 n

1x128d 2x62d 3x36d

test error (%)

Figure 6: Test errors (in %) on CIFAR-100. Clear correlation is observed between test error and the primal curve. (a) Different models sitting on the primal curve. The model achieving the primal minimum, 37 10d, achieves the best error. (b) Different models sitting on the dual curve. The model 10 10d achieves the dual maximum (ρ 3.3) and a test error comparable to the baseline with 3.3 times fewer parameters. Results are reported over 10 runs and shown with standard error bars.

Model Params C10 C100 1 128d [20] 13.8M 4.08 19.39 3 64d [20] 13.3M 3.96 18.57 37 10d (Ours) 13.7M 3.82 17.94 28 12d (Ours) 12.9M 3.74 18.05 Wide Res Net [22] 36.5M 4.17 20.50 1 226d 36.3M 3.88 18.36 8 64d [20] 34.4M 3.65 17.77 101 10d (Ours) 36.3M 3.65 17.44

Model ρ Params C10 C100 1 128d [20] 1 13.8M 4.75 20.74 10 10d (Ours) 3.3 4.22M 4.70 20.84 1 128d [20] 1 13.8M 4.08 19.39 10 10d (Ours) 3.3 4.22M 4.21 19.67 8 12d (Ours) 3.3 4.19M 4.20 19.58 Wide Res Net [22] 1 36.5M 4.17 20.50 1 226d 1 36.3M 3.88 18.36 2 64d [20] 4.0 9.13M 4.02 - 14 10d (Ours) 6.5 5.63M 4.01 19.04 6 25d (Ours) 5.0 7.26M 3.94 18.92 3 58d (Ours) 3.1 11.6M 3.87 18.75 2 98d (Ours) 2.1 17.4M 3.99 18.32

Table 1: Primal and dual results for Res Ne Xt-29 baselines on CIFAR-10/100. Test errors (in %) for CIFAR-10 (C10) and CIFAR-100 (C100) along with model sizes are reported. All models are variants of Res Ne Xt-29 except for Wide Res Net. (a) The optimally smooth models, 37 10d and 101 10d, surpass the baselines with the same number of parameters. (b) The optimally compact models only use a fraction of the parameters, yet attain similar or slightly degraded test errors. ρ indicates the parameter efﬁciency. indicates we reproduced results on baseline architectures from the cited paper, indicates models in the close vicinity of the optima. Results are averaged over 10 runs. Models were trained on 8 GPUs unless indicated by , in which case they were trained on a single GPU.

6 Conclusion

Understanding the effects of model architecture on training and test performance is a longstanding goal in the deep learning community. In this work we analyzed collegial ensembling, a general technique used in practice where multiple and functionally identical pathways are aggregated. We showed that collegial ensembles exhibit two distinct regimes of over-parameterization, deﬁned by large width and large multiplicity, with two distinct kernels governing the dynamics of each. In between these two regimes, we introduced a framework for deriving optimal ensembles in a sense of minimum capacity or maximum trainability. Empirical results on practical models demonstrate the predictive power of our framework, paving the way for more principled architecture search algorithms.

Params Top-1 error Top-5 error Res Net-50, 1 64d [20] 25.6M 23.93 7.11 Res Ne Xt-50, 32 4d [20] 25.0M 22.42 6.36 Res Ne Xt-50, 12 10d (Ours) 25.8M 22.37 6.30 Res Ne Xt-50, 15 8d (Ours) 25.1M 22.39 6.38 Res Net-101, 1 64d [20] 44.5M 22.32 6.25 Res Ne Xt-101, 32 4d [20] 44.2M 21.01 5.72 Res Ne Xt-101, 12 10d (Ours) 45.5M 21.16 5.74 Res Ne Xt-101, 15 8d (Ours) 44.2M 21.20 5.76

ρ Params Top-1 error Top-5 error Res Net-50, 1 64d [20] 1 25.6M 23.93 7.11 Res Ne Xt-50, 3 23d (Ours) 1.3 19.4M 23.80 7.11 Res Ne Xt-50, 4 16d (Ours) 1.5 17.1M 24.00 7.11 Res Net-101, 1 64d [20] 1 44.5M 22.32 6.25 Res Ne Xt-101, 3 23d (Ours) 1.4 32.9M 22.06 6.11 Res Ne Xt-101, 5 12d (Ours) 1.7 25.8M 22.30 6.27

Table 2: Primal and dual results for Res Net baselines on Image Net. Top-1 and top-5 errors (in %) and model sizes are reported. ρ indicates the parameter efﬁciency, indicates we reproduced results on baseline architectures from the cited paper, indicates models in the close vicinity of the optimum. Results are averaged over 5 runs.

Broader Impact

We introduce Collegial Ensembles (CE), which is a principled framework for neural network architecture search. The paper makes theoretical contributions backed up by empirical results. In addition, there is a practical message: In order to ﬁnd a good model with a ﬁxed budget of parameters, one would optimize the primal form of the CE objective, which will minimize variance of the NTK (maximize smoothness) and thereby have a better chance of improving generalization performance. This essentially aims to make optimal use of ﬁxed model capacity. Instead, if the goal is to ﬁnd a model that minimizes the number of parameters of a base architecture while preserving performance, one would optimize the dual form. This is a form of principled model pruning backed by theory and empirical analysis.

Crucially, the CE method allows a user to make smart choices to construct neural network models without needing to train a large search space of models. Instead, they can plug a base architecture into a simple formula corresponding to either the dual or primal objective, and get back a reasonable model that only needs to be trained once. Given that the typical architecture selection process involves an expensive grid search or evolutionary strategy, advances like CE have the potential to reduce data center costs and developer time.

In terms of scientiﬁc contributions, CE introduces theory and analysis that furthers the growing body of work on understanding deep learning in wide networks, including insights on the neural tangent kernel, which is becoming an important tool in analysis of neural network training behavior. In addition, this work leads to insights in how capacity can be smartly used to produce improved generalization on held out data, which is still a poorly understood aspect of machine learning which has traditionally focused on optimization. Finally, the empirical studies demonstrate that the theoretical framework can explain certain design choices that have become goto neural network architectures including Res Next, which can lead to further principled advancements.

Considering that our work is fundamental science, we are not advocating a particular application, and are instead trying to ensure that the practitioner has more control over how to effectively design models.

[1] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an inﬁnitely wide neural net. Ar Xiv, abs/1904.11955, 2019. [2] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns overparameterized networks that provably generalize on linearly separable data. 10 2017. [3] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. ar Xiv preprint ar Xiv:1707.08819, 2017. [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [5] Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from feynman diagrams. Ar Xiv, abs/1909.11304, 2020. [6] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Neur IPS, 2018. [7] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. Ar Xiv, abs/1909.05989, 2020. [8] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. Ar Xiv, abs/1803.01719, 2018. [9] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An efﬁcient densenet using learned group convolutions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2752 2761, 2017. [10] Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, 2018. [11] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Neural Information Processing Systems, 25, 01 2012. [13] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Ar Xiv, abs/1902.06720, 2019. [14] Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S. Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. Ar Xiv, abs/1911.00809, 2019. [15] Etai Littwin and L. Wolf. On the convex behavior of deep neural networks in relation to the layers width. Ar Xiv, abs/2001.04878, 2019. [16] Etai Littwin and L. Wolf. Residual tangent kernels. Ar Xiv, abs/2001.10460, 2020. [17] Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nathan Srebro. Exploring generalization in deep learning. 06 2017. [18] Matthew Olson, Abraham J. Wyner, and Richard Berk. Modern neural networks generalize on small data sets. In Neur IPS, 2018. [19] Xijun Wang, Meina Kan, Shiguang Shan, and Xilin Chen. Fully learnable group convolution for acceleration of deep neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9041 9050, 2019. [20] Saining Xie, Ross B. Girshick, Piotr Doll ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987 5995, 2017. [21] Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Ar Xiv, abs/1910.12478, 2019. [22] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. Ar Xiv, abs/1605.07146, 2016. [23] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. 11 2016.

[24] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufﬂenet: An extremely efﬁcient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6848 6856, 2018. [25] Ruizhe Zhao and Wayne Luk. Efﬁcient structured pruning and architecture searching for group convolution. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 1961 1970, 2019.