# interpolation_between_residual_and_nonresidual_networks__c5012811.pdf

Interpolation between Residual and Non-Residual Networks

Zonghan Yang 1 Yang Liu 1 Chenglong Bao 2 Zuoqiang Shi 3

Although ordinary differential equations (ODEs) provide insights for designing network architectures, its relationship with the non-residual convolutional neural networks (CNNs) is still unclear. In this paper, we present a novel ODE model by adding a damping term. It can be shown that the proposed model can recover both a Res Net and a CNN by adjusting an interpolation coefﬁcient. Therefore, the damped ODE model provides a uniﬁed framework for the interpretation of residual and non-residual networks. The Lyapunov analysis reveals better stability of the proposed model, and thus yields robustness improvement of the learned networks. Experiments on a number of image classiﬁcation benchmarks show that the proposed model substantially improves the accuracy of Res Net and Res Ne Xt over the perturbed inputs from both stochastic noise and adversarial attack methods. Moreover, the loss landscape analysis demonstrates the improved robustness of our method along the attack direction.

1. Introduction

Although deep learning has achieved remarkable success in many machine learning tasks, the theory behind it has still remained elusive. In recent years, developing new theories for deep learning has attracted increasing research interests. One important direction is to connect deep neural networks (DNNs) with differential equations (E, 2017) which have been largely explored in mathematics. This line of research mainly contains three perspectives: solving high dimensional differential equations with the help of DNNs due to its high expressive power (Han et al., 2018), discov-

1Institute for Artiﬁcial Intelligence, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University. 2Yau Mathematical Sciences Center, Tsinghua University. 3Department of Mathematical Sciences, Tsinghua University. Correspondence to: Chenglong Bao <clbao@mail.tsinghua.edu.cn>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

ering a differential equation that identiﬁes the rule of the observed data based on the standard block of existing DNNs (Chen et al., 2018), and designing new architectures based on the numerical schemes of differential equations (Haber & Ruthotto, 2017; Lu et al., 2018; Zhu et al., 2018; Chang et al., 2018; Tao et al., 2018; Lu et al., 2019).

While each attempt in the above directions has strengthened the theoretical understanding of deep learning, there still remain many open questions. Among them, one important question is what is the relationship between differential equations and non-residual convolutional neural networks. Most prior studies have focused on associating residual networks (Res Nets) (He et al., 2016) with differential equations (Lu et al., 2018; Chen et al., 2018), not only because Res Nets are relatively easy to optimize and achieve better classiﬁcation accuracy than CNNs, but also because the skipping connections among layers can be easily induced by the discretization of difference operators in differential equations. However, residual neural networks only account for a small fraction of the entire neural network family and have their own limitations. For example, Su et al. (2018) indicate that Res Nets are more sensitive to the perturbation of the inputs and the shallow CNNs. As a result, it is important to move a further step to investigate the relationship between differential equations and non-residual convolutional neural networks.

In this paper, we present a new ordinary differential equation (ODE) that interpolates non-residual and residual CNNs. The ODE is controlled by an interpolation parameter λ ranging from 0 to . It is equivalent to a residual network when λ is 0. On the contrary, the ODE amounts to a nonresidual network when λ approaches to . Hence, our work provides a uniﬁed framework for understanding both nonresidual and residual neural networks from the perspective of ODE. The interpolation is able to improve over both nonresidual and residual networks. Compared with non-residual networks, our ODE is much easier to optimize, especially for deep architectures. Compared with residual networks, we use the Lyapunov analysis to show that the interpolation results in improved robustness. To achieve the interpolation, a key difference of our work from existing methods is to discretize integral operators instead of difference operators to obtain neural networks. Experiments on image classiﬁcation benchmarks show that our approach substantially improves

Interpolation between Residual and Non-Residual Networks

the accuracies of Res Net (He et al., 2016) and Res Ne Xt (Xie et al., 2017) when inputs are perturbed by both stochastic noise and adversarial attack methods. Furthermore, the visualization of the loss landscape of our model validates our Lyaponov analysis.

2. Related Work

Interpreting machine learning from the perspective of dynamic systems was ﬁrstly advocated by E (2017) and Haber & Ruthotto (2017). Recently, there have been many exciting works in this direction (Lu et al., 2018; Chen et al., 2018). We brieﬂy review previous methods closely related to architecture design and model robustness.

ODE inspired architecture design Inspired by the relationship between ODE and neural networks, Lu et al. (2018) use a linear multi-step method to improve the model capacity of Res Net-like networks. Zhu et al. (2018) utilize the Runge-Kutta method to interpret and improve Dense Nets and Clique Nets. Chang et al. (2018) and Haber & Ruthotto (2017) leverage the leap-frog method to design novel reversible neural networks. Tao et al. (2018) propose to model non-local neural networks with non-local differential equations. Lu et al. (2019) design a novel Tranformer-like architecture with Strang-Marchuk splitting scheme. Chen et al. (2018) show that blocks of a neural network can be instantiated by arbitrary ODE solvers, in which parameters can be directly optimized with the adjoint sensitivity method. Dupont et al. (2019) improve the expressive power of a neural ODE by mitigating the trajectory intersecting problem. Compared to the above works, our work provides a new ODE that uniﬁes the analysis of residual and non-residual networks which leads to an interpolated architecture. The experiments validate the advantages of the proposed method using this framework.

ODE and model robustness A number of previous methods have also been proposed to improve adversarial robustness from the perspective of ODE. Zhang et al. (2019b) propose to use a smaller step factor in the Euler method for Res Net. Reshniak & Webster (2019) utilize an implicit discretization scheme for Res Net. Hanshu et al. (2019) propose to train a time-invariant neural ODE regularized by steady-state loss. Liu et al. (2019) and Wang et al. (2019) introduce stochastic noise to enhance its robustness inspired by stochastic differential equations. The aforementioned works have concentrated on improving numerical discretization schemes or introducing stochasticity for ODE modeling to gain robustness. From the Lyapunov stability perspective, Chang et al. (2019) propose to use anti-symmetric weight matrices to parametrize an RNN, which enhances its long-term dependency. Zhang et al. (2019a) also accelerate adversarial training by recasting it as a differential game from an ODE perspective. In this work, we provide

the Lyaponov analysis of the proposed ODE model which shows the robustness improvements over Res Nets in terms of local stability.

3. Methodology

In this section, we ﬁrst introduce the background of the relationship between ODE and Res Nets, and then the proposed ODE model and its stability analysis is present.

3.1. Background

Considering the ordinary differential equation:

dt = f(x(t), t), x(0) = x0, (1)

where x : [0, T] Rd represents the state of the system. Given the discretization step t and deﬁne tn = n t, the forward Euler method of Eq. (1) becomes

x(tn+1) = x(tn) + tf(x(tn), tn). (2)

Let xn = x(tn), t = 1, it recovers a residual block:

xn+1 = xn + fn(xn), (3)

and fn is the n-th layer operation in Res Nets. Thus, the output of network is equivalent to the evolution of the state variable at terminal time T, i.e. x(T) = x N is the output of last layer in a Res Net if assuming N = T/ t.

The dynamic formulation of Res Nets (see Eq. (1)) was initially established in (E, 2017). It inspired many interesting neural network architectures by using different discretization methods the ﬁrst order derivative in Eq. (1) such as linear multi-step network (Lu et al., 2018) and Runge-Kutta network (Zhu et al., 2018). From Eq. (1), the skip connection from the current step xn to the next step estimation xn+1 always exists no matter which kind of discretization is applied. Thus, a feedforward CNN without skip connection can not be directly explained under this framework which inspired current work. In the next section, we introduce a damped ODE which bridges the non-residual CNNs and Res Nets.

3.2. The Proposed ODE Model

Based on the ODE formulation , we add a damping term to the model (1) and leads to the following model:

dt = λx(t) + ρ(λ)f(x(t), t), (4)

starting from x(0) = x0. The constant λ [0, + ) is the called interpolation coefﬁcient and ρ : [0, + ) 7 [0, + ) is the weight function. The following proposition shows that the model shown in Eq. (4) has a closed form solution.

Interpolation between Residual and Non-Residual Networks

Proposition 3.1. For any T > 0, the solution of the ODE (4) is

x(T) = e λT

x0 + ρ(λ) Z T

0 eλtf(x(t), t)dt

Proof. Multiplying both sides by eλt, it has

dt = eλt dx(t)

dt + λeλtx(t) = ρ(λ)eλtf(x(t), t).

Integrating within [0, T] yields

eλT x(T) x(0) = ρ(λ) Z T

0 eλtf(x(t), t)dt, (6)

which induces the equality (5).

Following from the proposition 3.1 and the notations in section 3.1, the iterative formula of xn is

xn+1 = e λ txn + e λtn+1ρ(λ) Z tn+1

tn eλtf(x(t), t)dt.

(7) Assuming f(x(t), t) = f(xn, tn) for all t [tn, tn+1), the iterative scheme in Eq. (7) reduces to

xn+1 = e λ txn + 1 e λ t

λ ρ(λ)fn(xn), (8)

where fn(xn) = f(x(tn), tn) is the convolutions in n-th layer. Now, we are ready to analyze Eq. (8) by choosing an appropriate weight function ρ(λ). When the weight function ρ(λ) satisﬁes

ρ(λ) 1, λ 0+ and ρ(λ) λ, λ + , (9)

the output of n-th layer is

( xn + fn(xn), if λ 0+, tfn(xn), if λ + . (10)

The above equation clearly shows that our model recovers Res Nets when the interpolation parameter λ approaches 0 and the non-residual CNNs when it approaches + . Therefore, the ODE shown in Eq. (4) bridges the residual and non-residual CNNs and inspires the design of new architectures of neural networks.

3.3. Interpolated Network Design

Based on the uniﬁed ODE model shown in Eq. (4), two types of ρ(λ) are chosen and the corresponding network architectures are proposed. Considering the case when λ is small, we choose ρ(λ) = 1 and substitute the damping factor e λ t by its ﬁrst order approximation:

e λ t 1 λ t. (11)

Then, from Eq. (8), the output of n-th layer is

xn+1 = (1 λ t)xn + tfn(xn). (12)

To guarantee the positiveness of λ, we add the Re LU function to the interpolation parameter λ and absorb the t into it. Thus the n-th layer of the network is

xn+1 = (1 Re LU(λn))xn + fn(xn). (13)

Each λn is a trainable parameter for the n-th layer. It is known that the forward Euler discretization is stable when λ t (0, 2), i.e. λ (0, 2/ t). As t in a continuoustime dynamic system is small, the stable range of λ can be viewed as a relaxation of (0, + ), which coincides with the boundary condition in Eq. (9).

The second choice of the weight function is ρ(λ) = λ + 1 which satisﬁes the assumption in Eq. (9). Using the same approximation in Eq. 11, the scheme in Eq. (7) reduces to

xn+1 = (1 λ t)xn + (1 + λ t)fn(xn). (14)

Similar as the ﬁrst choice, the second interpolated network is given by

xn+1 =(1 Re LU(λn))xn + (1 + Re LU(λn))fn(xn). (15)

It is easy to know the interpolated networks shown in Eq. (13) and (15) recover a non-residual CNN if λn = 1 and a Residual network if λn = 0. As claimed in (He et al., 2016; Li et al., 2018), the identity shortcut connection helps mitigate gradient vanish problem and makes the loss landscape more smooth. It is natural that when λ 0 in Eq. (13), the optimization process of the interpolated model is much better than the non-residual CNN case with the same number of layers.

3.4. Interpolated Network Improves Robustness

Despite the high accuracy of Res Nets, it is sensitive to the small perturbation of inputs due to the existence of adversarial examples. That is, for a fragile neural network, minor perturbation can accumulate dramatically with respect to layer propagation, resulting in giant shift of prediction. In this section, we show the improvment of the proposed interpolated networks over Res Nets. The added damping term in our model weakens the amplitude of the solution of the original ODE. As a result, adding a damping term to the ODE model damps the error propagation process of Res Net, which improves model robustness.

In the following context, we show that robustness improvement of our proposed networks by using the stability analysis of the ODE.

Interpolation between Residual and Non-Residual Networks

Deﬁnition 3.2. Let x be an equilibrium point of the ODE model (1). Then x is called asymptotically locally stable if there exists δ > 0 such that limt + xt x = 0 for all starting points x0 within x0 x δ.

Therefore, the perturbation around equilibrium x does not change the output the network if x is asymptotically locally stable. The next proposition from (Lyapunov, 1992; Chen, 2001) presents a classical method that checks the stability of nonlinear system around the equilibrium when f is time invariant. It is noted that this time invariant assumption may hold as the learned ﬁlters in the deep layers converges.

Proposition 3.3. The equilibrium x of the ODE model

dt = f(x(t)) (16)

is asymptotically locally stable if and only if Re(ν) < 0 where ν is the eigenvalue of xf(x ) which is the Jacobi matrix of f at x .

Considering the damped ODE

dt = λx(t) + ρ(λ)f(x(t)), (17)

the Jacobi matrix at the equilibrium x is

Jλ(x ) = ρ(λ) xf(x ) λ.

Then, the eigenvalues ˆν of Jλ(x ) are

ρ(λ)ν λ (18)

where ν is the eigenvalue of xf(x ). When ρ(λ) = 1, we know Re(ˆν) = Re(ν) λ < Re(ν).

By choosing positive λ properly, we know the ODE in Eq. (17) is asymptotically locally stable at x . In general, we know

Re(ˆν) < Re(ν) ρ(λ) < 1 + λ Re(ν),

which coincides with our assumption in Eq. (9). The above analysis shows that the stationary point of our proposed damped ODE model is more likely to be locally stable, and thus improve the its robustness when the input has be perturbed. In the experiments, our loss landscape visualization further validates this analysis.

4. Experiments

We evaluate our proposed model on CIFAR-10 and CIFAR100 benchmarks, training and testing with the originally

given dataset. Following (He et al., 2016), we adopt the simple data augmentation technique: padding 4 pixels on each side of the image and sampling a 32 32 crop from it or its horizontal ﬂip. For Res Net experiments, we select the pre-activated version of Res Net-110 and Res Net-164 as baseline architectures. For Res Ne Xt experiments, we select Res Ne Xt-29, 8 64d as baseline from (Xie et al., 2017).

We apply Eq. (13) to Res Net-110, Res Net-164 and Res Ne Xt, and refer to them as In-Res Net-110, In-Res Net164, and In-Res Ne Xt. We also apply Eq. (15) to Res Net-110 and Res Net-164, referring to them as λ-In-Res Net-110, λIn-Res Net-164, and λ-In-Res Ne Xt.

The parameters λn of our interpolation models are initialized by randomly sampling from U[0.2, 0.25] in (λ- )In-Res Net-110 and (λ-)In-Res Ne Xt, and U[0.1, 0.2] in In-Res Net-164. The initialization of other parameters in Res Net and Res Ne Xt follows (He et al., 2016) and (Xie et al., 2017), respectively.

For all of the experiments, we use SGD optimizer with batch size = 128. For Res Net and (λ-)In-Res Net experiments, we train for 160 (300) epochs for the CIFAR-10 (-100) benchmark; the learning rate starts with 0.1, and is divided it by 10 at 80 (150) and 120 (225) epochs. We apply weight decay of 1e-4 and momentum of 0.9. For Res Ne Xt and (λ-)In-Res Ne Xt experiments, the learning rate starts at 0.05, and is divided it by 10 at 150 and 225 epochs. We apply weight decay of 5e-4 and momentum of 0.9.

We focus on two types of performances: optimization difﬁculty and model robustness. For optimization difﬁculty, we test our model on the CIFAR testing dataset. For model robustness, we evaluate the accuracy of our model over the perturbed inputs, details of which are given in the next section. For each experiment, we conduct 5 runs with different random seeds and report the averaged result to reduce the impact of random variations. The standard deviations of reported results can be found in Appendix D.

4.2. Measuring Robustness

In this section we introduce the two types of perturbation methods that we use: stochastic noise perturbations and adversarial attacks. For stochastic noise, we leverage the stochastic noise groups in CIFAR-10-C and CIFAR-100-C dataset (Hendrycks & Dietterich, 2019) for testing. The four groups of stochastic noise are impulse noise, speckle noise, Gaussian noise, and shot noise. For adversarial attacks, we consider three classical methods: Fast Gradient Sign Method (FGSM), Iterated Fast Gradient Sign Method (IFGSM), and Projected Gradient Descent (PGD). For a given data point (x, y):

FGSM induces the adversarial example x by moving

Interpolation between Residual and Non-Residual Networks

Benchmark Model Impulse Speckle Gaussian Shot Avg.

Res Net-110 56.38 59.12 43.82 55.47 53.70 In-Res Net-110 66.32 76.81 71.01 76.55 72.67 λ-In-Res Net-110 65.67 76.59 70.72 76.40 72.35 Res Net-164 60.88 61.77 45.66 57.75 56.51 In-Res Net-164 67.95 75.96 68.95 75.31 72.05 λ-In-Res Net-164 65.72 76.27 69.74 75.80 71.88 Res Ne Xt 55.12 58.21 39.14 52.06 51.13 In-Res Ne Xt 55.26 59.87 39.75 54.12 52.25 λ-In-Res Ne Xt 51.27 57.20 37.23 51.25 49.24

Res Net-110 25.36 29.69 20.16 27.81 25.76 In-Res Net-110 32.00 38.81 30.00 37.71 34.63 λ-In-Res Net-110 32.15 38.77 30.02 37.82 34.69 Res Net-164 27.55 30.90 20.40 28.97 26.95 In-Res Net-164 33.05 39.50 29.77 38.17 35.12 λ-In-Res Net-164 32.92 38.79 29.08 37.53 34.58 Res Ne Xt 26.83 28.29 17.09 25.67 24.47 In-Res Ne Xt 25.85 29.90 18.59 27.72 25.52 λ-In-Res Ne Xt 25.33 31.18 19.88 28.75 26.29

Table 1. Accuracy over the stochastic noise groups from CIFAR-10-C and CIFAR-100-C datasets, corresponded with perturbed CIFAR-10 and CIFAR-100 images from four types of stochastic noise, respectively. All of the results reported are averaged from 5 runs.

Model CIFAR-10 CIFAR-100 Res Net-110 93.58 72.73 In-Res Net-110 92.28 70.55 λ-In-Res Net-110 92.15 70.39 Res Net-164 94.46 76.06 In-Res Net-164 92.69 72.94 λ-In-Res Net-164 92.55 73.22 Res Ne Xt 96.35 81.63 In-Res Ne Xt 96.48 81.64 λ-In-Res Ne Xt 96.22 81.29

Table 2. Accuracy over CIFAR-10 and CIFAR-100 testing data, representing optimization difﬁculty of each model. All of the results reported are averaged from 5 runs.

with step size of ϵ at each component of the gradient descent direction, namely

x = x + ϵ sign( x L(x, y)). (19)

IFGSM performs FGSM with step size of α, and clips the perturbed images within [x ϵ, x + ϵ] iteratively, namely

x(m+1) = Clipx,ϵ n x(m) + α sign( x L(x(m), y)) o , (20) where m = 1, 2, , M, x(0) = x, and x(M) is the induced adversarial image. In our experiments, we set α = 2/255 and iteration times M = 20.

PGD attack is the same with IFGSM, except that the x(0) = x + δ with δ U[ ϵ, ϵ].

Figure 1. Learned interpolation coefﬁcients in In-Res Net-110 and In-Res Net-164 models trained on CIFAR-10 benchmarks.

4.3. Results

Optimization difﬁculty Table 2 shows the results of In Res Net-110 and In-Res Net-164 as well as the baselines over CIFAR-10 and CIFAR-100 testing set. On one hand, it can be seen that for (λ-)In-Res Net-110 and (λ-)In-Res Net164, there is accuracy drop within 3 percent compared with the Res Net baselines. This agrees with the fact that the interpolation model may be harder to optimize than Res Net. However, the performance of the interpolation models are still much better than that of the deep non-residual CNN models.

Robustness against stochastic noise Table 1 shows the ac-

Interpolation between Residual and Non-Residual Networks

Benchmark Model FGSM IFGSM PGD 1/255 2/255 4/255 1/255 2/255 4/255 1/255 2/255 4/255

Res Net-110 58.59 41.48 29.45 39.45 5.93 0.06 38.91 5.60 0.06 In-Res Net-110 71.97 55.24 38.26 65.70 32.05 5.14 65.66 31.74 5.01 λ-In-Res Net-110 71.06 50.84 30.05 65.93 30.72 3.52 65.81 30.45 3.41 Res Net-164 63.32 44.37 30.21 46.79 8.19 0.09 46.43 7.77 0.07 In-Res Net-164 70.88 51.84 32.81 64.34 27.43 2.27 64.20 26.95 2.15 λ-In-Res Net-164 70.01 50.53 31.77 63.33 26.50 2.01 63.19 26.04 1.91

Res Net-110 28.01 18.74 14.12 15.05 2.18 0.28 14.69 2.11 0.26 In-Res Net-110 32.24 18.74 11.84 23.44 4.92 0.55 23.22 4.81 0.53 λ-In-Res Net-110 32.79 18.40 11.24 24.17 5.17 0.53 24.03 5.00 0.51 Res Net-164 35.15 23.58 17.04 21.23 3.45 0.29 20.78 3.31 0.22 In-Res Net-164 37.21 22.30 13.93 28.05 6.59 0.73 27.75 6.34 0.67 λ-In-Res Net-164 37.37 22.50 13.94 28.25 6.64 0.69 28.03 6.46 0.64

Table 3. Accuracy over perturbed CIFAR-10 and CIFAR-100 images from FGSM, IFGSM, and PGD adversarial attacks with different attack radii. All of the results reported are averaged over 5 runs.

curacies of all models over the perturbed CIFAR-10 and CIFAR-100 images from four types of stochastic noise. Our In-Res Net-110 and In-Res Net-164 models achieve substantial improvement over the Res Net-110 and Res Net-164 baselines. For perturbed CIFAR-10 images, accuracy of (λ-)In-Res Net-110 and (λ-)In-Res Net-164 are over 15 % higher than Res Net-110 and Res Net-164 baselines on average. For perturbed CIFAR-100 images, accuracy of (λ- )In-Res Net-110 and (λ-)In-Res Net-164 are over 5% higher than Res Net-110 and Res Net-164 baselines on average. In Res Ne Xt models improves the accuracy of the perturbed images over Res Ne Xt as well.

Robustness against adversarial attacks Table 3 shows the accuracies of all models over the perturbed CIFAR-10 and CIFAR-100 images from FGSM, IFGSM, and PGD attacks at different attack radii of 1/255, 2/255, and 4/255. Most of the robustness results of our (λ-)In-Res Net-110 and (λ- )In-Res Net-164 models are higher than those of the Res Net110 and Res Net-164 models, which is empirically consistent with our Lyapunov analysis. Especially on CIFAR-10 benchmark, our In-Res Net-110 and In-Res Net-164 models obtain signiﬁcant robustness improvement against the strong IFGSM and PGD attacks at the radii of 1/255 and 2/255.

Learned interpolation coefﬁcients To get a better understanding of the interpolation model, we plot the interpolation coefﬁcients {Re LU(λn)} in In-Res Net-110 and In-Res Net164 models trained on CIFAR-10 benchmarks. As shown in Fig 1, most of the interpolation coefﬁcients lie within the range [0, 1], suggesting an interpolating behaviour. According to Eq. (13), interpolation coefﬁcients lying within [1, 2] represent negative skip connections, with the absolute weight scale of less than 1. Very few of the interpolation coefﬁcients are larger than 2, which is in line with the stability range of forward Euler scheme. In general, 79.6%(72.2%)

of λn s in In-Res Net-110(164) are larger than 0.01, which accounts for the signiﬁcance in robustness. More visualizations of learned interpolation coefﬁcients can be found in Appendix A.

Loss landscape analysis As is given by the Lyapunov analysis, the robustness improvement is theoretically provided in that the damped models enjoy more locally stable points than the original ones. To further verify this, we visualize the loss landscapes of In-Res Net-110 and Res Net-110 models trained on CIFAR-10 benchmark along the attack direction. For a instance (x, y), we plot the loss function L(x, y) of along the FGSM attack direction. We also select a random orthogonal direction from the FGSM attack one and plot the model predictions of each grids. The unit of each axis in the ﬁgures is at the scale of 1/255. To better analyze model robustness, we select the data instance (x, y) for the CIFAR-10-C dataset, namely (x , y), where x is x with injected stochastic noise.

Figure 2 illustrates the loss landscapes of the Res Net-110 and In-Res Net-110 models along the FGSM attack direction. We select two input data instances: for Figure 2-{(a)-(c)}, the input is the 3-th image in the shot noise group of the CIFAR-10-C dataset, the ground-truth label of which is ship; for Figure 2-{(d)-(f)}, the input is the 8-th image in the speckle noise group of the CIFAR-10-C dataset, the ground-truth label of which is horse. For the ﬁrst input example, Res Net-110 and In-Res Net-110 both make the correct prediction; for the second input example, they both make the wrong one. It can be seen that the added damping term have damped the loss landscape along the FGSM attack direction, resulting in a much weaker amplitude (the ﬁrst example), or even turned the amplifying loss landscape of Res Net-110 into a damping one of In-Res Net-110 (the second example). Whether Res Net-110 and In-Res Net-110

Interpolation between Residual and Non-Residual Networks

Model Acc. noise FGSM IFGSM PGD Res Net-110 93.58 53.70 41.48 5.93 5.60 In-Res Net-110 92.28 72.67 55.24 32.05 31.74 In-Res Net-sig-110 93.49 55.04 44.65 6.29 5.94 In-Res Net-gating-110 93.46 54.53 41.25 5.65 5.33 In-Res Net-gating-sig-110 90.68 68.04 46.17 21.89 21.65

Table 4. Accuracy and robustness of In-Res Net-110, In-Res Net-sig-110, In-Res Net-gating-110, and In-Res Net-gating-sig-110 models, as well as the Res Net-110 baseline on CIFAR-10 benchmarks. Acc. denotes the accuracy over CIFAR-10 testing set. noise denotes the average accuracy of the four stochastic noise groups from CIFAR-10-C. FGSM , IFGSM , and PGD represent accuracy under the corresponding attacks at the attack radius of 2/255. All of the results reported are averaged over 5 runs.

both make the correct prediction or the wrong one, it is clear that the In-Res Net-110 model enjoys better robustness than Res Net-110, which agrees with our Lyapunov analysis that the damping term has introduced more locally stable points.

4.4. Comparison among In-Res Net Variants

While Eq. (13) depicts the In-Res Net structure, in this section, we propose several variants of In-Res Net and compare their performances. To facilitate the discussion, the In-Res Net can be written in the general form:

xn+1 = (1 act(d(xn)))xn + tfn(xn), (21)

where d(xn) is the function determining the interpolation coefﬁcients. act is the activation function. For In-Res Net, the d(xn) is a learnable scalar parameter λn; act is Re LU function. We propose several In-Res Net variants:

d(xn) = λn, act = sigmoid: we replace the activation function to be sigmoid, which restricts the interpolation coefﬁcients to be within [0, 1], and thus guarantees that the learned model is an interpolation. We refer to it as In-Res Net-sig.

d(xn) = Wdxn + bd, act = Re LU: we let the learnable scalar parameters determined by a linear transformation from input xn, yielding a gating mechanism. We refer to it as In-Res Net-gating.

d(xn) = Wdxn + bd, act = sigmoid: based on the previous variant, we further replace the activation function to be sigmoid. It is noteworthy that this variant is the shortcut-only gating mechanism discussed in (He et al., 2016). We refer to it as In-Res Net-gating-sig.

We use In-Res Net-110 as the basic In-Res Net model and experiment on CIFAR-10 benchmark to compare their performance. The accuracy and robustness results are reported averagely from 5 runs, shown in Table 4. We elaborately tune the initialization intervals and report the model with the largest sum of the accuracy over both the CIFAR-10 testing set and the noise groups in the CIFAR-10-C dataset.

It can be seen that In-Res Net-110 leads to the largest robustness improvements over Res Net-110 baseline, with a relatively small accuracy drop. The In-Res Net-sig-110 model achieves better accuracy result than In-Res Net-110, however, its performance on robustness improvements are marginal. This is because the learned interpolation coefﬁcients in In-Res Net-sig-110 are close to 0, resulting in nearly identity skip-connections. Similarly, the performance of In Res Net-gating-110 is very close to Res Net-110 baseline due to the degeneration of its damped skip-connections. The In-Res Net-gating-sig-110 model also improves over the Res Net-110 baseline with a large margin in terms of robustness performance. The improvement, however, is less signiﬁcant than our In-Res Net-110 model. The accuracy of the In-Res Net-gating-sig-110 model also lags behind In Res Net-110, which may attribute to the extra optimization difﬁculty introduced by the gating mechanism.

4.5. Trade-off between Optimization and Robustness

As is shown in Table 2, while (λ-)In-Res Net enjoys better robustness, it suffers from optimization difﬁculty: an accuracy degeneration around 2% is caused by our (λ-)In Res Net model. In this section, we show that the initialization of λn is of great importance to the optimization process. We use In-Res Net-110 model and λ-In-Res Net-110 model trained on CIFAR-10 benchmark as the basic model, initializing λn by randomly sampling from U[x, y]. For the basic model, we have that U[x, y] = U[0.2, 0.25]. We try the following initialization schemes as well: U[x, y] = U[0, 0.1], U[0.1, 0.2], U[0.2, 0.25], U[0.25, 0.3], and U[0.3, 0.4]. The accuracy and robustness results are reported averagely from 5 runs, shown in Table 5.

From the experimental results, we can see that the performance of (λ-)In-Res Net is sensitive to the initialization of λn. On one hand, as the initialization becomes larger, the model robustness goes up. This agrees with our Lyapunov analysis, as the larger initialization of λn s tends to help model to converge to the larger ﬁnal λn s, yielding larger damping terms and better robustness. One the other hand, larger initialization leads to worse accuracy results. Espe-

Interpolation between Residual and Non-Residual Networks

(a) Loss landscape.

(b) Res Net-110 predictions

(c) In-Res Net-110 predictions

(d) Loss landscape.

(e) Res Net-110 predictions

(f) In-Res Net-110 predictions

Figure 2. The input data instance is the 3-th/8-th image in the shot/speckle noise group of the CIFAR-10-C dataset for (a)-(c)/(d)-(f), the ground truth label of which is ship/horse. For (a)-(c)/(d)-(f), Res Net-110 and In-Res Net-110 both make the correct / wrong prediction. (a) and (c) depict the loss landscape of Res Net-110 and In-Res Net-110 along the FGSM attack direction. {(b) and (e)} / {(c) and (f)} illustrate model predictions of {Res Net-110} / {In-Res Net-110} at each grids determined by the FGSM attack direction and a random orthogonal direction.

Model Initialization Acc. noise FGSM IFGSM PGD Res Net - 93.58 53.70 41.48 5.93 5.60 U[0.00, 0.10] 93.51 55.15 46.74 8.39 7.96 U[0.10, 0.20] 93.25 62.88 49.58 16.89 16.46 In-Res Net U[0.20, 0.25] 92.28 72.67 55.24 32.05 31.74 U[0.25, 0.30] 91.63 76.20 55.79 36.53 36.28 U[0.30, 0.40] 90.62 79.35 55.95 41.07 40.84 U[0.00, 0.10] 93.41 54.18 42.28 6.78 6.48 U[0.10, 0.20] 92.86 63.58 46.07 16.99 16.60 λ-In-Res Net U[0.20, 0.25] 92.15 72.35 50.84 30.72 30.45 U[0.25, 0.30] 91.30 75.65 53.29 36.90 36.74 U[0.30, 0.40] 90.17 79.66 55.03 41.06 40.94

Table 5. Accuracy and robustness results of In-Res Net-110 and λ-In-Res Net-110 with different initialization schemes. Acc. denotes the accuracy over CIFAR-10 testing set. noise denotes the average accuracy of the four stochastic noise groups from CIFAR-10-C. FGSM , IFGSM , and PGD represent model accuracy under the corresponding attacks at the radius of 2/255. All of the results reported are averaged over 5 runs except for U[0.3, 0.4]: they are averaged over 4(2) runs, as 1(3) out of 5 runs for In-Res Net-110 (λ-In-Res Net-110) failed with a ﬁnal accuracy of 10% on CIFAR-10 test set.

Interpolation between Residual and Non-Residual Networks

Model Acc. noise FGSM IFGSM PGD Res Net-110 93.58 53.70 41.48 5.93 5.60 Res Net-110, ens 95.03 55.70 43.99 6.26 5.93 In-Res Net-110 92.28 72.67 55.24 32.05 31.74 In-Res Net-110, ens 94.03 75.86 58.42 34.44 34.03 λ-In-Res Net-110 92.15 72.35 50.84 30.72 30.45 λ-In-Res Net-110, ens 94.00 75.29 53.66 32.95 32.77 Res Net-164 94.46 56.51 44.37 8.19 7.77 Res Net-164, ens 95.44 58.76 46.54 8.53 8.14 In-Res Net-164 92.69 72.05 51.84 27.43 26.95 In-Res Net-164, ens 94.26 75.26 54.72 28.97 28.51 λ-In-Res Net-164 92.55 71.88 50.53 26.50 26.04 λ-In-Res Net-164, ens 94.20 74.97 53.17 27.74 27.30

Table 6. Comparison between the accuracy and robustness results of the ensemble model over 5 different runs and those of the single model (scores are averaged). Acc. denotes the accuracy over CIFAR-10 testing set. noise denotes the average accuracy of the four stochastic noise groups from CIFAR-10-C. FGSM , IFGSM , and PGD represent model accuracy under the corresponding attacks at the radius of 2/255.

Figure 3. The accuracy improvements over single models for the ensemble Res Net-110, In-Res Net-110 and λ-In-Res Net-110 over CIFAR-10 dataset. Both of the ensemble of our models have more signiﬁcant accuracy improvements than the ensemble of the baseline Res Net-110 model.

cially for U[0.30, 0.40], 1(3) out of 5 runs of In-Res Net-110 (λ-In-Res Net-110) fails the optimization with a ﬁnal accuracy of 10%. This can be interpreted that the damped shortcuts hamper information propagation and lead to optimization difﬁculty (He et al., 2016). More results on CIFAR-100 benchmark can be found in Appendix B.

4.6. Effect of Model Ensemble

It is known that an ensemble model is more robust than a single model (Wang et al., 2019). To further improve accuracy and robustness, we perform model ensemble over the 5 different runs of baseline and our models. Table 6 shows the comparison between ensemble models and single models for Res Net-110, In-Res Net-110 and λ-Res Net-110

over CIFAR-10 dataset. It can be seen that all of the ensemble models are more robust and more accurate than the corresponding single models.

We also plot the accuracy improvements over single models for the ensemble Res Net-110, In-Res Net-110 and λ-In Res Net-110. As shown in 3, both of the ensemble of our models have more signiﬁcant accuracy improvements than the ensemble of the baseline Res Net-110 model. This can be attributed to the performance difference among different runs of our model due to optimization difﬁculty. More results and visualizations of the effect of ensemble method can be found in Appendix C.

5. Conclusion

While the relationship between ODEs and non-residual networks remains unclear, in this paper, we present a novel ODE model by adding a damping term. By adjusting the interpolation coefﬁcient, the proposed model uniﬁes the interpretation of both residual and non-residual networks. Lyapunov analysis and experimental results on CIFAR-10 and CIFAR-100 benchmarks reveals better robustness of the proposed interpolated networks against both stochastic noise and several adversarial attack methods. Loss landscape analysis reveals the improved robustness of our method along the attack direction. Furthermore, experiments show that the performance of proposed model is sensitive to the initialization of the interpolation coefﬁcients, demonstrating trade-off between optimization difﬁculty and robustness. The signiﬁcance of the design of interpolated networks is shown by comparing several model variants. Future work includes determining the interpolated coefﬁcients as a blackbox process and leveraging data augmentation techniques to improve our models.

Interpolation between Residual and Non-Residual Networks

Acknowledgements

We thank all the anonymous reviewers for their suggestions. Yang Liu is supported by the National Key R&D Program of China (No. 2017YFB0202204), National Natural Science Foundation of China (No. 61925601, No. 61761166008), and Huawei Technologies Group Co., Ltd. Chenglong Bao is supported by National Natural Sciences Foundation of China (No. 11901338) and Tsinghua University Initiative Scientiﬁc Research Program. Zuoqiang Shi is supported by National Natural Sciences Foundation of China (No. 11671005). This work is also supported by Beijing Academy of Artiﬁcial Intelligence.

Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., and Holtham, E. Reversible architectures for arbitrarily deep residual neural networks. In AAAI, 2018.

Chang, B., Chen, M., Haber, E., and Chi, E. H. Antisymmetricrnn: A dynamical system view on recurrent neural networks. In ICLR, 2019.

Chen, G. Stability of nonlinear systems. Wiley Encyclopedia of Electrical and Electronics Engineering, 2001.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural Ordinary Differential Equations. In Neur IPS, 2018.

Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural odes. In Neur IPS, pp. 3134 3144, 2019.

E, W. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5 (1):1 11, 2017.

Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.

Han, J., Jentzen, A., and E, W. Solving high-dimensional partial differential equations using deep learning. PNAS, 115(34):8505 8510, 2018.

Hanshu, Y., Jiawei, D., Vincent, T., and Jiashi, F. On robustness of neural ordinary differential equations. In ICLR, 2019.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In ECCV, pp. 630 645. Springer, 2016.

Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2019.

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Neur IPS, pp. 6389 6399, 2018.

Liu, X., Si, S., Cao, Q., Kumar, S., and Hsieh, C.-J. Neural sde: Stabilizing neural ode networks with stochastic noise. ar Xiv:1906.02355, 2019.

Lu, Y., Zhong, A., Li, Q., and Dong, B. Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations. In ICML, 2018.

Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., and Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view. ar Xiv:1906.02762, 2019.

Lyapunov, A. M. The general problem of the stability of motion. International journal of control, 55(3):531 534, 1992.

Reshniak, V. and Webster, C. Robust learning with implicit residual networks. ar Xiv:1905.10479, 2019.

Su, D., Zhang, H., Chen, H., Yi, J., Chen, P.-Y., and Gao, Y. Is robustness the cost of accuracy? a comprehensive study on the robustness of 18 deep image classiﬁcation models. In ECCV, pp. 631 648, 2018.

Tao, Y., Sun, Q., Du, Q., and Liu, W. Nonlocal neural networks, nonlocal diffusion and nonlocal modeling. In Neur IPS, pp. 496 506, 2018.

Wang, B., Yuan, B., Shi, Z., and Osher, S. J. Enresnet: Resnet ensemble via the feynman-kac formalism. In Neur IPS, 2019.

Xie, S., Girshick, R., Doll ar, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In CVPR, pp. 1492 1500, 2017.

Zhang, D., Zhang, T., Lu, Y., Zhu, Z., and Dong, B. You only propagate once: Painless adversarial training using maximal principle. In Neur IPS, 2019a.

Zhang, J., Han, B., Wynter, L., Low, K. H., and Kankanhalli, M. Towards robust resnet: A small step but a giant leap. In IJCAI, 2019b.

Zhu, M., Chang, B., and Fu, C. Convolutional Neural Networks combined with Runge-Kutta Methods. ar Xiv.org, February 2018.