# deep_layers_as_stochastic_solvers__59e8ceab.pdf

Published as a conference paper at ICLR 2019

DEEP LAYERS AS STOCHASTIC SOLVERS

Adel Bibi KAUST Bernard Ghanem KAUST Vladlen Koltun Intel Labs Ren e Ranftl Intel Labs

We provide a novel perspective on the forward pass through a block of layers in a deep network. In particular, we show that a forward pass through a standard dropout layer followed by a linear layer and a non-linear activation is equivalent to optimizing a convex objective with a single iteration of a τ-nice Proximal Stochastic Gradient method. We further show that replacing standard Bernoulli dropout with additive dropout is equivalent to optimizing the same convex objective with a variance-reduced proximal method. By expressing both fully-connected and convolutional layers as special cases of a high-order tensor product, we unify the underlying convex optimization problem in the tensor setting and derive a formula for the Lipschitz constant L used to determine the optimal step size of the above proximal methods. We conduct experiments with standard convolutional networks applied to the CIFAR-10 and CIFAR-100 datasets and show that replacing a block of layers with multiple iterations of the corresponding solver, with step size set via L, consistently improves classiﬁcation accuracy.

1 INTRODUCTION

Deep learning has revolutionized computer vision and natural language processing and is increasingly applied throughout science and engineering (Le Cun et al., 2015). This has motivated the mathematical analysis of various aspects of deep networks, such as the capacity and uniqueness of their representations (Soatto & Chiuso, 2014; Papyan et al., 2018) and their global training convergence properties (Haeffele & Vidal, 2017). However, a complete characterization of deep networks remains elusive. For example, Bernoulli dropout layers are known to improve generalization (Srivastava et al., 2014), but a thorough theoretical understanding of their behavior remains an open problem. While basic dropout layers have proven to be effective, there are many other types of dropout with various desirable properties (Molchanov et al., 2017). This raises many questions. Can the fundamental block of layers that consists of a dropout layer followed by a linear transformation and a non-linear activation be further improved for better generalization? Can the choice of dropout layer be made independently from the linear transformation and non-linear activation? Are there systematic ways to propose new types of dropout?

We attempt to address some of these questions by establishing a strong connection between the forward pass through a block of layers in a deep network and the solution of convex optimization problems of the following form:

minimize x Rd F(x) + g(x), F(x) def = 1

i fi(a i x). (1)

Note that when fi(a i x) = 1

2(a i x yi)2 and g(x) = x 2 2, Eq. (1) is standard ridge regression. When g(x) = x 1, Eq. (1) has the form of LASSO regression.

We show that a block of layers that consists of dropout followed by a linear transformation (fullyconnected or convolutional) and a non-linear activation has close connections to applying stochastic solvers to (1). Interestingly, the choice of the stochastic optimization algorithm gives rise to commonly used dropout layers, such as Bernoulli and additive dropout, and to a family of other types of dropout layers that have not been explored before. As a special case, when the block in question does not include dropout, the stochastic algorithm reduces to a deterministic one.

Our contributions can be summarized as follows. (i) We show that a forward pass through a block that consists of Bernoulli dropout followed by a linear transformation and a non-linear activation

The work was done during an internship at Intel Labs.

Published as a conference paper at ICLR 2019

is equivalent to a single iteration of τ-nice Proximal Stochastic Gradient, Prox-SG (Xiao & Zhang, 2014) when it is applied to an instance of (1). We provide various conditions on g that recover (either exactly or approximately) common non-linearities used in practice. (ii) We show that the same block with an additive dropout instead of Bernoulli dropout is equivalent to a single iteration of m S2GD (Koneˇcn y et al., 2016) a mini-batching form of variance-reduced SGD (Johnson & Zhang, 2013) applied to an instance of (1). (iii) By expressing both fully-connected and convolutional layers (referred to as linear throughout) as special cases of a high-order tensor product (Bibi & Ghanem, 2017), we derive a formula for the Lipschitz constant L of F(x). As a consequence, we can compute the optimal step size for the stochastic solvers that correspond to blocks of layers. We note that concurrent work (Sedghi et al., 2019) used a different analysis strategy to derive an equivalent result for computing the singular values of convolutional layers. (iv) We validate our theoretical analysis experimentally by replacing blocks of layers in standard image classiﬁcation networks with corresponding solvers and show that this improves the accuracy of the models.

2 RELATED WORK

Optimization algorithms can provide insight and guidance in the design of deep network architectures (Vogel & Pock, 2017; Kobler et al., 2017; Yang et al., 2016; Zhang & Ghanem, 2018). For example, Yang et al. (2016) have proposed a deep network architecture for compressed sensing. Their network, dubbed ADMM-Net, is inspired by ADMM updates (Boyd et al., 2011) on the compressed sensing objective. Similarly, Zhang & Ghanem (2018) demonstrated that unrolling a proximal gradient descent solver (Beck & Teboulle, 2009) on the same problem can further improve performance. The work of Kobler et al. (2017) demonstrated a relation between incremental proximal methods and Res Net blocks; based on this observation, they proposed a new architecture (variational networks) for the task of image reconstruction. Amos & Kolter (2017) proposed to embed optimization problems, in particular linearly-constrained quadratic programs, as structured layers in deep networks. Meinhardt et al. (2017) replaced proximal operators in optimization algorithms by neural networks. Huang & Van Gool (2017) proposed a new matrix layer, dubbed Re Eig, that applies a thresholding operation to the eigenvalues of intermediate feature representations that are stacked in matrix form. Re Eig can be tightly connected to a proximal operator of the set of positive semi-deﬁnite matrices. Sulam et al. (2018) proposed a new architecture based on a sparse representation construct, Multi-Layer Convolutional Sparse Coding (ML-CSC), initially introduced by Papyan et al. (2017). Sparsity on the intermediate representations was enforced by a multi-layer form of basis pursuit.

This body of work has demonstrated the merits of connecting the design of deep networks with optimization algorithms in the form of structured layers. Yet, with few exceptions (Amos & Kolter, 2017; Sulam et al., 2018), previous works propose specialized architectures for speciﬁc tasks. Our work aims to contribute to a uniﬁed framework that relates optimization algorithms to deep layers.

A line of work aims to provide rigorous interpretation for dropout layers. For example, Wager et al. (2013) showed that dropout is linked to an adaptively balanced ℓ2-regularized loss. Wang & Manning (2013) showed that approximating the loss with a normal distribution leads to a faster form of dropout. Gal & Ghahramani (2016a;b) developed a framework that connects dropout with approximate variational inference in Bayesian models. We provide a complementary perspective, in which dropout layers arise naturally in an optimization-driven framework for network design.

3 UNIFIED FRAMEWORK

This section is organized as follows. We introduce our notation and preliminaries in Section 3.1. In Section 3.2, we present a motivational example relating a single iteration of proximal gradient descent (Prox-GD) on (1) to the forward pass through a fully-connected layer followed by a nonlinear activation. We will show that several commonly used non-linear activations can be exactly or approximately represented as proximal operators of g(x). In Section 3.3, we unify fully-connected and convolutional layers as special cases of a high-order tensor product. We propose a generic instance of (1) in a tensor setting, where we provide a formula for the Lipschitz constant L of the ﬁnite sum structure of (1). In Section 3.4, we derive an intimate relation between stochastic solvers, namely τ-nice Prox-SG and m S2GD, and two types of dropout layers. Figure 1 shows an overview of the connections that will be developed.

Published as a conference paper at ICLR 2019

Figure 1: An overview of the tight relation between a single iteration of a stochastic solver and the forward pass through the lth layer in a network that consists of dropout followed by a linear transformation and a non-linear activation. We study an instance of problem (1) with quadratic F(x), where xl 1 are the input activations and xl, the variables being optimized, correspond to the output activations. Varying the type of stochastic solver changes the nature of the dropout layer, while the prior g(x) on the output activations determines the non-linearity Prox 1

3.1 NOTATION AND PRELIMINARIES

As we will be working with tensors, we will follow the tensor notation of Kolda & Bader (2009). The order of a tensor is the number of its dimensions. In particular, scalars are tensors of order zero, vectors are tensors of order one, and matrices are tensors of order two. We denote scalars by lowercase letters a, vectors by bold lowercase letters a, and matrices by bold capital letters A. We use subscripts ai to refer to individual elements in a vector. Tensors of order three or more will be denoted by cursive capital letters A RJ1 J2 Jn. Throughout the paper, we will handle tensors that are of at most order four. High-order tensors with a second dimension of size equal to one are traditionally called vector tensors and denoted A RJ1 1 J3 J4. We use A(i, j, k, z) to refer to an element in a tensor and A(i, j, k, :) to refer to a slice of a tensor. The inner product between tensors of the same size is denoted A, B = P

i1,...,i N A (i1, . . . , i N) B (i1, . . . , i N). The squared Frobenius norm of a tensor A is deﬁned as A 2 F = A, A . Lastly, the superscripts and H are used to denote the transpose and the Hermitian transpose, respectively.

3.2 MOTIVATIONAL INSIGHT: NON-LINEAR ACTIVATIONS AS PROXIMAL OPERATORS

As a motivating example, we consider the lth linear layer in a deep network that is followed by a non-linear activation ρ, i.e. xl = ρ(Axl 1 + b), where A Rn2 n1 and b Rn2 are the weights and biases of the layer and xl 1 and xl are the input and output activations, respectively. Now consider an instance of (1) with a convex function g(x) and

2 A xl xl 1 2 b xl = 1

i (A (i, :)xl xl 1 i )2 b xl, (2)

where A (i, :) is the ith row of A . Such an objective can be optimized iteratively in xl using Prox-GD with the following update equation:

L Axl 1 + b , (3)

where the Lipschitz constant L = λmax AA and λmax(.) denotes the maximum eigenvalue.

By initializing the iterative optimization at xl = 0, it becomes clear that a single iteration of (3) is equivalent to a fully-connected layer followed by a non-linearity that is implemented by the proximal operator (Fawzi et al., 2015). The choice of g(x) determines the speciﬁc form of the non-linearity ρ. Several popular activation functions can be traced back to their corresponding g(x). The Re LU, which enforces non-negative output activations, corresponds to the indicator function g(x) = 1x 0; the corresponding instance of problem (1) is a non-negative quadratic program. Similar observations for the Re LU have been made in other contexts (Amos & Kolter, 2017; Papyan et al., 2017). We

Published as a conference paper at ICLR 2019

g(x) 1x 0 γ 2 P

i max2 ( xi λ, 0) γ P

i log(xi) γ P

i log(1 x2 i )

Proxg (η) max(0, η)

1+γ if η λ η if η λ

1 4η2 + γ Root of cubic polynomial

Activation = Re LU(η) = Leaky Re LU(η) Softplus(η) Tanh(η)

Table 1: Different choices of g(x), their corresponding proximal operators, and their relation to common activation functions. Squared hinge loss regularization of the activations yields a generalized Leaky Re LU. Log-barriers recover smooth activations, such as Soft Plus, Tanh, or Sigmoid. Derivations can be found in supplementary material.

observe that many other activation functions ﬁt this framework. For example, when g(x) is a squared hinge loss, i.e. γ

i max2 ( xi λ, 0), a single update of (3) is equivalent to a linear layer followed by a Leaky Re LU. Table 1 lists some other choices of g(x) and their induced activations.

Note that g(x) is not required to exhibit a simple, coordinate-wise separable structure. More complex functions can be used, as long as the proximal operator is easy to evaluate. Interesting examples arise when the output activations have matrix structure. For instance, one can impose nuclear norm regularization g(X) = X to encourage X to be low rank. Alternatively, one can enforce positive semi-deﬁnite structure on the matrix X by deﬁning g(X) = 1X 0. A similar activation has been used for higher-order pooling (Huang & Van Gool, 2017).

In what follows, we will show that this connection can be further extended to explain dropout layers. Interestingly, speciﬁc forms of dropout do not arise from particular forms of objective (1), but from different stochastic optimization algorithms that are applied to it.

3.3 UNIFYING FULLY-CONNECTED AND CONVOLUTIONAL LAYERS

Before presenting our main results on the equivalence between a forward pass through a block of layers and solving (1) with stochastic algorithms, we provide some key lemmas. These lemmas will be necessary for a uniﬁed treatment of fully-connected and convolutional layers as generic linear layers. This generic treatment will enable efﬁcient computation of the Lipschitz constant for both fully-connected and convolutional layers.

Lemma 1. Consider the lth convolutional layer in a deep network with some non-linear activation, e.g. Proxg(.), where the weights A Rn2 n1 W H, biases B Rn2 1 W H, and input activations X l 1 Rn1 1 W H are stacked into 4th-order tensors. We can describe the layer as

X l = Proxg A HO X l 1 + B , (4)

where HO is the high-order tensor product. Here n1 is the number of input features, n2 is the number of output features (number of ﬁlters), and W and H are the spatial dimensions of the features. As a special case, a fully-connected layer follows naturally, since HO reduces to a matrix-vector multiplication when W = H = 1.

The proof can be found in supplementary material. Note that the order of the dimensions is essential in this notation, as the ﬁrst dimension in A corresponds to the number of independent ﬁlters while the second corresponds to the input features that will be aggregated after the 2D convolutions. Also note that according to the deﬁnition of HO in (Bibi & Ghanem, 2017), the spatial size of the ﬁlters in A, namely W and H, has to match the spatial dimensions of the input activations X l 1, since the operator HO performs 2D circular convolutions while convolutions in deep networks are 2D linear convolutions. This is not a restriction, since one can perform linear convolution through a zero-padded circular convolution. Lastly, we assume that the values in B are replicated along the spatial dimensions W and H in order to recover the behaviour of biases in deep networks.

Given this notation, we will refer to either a fully-connected or a convolutional layer as a linear layer throughout the rest of the paper. Since we are interested in a generic linear layer followed by

Published as a conference paper at ICLR 2019

a non-linearity, we will consider the tensor quadratic version of F(x), denoted F( X):

2 AH(i, :, :, :) HO X X l 1 2 F B, X | {z }

+ g( X). (5)

Note that if A Rn2 n1 W H, then AH Rn1 n2 W H, where each of the frontal slices of A(:, :, i, j) is transposed and each ﬁlter, A(i, j, :, :), is rotated by 180 . This means that AH HO X aggregates the n2 ﬁlters after performing 2D correlations. This is performed n1 times independently. This operation is commonly referred to as a transposed convolution. Details can be found in supplementary material.

Next, the following lemma provides a practical formula for the computation of the Lipschitz constant L of the ﬁnite sum part of (5):

Lemma 2. The Lipschitz constant L of F( X) as deﬁned in (5) is given by

L = max i {1,2,...,W },j {1,2,...,H}

n λmax ˆ A (:, :, i, j) ˆ AH (:, :, i, j) o , (6)

where ˆ A is the 2D discrete Fourier transform along the spatial dimensions W and H.

The proof can be found in supplementary material. Lemma 2 states that the Lipschitz constant L is the maximum among the set of maximum eigenvalues of all the possible W H combinations of the outer product of frontal slices ˆ A (:, :, i, j) ˆ AH (:, :, i, j). Note that if W = H = 1, then ˆ A = A Rn2 n1 since the 2D discrete Fourier transform of scalars (i.e. matrices of size 1 1) is an identity mapping. As a consequence, we can simplify (6) to L = maxi=j=1{λmax A(:, :, i, j)AH(:, :, i, j) } = λmax AA , which recovers the Lipschitz constant for fully-connected layers.

3.4 DROPOUT LAYERS AS VARIANTS OF STOCHASTIC SOLVERS

In this subsection, we present two propositions. The ﬁrst shows the relation between standard Bernoulli dropout (p is the dropout rate), Ber Dropoutp (Srivastava et al., 2014), and τ-nice Prox-SG. The second proposition relates additive dropout, Add Dropout, to m S2GD (Koneˇcn y et al., 2016). We will ﬁrst introduce a generic notion of sampling from a set. This is essential as the stochastic algorithms sample unbiased function estimates from the set of n1 functions in (5). Deﬁnition 3.1. (Gower et al., 2018). A sampling is a random set-valued mapping with values being the subsets of [n1] = {1, . . . , n1}. A sampling S is τ-nice if it is uniform, i.e. Prob (i S) = Prob (j S) i, j, and assigns equal probabilities to all subsets of [n1] of cardinality τ and zero probability to all others.

Various other types of sampling can be found in (Gower et al., 2018). We are now ready to present our ﬁrst proposition. Proposition 1. A single iteration of Prox-SG with τ-nice sampling S on (5) with τ = (1 p)n1, zero initialization, and unit step size can be shown to exhibit the update

i S A(:, i, :, :) HO X l 1(i, :, :, :) + B

which is equivalent to a forward pass through a Ber Dropoutp layer that drops exactly n1p input activations followed by a linear layer and a non-linear activation.

We provide a simpliﬁed sketch for fully-connected layers here. The detailed proof is in the supplement. To see how (7) reduces to the functional form of Ber Dropoutp followed by a fully-connected layer and a non-linear activation, consider W = H = 1. The argument of Prox 1

L g in (7) (without the bias term) reduces to n1

i S A(:, i, :, :) HO X l 1(i, :, :, :) = n1

i S A(:, i) X l 1(i) = n1

τ A Ber Dropoutp X l 1 .

Published as a conference paper at ICLR 2019

The ﬁrst equality follows from the deﬁnition of HO, while the second equality follows from trivially reparameterizing the sum, with Ber Dropoutp(.) being equivalent to a mask that zeroes out exactly pn1 input activations. Note that if τ-nice Prox-SG was replaced with Prox-GD, i.e. τ = n1, then this corresponds to having a Ber Dropoutp layer with dropout rate p = 0; thus, (8) reduces to A Ber Dropoutp( X l 1) = A X l 1, which recovers our motivating example (3) that relates Prox GD with the forward pass through a fully-connected layer followed by a non-linearity. Note that Proposition 1 directly suggests how to apply dropout to convolutional layers. Speciﬁcally, complete input features from n1 should be dropped and the 2D convolutions should be performed only on the τ-sampled subset, where τ = (1 p)n1.

Similarly, the following proposition shows that a form of additive dropout, Add Dropout, can be recovered from a different choice of stochastic solver.

Proposition 2. A single outer-loop iteration of m S2GD (Koneˇcn y et al., 2016) with unit step size and zero initialization is equivalent to a forward pass through an Add Dropout layer followed by a linear layer and a non-linear activation.

The proof is given in the supplement. It is similar to Proposition 1, with m S2GD replacing τ-nice Prox-SG. Note that any variance-reduced algorithm where one full gradient is computed at least once can be used here as a replacement for m S2GD. For instance, one can show that the serial sampling version of m S2GD, S2GD (Koneˇcn y et al., 2016), and SVRG (Johnson & Zhang, 2013) can also be used. Other algorithms such as Stochastic Coordinate Descent (Richt arik & Tak aˇc, 2016) with arbitrary sampling are discussed in the supplement.

4 EXPERIMENTS

A natural question arises as a consequence of our framework: If common layers in deep networks can be understood as a single iteration of an optimization algorithm, what happens if the algorithm is applied for multiple iterations? We empirically answer this question in our experiments. In particular, we embed solvers as a replacement to their corresponding blocks of layers and show that this improves the accuracy of the models without an increase in the number of network parameters.

Experimental setup. We perform experiments on CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). In all experiments, training was conducted on 90% of the training set while 10% was left for validation. The networks used in the experiments are variants of Le Net (Le Cun et al., 1999), Alex Net (Krizhevsky et al., 2012), and VGG16 (Simonyan & Zisserman, 2014). We used stochastic gradient descent with a momentum of 0.9 and a weight decay of 5 10 4. The learning rate was set to (10 2, 10 3, 10 4) for the ﬁrst, second, and third 100 epochs, respectively. For ﬁnetuning, the learning rate was initially set to 10 3 and reduced to 10 4 after 100 epochs. Moreover, when a block of layers is replaced with a deterministic solver, i.e. Prox-GD, the step size is set to the optimal constant 1/L, where L is computed according to Lemma 2 and updated every epoch without any zero padding as a circular convolution operator approximates a linear convolution in large dimensions (Zhu & Wakin, 2017). In Prox-SG, a decaying step size is necessary for convergence; therefore, the step size is exponentially decayed as suggested by Bottou (2012), where the initial step size is again set according to Lemma 2. Finally, to guarantee convergence of the stochastic solvers, we add the strongly convex function λ

2 X 2 F to the ﬁnite sum in (5), where we set λ = 10 3 in all experiments. Note that for networks that include a stochastic solver, the network will be stochastic at test time. We thus report the average accuracy and standard deviation over 20 trials.

Replacing fully-connected layers with solvers. In this experiment, we demonstrate that (i) training networks with solvers replacing one or more blocks of layers can improve accuracy when trained from scratch, and (ii) the improvement is consistently present when one or more blocks are replaced with solvers at different layers in the network. To do so, we train a variant of Le Net on the CIFAR-10 dataset with two Ber Dropoutp layers. The last two layers are fully-connected layers with Re LU activation. We consider three variants of this network: Both fully-connected layers are augmented with Ber Dropoutp (Le Net-D-D), only the last layer is augmented with Ber Dropoutp (Le Net-ND-D), and ﬁnally only the penultimate layer is augmented with Ber Dropoutp (Le Net-D-ND). In all cases, we set the dropout rate to p = 0.5. We replace the Ber Dropoutp layers with their corresponding stochastic solvers and run them for 10 iterations with τ = n1/2 (the setting corresponding to a dropout rate of p = 0.5). We train these networks from scratch using the same procedure as the baseline networks.

Published as a conference paper at ICLR 2019

The results are summarized in Table 2. It can be seen that replacing Ber Dropoutp with the corresponding stochastic solver (τ-nice Prox-SG) improves performance signiﬁcantly, for any choice of layer. The results indicate that networks that incorporate stochastic solvers can be trained stably and achieve desirable generalization performance.

Le Net-D-D Le Net-D-ND Le Net-ND-D

Baseline 64.39% 71.72% 68.54% Prox-SG 72.86% 0.177 75.20% 0.205 76.23% 0.206

Table 2: Comparison in accuracy between variants of the Le Net architecture on the CIFAR-10 dataset. The variants differ in the location (D or ND) and number of Ber Dropoutp layers for both the baseline networks and their stochastic solver counterpart Prox-SG. Accuracy consistently improves when Prox-SG is used. Accuracy is reported on the test set.

Convolutional layers and larger networks. We now demonstrate that solvers can be used to improve larger networks. We conduct experiments with variants of Alex Net1 and VGG16 on both CIFAR-10 and CIFAR-100. We start by training strong baselines for both Alex Net and VGG16, achieving 77.3% and 92.56% test accuracy on CIFAR-10, respectively. Note that performance on this dataset is nearly saturated. We then replace the ﬁrst convolutional layer in Alex Net with the deterministic Prox-GD solver, since this layer is not preceded by a dropout layer. The results are summarized in Table 3. We observe that ﬁnetuning the baseline network with the solver leads to an improvement of 1.2%, without any change in the network s capacity. A similar improvement is observed on the harder CIFAR-100 dataset.

Alex Net Alex Net-Prox-GD

CIFAR-10 77.30% 78.51% CIFAR-100 44.20% 45.53%

Table 3: Replacing the ﬁrst convolutional layer of Alex Net by the deterministic Prox-GD solver yields consistent improvement in test accuracy on CIFAR-10 and CIFAR-100.

Results on VGG16 are summarized in Table 4. Note that VGG16 has two fully-connected layers, which are preceded by a Ber Dropoutp layer with dropout rate p = 0.5. We start by replacing only the last layer with Prox-SG with 30 iterations and τ = n1/2 (VGG16-Prox-SG-ND-D). We further replace both fully-connected layers that include Ber Dropoutp with solvers (VGG16-Prox-SG-D-D). We observe comparable performance for both settings on CIFAR-10. We conjecture that this might be due to the dataset being close to saturation. On CIFAR-100, a more pronounced increase in accuracy is observed, where VGG-16-Prox-SG-ND-D outperforms the baseline by about 0.7%.

We further replace the stochastic solver with a deterministic solver and leave the dropout layers unchanged. We denote this setting as VGG16-Prox-GD in Table 4. Interestingly, this setting performs the best on CIFAR-10 and comparably to VGG16-Prox-SG-ND-D on CIFAR-100.

VGG16 VGG16-Prox-SG-ND-D VGG16-Prox-SG-D-D VGG16-Prox-GD

CIFAR-10 92.56% 92.44% 0.028 92.57% 0.029 92.80% CIFAR-100 70.27% 70.95% 0.042 70.44% 0.077 71.10%

Table 4: Experiments with the VGG16 architecture on CIFAR-10 and CIFAR-100. Accuracy is reported on the test set.

Dropout rate vs. τ-nice sampling. In this experiment, we demonstrate that the improvement in performance is still consistently present across varying dropout rates. Since Proposition 1 has established a tight connection between the dropout rate p and the sampling rate τ in (5), we observe that for different choices of dropout rate the baseline performance improves upon replacing a block of layers with a stochastic solver with the corresponding sampling rate τ. We conduct experiments with

1Alex Net (Krizhevsky et al., 2012) was adapted to account for the difference in spatial size of the images in CIFAR-10 and Image Net (Deng et al., 2009). The ﬁrst convolutional layer has a padding of 5, and all max-pooling layers have a kernel size of 2. A single fully-connected layer follows at the end.

Published as a conference paper at ICLR 2019

VGG16 on CIFAR-100. We train four different baseline models with varying choices of dropout rate p {0, 0.1, 0.9.0.95} for the last layer. We then replace this block with a stochastic solver with a sampling rate τ and ﬁnetune the network.

Table 5 reports the accuracy of the baselines for varying dropout rates p and compares to the accuracy of the stochastic solver with corresponding τ (Prox-SG). With a high dropout rate, the performance of the baseline network drops drastically. When using the stochastic solver, we observe a much more graceful degradation. For example, with a sampling rate τ that corresponds to an extreme dropout rate of p = 0.95 (i.e. 95% of all input activations are masked out), the baseline network with Ber Dropoutp suffers a 56% reduction in accuracy while the stochastic solver declines by only 5%.

Baseline Prox-SG

Dropout rate p Accuracy Sampling rate τ Accuracy

0 70.57% 512 70.87 0.10 70.56% 461 70.51% 0.0198 0.50 70.27% 256 70.95% 0.0419 0.90 68.34% 51 69.19% 0.0589 0.95 30.61% 26 67.42% 0.0774

Table 5: Comparison of the VGG16 architecture trained on CIFAR-100 with varying dropout rates p in the last Ber Dropoutp layer. We compare the baseline to its stochastic solver counterpart with corresponding sampling rate τ = (1 p)n1. Accuracy is reported on the test set.

In summary, our experiments show that replacing common layers in deep networks with stochastic solvers can lead to better performance without increasing the number of parameters in the network. The resulting networks are stable to train and exhibit high accuracy in cases where standard dropout is problematic, such as high dropout rates.

5 DISCUSSION

We have presented equivalences between layers in deep networks and stochastic solvers, and have shown that this can be leveraged to improve accuracy. The presented relationships open many doors for future work. For instance, our framework shows an intimate relation between a dropout layer and the sampling S from the set [n1] in a stochastic algorithm. As a consequence, one can borrow theory from the stochastic optimization literature to propose new types of dropout layers. For example, consider a serial importance sampling strategy with Prox-SG to solve (5) (Zhao & Zhang, 2015; Xiao & Zhang, 2014), where serial sampling is the sampling that satisﬁes Prob (i S, j S) = 0. A serial importance sampling S from the set of functions fi( X) is the sampling such that Prob (i S) fi( X) Li, where Li is the Lipschitz constant of fi( X), i.e. each function from the set [n1] is sampled with a probability proportional to the norm of the gradient of the function. This sampling strategy is the optimal serial sampling S that maximizes the rate of convergence solving (5) (Zhao & Zhang, 2015). From a deep layer perspective, performing Prox-SG with importance sampling for a single iteration is equivalent to a forward pass through the same block of layers with a new dropout layer. Such a dropout layer will keep each input activation with a non-uniform probability proportional to the norm of the gradient. This is in contrast to Ber Dropoutp where all input activations are kept with an equal probability 1 p. Other types of dropout arise when considering non-serial importance sampling where |S| = τ > 1.

In summary, we have presented equivalences between stochastic solvers on a particular class of convex optimization problems and a forward pass through a dropout layer followed by a linear layer and a non-linear activation. Inspired by these equivalences, we have demonstrated empirically on multiple datasets and network architectures that replacing such network blocks with their corresponding stochastic solvers improves the accuracy of the model. We hope that the presented framework will contribute to a principled understanding of the theory and practice of deep network architectures.

Acknowledgments. This work was partially supported by the King Abdullah University of Science and Technology (KAUST) Ofﬁce of Sponsored Research.

Published as a conference paper at ICLR 2019

Brandon Amos and J. Zico Kolter. Opt Net: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning (ICML), 2017.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2009.

Adel Bibi and Bernard Ghanem. High order tensor formulation for convolutional sparse coding. In International Conference on Computer Vision (ICCV), 2017.

L eon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. Springer, 2012.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2011.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), 2009.

Alhussein Fawzi, Mike Davies, and Pascal Frossard. Dictionary learning for fast classiﬁcation based on soft-thresholding. International Journal of Computer Vision, 2015.

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016a.

Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS), 2016b.

Robert M Gower, Peter Richt arik, and Francis Bach. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. ar Xiv preprint ar Xiv:1805.02632, 2018.

Benjamin D Haeffele and Ren e Vidal. Global optimality in neural network training. In Computer Vision and Pattern Recognition (CVPR), 2017.

Zhiwu Huang and Luc J Van Gool. A Riemannian network for SPD matrix learning. In Association for the Advancement of Artiﬁcial Intelligence (AAAI), 2017.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), 2013.

Misha E Kilmer and Carla D Martin. Factorization strategies for third-order tensors. Linear Algebra and its Applications, 2011.

Erich Kobler, Teresa Klatzer, Kerstin Hammernik, and Thomas Pock. Variational networks: Connecting variational methods and deep learning. In German Conference on Pattern Recognition, 2017.

Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM Review, 2009.

Jakub Koneˇcn y, Jie Liu, Peter Richt arik, and Martin Tak aˇc. Mini-batch semi-stochastic gradient descent in the proximal setting. Journal of Selected Topics in Signal Processing, 2016.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Image Net classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.

Yann Le Cun, Patrick Haffner, L eon Bottou, and Yoshua Bengio. Object recognition with gradientbased learning. In Shape, Contour and Grouping in Computer Vision. Springer, 1999.

Yann Le Cun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 2015.

Published as a conference paper at ICLR 2019

Tim Meinhardt, Michael M oller, Caner Hazirbas, and Daniel Cremers. Learning proximal operators: Using denoising networks for regularizing inverse imaging problems. In International Conference on Computer Vision (ICCV), 2017.

Dmitry Molchanov, Arsenii Ashukha, and Dmitry P. Vetrov. Variational dropout sparsiﬁes deep neural networks. In International Conference on Machine Learning (ICML), 2017.

Vardan Papyan, Yaniv Romano, and Michael Elad. Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research, 2017.

Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad. Theoretical foundations of deep learning via sparse representations: A multilayer sparse model and its connection to convolutional neural networks. IEEE Signal Processing Magazine, 2018.

Peter Richt arik and Martin Tak aˇc. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 2016.

Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. In International Conference on Learning Representations (ICLR), 2019.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Stefano Soatto and Alessandro Chiuso. Visual representations: Deﬁning properties and deep approximations. ar Xiv preprint ar Xiv:1411.7676, 2014.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 2014.

Jeremias Sulam, Aviad Aberdam, and Michael Elad. On multi-layer basis pursuit, efﬁcient algorithms and convolutional neural networks. ar Xiv preprint ar Xiv:1806.00701, 2018.

Christoph Vogel and Thomas Pock. A primal dual network for low-level vision problems. In German Conference on Pattern Recognition, 2017.

Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems (NIPS), 2013.

Sida Wang and Christopher Manning. Fast dropout training. In International Conference on Machine Learning (ICML), 2013.

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 2014.

Yan Yang, Jian Sun, Huibin Li, and Zongben Xu. Deep ADMM-Net for compressive sensing MRI. In Advances in Neural Information Processing Systems (NIPS), 2016.

Jian Zhang and Bernard Ghanem. ISTA-Net: Iterative shrinkage-thresholding algorithm inspired deep network for image compressive sensing. In Computer Vision and Pattern Recognition (CVPR), 2018.

Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In International Conference on Machine Learning (ICML), 2015.

Zhihui Zhu and Michael B Wakin. On the asymptotic equivalence of circulant and Toeplitz matrices. IEEE Transactions on Information Theory, 2017.

Published as a conference paper at ICLR 2019

A LEAKY RELU AS A PROXIMAL OPERATOR

Proposition 3. The Leaky Re LU is the solution of the proximal operator of the function g(x) = γ

i max2 ( xi λ, 0) with slope 1 1+γ and shift λ.

Proof. The proximal operator is deﬁned as

Proxg(a) = arg min x 1 2 x a 2 2 + γ

i max2 ( xi λ, 0)

| {z } γ smooth

Note that the problem is both convex and smooth. The optimality conditions are given by:

x a γmax ( x λ1, 0) = 0

Since the problem is separable in coordinates, we have:

( ai if xi λ ai γλ

1+γ if xi λ Proxg (ai) =

( ai if ai λ ai γλ

1+γ if ai λ.

The Leaky Re LU is deﬁned as

Leaky Re LU(ai) = ai if ai 0 αai if ai 0,

which shows that Proxg is a generalized form of the Leaky Re LU with a shift of λ and a slope α = 1 1+γ . With λ = 0 and γ = 1 α

α the standard form of the Leaky Re LU is recovered.

B APPROXIMATION OF A SOFT PLUS AS A PROXIMAL OPERATOR

Proposition 4. The proximal operator to g(x) = γ P i log(xi) approximates the Soft Plus activation.

Proof. The proximal operator is deﬁned as

Proxg(a) = arg min x 1 2 x a 2 2 γ X

i log(xi). (9)

Note that the function g(x) is elementwise separable, convex, and smooth. By equating the gradient to zero and taking the positive solution of the resulting quadratic polynomial, we arrive at the closedform solution:

Proxg(a) = 1

1 4x x + γ, (10)

where denotes elementwise multiplication. It is easy to see that this operator is close to zero for xi << 0, and close to xi for xi >> 0, with a smooth transition for small |xi|.

Note that the function Proxg(a) approximates the activation Soft Plus = log(1 + exp (a)) very well. An illustrative example is shown in Figure 2.

C APPROXIMATION OF TANH AND SIGMOID AS A PROXIMAL OPERATOR

Proposition 5. The proximal operator to g(x) = γ P

i log(1 xi xi) approximates the Tanh non-linearity.

Published as a conference paper at ICLR 2019

10 5 0 5 10

10 Soft Plus Prox

10 5 0 5 10

1.0 Tanh Prox

Figure 2: Comparison of common activation functions and their approximation via the corresponding proximal operator. Left: Soft Plus activation and corresponding proximal operator with γ = 1. Right: Tanh activation and corresponding proximal operator with γ = 0.1.

Proof. To simplify the exposition, we derive the proximal operator for the case γ = 1. The general case for γ > 0 follows analogously. The proximal operator is deﬁned as

Proxg(a) = arg min x 1 2 x a 2 2 X

i log(1 xi xi). (11)

Note that the logarithm is taken element wise, and the objective is convex and smooth. By equating the gradient to zero, it can be seen that the optimal solution is a root of a cubic equation: x3 i aix2 i 3xi + ai = 0, (12) which is deﬁned in each coordinate i separately.

p = a2 i 9 + 1, q = a3 i 27, (13)

Since q2 p3 < 0, ai R, it is guaranteed that all roots are real and distinct. Consequently, the roots can be described as

rk = 2 pcos

1 3cos 1 a3 i 27 p

3 , k {0, 1, 2}. (14)

Since g(x) is only deﬁned on x [ 1, 1]d, the root that minimizes (11) has to satisfy 1 ropt 1. (15)

y = cos 1 a3 i 27

implies 0 y π. Let

It is straightforward to check that

2, 1 , f1 1, 1

By substituting fk into (14) and checking inequality (15) it becomes clear that the root corresponding to k = 2 minimizes (11). By using trigonometric identities the root corresponding to k = 2 can be further simpliﬁed to

(Proxg(a))i = 2 psin

1 3sin 1 a3 i 27 p

which has the approximate shape of the Tanh activation.

An example of this operator is shown in Figure 2. The proximal operator corresponding to the Sigmoid activation can be derived in a similar fashion by setting g(x) = γ log(x) γ log(1 x).

Published as a conference paper at ICLR 2019

D TENSOR PRELIMINARIES

The exposition presented in the paper requires some deﬁnitions related to tensors and operators on tensors. We summarize the material here. In all subsequent deﬁnitions, we assume D Rn1 n2 n3 n4 and X Rn2 1 n3 n4. Deﬁnition D.1. (Bibi & Ghanem, 2017) The t-product between high-order tensors is deﬁned as

D HO X = fold HO circ HO (D) Mat Vec HO X (19)

where circ HO (D) Rn1n3n4 n2n3n4 and Mat Vec HO X Rn2n3n4 1.

The operator circ HO(.) unfolds an input tensor into a structured matrix. On the other hand, Mat Vec HO(.) unfolds an input tensor into a vector. The fold and unfold procedures are detailed in Bibi & Ghanem (2017). Deﬁnition D.2. (Bibi & Ghanem, 2017) The operator

bdiag (D) =

D (:, :, 1, 1) 0n1 n2 . . . . . . 0n1 n2

. . . ... ... . . . ...

. . . . . . D (:, :, i, j) . . . ...

. . . ... ... ... ... . . . . . . . . . . . . D (:, :, n3, n4)

where bdiag (.) : Cn1 n2 n3 n4 Cn1n3n4 n2n3n4, maps a tensor to a block diagonal matrix of all the frontal faces D(:, :, i, j). Note that if n3 = n4 = 1, bdiag () is an identity mapping. Moreover if n1 = n2 = 1, bdiag (.) is a diagonal matrix.

Due to the structure of the tensor unfold of circ HO(.), the resultant matrix circ HO(D) exhibits the following blockwise diagonalization:

circ HO(D) = (Fn4 Fn3 In1) bdiag ˆD (Fn4 Fn3 In2)H , (21)

where Fn is the n n Normalized Discrete Fourier Matrix. Note that ˆD has the dimensions n3 and n4 replaced with the corresponding 2D Discrete Fourier Transforms. That is ˆD(i, j, :, :) is the 2D Discrete Fourier Transform of D(i, j, :, :).

For more details, the reader is advised to start with third order tensors in the work of Kilmer & Martin (2011) and move to the work of Bibi & Ghanem (2017) for extension to higher orders.

Published as a conference paper at ICLR 2019

E PROOF OF LEMMA 1

Since non-linear activations commonly used in deep learning are elementwise, we only need to show that A HO X l 1 performs a convolution. In particular we need to show, that A HO X l 1 is equivalent to (i) performing 2D convolutions spatially along the third and fourth dimensions. (ii) It aggregates the result along the feature dimension n1. (iii) It repeats the procedure for each of the n2 ﬁlters independently. We will show the following using direct manipulation of the properties of HO (Bibi & Ghanem, 2017).

A HO X l 1 =

i A (:, i, :, :) HO X l 1(i, :, :, :)

i fold HO circ HO ((A (:, i, :, :))) Mat Vec X l 1(i, :, :, :) (22)

Note that Equation (22) shows that features are aggregated along the n1 dimension. Now, by showing that A (:, i, :, :) HO X l 1(i, :, :, :) performs n2 independent 2D convolutions along on the ith

channel, the Lemma 1 is proven. For ease of notation, consider two tensors U Rn2 1 W H and Y R1 1 W H then we have the following:

U HO Y (19) = fold HO circ HO U Mat Vec HO Y

(21) = fold HO

(FH FW In2) bdiag ˆU (FH FW )H | {z } 2D-Fourier Transform

Mat Vec HO Y

= fold HO (FH FW In2) bdiag ˆU Mat Vec HO ˆY

(20) = fold HO

(FH FW In2) | {z } 2D-Inverse Fourier Transform with stride of of n2

h ˆU (:, 1, 1, 1) i ˆY(1, 1, 1, 1) ... h ˆU (:, 1, i, j) i ˆY(1, 1, i, j) ... h ˆU (:, 1, W, H) i ˆY(1, 1, W, H)

Note that ˆG is the elementwise product of the 2D Discrete Fourier Transform between a feature of an input activation Y and the 2D Discrete Fourier Transform of every ﬁlter of the n2 in Y. Since

(FH FW In2) is the inverse 2D Fourier transform along each of the n2 ﬁlters resulting in ˆG. Thus U HO Y performs 2D convolutions independently along each of the n2 ﬁlters, combined with (22); thus, Lemma 1 is proven.

Published as a conference paper at ICLR 2019

F PROOF OF LEMMA 2

L = λmax 2F( X) = λmax

i AH(i, :, :, :) HO X X l 1 2 F

(19) = λmax

i circ HO AH(i, :, :, :) Mat Vec X Mat Vec X l 1 2 F

i circ H HO AH(i, :, :, :) circ HO AH(i, :, :, :) !

= λmax circ H HO AH circ HO AH . (23)

The second equality also follows from the separability of . 2 F where we dropped fold HO. The last equality follows from the linearity of HO. Moreover, from (19), we have:

circ H HO AH circ HO AH

(FH FW In1) bdiag ˆ AH (FH FW In2)H #H

(FH FW In1) bdiag ˆ AH (FH FW In2)H #

= (FH FW In2) bdiag ˆ A bdiag ˆ AH (FH FW In2)H

(20) = (FH FW In2)

ˆ A(:, :, 1, 1) ˆ AH(:, :, 1, 1) 0 0 0 ˆ A(:, :, 1, 2) ˆ AH(:, :, 1, 2) 0 ... ... ...

(FH FW In2)H

= (FH FW In2)

U11 . . . 0 0 U12 0

0 ... . . .

| {z } eigenvectors

Σ11 0 0 0 Σ12 0 ...

UH 11 . . . 0 0 UH 12 0

0 ... . . .

(FH FW In2)H .

The second equality follows from the orthogonality of FH FW In1. The fourth equality follows from ˆ A(:, :, i, j) ˆ AH(:, :, i, j) = UijΣij UH ij. Thus by combining (23) and (24), we have:

L = λmax 2F( X) = λmax

Σ11 0 0 0 Σ12 0 ...

= max i {1,2,...,W },j {1,2,...,H} (λmax (Σij))

= max i {1,2,...,W },j {1,2,...,H}

n λmax ˆ A (:, :, i, j) ˆ AH (:, :, i, j) o .

Published as a conference paper at ICLR 2019

G PROOF OF PROPOSITION 1

Lemma 3. For τ-nice Prox-SG,

2 AH(i, :, :, :) HO X X l 1(i, :, :, :) 2 F B, X

is an unbiased estimator to F( X).

E h Ψτ( X) i = E

1 2 AH(i, :, :, :) HO X X l 1 2 F

i=1 AH(i, :, :, :) HO X X l 1 2 F E [1i S] 1

i AH(i, :, :, :) HO X X l 1 2 F B, X = F( X).

The ﬁrst equality follows by introducing an indicator function where 1i S = 1 if i S and zero otherwise. The last equality follows from the uniformity across elements of the τ-nice S.

From Lemma 3, and with zero initialization it follows that

Ψτ( X) X=0 =

2 AH(i, :, :, :) HO X X l 1(i, :, :, :) 2 F B, X

2τ circ HO AH(i, :, :, :)) Mat Vec X Mat Vec X l 1(i, :, :, :) 2 F 1

τ circ H HO AH(i, :, :, :) Mat Vec X l 1(i, :, :, :) 1

τ A(:, i, :, :) HO X l 1(i, :, :, :) 1

is an unbiased estimator of F( X) X=0. The last iteration follows by noting that A = AH H. Therefore, the ﬁrst iteration of τ-nice Prox-SGD with zero initialization and unit step size is:

L g Ψτ( X) X=0

τ A(:, i, :, :) HO X l 1(i, :, :, :) + 1

Note that the previous stochastic sum in (26) with τ = (1 p)n1 can be reparameterized as follows:

i 1i S + n1

i A(:, i, :, :) HO X l 1(i, :, :, :)1i S

τ A HO M X l 1 = Prox 1

τ A HO Ber Dropoutp X l 1

Published as a conference paper at ICLR 2019

where M Rn1 1 W H is a mask tensor. Note that since τ = (1 p)n1, M has exactly pn1 slices M(i, :, :, :) that are completely zero. This equivalent to a dropout layer where the layer drops exactly pn1 input activations. It follows that (27) is equivalent to a forward pass through a Ber Dropoutp layer followed by a linear layer and non-linear activation.

H PROOF OF PROPOSITION 2

m S2GD (Koneˇcn y et al., 2016) with zero initialization at the ﬁrst epoch deﬁnes the following update:

fi( X) fi( Y) !!

With zero initialization at the ﬁrst epoch we have Y = 0, therefore

F( Y) Y=0 = 1

i fi(Y) Y=0 (25) = τ=n1 A HO X l 1 B. (29)

i S fi( X) 1

i S fi(Y) Y=0

i S n1circ H HO AH(i, :, :, :) h circ HO AH(i, :, :, :) Mat Vec X Mat Vec X l 1 i B

i S n1circ HO AH(i, :, :, :) H Mat Vec X l 1 B

i S circ H HO AH(i, :, :, :) circ HO AH(i, :, :, :) Mat Vec X

i S A(:, i, :, :) HO AH(i, :, :, :) HO X (30)

Substituting both (29) and (30) into (28) we have

X + A HO X l 1 + B

i S A(:, i, :, :) HO AH(i, :, :, :) HO X

A HO X l 1 + B | {z } Fully-connected/Convolutional

i S A(:, i, :, :) HO AH(i, :, :, :)

Note that I Rn2 n2 W H is the identity tensor in which all frontal slices are identity I(:, :, i, j) = In2 n2 and all other slices are zeros. Following the deﬁnition in (19), we have I HO X = X.

Published as a conference paper at ICLR 2019

I RANDOMIZED COORDINATE DESCENT PERMUTES DROPOUT AND LINEAR LAYER

We present an additional insight to the role of stochastic solvers on (5) in network design. In particular, we show that performing a randomized coordinate descent, RCD, on (5) ignoring the ﬁnite sum structure, is equivalent to a linear transformation followed by Ber Dropoutp and a non-linear activation. That is, performing RCD permutes the order of linear transformation and dropout. For ease of notation, we show this under the special case of fully connected layers. Proposition 6. A single iteration of Randomized Coordinate Descent, e.g. NSync (Richt arik & Tak aˇc, 2016), with τ-nice sampling of coordinates of (5) with τ = (1 p)n2, unit step sizes along each partial derivative, and with zero initialization is equivalent to:

i S A(i, :, :, :) HO X l 1 + B(i, :, :, :)

which is equivalent to a forward pass through a linear layer followed by a Ber Dropoutp layer (that drops exactly n2p output activations) followed by a non-linear activation.

Proof. We provide a sketch of the proof on the simple quadratic F(x) = 1

2 A x xl 1 2 b x where the linear layer is a fully-connected layer. Considering a randomized coordinate descent, e.g. NSync, with τ-nice sampling of the coordinates we have the following:

1 vi e i F(x)ei

1 vi e i A A x xl 1 b ei

x=0 = Proxg

1 vi e i Axl 1 b ei

i S e i h Diag (v) 1 Axl 1 + b i ei

v=1 = Proxg Dropoutp Axl 1 + b (32)

Note that ei is a vector of all zeros except the ith coordinate which is equal to 1. Moreover, since the step sizes along each partial derivative is 1, v = 1. Equation (32) is equivalent to a forward pass through a linear layer followed by a Ber Dropoutp layer and a non-linear activation.

Published as a conference paper at ICLR 2019

J RESNET BLOCKS

In here, we brieﬂy discuss how a Res Net block can be incorporated as an iteration of a solver. The key observation is that a Res Net block has two non-linearities connected through a skip connection. In particular, consider the lth Res Net block with two linear operators denotes as A(1) and A(2)

with biases as B(1) and B(2), respectively. The input activations to the Res Net block form the previous layer. Therefore, a forward pass through a Res Net block are X k 1 where X l are the output activations. A forward pass through a Res Net block exhibits the following functional form:

X l = Re LU A(2) HO Re LU A(1) HO X l 1 + B(1) + B(2) + X (l 1)

This can be seen as two blocks of linear followed by nonlinear composed such that:

X = Re LU A(1) HO X l 1 + B(1)

X l = Re LU A(2) HO X + B(2) + X l 1

Note that a a batch normalization layer is linear during the forward pass through the network and thus can be absorbed in A and B. Consequently, a forward pass through a Res Net block can be viewed as performing a single iteration of two consecutive stochastic or deterministic solvers (depending on the presence or absence of Dropout layers) on (5).