# understanding_deep_architecture_with_reasoning_layer__f4ff8375.pdf

Understanding Deep Architectures with Reasoning Layer

Xinshi Chen Georgia Institute of Technology xinshi.chen@gatech.edu

Yufei Zhang University of Oxford yufei.zhang@maths.ox.ac.uk

Christoph Reisinger University of Oxford christoph.reisinger@maths.ox.ac.uk

Le Song Georgia Institute of Technology lsong@cc.gatech.edu

Recently, there has been a surge of interest in combining deep learning models with reasoning in order to handle more sophisticated learning tasks. In many cases, a reasoning task can be solved by an iterative algorithm. This algorithm is often unrolled, and used as a specialized layer in the deep architecture, which can be trained end-to-end with other neural components. Although such hybrid deep architectures have led to many empirical successes, the theoretical foundation of such architectures, especially the interplay between algorithm layers and other neural layers, remains largely unexplored. In this paper, we take an initial step towards an understanding of such hybrid deep architectures by showing that properties of the algorithm layers, such as convergence, stability and sensitivity, are intimately related to the approximation and generalization abilities of the end-to-end model. Furthermore, our analysis matches closely our experimental observations under various conditions, suggesting that our theory can provide useful guidelines for designing deep architectures with reasoning modules (i.e., algorithm layers).

1 Introduction

Alg𝜙(E𝜃(x, ))

reasoning module neural module

Figure 1: Hybrid architecture.

Many real world applications require perception and reasoning to work together to solve a problem. Perception refers to the ability to understand and represent inputs, while reasoning refers to the ability to follow prescribed steps and derive answers satisfying certain constraints. To tackle such sophisticated learning tasks, recently, there has been a surge of interests in combining deep perception models with reasoning modules (or algorithm layers).

Typically, a reasoning module is stacked on top of a neural module, and treated as an additional layer of the overall deep architecture; then all the parameters in the architecture are optimized end-to-end with loss gradients (Fig 1). Very often these reasoning modules can be implemented as unrolled iterative algorithms, which can solve more sophisticated tasks with carefully designed and interpretable operations. For instance, SATNet [1] integrated a satisﬁability solver into its deep model as a reasoning module; E2Efold [2] used a constrained optimization algorithm on top of a neural energy network to predict and reason about RNA structures, while [3] used optimal transport algorithm as a reasoning module for learning to sort. Other algorithms such as ADMM [4, 5], Langevin dynamics [6], inductive logic programming [7], DP [8], k-means clustering [9], message passing [10, 11], power iterations [12] are

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

also used as differentiable reasoning modules in deep models for various learning tasks. Thus in the remainder of the paper, we will use reasoning module and algorithm layer interchangeably.

While these previous works have demonstrated the effectiveness of combining deep learning with reasoning, the theoretical underpinning of such hybrid deep architectures remains largely unexplored. For instance, what is the beneﬁt of using a reasoning module based on unrolled algorithms compared to generic architectures such as recurrent neural networks (RNN)? How exactly will the reasoning module affect the generalization ability of the deep architecture? For different algorithms which can solve the same task, what are their differences when used as reasoning modules in deep models? Despite the rich literature on rigorous analysis of algorithm properties, there is a paucity of work leveraging these analyses to formally study the learning behavior of deep architectures containing algorithm layers. This motivates us to ask the crucial and timely question of

How will the algorithm properties of an algorithm layer affect the learning behavior of deep architectures containing such layers?

In this paper, we provide a ﬁrst step towards an answer to this question by analyzing the approximation and generalization abilities of such hybrid deep architectures. To the best our knowledge, such an analysis has not been done before and faces several difﬁculties: 1) The analysis of certain algorithm properties such as convergence can be complex by itself; 2) Models based on highly structured iterative algorithms have rarely been analyzed before; 3) The bound needs to be sharp enough to match empirical observations. In this new setting, the complexities of the algorithm s analysis and generalization analysis are intertwined together, making the analysis even more challenging.

Summary of results. We ﬁnd that standard Rademacher complexity analysis, widely used for neural networks [13, 14, 15], is insufﬁcient for explaining the behavior of these hybrid architectures. Thus we resort to a more reﬁned local Rademacher complexity analysis [16, 17], and ﬁnd the following:

Relation to algorithm properties. Algorithm properties such as convergence, stability and sensitivity all play important roles in the generalization ability of the hybrid architecture. Generally speaking, an algorithm layer that is faster converging, more stable and less sensitive will be able to better approximate the joint perception and reasoning task, while at the same time generalize better. Which algorithm? There is a tradeoff that a faster converging algorithm has to be less stable [18]. Therefore, depending on the precise setting, the best choice of algorithm layer may be different. Our theorem reveals that when the neural module is overor under-parameterized, stability of the algorithm layer can be more important than its convergence; but when the neural module is has an about-right parameterization, a faster converging algorithm layer may give a better generalization.

What depth? With deeper algorithm layers, the representation ability gets better, but the generalization becomes worse if the neural module is over/under-parameterized. Only when it has about-right complexity, deeper algorithm layers can induce both better representation and generalization. What if RNN? It has been shown that RNN (or graph neural networks, GNN) can represent reasoning and iterative algorithms [19, 15]. On the example of RNN we demonstrate in Sec 6 that these generic reasoning modules can also be analyzed under our framework, revealing that RNN layers induce better representation but worse generalization compared to traditional algorithm layers. Experiments. We conduct empirical studies to validate our theory and show that it matches well with experimental observations under various conditions. These results suggest that our theory can provide useful practical guidelines for designing deep architectures with algorithm layers.

Contributions and limitations. To the best of our knowledge, this is the ﬁrst result to quantitatively characterize the effects of algorithm properties on the learning behavior of hybrid deep architectures with reasoning modules (algorithm layers), showing that algorithm biases can help reduce sample complexity of such architectures. Our result also reveals a subtle and previously unknown interplay between algorithm convergence, stability and sensitivity when affecting model generalization, and thus provides design principles for deep architectures with algorithm layers. To simplify the analysis, our initial study is limited to a setting where the reasoning module is an unconstrained optimization algorithm and the neural module outputs a quadratic energy function. However, our analysis framework can be extended to more complicated cases and the insights can be expected to apply beyond our current setting.

Related theoretical works. Our analysis borrows proof techniques for analyzing algorithm properties from the optimization literature [18, 20] and for bounding Rademacher complexity from the statistical learning literature [13, 16, 17, 21, 22], but our focus and results are new. More precisely, the leave-

one-out stability of optimization algorithms have been used to derive generalization bounds [23, 24, 25, 18, 26, 27]. However, all existing analyses are in the context where the optimization algorithms are used to train and select the model, while our analysis is based on a fundamentally different viewpoint where the algorithm itself is unrolled and integrated as a layer in the deep model. Also, existing works on the generalization of deep learning mainly focus on generic neural architectures such as feed-forward neural networks, RNN, GNN, etc [13, 14, 15]. The complexity of models based on highly structured iterative algorithms and the relation to algorithm properties have not been investigated. Furthermore, we are not aware of any previous use of local Rademacher complexity analysis for deep learning models.

2 Setting: Optimization Algorithms as Reasoning Modules

In many applications, reasoning can be accomplished by solving an optimization problem deﬁned by a neural perceptual module. For instance, a visual SUDOKU puzzle can be solved using a neural module to perceive the digits followed by a quadratic optimization module to maximize a logic satisﬁability objective [1]. The RNA folding problem can be tackled by a neural energy model to capture pairwise relations between RNA bases and a constrained optimization module to minimize the energy, with additional pairing constraints, to obtain a folding [2]. In a broader context, MAML [28, 29] also has a neural module for joint initialization and a reasoning module that performs optimization steps for taskspeciﬁc adaptation. Other examples include [6, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]. More speciﬁcally, perception and reasoning can be jointly formulated in the form

y(x) = arg miny Y Eθ(x, y), (1)

where x is an input, and Eθ(x, y) is a neural energy function with parameters θ, which speciﬁes the type of information needed for performing reasoning, and together with constraints Y on the output y, speciﬁes the style of reasoning. Very often, the optimizer can be approximated by iterative algorithms, so the mapping in Eq. 1 can be approximated by the following end-to-end hybrid model

fφ,θ(x) := Algk φ (Eθ(x, )) : X 7 Y. (2)

Algk φ is the reasoning module with parameters φ. Given a neural energy, it performs k-step iterative updates to produce the output (Fig 1). When k is ﬁnite, Algk φ corresponds to approximate reasoning. As an initial attempt to analyze deep architectures with reasoning layers, we will restrict our analysis to a simple case where Eθ(x, y) is quadratic in y. A reason is that the analysis of advanced algorithms such as Nesterov accelerated gradients will become very complex for general cases. Similar problems occur in [18] which also restricts the proof to quadratic objectives. Speciﬁcally:

Problem setting: Consider a hybrid architecture where the neural module is an energy function of the form Eθ((x, b), y) = 1

2y Qθ(x)y + b y, with Qθ a neural network that maps x to a matrix. Each energy can be uniquely represented by (Qθ(x), b), so we can write the overall architecture as

fφ,θ(x, b) := Algk φ(Qθ(x), b). (3)

Assume we are given a set of n i.i.d. samples Sn = {((x1, b1), y 1), , ((xn, bn), y n)}, where the labels y are given by the exact minimizer Opt of the corresponding Q , i.e.,

y = Opt(Q (x), b). (4)

Then the learning problem is to ﬁnd the best model fφ,θ from the space F := {fφ,θ : (φ, θ) Φ Θ} by minimizing the empirical loss function

min fφ,θ F 1 n

i=1 ℓφ,θ(xi, bi), (5)

where ℓφ,θ(x, b) := Algk φ (Qθ(x), b) Opt(Q (x), b) 2. Furthermore, we assume:

We have Y = Rd, and both Qθ and Q map X to Sd d µ,L , the space of symmetric positive deﬁnite (SPD) matrices with µ, L > 0 as its smallest and largest singular values, respectively. Thus the induced energy function Eθ will be µ-strongly convex and L-smooth, and the output of Opt is unique. The input (x, b) is a pair of random variables where x X Rm and b B Rd. Assume b satisﬁes E[bb ] = σ2 b I. Assume x and b are independent, and their joint distribution follows a probability measure P. Assume samples in Sn are drawn i.i.d. from P.

Assume B is bounded, and let M = sup(Q,b) Sd d µ,L B Opt(Q, b) 2.

Though this setting does not encompass the full complexity of hybrid deep architectures, it already reveals interesting connections between algorithm properties of the reasoning module and the learning behaviors of hybrid architectures.

3 Properties of Algorithms

In this section, we formally deﬁne the algorithm properties of the reasoning module Algk φ, under the problem setting presented in Sec 2. After that, we compare the corresponding properties of gradient descent, GDk φ, and Nesterov s accelerated gradients, NAGk φ, as concrete examples.

(I) The convergence rate of an algorithm expresses how fast the optimization error decreases as k grows. Formally, we say Algk φ has a convergence rate Cvg(k, φ) if for any Q Sd d µ,L , b B,

Algk φ(Q, b) Opt(Q, b) 2 Cvg(k, φ) Alg0 φ(Q, b) Opt(Q, b) 2. (6)

(II) Stability of an algorithm characterizes its robustness to small perturbations in the optimization objective, which corresponds to the perturbation of Q and b in the quadratic case. For the purpose of this paper, we say an algorithm Algk φ is Stab(k, φ)-stable if for any Q, Q Sd d µ,L and b, b B,

Algk φ(Q, b) Algk φ(Q , b ) 2 Stab(k, φ) Q Q 2 + Stab(k, φ) b b 2, (7)

where Q Q 2 is the spectral norm of the matrix Q Q .

(III) Sensitivity characterizes the robustness to small perturbations in the algorithm parameters φ. We say the sensitivity of Algk φ is Sens(k) if it holds for all Q Sd d µ,L , b B, and φ, φ Φ that

Algk φ(Q, b) Algk φ (Q, b) 2 Sens(k) φ φ 2. (8)

This concept is referred in the deep learning community to parameter perturbation error or sharpness [44, 45, 46]. It has been used for deriving generalization bounds of neural networks, both in the Rademacher complexity framework [13] and PAC-Bayes framework [47].

(IV) The stable region is the range Φ of the parameters φ where the algorithm output will remain bounded as k grows to inﬁnity, i.e., numerically stable. Only when the algorithms operate in the stable region, the corresponding Cvg(k, φ), Stab(k, φ) and Sens(k) will remain ﬁnite for all k. It is usually very difﬁcult to identity the exact stable region, but a sufﬁcient range can be provided.

GD and NAG. Now we will compare the above four algorithm properties for gradient descent and Nesterov s accelerated gradient method, both of which can be used to solve the quadratic optimization in our problem setting. First, the algorithm update steps are summarized bellow:

GDφ : yk+1 yk φ(Qyk + b) NAGφ :

( yk+1 zk φ(Qzk + b) zk+1 yk+1 + 1 µφ

1+ µφ(yk+1 yk) (9)

where the hyperparameter φ corresponds to the step size. The initializations y0, z0 are set to zero vectors throughout this paper. Denote the results of k-step update, yk, of GD and NAG by GDk φ(Q, b) and NAGk φ(Q, b), respectively. Then their algorithm properties are summarized in Table 1.

Table 1: Comparison of algorithm properties between GD and NAG. For simplicity, only the order in k is presented. Complete statements with detailed coefﬁcients and proofs are given in Appendix A.

Alg Cvg(k, φ) Stab(k, φ) Sens(k) Stable region Φ

GDk φ O (1 φµ)k O 1 (1 φµ)k O k(1 c0µ)k 1 [c0, 2 µ+L] NAGk φ O k(1 φµ)k O 1 (1 φµ)k O k3(1 c0µ)k [c0, 4 µ+3L]

Table 1 shows: (i) Convergence: NAG converges faster than GD, especially when µ is very small, which is a well-known result. (ii) Stability: However, as k grows, NAG is less stable than GD for a ﬁxed k, in contrast to their convergence behaviors. This is pointed out in [18], which proves that a faster converging algorithm has to be less stable. (iii) Sensitivity: The sensitivity behaves similar to

the convergence, where NAG is less sensitive to step-size perturbation than GD. Also, the sensitivity of both algorithms gets smaller as k grows larger. (iv): Stable region: Since µ < L, the stable region of GD is larger than that of NAG. It means a larger step size is allowable for GD that will not lead to exploding outputs even if k is large. Note that all the other algorithm properties are based on the assumption that φ is in the stable region Φ. Furthermore, as k goes to inﬁnity, the space {Algk φ : φ Φ} will ﬁnally shrink to a single function, which is the exact minimizer {Opt}.

Our purpose of comparing the algorithm properties of GD and NAG is to show in a later section their difference as a reasoning layer in deep architectures. However, some results in Table 1 are new by themselves, which may be of independent interest. For instance, we are not aware of other analysis of the sensitivity of GD and NAG to their step-size perturbation. Besides, for the stability results, we provide a proof with a weaker assumption where φ can be larger than 1/L, which is not allowed in [18]. This is necessary since in practice the learned step size φ is usually larger than 1/L.

4 Approximation Ability

How will the algorithm properties affect the approximation ability of deep architecture with reasoning layers? Given a model space F := {Algk φ (Qθ( ), ) : φ Φ, θ Θ}, we are interested in its approximation ability to functions of the form Opt (Q (x), b). More speciﬁcally, we deﬁne the loss

ℓφ,θ(x, b) := Algk φ (Qθ(x), b) Opt(Q (x), b) 2, (10)

and measure the approximation ability by infφ Φ,θ Θ sup Q Q Pℓφ,θ, where Q := {X 7 Sd d µ,L } and Pℓφ,θ = Ex,b[ℓφ,θ(x, b)]. Intuitively, using a faster converging algorithm, the model Algk φ could represent the reasoning-task structure, Opt, better and improve the overall approximation ability. Indeed we can prove the following lemma conﬁrming this intuition.

Lemma 4.1. (Faster Convergence Better Approximation Ability). Assume the problem setting in Sec 2. The approximation ability can be bounded by two terms:

inf φ Φ,θ Θ sup Q Q Pℓφ,θ σbµ 2 inf θ Θ sup Q Q P Qθ Q F | {z } approximation ability of the neural module

+M inf φ Φ Cvg(k, φ) | {z } best convergence

With Lemma 4.1, we conclude that: A faster converging algorithm can deﬁne a model with better approximation ability. For example, for a ﬁxed k and Qθ, NAG converges faster than GD, so NAGk φ can approximate Opt more accurately than GDk φ, which is experimentally validated in Sec 7.

Similarly, we can also reverse the reasoning, and ask the question that, given two hydrid architectures with the same approximation error, which architecture has a smaller error in representing the energy function Q ? We show that this error is also intimately related to the convergence of the algorithm.

Lemma 4.2. (Faster Convergence Better Representation of Q ). Assume the problem setting in Sec 2. Then φ Φ, θ Θ, Q Q it holds true that

P Qθ Q 2 F σ 2 b L4( q

Pℓ2 φ,θ + M Cvg(k, φ))2. (12)

Lemma 4.2 highlights the beneﬁt of using an algorithmic layer that aligns with the reasoning-task structure. Here the task structure is represented by Opt, the minimizer, and convergence measures how well Algk φ is aligned with Opt. Lemma 4.2 essentially indicates that if the structure of a reasoning module can better align with the task structure, then it can better constrain the search space of the underlying neural module Qθ, making it easier to learn, and further lead to better sample complexity, which we will explain more in the next section.

As a concrete example for Lemma 4.2, if GDk φ (Qθ, ) and NAGk φ (Qθ, ) achieve the same accuracy for approximating Opt (Q , ), then the neural module Qθ in NAGk φ (Qθ, ) will have a better accuracy for approximating Q than Qθ in GDk φ (Qθ, ). In other words, a faster converging algorithm imposes more constraints on the energy function Qθ, making it approach Q faster.

5 Generalization Ability

How will algorithm properties affect the generalization ability of deep architectures with reasoning layers? We theoretically showed that the generalization bound is determined by both the algorithm properties and the complexity of the neural module. Moreover, it induces interesting implications - when the neural module is overor underparameterized, the generalization bound is dominated by algorithm stability; but when the neural module has an about-right parameterization, the bound is dominated by the product of algorithm stability and convergence.

More speciﬁcally, we will analyze generalization gap between the expected loss and empirical loss,

Pℓφ,θ = Ex,bℓφ,θ(x, b) and Pnℓφ,θ = 1

n Pn i=1ℓφ,θ(xi, bi), respectively, (13)

where Pn is the empirical probability measure induced by the samples Sn. Let ℓF := {ℓφ,θ : φ Φ, θ Θ} be the function space of losses of the models. The generalization gap, Pℓφ,θ Pnℓφ,θ, can be bounded by the Rademacher complexity, ERnℓF, which is deﬁned as the expectation of the empirical Rademacher complexity, RnℓF := Eσ supφ Φ,θ Θ 1 n Pn i=1 σiℓφ,θ(xi, bi), where {σi}n i=1 are n independent Rademacher random variables uniformly distributed over { 1}. Generalization bounds derived from Rademacher complexity have been studied in many works [48, 49, 50, 51].

However, deriving the Rademacher complexity of ℓF is highly nontrivial in our case, and we are not aware of prior bounds for deep learning models with reasoning layers. Aiming at bridging the relation between algorithm properties and generalization ability that can explain experimental observations, we ﬁnd that standard Rademacher complexity analysis is insufﬁcient. The shortcoming of the standard Rademacher complexity is that it provides global estimates of the complexity of the model space, which ignores the fact that the training process will likely pick models with small errors. Taking this factor into account, we resort to more reﬁned analysis using local Rademacher complexity [13, 16, 17]. Remarkably, we found that the bounds derived via global and local Rademacher complexity will lead to different conclusions about the effects of algorithm layers. That is, an algorithm that converges faster could lead to a model space that has a larger global Rademacher complexity but a smaller local Rademacher complexity. Also, the global Rademacher complexity is dominated by algorithmic stability. However, in the local counterpart, there is a trade-off term between stability and convergence, which aligns better with the experimental observations.

Main Result. More speciﬁcally, the local Rademacher complexity of ℓF at level r is deﬁned as

ERnℓloc F (r) where ℓloc F (r) := {ℓφ,θ : φ Φ, θ Θ, Pℓ2 φ,θ r}. (14)

This notion is less general than the one deﬁned in [16, 17] but is sufﬁcient for our purpose. Here we also deﬁne a loss function space ℓQ := { Qθ Q F : θ Θ} for the neural module Qθ, and introduce its local Rademacher complexity ERnℓloc Q (rq), where ℓloc Q (rq) = Qθ Q F ℓQ : P Qθ Q 2 F rq . With these deﬁnitions, we can show that the local Rademacher complexity of the hybrid architecture is explicitly related to all considered algorithm properties, namely convergence, stability and sensitivity, and there is an intricate trade-off.

Theorem 5.1. Assume the problem setting in Sec 2. Then we have for any t > 0 that

ERnℓloc F (r)

(Cvg(k)M + r)2C1(n) + C2(n, t) + C3(n, t) + 4 (15)

+ Sens(k)BΦ, (16)

where Stab(k) = supφ Stab(k, φ) and Cvg(k) = supφ Cvg(k, φ) are worst-case stability and

convergence, BΦ = 1

2 supφ,φ Φ φ φ 2, C1(n) = O(log N(n)), C3(n, t) = O( log N(n) n +

C2(n, t) = O( t log N(n)

n + (C3(n, t) + 1) log N(n) n ), and N(n) = N( 1 n, ℓQ, L ) is the covering number of ℓQ with radius 1 n and L norm.

Proof Sketch. We will explain the key steps here, and the full proof details are deferred to Appendix C. The essence of the proof is to ﬁnd the relation between Rnℓloc F (r) and Rnℓloc Q (rq), and also the relation between the local level r and rq. Then the analysis of the local Rademachar complexity of the end-to-end model Algk φ(Qθ, ) can be reduced to that of the neural module Qθ.

More speciﬁcally, we ﬁrst show that the loss ℓφ,θ is Stab(k)-Lipschitz in Qθ and Sens(k)-Lipschitz in φ. By the triangle inequality and algorithm properties, we can bound the sensitivity of the loss by

|ℓφ,θ(x) ℓφ ,θ (x)| Stab(k) Qθ(x) Qθ (x) 2 + Sens(k) φ φ 2. (17)

Second, by leveraging vector-contraction inequality for Rademacher complexity of vector-valued hypothesis [21, 22] and our previous observations in Lemma 4.2, we can turn the sensitivity bound on the loss function in Eq. 17 to a local Rademacher complexity bound

Rnℓloc F (r)

2d Stab(k)Rnℓloc Q (rq) + Sens(k)BΦ with rq = σ 2 b L4( r + MCvg(k))2. (18)

Therefore, bounding the local Rademacher complexity of ℓloc F at level r resorts to bounding that of ℓloc Q at level rq. This inequality has already revealed the role of stability, convergence, and sensitivity in bounding local Rademacher complexity, and is the key step in the proof.

Third, based on an extension of Talagrand s inequality for empirical processes [52, 16], we can bound the empirical error Pn Qθ Q 2 F using rq and some other terms with high probability. Then Rnℓloc Q (rq) can be bounded using the covering number of ℓQ via the classical Dudley entropy integral [53], where the upper integration bound is given by the upper bound of Pn Qθ Q 2 F .

Trade-offs between convergence, stability and sensitivity. Generally speaking, the convergence rate Cvg(k) and sensitivity Sens(k) have similar behavior, but Stab(k) behaves opposite to them; see illustrations in Fig 2. Therefore, the way these three quantities interact in Theorem 5.1 suggests that in different regimes one may see different generalization behavior. More specially, depending on the parameterization of Qθ, the coefﬁcients C1, C2, and C3 in Eq. 15 may have different scale, making the local Rademacher complexity bound dominated by different algorithm properties. Since the coefﬁcients Ci are monotonely increasing in the covering number of ℓQ, we expect that:

(i) When Qθ is over-parameterized, the covering number of ℓQ becomes large, as do the three coefﬁcients. Large Ci will reduce the effect of Cvg(k) and make Eq. 15 dominated by Stab(k);

(ii) When Qθ is under-parameterized, the three coefﬁcients get small, but they still reduce the effect of Cvg(k) given the constant 4 in Eq. 15, again making it dominated by Stab(k);

(iii) When the parametrization of Qθ is about-right, we can expect Cvg(k) to play a critical role in Eq. 15, which will then behave similar to the product Stab(k)Cvg(k), as illustrated schematically in Fig 2. We experimentally validate these implications in Sec 7.

Cvg(k) or Sens(k)

Cvg(k) * Stab(k)

Figure 2: Overall trend of algorithm properties.

Trade-off of the depth. Combining the above implications with the approximation ability analysis in Sec 4, we can see that in the abovementioned cases (i) and (ii), deeper algorithm layers will lead to better approximation accuracy but worse generalization. Only in the ideal case (iii), a deeper reasoning module can induce both better representation and generalization abilities. This result provides practical guidelines for some recently proposed inﬁnite-depth models [54, 55].

Standard Rademacher complexity analysis. If we consider the standard Rademacher complexity and directly bound it by the covering number of ℓF via Dudley s entropy integral in the way some existing generalization bounds of deep learning are derived [13, 14, 15], we will get the following upper bound for the covering number, where Cvg(k) does not play a role:

N(ϵ, ℓF, L2(Pn)) N(ϵ/ (2Stab(k)) , Q, L2(Pn)) N(ϵ/ (2Sens(k)) , Φ, 2). (19)

Since Φ only contains the hyperparameters in the algorithm and Q := {Qθ, θ Θ} is often highly expressive, typically stability will dominate this bound. Or, consider the case when algorithm layers are ﬁxed so Φ only contains one element. Then this covering number is determined by stability, which infers that NAGk 1(Qθ, ) has a larger Rademacher complexity than GDk 1(Qθ, ) since it is less stable. However, in the local Rademacher complexity bound in Theorem 5.1, even if Sens(k) in Eq. 16 is ignored, there is still a trade-off between convergence and stability which implies NAGk 1(Qθ, ) can have a smaller local Rademacher complexity than GDk 1(Qθ, ), leading to a different conclusion. Our experiments show the local Rademacher complexity bound is better for explaining the actual observations.

6 Pros and Cons of RNN as a Reasoning Layer

It has been shown that RNN (or GNN) can represent reasoning and iterative algorithms over structures [19, 15]. Can our analysis framework also be used to understand RNN (or GNN)? How will its behavior compare with more interpretable algorithm layers such as GDk φ and NAGk φ? In the case of RNN, the algorithm update steps in each iteration are given by an RNN cell yk+1 RNNcell (Q, b, yk) := V σ W Lσ W L 1 W 2σ W 1 1 yt + W 1 2 gt . (20) where the activation function σ = Re LU takes yk and the gradient gt = Qyt + b as inputs. Then a recurrent neural network RNNk φ having k unrolled RNN cells can be viewed as a neural algorithm. The algorithm properties of RNNk φ are summarized in Table 2. Assume φ = {V, W 1 1 , W 1 2 , W 2:L} is

in a stable region with cφ := sup Q V 2 W 1 1 + W 1 2 Q 2 QL l=2 W l 2 < 1, so that the operations in RNNcell are strictly contractive, i.e., yk+1 yk 2 < yk yk 1 2. In this case, the stability and sensitivity of RNNk φ are guaranteed to be bounded.

Table 2: Properties of RNNk φ. (Details are given in Appendix D.)

Stable region Φ cφ < 1 Stab(k, φ) O(1 ck φ) Sens(k) O(1 (infφ cφ)k) minφ Cvg(k, φ) O(ρk) with ρ < 1

However, the fundamental disadvantage of RNN is its lack of worst-case guarantee for convergence. In general the outputs of RNNk φ may not converge to the minimizer Opt, meaning that its worst-case convergence rate can be much larger than 1. This will lead to worse generalization bound according to our theory compared to GDk φ and NAGk φ.

The advantage of RNN is its expressiveness, especially given the universal approximation ability of MLP in the RNNcell. One can show that RNNk φ can express GDk φ or NAGk φ with suitable choices of φ. Therefore, its best-case convergence can be as small as O(ρk) for some ρ < 1. When the needed types of reasoning is unknown or beyond what existing algorithms are capable of, RNN has the potential to learn new reasoning types given sufﬁcient data.

7 Experimental Validation

Our experiments aim to validate our theoretical prediction with computational simulations, rather than obtaining state-of-the-art results. We hope the theory together with these experiments can lead to practical guidelines for designing deep architectures with reasoning layers. We conduct two sets of experiments, where the ﬁrst set of experiments strictly follows the problem setting described in Sec 2 and the second is conducted on BSD500 dataset [56] to demonstrate the possibility of generalizing the theorem to more realistic applications. Implementations in Python are released1.

7.1 Synthetic Experiments

The experiments follow the problem setting in Sec 2. We sample 10000 pairs of (x, b) uniformly as overall dataset. During training, n samples are randomly drawn from these 10000 data points as the training set. Each Q (x) is produced by a rotation matrix and a vector of eigenvalues parameterized by a randomly ﬁxed 2-layer dense neural network with hidden dimension 3. Then the labels are generated according to y = Opt(Q (x), b). We train the model Algk φ(Qθ, ) on Sn using the loss in Eq. 10. Here, Qθ has the same overall architecture as Q but the hidden dimension could vary. Note that in all ﬁgures, each k corresponds to an independently trained model with k iterations in the algorithm layer, instead of the sequential outputs of a single model. Each model is trained by ADAM and SGD with learning rate grid-searched from [1e-2,5e-3,1e-3,5e-4,1e-4], and only the best result is reported. Furthermore, error bars are produced by 20 independent instantiations of the experiments. See Appendix E for more details.

0 5 10 15 20 25 30 k

empirical error

Approximation ability. To validate Lemma 4.1, we compare GDk φ (Qθ, ) and NAGk φ (Qθ, ) in terms of approximation accuracy. For various hidden sizes of Qθ, the results are similar, so we report one representative case in Fig 3. The approximation accuracy aligns with the convergence of the algorithms, showing that faster converging algorithm can induce better approximation ability.

Faster convergence better Qθ. We report the error of the neural module Qθ in Fig 4. Note that Algk φ(Qθ, ) is trained end-to-end, without supervision on Qθ. In Fig 4, the error of Qθ decreases as

1https://github.com/xinshi-chen/Deep-Architecture-With-Reasoning-Layer

k grows, in a rate similar to algorithm convergence. This validates the implication of Lemma 4.2 that, when Algk φ is closer to Opt, it can help the underlying neural module Qθ to get closer to Q .

0 5 10 15 20 25 30 k

0 5 10 15 20 25 30 k

Figure 4: P Qθ Q 2 F

0 5 10 15 20 25 30 k

0 5 10 15 20 25 30 k

0 5 10 15 20 25 30 k

Figure 5: Generalization gap

5 10 15 20 k

training error

5 10 15 20 k

generalization gap

Figure 6: Algorithm layers vs RNN.

Generalization gap. In Fig 5, we report the generalization gaps, with hidden sizes of Qθ being 0, 16, and 32, which corresponds to the three cases (ii), (iii), and (i) discussed under Theorem 5.1, respectively. Comparing Fig 5 to Fig 2, we can see that the experimental results match very well with the theoretical implications.

RNN. As discussed in Sec 6, RNN can be viewed as neural algorithms. To have a cleaner comparison, we report their behaviors under the learning to optimize senario where the objectives (Q, b) are given. Fig 6 shows that RNN has a better representation power but worse generalization ability.

7.2 Experiments on Real Dataset

(a) original image

(b) noisy image

(c) denoised by GD12 φ (Eθ(X, ))

To show the real world applicability of our theoretical framework, we consider the local adaptive image denoising task. Details are given below.

Dataset. We split BSD500 (400 images) into a training set (100 images) and a test set (300 images). Gaussian noises are added to each pixel with noise levels depending on image local smoothness, making the noise levels on edges lower than non-edge regions. The task is to restore the original image from the noisy version X [0, 1]180 180.

Architecture. In Algk φ (Eθ(X, )), Algk φ is a k-step unrolled minimization algorithm to the ℓ2-regularized reconstruction objective Eθ(X, Y ) := 1 2 Y + gθ(X) X 2 F + 1

i,j |[fθ(X)]i,j Yi,j|2, and the residual gθ(X) and position-wise regularization coefﬁcient fθ(X) are both Dn CNN networks as in [57]. The optimization objective, Eθ(X, Y ), is quadratic in Y .

Generalization gap. We instantiate the hybrid architecture into different models using GD and NAG algorithms with different unrolled steps k. Each model is trained with 3000 epochs, and the generalization gaps are reported in Fig. 7. The results also show good consistency with our theory, where stabler algorithm (GD) can generalize better given over/under-parameterized neural module, and for the about-right parameterization case, the generalization gap behaviors are similar to Stab(k) Cvg(k).

0 5 10 15 20 25 30 k

generalization gap

dim=20-2-3-2

0 5 10 15 20 25 30 k

generalization gap

dim=3-2-3-2

0 5 10 15 20 25 30 k

generalization gap

dim=3-2-0-0

Figure 7: Generalization gap. Each k corresponds to a separately trained model. Left (underparameterized): fθ is a Dn CNN with 3 channels and 2 hidden layers and gθ = 0. Middle (about-right): both fθ and gθ have 3 channels and 2 hidden layers. Right (over-parameterized): fθ has 20 channels.

Visualization. To show that the learned hybrid model has a good performance in this real application, we include a visualization of the original, noisy, and denoised images.

8 Conclusion and Discussion

In this paper, we take an initial step towards the theoretical understanding of deep architectures with reasoning layers. Our theorem indicates intriguing relation between algorithm properties of the reasoning module and the approximation and generalization of the end-to-end model, which in turn provides practical guideline for designing reasoning layers. The current analysis is limited due to the simpliﬁed problem setting. However, assumptions we made are only for avoiding the non-uniqueness of the reasoning solution and the instability of the mapping from the reasoning solution to the neural module. The assumptions could be relaxed if we can involve other techniques to resolve these issues. These additional efforts could potentially generalize the results to more complex cases.

Broader Impact

A common ethical concern of deep learning models is that they may not perform well on unseen examples, which could lead to the risk of producing biased content reﬂective of the training data. Our work, which learns an energy optimization model from the data, is not an exception. The approach we adopt to address this issue is to design hybrid deep architectures containing specialized reasoning modules. In the setting of quadratic energy functions, our theoretical analysis and numerical experiments show that hybrid deep models produce more reliable results than generic deep models on unseen data sets. More work is needed to determine the extent to which such hybrid model prevents biased outputs in more sophisticated tasks

Acknowledgement

We would like to thank Professor Vladimir Koltchinskii for providing valuable suggestions and thank anonymous reviewers for providing constructive feedbacks. This work is supported in part by NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMit F, IIS-1841351, CAREER IIS-1350983 to L.S.

[1] Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a differentiable satisﬁability solver. In International Conference on Machine Learning, pages 6545 6554, 2019.

[2] Xinshi Chen, Yu Li, Ramzan Umarov, Xin Gao, and Le Song. Rna secondary structure prediction by learning unrolled algorithms. ar Xiv preprint ar Xiv:2002.05810, 2020.

[3] Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable sorting using optimal transport: The sinkhorn cdf and quantile operator. ar Xiv preprint ar Xiv:1905.11885, 2019.

[4] Harsh Shrivastava, Xinshi Chen, Binghong Chen, Guanghui Lan, Srinivas Aluru, Han Liu, and Le Song. GLAD: Learning sparse graph recovery. In International Conference on Learning Representations, 2020.

[5] Y Yang, J Sun, H Li, and Z Xu. Admm-net: A deep learning approach for compressive sensing mri. corr. ar Xiv preprint ar Xiv:1705.06869, 2017.

[6] John Ingraham, Adam Riesselman, Chris Sander, and Debora Marks. Learning protein structure with a differentiable simulator. In International Conference on Learning Representations, 2019.

[7] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems, pages 3749 3759, 2018.

[8] Arthur Mensch and Mathieu Blondel. Differentiable dynamic programming for structured prediction and attention. In 35th International Conference on Machine Learning, volume 80, 2018.

[9] Bryan Wilder, Eric Ewing, Bistra Dilkina, and Milind Tambe. End to end learning and optimization on graphs. In Advances in Neural Information Processing Systems, pages 4674 4685, 2019.

[10] Justin Domke. Parameter learning with truncated message-passing. In CVPR 2011, pages 2937 2943. IEEE, 2011.

[11] Despoina Paschalidou, Osman Ulusoy, Carolin Schmitt, Luc Van Gool, and Andreas Geiger. Raynet: Learning volumetric 3d reconstruction with ray potentials. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3897 3906, 2018.

[12] Wei Wang, Zheng Dang, Yinlin Hu, Pascal Fua, and Mathieu Salzmann. Backpropagationfriendly eigendecomposition. In Advances in Neural Information Processing Systems, pages 3156 3164, 2019.

[13] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240 6249, 2017.

[14] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. ar Xiv preprint ar Xiv:1910.12947, 2019.

[15] Vikas K Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. ar Xiv preprint ar Xiv:2002.06157, 2020.

[16] Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

[17] Vladimir Koltchinskii et al. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006.

[18] Yuansi Chen, Chi Jin, and Bin Yu. Stability and convergence trade-off of iterative optimization algorithms. ar Xiv preprint ar Xiv:1804.01619, 2018.

[19] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981 3989, 2016.

[20] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

[21] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3 17. Springer, 2016.

[22] Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Structured prediction theory based on factor graph complexity. In Advances in Neural Information Processing Systems, pages 2514 2522, 2016.

[23] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499 526, 2002.

[24] Shivani Agarwal and Partha Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10(Feb):441 474, 2009.

[25] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1509.01240, 2015.

[26] Omar Rivasplata, Emilio Parrado-Hernández, John S Shawe-Taylor, Shiliang Sun, and Csaba Szepesvári. Pac-bayes bounds for stable algorithms with instance-dependent priors. In Advances in Neural Information Processing Systems, pages 9214 9224, 2018.

[27] Saurabh Verma and Zhi-Li Zhang. Stability and generalization of graph convolutional neural networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1539 1548, 2019.

[28] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126 1135. JMLR. org, 2017.

[29] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pages 113 124, 2019.

[30] David Belanger, Bishan Yang, and Andrew Mc Callum. End-to-end learning for structured prediction energy networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 429 439. JMLR. org, 2017.

[31] Priya Donti, Brandon Amos, and J Zico Kolter. Task-based end-to-end model learning in stochastic optimization. In Advances in Neural Information Processing Systems, pages 5484 5494, 2017.

[32] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 136 145. JMLR. org, 2017.

[33] Marin Vlastelica Poganˇci c, Anselm Paulus, Vit Musil, Georg Martius, and Michal Rolinek. Differentiation of blackbox combinatorial solvers. In International Conference on Learning Representations, 2019.

[34] Michal Rolínek, Paul Swoboda, Dominik Zietlow, Anselm Paulus, Vít Musil, and Georg Martius. Deep graph matching via blackbox differentiation of combinatorial solvers. ar Xiv preprint ar Xiv:2003.11657, 2020.

[35] Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. Learning with differentiable perturbed optimizers. ar Xiv preprint ar Xiv:2002.08676, 2020.

[36] Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. In Advances in Neural Information Processing Systems, pages 9061 9071, 2018.

[37] Aaron Ferber, Bryan Wilder, Bistra Dilkina, and Milind Tambe. Mipaal: Mixed integer program as a layer. In AAAI, pages 1504 1511, 2020.

[38] Patrick Knobelreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas Pock. End-toend training of hybrid cnn-crf models for stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2339 2348, 2017.

[39] Vlad Niculae, Andre Martins, Mathieu Blondel, and Claire Cardie. Sparsemap: Differentiable sparse structured inference. In International Conference on Machine Learning, pages 3799 3808, 2018.

[40] Matthew Mac Kay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. ar Xiv preprint ar Xiv:1903.03088, 2019.

[41] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated backpropagation for bilevel optimization. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1723 1732. PMLR, 2019.

[42] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. ar Xiv preprint ar Xiv:1806.04910, 2018.

[43] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566 1575, 2019.

[44] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016.

[45] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. ar Xiv preprint ar Xiv:1710.05468, 2017.

[46] Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947 5956, 2017.

[47] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.

[48] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443 457. Springer, 2000.

[49] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902 1914, 2001.

[50] Vladimir Koltchinskii, Dmitry Panchenko, et al. Empirical margin distributions and bounding the generalization error of combined classiﬁers. The Annals of Statistics, 30(1):1 50, 2002.

[51] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

[52] Michel Talagrand. Sharper bounds for gaussian and empirical processes. The Annals of Probability, pages 28 76, 1994.

[53] Richard M Dudley. Uniform central limit theorems, volume 142. Cambridge university press, 2014.

[54] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, pages 688 699, 2019. [55] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, and Armin Askari. Implicit deep learning. ar Xiv preprint ar Xiv:1908.06315, 2019. [56] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898 916, 2010. [57] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142 3155, 2017.