# on_training_implicit_models__ab6aaa67.pdf On Training Implicit Models Zhengyang Geng1,2 Xin-Yu Zhang2 Shaojie Bai4 Yisen Wang2,3 Zhouchen Lin2,3,5 1Zhejiang Lab, China 2Key Lab. of Machine Perception, School of AI, Peking University 3Institute for Artificial Intelligence, Peking University 4Carnegie Mellon University 5Pazhou Lab, China This paper focuses on training implicit models of infinite layers. Specifically, previous works employ implicit differentiation and solve the exact gradient for the backward propagation. However, is it necessary to compute such an exact but expensive gradient for training? In this work, we propose a novel gradient estimate for implicit models, named phantom gradient, that 1) forgoes the costly computation of the exact gradient; and 2) provides an update direction empirically preferable to the implicit model training. We theoretically analyze the condition under which an ascent direction of the loss landscape could be found, and provide two specific instantiations of the phantom gradient based on the damped unrolling and Neumann series. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 , and even boost the performance over approaches based on the exact gradient on Image Net. 1 Introduction Conventional neural networks are typically constructed by explicitly stacking multiple linear and non-linear operators in a feed-forward manner. Recently, the implicitly-defined models [1, 2, 3, 4, 5] have attracted increasing attentions and are able to match the state-of-the-art results by explicit models on several vision [3, 6], language [2] and graph [4] tasks. These works treat the evolution of the intermediate hidden states as a certain form of dynamics, such as fixed-point equations [2, 3] or ordinary differential equations (ODEs) [1, 7], which represents infinite latent states. The forward passes of implicit models are therefore formulated as solving the underlying dynamics, by either black-box ODE solvers [1, 7] or root-finding algorithms [2, 3]. As for the backward passes, however, directly differentiating through the forward pass trajectories could induce a heavy memory overhead [8, 9]. To this end, researchers have developed memory-efficient backpropagation via implicit differentiation, such as solving a Jacobian-based linear fixed-point equation for the backward pass of deep equilibrium models (DEQs) [2], which eventually makes the backpropagation trajectories independent of the forward passes. This technique allows one to train implicit models with essentially constant memory consumption, as we only need to store the final output and the layer itself without saving any intermediate states. However, in order to estimate the exact gradient by implicit differentiation, implicit models have to rely on expensive black-box solvers for backward passes, e.g., ODE solvers or root-solving algorithms. These black-box solver usually makes the gradient computation very costly in practice, even taking weeks to train state-of-the-art implicit models on Image Net [10] with 8 GPUs. This work investigates fast approximate gradients for training implicit models. We found that a firstorder oracle that produces good gradient estimates is enough to efficiently and effectively train implicit *Equal contribution Corresponding author, zlin@pku.edu.cn 35th Conference on Neural Information Processing Systems (Neur IPS 2021). models, circumventing laboriously computing the exact gradient as in prior arts [2, 3, 4, 11, 12]. We develop a framework in which a balanced trade-off is made between the precision and conditioning of the gradient estimate. Specifically, we provide the general condition under which the phantom gradient can provide an ascent direction of the loss landscape. We further propose two instantiations of phantom gradients in the context of DEQ models, which are based on the the damped fixed-point unrolling and the Neumann series, respectively. Importantly, we show that our proposed instantiations satisfy the theoretical condition, and that the stochastic gradient descent (SGD) algorithm based on the phantom gradient enjoys a sound convergence property as long as the relevant hyperparameters, e.g., the damping factor, are wisely selected. Note that our method only affects, and thus accelerates, the backward formulation of the implicit models, leaving the forward pass formulation (i.e., the root-solving process) and the inference behavior unchanged so that our method is applicable to a wide range of implicit models, forward solvers, and inference strategies. We conduct an extensive set of synthetic, ablation, and large-scale experiments to both analyze the theoretical properties of the phantom gradient and validate its speedup and performances on various tasks, such as Image Net [10] classification and Wikitext-103 [13] language modeling. Overall, our results suggest that: 1) the phantom gradient estimates an ascent direction; 2) it is applicable to large-scale tasks and is capable of achieving a strong performance which is comparable with or even better than that of the exact gradient; and 3) it significantly shortens the total training time needed for implicit models roughly by a factor of 1.4 1.7 , and even accelerates the backward passes by astonishingly 12 on Image Net. We believe that our results provide strong evidence for effectively training implicit models with the lightweight phantom gradient. 2.1 Inspection of Implicit Differentiation In this work, we primarily focus on the formulation of implicit models based on root-solving, represented by the DEQ models [2]. The table of notations is arranged in Appendix A. Specifically, given an equilibrium module F, the output of the implicit model is characterized by the solution h to the following fixed-point equation: h = F(h , z), (1) where z Rdu+dθ is the union of the module s input u Rdu and parameters θ Rdθ, i.e., z = [u , θ ]. Here, u is usually a projection of the original data point x Rdx, e.g., u = M(x). In this section, we assume F is a contraction mapping w.r.t. h so that its Lipschitz constant Lh w.r.t. h is less than one, i.e., Lh < 1, a setting that has been analyzed in recent works [14, 15]1. To differentiate through the fixed point by Eq. (1), we need to calculate the gradient of h w.r.t. the input z. By Implicit Function Theorem (IFT), we have Here, ( a/ b)ij = aj/ bi. The equilibrium point h of Eq. (1) is then passed to a post-processing function G to obtain a prediction ˆy = G(h ). In the generic learning scenario, the training objective is the following expected loss: R(θ) = E(x,y) P [L(ˆy(x; θ), y)] , (3) where y is the groundtruth corresponding to the training example x, and P is the data distribution. Here, we omit the parameters of G, because given the output h of the implicit module F, training the post-processing part G is the same as training explicit neural networks. The most crucial component is the gradient of the loss function L w.r.t. the input vector z = [u , θ ], which is used to train both the implicit module F and the input projection module M. Using Eq. (2) with the condition h = h , we have L u = F The gradients in Eq. (4) are in the same form w.r.t. u and θ. Without loss of generality, we only discuss the gradient w.r.t. θ in the following sections. 1Note that the contraction condition could be largely relaxed to the one that the spectral radius of F/ h on the given data is less than 1, i.e., ρ( F/ h ) < 1, as indicated by the well-posedness condition in [5]. 2.2 Motivation The most intriguing part lies in the Jacobian-inverse term, i.e., (I F/ h) 1. Computing the inverse term by brute force is intractable due to the O(n3) complexity. Previous implicit models [2] approach this by solving a linear system involving a Jacobian-vector product iteratively via a gradient solver, introducing over 30 Broyden [16] iterations in the backward pass. However, the scale of the Jacobian matrix can exceed 106 106 in the real scenarios, leading to a prohibitive cost in computing the exact gradient. For example, training a small-scale state-of-the-art implicit model on Image Net can consume weeks using 8 GPUs while training explicit models usually takes days, demonstrating that pursuing the exact gradient severely slows down the training process of implicit models compared with explicit models. Secondly, because of the inversion operation, we cast doubt on the conditioning of the gradient and the stability of training process from the numerical aspect. The Jacobian-inverse can be numerically unstable when encountering the ill-conditioning issue. The conditioning problem might further undermine the training stability, as studied in the recent work [17]. Plus, the inexact gradient [9, 18, 19, 20, 21, 22] is widely applied in the previous learning protocol, like linear propagation [23] and synthetic gradient [24]. Here, the Jacobian-inverse is used to calculate the exact gradient which is not always optimal for model training. Moreover, previous research has used a moderate gradient noise as a regularization approach [25], which has been shown to play a central role in escaping poor local minima and improving generalization ability [26, 27, 28]. The concerns and observations motivate us to rethink the possibility of replacing the Jacobian-inverse term in the standard implicit differentiation with a cheaper and more stable counterpart. We believe that an exact gradient estimate is not always required, especially for a black-box layer like those in the implicit models. Hence this work designs an inexact, theoretically sound, and practically efficient gradient for training implicit models under various settings. We name the proposed gradient estimate as the phantom gradient. Suppose the Jacobian h / θ is replaced with a matrix A, and the corresponding phantom gradient is defined as d L θ := A L Next, we give the general condition on A so that the phantom gradient can be guaranteed valid for optimization (Sec. 2.3), and provide two concrete instantiations of A based on either damped fixed-point unrolling or the Neumann series (Sec. 2.4). The proofs of all our theoretical results are presented in Appendix C. 2.3 General Condition on the Phantom Gradient Previous research on theoretical properties for the inexact gradient include several aspects, such as the gradient direction [22], the unbiasedness of the estimator [29], and the convergence theory of the stochastic algorithm [19, 30]. The following theorem formulates a sufficient condition that the phantom gradient gives an ascent direction of the loss landscape. Theorem 1. Suppose the exact gradient and the phantom gradient are given by Eq. (4) and (5), respectively. Let σmax and σmin be the maximal and minimal singular value of F/ θ. If A I F < σ2 min σmax , (6) then the phantom gradient provides an ascent direction of the function L, i.e., *d L θ , L Remark 1. Suppose only the (I F/ h) 1 term is replaced with a matrix D, namely, A = ( F/ θ) D. Then, the condition in (6) can be reduced into D I F where κ is the condition number of F/ θ. (See Appendix C.1 for the derivation.) 2.4 Instantiations of the Phantom Gradient In this section, we present two practical instantiations of the phantom gradient. We also verify that the general condition in Theorem 1 can be satisfied if the hyperparameters in our instantiations are wisely selected. Suppose we hope to differentiate through implicit dynamics, e.g., either a root-solving process or an optimization problem. In the context of hyperparameter optimization (HO), previous solutions include differentiating through the unrolled steps of the dynamics [19] or employing the Neumann series of the Jacobian-inverse term [9]. In our case, if we solve the root of Eq. (1) via the fixed-point iteration: ht+1 = F(ht, z), t = 0, 1, , T 1, (9) then by differentiating through the unrolled steps of Eq. (9), we have Besides, the Neumann series of the Jacobian-inverse (I F/ h) 1 is Notably, computing the Jacobian h / θ using the Neumann series in (11) is equivalent to differentiating through the unrolled steps of Eq. (9) at the exact equilibrium point h and taking the limit of infinite steps [9]. Without altering the root of Eq. (1), we consider a damped variant of the fixed-point iteration: ht+1 = Fλ(ht, z) = λF(ht, z) + (1 λ)ht, t = 0, 1, , T 1. (12) Differentiating through the unrolled steps of Eq. (12), Eq. (10) is adapted as hs + (1 λ) I The Neumann series of (I F/ h) 1 is correspondingly adapted as λ I + B + B2 + B3 + , where B = λ F h + (1 λ)I. (14) The next theorem shows that under mild conditions, the Jacobian from the damped unrolling in Eq. (13) converges to the exact Jacobian and the Neumann series in (14) converges to the Jacobianinverse (I F/ h) 1 as well. Theorem 2. Suppose the Jacobian F/ h is a contraction mapping. Then, (i) the Neumann series in (14) converges to the Jacobian-inverse (I F/ h) 1; and (ii) if the function F is continuously differentiable w.r.t. both h and θ, the sequence in Eq. (13) converges to the exact Jacobian h / θ as T , i.e., However, as discussed in Sec. 2.2, it is unnecessary to compute the exact gradient with infinite terms. In the following context, we introduce two instantiations of the phantom gradient based on the finite-term truncation of Eq. (13) or (14). Unrolling-based Phantom Gradient (UPG). In the unrolling form, the matrix A is defined as Aunr k,λ = λ hs + (1 λ) I Neumann-series-based Phantom Gradient (NPG). In the Neumann form, the matrix A is defined as Aneu k,λ = λ F I + B + B2 + + Bk 1 , where B = λ F h + (1 λ)I. (17) Note that both the initial point of the fixed-point iteration (i.e., h0 in Eq. (16)) and the point at which the Neumann series is evaluated (i.e., h in (17)) are the solution of the root-finding solver. (See Appendix B for implementation of the phantom gradient.) According to Theorem 2, the matrix A defined by either Eq. (16) or (17) converges to the exact Jacobian h / θ as k for any λ (0, 1]. Therefore, by Theorem 2, the condition in (6) can be satisfied if a sufficiently large step k is selected, since A I F Next, we characterize the impact of the two hyperparameters, i.e., k and λ, on the precision and conditioning of A. Take the NPG (Eq. (17)) as an example. (i) On the precision of the phantom gradient, a large k makes the gradient estimate more accurate, as higher-order terms of the Neumann series are included, while a small λ slows down the convergence of the Neumann series because the norm B increases as λ decreases. (ii) On the conditioning of the phantom gradient, a large k impairs the conditioning of A since the condition number of Bk grows exponentially as k increases, while a small λ helps maintain a small condition number of A because the singular values of F/ h are smoothed by the identity matrix. In a word, a large k is preferable for a more accurate A, while a small λ contributes to the wellconditioning of A. Practically, these hyperparameters should be selected in consideration of a balanced trade-off between the precision and conditioning of A. See Sec. 3 for experimental results. 2.5 Convergence Theory In this section, we provide the convergence guarantee of the SGD algorithm using the phantom gradient. We prove that under mild conditions, if the approximation error of the phantom gradient is sufficiently small, the SGD algorithm converges to an ϵ-approximate stationary point in the expectation sense. We will discuss the feasibility of our assumptions in Appendix C.3. Theorem 3. Suppose the loss function R in Eq. (3) is ℓ-smooth, lower-bounded, and has bounded gradient almost surely in the training process. Besides, assume the gradient in Eq. (4) is an unbiased estimator of R(θ) with a bounded covariance. If the phantom gradient in Eq. (5) is an ϵ-approximation to the gradient in Eq. (4), i.e., ϵ, almost surely, (19) then using Eq. (5) as a stochastic first-order oracle with a step size of ηn = O(1/ n) to update θ with gradient descent, it follows after N iterations that "PN n=1 ηn R(θn) 2 O ϵ + log N Remark 2. Consider the condition in (19): Suppose the gradient L/ h is almost surely bounded. By Theorem 2, the condition in (19) can be guaranteed as long as a sufficiently large k is selected. 0.90 0.92 0.94 0.96 0.98 1.00 Cosine Similarity k = 1 k = 2 k = 3 k = 4 k = 5 0.90 0.92 0.94 0.96 0.98 1.00 Cosine Similarity 0.90 0.92 0.94 0.96 0.98 1.00 Cosine Similarity 0.90 0.92 0.94 0.96 0.98 1.00 Cosine Similarity (a) Neumann-series-based phantom gradient. (b) Unrolling-based phantom gradient. λ = 0.5 λ = 1.0 λ = 0.5 λ = 1.0 Figure 1: Cosine similarity between the phantom gradient and the exact gradient in the synthetic setting. 3 Experiments In this section, we aim to answer the following questions via empirical results: (1) How precise is the phantom gradient? (2) What is the difference between the unrolling-based and the Neumannseries-based phantom gradients? (3) How is the phantom gradient influenced by the hyperparameters k and λ? (4) How about the computational cost of the phantom gradient compared with implicit differentiation? (5) Can the phantom gradient work at large-scale settings for various tasks? We have provided some theoretical analysis and intuitions to (1), (2), and (3) in Sec. 2.4. Now we answer (1) and (2) and demonstrate the performance curves under different hyperparameters k and λ on CIFAR-10 [31]. Besides, we also study other factors that have potential influences on the training process of the state-of-the-art implicit models [2, 3] like pretraining. For (4) and (5), we conduct experiments on large-scale datasets to highlight the ultra-fast speed and competitive performances, including image classification on Image Net [10] and language modeling on Wikitext-103 [13]. We start by introducing two experiment settings. We adopt a single-layer neural network with spectral normalization [32] as the function F and fixed-point iterations as the equilibrium solver, which is the synthetic setting. Moreover, on the CIFAR-10 dataset, we use the MDEQ-Tiny model [3] (170K parameters) as the backbone model, denoted as the ablation setting. Additional implementation details and experimental results are presented in Appendix D2. Precision of the Phantom Gradient. The precision of the phantom gradient is measured by its angle against the exact gradient, indicated by the cosine similarity between the two. We discuss its precision in both the synthetic setting and the ablation setting. The former is under the static and randomly generated weights, while the latter provides characterization of the training dynamics. In the synthetic setting, the function F is restricted to be a contraction mapping. Specifically, we directly set the Lipschitz constant of F as Lh = 0.9, and use 100 fixed-point iterations to solve the root h of Eq. (1) until the relative error satisfies h F(h, z) / h < 10 5. Here, the exact gradient is estimated by backpropagation through the fixed-point iterations, and cross-validated by implicit differentiation solved with 20 iterations of the Broyden s method [16]. In our experiment, the cosine similarity between these two gradient estimates consistently succeeds 0.9999, indicating the gradient estimate is quite accurate when the relative error of forward solver is minor. The cosine similarity between phantom gradients and exact gradients is shown in Fig. 1. It shows that the cosine similarity tends to increase as k grows and that a small λ tends to slow down the convergence of the phantom gradient, allowing it to explore in a wider range regarding the angle against the exact gradient. In the ablation setting, the precision of the phantom gradient during the training process is shown in Fig. 2. The model is trained by implicit differentiation under the official schedule3. It shows that the phantom gradient still provides an ascent direction in the real training process, as indicated by the considerable cosine similarity against the exact gradient. Interestingly, the cosine similarity slightly decays as the training progresses, which implies a possibility to construct an adaptive gradient solver for implicit models. 2All training sources of this work are available at https://github.com/Gsunshine/phantom_grad. 3Code available at https://github.com/locuslab/mdeq. (a) Neumann-series-based phantom gradient. (b) Unrolling-based phantom gradient. Figure 2: Cosine similarity between the phantom gradient and the exact gradient in the real scenario. The horizontal axis corresponds to the cosine similarity, and the vertical axis to the training step. To Pretrain, or not to Pretrain? To better understand the components of the implicit models training schedule, we first illustrate a detailed ablation study of the baseline model in the ablation setting. The average accuracy with standard deviation is reported in Tab. 1. Table 1: Ablation settings on CIFAR-10. Method Acc. (%) Implicit Differentiation 85.0 0.2 w/o Pretraining 82.3 1.3 w/o Dropout 83.7 0.1 Adam SGD 84.5 0.3 SGD w/o Pretraining 82.9 1.5 UPG (A5,0.5, w/o Dropout) 85.8 0.5 NPG (A5,0.5, w/o Dropout) 85.6 0.5 UPG (A9,0.5, w/ Dropout) 86.1 0.5 The MDEQ model employs a pretraining stage in which the model F is unrolled as a recurrent network. We study the impact of the pretraining stage, the Dropout [33] operation, and the optimizer separately. It can be seen that the unrolled pretraining stabilizes the training of the MDEQ model. Removing the pretraining stage leads to a severe performance drop and apparent training instability among different trials because the solver cannot obtain an accurate fixed point h when the model is not adequately trained. This ablation study also suggests that the MDEQ model is a strong baseline for our method to compare with. However, pretraining is not always indispensable for training implicit models. It introduces an extra hyperparameter, i.e., how many steps should be involved in the pretraining stage. Next, we discuss how the UPG could circumvent this issue. Trade-offs between Unrolling and Neumann. For an exact fixed point h , i.e., h = F(h , z), there is no difference between UPG and NPG. However, when the numerical error exists in solving h , i.e., h F(h , z) > 0, these two instatiations of the phantom gradient can behave differently. Table 2: Complexity comparison. Mem means the memory cost, and K and k denote the solver s steps and the unrolling/Neumann steps, respectively. Here, K k 1. Method Time Mem Peak Mem Implicit O(K) O(1) O(k) UPG O(k) O(k) O(k) NPG O(k) O(1) O(1) We note that a particular benefit of the UPG is its ability to automatically switch between the pretraining and training stages for implicit models. When the model is not sufficiently trained and the solver converges poorly (see [3]), the UPG defines a forward computation graph that is essentially equivalent to a shallow weight-tied network to refine the coarse equilibrium states. In this stage, the phantom gradient serves as a backpropagation through time (BPTT) algorithm and hence behaves as in the pretraining stage. Then, as training progresses, the solver becomes more stable and converges to the fixed point h better. This makes the UPG behave more like the NPG. Therefore, the unrolled pretraining is gradually transited into the regular training phase based on implicit differentiation, and the hyperparameter tuning of pretraining steps can be waived. We argue that such an ability to adaptively switch training stages is benign to the implicit models training protocol, which is also supported by the performance gain in Tab. 1. Although the UPG requires higher memory overhead than implicit differentiation or the NPG, it does not surpass the peak memory usage in the entire training protocol by implicit differentiation 1 2 3 4 5 6 7 8 9 k 1.0 0.9 0.5 0.1 1 2 3 4 5 6 7 8 9 k type Unrolling Neumann (a) Impact of λ and k (b) NPG v.s. UPG Figure 3: Ablation studies on (a) the hyperparameters λ and k, and (b) two forms of phantom gradient. due to the pretraining stage. In the ablation setting, the MDEQ model employs a 10-layer unrolling for pretraining, which actually consumes double memory compared with a 5-step unrolling scheme, e.g., A5,0.5 in Tab. 1. In Tab. 2, we also demonstrate the time and memory complexity for implicit differentiation and the two forms of phantom gradient. In addition, leaving the context of approximate and exact gradient aside, we also develop insights into understanding the subtle differences between UPG and NPG in terms of the state-dependent and state-free gradient. Actually, the UPG is state-dependent, which means it corresponds to the exact gradient of a computational sub-graph. Both the NPG and the gradient solved by implicit differentiation, however, do not exactly match the gradient of any forward computation graph unless the numerical error is entirely eliminated in both the forward and the backward passes for implicit differentiation or the one-step gradient [34, 21, 30, 22] is used for the NPG, i.e., k = 1. Interestingly, we observe that models trained on the state-dependent gradient demonstrate an additional training stability regarding the Jacobian spectral radius, compared with those trained on the state-free counterpart. We empirically note that it can be seen as certain form of implicit Jacobian regularization for implicit models as a supplement to the explicit counterpart [17], indicated by the stable estimated Jacobian spectral radius during training, i.e., ρ( Fλ/ h) 1. The experimental results also cohere with our insight. The performance curves in Fig. 3 demonstrate the influence of λ and k and further validate that the UPG is more robust to a wide range of steps k than the NPG. In fact, when the Jacobian spectral radius ρ( F/ h) increases freely without proper regularization, the exploding high-order terms in the NPG could exert a negative impact on the overall direction of the phantom gradient, leading to performance degradation when a large k is selected (see Fig. 3(b)). We also observe a surging of the estimated Jacobian spectral radius as well as the gradient explosion issue in the NPG experiments. On the contrary, the UPG can circumvent these problems thanks to its implicit Jacobian regularization. Table 3: Experiments using DEQ [2] and MDEQ [3] on vision and language tasks. Metrics stand for accuracy(%) for image classification on CIFAR-10 and Image Net, and perplexity for language modeling on Wikitext-103. JR stands for Jacobian Regularization [17]. indicates additional steps in the forward equilibrium solver. Datasets Model Method Params Metrics Speed CIFAR-10 MDEQ Implicit 10M 93.8 0.17 1.0 CIFAR-10 MDEQ UPG A5,0.5 10M 95.0 0.16 1.4 Image Net MDEQ Implicit 18M 75.3 1.0 Image Net MDEQ UPG A5,0.6 18M 75.7 1.7 Wikitext-103 DEQ (Post LN) Implicit 98M 24.0 1.0 Wikitext-103 DEQ (Post LN) UPG A5,0.8 98M 25.7 1.7 Wikitext-103 DEQ (Pre LN) JR + Implicit 98M 24.5 1.7 Wikitext-103 DEQ (Pre LN) JR + UPG A5,0.8 98M 24.4 2.2 Wikitext-103 DEQ (Pre LN) JR + UPG A5,0.8 98M 24.0 1.7 Phantom Gradient at Scale. We conduct large-scale experiments to verify the advantages of the phantom gradient on vision, graph, and language benchmarks. We adopt the UPG in the large-scale experiments. The results are illustrated in Tab. 3 and Tab. 4. Our method matches or surpasses the implicit differentiation training protocol on the state-of-the-art implicit models with a visible reduction on the training time. When only considering the backward pass, the acceleration for MDEQ can be remarkably 12 on Image Net classification. Table 4: Experiments using IGNN [4] on graph tasks. Metrics stand for accuracy(%) for graph classification on COX2 and PROTEINS, Micro-F1(%) for node classification on PPI. Datasets Model Method Params Metrics (%) COX2 IGNN Implicit 38K 84.1 2.9 COX2 IGNN UPG A5,0.5 38K 83.9 3.0 COX2 IGNN UPG A5,0.8 38K 83.9 2.7 COX2 IGNN UPG A5,1.0 38K 83.0 2.9 PROTEINS IGNN Implicit 34K 78.6 4.1 PROTEINS IGNN UPG A5,0.5 34K 78.4 4.2 PROTEINS IGNN UPG A5,0.8 34K 78.6 4.2 PROTEINS IGNN UPG A5,1.0 34K 78.8 4.2 PPI IGNN Implicit 4.7M 97.6 PPI IGNN UPG A5,0.5 4.7M 98.2 PPI IGNN UPG A5,0.8 4.7M 97.4 PPI IGNN UPG A5,1.0 4.7M 96.2 4 Related Work Implicit Models. Implicit models generalize the recursive forward/backward rules of neural networks and characterize their internal mechanism by some pre-specified dynamics. Based on the dynamics, the implicit mechanisms can be broadly categorized into three classes: ODE-based [1, 7], root-solving-based [2, 3, 4, 5, 11], and optimization-based [35, 36, 37] implicit models. The ODE-based implicit models [1, 7] treat the iterative update rules of residual networks as the Euler discretization of an ODE, which could be solved by any black-box ODE solver. The gradient of the differential equation is calculated using the adjoint method [38], in which the adjoint state is obtained by solving another ODE. The root-solving-based implicit models [2, 5, 3, 4, 6, 11, 22] characterize layers of neural networks by solving fixed-point equations. The equations are solved by either the black-box root-finding solver [2, 3] or the fixed-point iteration [4, 22]. The optimization-based implicit models [35, 36, 37, 39, 34, 21, 40, 41] leverage the optimization programs as layers of neural networks. Previous works have studied differentiable layers of quadratic programming [35], submodular optimization [36], maximum satisfiability (MAXSAT) problems [37], and structured decomposition [21]. As for the backward passes, implicit differentiation is applied to the problemdefining equations of the root-solving-based models [2, 3] or the KKT conditions of the optimizationbased models [35]. As such, the gradient can be obtained from solving the backward linear system. In this work, we focus on the root-solving-based implicit models. Theoretical works towards root-solving-based implicit models include the well-posedness [5, 4], monotone operators [11], global convergence [42, 41], and Lipschitz analysis [15]. We look into the theoretical aspect of the gradient-based algorithm in training implicit models and the efficient practice guidance. With these considerations, we show that implicit models of the same architecture could enjoy faster training speed and strong generalization in practical applications by using the phantom gradient. Non-End-to-End Optimization in Deep Learning. Non-end-to-end optimization aims to replace the standard gradient-based training of deep architectures with modular or weakly modular training without the entire forward and backward passes. Currently, there are mainly three research directions in this field, namely, the auxiliary variable methods [43, 44, 45, 46, 47, 48, 49], target propagation [50, 51, 52], and synthetic gradient [24, 53, 54]. The auxiliary variable methods [43, 44, 45, 46, 47, 48, 49] formulate the optimization of neural networks as constrained optimization problems, in which the layer-wise activations are considered as trainable auxiliary variables. Then, the equality constraints are relaxed as penalty terms added to the objectives so that the parameters and auxiliary variables can be divided into blocks and thus optimized in parallel. The target propagation method [50, 51, 52] trains each module by having its activations regress to the pre-assigned targets, which are propagated backwards from the downstream modules. Specifically, the auto-encoder architecture is used to reconstruct targets at each layer. Finally, the synthetic gradient method [24, 53, 54] estimates the local gradient of neural networks using auxiliary models, and employ the synthetic gradient in place of the exact gradient to perform parameter update. In this way, the forward and backward passes are decoupled and can be executed in an asynchronous manner. Our work is in line with the non-end-to-end optimization research since we also aims to decouple the forward and backward passes of neural networks. However, we show that finding a reasonable target or a precise gradient estimate is not always necessary in training deep architectures. Our paper paves a path that an inexact but well-conditioned gradient estimate can contribute to both fast training and competitive generalization of implicit models. Differentiation through Implicit Dynamics. Differentiation through certain implicit dynamics is an important aspect in a wide range of research fields, including bilevel optimization [19, 9], metalearning [18, 55, 30], and sensitivity analysis [56]. Since the gradient usually cannot be computed analytically, researchers have to implicitly differentiate the dynamics at the converged point. The formula of the gradient typically contains a term of Jacobian-inverse (or Hessian-inverse), which is computationally prohibitive for large-scale models. (See Eq. (2) in our case.) Herein, several techniques have been developed to approximate the matrix inverse in the previous literature. An intuitive solution is to differentiate through the unrolled steps of a numerical solver of the dynamics [57, 58, 8]. In particular, if a single step is unrolled, it reduces to the well-known one-step gradient [59, 18, 60, 34, 30, 21, 22], in which the inverse of Jacobian/Hessian is simply approximated by an identity matrix. On the contrary, unrolling a small number of steps may induce a bias [9], while the memory and computational cost grows linearly as the number of unrolled steps increases. Towards this issue, Shaban et al. [19] propose to truncate the long-term dependencies and differentiate through only the last L steps. In fact, if the dynamics have converged to a stationary point, the finite-term truncation in Shaban et al. [19] is exactly the Neumann approximation of the Jacobian-inverse with the first L terms. Based on this, Lorraine et al. [9] directly use the truncated Neumann series as an approximation of the Jacobian-inverse. Besides the unrolling-based methods, optimization-based approaches [61, 55] have been studied in this field as well. Since the Jacobian-inverse-vector product can be viewed as solution of a linear system, algorithms like the conjugate gradient method can be used to solve it. 5 Limitation and Future Work The main limitation of this work lies in the hyperparameter tuning of the phantom gradient, especially for the damping factor λ, which directly controls the gradient s precision and conditioning, the implicit Jacobian regularization for UPG, the stability for NPG, and the final generalization behaviors. However, it has not been a hindrance to the application of phantom gradients in training implicit models as one can tune the hyperparameter according to the validation loss in the early training stage. Regarding future works, we would like to highlight the following aspects: (1) eliminating the bias of the current phantom gradient, (2) constructing an adaptive gradient solver for implicit models, (3) analyzing the damping factor to provide practical guidance, (4) investigating the implicit Jacobian regularization, and (5) understanding how different noises in the gradients can impact the training of implicit models under different loss landscapes. 6 Conclusion In this work, we explore the possibility of training implicit models via the efficient approximate phantom gradient. We systematically analyze the general condition of a gradient estimate so that the implicit model can be guaranteed to converge to an approximate stationary point of the loss function. Specifically, we give a sufficient condition under which a first-order oracle could always find an ascent direction of the loss landscape in the training process. Moreover, we introduce two instantiations of the proposed phantom gradient, based on either the damped fixed-point unrolling or the Neumann series. The proposed method shows a 1.4 1.7 acceleration with comparable or better performances on large-scale benchmarks. Overall, this paper provides a practical perspective on training implicit models with theoretical guarantees. Acknowledgments Zhouchen Lin was supported by the NSF China (No.s 61625301 and 61731018), NSFC Tianyuan Fund for Mathematics (No. 12026606) and Project 2020BD006 supported by PKU-Baidu Fund. Yisen Wang was partially supported by the National Natural Science Foundation of China under Grant 62006153, and Project 2020BD006 supported by PKU-Baidu Fund. Shaojie Bai was sponsored by a grant from the Bosch Center for Artificial Intelligence. [1] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural Ordinary Differential Equations. In Neural Information Processing Systems (Neur IPS), 2018. 1, 9 [2] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep Equilibrium Models. In Neural Information Processing Systems (Neur IPS), 2019. 1, 2, 3, 6, 8, 9 [3] Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Multiscale Deep Equilibrium Models. In Neural Information Processing Systems (Neur IPS), pages 5238 5250, 2020. 1, 2, 6, 7, 8, 9 [4] Fangda Gu, Heng Chang, Wenwu Zhu, Somayeh Sojoudi, and Laurent El Ghaoui. Implicit Graph Neural Networks. In Neural Information Processing Systems (Neur IPS), pages 11984 11995, 2020. 1, 2, 9 [5] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, Armin Askari, and Alicia Tsai. Implicit Deep Learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021. 1, 2, 9 [6] Tiancai Wang, Xiangyu Zhang, and Jian Sun. Implicit Feature Pyramid Network for Object Detection. ar Xiv preprint ar Xiv:2012.13563, 2020. 1, 9 [7] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented Neural ODEs. In Neural Information Processing Systems (Neur IPS), 2019. 1, 9 [8] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and Reverse Gradient-Based Hyperparameter Optimization. In International Conference on Machine Learning (ICML), pages 1165 1173, 2017. 1, 10 [9] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing Millions of Hyperparameters by Implicit Differentiation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1540 1552, 2020. 1, 3, 4, 10 [10] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net: Large Scale Visual Recognition Challenge. International Journal on Computer Vision (IJCV), 115(3):211 252, 2015. 1, 2, 6 [11] Ezra Winston and J. Zico Kolter. Monotone operator equilibrium networks. In Neural Information Processing Systems (Neur IPS), pages 10718 10728, 2020. 2, 9 [12] Cheng Lu, Jianfei Chen, Chongxuan Li, Qiuhao Wang, and Jun Zhu. Implicit Normalizing Flows. In International Conference on Learning Representations (ICLR), 2021. 2 [13] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models. In International Conference on Learning Representations (ICLR), 2017. 2, 6 [14] Chirag Pabbaraju, Ezra Winston, and J. Zico Kolter. Estimating Lipschitz constants of monotone deep equilibrium models. In International Conference on Learning Representations (ICLR), 2021. 2 [15] Max Revay, Ruigang Wang, and Ian R Manchester. Lipschitz Bounded Equilibrium Networks. ar Xiv:2010.01732, 2020. 2, 9 [16] Charles G Broyden. A Class of Methods for Solving Nonlinear Simultaneous Equations. Mathematics of computation, 19(92):577 593, 1965. 3, 6 [17] Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Stabilizing Equilibrium Models by Jacobian Regularization. In International Conference on Machine Learning (ICML), 2021. 3, 8 [18] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning (ICML), pages 1126 1135, 2017. 3, 10 [19] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated Backpropagation for Bilevel Optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1723 1732, 2019. 3, 4, 10 [20] Jens Behrmann, David Kristjanson Duvenaud, and Jörn-Henrik Jacobsen. Invertible Residual Networks. In International Conference on Machine Learning (ICML), 2019. 3 [21] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is Attention Better Than Matrix Decomposition? In International Conference on Learning Representations (ICLR), 2021. 3, 8, 9, 10 [22] Samy Wu Fung, Howard Heaton, Qiuwei Li, Daniel Mc Kenzie, Stanley J. Osher, and Wotao Yin. Fixed Point Networks: Implicit Depth Models with Jacobian-Free Backprop. ar Xiv preprint ar Xiv:2103.12803, 2021. 3, 8, 9, 10 [23] Mehrdad Yazdani. Linear Backprop in non-linear networks. In Neural Information Processing Systems Workshops, 2018. 3 [24] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. In International Conference on Machine Learning (ICML), pages 1627 1635, 2017. 3, 9, 10 [25] Xavier Gastaldi. Shake-Shake Regularization. ar Xiv preprint ar Xiv:1705.07485, 2017. 3 [26] Guozhong An. The Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation, 8(3):643 674, 1996. 3 [27] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects. In International Conference on Machine Learning (ICML), pages 7654 7663, 2019. 3 [28] Jingfeng Wu, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, and Zhanxing Zhu. On the Noisy Gradient Descent that Generalizes as SGD. In International Conference on Machine Learning (ICML), pages 10367 10376, 2020. 3 [29] Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and Joern-Henrik Jacobsen. Residual Flows for Invertible Generative Modeling. Neural Information Processing Systems (Neur IPS), pages 9916 9926, 2019. 3 [30] Xin-Yu Zhang, Taihong Xiao, Haolin Jia, Ming-Ming Cheng, and Ming-Hsuan Yang. Semi Supervised Learning with Meta-Gradient. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 73 81, 2021. 3, 8, 10 [31] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, Citeseer, 2009. 6 [32] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normalization for Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), 2018. 6 [33] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. 7 [34] Bryan Wilder, Eric Ewing, Bistra Dilkina, and Milind Tambe. End to end learning and optimization on graphs. In Neural Information Processing Systems (Neur IPS), pages 4672 4683, 2019. 8, 9, 10 [35] Brandon Amos and J. Zico Kolter. Opt Net: Differentiable Optimization as a Layer in Neural Networks. In International Conference on Machine Learning (ICML), pages 136 145, 2017. 9 [36] Josip Djolonga and Andreas Krause. Differentiable Learning of Submodular Models. Neural Information Processing Systems (Neur IPS), pages 1013 1023, 2017. 9 [37] Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning (ICML), pages 6545 6554, 2019. 9 [38] VG Boltyanskiy, Revaz V Gamkrelidze, YEF Mishchenko, and LS Pontryagin. Mathematical theory of optimal processes. 1962. 9 [39] Marin Vlastelica, Anselm Paulus, Vít Musil, Georg Martius, and Michal Rolínek. Differentiation of Blackbox Combinatorial Solvers. In International Conference on Learning Representations (ICLR), 2020. 9 [40] Stephen Gould, Richard Hartley, and Dylan Campbell. Deep Declarative Networks. IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), 2021. 9 [41] Xingyu Xie, Qiuhao Wang, Zenan Ling, Xia Li, Yisen Wang, Guangcan Liu, and Zhouchen Lin. Optimization Induced Equilibrium Networks. ar Xiv preprint ar Xiv:2105.13228, 2021. 9 [42] Kenji Kawaguchi. On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers. In International Conference on Learning Representations (ICLR), 2020. 9 [43] Miguel Carreira-Perpinan and Weiran Wang. Distributed Optimization of Deeply Nested Systems. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 10 19, 2014. 9 [44] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training Neural Networks Without Gradients: A Scalable ADMM Approach. In International Conference on Machine Learning (ICML), pages 2722 2731, 2016. 9 [45] Ziming Zhang, Yuting Chen, and Venkatesh Saligrama. Efficient Training of Very Deep Neural Networks for Supervised Hashing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1487 1495, 2016. 9 [46] Ziming Zhang and Matthew Brand. Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks. In Neural Information Processing Systems (Neur IPS), 2017. 9 [47] Jinshan Zeng, Shikang Ouyang, Tim Tsz-Kit Lau, Shaobo Lin, and Yuan Yao. Global Convergence in Deep Learning with Variable Splitting via the Kurdyka-Łojasiewicz Property. ar Xiv preprint ar Xiv:1803.00225, 2018. 9 [48] Jia Li, Cong Fang, and Zhouchen Lin. Lifted Proximal Operator Machines. In Association for the Advancement of Artificial Intelligence (AAAI), pages 4181 4188, 2019. 9 [49] Fangda Gu, Armin Askari, and Laurent El Ghaoui. Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3362 3371, 2020. 9 [50] Yoshua Bengio. How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation. ar Xiv preprint ar Xiv:1407.7906, 2014. 9 [51] Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference Target Propagation. In International Conference on Learning Representations (ICLR), pages 498 515, 2015. 9 [52] Alexander Meulemans, Francesco Carzaniga, Johan Suykens, João Sacramento, and Benjamin F. Grewe. A Theoretical Framework for Target Propagation. In Neural Information Processing Systems (Neur IPS), pages 20024 20036, 2020. 9 [53] Wojciech Marian Czarnecki, Grzegorz Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding Synthetic Gradients and Decoupled Neural Interfaces. In International Conference on Machine Learning (ICML), pages 904 912, 2017. 9, 10 [54] Benjamin James Lansdell, Prashanth Ravi Prakash, and Konrad Paul Kording. Learning to solve the credit assignment problem. In International Conference on Learning Representations (ICLR), 2020. 9, 10 [55] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-Learning with Implicit Gradients. In Neural Information Processing Systems (Neur IPS), 2019. 10 [56] J Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013. 10 [57] Justin Domke. Generic Methods for Optimization-Based Modeling. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 318 326, 2012. 10 [58] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based Hyperparameter Optimization through Reversible Learning. In International Conference on Machine Learning (ICML), pages 2113 2122, 2015. 10 [59] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters. In International Conference on Machine Learning (ICML), pages 2952 2960, 2016. 10 [60] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable Architecture Search. International Conference on Learning Representations (ICLR), 2018. 10 [61] Fabian Pedregosa. Hyperparameter Optimization with Approximate Gradient. In International Conference on Machine Learning (ICML), pages 737 746, 2016. 10