# two_tales_of_singlephase_contrastive_hebbian_learning__bb4eee27.pdf Two Tales of Single-Phase Contrastive Hebbian Learning Rasmus Høier 1 Christopher Zach 1 Abstract The search for biologically plausible learning algorithms has converged on the idea of representing gradients as activity differences. However, most approaches require a high degree of synchronization (distinct phases during learning) and introduce substantial computational overhead, which raises doubts regarding their biological plausibility as well as their potential utility for neuromorphic computing. Furthermore, they commonly rely on applying infinitesimal perturbations (nudges) to output units, which is impractical in noisy environments. Recently it has been shown that by modelling artificial neurons as dyads with two oppositely nudged compartments, it is possible for a fully local learning algorithm named dual propagation to bridge the performance gap to backpropagation, without requiring separate learning phases or infinitesimal nudging. However, the algorithm has the drawback that its numerical stability relies on symmetric nudging, which may be restrictive in biological and analog implementations. In this work we first provide a solid foundation for the objective underlying the dual propagation method, which also reveals a surpising connection with adversarial robustness. Second, we demonstrate how dual propagation is related to a particular adjoint state method, which is stable regardless of asymmetric nudging. 1. Introduction Credit assignment using fully local alternatives to backpropagation is interesting both as potential models of biological learning as well as for their applicability for energy efficient analog neuromorphic computing (Kendall & Kumar, 2020; Yi et al., 2022). A pervasive idea in this field is the idea of representing error signals via activity 1Department of Electrical Engineering, Chalmers University of Technology, Sweden. Correspondence to: Rasmus Høier . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). differences, referred to as NGRAD (Neural Gradient Representation by Activity Differences) approaches (Lillicrap et al., 2020). However, a commonly overlooked issue in NGRAD approaches is the requirement of applying infinitesimal perturbations (or nudging) to output neurons in order to propagate error information through the network. This is problematic as analog and biological neural networks are inherently noisy, potentially causing the error signal to vanish if insufficient nudging is applied. In many local learning methods the output units are positively nudged to reduce a target loss, but utilizing additional negative nudging (output units increase a target loss) can be beneficial to improve accuracy (e.g. (Laborieux et al., 2021)). The vanishing error signal problem is addressed by coupled learning (Stern et al., 2021), which proposes to replace the clamped output units of contrastive Hebbian learning with a convex combination of the label and the free phase outputs. Unfortunately, coupled learning has been shown to perform worse than equilibrium propagation on CIFAR10 and CIFAR100 (Scellier et al., 2023), and it does not necessarily approximate gradient descent on the output loss function (Stern et al., 2021). Holomorphic equilibrium propagation (Laborieux & Zenke, 2022; 2023) mitigates the need for infinitesimal nudging required in standard equilibrium propagation (Scellier & Bengio, 2017) at the cost of introducing complex-valued parameters. Whether this is a suitable approach for either biological or analog neural networks is an open question. Dual propagation (Høier et al., 2023) (DP), an algorithm similar in spirit to contrastive Hebbian learning, equilibrium propagation and coupled learning, is compatible with non-infinitesimal nudging by default. This method infers two sets of oppositely nudged and mutually tethered states simultaneously. However, utilization of symmetric nudging is a necessary condition for the convergence of its inference step. Contributions DP is compatible with strong feedback and only requires a single inference phase, which are appealing features with regards to biological plausibility and potential applications to analog neuromorphic computing. However, the lack of convergence guarantees in the case of asymmetric nudging is clearly unsettling as exact symmetry is hard to realize outside of digital computers. Further, unlike digital computers neuromorphic, analog or otherwise highly Two Tales of Single-Phase Contrastive Hebbian Learning distributed computing hardware typically performs continuous computations and runs asynchronously. Consequently, numerical stability of an energy-based inference and learning method is of essential importance. For this reason we derive an improved variant of dual propagation, which overcomes this strict symmetry requirement. In summary the contributions of this work are: A derivation of DP based on repeated relaxations of the optimal value reformulation (Outrata, 1988; Dempe & Zemkoho, 2013). A new Lagrangian based derivation of dual propagation, which recovers the original dual propagation algorithm in the case of symmetric nudging, but leads to a slightly altered (and more robust) method in the case of asymmetric nudging. We experimentally investigate the robustness of these algorithms to asymmetric nudging and strong feedback, and we further demonstrate the impact of asymmetric nudging on the estimated Lipschitz constant. 2. Related Work CHL, EP and lifted networks In contrastive Hebbian learning (CHL) (Movellan, 1991; Xie & Seung, 2003) and equilibrium propagation (EP) (Scellier & Bengio, 2017) neuronal activations are found via an energy minimization procedure. Inference is carried out twice, once with and once without injecting label information at the output layer. While CHL clamps the output units to the true targets, EP applies nudging towards a lower loss. The difference between the activity in each of these two inference phases is used to represent neuronal error vectors. To speed up convergence and ensure that inferred states represent the same local energy basin, this is typically (but not always) done sequentially, e.g. the second inference phase is initialized with the solution found in the first phase. A better gradient estimate can be obtained by introducing an additional oppositely nudged inference phase, yielding a three-phased algorithm (Laborieux et al., 2021). Dual propagation (Høier et al., 2023) (DP) is another algorithm in the NGRAD family, in which each neuron has two intrinsic states corresponding to positively and negatively nudged compartments as illustrated in Fig. 1. The algorithm is based on a specific network potential and uses the respective stationarity conditions to obtain an easy-to-implement inference method: each neuron maintains two internal states representing the neural activity as (weighted) mean and the error signal as difference, respectively (which makes it an instance of an NGRAD method). The mean state is sent upstream to the next layer while the difference is passed downstream to the preceding layer, where it nudges the receiving neural state. The method essentially interleaves or braids the two inference phases and makes it possible to s+ = f(a+ᾱΔ) s = f(a αΔ) Bottom up input Top down input Out-going signal (a) Sketch of a dyadic DP neuron. basal dendrites apical dendrites (b) Sketch of a pyramidal neuron. Figure 1. (a) Illustration of a dyadic neuron (note that all quantities are scalar). The two internal states, s+ and s , receive the same bottom-up input a but the top down input nudges them in opposite directions. The difference and weighted mean of these internal states are then propagated downstream and upstream respectively. (b) In a pyramidal neuron bottom-up signal arrive at the basal dendrites and top-down signal arrive at the apical dendrites. Concerns regarding DP and biological plausibility are discussed in section 7. infer both states simultaneously. When the update sequence is chosen appropriately, as few as two updates per neuron are sufficient, making the algorithm comparable to backpropagation in terms of runtime and 10-100X faster than CHL and EP. In practice, DP is applied to feed forward networks and EP typically to Hopfield models, which has implications for how inference is carried out (fixed-point method vs energy minimization) and on the hardware the algorithms are suitable for. Another difference to EP and CHL is, that dual propagation infers both sets of states simultaneously (and not in sequential phases). The underlying network potential used in DP is parametrized by a coefficient α [0, 1] determining the weighted mean. The stability of the resulting inference method hinges on choosing α = 1/2. Casting deep learning as an optimization task over explicit activations and weights is the focus of a diverse set of backpropagation alternatives sometimes collectively referred to as lifted neural networks (Carreira-Perpinan & Wang, 2014; Askari et al., 2018; Gu et al., 2020; Li et al., 2020; Choromanska et al., 2019; Zach & Estellers, 2019; Høier & Zach, 2020). Predictive coding networks (e.g. (Whittington & Bogacz, 2017; Salvatori et al., 2023)) can be also understood as instances of lifted neural networks. Although members of this group have different origins, they are algorithmically closely related to CHL and EP (Zach, 2021), but vary e.g. in their suitability for digital hardware. Weak and strong feedback While a number of CHLinspired learning methods for neural networks are shown to be equivalent to back-propagation when the feedback parameter β approaches zero (i.e. infinitesimal nudging takes place, as discussed in e.g. (Xie & Seung, 2003; Scellier & Two Tales of Single-Phase Contrastive Hebbian Learning Bengio, 2017; Zach & Estellers, 2019; Zach, 2021)), practical implementations use a finite but small value for β, whose magnitude is further limited either explicitly or implicitly. CHL implicitly introducing weak feedback via its layer discounting in order to approximate a feed-forward neural network, and both CHL and EP rely on weak feedback to stay in the same energy basin for the free and the clamped solutions. The iterative solver suggested for the LPOM model (Li et al., 2020) also depends on sufficiently weak feedback to ensure convergence of the fixed-point inference scheme. In contrast to these restrictions, the feedback parameter β in dual propagation is weakly constrained and its main effect is to influence the resulting finite difference approximation for activation function derivatives. 3. Background A feed-forward network is a composition of L layer computations sk 1 7 fk(Wk 1sk 1), where fk is the activation function of layer k and Wk 1 is the associated weight matrix. s0 is the input provided to the network and sk := fk(Wk 1sk 1) is the vector of neural states in layer k. The set of weight matrices θ = {Wk}L 1 k=1 form the trainable parameters, which are optimized during the training stage to minimize a target loss ℓ. Training is usually at least on digitial hardware conducted via back-propagation of errors, which is a specific instance of the chain rule. In (Høier et al., 2023) a localized, derivative-free approach for supervised learning in feed-forward neural networks is proposed, which is based on the following network potential LDP α (Eq. 1) for a parameter α [0, 1], LDP α (θ) = min s+ max s αℓ(s+ L) + αℓ(s L) k=1 Ek(s+ k , sk 1) Ek(s k , sk 1) . (1) The terms Ek in LDP α are specifically chosen as Ek(sk, sk 1) = Gk(sk) s k Wk 1sk 1, where Gk is typically a strongly convex function for the resting energy (and relates to the activation function via Gk = f 1 k and G k = fk). This choice leads via the corresponding stationarity conditions to particularly simple fixed-point updates for s k for the inference of neural states, s+ k fk Wk 1 sk 1 + αW k (s+ k+1 s k+1) s k fk Wk 1 sk 1 αW k (s+ k+1 s k+1) , (2) where α := 1 α and sk := αs+ k + αs k . The state of the output units is determined by solving s+ L arg min s L αℓ(s L) + EL(s L, s L 1) s L arg min s L αℓ(s L) + EL(s L, s L 1). (3) Gradient-based learning of the weight matrices θ = (Wk)L 1 k=0 is based on the quantity Wk LDP α (s+ k+1 s k+1) s k once the neural states s k are inferred. The choice of α = 1 is equivalent to the LPOM formulation (Li et al., 2020), that builds on the following network potential, LLP OM(θ) = min s+ ℓ(s+ L) Ek(s+ k , s+ k 1) Ek(fk(Wk 1s+ k 1), s+ k 1) . Eq. 4 can be obtained from Eq. 1 by closed form maximization w.r.t. all s k , as s k in this setting solely depends on s+ k 1. Analogously, the choice α = 0 can be interpreted as inverted LPOM model (after minimizing w.r.t. all s+ k variables). Unfortunately, the straightforward update rules in Eq. 2 lead to a stable method only when α = 1/2. In Section F we discuss the necessary adaptation of Eq. 2 (which are also used in our experiments) when α {0, 1}. In this work we (i) provide explanations why the choice α = 1/2 is special (via connecting it to the adjoint state method in Section 5 and via a more explicit analysis in Section G) and (ii) demonstrate why selecting α = 1/2 can be beneficial. 4. A Relaxation Perspective on Dual Propagation In this section we show how LDP α can be obtained exactly by repeated application of certain relaxations of the optimal value reformulation (Outrata, 1988; Dempe & Zemkoho, 2013) (summarized in Section 4.1). Obtaining LDP α requires the application of two (to our knowledge) novel relaxations for bilevel programs as described in Sections 4.2 and 4.3. 4.1. The Optimal Value Reformulation and its Relaxation First we describe a basic (and known) reformulation for a class of bilevel minmization problems. We consider a general bilevel program of the form min θ ℓ(s ) s.t. s = arg min s E(s; θ) (5) In the context of this work we assume that the minimizer of E(s; θ) is unique for all θ, which corresponds to layer computation in deep neural networks to be proper (singlevalued) functions. In a first step the constraint in Eq. 5 can restated as min θ,s ℓ(s) s.t. E(s; θ) min s E(s ; θ), (6) which is the optimal value reformulation (Outrata, 1988; Dempe & Zemkoho, 2013). As E(s; θ) mins E(s ; θ) Two Tales of Single-Phase Contrastive Hebbian Learning by construction, the inequality constraint is always (possibly only weakly) active. By fixing the corresponding non-negative Lagrange multiplier to 1/β for a β > 0 we obtain a relaxed optimal value reformulation ( ROVR ), min θ,s ℓ(s) + 1 β E(s; θ) min s E(s ; θ) = min θ,s max s ℓ(s) + 1 β E(s; θ) E(s ; θ) . (7) Repeated application of this conversion from lower-level minimization task to higher-level penalizers on deeply nested problems yields a large variety of known learning methods (Zach, 2021). When β 0, a number of works have shown the connection between Eq. 7, back-propagation and implicit differentiation (Xie & Seung, 2003; Scellier & Bengio, 2017; Gould et al., 2019; Zach, 2021). 4.2. A Saddlepoint Relaxed Optimal Value Reformulation A different relaxation for a bilevel program can be obtained by rewriting Eq. 5 as follows, min s+ max s ℓ(αs++ αs ) s.t. s+ = arg min s E(s; θ) s = arg min s E(s; θ), (8) where α [0, 1] and α = 1 α as before. The reformulation above replaces the solution s of the inner problem with an (apparently superfluous) convex combination of two related solutions s+ and s . Under the uniqueness assumption on the minimizer of E( ; θ), this program is equivalent to Eq. 5 (as s = s+ = s by construction). The main effect of introducing s in Eq. 8 is, that it allows two independent applications of the ROVR. Applying the standard ROVR on s (a maximization task) yields the first relaxation, min s+ max s ℓ(αs++ αs ) 1 β (E(s ; θ) mins E(s; θ)) s.t. s+ = arg mins E(s; θ), (9) and a second application of the ROVR on s+ (with the same multiplier 1/β) results in JSPROVR α,β (θ) := min s+ max s ℓ(αs++ αs ) β (E(s+; θ) E(s ; θ)) (10) as the mins E(s; θ) terms cancel each other. We can reverse the order of min and max in Eq. 8, which also induces a reversal of min and max in Eq. 10. Strong duality may not be satisfied in general, but the stationarity conditions w.r.t. s+ and s are unaffected by permuting min and max. We call the algebraic conversion from Eq. 5 to Eq. 10 the saddle-point ROVR or SPROVR. In order to model deep feed-forward networks we extend the bilevel program to a multi-level program with with L nesting levels, min θ ℓ(s L) s.t. s k = arg minsk Ek(sk, s k 1; θ) s 1 = arg mins1 E1(s1; θ), (11) where each Ek( , sk 1; θ) is assumed to have a unique minimizer for all sk 1 and θ. By applying SPROVR repeatedly from the outermost level to inner ones, we arrive at the following relaxed reformulation, min θ min {s+ k } max {s k } ℓ( s L) + 1 β E1(s+ 1 ; θ) E1(s 1 ; θ) k=2 Ek(s+ k ; sk 1; θ) Ek(s k ; sk 1; θ) . (12) We recall that the short-hand notation sk stands for sk = αs+ k + αs k , but the choice of the coefficient α can in principle be layer-dependent (by replacing α with αk). Similarly, the value of multiplier 1/β may depend on the layer k. Traversing the SPROVR from the inside towards the outermost level leads to a different (and less relevant) relaxation. 4.3. An Adversarial Relaxed Optimal Value Reformulation In this section we present a different instance of an optimal value reformulation, that yields centered approaches to contrastive Hebbian learning and is further closely linked to adversarial training. We start with the bilevel program, Jadv α,β(θ) = ℓ(s ) s.t. s = arg min s E(s; θ) αβℓ(s), where β > 0 is a step size parameter and α [0, 1] further modulates the step length. The solution s is a minimizer of an adversarially perturbed inner problem (and assumed to exist), and therefore Eq. 13 is generally not equivalent to Eq. 5. A simple illustrative example is given by setting E(s; θ) = s θ 2/2 and ℓ(s) = g s, then arg mins E(s; θ) = θ but s in Eq. 13 is given by s = θ + αβg, i.e. s takes a gradient ascent step from θ = arg mins E(s; θ) to increase the (linear) target loss ℓ (somewhat in analogy with the fast gradient method). On the other hand, the outer problem aims to reduce the main loss ℓfor this perturbed sample s . Using the optimal value reformulation (cf. Section 4.1) this is equivalent to an inequality constrained program, min s ℓ(s ) s.t. E(s ; θ) αβℓ(s ) min s E(s; θ) αβℓ(s). (14) Fixing the (non-negative) multiplier of the Lagrangian relaxation to 1/β yields (after algebraic manipulations) JAROVR α,β (θ) := min s+ max s αℓ(s+) + αℓ(s ) β E(s+; θ) E(s ; θ) , (15) Two Tales of Single-Phase Contrastive Hebbian Learning which e.g. generalizes the centered variant of EP. Increasing the effective step length αβ in Jadv α,β by increasing α allows larger adversarial perturbations and therefore makes the learning task harder. This property carries over to JAROVR α,β as shown in the following proposition (proven in the appendix). Proposition 4.1. Let 0 < β β and α, α [0, 1] with α α . Then the following holds: (a) JAROVR α,β (θ) JAROVR α ,β (θ) and (b) βJAROVR α,β (θ) β JAROVR α,β (θ). Original EP (α = 1) and centered EP (Laborieux et al., 2021; Scellier et al., 2023) (α = 1/2) are two particular instances of JAROVR α,β . Hence, Prop. 4.1 raises the question of how much the better gradient estimate and how much the stricter loss JAROVR 1/2,β contribute to the advantages of centered EP. Without further assumption we cannot expect claim (b) to be strengthened to JAROVR α,β (θ) JAROVR α,β (θ) as a larger adversarial step is offset by a smaller multiplier 1/β . If the total step size αβ remains constant, then a stronger relation can nevertheless be shown: Proposition 4.2. Let 0 < β < β and α such that αβ = α β . Then JAROVR α,β (θ) JAROVR α ,β (θ). The two relaxations, JAROVR α,β and JSPROVR α,β (Eq. 10) look very similar, the only difference being the location of the averaging step (averaging the argument or the function value of the target loss ℓ). Generally we expect JAROVR α,β to upper bound JSPROVR α,β , which is always the case when ℓis convex. The huge advantage of SPROVR is that it leads to tractable objectives when applied to deeply nested problems (linear in the problem depth L). Applying AROVR in a similar manner on Eq. 11 yields an exponential number of terms in the resulting objective unless α = 1 or α = 0 (where it actually coincides with SPROVR). We now can easily identify LDP α (Eq. 1) as result of applying both SPROVR and AROVR on a deeply nested program: Corollary 4.3. The DP objective LDP α (Eq. 1) can be obtained by first applying AROVR on the outermost level of Eq. 11, followed by subsequent applications of SPROVR for the remaining levels. In Section 6.1 we verify whether different choices for α have any impact on the robustness of a trained model. Since any differences are expected to vanish for small choices of β, the large β setting (we choose β = 1/2) is of main interest. Note that unlike standard adversarial training (which perturbs solely the input to the network), the lifted network potential LDP α is based on perturbations of the network s hidden activations. Adversarial perturbations of the inputs can be achieved by adding an auxiliary constraint s0 = x, e.g. by extending LDP α with e.g. E0(s0, x) = s0 x 2. In our experiments, a significant reduction of the Lipschitz estimate is in fact observed when decreasing α (and there- fore increasing the adversarial step size αβ). We do of course not suggest to replace advanced adversarial training with the loss JAROVR 0,β , but the main goal of this section is to demonstrate that a non-infinitesimal choice of β has a theoretical and practical impact and therefore goes beyond a pure approximation of back-propagation (respectively implicit differentiation). One may also speculate that the hypothesized NGRAD-based learning in biological systems gains some robustness by leveraging a similar mechanism. 5. A Lagrangian Perspective on Dual Propagation In this section we derive a variant of dual propagation, which we refer to as DP , which turns out to be robust to asymmetric nudging (i.e. choosing α [0, 1] not necessarily equal to 1/2). Our starting point is a modified version of Le Cun s classic Lagrangian-based derivation of backpropagation (Lecun, 1988). We assume (i) that the activation functions fk are all invertible (which can be relaxed), and (ii) that fk has a symmetric derivative (i.e. f k(x) = f k(x) ). The second assumption clearly holds e.g. for element-wise activation functions. Our initial Lagrangian relaxation can now be stated as follows, L(θ) = min s max δ ℓ(s L) k=1 δ k f 1 k (sk) Wk 1sk 1 . (16) Note that the multiplier δk corresponds to the constraint f 1 k (sk) = Wk 1sk 1 (in contrast to the constraint sk = fk(Wk 1sk 1) employed in (Lecun, 1988)). The main step is to reparamtrize sk and δk in terms of s+ k and s k , sk = αs+ k + αs k δk = s+ k s k (17) for a parameter α [0, 1] (and α := 1 α). In the following we use the short-hand notations sk := αs+ k + αs k and ak := Wk 1 sk 1. Proposition 5.1. Assume that the activation functions fk, k = 1, . . . , L 1, are invertible, have symmetric derivatives and behave locally linear. Then the Lagrangian corresponding to the reparametrization in Eq. 17 is given by LDP α (θ) = min s+ max s ℓ( s L) k=1 (s+ k s k ) f 1 k ( sk) Wk 1 sk 1 , (18) and the optimal s k in (18) satisfy s+ k fk Wk 1 sk 1 + αW k (s+ k+1 s k+1) s k fk Wk 1 sk 1 αW k (s+ k+1 s k+1) (19) for internal layers k = 1, . . . , L 1. Two Tales of Single-Phase Contrastive Hebbian Learning For the output layer activities s L we absorb f L into the target loss (if necessary) and therefore solve min s+ L max s L ℓ( s L) + (s+ L s L) ( s L a L) (20) with corresponding necessary optimality conditions 0 = αℓ ( s L) + s L a L + α(s+ L s L) 0 = αℓ ( s L) s L + a L + α(s+ L s L). (21) Adding these equations reveals ℓ ( s L) + s+ L s L = 0. Inserting this in any of the equations in (21) also implies s L = a L = WL 1 s L 1. Thus, by combining these relations and with g := ℓ (WL 1 s L 1), the updates for s L are given by s+ L WL 1 s L 1 αg s L WL 1 s L 1 + αg. (22) For the β-weighted least-squares loss, ℓ(s L) = β 2 s L y 2, the above updates reduce to s+ L a L αβ(a L y) and s L a L + αβ(a L y). The neural activities s k together encode the forward and the adjoint state. Since the updates in Eq. 19 are based on the adjoint state method, we denote the resulting algorithm adjoint-DP or just DP . Relation to the original dual propagation The update equations in (19) coincide with the original dual propagation rules if α = 1/2 (Høier et al., 2023), although the underlying objectives LDP α (18) and LDP α are fundamentally different. If α = 1/2, then α and α switch places w.r.t. the error signals (but not in the definition of s) compared to the original dual propagation method.1 The updates in (19) for α = 0 correspond to an algorithm called Fenchel backpropagation discussed in (Zach, 2021). Both the objective of the original dual propagation (1) and the objective of the improved variant (18) can be expressed in a general contrastive form, but the underlying potentials Ek are somewhat different, DP: Ek(s+ k , sk 1) Ek(s k , sk 1) = Gk(s+ k ) Gk(s k ) (s+ k s k ) Wk 1 sk 1 DP : Ek(s k , sk 1) Ek(s k , sk 1) = (s+ k s k ) Gk( sk) (s+ k s k ) Wk 1 sk 1. Here Gk is a convex mapping ensuring that arg minsk Ek(sk, sk 1) provides the desired activation function fk. The relation between fk and Gk is given by fk = G k and f 1 k = Gk (where G k is the convex 1This partial exchange of roles of α and α is somewhat analogous to the observation that e.g. forward differences in the primal domain become backward differences in the adjoint. conjugate of Gk). Activations function induced by Gk have automatically symmetric Jacobians as f k = 2G k under mild assumptions. The main difference between these two flavors of DP can be restated as follows, Ek = Gk(s+ k ) Gk(s k ) (s+ k s k ) Gk( sk) = DGk(s+ k sk) DGk(s k sk), (23) where DGk is the Bregman divergence induced by Gk. The difference of Bregman divergences can have any sign, and Ek = 0 if DGk(s+ k sk) = DGk(s k sk). Hence, for the choice α {0, 1} (i.e. sk = s+ k or sk = s k ), Ek can only be zero when s+ k = s k . Fixing α = 1/2 and Gk to any convex quadratic function implies Ek = 0 (as DGk becomes the squared Mahalanobis distance in this setting). Non-invertible activation function If fk is not invertible at the linearization point (such as the softmax function), then Dk is singular and the constraint f 1 k (sk) = Wk 1sk 1 is converted into a constraint that D+ k sk is restricted to a linear subspace, D+ k (sk s0 k) = Wk 1sk 1 + Nkvk, (24) where Nk spans the null space of Dk and vk is an arbitrary vector. Going through the similar steps as above leads to the same updates for s k . Starting from Lecun s Lagrangian If we start from the more natural Lagrangian L(θ) = min s max δ ℓ(s L) k=1 δ k (sk fk(Wk 1sk 1)) , (25) instead from (16), then the back-propagated signal s k+1 s+ k+1 cannot be easily moved inside the activation function fk. The update equations for s k are now of the lessconvenient form s+ k fk(Wk 1 sk 1) + αW k f k+1( sk)(s k+1 s+ k+1) s k fk(Wk 1 sk 1) αW k f k+1( sk)(s k+1 s+ k+1), which still require derivatives of the (next-layer) activation functions, which makes it less appealing to serve as an NGRAD method. 6. Numerical Validation Our numerical experiments aim to verify two claims: first, in Section 6.1 we demonstrate that choosing different values for α makes a significant difference in the learned weight matrices. Second, Section 6.2 validates that the proposed DP method is efficient enough to enable successful training beyond toy-sized DNNs for arbitrary choices of α 2. 2The code used in our experiments is available at: github. com/Rasmuskh/dualprop_icml_2024 Two Tales of Single-Phase Contrastive Hebbian Learning Table 1. Test accuracy and Lipschitz properties of an MLP trained with DP and DP for different choice of α (and fixed β = 1/2). ˆL is the spectral norm of the product of weight matrices. Results are averaged over 3 random seeds. The additional results under DP and BP correspond to adding input layer robustness (DP) and FGSM-based adversarial training (BP), respectively. DP DP BP α 1 ½ 0 1 ½ 0 MNIST Acc 95.85 0.68 98.09 0.06 98.46 0.07 98.06 0.15 97.9 0.1 98.21 0.13 98.05 0.12 98.43 0.11 98.43 0.04 ˆL 522 288 67.2 12.6 29.9 4.2 73.4 2.7 78.1 9.9 43.2 19.4 55.1 14.4 15.87 6.34 36.16 4.23 F-MNIST Acc 83.36 0.74 88.32 0.05 89.16 0.43 88.26 0.46 88.35 0.37 88.18 0.27 88.37 0.56 88.29 0.79 88.64 0.25 ˆL 157.7 57.8 34.3 4.0 19.97 3.9 52.0 23.6 35.1 7.0 29.4 2.7 37.7 10.2 8.05 1.03 12.68 0.81 Implementation The adjoint variant DP literally implements the updates stated in Eq. 19. In order to enable DP for values α = 1/2, we leverage stabilized fixed-point updates (Section F), Not using stabilized updates yields failures to learn successfully as indicated in Table 2 (with details given in Section A.1). Using stable fixed-point updates comes at a significant computational cost as more iterations are required to obtain sufficiently accurate neural states. Consequently we apply these only in the experiments related to the Lipschitz behavior described in Section 6.1. 6.1. The impact of α on the Lipschitz continuity Section 4.3 suggests that the choice of α matters in particular in the strong feedback setting (i.e. β is not close to zero). We ran experiments on MNIST and Fashion MNIST using a 784-512( 2)-10 MLP with Re LU activation functions, and Table 1 reports test accuracy and Lipschitz estimates (computed as spectral norm of the accumulated weight matrix W2W1W0 as Re LU is 1-Lipschitz) after 20 epochs of ADAM-based training (with learning rate 0.001 and default parameters otherwise). No additional loss terms such as weight decay were used. As indicated by Prop. 4.1, a lower value of α yields a significantly smaller Lipschitz constant (as well as improved test accuracy) for the DP method. The DP approach leads to a similar but far less pronounced behavior regarding the Lipschitz estimate. The small differences between the Lipschitz estimates for the different choices of α have no consistent impact on the test accuracy. As pointed out in Section 4.3, the DP objective LDP α=0 is robust only w.r.t. perturbations in the internal layers. Consequently, the very first layer is not necessarily adversarially robust, and additional robustness can be achieved by extending LDP 0 with a term for the input layer, introducing another term E0(s, x) = s x 2/2. Table 1 also includes the resulting accuracy and Lipschitz estimates for this extended variant of LDP 0 as well as the results for simple adversarial training (using FGSM (Goodfellow et al., 2015) as approximate inner maximization; the step size ε = 0.01 chosen to roughly match the test accuracy of DP). Adding Table 2. Mean test accuracy in percent for the original and the improved dual propagation methods using α {0, 1}. (X) indicates that the particular experiment did not converge. Results are averaged over five random seeds. α = 0 α = 1 Iters DP DP DP DP 1 98.3 0.1 98.50 0.1 98.4 0.1 97.9 0.1 30 98.4 0.0 X 98.5 0.1 X input layer robustness significantly reduces the Lipschitz estimate of the DP-trained network (at a small reduction of test accuracy). In the setting α = 1/2, DP and DP use identical update rules for the hidden units, but treat the output units differently (Eq. 3 vs. Eq. 22), hence the two methods behave very similar but not identical in practice. Finally, we remark that the difference between all tested methods largely vanishes in the weak feedback setting as they all mimic back-propagation when β 0+. 6.2. VGG16 experiments We also train a 16 layer VGG network using DP with a crossentropy classification loss on the CIFAR10 and CIFAR100 datasets. In contrast to (Høier et al., 2023) we do not employ the efficient, backpropagation-like forwardbackward traversal of the layers. Instead we apply 17 forward-only passes through the layers, which due to the dyadic neurons allows errors to gradually propagate to all layers. We employ standard data augmentation (random crops and horizontal flips) and carry out all experiments with 3 random seeds and report mean and std. deviation. 10% of the training data is hold out as validation set for model selection. The hyperparameters are listed in Section B. Unlike the MLP setting, we find now that the choice of β has some influence on the stability of the training method. Note that (Høier et al., 2023) actually reported results for β = 1/batch-size (instead of β = 1) as the (linearized) mean CE loss was used. We instead use the summed CE loss (so the feedback signal is not scaled by the batchsize). Two Tales of Single-Phase Contrastive Hebbian Learning Table 3. Table of top-1 and top-5 accuracies in percent for VGG16 networks trained with DP for different choices of β and α. ( ) Indicates an experiment where a reasonable model was found before training eventually diverged. (X) indicates an experiment where no reasonable model was learned. β 1.0 0.1 0.01 BP α 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 CIFAR10 Top-1 91.34 0.16 X X 92.41 0.07 89.3 0.17 X 92.26 0.22 92.19 0.32 92.19 0.08 92.36 0.16 CIFAR100 Top-1 69.3 0.08 X X 70.05 0.23 69.25 0.28 X 69.33 0.24 69.38 0.10 68.83 0.06 69.1 0.1 Top-5 89.6 0.11 X X 89.42 0.16 88.54 0.18 X 88.46 0.17 88.23 0.18 88.28 0.13 88.24 0.22 4 8 12 16 Layer Angle (deg) (a) α = 0.0, β = 0.01 Batch index 4 8 12 16 Layer Angle (deg) (b) α = 0.0, β = 0.01 4 8 12 16 Layer Angle (deg) (c) α = 0.0, β = 1.0 Batch index 4 8 12 16 Layer Angle (deg) (d) α = 0.0, β = 1.0 Figure 2. CIFAR100: Angle of gradient estimates relative to BP gradients, for DP with α = 0 and β = 0.01 and β = 1.0. Angles are plotted across layers and epochs (left column). The right column zooms in on the first 200 minibatches (i.e. a fifth of an epoch). Results for DP and β {0.01, 0.1, 1.0} in Table 3 show that the choice of α has no noticeable impact in the weak feedback setting (β = 0.01), which is to be expected. In the moderate feedback setting (β = 0.1), DP with α = 1 fails, and DP with α = 0.5 eventually diverges on CIFAR10 after reaching an acceptable model. The choice β = 1.0 requires α = 0 for successful training and leads to a drop in performance. Fig. 2 compares the angle between DP gradient estimates and backprop gradient estimates when α = 0 in the strong (β = 1) and weak (β = 0.01) feedback setting. It can be seen in Figs. 2a and 2c, that DP with β = 0.01 always provides excellent gradient estimates, whereas the gradients are initially not well aligned when β = 1. The difference is especially pronounced during the first 200 batches (a fifth of an epoch) in Figs. 2b and 2d. Imagenet32x32 We restrict the more compute intensive Image Net32x32 experiments to the setting α = 1/2 and β = 0.01. We use 5% of the training data for validation and model selection, and use the public validation dataset to evaluate the selected model. With top-1 accuracy of 41.59 0.11 and top-5 accuracy of 65.63 0.19 it is apparent that the repeated inference scheme maintains performance on par with (or slightly above) the results reported for DP and BP in (Høier et al., 2023). 7. Discussion Biological plausibility of the dyadic neuron model In the DP framework neurons have two internal compartments, which integrate both bottom-up and top-down input signals. Introducing distinct neural compartments is somewhat biologically plausible as recent neuroscience research suggests that dendrites are compartmental structures capable of performing fairly complex operations (Chavlis & Poirazi, 2021; Beniaguev et al., 2021). It has been proposed in (Guerguiev et al., 2017; Richards & Lillicrap, 2019) that distal apical dendrites (integrating top-down feedback signals) might be responsible for credit assignment in pyramidal neurons, whereas basal dendrites integrate local feed-forward and recurrent inputs. The DP algorithm does not fully fit this picture as the state updates require integrating bottom-up input and feed-back input jointly. Further, DP relies on some form of multiplexing within a neuron in order to communicate both the neural state difference downstream and the mean state upstream. This distinction is illustrated in Fig. 1. In spite of these caveats, the dyadic neuron model shows that by taking a small step away from the single compartment based Mc Culloch Pitts neuron model one can nevertheless achieve asynchronous propagation of meaningful errors entirely driven by local neural activities. While still far from biological reality this may be useful for goals such as on-chip training of neuromorphic hardware. Two Tales of Single-Phase Contrastive Hebbian Learning Compartmental interpretation of related models The contrastive, lifted and difference target propagation based models can all be interpreted as compartmental models (either compartments within a neuron or within a small neural circuitry). In EP and CHL, neurons need to store the activities of the two temporally distinct inference phases. In dual propagation and in LPOM, neurons are required to maintain the nudged internal states. Neurons in difference target propagation are expected to have compartments for storing predictions, targets and possibly even noise perturbed states (Lee et al., 2015). Works such as (Guerguiev et al., 2017; Sacramento et al., 2018) explicitly focus on building biologically inspired compartmental neuron models, although these methods incur even higher computational costs by also modelling spiking behaviour. The segregated dendrite model (Guerguiev et al., 2017) requires two temporally distinct phases making it closer to a neuroscientists realization of CHL or EP. The dendritic cortical microcircuit model (Sacramento et al., 2018) is single phased and implementation-wise similar to fully asymmetric DP. Timescales of neural dynamics When training a network with BP, the weights of a layer can not be updated until the forward pass has completed, and the error has at least been back-propagated to the layer in question. This strict synchronization requirement has motivated several algorithms (Jaderberg et al., 2017; Nøkland & Eidnes, 2019), which to varying degree relax the need for precise synchronization. One way to avoid synchronisation across layers entirely is to use auxiliary losses at each layer, which often is applied in unsupervised learning methods such as ART (Grossberg, 1987), deep belief networks (Hinton et al., 2006; Bengio et al., 2006) or more recently soft Hebbian learning (Journ e et al., 2023). In the supervised learning setting, many lifted neural network approaches (such as (Carreira-Perpinan & Wang, 2014; Whittington & Bogacz, 2017; Li et al., 2020), that result in a joint minimization problem over neural states and synaptic weights) in principle rely on minimal synchronization (though in practice, inference of the neural activities is usually run to convergence to avoid storing intermediate neural states). Among biologically inspired alternatives to BP, difference target propagation (Lee et al., 2015) has similar synchronization constraints as BP, while CHL and EP require temporally distinct inference phases to converge before any weight update. DP only has a single inference phase and does not impose a strict schedule on the order in which layers are traversed, but still requires that meaningful error signals are able to reach layers before their weights are updated (see e.g. the random-DP variant in (Høier et al., 2023)). In a setting with asynchronous neurons this corresponds to a requirement that synaptic plasticity takes place at a (much) slower timescale than neural activities (which is at the core of continuous-time neuron models used in computational neuroscience). This agrees with the observation that neural activity, short-term plasticity (e.g. working memory and attention) and long-term plasticity take place at entirely different timescales (10ms, 100ms-minutes and minutes-hours respectively, e.g. (Tsodyks et al., 1998; Abbott & Regehr, 2004; Citri & Malenka, 2008)). Weight transport One of the key objections to BP is that it is biologically implausible for the forwards and backwards pathways to be symmetric (using W to propagate forwards and W to propagate errors backwards). This is known as the weight transport problem (Crick, 1989; Grossberg, 1987), and variations of this issue is present in many NGRAD algorithms, including DP. The weight transport problem is not addressed in the current work, but we note that DP has previously been shown to be compatible with Kolen-Pollack learning (Kolen & Pollack, 1994) of the feedback weights (Høier et al., 2023). There are a number of other interesting attempts at solving the weight transport problem in the literature such as random feedback alignment (Lillicrap et al., 2014; Nøkland, 2016; Frenkel et al., 2021) and stochastic alignment methods (Akrout et al., 2019; Ernoult et al., 2022). 8. Conclusion Fully local activity difference based learning algorithms are essential for achieving on-device training of neuromorphic hardware. However, the majority of works require distinct inference phases and are costly to simulate. The dual propagation formulation overcomes these issues but also relies on symmetric nudging (at each layer), which may itself be too strict a requirement in noisy hardware. Furthermore, a rigorous derivation of the DP objective has so far been missing. In this work we first present the theoretical foundation for dual propagation and argue (in theory and practice) why asymmetric nudging can be beneficial, despite requiring slow inference methods (instead of fast fixed-point dynamics) on digital hardware. Further, we present an improved variant of dual propagation derived from a suitable Lagrangian, which recovers the dynamics of dual propagation in the case of symmetric nudging (α = 1/2). In the case of asymmetric nudging (such as α {0, 1}) the new variant leads to slightly modified fixed-point dynamics, which are in fact significantly more stable as demonstrated in our experiments. Acknowledgements This work was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and by the Chalmers AI Research Centre (CHAIR). The experiments were enabled by the supercomputing resource Berzelius provided by National Supercomputer Cen- Two Tales of Single-Phase Contrastive Hebbian Learning tre at Link oping University and the Knut and Alice Wallenberg foundation. We would also like to thank Maxence Ernoult for helpful discussions. Impact statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Abbott, L. and Regehr, W. G. Synaptic computation. Nature, 431(7010):796 803, 2004. Akrout, M., Wilson, C., Humphreys, P., Lillicrap, T., and Tweed, D. B. Deep learning without weight transport. In Advances in Neural Information Processing Systems, volume 32, 2019. Askari, A., Negiar, G., Sambharya, R., and Ghaoui, L. E. Lifted neural networks. ar Xiv preprint ar Xiv:1805.01532, 2018. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006. Beniaguev, D., Segev, I., and London, M. Single cortical neurons as deep artificial neural networks. Neuron, 109 (17):2727 2739, 2021. Carreira-Perpinan, M. and Wang, W. Distributed optimization of deeply nested systems. In Artificial Intelligence and Statistics, pp. 10 19, 2014. Chavlis, S. and Poirazi, P. Drawing inspiration from biological dendrites to empower artificial neural networks. Current opinion in neurobiology, 70:1 10, 2021. Choromanska, A., Cowen, B., Kumaravel, S., Luss, R., Rigotti, M., Rish, I., Diachille, P., Gurev, V., Kingsbury, B., Tejwani, R., et al. Beyond backprop: Online alternating minimization with auxiliary variables. In International Conference on Machine Learning, pp. 1193 1202. PMLR, 2019. Citri, A. and Malenka, R. C. Synaptic plasticity: multiple forms, functions, and mechanisms. Neuropsychopharmacology, 33(1):18 41, 2008. Crick, F. The recent excitement about neural networks. Nature, 337:129 132, 1989. Dempe, S. and Zemkoho, A. B. The bilevel programming problem: reformulations, constraint qualifications and optimality conditions. Mathematical Programming, 138 (1):447 473, 2013. Ernoult, M. M., Normandin, F., Moudgil, A., Spinney, S., Belilovsky, E., Rish, I., Richards, B., and Bengio, Y. Towards scaling difference target propagation by learning backprop targets. In International Conference on Machine Learning, pp. 5968 5987. PMLR, 2022. Frenkel, C., Lefebvre, M., and Bol, D. Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks. Frontiers in neuroscience, 15:629892, 2021. Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. Gould, S., Hartley, R., and Campbell, D. Deep declarative networks: A new hope. ar Xiv preprint ar Xiv:1909.04866, 2019. Grossberg, S. Competitive learning: From interactive activation to adaptive resonance. Cognitive science, 11(1): 23 63, 1987. Gu, F., Askari, A., and El Ghaoui, L. Fenchel lifted networks: A lagrange relaxation of neural network training. In International Conference on Artificial Intelligence and Statistics, pp. 3362 3371. PMLR, 2020. Guerguiev, J., Lillicrap, T. P., and Richards, B. A. Towards deep learning with segregated dendrites. e Life, 6:e22901, dec 2017. ISSN 2050-084X. doi: 10.7554/e Life.22901. Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural computation, 18 (7):1527 1554, 2006. Høier, R. and Zach, C. Lifted regression/reconstruction networks. In Proceedings of the British Machine Vision Conference (BMVC), 2020. Høier, R., Staudt, D., and Zach, C. Dual propagation: Accelerating contrastive hebbian learning with dyadic neurons. In International Conference on Machine Learning, 2023. Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1627 1635. JMLR. org, 2017. Journ e, A., Rodriguez, H. G., Guo, Q., and Moraitis, T. Hebbian deep learning without feedback. In The Eleventh International Conference on Learning Representations, 2023. Kendall, J. D. and Kumar, S. The building blocks of a brain-inspired computer. Applied Physics Reviews, 7(1): 011305, 2020. Two Tales of Single-Phase Contrastive Hebbian Learning Kolen, J. and Pollack, J. Backpropagation without weight transport. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 94), volume 3, pp. 1375 1380 vol.3, 1994. doi: 10.1109/ICNN.1994. 374486. Laborieux, A. and Zenke, F. Holomorphic equilibrium propagation computes exact gradients through finite size oscillations. Advances in Neural Information Processing Systems, 35:12950 12963, 2022. Laborieux, A. and Zenke, F. Improving equilibrium propagation without weight symmetry through jacobian homeostasis. ar Xiv preprint ar Xiv:2309.02214, 2023. Laborieux, A., Ernoult, M., Scellier, B., Bengio, Y., Grollier, J., and Querlioz, D. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient estimator bias. Frontiers in neuroscience, 15:129, 2021. Lecun, Y. A theoretical framework for back-propagation. In Proceedings of the 1988 Connectionist Models Summer School, CMU, Pittsburg, PA, pp. 21 28. Morgan Kaufmann, 1988. Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Joint european conference on machine learning and knowledge discovery in databases, pp. 498 515. Springer, 2015. Li, J., Xiao, M., Fang, C., Dai, Y., Xu, C., and Lin, Z. Training neural networks by lifted proximal operator machines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. Random feedback weights support learning in deep neural networks. ar Xiv preprint ar Xiv:1411.0247, 2014. Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., and Hinton, G. Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335 346, 2020. Movellan, J. R. Contrastive hebbian learning in the continuous hopfield model. In Connectionist models, pp. 10 17. Elsevier, 1991. Nøkland, A. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, volume 29, 2016. Nøkland, A. and Eidnes, L. H. Training neural networks with local error signals. In International conference on machine learning, pp. 4839 4850. PMLR, 2019. Outrata, J. V. A note on the usage of nondifferentiable exact penalties in some special optimization problems. Kybernetika, 24(4):251 258, 1988. Richards, B. A. and Lillicrap, T. P. Dendritic solutions to the credit assignment problem. Current opinion in neurobiology, 54:28 36, 2019. Sacramento, J., Ponte Costa, R., Bengio, Y., and Senn, W. Dendritic cortical microcircuits approximate the backpropagation algorithm. Advances in neural information processing systems, 31, 2018. Salvatori, T., Mali, A., Buckley, C. L., Lukasiewicz, T., Rao, R. P., Friston, K., and Ororbia, A. Brain-inspired computational intelligence via predictive coding. ar Xiv preprint ar Xiv:2308.07870, 2023. Scellier, B. and Bengio, Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience, 11:24, 2017. Scellier, B., Ernoult, M., Kendall, J., and Kumar, S. Energybased learning algorithms for analog computing: a comparative study. ar Xiv preprint ar Xiv:2312.15103, 2023. Stern, M., Hexner, D., Rocks, J. W., and Liu, A. J. Supervised learning in physical networks: From machine learning to learning machines. Physical Review X, 11(2): 021045, 2021. Tsodyks, M., Pawelzik, K., and Markram, H. Neural networks with dynamic synapses. Neural computation, 10 (4):821 835, 1998. Whittington, J. C. and Bogacz, R. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation, 29(5):1229 1262, 2017. Xie, X. and Seung, H. S. Equivalence of backpropagation and contrastive hebbian learning in a layered network. Neural computation, 15(2):441 454, 2003. Yi, S.-i., Kendall, J. D., Williams, R. S., and Kumar, S. Activity-difference training of deep neural networks using memristor crossbars. Nature Electronics, pp. 1 7, 2022. Zach, C. Bilevel programs meet deep learning: A unifying view on inference learning methods. ar Xiv preprint ar Xiv:2105.07231, 2021. Zach, C. and Estellers, V. Contrastive learning for lifted networks. In Sidorov, K. and Hicks, Y. (eds.), Proceedings of the British Machine Vision Conference (BMVC), pp. 163.1 163.12. BMVA Press, 9 2019. doi: 10.5244/C.33. 163. Two Tales of Single-Phase Contrastive Hebbian Learning A. Additional Results Fig. 3 illustrate additional statistics on the alignment of back-propagation induced gradients (w.r.t. the network parameters) and the ones obtained by the proposed DP improvement. A.1. Divergence of DP The original dual propagation and the improved variant differ in the case α [0, 1], α = 1/2, with the boundary cases α {0, 1} being of particular interest. The most efficient way to implement dual propagation is by sequentially updating neurons in every layer from input to output and then from output to input (akin to a forward and a backward traversal, which are nevertheless part of the sole inference phase using the same dynamics) before performing a weight update. However, to show the unstable behaviour of the original dual propagation formulation in the case of asymmetric nudging we instead perform repeated inference passes through the layers (which also better matches a continuous/asynchronous compute model). Each pass (or iteration) corresponds to updating all neurons from input to output layer and back. The results of these experiments are summarized in Tab 2, where we trained a 784-1000( 4)-10 MLP with Re LU activation functions on MNIST using the least-squares loss. The nudging strength β is 1/2, which is also compatible with the original DP method. We observe that inference in the original variant of DP diverges when applying asymmetric nudging and multiple inference iterations. This is not surprising as inference for dual propagation is only guaranteed to converge in the symmetric setting. The new variant of DP on the other hand yields stable inference results in all cases. B. CNN experimental details The experiments of Section 6.2 were carried out with a VGG16 network and the following hyper parameters: Table 4. Hyper parameters Epochs Learning rate Momentum Weight-decay Batchsize Dataset Warmup Decay Total Initial Peak Final CIFAR10 10 120 130 0.001 0.025 2e-6 0.9 5e-4 100 CIFAR100 10 190 200 0.001 0.015 2e-6 0.9 5e-4 50 Imagenet32x32 10 120 130 0.001 0.020 2e-6 0.9 5e-4 250 C. Proof of Proposition 5.1 Proof. The first-order optimality conditions for sk and δk (k = 1, . . . , L 1) in Eq. 16 are given by (I) 0 = (f 1 k ) (sk)δk W k δk+1 (II) 0 = f 1 k (sk) Wk 1sk 1. (26) Reparametrization in terms of s k (Eq. 17) and expanding (II) + α(I) and (II) α(I) yields 0 = f 1 k (αs+ k + αs k ) ak + α(f 1 k ) ( sk) (s+ k s k ) αW k (s+ k+1 s k+1) 0 = f 1 k (αs+ k + αs k ) + ak + α(f 1 k ) ( sk) (s+ k s k ) αW k (s+ k+1 s k+1), (27) which are also the KKT conditions for LDP + α in Eq. 18. It remains to manipulate these to obtain the desired update equations. Adding the equations above results in 0 = (f 1 k ) ( sk) (s+ k s k ) W k (s+ k+1 s k+1). (28) Dual propagation is (via its use of finite differences) intrinsically linked to a (locally) linear assumption on fk. Hence, we assume fk(a) = s0 k +Dka+O( a ak 2) with s0 k = fk(ak) Dkak and Dk = f k(ak). The local linear assumption allows us to neglect the higher order terms. Consequently we also assume that f 1 k is locally linear and therefore (f 1 k ) ( sk) D 1 k is independent of sk. Hence, we arrive at 0 D k (s+ k s k ) W k (s+ k+1 s k+1) s+ k s k D k W k (s+ k+1 s k+1). (29) Two Tales of Single-Phase Contrastive Hebbian Learning (a) α = 0 (b) α = 1 (c) α = 0.0 (d) α = 1.0 (e) α = 0.0 (f) α = 1.0 Figure 3. MNIST experiments employing asymetric α. Results are averaged over five random seeds. Top: Alignment between the parameter updates obtained with back-propagation and with the improved DP variant (using 30 inference iterations and asymmetric nudging with α {0, 1}). Middle: L2 norm of difference between BP and DP gradients. Bottom: L2 norms of BP and DP gradients when using 30 inference iterations and asymmetric nudging (α = 0.0 and α = 1.0). Results are averaged over five random seeds. Two Tales of Single-Phase Contrastive Hebbian Learning We insert this into the second of the necessary optimality conditions (27) and obtain 0 = f 1 k (αs+ k + αs k ) ak = f 1 k s k + α(s+ k s k ) ak = f 1 k s k + αD k W k (s+ k+1 s k+1) ak = D 1 k s k s0 k + αD k W k (s+ k+1 s k+1) ak. The last line is equivalent to s k = Dkak + s0 k αD k W k (s+ k+1 s k+1) fk Wk 1 sk 1 αW k (s+ k+1 s k+1) . (30) Analogously, s+ k fk(Wk 1 sk 1 + αW k (s+ k+1 s k+1)). In summary we obtain the update rules shown in (19). D. Proof of Proposition 4.1 For convenience we restate Prop. 4.1: Proposition D.1. Let 0 < β β and α, α [0, 1] with α α . Then the following holds: (a) JEP α,β(θ) JEP α ,β(θ) and (b) βJEP α,β(θ) β JEP α,β (θ). This is a consequence of the following lemma. Lemma D.2. Let s1 = arg min s α1f(z) + g(z) s2 = arg min s α2f(z) + g(z) (31) for α2 > α1. Then f(s2) f(s1). Proof. Optimality of s1,2 implies α1f(s1) + g(s1) α1f(s2) + g(s2) α2f(s2) + g(s2) α2f(s1) + g(s1). Adding these inequalities yields α1f(s1) + α2f(s2) α1f(s2) + α2f(s1) (α2 α1)f(s2) (α2 α1)f(s1), (32) i.e. f(s2) f(s1) since α2 α1 > 0. Proof of the proposition. Claim (a) In the following we absorb 1/β into E and omit the dependency of E on θ for notational brevity. α α implies α α 0 α α . Further, we define sα := arg min s αℓ(s) + E(u) s α := arg min s αℓ(s) + E(s) sα := arg min s α ℓ(s) + E(u) s α := arg min s α ℓ(s) + E(s). (33) We identify f(s) = ℓ(s) and g(s) = E(s, θ) and apply the lemma to deduce that ℓ(sα ) ℓ(sα) ℓ(s α ) ℓ(s α). (34) JEP α,β(θ) = αℓ(sα) + E(sα) + αℓ(s α) E(s α) αℓ(sα ) + E(sα ) + αℓ(s α) E(s α) = α ℓ(sα ) + E(sα ) + (α α )ℓ(sα ) + αℓ(s α) E(s α) α ℓ(sα ) + E(sα , ) + (α α )ℓ(s α) + αℓ(s α) E(s α) = α ℓ(sα ) + E(sα ) + α ℓ(s α) E(s α) α ℓ(sα ) + E(sα ) + α ℓ(s α ) E(s α ) = JEP α ,β(θ), Two Tales of Single-Phase Contrastive Hebbian Learning where we used the definition of sα in the first inequality, ℓ(sα ) ℓ(s α) (together with α α 0) for the second one, and finally the definition of s α to obtain the last inequality. Claim (b) We proceed similar to above and define the non-negative quantities γ := αβ γ := αβ γ := αβ γ := αβ . (36) The assumption β β implies γ γ 0 γ γ . (37) We also introduce sγ := arg min s γℓ(s) + E(s) s γ := arg min s γℓ(s) + E(s) sγ := arg min s γ ℓ(s) + E(s) s γ := arg min s γ ℓ(s) + E(s). (38) From the lemma we conclude that ℓ(sγ ) ℓ(sγ) ℓ(s γ) ℓ(s γ ). (39) βJEP α,β(θ) = γℓ(sγ) + E(sγ) + γℓ(s γ) E(s γ) γℓ(sγ ) + E(sγ ) + γℓ(s γ) E(s γ) = γ ℓ(sγ ) + E(sγ ) + (γ γ )ℓ(sγ ) + γℓ(s γ) E(s γ) γ ℓ(sγ ) + E(sγ ) + (γ γ )ℓ(s γ) + γℓ(s γ) E(s γ) = γ ℓ(sγ ) + E(sγ ) + (γ γ + γ) | {z } = γ ℓ(s γ) E(s γ) γ ℓ(sγ ) + E(sγ ) + γ ℓ(s γ ) E(s γ ) = β JEP α,β (θ). We applied the definition of sγ (first inequality), ℓ(sγ ) ℓ(s γ) (2nd inequality) and the definition of s γ . E. Proof of Proposition 4.2 For convenience we restate Prop. 4.2: Proposition E.1. Let 0 < β < β and α such that αβ = α β . Then JEP α,β(θ) JEP α ,β (θ). Proof. Let γ := αβ = α β . We further introduce U γ(s) := γℓ(s) + U(s) (41) sαβ = arg min s αℓ(s) + 1 sα β = arg min s α ℓ(s) + 1 s γ = arg min s αℓ(s) + 1 β U(s) = arg min s U γ(s). Two Tales of Single-Phase Contrastive Hebbian Learning JEP α,β(θ) = αℓ(sαβ) + 1 β U(sαβ) + αℓ(s γ) 1 = ℓ(sαβ) + 1 β U γ(sαβ) 1 ℓ(sα β ) + 1 β U γ(sα β ) 1 = ℓ(sα β ) + 1 β U γ(sα β ) U γ(s γ) ℓ(sα β ) + 1 β U γ(sα β ) U γ(s γ) = α ℓ(sα β ) + 1 β U(sα β ) + α ℓ(s γ) 1 β U(s γ) = JEP α ,β (θ), where we used the minimality of sαβ and s γ, and that 1/β < 1/β to obtain the inequalities. F. Stabilized Fixed-Point Iterations The LPOM model considered in (Li et al., 2020) uses specific layerwise potentials Ek and also layer-dependent values of β, LLPOM(θ) = min {sk} ℓ(s L) + X 1 βk Gk(sk) s k Wk 1sk 1 + G k(Wk 1sk 1) , (44) where we use sk instead of s+ k as the s k are already eliminated. The LPOM authors suggest fixed-point iterations of the form s(t) k fk Wk 1sk 1 + βk βk+1 W k sk+1 fk+1(Wks(t 1) k ) , (45) which can be derived from the stationarity condition w.r.t. sk, 0 = 1 βk ( Gk(sk) Wk 1sk 1) + 1 βk+1 W k G k+1(Wksk) W k sk+1 = 1 βk f 1 k (sk) Wk 1sk 1 + 1 βk+1 W k (fk+1(Wksk) sk+1) . (46) In (Li et al., 2020) it is assumed that each fk is Lipschitz-continuous (w.l.o.g. 1-Lipschitz continuous) and that {βk}k are chosen such that βk βk+1 W k WK 2 < 1, (47) i.e. βk βk+1 and thereby introducing discounting of later layers (and enabling only weak feedback from the target loss ℓ). In our setting we choose β0 = = βL 1 = 1 and βL = β. Hence, the scheme in Eq. 45 is only guaranteed to converge as long as W k Wk 2 < 1. For many standard activation functions (in particular Re LU, leaky Re LU and hard sigmoid) the underlying Gk is of the form Gk(sk) = sk 2/2 + ıC(sk) for a convex set C, where ıC(sk) is the functional form of the constraint sk C. Therefore it is convenient to add a quadratic dampening term when optimizing w.r.t. sk. Let Φk(sk; sk 1, sk+1, θ) := Gk(sk) s k Wk 1sk 1 + G k+1(Wksk) s k+1Wksk (48) Φk(sk; s0 k, sk 1, sk+1, θ) := Gk(sk) s k Wk 1sk 1 + (sk s0 k) W k fk+1(Wks0 k) s k+1Wksk (49) be the terms in LLPOM dependent on sk and its (partially) linearized surrogate, respectively. Now consider the iteration s(t+1) k := arg min sk Φk(sk; s(t) k , sk 1, sk+1, θ) + L 2 sk s0 k 2 Wk 1sk 1 + W k sk+1 fk+1(Wks(t 1) k ) + Ls(t) k 1 + L If s(t) k is a stationary point of Φk, then s(t+1) k = s(t) k is also a stationary point of Φk. If Gk+1 is 1-strongly convex (and therefore Gk+1 = fk+1 is 1-Lipschitz continuous) and L W k Wk 2, then Φk( ; s0 k, sk 1, sk+1, θ) + L 2 s0 k 2 is a Two Tales of Single-Phase Contrastive Hebbian Learning majorizer of Φk, and the sequence (s(t) k )t decreases monotonically Φk. It can be further shown that Eq. 50 is a contraction if L > max(0, W W 2 1)/2. We use the stabilized fixed-point scheme in Eq. 50 determine the neural states sk in the setting α {0, 1} (i.e. the LPOM and inverted LPOM formulations), where we assign L to the estimate of W k Wk 2 obtained by 5 power iterations. G. Analysis of the fixed point iterations In the following we use an idealized setup, where we make the following assumptions: 1. We consider a trilevel program with a hidden and an output layer, 2. and we use second-order Taylor expansion for the relevant mappings. The trilevel program is given by ℓ(y ) s.t. y = arg min y F(y) y Wz z = arg min z G(z) z x (51) and ℓ, F and G are given by 2y Hℓy + b ℓy F(y) = 1 2y HF y + b F y G(z) = 1 2z HGz + b Gz. (52) In the following subsection we derive the closed form expression for the update of z in case of the AROVR and SPROVR relaxations. G.1. Fixed-point updates based on AROVR Applying the AROVR step on the outermost level in Eq. 51 yields U(y+, y , z+, z ) = αℓ(y+) + αℓ(y ) + F(y+) F(y ) (y+ y ) W z + G(z+) G(z ) (z+ z ) x. (53) z = α z+ + α z uses a potentially different coefficient α . By using the assumption in Eq. 52, y are given by 0 = α(Hℓy+ + bℓ) + HF y+ + b F Wz = y+ = (HF + αHℓ) 1(Wz b F αbℓ) 0 = α(Hℓy + bℓ) + HF y + b F Wz = y = (HF αHℓ) 1(Wz b F + αbℓ). (54) Consequently, y+ y = (HF + αHℓ) 1(Wz b F αbℓ) (HF αHℓ) 1(Wz b F + αbℓ) = (HF + αHℓ) 1 (HF αHℓ) 1 (Wz b F ) α(HF + αHℓ) 1 α(HF αHℓ) 1 bℓ. (55) Only the term involving Wz is of interest as the remaining ones are independent of z. Now define M := (HF + αHℓ) 1 (HF αHℓ) 1, (56) (HF + αHℓ)M = I (HF + αHℓ)(HF αHℓ) 1 (HF + αHℓ)M(HF αHℓ) = (HF αHℓ) (HF + αHℓ) = Hℓ. (57) M = (HF + αHℓ) 1Hℓ(HF αHℓ) 1. (58) A similar calculation shows that M := α(HF + αHℓ) 1 α(HF αHℓ) 1 = (HF + αHℓ) 1 (1 2α)HF 2α αHℓ (HF αHℓ) 1. Two Tales of Single-Phase Contrastive Hebbian Learning y+ y = M(Wz b F ) M bℓ = (HF + αHℓ) 1 Hℓ(HF αHℓ) 1(b F Wz) (1 2α)HF 2α αHℓ (HF αHℓ) 1bℓ . (59) With HF = I and Hℓ= βI we obtain M = β (1 + αβ)(1 αβ) I. (60) Using the quadratic approximation for G, G(z) = 1 2z HGz + b Gz, then z+ H 1 G x b G + α W (y+ y ) = g x + α W (y+ y ) z H 1 G x b G α W (y+ y ) = g x α W (y+ y ) (61) α z+ + α z H 1 G x b G + (α )2 ( α )2 W (y+ y ) = H 1 G x b G + (2α 1)W (y+ y ) = g x + (2α 1)W (y+ y ) = g x + (2α 1)W (MW z + v) for a vector v containing all terms in y+ y not dependent on z. The choice α = 1/2 therefore implies that z is a fixed point of the above relation. G.2. Fixed-point updates based on SPROVR The above reasoning applies to the AROVR formulation. The SPROVR model uses a different objective, U(y+, y , z+, z ) = ℓ(αy+ + αy ) + F(y+) F(y ) (y+ y ) W z + G(z+) G(z ) (z+ z ) x (63) with corresponding stationary conditions αℓ ( y) + F(y+) W z = 0 αℓ ( y) F(y ) + W z = 0. (64) By using a quadratic model for ℓand F, this translates to (I) α (Hℓ y + bℓ) + HF y+ + b F W z = 0 (II) α (Hℓ y + bℓ) HF y b F + W z = 0. (65) Adding these relation yields Hℓ y + bℓ+ HF (y+ y ) = 0 y+ y = H 1 F (Hℓ y + bℓ) . (66) Further, α(I) α(II) results in 0 = (α2 α2)Hℓ y + (α2 α2)bℓ+ HF y + b F W z = ((2α 1)Hℓ+ HF ) y + (2α 1)bℓ+ b F W z y = ((2α 1)Hℓ+ HF ) 1 (W z (2α 1)bℓ b F ) y+ y = H 1 F Hℓ((2α 1)Hℓ+ HF ) 1 (W z (2α 1)bℓ b F ) + bℓ . (68) Let M be the matrix applied on W z, i.e. M = H 1 F Hℓ((2α 1)Hℓ+ HF ) 1 , (69) Two Tales of Single-Phase Contrastive Hebbian Learning then for HF = I and Hℓ= βI we obtain M = β 1 + (2α 1)β I = β 1 β + 2αβ I. (70) Recall the value of M before (Eq. 60). As (1 + αβ)(1 αβ) = 1 β + 2αβ α αβ2 and α αβ2 0, we conclude that the SPROVR leads to a dampened error signal compared to AROVR (which is intuitively expected). The updates for z are given by z+ H 1 G x b G + α W (y+ y ) = g x + α W (MW z + v) z H 1 G x b G α W (y+ y ) = g x α W (MW z + v) (71) for a suitable vector v independent of z. Finally, α z+ + α z H 1 G x b G + (α )2 ( α )2 W (y+ y ) = H 1 G x b G + (2α 1)W (y+ y ) = g x + (2α 1)W (y+ y ) = g x + (2α 1)W (MW z + v) , and we conclude that z is again a fixed point for the updates when α = 1/2.