# learning_representations_for_binaryclassification_without_backpropagation__66a17ae0.pdf Published as a conference paper at ICLR 2020 LEARNING REPRESENTATIONS FOR BINARYCLASSIFICATION WITHOUT BACKPROPAGATION Mathias Lechner IST Austria Am Campus 1, Klosterneuburg, Austria mlechner@ist.ac.at The family of feedback alignment (FA) algorithms aims to provide a more biologically motivated alternative to backpropagation (BP), by substituting the computations that are unrealistic to be implemented in physical brains. While FA algorithms have been shown to work well in practice, there is a lack of rigorous theory proofing their learning capabilities. Here we introduce the first feedback alignment algorithm with provable learning guarantees. In contrast to existing work, we do not require any assumption about the size or depth of the network except that it has a single output neuron, i.e., such as for binary classification tasks. We show that our FA algorithm can deliver its theoretical promises in practice, surpassing the learning performance of existing FA methods and matching backpropagation in binary classification tasks. Finally, we demonstrate the limits of our FA variant when the number of output neurons grows beyond a certain quantity. 1 INTRODUCTION A key factor enabling the successes of Deep Learning is the backpropagation of error (BP) algorithm (Rumelhart et al., 1986). Since it has been introduced, BP has sparked several discussions on whether physical brains are realizing BP-like learning or not (Grossberg, 1987; Crick, 1989). Today, most researchers consent that two distinct characteristics of BP render the idea of a BP based learning in brains as implausible: 1) The usage of symmetric forward and backward connections and 2) the strict separation of activity and error propagation (Bartunov et al., 2018). These two objections have lead researchers to search for more biologically motivated alternatives to BP. The three most influential families of BP alternatives distilled so far are Contrastive Hebbian Learing (CHL) (Movellan, 1991), target-propagation (TP) (Le Cun, 1986; Hinton, 2007; Bengio, 2014) and feedback Alignment (FA) (Lillicrap et al., 2016). The idea of CHL is to propagate the target activities, instead of the errors, backward through the network. For this reason, a temporal dimension is added to the neuron activities. Each neuron then adapts its parameters based on the temporal differences of its forward and backward activity. The two significant critic points of CHL are the requirement for symmetric forward-backward connections and the use of alternating forward and backward phases (Baldi & Pineda, 1991; Bartunov et al., 2018). TP shares the idea with Contrastive Hebbian Learning of propagating target activities instead of errors. However, rather than keeping symmetric forward and backward paths, the reciprocal propagation of the activities are realized through learned connections. Consequently, each layer has assigned two objectives: Learning the inverse of the layer s forward function and minimizing the difference to the back-projected target activity. Variants of TP differ in how exactly the target activity is projected backward (Le Cun, 1986; Bengio, 2014; Bartunov et al., 2018). Theoretical guarantees of TP rely on the assumption that each reciprocal connection implements the perfect inverse of the corresponding forward function. This issue of an imperfect inverse was also found to be the bottleneck of TP in practice (Bartunov et al., 2018). When the output of a layer has a significant lower dimension than its input, reconstructing the input from the output becomes challenging, resulting in poor learning performance. Published as a conference paper at ICLR 2020 Feedback alignment algorithms eliminate the weight sharing implausibility of BP by replacing the symmetric weights in the error backpropagation path by random matrices. The second objection, i.e., separate activity and error channels, is attenuated by Direct Feedback Alignment (Nøkland, 2016) which drastically reduces the number of channels carrying an error signal. While feedback alignment algorithms work well on small and medium-sized benchmarks, a recent study identified that they are unable to provide learning on more challenging datasets like Image Net (Bartunov et al., 2018). Another criticism of FA algorithms is the lack of rigorous mathematical justification and convergence guarantees of the performed computations. In this work, we investigate feed-forward networks where the weights of all, expect the first, layers are constrained to positive values. We prove that this constraint does not invalidate the universal approximation capabilities of neural networks. Next, we show that, in combination with monotonic activation functions, all layers from the second layer on realize monotonically increasing functions. The backpropagation of a scalar error signal through these layers only affects the magnitude of the error signal but does not change its sign. Consequently, we prove that algorithms that bypass the error backpropagation steps, such as Direct Feedback Alignment, can compute the sign of the true gradient with respect to the weights of our constraint networks without the need for backpropagation. Finally, we show that our algorithm, which we call monotone Direct Feedback Alignment, can deliver its theoretical promises in practice, by surpassing the learning performance of existing feedback alignment algorithms in binary classification task, i.e., when the error signal is scalar, and provide decent performance even when the error signal is not scalar. We make the following key contributions: First FA algorithm that has provable learning capabilities for non-linear networks of arbitrary depth Experimental evaluation showing that our FA algorithm outperforms the learning performance of existing FA algorithms and match backpropagation in binary classification tasks We make an efficient Tensor Flow implementation of all tested algorithms publicly available1 2 BACKPROPAGATION AND FEEDBACK ALIGNMENT We consider the feed-forward neural network hl(hl 1) := f(Wlhl 1 + bl) if l < L Wlhl 1 + bl if l = L Wl Rnl nl 1, bl Rnl where f is the non-linear activation function, x the input and h L the output of the network. For classification tasks, h L is usually transformed into a probability distribution with discrete support by a sigmoid or softmax function. During training, the parameters Wl, bl, l = 1, . . . L are adjusted to minimize a loss function L(y, h L) on samples of a giving training distribution p(y, x). This is usually done by performing gradient descent dθ , α R+ (2) with respect to the parameters θl {Wl, bl}, 1 l n of the network. 1https://github.com/mlech26l/iclr_paper_mdfa Published as a conference paper at ICLR 2020 a) BP b) FA c) DFA d) m DFA y y y y Figure 1: Graphical overview of various error transport methods, adapted from Nøkland (2016). Black arrows represent the forward path, whereas green arrows represent the error feedback. Weights that are learned are denoted as Wi and fixed weights are denoted by Bi. Weights that are restricted to only positive values are annotated by a +. a) Backpropagation, b) Feedback alignment, c) Direct Feedback Alignment, d) monotone Direct Feedback Alignment. 2.1 BACKPROPAGATION Backpropagation (Rumelhart et al., 1986) is the primary method to compute the gradients needed by the updates in equation (2) by iteratively applying the chain-rule d L dθl = dhl d L dhl = dhl+1 T d L dhl+1 (4) dhl = Wl+1diag(f (Wl+1hl + bl+1)). (5) A graphical representation of how information first flow forward and then backward in BP through each layer is shown in Figure (1) a. Two major concerns argue against the idea that biological neural networks are implementing BPbased learning. I) The weight matrix Wl of the forward path is reused in the backward path in the form of W T l (weight sharing), and II) the strict separation of activity carrying forward and error carrying backward connections (reciprocal error transport). 2.2 FEEDBACK ALIGNMENT ALGORITHMS Feedback alignment addresses the implausibility of reusing W T l in the backward path by replacing W T l by a fixed random matrix Bl. Lillicrap et al. (2016) showed that this somewhat counterintuitive approach works remarkably well in practice. The term feedback alignment originates from the observations made by Lillicrap et al. (2016) that the angle between the FA update vector and the true gradient starts to decrease, i.e., align, after a few epochs of the training algorithm. Theoretical groundwork on this alignment principle of FA relies on strong assumptions such as a linearized network with one hidden layer (Lillicrap et al., 2016). FA avoids any weight sharing but does not address the reciprocal error transport implausibility, due to its strict separation of forward and backward pathways, as shown in Figure (1) b. Direct Feedback Alignment (DFA) (Nøkland, 2016) relaxes this issue by replacing all backward paths with a direct feed from the output layer error-gradient d L dh L . Consequently, there is only a single error signal that is distributed across the entire network, which is arguably more biologically plausible than reciprocal error connections. The resulting parameter updates of DFA are of the form θl if l = L d L dh L Bl hl θl if l < L (6) Published as a conference paper at ICLR 2020 , where Bl Rn L nl is a random matrix. A graphical schematic of DFA is shown in Figure (1) c. Similar to FA, DFA shows a decent learning performance in mid sized classification tasks (Nøkland, 2016), but fails on more complex datasets such as Image Net (Bartunov et al., 2018). Theory on adapting the alignment principle to DFA shows that under the strong assumptions of constant DFA update directions and a layer-wise criterion minimization, the DFA update vector will align with the true gradient (Nøkland, 2016; Gilmer et al., 2017). Recently, Frenkel et al. (2019) proposed to combine ideas from feedback alignment and targetpropagation in their Direct Random Target Projection (DRTP) algorithm. While DRTP shows decent empirical performance, theoretical guarantees about DRTP rely on linearized networks. 2.3 SIGN-SYMMETRY ALGORITHMS Liao et al. (2016) introduced the sign-symmetry algorithm, a hybrid of BP and FA. Sign-symmetry locks the signs of the feedback weight Bl to have the same signs as W T l , but random absolute value. The authors showed that this approach drastically improves learning performance compared to standard FA. Furthermore, Moskovitz et al. (2018) and Xiao et al. (2019) demonstrated that the sign-symmetry algorithm is even able to match backpropagation for training deep network architectures and large datasets such as Image Net. While these empirical observations suggest that the polarity of the error feedback is more important than its magnitude, the mathematical justification of sign-symmetry remains absent. Similar to FA, sign-symmetry relaxes the strict weight sharing implausibility, but still relies on an unrealistic reciprocal error transport. 3 MONOTONE DIRECT FEEDBACK ALIGNMENT In this section, we first introduce a new class of feed-forward networks, where all, except the first, layers are constrained to realize monotone functions. We call such networks mono-nets and show that they are as expressive as unconstrained feed-forward networks. Next, we prove that for our mono-nets with single output tasks, e.g., binary-classification, feedback alignment algorithms provide the sign of the gradient. The sign of the gradient is interesting for learning, as it tells us if the value of a parameter should be increased or decreased in order to reduce the loss. At the end of this section, we will highlight similarities to algorithms from literature, which can provide resilient learning by only relying on the sign of the gradient. Neural networks with monotonic constraints have been already studied in literature (You et al., 2017), however not in the context of learning algorithms. Definition 1 (mono-net). A mono-net is a feed-forward neural network with L layers h1, . . . h L, each layer l composted of nl units and the semantics hl(hl 1) := f(Wlhl 1 + bl) if l < L Wlhl 1 + bl if l = L (7) h0 := x (8) W1 Rn1 n0, (9) Wl Rnl nl 1 + , for l > 1 (10) bl Rnl (11) where R+ are the positive reals, i.e. R+ = {x R|x 0}, f is a non-linear monotonic increasing activation function, x the input and h L the output of the network. The major difference between mono-nets and general feed-forward neural networks is the restriction to only positive weight values in layers from the second layer on. Combined with the monotonic increasing activation function, this means that each layer hl(hl 1), l 2 realizes a monotone increasing function. Because functional composition preserves monotonicity, the complete network up to the first layer hl hl 1 h2(h1) (12) implements a monotone increasing function. Published as a conference paper at ICLR 2020 mono-nets are Universal Approximators At first glance, this restriction seems counterproductive, as it might interfere with the expressiveness of the networks. However, we proof in Theorem 1 that our mono-nets with tangent hyperbolic activation are universal approximators, meaning that we can approximate any continuous function arbitrarily close. A potential drawback of the monotonicity constraint is that we might need a larger number of units in the hidden layers to achieve the same expressiveness as a general feed-forward network, as illustrated in our proof of Theorem 1. Theorem 1 (mono-nets are Universal Approximators). Let In be the n-dimensional unit hypercube [0, 1]n and C(In) denote the set of continuous functions f : In R. We define f as the supremum norm of f C(In) over its domain In. For any given f C(In) and ε > 0, there exist a function m : In R of the form i=1 vi tanh( wi T x + ˆwi T ( x) + bi) + c (13) with v R+ M, wi R+ n, ˆwi R+ n, b RM, c R and M < such that m(x) f(x) < ε. In essence, the set of functions m(x) of the form given in (19) is dense in C(In). Proof. See supplementary materials 3. m DFA provides the sign of the gradient Here, we prove that for 1-dimensional outputs DFA applied on a mono-net, which we will call simply m DFA, provides the sign of the true gradient. Note that we focus our methods on DFA instead of vanilla FA, due to the superiority of DFA in terms of biological plausibility and empirical performance (Nøkland, 2016; Bartunov et al., 2018). Theorem 2 (For 1-dimensional outputs m DFA computes the sign of the gradient). Let L(y, h L) be a loss function, m(x) := h L h L 1 h2 h1 h0(x) be a mono-net according to Definition 1 with parameters Θ := {Wl, bl l = 1, . . . L}. We denote δθ the update value computed by m DFA and θ as the gradient L θ for any θ {Wl, bl} with 1 l L. If n L = 1, it follows that δθ i,j 0, (14) for each coordinate (i, j) of θ. Proof. See supplementary materials. A graphical illustration of how activities and errors propagate in m DFA is shown in Figure (1) d. Literature on learning by relying only on the sign of the gradient Two learning concepts related to m DFA are RPROP (Riedmiller & Braun, 1993; Riedmiller, 1994) and sign SGD (Bernstein et al., 2018). RPROP aims to build a more resilient alternative to gradient descent by decoupling the amplitude of the gradient from the step size of parameter updates. In essence, for each coordinate RPROP adapts the step size based on the sign of the most recent gradients computed. Riedmiller & Braun (1993) showed that their approach could stabilize the training of a neural network compared to standard gradient descent. Performing gradient descent with taking the sign of each gradient coordinate is on an algorithmic level equivalent to the well-known steepest descent method with L norm (Boyd & Vandenberghe, 2004; Bernstein et al., 2018). sign SGD (Bernstein et al., 2018) studies convergence properties of the stochastic approximation of this algorithm. What about networks with more than one output neuron? Theorem 2 applies only to networks with scalar output. As a natural consequence, one may ask whether such theoretical guarantees can be extended to more dimensional output variables. The simple answer is, unfortunately not. In the supplementary materials section A.3 we provide a counterexample showing that Theorem 2 naively extended to two output neurons does not hold anymore. We want to note that the requirement of a neural network to have only a single output neuron is biologically unjustified. It is known that sub-circuits of biological neuronal networks can feed to multiple motor neuron groups Cook et al. (2019). Published as a conference paper at ICLR 2020 How does m DFA relate to the non-negative matrix factorization? A seemingly related concept to m DFA is the non-negative matrix factorization (NMF) algorithm. NMF decomposes an observation matrix V into a weight matrix W and a latent variable matrix H such that V WH. In contrast to other decomposition-based unsupervised learning methods, all three matrices V, W and H are restricted to non-negative entries. While NMF can model data that is inherently non-negative, such has semantic features of images and text, effectively Yuan & Oja (2005); Shahnaz et al. (2006), the method is unable to learn subtractive and non-linear structures that are present in the data Lee & Seung (1999). Semi-non-negative matrix factorization Ding et al. (2008) relaxes the original restriction to nonnegative observations of NMF, by only constraining the weight matrix W to be non-negative. Deep semi-NMF Trigeorgis et al. (2014) further enhances the expressiveness of NMF by adding multiple layers and non-linearities between them to the decomposition. Concerning this work, the semantics of mono-nets from the second layer on is equivalent to that of deep semi-NMF models. However, the unconstrained first layer of mono-nets provides universal approximation capabilities, enabling mono-nets to learn subtractive and non-monotonic input dependencies. Moreover, while deep NMF models are mostly trained via layer-wise learning in an unsupervised context Trigeorgis et al. (2014); Yu et al. (2018), the sole purpose of mono-nets is to investigate alternatives to backpropagation for training multi-layer classifiers. 4 EXPERIMENTS So far, we have only proven the learning capabilities of m DFA. What remains unclear is whether m DFA can deliver its theoretical promises in practice. In this section, we experimentally evaluate the learning performance of m DFA on a set of empirical benchmarks. We aim to answer the following two questions: How well does m DFA perform compared to DFA, FA, and backpropagation in natural conditions, i.e., in binary classification tasks, and how much does the performance of m DFA degrade in multi-class classification tasks? Our performance measure is the achieved classification accuracy of a network trained by a particular method. First, we report the highest achieved accuracy on the training set, which tells us how well the algorithm could fit the model to the training data. Secondly, for each method, we tuned the hyperparameters on a separate validation set and selected the best performing configuration to be evaluated on the test data. The obtained test accuracy tells us how well the model generalizes to data outside the training set. We evaluate fully-connected networks (FC) and Convolutional networks (CNNs) in form of modified all-convolutional-nets (Springenberg et al., 2015) with tangent hyperbolic, Re LU (Nair & Hinton, 2010), and the hard-tanh non-linearity. The hard-tanh function is defined as hard-tanh(x) := min(max(x, 1), 1). (15) Hyperparameters For all training methods, we fixed the batch size to 64, applied no regularization, no normalization, and no data augmentation. Optimizer, i.e., { Vanilla Gradient Descent, Adam (Kingma & Ba, 2014), Rmsprop (Tieleman & Hinton, 2012) }, learning rate, training epochs, and weight initialization method were tuned on the validation set. We tested three different weight initialization schemes; all zeros, a scaled uniform, and a normal distribution. Note that all zeros was only tested on the forward weights. Our uniform initialization followed the methodology of Nøkland (2016), i.e., scaling the bounds of the distribution inversely by the square-root of the number of incoming connections of a neuron. In order to comply with the weight constraints of mono-nets, for m DFA the lower bound of the uniform distribution was set to ε = 10 3. Moreover, for m DFA we post-processed the initial weights of the normal distribution by taking their absolute values. Input variables are scaled to [0,1]. Detailed network architectures and a brief discussion about the best performing hyperparameter configurations are listed in the supplementary materials in section B.1 and section B.3. Published as a conference paper at ICLR 2020 4.1 BINARY CLASSIFICATION We created a series of binary classification benchmarks by randomly sampling two classes from the well-studied CIFAR-100 and Image Net datasets. We then train and test a 1-vs-1 classifier on the samples of the two classes. For each dataset, we create five such binarized sub-datasets and report the mean and standard deviation over these experiments. CIFAR-100 (Krizhevsky et al., 2014) poses a challenging classification dataset, consisting of 32-by-32 RGB images in 100 real-world object categories. Image Net (Russakovsky et al., 2015) is a large-scale object classification benchmark and de-facto standard to asses new deep learning methods. Each sample is a high-resolution image representing one out of 1000 possible object classes. We pre-processed all samples by cropping and resizing each image to 224-by-224 pixels. Because Image Net lacks a public test set, we will report the validation accuracy, i.e., as it was reported by Bartunov et al. (2018). The results on the binarized benchmarks are shown in Table 1 for CIFAR-100 and Table 2 for Image Net. m DFA could bring the training error to zero for fully-connected networks, and match the test/validation accuracy of backpropagation for convolutional networks. Poor learning performance for Re LU networks One surprising characteristic in our results is that m DFA fails to provide the same level of learning performance as backpropagation for Re LU networks. Recall that Theorem 1 proves the universal approximation capabilities only for mono-net with tanh activation function. A mono-net with Re LU non-linearity restricts both, the activation and the weight values, to positive values. These constraints arguably limit the approximation capabilities of mono-net in combination with Re LU and thus explain the poor performance of m DFA for Re LU networks. We could validate our hypothesis by testing a rectifier non-linearity that also contains negative values in its image. Coincidentally, the hard-tanh function matches exactly this criterion. Therefore, the decent learning performance of m DFA for hard-tanh networks confirms our hypothesis. Note that also FA and DFA expressed an improvement in performance when switching from Re LU to hard-tanh activation. This observation suggests that feedback alignment algorithms in general benefit from symmetric activation functions. Discrepancy between fully-connected and convolutional networks Though m DFA achieved the same training accuracy as backpropagation for fully-connected networks, the generalization ability, i.e., test and validation accuracy, slightly lacks behind BP in our binarized-CIFAR-100 benchmark. This observation aligns with the studies by Nøkland (2016); Bartunov et al. (2018). For convolutional neural networks, this effect is reversed, i.e., a decent test/validation accuracy but a higher training error than BP. We speculate that the two tested initialization schemes for the feedback weights caused this discrepancy. Both backpropagation and feedback alignment have been shown to be sensitive to the employed initialization method (Zhang et al., 2019; Bartunov et al., 2018). The restriction to positive values of the weights used in m DFA, requires a re-thinking of the initialization methods examined in the literature. However, we will leave this study open for future work. Activation Model Training accuracy Test accuracy Training accuracy Test accuracy (FC) (FC) (CNN) (CNN) BP 100.0 0.0% 81.3 11.4% 100.0 0.0% 80.9 10.4% DFA 84.1 8.2% 76.8 11.5% 84.4 7.9% 80.0 10.3% FA 85.3 8.0% 76.9 12.0% 82.5 10.2% 77.5 12.2% m DFA 100.0 0.0% 77.3 12.2% 87.4 7.0% 80.9 9.2% BP 100.0 0.0% 80.0 10.8% 100.0 0.0% 82.2 9.5% DFA 76.2 9.6% 72.9 9.2% 73.0 15.5% 72.0 16.2% FA 77.9 10.2% 71.0 12.0% 68.6 15.9% 69.8 15.7% m DFA 78.2 9.9% 62.1 12.7% 78.0 11.7% 75.5 11.7% BP 100.0 0.0% 81.4 10.5% 100.0 0.0% 81.1 11.9% DFA 84.2 8.2% 77.5 11.2% 81.6 9.3% 79.7 10.5% FA 84.9 8.0% 77.6 12.2% 94.2 3.5% 83.4 8.7% m DFA 100.0 0.0% 77.2 12.2% 89.0 6.0% 81.1 9.8% Table 1: Accuracies on binarized-CIFAR-100. Mean and standard deviation Published as a conference paper at ICLR 2020 Activation Model Training acc. Validation acc. Training acc. Validation acc. (FC) (FC) (CNN) (CNN) BP 98.4 3.6% 77.3 6.8% 99.8 0.5% 81.7 11.6% DFA 93.8 13.0% 74.0 8.5% 100.0 0.0% 83.0 8.6% FA 97.5 0.8% 74.7 7.7% 51.9 0.5% 54.7 4.6% m DFA 95.3 10.4% 79.5 9.8% 79.5 9.8% 83.8 7.2% BP 100.0 0.0% 78.3 7.7% 90.0 18.1% 72.5 13.7% DFA 75.1 8.1% 73.2 8.0% 50.4 0.9% 52.7 2.4% FA 68.3 7.9% 71.0 6.7% 52.3 0.9% 54.8 2.1% m DFA 56.9 0.9% 59.7 2.4% 52.6 1.6% 58.4 4.5% BP 99.9 0.2% 75.8 7.7% 99.9 0.1% 76.2 9.8% DFA 97.9 2.2% 76.8 7.2% 99.3 1.5% 83.2 8.8% FA 98.1 2.3% 76.5 7.2% 62.4 17.3% 60.8 13.8% m DFA 99.9 0.1% 78.0 6.8% 78.2 9.4% 83.0 6.8% Table 2: Accuracies on binarized-Image Net. Mean and standard deviation Set Classes BP DFA FA m DFA 3 100.0 0.0% 50.5 9.5% 79.0 6.1% 53.3 8.9% 5 100.0 0.0% 33.8 9.7% 68.4 6.0% 34.1 9.8% 10 100.0 0.0% 24.7 1.8% 52.4 4.0% 25.9 1.2% 25 99.9 0.1% 13.4 1.7% 31.3 1.6% 13.1 1.4% 50 99.7 0.1% 8.5 0.1% 20.0 1.1% 7.9 0.2% 3 77.4 8.7% 49.1 7.2% 74.2 8.1% 50.9 7.2% 5 68.5 8.1% 31.2 11.6% 60.7 8.4% 33.9 9.8% 10 57.3 4.9% 25.0 3.2% 46.9 3.8% 26.3 2.3% 25 41.2 2.6% 13.4 1.9% 28.2 1.4% 13.1 1.3% 50 31.8 1.8% 8.8 0.4% 18.1 0.9% 8.0 0.4% Table 3: Mulit-class classification accuracies of fully-connected network with tanh activation on n-class CIFAR-100. Mean and standard deviation. 4.2 MULTI-CLASS CLASSIFICATION Here we modify the classification benchmark creation procedure used above, by randomly sampling n classes from the datasets instead of just two. Due to their compelling results in our binary classification benchmark, we restrict our evaluation to networks with tanh activation. Table 3 and Table 4 show the results on our n-class CIFAR-100 benchmark for fully-connected and CNNs respectively. The results on n-class Image Net can be found in the supplementary materials in Table 5 and 6. m DFA can provide learning for networks with more than one output neuron Though we do not have a complete theory on m DFA for multi-dimensional outputs, our experiments indicate that m DFA can provide learning for networks with more than one output neuron. In particular, the achieved accuracies of m DFA outperforms other feedback alignment methods for convolutional networks with ten or fewer output neurons. However, m DFA falls behind standard DFA for networks with more than ten output neurons. This observation suggests that our restriction to positive weights can be beneficial even for multi-class tasks but eventually hurts learning performance when the number of classes grows larger. Feedback alignment algorithms, in general, tend to struggle with increasing output dimension Our results express the trend that with an increase in the number of classes, feedback alignment algorithms struggle to fit the training data. While BP can reduce the training error to almost zero independently of the output dimension, the training errors achieved by FA algorithms are significantly higher and correlate with the number of output neurons. Our observations suggest that the dimension of the error signal affects the training convergence for FA algorithms, in contrast to BP, which appears to be less affected by the dimension of the error signal. This relation potentially provides a step towards understanding why FA algorithms fail on challenging datasets as described by Bartunov et al. (2018). Published as a conference paper at ICLR 2020 Set Classes BP DFA FA m DFA 3 100.0 0.0% 86.3 5.0% 84.1 10.2% 100.0 0.0% 5 100.0 0.0% 84.6 4.6% 74.3 8.6% 100.0 0.0% 10 100.0 0.0% 83.0 3.5% 57.9 3.4% 94.4 1.7% 25 99.9 0.0% 81.0 1.5% 36.3 1.8% 64.8 1.4% 50 98.9 0.0% 80.5 1.0% 22.7 1.4% 44.5 0.9% 3 76.2 10.1% 79.8 6.2% 79.6 9.7% 82.0 7.6% 5 65.4 9.8% 71.8 7.6% 70.1 9.0% 73.3 8.8% 10 57.5 5.2% 65.0 4.8% 55.1 4.1% 63.9 4.2% 25 41.8 2.2% 48.3 2.2% 35.0 1.4% 44.0 2.5% 50 33.3 1.5% 37.1 1.5% 22.3 1.1% 34.5 1.5% Table 4: Mulit-class classification accuracies of Convolutional Neural Network with tanh activation on n-class CIFAR-100. Mean and standard deviation. 5 CONCLUSION Feedback alignment algorithms are promising biologically motivated alternatives to backpropagation. While existing literature provides empirical evidence that FA algorithms can work well in practice, there is still a lack of rigorous theory formalizing their learning capabilities. Here we contributed to the field of biologically motivated learning algorithms, by introducing the first feedback alignment algorithm that provable provides learning for non-linear networks of arbitrary depth and single output neuron. We showed that our FA algorithm outperforms existing FA algorithms in binary classification tasks, and even provide decent learning on multi-class problems. Limitations We demonstrated on empirical benchmarks as well as theoretical examples that our method is limited to networks with scalar output. Indeed, uncovering the mathematical principles behind the decent learning performance of FA and DFA on multi-class tasks remains an open challenge. Is this really useful? In terms of scientific significance, this work provided theoretical contributions toward understanding the capabilities and limits of feedback alignment algorithms. From a practical point of view, our m DFA algorithm has an advantage over backpropagation concerning training latency, as all weight updates can be computed in parallel, i.e., see Figure 1. Furthermore, we have shown that m DFA is superior to other feedback alignment algorithms in binary classification tasks. Consequently, m DFA provides an effective solution for binary classification problems with training latency constraints. Dynamic branch prediction in microprocessor pipelines poses such problem instance, where program specific binary branch outcomes, i.e., branch taken/not taken, need to be learned in real-time. Because of this real-time constraint, existing branch predictors often employ only shallow perceptron modules (Jim enez & Lin, 2002; Egan et al., 2003). m DFA could enable deeper branch predictor networks to be learned in real-time. ACKNOWLEDGMENTS This research was supported in part by the Austrian Science Fund (FWF) under grant Z211-N23 (Wittgenstein Award). Pierre Baldi and Fernando Pineda. Contrastive learning and neural oscillations. Neural Computation, 3(4):526 545, 1991. Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy Lillicrap. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In Conference on Neural Information Processing Systems (Neur IPS), 2018. Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target propagation. ar Xiv preprint ar Xiv:1407.7906, 2014. Published as a conference paper at ICLR 2020 Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. sign SGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning (ICML), 2018. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Steven J Cook, Travis A Jarrell, Christopher A Brittin, Yi Wang, Adam E Bloniarz, Maksim A Yakovlev, Ken CQ Nguyen, Leo T-H Tang, Emily A Bayer, Janet S Duerr, et al. Whole-animal connectomes of both caenorhabditis elegans sexes. Nature, 571(7763):63 71, 2019. Francis Crick. The recent excitement about neural networks. Nature, 337(6203):129 132, 1989. Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence, 32(1):45 55, 2008. Colin Egan, Gordon Steven, Patrick Quick, Rub en Anguera, Fleur Steven, and Lucian Vintan. Twolevel branch prediction using neural networks. Journal of Systems Architecture, 49(12-15):557 570, 2003. Charlotte Frenkel, Martin Lefebvre, and David Bol. Learning without feedback: Direct random target projection as a feedback-alignment algorithm with layerwise feedforward training. ar Xiv preprint ar Xiv:1909.01311, 2019. Justin Gilmer, Colin Raffel, Samuel S Schoenholz, Maithra Raghu, and Jascha Sohl-Dickstein. Explaining the learning dynamics of direct feedback alignment. International Conference on Learning Representations (ICLR) - Workshop track, 2017. Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive science, 11(1):23 63, 1987. Geoffrey Hinton. How to do backpropagation in a brain. In Invited talk at the NIPS2007 Deep Learning Workshop, volume 656, 2007. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Daniel A Jim enez and Calvin Lin. Neural methods for dynamic branch prediction. ACM Transactions on Computer Systems (TOCS), 20(4):369 397, 2002. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 2010. Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-100 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014. Yann Le Cun. Learning process in an asymmetric threshold network. In Disordered systems and biological organization, pp. 233 240. Springer, 1986. Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788, 1999. Qianli Liao, Joel Z Leibo, and Tomaso Poggio. How important is weight symmetry in backpropagation? In AAAI Conference on Artificial Intelligence (AAAI), 2016. Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7, 2016. Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep convolutional networks. ar Xiv preprint ar Xiv:1812.06488, 2018. Published as a conference paper at ICLR 2020 Javier R Movellan. Contrastive hebbian learning in the continuous hopfield model. In Connectionist models, pp. 10 17. Elsevier, 1991. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2010. Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Conference on Neural Information Processing Systems (Neur IPS), 2016. Martin Riedmiller. Rprop-description and implementation details. Citeseer, 1994. Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In IEEE international conference on neural networks, 1993. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533, 1986. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. Farial Shahnaz, Michael W Berry, V Paul Pauca, and Robert J Plemmons. Document clustering using nonnegative matrix factorization. Information Processing & Management, 42(2):373 386, 2006. Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations (ICLR), 2015. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012. George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou, and Bjoern Schuller. A deep seminmf model for learning hidden representations. In International Conference on Machine Learning, pp. 1692 1700, 2014. Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning algorithms can scale to large datasets. In International Conference on Learning Representations (ICLR), 2019. Seungil You, David Ding, Kevin Canini, Jan Pfeifer, and Maya Gupta. Deep lattice networks and partial monotonic functions. In Conference on Neural Information Processing Systems (Neur IPS), 2017. Jinshi Yu, Guoxu Zhou, Andrzej Cichocki, and Shengli Xie. Learning the hierarchical parts of objects by deep non-smooth nonnegative matrix factorization. IEEE Access, 6:58096 58105, 2018. Zhijian Yuan and Erkki Oja. Projective nonnegative matrix factorization for image compression and feature extraction. In Scandinavian Conference on Image Analysis, pp. 333 342. Springer, 2005. Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization via better initialization. In International Conference on Learning Representations (ICLR), 2019. Published as a conference paper at ICLR 2020 A.1 MONO-NETS ARE UNIVERSAL APPROXIMATORS We denote σ(x) := 1 1 + exp( x) the sigmoid function, tanh(x) := ex e x the hyperbolic tangent and 1[expr] := 1 if expr is true 0 if expr is false the indicator function. From the definitions above we can derive the equalities tanh(x) = tanh( x) (17) , which we will need for our proof. Lemma 1 (Universal Approximation Theorem). For any given f C(In) and ε > 0, there exist a function g : In R of the form i=1 viσ(w T i x + bi) (18) with v RN, wi Rn, b RN and N < such that g(x) f(x) < ε. In essence, the set of functions g(x) of the form given in (18) is dense in C(In). Proof. See Hornik et al. (1989) Theorem 3 (mono-nets are Universal Approximators). For any given f C(In) and ε > 0, there exist a function m : In R of the form i=1 vi tanh(w T i x + bi) + c (19) with v R+ M, wi Rn, b RM, c R and M < such that m(x) f(x) < ε. In essence, the set of functions m(x) of the form given in (19) is dense in C(In). Proof. By Lemma 1 we know that there exist a g(x) with g(x) f(x) < ε that has the form i=1 viσ(w T i x + bi) We will show that we can reformulate g(x) to the form in equation (19). Our basic idea is to propagate all negative weight entries into to the first layer where negative weight values are allowed. Published as a conference paper at ICLR 2020 i=1 viσ(w T i x + bi) 1 + tanh(w T i x + bi) i=1 vi 1 2 tanh n X j=1 wi,jxj + bi + 1 i=1 1[vi 0]vi 1 2 tanh n X j=1 wi,jxj + bi i=1 1[vi < 0]vi 1 2 tanh n X j=1 wi,jxj + bi i=1 1[vi 0]vi 1 2 tanh n X j=1 wi,jxj + bi i=1 1[vi < 0]vi 1 2 tanh j=1 wi,jxj bi i=1 vi tanh wi T x + bi i=1 v i tanh wi T x + bi + c vi = 1[vi 0]vi 1 2 0 v i = 1[vi < 0]vi 1 2 0 Therefore, we showed that there exist a m(x) with a form as in equation (19) that satisfy m(x) f(x) < ε. This proof showed us how to translate any neural network with sigmoid activation function and one hidden layer of size N into a mono-net with 2N hidden units. Published as a conference paper at ICLR 2020 A.2 FOR 1-DIMENSIONAL OUTPUTS MDFA COMPUTES THE SIGN OF THE GRADIENT Lemma 2 (The gradient of a monotone layer is non-negative). Let m : Rn0 R with m(x) := h L h L 1 h2 h1 h0(x) be a mono-net according to Definition 1. Then dhl i,j 0, (20) for any l, i, j with 2 l L and 1 i nl and 1 j nl 1 Proof. We have to distinguish two cases: Case 1: l = L, i.e. there is no activation function. We have dhl i,j 0, (21) according to the definition in Equation (10). Case 2: l < L, i.e. there is an activation function. We have dhl i,j = Wldiag(f (Wlhl 1 + bl)) diag(f (Wlhl 1 + bl)) where f is the derivative of the activation function. Because f is a monotonic function, its derivative is non-negative everywhere. As a result we have a sum of a product of non-negative values. Ergo dhl Lemma 3 (The gradient of a composition of monotone layers is non-negative). Let m : Rn0 R with m(x) := h L h L 1 h2 h1 h0(x) be a mono-net according to Definition 1. Then dhl i,j 0, (25) for any l, k, i, j with 2 l L and 1 k < l and 1 i nl and 1 j nk Proof. By applying the chain rule we get According to Lemma 2 we have a product of non-negative values. Because a product of non-negative values is non-negative itself, we have dhl Theorem 4 (For 1-dimensional outputs m DFA computes the sign of the gradient). Let L(y, h L) be a loss function, m(x) := h L h L 1 h2 h1 h0(x) be a mono-net according to Definition 1 with parameters Θ := {Wl, bl l = 1, . . . L}. We denote δθ the update value computed by m DFA and θ as the gradient L θ for any θ {Wl, bl} with 1 l L. If n L = 1, it follows that δθ i,j 0, (28) for each coordinate (i, j) of θ. Published as a conference paper at ICLR 2020 Proof. We distinguish two cases: Case 1: l = L, i.e. θ is a parameter of the last layer. From the definition of m DFA we have For the gradient by applying the chain rule we get Thus, in the last layer the m DFA update equals the gradient. Case 2: l < L, i.e θ is a parameter of a hidden layer. From the definition of m DFA we have δθ with B R+ n L nl. Next, we expand the multiplication, dh L )k Bk,i d(hl)i dh L )k Bk,i. (33) We assumed that the output dimension is 1, i.e. n L = 1. Therefore, δθ i,j = d(hl)i dθi,j ( d L dh L )1B1,i. (34) For the gradient by applying the chain rule we get Like above, we expand the multiplication, dh L )k(dh L dhl )k,i d(hl)i dh L )k(dh L dhl )k,i. (38) We assumed that the output dimension is 1, i.e. n L = 1. Therefore, θ i,j = d(hl)i dθi,j ( d L dh L )1(dh L dhl )1,i. (39) For Equation (28) we get by applying Lemma 3, δθ i,j = d(hl)i dθi,j ( d L dh L )1B1,i d(hl)i dθi,j ( d L dh L )1(dh L dhl )1,i (40) dθi,j ( d L B1,i |{z} 0 dhl )1,i | {z } 0 Published as a conference paper at ICLR 2020 A.3 FOR k 2-DIMENSIONAL OUTPUTS MDFA UPDATES MAY NOT ALIGN WITH THE GRADIENT Theorem 5 (For k 2-dimensional outputs m DFA updates may not align with the gradient). Let L(y, h L) be a loss function, m(x) := h L h L 1 h2 h1 h0(x) be a mono-net according to Definition 1 with parameters Θ := {Wl, bl l = 1, . . . L}. We denote δθ the update value computed by m DFA and θ as the gradient L θ for any θ {Wl, bl} with 1 l < L. If n L 2, there exists the possibility that δθ i,j < 0, (43) for at least one coordinate (i, j) of θ. Proof. We construct a minimal counterexample consisting of a network with two outputs and one hidden layer of two neurons. W2 = 0 1 1 0 B2 = 1 0 0 1 L h L = 1 0 0 1 θ = 1 0 0 1 (48) Then we have dh L W T 2 dhl dh L B2 dhl (55) , which are orthogonal. B EXPERIMENT SETUP B.1 NETWORK ARCHITECTURES Dataset Fully-connected Convolutional CIFAR-100 1024,1024 (96,5,2),(96,3,2),(96,3,1) Image Net 1024,1024,1024,1024 (96,3,2),(96,5,1),(128,3,2), (192,3,1),(192,3,2),(384,3,1) Network architectures, layers are separated by commas. Fully-connected column specifies the number of neurons of each layer. Convolutional column specifies the number of filters, kernel size, and stride for each layer. Published as a conference paper at ICLR 2020 B.2 n-CLASS IMAGENET Set Classes BP DFA FA m DFA Training 3 100.0 0.0% 59.6 7.2% 50.4 5.9% 51.4 4.4% 5 100.0 0.0% 48.9 5.0% 41.4 5.5% 41.0 7.1% 10 99.7 0.5% 33.4 4.5% 24.7 3.3% 38.7 10.1% Validation 3 74.5 8.8% 62.8 11.9% 64.2 10.9% 62.0 9.6% 5 63.0 7.4% 43.8 6.2% 42.3 7.1% 37.3 5.2% 10 41.2 5.0% 30.8 4.9% 26.0 4.5% 33.3 8.0% Table 5: Mulit-class classification accuracy of fully-connected network with tanh activation on nclass Image Net. Mean and standard deviation. Set Classes BP DFA FA m DFA Training 3 100.0 0.0% 100.0 0.0% 35.8 0.7% 68.1 8.2% 5 100.0 0.0% 100.0 0.0% 21.4 0.4% 60.4 6.2% 10 100.0 0.0% 90.4 10.0% 11.3 0.7% 42.9 3.9% Validation 3 68.5 13.0% 76.5 9.9% 53.3 1.7% 74.8 8.8% 5 61.6 7.3% 68.7 6.3% 29.0 2.0% 61.9 5.3% 10 52.8 4.6% 53.6 5.3% 10.6 0.8% 44.1 4.4% Table 6: Mulit-class classification accuracy of Convolutional Neural Network with tanh activation on n-class Image Net. Mean and standard deviation. B.3 DISCUSSION ON HYPERPARAMETERS We observed that all FA algorithms yield a more stable convergence with vanilla stochastic gradient descent, i.e., no post-processing of the FA updates, than with Adam (Kingma & Ba, 2014) or Rmsprop (Tieleman & Hinton, 2012). This may be non-surprising as these acceleration methods have been developed for gradient-based optimization, whereas FA updates roughly align with the gradients at best. Furthermore, we observed that FA algorithms achieve a descent validation accuracy after the first few training epochs. However, in contrast to BP, these methods may require over a hundred training epochs to converge fully. This fast-start-slow-convergence aligns with the observations made by Lillicrap et al. (2016). We found that FA, DFA, and m DFA perform best when all forward weights are initialized to zero. Notable exceptions are networks with Re LU activation function which express poor performance with the all-zeros initialization scheme. This poor performance with all-zeros initialization partially explains the poor observed performance of FA and DFA for Re LU networks. In contrast to Nøkland (2016), we observed that backward weights initialized by a normal distribution perform slightly better than the scaled uniform proposed by Nøkland (2016). We confirmed the observation made by Bartunov et al. (2018), that feedback alignment algorithms are in general relatively sensitive to hyperparameter choice.