# monotone_learning_with_rectified_wire_networks__975e5280.pdf

Journal of Machine Learning Research 20 (2019) 1-42 Submitted 5/18; Revised 11/18; Published 1/19

Monotone Learning with Rectiﬁed Wire Networks

Veit Elser ve10@cornell.edu Department of Physics Cornell University Ithaca, NY 14853-2501, USA

Dan Schmidt Daniel.Schmidt@analog.com and Jonathan Yedidia Jonathan.Yedidia@analog.com Analog Devices, Inc. Boston, MA, USA

Editor: Samy Bengio

We introduce a new neural network model, together with a tractable and monotone online learning algorithm. Our model describes feed-forward networks for classiﬁcation, with one output node for each class. The only nonlinear operation is rectiﬁcation using a Re LU function with a bias. However, there is a rectiﬁer on every edge rather than at the nodes of the network. There are also weights, but these are positive, static, and associated with the nodes. Our rectiﬁed wire networks are able to represent arbitrary Boolean functions. Only the bias parameters, on the edges of the network, are learned. Another departure in our approach, from standard neural networks, is that the loss function is replaced by a constraint. This constraint is simply that the value of the output node associated with the correct class should be zero. Our model has the property that the exact norm-minimizing parameter update, required to correctly classify a training item, is the solution to a quadratic program that can be computed with a few passes through the network. We demonstrate a training algorithm using this update, called sequential deactivation (SDA), on MNIST and some synthetic datasets. Upon adopting a natural choice for the nodal weights, SDA has no hyperparameters other than those describing the network structure. Our experiments explore behavior with respect to network size and depth in a family of sparse expander networks.

Keywords: Neural Networks, Online Training Algorithms, Rectiﬁed Linear Unit

Some of the simplest modes of machine learning are monotone in character, where the representation of knowledge changes unidirectionally. In the k-nearest-neighbor classiﬁcation algorithm (Cover and Hart, 1967), for example, exemplars of the classes are amassed, monotonically, over the course of training. It is also possible to have monotone learning where memory does not grow, but decreases over time. Suppose we try to classify Boolean feature vectors using conjunctive normal form formulas, one for each class. Starting with large and randomly initialized formulas, during training we simply discard, from the formula of the feature vector s class, all the inconsistent clauses (Mooney, 1995). Clearly an attractive feature of both of these monotone algorithms is the computational ease of reaching the desired outcome, i.e. additional exemplars for improved discrimination, or discarded clauses to better accommodate a class.

c 2019 Veit Elser, Dan Schmidt and Jonathan Yedidia.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-281.html.

Elser, Schmidt and Yedidia

We report on a new kind of monotone learning where during training only the values of a ﬁxed number of parameters are changed, and in a monotone fashion. Like the two examples above, there is a simple and computationally tractable objective when learning each new item. But perhaps the most interesting feature of our scheme is that it operates on deep networks of rectiﬁers, a setting where training is not normally seen as computationally tractable. Monotone learning with deep rectiﬁer networks is made possible by two changes to the standard paradigm: (i) conservative updates that minimize the norm of the parameter changes, and (ii) eliminating all but the bias variables as learned parameters. Although the feedforward computations in standard neural network models are inspired by biological neuronal computations, the training protocols in widespread use seem far from natural. Humans are able to generalize the shape of the digit 5 without suﬀering through many training epochs through a data set, and are capable of building a rudimentary representation of digits or other classes of objects from relatively few examples. The conservative learning principle, introduced to the best of our knowledge by Widrow et al. (1988) as the minimal disturbance principle, comes closer to our experience of natural learning. In this mode of learning the parameters of the network are minimally changed to accommodate each example as it is received, with the rationale that the attention to minimality preserves the representation created by earlier examples. The minimal disturbance principle was reprised by Crammer et al. (2006) as the passiveaggressive algorithm , in the context of support-vector machines. There are relatively few models where aggressive parameter updates minimal, yet achieving immediate results are easily computed. To emphasize the norm-minimizing characteristic of this learning mode we use the term conservative learning. An approximate implementation of conservative learning was successfully used to discover the Strassen rules of matrix multiplication (Elser, 2016). In the present work we make changes to the standard neural network model to enable conservative learning. In broad outline, the changes guarantee monotonicity, making the computation of the conservative updates tractable. A key step was elevating the role of the additive parameter, or bias. The metaphor that replaces Hebbian synapse (weight) learning is that of a silting river delta, with myriad channels whose levels (bias) rise diﬀerentially, but monotonically, over time. Below is an overview of our contributions.

Inspired by analog implementations of logic, we propose replacing the conventional rectiﬁed sum computed by a standard neuron using a Re LU function

by a sum of rectiﬁcations:

i max (0, xi bi). (1)

Only the bias parameters are learned; the weights are static.

Monotone Learning with Rectified Wire Networks

Even with just positive weights w in (1), we show that networks of rectiﬁed wires can represent arbitrary Boolean functions. This construction makes use of doubled inputs, where true is encoded as (1, 0), false as (0, 1), and generalizes, for symbolic data, to one-hot encoding.

The node values of our networks are non-negative. We propose deﬁning class membership of data by the corresponding output node having value zero.

Because the biases on all the wires are non-negative and can only increase during training, we choose to initialize them at zero.

We show that the 2-norm minimizing change of the bias parameters, that sets the class output node for a given input to zero, is the solution to a quadratic program.

An iterative algorithm similar to stochastic gradient descent, including both backward propagation and a new kind of forward propagation, is proposed to approximately ﬁnd the 2-norm minimizing bias changes. Executing the minimum number of iterations required to make the class output node the smallest in value deﬁnes the sequential deactivation algorithm (SDA).

We show that a particular limit of SDA, on networks with a single hidden layer, learns arbitrary Boolean functions.

Networks with better scaling have multiple layers, and their training is sensitive to the settings of the static weights. We propose balanced weights based just on the in-degrees and out-degrees in this general setting.

A two-parameter family of random sparse expander networks is introduced to explore the size and depth behavior of learning in our model.

Conservative learning on rectiﬁed wire networks with the SDA algorithm is demonstrated for MNIST and synthetic datasets.

Despite our model s capacity for depth in its representations, we ﬁnd that the best experimental results with the SDA algorithm are obtained in a limit where deactivation occurs only in the ﬁnal layer, where the model is equivalent to a network with just a single hidden layer of rectiﬁer gates. The pursuit of conditions that fully utilize activation/decativation in all layers is identiﬁed as the goal of future research.

1. Rectiﬁed wire networks

One motivation for our network model is the implementation of logic by analog computations without multiplications. The analog counterparts of true and false are, respectively, the numbers 1 and 0. Before presenting our rectiﬁed wire network model, we consider networks constructed from biased rectiﬁer gates. A biased rectiﬁer gate with K inputs,

R[b](x1, . . . , x K) = max (0, x1 + + x K b),

Elser, Schmidt and Yedidia

generalizes and and or gates. and gates are realized with bias b = K 1. We would get the or gate with b = 0 if we could saturate the output at the value 1. We show below how this detail can be ﬁxed with additional rectiﬁers and a suitable network design. Negations are completely absent from our rectiﬁer implementations of general logic circuits. This is made possible by the process of demorganization, where not gates are pushed through the circuit from output to input, exchanging and and or gates as dictated by De Morgan s laws. After all the not gates have been pushed through, the only remaining not gates act directly on the inputs. We accommodate this by allowing each input to the logic circuit to be replaced by an analog pair with values (1, 0) for true and (0, 1) for false. We refer to this encoding scheme as input doubling. It is relatively straightforward to show, as we do in theorem 1.1, that networks comprising only rectiﬁer gates, with bias variables as the only parameters, can mimic any logic circuit. From the machine learning perspective, our network model has at least the capacity to represent the classes deﬁned by the truth value of arbitrary Boolean functions. However, the parameter settings that realize these Boolean classes are very special points in the continuous parameter space of the model, and are not claimed to be directly relevant for the intended use of these networks.

Theorem 1.1 Any Boolean function on N inputs and computed with M binary and/ or gates and any number of not gates can be implemented by an analog network comprising at most 5M biased rectiﬁer gates taking the corresponding 2N doubled analog inputs.

Proof As a circuit, the function takes N Boolean inputs and produces one Boolean output. We associate negations in the circuit with gate outputs as shown for the case of the and gate in the left panel of Figure 1. Upon output, both the gate output y1 and its negation y2 by a not gate (rendered as an open circle), are made available to gates receiving inputs. One of the output forks might not be used, as for the ﬁnal gate of the circuit whose single output is the value of the Boolean function. The diagram also shows how each of the gate inputs is derived from one fork of the output of another gate, or an input to the circuit. The body of the proof consists of conﬁrming that the Boolean gate relationship between the inputs (x1, x2), and the output y1 and its negation y2, is exactly reproduced by a corresponding rectiﬁer gate network. The replacement for the and gate is shown in the right panel of Figure 1. All bias parameters are either 0 or 1, and the only values that arise on the nodes are also 0 and 1. The replacement rule for the or gate is shown in Figure 2. For both of the circuit replacement rules just described there are three others, where one or both of the inputs are negated. These are trivially generated from the ones shown by inserting a negation right after the inputs (x1 or x2 or both), which is equivalent to swapping the branches of the input-forks. For completeness we include the cases where a gate receives both inputs from the same output fork. The case of the and gate is shown in Figure 3. The replacement rule shown on the left applies when both inputs come from the same branch of the input fork, while the rule on the right applies when the inputs come from both branches. As before, the rule for the other choice of input branch is obtained by swapping branches in the replacement. For the or gate the rule on the left of Figure 3 is unchanged, while the one on the right has the two bias values swapped.

Monotone Learning with Rectified Wire Networks

Figure 1: and gate and its replacement with rectiﬁer gates in the analog network.

Figure 2: or gate and its replacement with rectiﬁer gates in the analog network.

The only negations remaining after all and and or gates have been replaced are those in the N input forks to the circuit. These are taken care of by deﬁning the input to the analog network as the 2N values on the branches of those forks. The resulting network will have only rectiﬁer gates. From the gate replacement rules (Figs. 1-3) we see that the number of rectiﬁer gates is at most 5M.

We now generalize our rectiﬁer gate model in two ways. The ﬁrst follows from the observation that a rectiﬁer on K inputs with bias b is equivalent to summing those inputs and placing a (single-input) rectiﬁer with bias b on each of the wires leaving the rectiﬁer.

Elser, Schmidt and Yedidia

Figure 3: and gate replacement rules when both inputs come from the same output fork, either the same branch (left) or both branches (right). The rule for the or gate is the same for the case on the left, and has R[0] and R[1] exchanged for the case on the right.

We give these biases the freedom to take diﬀerent values, increasing the expressivity of the network and boosting the parameter count commensurate with standard model networks, which have a weight on each edge. By setting the biases on selected edges in a completely connected network to large values, thereby disabling them, our model has the capacity to select its architecture. The second way we generalize the model is to introduce weights at all the summations. These are static not intended to be learned and positive. We show in section 6 that in layered networks one can apply a rescaling to the biases that has the eﬀect of restoring all the weights to 1 without changing the classiﬁcation behavior of the network. The weights therefore only play a role during training, through the gain/decay they impart to the analog signal as it propagates from input to output, and the response this elicits from the bias parameters. We now formally deﬁne our model.

Deﬁnition 1.2 A rectiﬁed wire network is a model of computation on a directed acyclic graph. Associated with each wire or edge i j joining node i to j is the node value xi and edge output yi j related by yi j = max (0, xi bi j),

where bi j is the bias parameter for the edge. The edge outputs yi j, from all edges i j incident on node j, are summed to give the value xj of node j:

Monotone Learning with Rectified Wire Networks

The positive constant wj is the weight of the node. The node values x, edge outputs y, and bias parameters b of a rectiﬁed wire network are general real numbers. The number of edge outputs y summed by a node j is its in-degree | j|, and the number of edges receiving xj as input is its out-degree |j |. Nodes with in-degree zero are input nodes and have their x values assigned. The outputs of the network are the x values of the nodes with out-degree zero, the output nodes. Nodes which are neither input nor output nodes are the hidden nodes of the network. The connectivity of rectiﬁed wire networks is such that there always exists a path between any input node and any output node.

The content of theorem 1.1, re-expressed in terms of rectiﬁed wires, takes the following form:

Corollary 1.3 The truth value of a Boolean function on N variables that can be computed with M binary and/ or gates and any number of not gates can be represented by a twooutput rectiﬁed wire network taking 2N doubled inputs and having at most 7M hidden nodes. Like the doubled inputs, the two output nodes encode the truth of the function with values (1, 0) or (0, 1).

Proof A logic circuit with M binary and/or gates has M nodes, one at each gate output. When re-expressed as a rectiﬁed wire network, each gate output is replaced by two nodes and, in the worst case (Figs. 1-2), the gate itself results in ﬁve additional nodes. The number of hidden nodes in the resulting rectiﬁed wire network is therefore bounded by 7M. The wiring of the nodes, including the input and output nodes, can be the complete acyclic graph because large bias settings, by deselection, realize any network on the given number of nodes.

While possible in principle, we should not expect it to be easy for a rectiﬁed wire network to learn bias values on the modest networks promised by corollary 1.3 when presented with data generated by a Boolean function. On the other hand, as we show in section 3, this particular network model has a signiﬁcant advantage over the standard model in having tractable conservative learning. A diﬀerent, equally valid representation might be learned instead, most likely on a much larger network. We render rectiﬁed wire networks as simple directed graphs, with input nodes at the bottom, output nodes at the top, and all edges directed in the upward direction. The rendering of the bias values the only learned parameters is through the edge thickness. Edges with low bias are thick, those with high bias are thin. When the bias on an edge is so great that the rectiﬁed value is zero for any input to the network, the edge is eﬀectively absent (zero thickness). Figure 4 shows this style of rendering of a network with four input, four hidden, and two output nodes. We can generalize both the inputs and the outputs of rectiﬁed wire networks to expand their scope beyond Boolean function-value classiﬁers. In many applications the feature vectors are strings of symbols from a ﬁnite alphabet: {true, false}, {a,c,g,t}, etc. As in the Boolean case, for an alphabet of size K we use the one-hot encoding scheme where the kth symbol is expanded into K numerical values all zero except for a 1 in the kth position. Numerical (non-symbolic) data may be encoded in the same style. Suppose the data vectors have length L: v = (v1, . . . , v L). We ﬁrst obtain the cumulative probability function θi for

Elser, Schmidt and Yedidia

Figure 4: Rectiﬁed wire network with four input nodes (bottom row), four hidden nodes, and two output nodes (top row). Thinnest edges have the highest bias.

each component i = 1, . . . , L over the dataset. The data vector for input to the network, corresponding to v, is then

d(v) = [ θ1(v1) , 1 θ1(v1) , . . . , θL(v L) , 1 θL(v L) ] . (2)

With this convention, the input to the network is always a vector of numbers between 0 and 1 whose sum equals the number of components or string length of the data. The mapping by the cumulative probability functions, in the case of analog data, ensures that the distribution in each channel is uniform. The number of output nodes equals the number of data classes. A Boolean function classiﬁer would have two output nodes to classify the truth value of the input. To classify MNIST data (Le Cun et al., 1998), the network would have 10 output nodes. Before we describe how these outputs are interpreted we make a trivial observation:

Lemma 1.4 In a rectiﬁed wire network with input vector d 0, there exist settings of the bias parameters where, for any of the output nodes c C, we have xc = 0 and xj > 0 for j C \ {c}.

Proof All the output nodes will have positive values for zero bias, because the weights are positive and each output is connected by a path to one of the inputs with positive value (see deﬁnition 1.2). To exclusively arrange for xc = 0, we set bi c = xi on all the edges i c incident on c (keeping all other biases at zero).

We will explain in section 3 why zero is the appropriate indicator function value for deﬁning classes on a conservatively trained rectiﬁed wire network.

2. Notation

Now that most of the elements of our network model have been introduced, we summarize our notational conventions. Lower case symbols represent variables or parameters on the

Monotone Learning with Rectified Wire Networks

network. These bear a latin index when deﬁned on a node of the network. Examples are the node values, xi, or the weights wi. The index is written as a directed edge between nodes when deﬁned on an edge of the network. Examples are the bias parameters, bi j, and edge-outputs yi j. The same lower case symbols, both for nodes and edges, when written without subscripts denote the vector of all the indexed values. For example, b is the vector of all biases in the network and d the vector of data values on the input nodes. When such symbols appear in an equality/inequality, the relation applies componentwise. Round brackets attached to a variable identify its functional dependence, when relevant. The value of node k, for example, might be written xk(b, d) to emphasize its dependence on the biases and input data. Upper case symbols are used for sets. The symbols D, H, C are reserved for the data (input), hidden, and class (output) nodes respectively, while E is the set of directed edges. Upper case symbols are also used for continuous sets with the same convention for functional dependence as variables. For example, B(d) might be the set of bias values b such that for b B(d) the node value xi(b, d) is zero for input d. Cardinality is denoted with vertical bars, e.g. |E| for the number of edges in the network. Node i in a network has in-degree | i| and out-degree |i |.

3. Counterfactual classiﬁcation and conservative learning

Monotone learning on rectiﬁed wire networks is made possible by the fact that, for a ﬁxed input d 0 to the network, the value xk(b, d) at any node k is a non-negative and nonincreasing function of the bias parameters b. This, combined with the fact (lemma 1.4) that zero is always a feasible value on any of the output nodes, motivates the following deﬁnition of a classiﬁer.

Deﬁnition 3.1 The output nodes C of a rectiﬁed wire network serve as a classiﬁer of |C| classes of inputs when there are biases b such that for inputs d in class c C, xc(b, d) = 0 and xi(b, d) > 0 for i C \ {c}.

To see why this deﬁnition makes sense in the context of conservative learning, suppose that at some stage of training all the output nodes are positive for some input d in class c. By increasing the biases b, the output value xc(b, d) can always be driven to zero for input d, and by increasing them as little as possible, i.e. conservatively, there is a good chance the other output nodes can be kept positive. If some of the other outputs are also set to zero in the process of increasing biases, the classiﬁcation is ambiguous. In any case, and after any amount of subsequent training where biases are increased, monotonicity of xc(b, d) with respect to b ensures xc(b, d) = 0 continues to be a valid indicator for class c. The outputs of the proposed classiﬁer are counterfactual in the sense that a large positive value xc on the node for class c represents strong evidence that c is not the correct class. In the case of a class deﬁned by the truth value of a Boolean function, we would use the function that returns false for members of the class, so the value of the corresponding analog circuit output node is zero. We shall see that we do not need to separately insist that, during training, biases are never decreased, as it will follow automatically from the conservative learning principle which assigns the following cost when biases are changed:

Elser, Schmidt and Yedidia

Deﬁnition 3.2 The cost, in a rectiﬁed wire network, for changing the biases from b to b , is the p-norm b b p.

The uniqueness of the cost-minimizing bias change, for ﬁxing a misclassiﬁcation, depends on the norm exponent p. We mostly use p = 2 in this study, where uniqueness can be proved. Our deﬁnition of the conservative learning rule, which sidesteps uniqueness, is the following:

Deﬁnition 3.3 Suppose output node c C for class c has value xc(b, d) > 0 for an input d, in a network with biases set at b. A conservative bias update is any b that minimizes b b p and produces output value xc(b , d) = 0.

Lemma 3.4 If b is any conservative update of b, then b b.

Proof Biases b have just learned an input d in some class c C, so xc(b , d) = 0. Now suppose b i j < bi j for some edge i j E. The update b , where b i j = bi j and all others are unchanged, has lower cost than b and yet, by monotonicity of node values with respect to b, all output nodes with value zero will still be zero, including xc(b , d).

The outputs of a rectiﬁed wire network are compositions of rectiﬁer functions. Composition of convex functions are in general not convex, but for the rectiﬁer function this is true. We give an inductive proof based on the following.

Lemma 3.5 If x(b) : RK R is convex, then y(a, b) : (R, RK) R, deﬁned by

y(a, b) = max (0, x(b) a),

is also convex.

Proof Because x is convex, for arbitrary b1, b2 RK we have

x(λb1 + (1 λ)b2) λ x(b1) + (1 λ) x(b2) (3)

for 0 λ 1. For arbitrary a1, a2 R,

y (λa1 + (1 λ)a2, λb1 + (1 λ)b2)

= max (0, x(λb1 + (1 λ)b2) λa1 (1 λ)a2)

max (0, λ x(b1) + (1 λ) x(b2) λa1 (1 λ)a2)

= max (0, λ z1 + (1 λ) z2),

where we have used (3) and deﬁned

z1 = x(b1) a1 z2 = x(b2) a2.

Since for arbitrary s, t R

max (0, s + t) max (0, s) + max (0, t),

Monotone Learning with Rectified Wire Networks

max (0, λ z1 + (1 λ) z2) max (0, λ z1) + max (0, (1 λ) z2)

= λ max (0, z1) + (1 λ) max (0, z2)

= λ y(a1, b1) + (1 λ) y(a2, b2)

as claimed.

Theorem 3.6 The node values x(b, d) of a rectiﬁed wire network, given an input d, are convex functions of the bias parameters b.

Proof The values of the input nodes D, set to the constant values xi = di, i D, are trivially convex. Since our network is an acyclic graph, the nodes can be indexed by consecutive integers so that all edge outputs yi j incident on a node j H C are from input nodes i D or from hidden nodes i H that have i < j. For any j H C, we use induction and suppose all xi, i H with i < j are convex functions of the bias parameters. Consider any of the edge outputs yi j(b) incident on j:

yi j(b) = max (0, xi(b) bi j).

Seen as a function of bias parameters, where xi(b) is a convex function of biases not including bi j, yi j(b) ﬁts the hypothesis of lemma 3.5 and is therefore convex. This conclusion also applies to the base case, the hidden node numbered j = 1, where all the xi are inputs i D and constant. Since the node value

xj(b) = wj X

i j yi j(b)

is a positive multiple of a sum of convex functions, it too is convex and completes the induction.

For the case p = 2, we can prove uniqueness of the conservative bias update.

Corollary 3.7 The set B(d), of biases b of a rectiﬁed wire network with ﬁxed input d for which xc(b, d) = 0 for a given output node c, is non-empty and convex. Consequently, the conservative bias update b B(d) that minimizes the cost b b 2 with respect to the biases b is unique.

Proof The set B(d) may be deﬁned as the feasible set of xc(b, d) 0 (since this function is non-negative). By lemma 1.4 B(d) is non-empty, and by theorem (3.6) B(d) is closed and convex because xc(b, d) is a convex function of b. Now suppose b 1 B(d) and b 2 B(d) are distinct minimizers of b b 2 for some bias b. Because B(d) is convex we have a contradiction because b 3 = (b 1 + b 2)/2 B(d) and b 3 b 2 < b 1 b 2 = b 2 b 2 implies b 1 and b 2 were not minimizers.

Elser, Schmidt and Yedidia

To design algorithms for computing the conservative bias update and to establish the complexity of this task we cast the problem as a mathematical program. Consider a rectiﬁed wire network with edges E on which we are given bias parameters b, say from previous training. Suppose we now wish to learn the pair (c, d), a vector of input data d in the class associated with output node c. Unless xc(b, d) = 0, the biases must be conservatively updated to b so this is true. Before we do this we partition E into the set of active edges A and its complement, A.

Deﬁnition 3.8 The set A of active edges of a rectiﬁed wire network, given input data d and bias parameters b, are those edges i j where xi(b, d) > bi j, or equivalently, where yi j(b, d) > 0.

The biases bi j on the inactive edges i j A, when increased, have no eﬀect on any node values, in particular the output nodes, because the corresponding edge-outputs yi j are zero by monotonicity of x(b) (non-increasing). To ﬁnd b we may therefore work with the network induced by the active edges A, called the active network. Our mathematical program makes use of a set of reduced bias variables.

Deﬁnition 3.9 Let A be the active edges of a rectiﬁed wire network for input data d and bias parameters b. For each edge i j of the network, deﬁne the reduced bias by

b i j = bi j, i j A xi(b, d), i j A.

Now consider a conservative update b of b, and the reduced biases b for b . We always have b b , and the reduced values have the property that x(b , d) = x(b , d). In particular, for the class output node c we have xc(b , d) = xc(b , d) = 0. The only reason that a conservative update b of b would not already be reduced (b = b ) is that the freedom b b may allow a reduction in cost, that is, give a more conservative update. We now deﬁne the mathematical program. For given data d, biases b and weights w on a rectiﬁed wire network, we are given the corresponding active edges A and node c of the correct class. The unknowns are the updated biases b , reduced biases b , and edge-outputs y on A, and the values of x on c and the input and hidden nodes:

minimize: b b p p (4a)

such that: b b , (4b)

xi = di, i D (4d)

yi j = xi b i j, i j A (4e)

0 yi j, i j A (4f)

i j A yi j, j H {c} (4g)

xc = 0. (4h)

Monotone Learning with Rectified Wire Networks

Theorem 3.10 The mathematical program (4), deﬁned for the network A that is active for given biases b and data d, as part of its solution b gives a conservative bias update for the class associated with output node c.

Proof First observe that any valid (xc = 0) assignment of node values x and edge-outputs y, in the corresponding rectiﬁed wire network, is realized by a feasible point of the linear system (4c) (4h) involving just the reduced biases (not b ). For the variables b to be rectiﬁed wire biases consistent with the reduced biases b , it is suﬃcient for them to satisfy (4b), as monotonicity ensures xc(b ) = 0. Optimizing (4a) subject to (4b) and the constraints that deﬁne the feasible set of reduced biases, gives the most conservative update.

For p = 2, (4) is a positive semi-deﬁnite quadratic program and can be solved in time that grows polynomially (Kozlov et al., 1980) in the size of the active network, |A|. The same conclusion applies for p = 1, a linear program, because the objective can be replaced by the sum of all the updated biases b provided we impose the result of lemma (3.4), b b , as a constraint. While these results allow us to add rectiﬁed wire networks to the list of models for which there is a tractable conservative learning rule, in most applications it is also important that training scales nearly linearly with the size of the network. The sequential deactivation algorithm (SDA) described in the next section is designed to meet that goal. We close this section by reviewing the features of our model that make training tractable. The insistence on exactly learning individual items should not be counted as a tractabilityenabling feature. Indeed, it is easy to see how the mathematical program (4), to compute the conservative update for one data item, would be generalized to ﬁnd the norm-minimizing update that classiﬁes an entire mini-batch. All the variables and data, with the exception of the biases b and b , now carry a data index m M, where M is a mini-batch. For example, (4b) would be replaced by

b i j,m b i j i j A, m M

and imposes consistency of the bias updates for all instantiations of the reduced biases over the mini-batch. While the size of the mathematical program has grown to O(|A||M|) equations and inequalities, it is still tractable. Convexity of the variables x and y with respect to the bias parameters is clearly important for tractable learning, and the proofs of lemma 3.5 and theorem 3.6 show that this relies on the non-negativity of the static weights and the form of the activation function. Less obvious in facilitating tractability is the proposal to replace the loss function of neural networks by a constraint, and in particular, the simple constraint (4h). For example, one might consider replacing this single constraint with the following set of inequalities,

xi > xc + , i C \ c, (5)

where the ﬁxed positive parameter speciﬁes a margin for avoiding ambiguous classiﬁcation. After all, we still have a tractable mathematical program after this substitution. Unfortunately, this proposal creates a conﬂict in the relationship between the reduced biases b and the actual bias updates b . While the b variables provide an exhaustive parameterization of the node variables x, even under the constraint (5), we cannot count on the

Elser, Schmidt and Yedidia

inequality b b to preserve this constraint when b biases are replaced by b biases in the rectiﬁed wire model (the monotonic decrease of a non-class output may exceed that of the class output).

4. Sequential deactivation algorithm

This section describes in detail the SDA algorithm for computing bias updates in the rectiﬁed wire model. While these updates are not guaranteed to be the most conservative possible, their computation is signiﬁcantly faster than solving a general quadratic program. The SDA algorithm resembles the gradient methods used for optimizing standard model networks, the most widely used being stochastic gradient descent (SGD) (Rumelhart et al., 1986). The program (4) would be greatly simpliﬁed if inequality (4f) was somehow automatically satisﬁed. All active edges would remain active in the update and the updated biases b on them would be equal to the reduced biases b . The node variables x, in particular xc, would be linear functions of the b :

xc(b ) = xc(b) + xc (b b).

For the p = 2 norm on bias updates, the optimal update that achieves xc = 0 is then

b = b xc(b) xc 2 xc.

The conservative update, in this simpliﬁcation, is the same as gradient descent. In the general case, where edges may become inactive, from the monotonicity of xc we know at least that the gradient xc never has positive components. If we had the stronger property, that xc is also never zero while xc > 0, then gradient descent will always ﬁnd an update b that satisﬁes xc(b ) = 0. But the latter property follows immediately from the fact that xc > 0 is only possible when there are active edges incident to output node c, so that xc has non-zero derivatives at least with respect to the biases on them. While gradient descent cannot guarantee the norm minimizing update promised by the solution of program (4), it has some attractive features. First, the computation of the (piecewise constant) gradient is fast, as is the computation of the step size to the next gradient discontinuity (deactivation event). Second, there is an opportunity to make the gradient descent update b more conservative on the inactive edges. Using the condition yi j(b ) = 0 as the indicator of an inactive edge, the improved update is, for all i j A,

b i j = max (bi j, xi(b )) , yi j(b ) = 0 b i j, yi j(b ) > 0. (6)

The SDA algorithm is the eﬃcient implementation of gradient descent on the rectiﬁed wire model followed by (6). Its name draws attention to the fact that the steps in the descent are deﬁned by deactivation events. We now turn to the implementation of the SDA algorithm. The elementary operations are deﬁned in Algorithm 1. Procedure eval is simply a forward-pass through the network, from data supplied at the input nodes, to the output nodes whose values directly encode

Monotone Learning with Rectified Wire Networks

Algorithm 1 Elementary network procedures

procedure eval(E, b, d) (A, x, y)

xi di, i D xj wj P i j E yi j, j H C yi j max(0, xi bi j), i j E

A {i j E : yi j = 0} end procedure

procedure grad(A, c) xc

( xc)j wc δjc, j C ( xc)i wi P i j A( xc)j, i H end procedure

procedure velocity(A, xc) y

( y)i j ( xc)j + wi P k i A ( y)k i, i j A end procedure

the class. In addition to the output node values {xi : i C}, eval also provides the edge outputs y and the set of active edges A (on which the edge outputs are non-zero). Suppose the current bias parameters are b(0). Denote by ( xc)i j the negative gradient component for the bias on edge i j A. Because the derivatives with respect to all the bi j on active edges into the same node j are equal, the node-indexed variables

( xc)j := ( xc)i j, i j A

are well deﬁned. In particular, if j is an output node, then ( xc)j = wc δjc. As long as the gradient is constant, the biases evolve as

bi j(t) = bi j(0) + t( xc)j, (7)

where t is a continuous time . The gradient ( xc)i is positive only if node i is connected by active edges to node c, with value given by the sum over all paths on active edges, each contributing by the product of weights along the path. The eﬀect on xc of a bias change on an active edge into i is the same as if the same bias change was instead applied to all the active edges leaving node i, but multiplied by wi. This implies the recursion,

( xc)i = wi X

i j A ( xc)j, i H (8)

in the procedure grad of Algorithm 1 and corresponds to back propagation in the network. The third procedure, velocity, is derived from the recursion for the edge-outputs y:

yi j = wi X

k i A yk i bi j.

Elser, Schmidt and Yedidia

Taking the time derivative and using (7),

( y)i j = wi X

k i A ( y)k i + ( xc)j (9)

we obtain the velocities of the edge-outputs by forward propagation. The initialization occurs at edges i j from all the input nodes i, for which the sum in (9) is absent and ( xc)j is set by grad. By construction, both yi j(0) and the (constant) velocities ( y)i j on the active edges i j A are positive and the ﬁrst deactivation event occurs at time

t = min i j A yi j(0) ( y)i j . (10)

This is the time step in one iteration of the SDA algorithm. After the biases are incremented by (7) and the newly deactivated edges are removed from A, another round is begun. Iterations are terminated when eval returns xc = 0. The ﬁnal biases are obtained by applying (6) to the gradient descent biases. We now revisit the classiﬁcation rule of deﬁnition 3.1, where data is declared learned when xc = 0. Recall that this has the desired property of not having the supremacy (smallest value among all classes) of node c spoiled by subsequent training. The only way that the latter can have a negative eﬀect, through the general decrease in output values with increasing biases, is when output nodes other than c are in a tie at value zero. To mitigate this eﬀect, and also make the bias changes even more conservative, we introduce the ultra-conservative learning rule:

Deﬁnition 4.1 In the ultra-conservative mode of learning with the SDA algorithm, iterations are terminated when xc, the value of the output node c of the correct class, is either zero or smaller than the values of all the other output nodes.

This termination criterion is conservative from the point of view of testing. Consider the early stages of training, when we might want to test the algorithm on which class is being favored. The natural candidate for the latter is the output node with the smallest value. By terminating sda as soon as this criterion is met, the test (when performed right after training) will succeed with a smaller change to the biases than is required by the xc = 0 criterion. Because the property of data d producing the smallest output value on node c can be spoiled by training with data d = d, it may be necessary for the network to retrain on d, or data similar to d, in order to properly learn the combination (d, c). The implied heuristic is that the net bias change for this mode of learning class c may be more conservative than always insisting on output xc = 0 for every d in this class. The sda algorithm with the ultra-conservative termination criterion is summarized in Algorithm 2. As soon as xc becomes the smallest output or is zero, the algorithm returns the most conservatively updated biases b consistent with the activity of the network edges, (6). Estimating the work required to correct a misclassiﬁcation is complicated by two factors. While the work in one iteration scales linearly with the size of the active network, we do not know how many iterations are needed on average to satisfy the ultra-conservative

Monotone Learning with Rectified Wire Networks

Algorithm 2 Sequential Deactivation (SDA) input (E, b0), (d, c) (network edges, initial biases), (data vector, class)

b b0 initialize biases (A, x, y) eval(E, b, d) active edges, node and edge output values iter 0 zero the iteration counter

while xc > 0 and arg mini C xi = {c} do iterate until node c is smallest xc grad(A, c) gradient with respect to active biases y velocity(A, xc) edge output velocity t mini j A yi j/( yi j) step size

for i j A do

bi j bi j + t ( xc)j increase biases end for

(A, x, y) eval(A, b, d) new active edges, node and edge output values iter iter + 1 increment iteration counter end while

for i j E do

if i j A then

b i j = bi j keep biases on active edges else

b i j = max(b0 i j, xi) conservative update for inactive edges end if end for

output b , iter updated biases, iteration count

termination rule. Empirically (section 9) we ﬁnd the number of iterations is often quite small and depends only weakly on network size. A second eﬀect, serving to lessen the work, is the sparsiﬁcation of the active network with time. In any event, the total computation in learning each data item would be at most O(|A|2), since in each iteration at least one of the active edges is deactivated (and eliminated from subsequent rounds). A possible direction for future research is the design of an algorithm that solves the program (4), for the most conservative update, with only a modest amount of extra work than used by SDA. In this connection we present the simplest instance we know of where the update found by SDA is non-optimal. Figure 5 shows a fragment of a network where the bias updates on four edges can be analyzed in isolation. We consider the original optimization problem (4), where the bias updates must give zero on node c. There may be additional edges incident on the nodes shown, but these are inactive for the data item being learned. Input node 1 has been zero for all previous data and explains why biases b1 2, b2 3 and b2 c are still zero. The inactive edge incident on node 3 was active on previous data and

Elser, Schmidt and Yedidia

Figure 5: A network fragment on which the SDA update is not optimal. The initial bias on the thick edges is zero, while b3 c = b. Another parameter is the weight w of node 2; all other nodes have unit weight. The dashed edge is inactive initially and irrelevant, but was responsible for b > 0 from prior training.

accounts for the positive bias b3 c = b. The nodes all have unit weight except node 2, whose weight is w. Data vector component d1 = 1 (on node 1) is the only one relevant for the fragment shown.

It is easy to see that for the case w < b the SDA solution is optimal, that is, it coincides with the solution of (4). However, with a bit of work one can show that the SDA updates are suboptimal when w > b, even when taking advantage of (6). This example only shows there is room for improvement. We do not know how pervasive this type of instance is in practice, let alone how seriously it impacts learning.

We close this section with the observation that SDA is not that diﬀerent from SGD applied to the loss function xc. The main diﬀerences are that the step size (learning rate) is not a free parameter but set by deactivation events and that multiple steps may be executed when learning each data item. The beneﬁt imparted by the rectiﬁed wire model, in this context, is just that the gradient computations and step sizes are very easy to compute. If we set aside monotonicity and solution guarantees, these attributes carry over to loss functions that are much stronger than xc.

To get a sense of how small a change moves the model into diﬃcult territory, consider the problem of learning the truth value of a Boolean function f(z) on N variables generated by M binary and/or gates, as discussed in section 1. The rectiﬁed wire network would receive 2N doubled inputs and have output nodes f and f, trained so that (xf, x f) = (f(z), f(z)). From corollary (1.3) we know such a rectiﬁed wire network exists, with unit weights and

Monotone Learning with Rectified Wire Networks

having at most 7M hidden nodes. The outstanding question, of course, is whether a gradient based algorithm can ﬁnd the appropriate bias settings. Instead of the monotone loss function xf(z), fraught with the problem that training cannot prevent (xf, x f) = (0, 0) an ambiguity we might try the hinge-loss function

max 0, xf(z) x f(z) + .

By corollary (1.3) we know that it is possible to achieve zero loss over all z with margin = 1, and that the data will be correctly and unambiguously classiﬁed. Like xf(z), this loss function is piecewise linear on the rectiﬁed wire model and can likewise beneﬁt from fast gradient and step size calculations. Because this loss is no longer monotone decreasing in the bias parameters, there is now no reason to initialize biases at zero. The loss of monotonicity, of course, brings with it the zero gradient problem. We suspect this may be more serious for rectiﬁed wires than it is in standard network models.

5. Clause learning

Although the SDA algorithm provides a tractable computation for learning individual data in the rectiﬁed wire model, we do not yet have any reason to believe this algorithm can learn any interesting functions. To address that concern, in this section we show that at least in a particular limit, the SDA algorithm is able to learn classes deﬁned by any Boolean function. We already saw in section 1 that rectiﬁed wire networks can represent such classes, even with favorable size scaling. To also demonstrate learning we have to rely on an admittedly impractical family of networks, with size growing exponentially with the number of Boolean variables.

Deﬁnition 5.1 The complete Boolean network, for learning a Boolean function f(z1, . . . , z N), has N pairs of input nodes, 2N hidden nodes, and two output nodes. The input nodes correspond to the literals z1, z1, . . . , z N, z N, the hidden nodes to all possible N-clauses {z1, z1} {z N, z N} formed from N literals, and the output nodes correspond to the truth value of f. Each hidden node has N edges to each of its constituent literals, and edges to both of the output nodes.

Figure 6 shows the complete Boolean network for functions of two variables (rendered for zero bias on all edges). In general, the hidden nodes all lie in the same layer. With the weights at the output nodes ﬁxed at 1, the limit of the SDA algorithm we will analyze is where the weight w shared by all the hidden nodes approaches zero. We refer to this limit on complete Boolean networks as clause learning The SDA bias changes in clause learning are very simple. While ( xc)i {0, 1} when i is an output node, by (8) we have ( xc)i {0, w} at all the hidden nodes. From (8) it also follows that the bias changes on edges into the hidden nodes are smaller by a factor w relative to edges leaving the hidden nodes. Recall that an SDA iteration is deﬁned by the deactivation of some edge as biases are increased. In clause learning this always occurs in the second layer, in the edges to the outputs. As the hidden node values are O(w), the biases to the output edges will also be O(w) as that suﬃces for deactivation. While there are corresponding bias changes on edges into the hidden nodes, these are smaller by a factor

Elser, Schmidt and Yedidia

Figure 6: Complete Boolean network for 2-variable Boolean functions f(z1, z2). The function value is classiﬁed by the smaller of the two output node values, with the convention that node f is smaller when f(z1, z2) = 0.

of w, or O(w2), and far from making these edges inactive (since xi {0, 1} on the input nodes).

Lemma 5.2 In clause learning, the hidden nodes and the biases on edges leaving them, always have values of the form w K + O(w2), K Z 0.

Proof We use induction, with the base case being the start of training where all the biases are zero. For any (doubled) input d {0, 1}2N, the value of a hidden node h is xh = w K(d, h), where K(d, h) {0, . . . , N} counts the number of literals associated with node h that are 1 for input d. Next suppose that a slightly modiﬁed statement continues to hold after T iterations of SDA, that is, for any hidden node h and any edge h i, xh, bh i {w K + O(w2): K Z 0}. Consider what happens in iteration T + 1, invoked when the network makes the wrong prediction and some edge is deactivated by an increase in bias. The deactivated edge will always be an edge leaving a hidden node, and the change of its bias will therefore be the diﬀerence of two numbers in the set {w K+O(w2): K Z 0}, thus keeping the changed bias in this set. Since the corresponding bias changes on edges into the hidden nodes are O(w2), the statement about the values of the hidden nodes also continues to hold.

Theorem 5.3 Clause learning succeeds for any Boolean function.

Proof Consider the values of the two output nodes when clause learning encounters the doubled input d {0, 1}2N associated with the function argument z {0, 1}N. By lemma 5.2, xf = w Kf + O(w2), x f = w K f + O(w2), where Kf, K f Z 0. As long as the integer for the wrong output is greater than zero, SDA will make bias changes, if necessary, to

Monotone Learning with Rectified Wire Networks

identify the correct class. For example, suppose f(z) = 0 and K f > 0. Through bias increases SDA can bring about xf < x f because all inactivations will be on edges h f, that is, from hidden nodes to the correct output node. The only eﬀect these changes can have on x f is a decrease of order O(w2), due to O(w2) bias increases on edges into the hidden nodes. Since each edge h f can become inactive just once, the number of SDA iterations required to produce the desired outcome is bounded by the number of hidden nodes. The above argument does not apply in the case K f = 0. However, we can argue that this never arises when f(z) = 0. If it did, consider the input d {0, 1}2N corresponding to z and the unique hidden node h whose combination of input-literals is such that xh = w N + O(w2) (complete Boolean networks sample all N-literal combinations). Since x f is a sum of non-negative contributions from all the hidden nodes, K f = 0 is only possible if bh f = w N + O(w2). To ﬁnish the proof, we show that this value is inconsistent with conservative learning. Fixing the hidden node h from above, consider all the arguments z for which f(z) = 1 and the bias bh f required to correctly assign them to the correct class by the smallness of x f. The corresponding inputs d for such z would produce xh = w K + O(w2), with K < N, because the clause associated with h is unique in producing xh = w N + O(w2) only for an input for which f(z) = 0. But to achieve the smallest contribution, O(w2), to x f from edge h f, we only need bias bh f = w K +O(w2), K < N. This is smaller (more conservative) than the hypothesized value from above. The same argument, with interchanged output nodes, applies to learning data for which f(z) = 1.

6. Balanced weights

We say a network is layered when the nodes can be partitioned into a sequence of layers ℓ= 0, 1, . . . , L, such that all edges in the network are between nodes in adjacent layers. When a rectiﬁed wire network has the property that the weights are only layer-dependent, a suitable rescaling applied to the biases will eliminate all the weights replacing them by 1 without changing the classiﬁcation behavior of the network. Denoting the layer of node i as ℓ(i), we see from deﬁnition 1.2 that the rescalings

xi = xi/Wℓ(i) yi j = yi j/Wℓ(i) bi j = bi j/Wℓ(i), (11)

leave the equations unchanged while replacing all the weights by 1. Here ℓ= 0 and ℓ= L correspond to the input nodes and output (class) nodes, respectively. Classiﬁcation is unchanged because all the output node values are rescaled by the same (positive) factor.

Elser, Schmidt and Yedidia

The weights do have an eﬀect on the way the network is trained by the conservative learning rule. In the following we motivate a particular setting of the weights called balanced, derived for each node i from its in-degree, | i|, and out-degree, |i |. From deﬁnition 1.2 we see that the choice

wi = 1 | i| (12)

will have the eﬀect that there is no net gain or decay in the typical node values x when moving from one layer to the next. The appearance of the in-degree is associated with the forward propagation of x. A very diﬀerent choice is suggested by recursion (8), which relates the bias changes b = ( xc) throughout the network in conservative learning. If we wish the biases to have equal inﬂuence on the class output node, so that again there is no net gain/decay moving between layers, then the correct choice is

wi = 1 |i |. (13)

Here we have the out-degree because the bias changes are derived from backward propagation. As edges become inactive over the course of training, the in-degree in (12) and out-degree in (13) should count only the active edges. However, both (12) and (13) are problematic when they are unequal, even at the onset of edges becoming inactive. When (12) and (13) are unequal, the node values x and accumulated bias changes b will grow at diﬀerent rates from layer to layer. Since edge deactivation is determined by x b, it will not be uniform in the network, but will be concentrated either at the input layer or the output layer. To promote a more uniform distribution of edge activations in the network we use balanced weights:

| i| |i | , i H. (14)

With this choice, the propagation of x in the limit of small bias takes the form

xi = | i| p

| i| |i | x i,

where x i denotes an average over nodes on the in-edges to i. From (8) we have

( xc)i = |i | p

| i| |i | ( xc)i ,

where ( xc)i denotes an average over nodes on out-edges of i. The two growth rates, adjusted for the same direction through the network, are now equal:

| i| |i | = ( xc)i

Whereas neither x nor b = ( xc) will have zero layer-to-layer growth when there is a systematic imbalance in the in-degrees and out-degrees, by having equal growth there is a

Monotone Learning with Rectified Wire Networks

better chance that deactivations will occur uniformly across layers. Non-zero layer-to-layer growth/decay of variables in a rectiﬁed wire network does not present a problem because the equations have no intrinsic scale. Suitable rescalings of the type given at the start of this section, applied after training, can neutralize the layer-to-layer growth without changing the classiﬁcation behavior of the network. So far we have only considered the weights of the hidden nodes. The only other nodes with weights are the output nodes. These might be weighted diﬀerently based on prior information about the classes, such as their frequency in the data. However, when there is no distinguishing prior information about the classes, the weights on the output nodes should be given equal values. By scale invariance we are free to impose

wi = 1, i C.

As a tool for the study of rectiﬁed wire networks, we introduce a global weight-multiplier hyperparameter q where formula (14) is replaced by

| i| |i | , i H, (15)

and q = 1 corresponds to balanced weights. The limit q 0 is interesting because it collapses the rectiﬁed wire model to a much simpler one. As explained in section 5, and easily generalized to arbitrary numbers of layers, two things happen in the q 0 limit: (i) only edges in the ﬁnal layer ever become inactive, and (ii) all biases except those in the ﬁnal layer can be neglected. Because of (i), all the layers below the ﬁnal hidden layer of nodes combine into a single linear layer, resulting in a model with just a single hidden layer. After scaling away the weights, the expression for the value of an output node k C takes the form

i D aji xi bj k

where the integers aji count the number of paths in the network from an input node i to a hidden layer node j. This model, comprising a single layer of rectiﬁer gates, is of course much easier to analyze than a general, multi-layered rectiﬁed wire model. By decreasing q in experiments one can assess the value of depth in the network. In particular, if q 0 does not compromise performance, then depth is not being utilized in an essential way.

7. Small networks

Because both our network model (rectiﬁed wires) and training method (conservative learning) are unconventional, we use this section to demonstrate the model and method on very small networks before turning to more standard demonstrations in section 9. We ﬁrst describe how the SDA algorithm trains the complete Boolean network on functions of two variables. This is followed by the study of a two-hidden-layer network, for functions on three Boolean variables. We already proved (section 5) that the 16-edge network shown in Figure 6 learns all 222

two-variable Boolean functions with SDA in the limit of vanishing weights in the hidden layer. Here we check that this continues to be true for balanced weights. Thanks to the

Elser, Schmidt and Yedidia

Figure 7: Two examples of ﬁnal bias settings, in columns, for networks trained on the functions (left to right) f0, f1, f+, f with diﬀerent data orderings.

symmetry of this network, we only need to check four equivalence classes of functions. Here equivalence is with respect to negating inputs or the output, or swapping the two variables. In all cases, these operations on the function correspond to relabelings of the input and output nodes of the network. Using the algebra of GF(2) for Boolean operations, representatives of the four equivalence classes are

f0(z1, z2) = 0

f1(z1, z2) = z1 f+(z1, z2) = z1 + z2 f (z1, z2) = z1z2.

We have veriﬁed that the four functions above are learned on the 16-edge network with SDA and balanced weights (q = 1). The biases were always initialized at zero, as rendered by equal-thickness wires in Figure 6. Recall how the network is trained on a function f with the SDA algorithm. After a data vector d = (z1, z2, z2, z1) is forward-propagated, the two output nodes will have equal or unequal values. If unequal, and the correct node corresponding to class f(z1, z2) is smallest, no biases are changed and the next item is processed. If the smallest output is positive but wrong, or equal to the other output, then the SDA algorithm executes iterations until the correct output is smallest or zero. The network will then have learned d, except for the case where the iterations drive both outputs to zero. This irreversible mode of classiﬁcation ambiguity does not arise with the 16-edge network and any of the four functions. In ultra-conservative learning there is no guarantee that successfully learned data is not unlearned in subsequent training. However, by testing on all four data each time biases are changed by SDA, one can establish whether the function is learned. The order in which data are processed matters, as manifested in the ﬁnal bias values. Figure 7 shows two examples of ﬁnal bias settings for each of the four functions. Deﬁne a trial as a single training experiment where, starting with zero bias, the data are processed in some prescribed order. All trials, for any of the four functions above, are successful on the 16-edge network. That is, there always is a point where the biases stop

Monotone Learning with Rectified Wire Networks

changing and the data are correctly and unambiguously classiﬁed. Two numbers of interest in a successful trial are (i) #err, the total number of wrong or ambiguously classiﬁed data (invocations of SDA) over the course of training, and (ii) #iter, the total number of SDA iterations performed. These numbers are equal on the 16-edge network trained on any of the 2-variable functions, as one SDA iteration is suﬃcient to ﬁx any incorrect classiﬁcation in this case. Not surprisingly, the constant function f0 is the easiest to learn, with #err = 1. All trials with f1 have #err equal to 4 or 5. The xor function, f+, always has #err = 3, while the and function, f , has the greatest variation with #err = 2, 3 or 4.

The more aggressive variant of SDA, where iterations continue until the class output node is zero, also succeeds for all four functions on the 16-edge network with balanced weights. In this mode of training one pass through the data always suﬃces. We ﬁnd, for all four functions, that the network must see all the data (#err = 4) and 6 #iter 9.

Complete Boolean networks, such as the 16-edge network, quickly become too large for functions of many Boolean variables, and alternative network designs must be considered. For example, the nodes in the single hidden layer could be clauses formed from all small subsets of the variables, or even an incomplete sampling of such clauses. The success of clause learning (w 0) with such networks would of course depend on the nature of the function.

The opposite limit applied to the hidden-layer weights, w , suggests a diﬀerent design for single-hidden-layer networks. Because biases on edges to the outputs never change in this limit, we connect each hidden-layer node to a single output. The hidden-layer nodes are now interpreted as proto-clauses, because their composition in literals is modiﬁed during training by changes to the input biases. For example, a proto-clause could sample both a variable and its negation, and rely on training (bias change) to select one or the other. Moving bias changes to the input side of the network alleviates the exponential explosion of static clauses one faces in the w 0 limit.

The departure from complete Boolean networks we feel is most interesting is increasing the number of hidden node layers. Indeed, keeping training tractable in this setting was the primary motivation for the rectiﬁed wire model. To explore this feature we trained a network on three-variable Boolean functions where the hidden layer nodes only have indegree two. To potentially express relationships among three variables, the hidden nodes are arranged in two layers.

The three-variable network, shown in Figure 8, is complete in the following sense. The nodes in the ﬁrst hidden layer exhaust the 3 4 ways of collecting inputs from distinct Boolean pairs, with and without negation. One attribute of these nodes is the identity of the Boolean variable 1, 2 or 3 that was not sampled. The second layer of hidden nodes exhausts all 3 4 4 combinations of input pairs for which this ﬁrst hidden layer attribute is diﬀerent. As in the 16-edge network, the wiring between the ﬁnal hidden layer and the output nodes is complete in the conventional sense. The resulting network has 216 edges.

Bias renderings, after training, of networks with over 200 edges are not very comprehensible. However, the statistics of the numbers #err and #iter appear to correlate well with the naive complexity of the Boolean function. We considered three functions that evaluate

Elser, Schmidt and Yedidia

Figure 8: A rectiﬁed wire network with two hidden layers that, we conjecture, learns all Boolean functions f(z1, z2, z3) on three variables. All hidden nodes in this network have two inputs.

to 0 and 1 with equal frequency:

f1(z1, z2, z3) = z1 f>(z1, z2, z3) = z1z2 + z2z3 + z3z1 f+(z1, z2, z3) = z1 + z2 + z3.

Learning f1, or learning to ignore z2 and z3, should be easiest. Function f>, the logical majority function, we expect to be harder because its value, in some but not all cases, is sensitive to all three variables. By the same argument, the parity function f+ should be the hardest of the three. All trials we performed on f1, f> and f+, using SDA on the 216-edge network with balanced weights, were successful. Figure 9 shows the distribution of #err and #iter in 50 trials for each, diﬀering only in the order in which the data are processed. By either metric the total number of false classiﬁcations or the total work in training the diﬃculty ranking of the functions is f1 < f> < f+. Less extensive trials on the full set of 223

Boolean functions all proved successful and leads us to conjecture that the 216-edge network of Figure 8 learns all 3-variable Boolean functions regardless of data order. We note that the complete Boolean network, with only 23(3+2) = 40 edges, would have been much smaller in this instance and by theorem 5.3 can learn all the Boolean functions in the w 0 limit. Still, it is reassuring to see that SDA also works in deep networks. The three-variable Boolean functions also highlight the advantage of the ultra-conservative termination rule for SDA iterations. When we repeat the 50 trials with the stricter zero-

Monotone Learning with Rectified Wire Networks

Figure 9: Distribution of #err and #iter in 50 trials when learning the three-variable Boolean functions f1, f> and f+ on the network shown in Figure 8.

output-value rule (xc = 0), we ﬁnd that training is not always successful. While training on f1 was always successful, the success rate on f> was 76% and on f+ only 22%. The success rate improves for smaller q and appears to be at 100% for all three functions when q = 0.5. Another point in favor of ultra-conservative termination is the greatly reduced number of iterations. For example, by forcing zero outputs, the total number of iterations needed to learn f1, averaged over data ordering, is about 140, or about 15 times the number needed by the ultra-conservative rule.

8. Sparse expander networks

A nice simpliﬁcation provided by rectiﬁed wire networks is that the training algorithm has very few hyperparameters. Learning succeeds or fails mostly on the basis of the network we choose. However, it is possible that the tractability of rectiﬁed wire training comes at the cost of a greater sensitivity to network design. We therefore anticipate that network characteristics size, depth, sparsity will take over the role of hyperparameters. As a general rule, learning improves with larger networks. Since the edge-biases are the only learned parameters of a rectiﬁed wire network, the appropriate measure of network size is the number of edges. In this section we introduce a family of networks that adds depth the number of hidden layers as a second design parameter. Layered networks that are completely connected in the conventional sense suﬀer from a symmetry breaking problem. Since we start with all biases set at zero, a completely connected network has perfect permutation symmetry of its hidden nodes. This symmetry

Elser, Schmidt and Yedidia

extends, during training, to the bias updates on all edges between the hidden nodes. To allow for independent settings of the biases, the initial biases must break the symmetry in some data-neutral fashion. Incompletely connected or even sparse networks are an extreme form of symmetry breaking, and correspond to inﬁnite bias settings. Our network design, called sparse expander networks, generalizes the networks of section 7. Sparsity in our case is the property that all the hidden nodes have in-degree of only two. At the same time, the representation of the data is expanded, layer to layer, by a constant growth factor g in the size of layers. The expander structure cannot be applied to the ﬁnal two layers because of the ﬁxed (and small) number of output nodes. Here the layers are completely connected, thereby making all the information in the last (largest) hidden layer available to each output node. Letting h denote the number of hidden layers, the number of edges in a sparse expander network for |D| inputs and |C| classes, with parameters (g, h), is

|E| = |D|(2g + + 2gh 1 + (2 + |C|)gh). (17)

To introduce a degree of uniformity, our networks have a constant out-degree of 2g on all input and hidden nodes. This requires that g is an integer. The in-edges to the nodes in a hidden layer are generated by the simplest random algorithm. In a ﬁrst pass through the hidden layer we create one edge per node to the layer below by drawing uniformly from the nodes that currently have the smaller of two possible out-degrees. A second edge is added in a second pass, at the completion of which all the nodes on the layer below will have out-degree 2g. Edges from the ﬁrst hidden layer to the input layer are constructed no diﬀerently. This construction is implemented by the publicly available1C program expander and was used for all the experiments reported in the next section. We can use the hyperparameter q to assess the role of depth. If performance does not degrade in the limit q 0, then the network could have been replaced by the much simpler single-hidden-layer model (16). When this limit is applied to a sparse expander network, the number of hidden nodes is |H| = |D|gh. The sampling of the ﬁxed weights aji in the rectiﬁer gates is also controlled by g and h.

9. Experiments

To describe experiments with conservatively trained rectiﬁed wire networks it suﬃces to specify the network. The SDA algorithm has no other hyperparameters, since there even is a balancing principle that sets the weights. The two-parameter sparse expander networks oﬀer a convenient way to study behavior both with respect to network size and depth. These characteristics were our main focus and guided the choice of experiments going into this study. Follow-up experiments, that featured the weight-multiplier hyperparameter q, later provided critical insights for network design that have yet to be explored. All our experiments were carried out with a publicly available1 C implementation of the SDA algorithm called rainman. This program requires that the user-supplied network is layered and the number of inputs and outputs are compatible with the data. Our networks were all created by the program expander.

1. github.com/veitelser/rectified-wires

Monotone Learning with Rectified Wire Networks

Although rainman reports results at regular intervals, after a speciﬁed number of data have been processed, this batch size has no bearing on the actual training because bias parameters are only updated in response to wrong classiﬁcations. The batch size merely sets the frequency with which the network is tested against a body of test-data. The main result of this test is the classiﬁcation accuracy: the test-data average of the indicator variable that is 1 for correct, unambiguous classiﬁcations, and 1/m when the correct class node is in an m-fold tie. In addition to the classiﬁcation accuracy, rainman also reports a number of other quantities of interest. Two we have already seen in section 7: the running total of misclassiﬁcations, #err, and the work (iterations) performed so far by SDA, #iter. These should increase while the classiﬁcation accuracy is below 100%. However, in an overﬁtting situation #err and #iter stop increasing before the accuracy has reached 100%. Evidence of this phenomenon is more directly discerned through another statistic reported by rainman: the average number of SDA iterations required to ﬁx each of the incorrect classiﬁcations encountered in the batch, #iter . As deﬁned, #iter is never less than 1. However, when there are no misclassiﬁcations in the batch at all (keeping #err and #iter static), the value #iter = 0 is reported. Information more revealing of the internal workings of the network is provided by the layer-averages of the edge activations, α1 , . . . , αh . Here αh is the fraction of edges, in the layer leading to the output nodes, that are active in an average test-data item. These numbers decrease during training, a result of biases being increased. Lastly, a statistic relevant to the termination of training is the frequency, in the testdata, of irreversible misclassiﬁcations, 0err . The latter arise whenever an output node, not of the correct class, has value zero. The onset of 0err > 0 signals that training should terminate because activation levels are so low that SDA is forced to make multiple outputs zero even while targeting just the class output. A simple remedy for preventing 0err > 0 is to increase the size of the network. The case of m-fold classiﬁcation ambiguities, mentioned above, occurs in practice only when m output nodes are tied at zero. Ambiguous classiﬁcation is therefore not a problem as long as 0err > 0. On a single Intel Xeon 2.00GHz core rainman runs at a rate of 50ns per iteration per network edge. Multiplying this number by the product of #iter and |E| (17) gives a good estimate of the wall-clock time in all our experiments.

9.1. Nested majority functions

We ﬁrst describe experiments with a synthetic data set. Synthetic data has some advantages over natural data. Our nested majority functions (NMF) data has two nice features. First, through the level of nesting we can systematically control the complexity of the data. Second, because these Boolean functions are deﬁned for all possible Boolean arguments, it is easy to generate large training sets for supervised learning. NMFs are a parameterized family of Boolean functions that evaluate to 0 and 1 with equal frequency. The ﬁrst parameter, a prime p, sets the number of arguments at p 1. Instances are deﬁned by three integers a, b, c {1, . . . , p 1} and a nesting level n = 0, 1, 2, . . . . Starting with

f0 a(z1, . . . , zp 1) = za,

Elser, Schmidt and Yedidia

Figure 10: An instance of a depth 5 nested majority function on 6 variables. The nodes are majority gates and the two edge colors indicate the presence/absence of negation. The depth 4, . . . , 1 functions correspond to removing 1, . . . , 4 layers from the bottom of the circuit.

the other functions (taking values in GF(2)) are deﬁned via the recurrence

fn a = f> fn 1 (ab mod p) + z(ac) , fn 1 (2ab mod p) + z(2ac) , fn 1 (3ab mod p) + z(3ac) ,

where f> is the 3-argument majority function and

z(x) = (x mod p) mod 2

should be interpreted as an element of GF(2). The functions fn at level n correspond to depth-n Boolean circuits taking p 1 inputs and constructed from not and majority gates. Figure 10 shows an instance of the function f5 for p = 7. From the identity

f>(z1 + 1, z2 + 1, z3 + 1) = f>(z1, z2, z3) + 1

it follows that negating all the arguments of a NMF negates its value. The two classes, deﬁned by the NMF value, are therefore equal in size. Our experiments used data generated by the functions fn 1 for p = 31 and b = 2, c = 3. The easiest of these, with n = 0, only depends on a single argument. To learn this function the rectiﬁed wire network (taking 2 30 inputs) must identify the correct one and ignore all the others. In the case of f1 1 it must select three of the arguments and also whether to apply negations. The functions f2 1 , f3 1 , etc. become progressively harder as they depend on

Monotone Learning with Rectified Wire Networks

Figure 11: Improvement of the classiﬁcation accuracy with network size, for the function f3 1 , on rectiﬁed wire networks with three hidden layers.

more arguments and with more complicated rules. The possibility of generalizing the value of fn, from many fewer than 2p 1 data, is clearly only possible for small n. Our results go as far as n = 4. We generated training and test data samples (of the 30 Boolean variables) with a pseudorandom number generator. In the longest experiment, with 107 training samples, less than 1% of the data was seen more than once by the training algorithm. We ﬁrst present results for the function f3 1 . Fixing the number of hidden layers at h = 3, Figure 11 shows the eﬀect of increasing the network size via the growth factor g. The accuracy is plotted as a function of #err, rather than the number of training data, because learning (change of bias) occurs only when there has been a misclassiﬁcation. In all except the g = 14 network (683760 edges), the accuracy reaches a maximum after which there is a sharp decline that coincides with the onset of 0err > 0. The smaller networks could still serve as imperfect classiﬁers by terminating their training at the empirical maximum test accuracy, or, in the absence of test data, when 0err exceeds a threshold. We have no explanation of the intriguing similarity of the unsuccessful accuracy plots. We also ﬁnd that many of the seemingly random features in the accuracy plots are nearly reproduced in diﬀerent random instances of expander networks with the same (h, g). We emphasize that it is not really surprising that generalization performance can drastically deteriorate with more training examples for small networks. Conservative learning is using a fundamentally diﬀerent principle than the standard approach that tries to minimize an overall loss function across training examples. Conservative learning is merely trying to myopically account for the most recently seen example, which can have negative consequences for other training examples, to say nothing of unseen test examples. The eﬀect of changing network depth is shown in Figures 12 and 13. Although the three featured networks have about the same number of edges, only the two deeper networks are

Elser, Schmidt and Yedidia

Figure 12: Of three networks with comparable size trained on the function f3 1 , the deeper networks are more successful and require more data.

successful. Figure 12, with accuracy plotted against #err, shows that learning is faster on shallower networks. On the other hand, when plotted against #iter in Figure 13, we see that less work is required to train the deeper networks. Work, as measured in SDA iterations, is not evenly distributed over the course of training. Figure 14 shows #iter plotted against #err. More iterations are needed early in training, especially so in shallow networks. The balanced weight heuristic is far from successful in keeping the layer-averaged activation levels uniform. After training the (h, g) = (3, 14) network, we ﬁnd α3 = 1.1%, α2 = 47% with the rest at 100%. For the deeper and also successful (h, g) = (4, 7) network, the diminished activations are α4 = 4.1%, α3 = 57%, α2 = 76%. The lower complexity of the functions fn 1 for n < 3 is reﬂected in the smaller networks required to learn them. On networks with two hidden layers (h = 2), f2 1 is successfully learned with g = 21 (108360 edges) after only 5000 misclassiﬁcations. For f1 1 the corresponding numbers are g = 6 (9360 edges) and #err = 900. On the other hand, if we train f4 1 on the same network (h = 4, g = 7) that learned f3 1 with #err 4 105, the best accuracy, achieved for roughly the same #err, is only 91.9%. The prediction accuracy on NMF data, for all of our networks, is improved by decreasing the weight-multiplier hyperparameter q. In fact, it appears that q 0 is optimal, or that the best performance can be achieved with the model (16) comprising a single layer of rectiﬁer gates with ﬁxed weights aji. A striking case is the function f3 1 trained on the sparse expander graph with h = 2, g = 57, already encountered in Figure 12. Figure 15 shows the eﬀect of decreasing q from the balanced setting, q = 1. We see that the 94% accuracy of the q = 1 network reaches 100% already for q = 0.7. It seems the only downside in decreasing q is an increase in the number of training data (greater #err). The q 0 weight codewords aji associated with the sparse expander network are very simple. Because the network has in-degree two and depth h = 2, almost all the rectiﬁers see four random inputs that are

Monotone Learning with Rectified Wire Networks

Figure 13: Same as Figure 12, except plotted against the total work (SDA iterations).

Figure 14: Average number of iterations used by the SDA algorithm to ﬁx a wrong classiﬁcation over the course of training for the three networks featured in Figs. 12 and 13.

given equal weight. It is likely that the performance of model (16) can be further improved with weight codewords that go beyond those generated by sparse expander networks. In light of the q 0 ﬁndings, it is not surprising that conventional neural networks, even with just a single hidden layer, easily outperform the rectiﬁed wire model on the NMF data. A fully-connected Re LU network with one layer of only 50 hidden nodes was suﬃcient to learn f3 1 completely after seeing 5 105 samples. This can be compared with the 2 104

samples (#err) that informed bias changes in the rectiﬁed wire model when learning the same data (Figure 15). The vast reduction in the number of hidden nodes, from 2 105

(penultimate layer of the h = 2, g = 57 network) to 50, no doubt is attributable to the

Elser, Schmidt and Yedidia

Figure 15: Evolution of the prediction accuracy with the weight-multiplier hyperparameter q on the h = 2, g = 57 network for the function f3 1 .

standard model s capacity for designing the weight-codewords that in model (16) are static.

Seen as images, the MNIST handwritten digits (Le Cun et al., 1998) are analog data. Because the contrast in these images is well approximated as two-valued, it makes sense to map pixels to a two-symbol alphabet and train a rectiﬁed wire network on data of that type. We report results with this approach at the end of this section. However, we ﬁrst try processing MNIST non-symbolically, using the analog data vectors described in equation (2). One motivation for the analog approach is that MNIST images are highly compressible. Playing to a possible strength of rectiﬁed wire networks, it might be advantageous to ﬁrst map MNIST images, and images in general, to data vectors the components of which are more nearly statistically independent. We use a method based on principal component analysis. Using singular value decomposition, we approximate the 60000 282 matrix X of ﬂattened MNIST training images as the product X UV , where the rows of the r 282

matrix V are the orthonormal eigen-images associated with the r largest singular values of X. A ﬂattened image x (training or testing) is mapped to r analog channels by v = V x. Applying the component-wise cumulative probability distribution functions to this v then gives us the 2r-component data vector as in equation (2). We compute the cumulative probability function from the training data and use the same function when processing the test data (with test samples below the minimum or above the maximum training samples mapped to 0 and 1 respectively).

Monotone Learning with Rectified Wire Networks

Figure 16 shows the test accuracy when training a sparse expander network on analog MNIST data with r = 50 channels. This network has 205400 edges, the same number a conventional network would have with a single fully-connected layer of 259 hidden nodes. After making about #err = 25000 misclassiﬁcations, there are no further misclassiﬁcations on the training data and training ceases. The test accuracy attains a maximum of 94.54% at this point. Figure 17 shows the evolution of #iter over the course of training. Generalization (test accuracy) improves when we decrease the q hyperparameter from the value q = 1 for balanced weights. However, the improvement, reaching 94.99% at q = 0.2, is much more modest than we found for the NMF data. The time used by rainman in any of these training runs was about 65 minutes. Although about one million data (16 epochs) are processed, only for about 25000 of these does SDA do anything. We also used the analog MNIST data to compare SDA training with stochastic gradient descent (SGD) training on the same model and network architecture. Because the sparse networks we use for SDA are a poor ﬁt for GPU-based software, we wrote a C++ SGD optimizer that, like SDA, runs on a single core without calls to a GPU. In the strictest comparison SGD used the same monotone loss function, xc, used by SDA. For SGD we also tried multi-class hinge loss, with margin parameter 0.1. As discussed in section 3, this loss spoils tractability (monotone bias changes) and is therefore not applicable for SDA. The mini-batch size for SGD was ﬁxed at 100 and we employed standard stochastic gradient descent without momentum. SGD does poorly when forced to use SDA s monotone loss. With the learning rate set at 0.002, training accuracy reaches a maximum of 91.5% in 6 minutes; this is also the test accuracy for this mode of operation. Since SDA uses adaptive gradient steps (set by deactivations), we also tried SGD with a ten-fold reduction in learning rate. While training now goes through 97 epochs of data and takes about as long as SDA, there is no improvement in either the training or test accuracy. We interpret this as supporting the conservative learning principle: there are gains when parameters are optimally adjusted in response to individual data. The performance of SGD improves signiﬁcantly when the rectiﬁed wire model is trained with multi-class hinge loss. However, due to the structure of expander networks, this loss is problematic when applied exclusively. When evaluated on the initial expander network (all edges active), the partial derivatives of the loss with respect to all biases, except those on edges to the outputs, are identically zero. This occurs because the eﬀect of all bias changes (except in the ﬁnal layer) passes through the penultimate layer of nodes that is completely connected to the outputs, and equal changes on these leaves the hinge loss unchanged. To avoid the zero derivative problem we used a hybrid approach, where hinge loss was used only after some number of data had ﬁrst been processed with the monotone xc loss to create deactivations that break the symmetry in the ﬁnal layer. Switching to hinge loss after 5 epochs had no eﬀect on accuracy, but switching after 10 epochs improved training and test accuracies to 94.4% and 94.1%, respectively. This trend in improvement continues and reaches 96.8% (training) and 95.6% (test) when the switch is made after 20 epochs (121 minutes). We attribute the modest improvement in generalization (0.6%), over the SDA trained model, to the margins that hinge loss is able to insert in the class boundaries. We do not believe the overﬁtting we saw with SDA was caused by limiting the analog data vectors to only 50 channels, because essentially the same classiﬁcation accuracy is

Elser, Schmidt and Yedidia

Figure 16: Classiﬁcation accuracy of MNIST images from 50-principal-component data vectors with the SDA algorithm on a sparse expander network with parameters h = 2, g = 13.

Figure 17: SDA iterations per misclassiﬁcation for the same data and network as in Figure 16

obtained with the uncompressed symbolic data. The latter was generated from the raw MNIST images by thresholding pixel values at 40% of the maximum (the minimum point of the pixel histogram) to deﬁne the two symbols. Figure 18 shows the evolution of the accuracy at ﬁxed depth h = 4 as the network size increases. In the smallest network (g = 1),

Monotone Learning with Rectified Wire Networks

Figure 18: Classiﬁcation accuracy of binarized MNIST images on rectiﬁed wire networks with four hidden layers.

as expected, the accuracy drops precipitously at the onset of 0err > 0. However, already in the next network (g = 2) we have the situation where training terminates because most (training) data are correctly classiﬁed and the remaining ambiguous cases are ties at zero (in output) and untrainable. At termination we ﬁnd 0err = 2.7%, contributing to the reduction in accuracy. Increasing the network size further (g = 3) gives accuracy 94.8% at termination, where now 0err = 0. This accuracy is close to that obtained with the analog data. Although our results for MNIST are an improvement over the simple linear classiﬁer with accuracy 88%, and even a pairwise-boosted linear classiﬁer (92.4%), the appropriate comparison should be with other tractable methods. Of these, we note that support vector machines with a Gaussian kernel already achieve 98.6% accuracy on MNIST. These and other techniques for MNIST are reviewed in Le Cun et al. (1998).

9.3. Markov sequences

As our ﬁnal application we study a synthetic symbolic dataset where, unlike the nested majority function data, samples are generated probabilistically. The data are strings in the alphabet {A,B,C,D} generated by Markov chains. Examples of 12-symbol strings from the two classes are shown in Table 1. Those in class 1 are extracts from the Markov chain deﬁned by the transition matrix

1 2 1 6 3 1 4 2 4 1 4 1 2 6 1 1

Elser, Schmidt and Yedidia

Figure 19: SDA iterations per misclassiﬁcation for the same data and networks shown in Figure 18.

class 1 class 2 CADCCCACCADA ACCBDBABDBCC ADABDCCCDACB ABDAABACCDDB CBCCADBBCADB BDBDADACDACD DADBDADDBDCA DBADCCADADDA BCAADBBABCDB BDACCCDABBDB ABDDBADCABBD CCCBDABCADBD

Table 1: Examples of 12-symbol strings extracted from the Markov chain with transition matrix (18), in class 1, and its transpose, in class 2.

Since T is doubly stochastic, its transpose deﬁnes another Markov chain and is the source of the strings in class 2. The doubly stochastic property also ensures that the four symbols occur with equal frequency. Finally, the absence of zeros means that all bigrams have ﬁnite probability. Even a machine that managed to reconstruct T from the data could never have perfect classiﬁcation accuracy, because every 12-symbol string occurs with ﬁnite probability in either chain. The optimal classiﬁer selects the chain (class) that gives each string the highest probability. With this criterion applied to our T, the true-positive rate for 12-symbol strings is 95.1%. We also trained networks on 25-symbol strings, for which the true-positive rate is 99.1%. These numbers set the maximum achievable classiﬁcation accuracy. In Figures 20 and 21 we compare classiﬁcation accuracies for 12and 25-symbol strings as learned on networks having approximately 2 105 edges. We see that here the depth of the network has very little eﬀect on the maximum accuracy. By the probabilistic nature of

Monotone Learning with Rectified Wire Networks

Figure 20: Classiﬁcation accuracy of 12-symbol strings generated by the Markov chain (18) or its transpose.

Figure 21: Classiﬁcation accuracy of 25-symbol strings generated by the Markov chain (18) or its transpose.

the data, these accuracies must eventually decline as strings having higher probability, in the class with the other label, are encountered. Deeper networks are less vulnerable to this phenomenon. The highest accuracy achieved for 12-symbol strings is 6% below that of the optimal classiﬁer; for 25-symbol strings the shortfall is 3%.

We did not explore the eﬀect of decreasing the q hyperparameter for this data set.

Elser, Schmidt and Yedidia

10. Discussion

The rectiﬁed wire model is a signiﬁcant departure from the standard model of neural networks. Even so, the training algorithm we derived for this model, SDA, deserves comparison with the leading algorithm for the standard model: stochastic gradient descent (SGD) (Rumelhart et al., 1986). In both algorithms the network parameters are updated after only a small amount of data has been seen: a single item in the case of SDA, a minibatch for SGD. The computation of updates involves propagating data vectors in a forward pass, and the class of the data or the associated loss in a backward pass. SDA makes an extra forward pass that determines the size of each update, signaled by a deactivation event. In SGD the size of the update is set by the learning rate, a hyperparameter, and the update itself is an average over the mini-batch, whose size is another hyperparameter. The characteristic of SDA that most diﬀerentiates itself from SGD is the monotone evolution of the network parameters. Ostensibly this is problematic: there is no going back on any updates, including those made early when very little of the data had been seen. This tacitly assumes the learned state of the network is unique, or nearly so. But an alternative hypothesis is also worth considering: a high multiplicity of learned states, all equally viable. Our experiments with SDA support this hypothesis. Very diﬀerent ﬁnal bias settings, all giving perfect classiﬁcation, were obtained even for simple classes just by changing the order of the data. Our experiments with Markov-chain generated strings showed that SDA is also not tripped up by outlier data. Although the ﬁnal network in this probabilistic setting must have 100% classiﬁcation ambiguity, the accuracy in practice is near that of the optimal classiﬁer long before this asymptotic state is reached. Not all our experiments with SDA were successful. On the MNIST dataset the algorithm suﬀered from overﬁtting: whereas the network had learned to perfectly classify the training data, the classiﬁcation accuracy on the test data was only 95%. Does this reﬂect a fundamental deﬁciency in the rectiﬁed wire model, and the conservative learning principle more generally, or is it a result of poor execution? An even more serious problem with the rectiﬁed wire model was exposed, in experiments, upon decreasing the weight-multiplier hyperparameter q. Recall that the primary motivation for the model was tractable training in the presence of network depth. Signiﬁcant simpliﬁcations of the standard neural network model, such as ﬁxing the weights and replacing loss by a constraint, were key in facilitating tractable training. Even so, the expressivity provided by the remaining bias parameters remains high (arbitrary Boolean functions), and the training algorithm (SDA) fully utilizes activations (Re LU nonlinearities) in every layer of the network. The q parameter was introduced to explore behavior with respect to depth: small/large q favors bias changes near the output/input nodes. Though we studied a very limited class of networks sparse expander nets the experimental results pointed uniformly in favor of q 0. In this limit the rectiﬁed wire model reduces to a much simpler model (16), comprising a single layer of rectiﬁer gates. While this limiting model is still non-trivial and interesting, it deﬁnitely lacks depth. Our experience with the rectiﬁed wire model highlights the diﬀerence between the capacity for depth and the learnability of depth. Although there exist bias settings in the model that eﬃciently represent complex Boolean functions and utilize activations on all levels, experiments show that these are not the kinds of bias settings that are learned by

Monotone Learning with Rectified Wire Networks

the tractable, conservative route. At least that is the conclusion we should draw from the experiments performed so far. There are still some directions to explore before we can conclude that conservatively trained rectiﬁed wire networks lack the capacity to learn in depth . The most obvious is to try architectures very diﬀerent from the sparse expander networks considered so far. These need not even be layered, although they must break symmetry if we continue to initialize all the bias parameters at zero. Specialized architectures may be inherently better at exploiting depth, as in the case of standard model networks with convolutional layers. Next, we recall that the bias updates given by the SDA algorithm can fall short of the most conservative possible. While we do not know whether the quadratic program (4) can be solved by something as eﬃcient as the SDA algorithm, experiments with generalpurpose solvers could study the eﬀect of update quality on training. By the same route one could study the eﬀect of learning, in aggregate, entire mini-batches (section 3, conclusions). Another avenue is to deﬁne conservative by the 1-norm on the bias updates. Our choice of the 2-norm was motivated by the uniqueness of the optimal updates. However, it seems likely that in the mini-batch setting even the 1-norm will have unique optima. Finally, it might be interesting to allow the weights in the network to vary slowly over the course of training. This could be as simple as favoring diﬀerent depths over time using the q parameter, or a more sophisticated scheme based on running averages.

Acknowledgments

We thank Alex Alemi, Sergei Burkov, Pat Coady, Cris Moore and Mehul Tikekar for discussions over the course of this project. We also thank the three reviewers whose comments brought signiﬁcant improvements to the paper.

Thomas Cover and Peter Hart. Nearest neighbor pattern classiﬁcation. IEEE transactions on information theory, 13(1):21 27, 1967.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551 585, 2006.

Veit Elser. A network that learns strassen multiplication. Journal of Machine Learning Research, 17(116):1 13, 2016.

Mikhail K Kozlov, Sergei P Tarasov, and Leonid G Khachiyan. The polynomial solvability of convex quadratic programming. USSR Computational Mathematics and Mathematical Physics, 20(5):223 228, 1980.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Raymond J Mooney. Encouraging experimental results on learning cnf. Machine Learning, 19(1):79 92, 1995.

Elser, Schmidt and Yedidia

David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533, 1986.

Bernard Widrow, Rodney G Winter, and Robert A Baxter. Layered neural nets for pattern recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(7):1109 1118, 1988.