# training_neural_machines_with_tracebased_supervision__6b7d3eea.pdf Training Neural Machines with Trace-Based Supervision Matthew B. Mirman 1 Dimitar Dimitrov 1 Pavle Djordjevic 1 Timon Gehr 1 Martin Vechev 1 We investigate the effectiveness of trace-based supervision methods for training existing neural abstract machines. To define the class of neural machines amenable to trace-based supervision, we introduce the concept of a differential neural computational machine ( NCM) and show that several existing architectures (NTMs, NRAMs) can be described as NCMs. We performed a detailed experimental evaluation with NTM and NRAM machines, showing that additional supervision on the interpretable portions of these architectures leads to better convergence and generalization capabilities of the learning phase than standard training, in both noise-free and noisy scenarios. 1. Introduction Recently, there has been substantial interest in neural machines that can induce programs from examples (Graves et al., 2014; Reed & de Freitas, 2016; Graves et al., 2016; Zhang et al., 2015; Zaremba & Sutskever, 2015; Kaiser & Sutskever, 2015; Gaunt et al., 2016; Vinyals et al., 2015; Feser et al., 2016; 2015; Frankle et al., 2016; Kurach et al., 2016; Bošnjak et al., 2017). While significant progress has been made towards learning interesting algorithms (Graves et al., 2016), ensuring these machines converge to the desired solution during training is challenging. Interestingly however, even though they differ architecturally, most of these machines rely on components (e.g., neural memory) that are more interpretable than typical neural networks (e.g., an LSTM). This allows one to provide additional supervision and help bias the learning towards the desired solution. In this work, we investigate whether (and by how much) 1Department of Computer Science, ETH Zurich, Switzerland. Correspondence to: Matthew B. Mirman , Martin Vechev . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). additional amounts of supervision provided to these interpretable components during training can improve learning. The particular type of supervision we consider is partial trace supervision, providing more detailed information, beyond input-output examples, during learning. To help describe the type of architectures our results apply to, we introduce the notion of a differential neural computational machine ( NCM), a formalism which allows to cleanly characterize the neural machines that can benefit from any amount of extra trace-based supervision. We show that common existing architectures such as Neural Turing Machines (NTMs) and Neural Random Access Machines (NRAMs) can be described as NCMs. We also explain why other machines such as the Neural Program Interpreter (NPI) (Reed & de Freitas, 2016) or its recent extensions (e.g., the Neural Program Lattice, Li et al. (2017)) cannot be instantiated as a NCM and are thus restricted to require large (and potentially prohibitive) amounts of supervision. We performed an extensive evaluation investigating how partial trace information affects training of both NTMs and NRAMs. Overall, our experimental results indicate that additional supervision can substantially improve convergence while leading to better generalization and interpretability, under both noise-free and noisy supervision scenarios. Interestingly, we also show that on a more complex task, the NRAM architecture trained with additional supervision can generalize better than more recent architectures such as the DNGPU (Freivalds & Liepins, 2017) which are difficult to interpret and thus, provide additional supervision to. The work closest to ours is the Differentiable Forth of Bošnjak et al. (2017), which also investigates the effects of extra supervision. The main difference between the two works is the kind of supervision provided. In Bošnjak et al. (2017), the teacher provides a sketch of the program to be learned. The machine then needs to fill in the holes of the sketch. In our work, there is no program sketch: the teacher hints (part of) the operations that the machine should execute in specific executions (i.e., hints on specific execution traces). The machine then has to learn these operations. We believe that both training methods are interesting and worth investigating in the context of neural machines. We also remark that both methods have already been investigated in the context of traditional (non-neural) programming, for example by Lau et al. (2003) and Solar-Lezama et al. (2006). Training Neural Machines with Trace-Based Supervision (a) Overfitting & No Traces (b) Generalizing & No Traces (c) Generalizing & Traces Figure 1. Traces of locations accessed by read/write heads for the Flip3rd task in three different training setups. The y-axis represents time (descending); the x-axis represents head locations. First, two NTMs are trained without partial trace information. White represents the distribution of the write head; orange the distribution of the read head; (b) and (c) generalize and are more interpretable than (a); (c) was trained using partial trace information and is more interpretable than (b). To intuitively illustrate our approach, consider the task of training an NTM to flip the third bit in a bit stream (called Flip3rd) such tasks have been extensively studied in the area of program synthesis with noise (Raychev et al., 2016) and without (Jha et al., 2010). An example input-output pair for this task could be [0, 1, 0, 0] [0, 1, 1, 0]. Given a set of such examples, our goal is to train an NTM that solves this task. Figure 1c shows an example NTM run (on an input that is longer than those shown during training) that demonstrates good generalization and has an understandable trace. The y-axis indicates time (descending), the x-axis indicates the accessed memory location, the white squares represent the write head of the NTM, and the orange squares represent the read head. As we can see, the model writes the input sequence to the tape and then reads from the tape in the same order. However, in the absence of richer supervision, the NTM (and other neural architectures) can easily overfit to the training set an example of an overfitting NTM is shown in Figure 1a. Here, the traces are chaotic and difficult to interpret. Further, even if the NTM generalizes, it can do so with erratic reads and writes an example is shown in Figure 1b. Here, the NTM learns to read from the third bit (circled) with a smaller weight than from other locations, and also reads and writes erratically near the end of the sequence. This model is less interpretable than the one in Figure 1c because it is unclear why the NTM reads the third bit with less weight, and whether this helps flip it. In this work we develop systematic ways to guide the training of a neural machine (e.g., NTM, NRAM) towards more interpretable behavior 1. For instance, for Flip3rd, providing partial trace information on the NTM s read heads for 10% of input-output examples is sufficient to bias learning towards the NTM shown in Figure 1c 100% of the time. 1All of the code, tasks and experiments are available at: https://github.com/eth-sri/ncm 2. Neural Computational Machines We now define the notion of a neural computational machine (NCM) which we use throughout to formalize our ideas. We introduce NCMs for two reasons. First, they allow one to capture the class of neural machines to which our supervision methods apply and explain why other machines fall outside this class. That is, the NCM abstraction clearly delineates end-to-end differentiable architectures such as NTM (Graves et al., 2014) and NRAM (Kurach et al., 2016), which can train with little to no trace supervision, from architectures that are not end-to-end differentiable, such as NPI (Reed & de Freitas, 2016), and hence require a certain minimum amount of trace information. In Section 3, we show how to phrase two existing neural architectures (NTM and NRAM) as an NCM. Second, it enables us to cleanly define a general, abstract loss function that captures different kinds of trace-based supervision at the NCM level and makes it clear which components of that loss function need to be instantiated for the particular NCM architecture (e.g., NTM, NRAM). In Section 4, we show how to specify the general trace-based loss function at the NCM level and how to instantiate key components of that loss function for NTMs and NRAMs. We note that while NCMs do allow for a clean formalization of the ideas and help overall understanding, they are not meant to be a complete neural model where one specifies the additional supervision only at the NCM level without any awareness of the details of the underlying architecture. Indeed, the user still has to be aware of the interpretable portions of the particular architecture (e.g., NRAM, NTM) and have some intuition for how a solution to the task at hand would use those components. We do believe however that this is a reasonable trade-off to investigate: a little bit of extra knowledge can enable substantially better results. Training Neural Machines with Trace-Based Supervision (a) Neural Computational Machine (NCM) Δr controller (b) Neural Turing Machine (NTM) controller ft bt,1 at,1 bt,2 at,2 bt,3 at,3 bt,4 at,4 bt,5 at,5 ct,1 ct,2 (c) Neural Random Access Machine (NRAM) Figure 2. (a) depicts the generic NCM structure, (b) is a high-level overview of an NTM, and (c) is a high-level overview of an NRAM. For NRAM, the controller outputs a circuit, which in this case contains the modules READ, ADD, SUB, LT, and WRITE. The controller encodes the two inputs to the modules as probability vectors, a and b, over the possible choices. Darker colors capture higher probability, e.g., dark green is the most likely choice. The only input to the controller are the registers r1 and r2. 2.1. NCM design The design of an NCM mimics classic computational machines with a controller and a memory. Concretely, an NCM is a triple of functions: a processor, a controller, and a loss. We define these components below. Processor The processor πθ : C M B M performs a pre-defined set of commands C, which might involve manipulating memories in M. The commands may produce additional feedback in B. Also, the processor s operation may depend on parameters θ. Controller Controller κθ : B Q I C Q O decides which operations the machine performs at each step. It receives external inputs from I and returns external outputs in O. It can also receive feedback B from the processor and command it to do certain operations (e.g., memory read). The decisions the controller takes may depend on its internal state in Q. The controller can also depend on additional parameters θ. For instance, if the controller is a neural network, then θ gives the network s weights. Loss Function The loss function LE : Trace E R indicates how close an execution trace τ Trace of the machine (defined below) is to a behavior from a set E. Plugging the behavior of the machine into the loss, we get: l(θ; x, e) = LE(τθ(x), e) (1) Averaging l over a set of input-output examples (x, e) gives us the loss surface that is being minimized during training. As standard, we require l to be continuous and piecewise differentiable w.r.t w for any given example (x, e). Traces The execution of the machine begins with an input sequence x = {xt}n 1 and initial values of the controller state q0, memory M0, and processor feedback b0. At each time step t = 1 . . . n, controller and processor take turns: (ct, qt, yt) = κ(bt 1, qt 1, xt) (bt, Mt) = π(ct, Mt 1) (2) To avoid clutter, we keep the parameters θ implicit. A trace τ(x, b0, M0, q0) = {(ct, bt, qt, yt, Mt)}n 1 keeps track of the values of these quantities at every time step. We occasionally write τC, τB, . . . for the trace projected onto one of its components c, b, . . . . NCMs Note that the differentiability conditions we impose on l do not imply the functions π, κ and LE are continuous or differentiable w.r.t all parameters. They indeed can be highly discontinuous as in NCMs like Memory Networks (Weston et al., 2014), or as in Neural Programmer Interpreters (Reed & de Freitas, 2016). In order to ensure Training Neural Machines with Trace-Based Supervision the differentiability of l, these architectures train with strong supervision: in this case the loss function LE requires examples e E that provide a value for each discontinuous quantity in the traces. In contrast, what we call differentiable neural computational machines ( NCM), have κ, π and LE continuous and piecewise differentiable. Then, there is no need to specify corresponding values in the examples, and so we can train with as much trace information as available. 3. NTMs and NRAMs as NCMs We now describe NTMs and NRAMs as NCMs. NTM as NCM An NTM (Graves et al., 2014) (Figure 2b) has access to a memory M Rc n of c cells of n real numbers each. We suppose the machine has one read head and one write head, whose addresses are, respectively, the probability vectors r, w [0, 1]{1...c}. At every time step, the read head computes the expected value m Rn of a random cell at index i r. This value together with the current input are fed into a controller neural network, which then decides on the command. The command consists of several pieces: (i) a decision on what fraction e Rn to erase, (ii) how much a Rn to add to the cells underneath the write head, and (iii) an indication of the head movement with two probability vectors r, w [0, 1]{ 1,0,+1}. Finally, the controller produces the current output value. In terms of NCMs, NTM variables fall into the following classes: I/O State Communication (r, w, M) M (e, a, r, w) C Each of these variables change over time according to certain equations (found in the supplementary ). The processor π and the controller κ functions for each time step satisfy: ((et, at, rt, wt), qt, yt) = κ(mt, qt 1, xt) (3) (mt+1, (rt, wt, Mt)) = π((et, at, rt, wt), (rt 1, wt 1, Mt 1)). (4) That is, the processor computes the new values of the read and write heads by convolving their old values with r and w, updates the memory by using e, a, the write head w and the previous value of the memory, and also produces the output read value by using the memory and the read head r. Formal details are provided in the supplementary material. The standard loss function LE for the NTM includes a term, such as cross-entropy or L2 distance, for the machine output at every time step. Each of these compare the machine output to the respective values contained in examples e E. NRAM as NCM A Neural Random Access Machine (NRAM) (Kurach et al., 2016) is a neural machine designed for ease of pointer access. The NRAM has a memory and registers that store probability vectors on a fixed set {1 . . . c}. The memory is, therefore, a matrix M Rc c of c cells, while the register file is a matrix r Rn c of n registers. The controller receives no external inputs, but it takes as feedback the probability of 0 for each register. It also produces no external output, except for a halting probability f [0, 1] produced at every time step. The output of the run is considered to be the final memory. At each time step the NRAM can execute several modules from a fixed sequence. Each module implements a simple integer operation/memory manipulation lifted to probability vectors. For example, addition lifts to convolution, while memory access is as in the NTM. At every time step, the controller organizes the sequence of modules into a circuit, which is then executed. The circuit is encoded by a pair of probability distributions per module, see Figure 2c. These distributions specify respectively which previous modules or registers will provide a given module s first/second arguments. The distributions are stacked in the matrices a and b. A similar matrix c is responsible for specifying what values should be written to the registers at the end of the time step. The NCM instantiation of an NRAM is: I/O State Communication (at, bt, ct) C The equations for these quantities are found in the supplementary material. The processor π and the controller κ are: ((at, bt, ct), qt, ft) = κ(r(t 1), ,0, qt 1, 1) (rt, ,0, (rt, Mt)) = π((at, bt, ct), (rt 1, Mt 1)). (5) The NRAM loss is an expectation with respect to the distribution p of the halting time step, as determined by the halting probabilities ft (see the supplementary material). Each example is a vector e {1 . . . c}c holding the desired value of each memory cell. The loss considers the negative log likelihood that the i-th memory cell at time step t equals the value ei from the example, independently for each i: LE(τ, e) = X i