# sobolev_training_for_neural_networks__36d9eb5d.pdf

Sobolev Training for Neural Networks

Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg Grzegorz Swirszcz, and Razvan Pascanu Deep Mind, London, UK {lejlot,osindero,jaderberg,swirszcz,razp}@google.com

At the heart of deep learning we aim to use neural networks as function approximators training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to input-output pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function s outputs but also the function s derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the data-efﬁciency and generalization capabilities of our learned function approximation. We provide theoretical justiﬁcations for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on large-scale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.

1 Introduction

Deep Neural Networks (DNNs) are one of the main tools of modern machine learning. They are consistently proven to be powerful function approximators, able to model a wide variety of functional forms from image recognition [8, 24], through audio synthesis [27], to human-beating policies in the ancient game of GO [22]. In many applications the process of training a neural network consists of receiving a dataset of input-output pairs from a ground truth function, and minimising some loss with respect to the network s parameters. This loss is usually designed to encourage the network to produce the same output, for a given input, as that from the target ground truth function. Many of the ground truth functions we care about in practice have an unknown analytic form, e.g. because they are the result of a natural physical process, and therefore we only have the observed input-output pairs for supervision. However, there are scenarios where we do know the analytic form and so are able to compute the ground truth gradients (or higher order derivatives), alternatively sometimes these quantities may be simply observable. A common example is when the ground truth function is itself a neural network; for instance this is the case for distillation [9, 20], compressing neural networks [7], and the prediction of synthetic gradients [12]. Additionally, if we are dealing with an environment/data-generation process (vs. a pre-determined set of data points), then even though we may be dealing with a black box we can still approximate derivatives using ﬁnite differences. In this work, we consider how this additional information can be incorporated in the learning process, and what advantages it can provide in terms of data efﬁciency and performance. We

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Dxh Dxhm, v1i, v2i

Dxh Dxhf, v1i, v2i

Figure 1: a) Sobolev Training of order 2. Diamond nodes m and f indicate parameterised functions, where m is trained to approximate f. Green nodes receive supervision. Solid lines indicate connections through which error signal from loss l, l1, and l2 are backpropagated through to train m. b) Stochastic Sobolev Training of order 2. If f and m are multivariate functions, the gradients are Jacobian matrices. To avoid computing these high dimensional objects, we can efﬁciently compute and ﬁt their projections on a random vector vj sampled from the unit sphere.

propose Sobolev Training (ST) for neural networks as a simple and efﬁcient technique for leveraging derivative information about the desired function in a way that can easily be incorporated into any training pipeline using modern machine learning libraries.

The approach is inspired by the work of Hornik [10] which proved the universal approximation theorems for neural networks in Sobolev spaces metric spaces where distances between functions are deﬁned both in terms of their differences in values and differences in values of their derivatives.

In particular, it was shown that a sigmoid network can not only approximate a function s value arbitrarily well, but that the network s derivatives with respect to its inputs can approximate the corresponding derivatives of the ground truth function arbitrarily well too. Sobolev Training exploits this property, and tries to match not only the output of the function being trained but also its derivatives.

There are several related works which have also exploited derivative information for function approximation. For instance Wu et al. [30] and antecedents propose a technique for Bayesian optimisation with Gaussian Processess (GP), where it was demonstrated that the use of information about gradients and Hessians can improve the predictive power of GPs. In previous work on neural networks, derivatives of predictors have usually been used either to penalise model complexity (e.g. by pushing Jacobian norm to 0 [19]), or to encode additional, hand crafted invariances to some transformations (for instance, as in Tangentprop [23]), or estimated derivatives for dynamical systems [6] and very recently to provide additional learning signal during attention distillation [31]1. Similar techniques have also been used in critic based Reinforcement Learning (RL), where a critic s derivatives are trained to match its target s derivatives [29, 15, 5, 4, 26] using small, sigmoid based models. Finally, Hyvärinen proposed Score Matching Networks [11], which are based on the somewhat surprising observation that one can model unknown derivatives of the function without actual access to its values all that is needed is a sampling based strategy and speciﬁc penalty. However, such an estimator has a high variance [28], thus it is not really useful when true derivatives are given.

To the best of our knowledge and despite its simplicity, the proposal to directly match network derivatives to the true derivatives of the target function has been minimally explored for deep networks, especially modern Re LU based models. In our method, we show that by using the additional knowledge of derivatives with Sobolev Training we are able to train better models models which achieve lower approximation errors and generalise to test data better and reduce the sample complexity of learning. The contributions of our paper are therefore threefold: (1): We introduce

1Please relate to Supplementary Materials, section 5 for details

Sobolev Training a new paradigm for training neural networks. (2): We look formally at the implications of matching derivatives, extending previous results of Hornik [10] and showing that modern architectures are well suited for such training regimes. (3): Empirical evidence demonstrating that Sobolev Training leads to improved performance and generalisation, particularly in low data regimes. Example domains are: regression on classical optimisation problems; policy distillation from RL agents trained on the Atari domain; and training deep, complex models using synthetic gradients we report the ﬁrst successful attempt to train a large-scale Image Net model using synthetic gradients.

2 Sobolev Training

We begin by introducing the idea of training using Sobolev spaces. When learning a function f, we may have access to not only the output values f(xi) for training points xi, but also the values of its j-th order derivatives with respect to the input, Dj xf(xi). In other words, instead of the typical training set consisting of pairs {(xi, f(xi))}N i=1 we have access to (K + 2)-tuples {(xi, f(xi), D1 xf(xi), ..., DK x f(xi))}N i=1. In this situation, the derivative information can easily be incorporated into training a neural network model of f by making derivatives of the neural network match the ones given by f.

Considering a neural network model m parameterised with θ, one typically seeks to minimise the empirical error in relation to f according to some loss function ℓ

i=1 ℓ(m(xi|θ), f(xi)).

When learning in Sobolev spaces, this is replaced with:

ℓ(m(xi|θ), f(xi)) +

j=1 ℓj Dj xm(xi|θ), Dj xf(xi)

where ℓj are loss functions measuring error on j-th order derivatives. This causes the neural network to encode derivatives of the target function in its own derivatives. Such a model can still be trained using backpropagation and off-the-shelf optimisers.

A potential concern is that this optimisation might be expensive when either the output dimensionality of f or the order K are high, however one can reduce this cost through stochastic approximations. Speciﬁcally, if f is a multivariate function, instead of a vector gradient, one ends up with a full Jacobian matrix which can be large. To avoid adding computational complexity to the training process, one can use an efﬁcient, stochastic version of Sobolev Training: instead of computing a full Jacobian/Hessian, one just computes its projection onto a random vector (a direct application of a known estimation trick [19]). In practice, this means that during training we have a random variable v sampled uniformly from the unit sphere, and we match these random projections instead:

ℓ(m(xi|θ), f(xi)) +

j=1 Evj ℓj Dj xm(xi|θ), vj , Dj xf(xi), vj

Figure 1 illustrates compute graphs for non-stochastic and stochastic Sobolev Training of order 2.

3 Theory and motivation

While in the previous section we deﬁned Sobolev Training, it is not obvious that modeling the derivatives of the target function f is beneﬁcial to function approximation, or that optimising such an objective is even feasible. In this section we motivate and explore these questions theoretically, showing that the Sobolev Training objective is a well posed one, and that incorporating derivative information has the potential to drastically reduce the sample complexity of learning.

Hornik showed [10] that neural networks with non-constant, bounded, continuous activation functions, with continuous derivatives up to order K are universal approximators in the Sobolev spaces of order K, thus showing that sigmoid-networks are indeed capable of approximating elements of these

Figure 2: Left: From top: Example of the piece-wise linear function; Two (out of a continuum of) hypotheses consistent with 3 training points, showing that one needs two points to identify each linear segment; The only hypothesis consistent with 3 training points enriched with derivative information. Right: Logarithm of test error (MSE) for various optimisation benchmarks with varied training set size (20, 100 and 10000 points) sampled uniformly from the problem s domain.

spaces arbitrarily well. However, nowadays we often use activation functions such as Re LU which are neither bounded nor have continuous derivatives. The following theorem shows that for K = 1 we can use Re LU function (or a similar one, like leaky Re LU) to create neural networks that are universal approximators in Sobolev spaces. We will use a standard symbol C1(S) (or simply C1) to denote a space of functions which are continuous, differentiable, and have a continuous derivative on a space S [14]. All proofs are given in the Supplementary Materials (SM). Theorem 1. Let f be a C1 function on a compact set. Then, for every positive ε there exists a single hidden layer neural network with a Re LU (or a leaky Re LU) activation which approximates f in Sobolev space S1 up to ϵ error. This suggests that the Sobolev Training objective is achievable, and that we can seek to encode the values and derivatives of the target function in the values and derivatives of a Re LU neural network model. Interestingly, we can show that if we seek to encode an arbitrary function in the derivatives of the model then this is impossible not only for neural networks but also for any arbitrary differentiable predictor on compact sets.

Theorem 2. Let f be a C1 function. Let g be a continuous function satisfying g f

x > 0. Then, there exists an η > 0 such that for any C1 function h either f h η or g h

x η. However, when we move to the regime of ﬁnite training data, we can encode any arbitrary function in the derivatives (as well as higher order signals if the resulting Sobolev spaces are not degenerate), as shown in the following Proposition. Proposition 1. Given any two functions f : S R and g : S Rd on S Rd and a ﬁnite set Σ S, there exists neural network h with a Re LU (or a leaky Re LU) activation such that x Σ : f(x) = h(x) and g(x) = h

x(x) (it has 0 training loss). Having shown that it is possible to train neural networks to encode both the values and derivatives of a target function, we now formalise one possible way of showing that Sobolev Training has lower sample complexity than regular training.

Let F denote the family of functions parametrised by ω. We deﬁne Kreg = Kreg(F) to be a measure of the amount of data needed to learn some target function f. That is Kreg is the smallest number for which there holds: for every fω F and every set of distinct Kreg points (x1, ..., x Kreg) such that i=1,...,Kregf(xi) = fω(xi) f = fω. Ksob is deﬁned analogously, but the ﬁnal implication is of form f(xi) = fω(xi) f

x (xi) f = fω. Straight from the deﬁnition there follows: Proposition 2. For any F, there holds Ksob(F) Kreg(F). For many families, the above inequality becomes sharp. For example, to determine the coefﬁcients of a polynomial of degree n one needs to compute its values in at least n + 1 distinct points. If we know values and the derivatives at k points, it is a well-known fact that only n

2 points sufﬁce to determine all the coefﬁcients. We present two more examples in a slightly more formal way. Let FG denote a family of Gaussian PDF-s (parametrised by µ, σ). Let Rd D = D1 . . . Dn and let FPL be a family of functions from D1 ... Dn (Cartesian product of sets Di) to Rn of form f(x) = [A1x1 + b1, ..., Anxn + bn] (linear element-wise) (Figure 2 Left).

Dataset 20 training samples 100 training samples

Regular Sobolev Regular Sobolev

Figure 3: Styblinski-Tang function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector ﬁeld of the gradients of each predictor underneath the function plot.

Proposition 3. There holds Ksob (FG) < Kreg(FG) and Ksob(FPL) < Kreg(FPL). This result relates to Deep Re LU networks as they build a hyperplanes-based model of the target function. If those were parametrised independently one could expect a reduction of sample complexity by d+1 times, where d is the dimension of the function domain. In practice parameters of hyperplanes in such networks are not independent, furthermore the hinges positions change so the Proposition cannot be directly applied, but it can be seen as an intuitive way to see why the sample complexity drops signiﬁcantly for Deep Re LU networks too.

4 Experimental Results

We consider three domains where information about derivatives is available during training2.

4.1 Artiﬁcial Data

First, we consider the task of regression on a set of well known low-dimensional functions used for benchmarking optimisation methods.

We train two hidden layer neural networks with 256 hidden units per layer with Re LU activations to regress towards function values, and verify generalisation capabilities by evaluating the mean squared error on a hold-out test set. Since the task is standard regression, we choose all the losses of Sobolev Training to be L2 errors, and use a ﬁrst order Sobolev method (second order derivatives of Re LU networks with a linear output layer are constant, zero). The optimisation is therefore:

i=1 f(xi) m(xi|θ) 2 2 + xf(xi) xm(xi|θ) 2 2.

Figure 2 right shows the results for the optimisation benchmarks. As expected, Sobolev trained networks perform extremely well for six out of seven benchmark problems they signiﬁcantly reduce the testing error with the obtained errors orders of magnitude smaller than the corresponding errors of the regularly trained networks. The stark difference in approximation error is highlighted in Figure 3, where we show the Styblinski-Tang function and its approximations with both regular and Sobolev Training. It is clear that even in very low data regimes, the Sobolev trained networks can capture the functional shape.

Looking at the results, we make two important observations. First, the effect of Sobolev Training is stronger in low-data regimes, however it does not disappear even in the high data regime, when one has 10,000 training examples for training a two-dimensional function. Second, the only case where regular regression performed better is the regression towards Ackley s function. This particular

2All experiments were performed using Tensor Flow [2] and the Sonnet neural network library [1].

Test action prediction error Test DKL

Regular distillation Sobolev distillation

Figure 4: Test results of distillation of RL agents on three Atari games. Reported test action prediction error (left) is the error of the most probable action predicted between the distilled policy and target policy, and test DKL (right) is the Kulblack-Leibler divergence between policies. Numbers in the column title represents the percentage of the 100K recorded states used for training (the remaining are used for testing). In all scenarios the Sobolev distilled networks are signiﬁcantly more similar to the target policy.

example was chosen to show that one possible weak point of our approach might be approximating functions with a very high frequency signal component in the relatively low data regime. Ackley s function is composed of exponents of high frequency cosine waves, thus creating an extremely bumpy surface, consequently a method that tries to match the derivatives can behave badly during testing if one does not have enough data to capture this complexity. However, once we have enough training data points, Sobolev trained networks are able to approximate this function better.

4.2 Distillation

Another possible application of Sobolev Training is to perform model distillation. This technique has many applications, such as network compression [21], ensemble merging [9], or more recently policy distillation in reinforcement learning [20].

We focus here on a task of distilling a policy. We aim to distill a target policy π (s) a trained neural network which outputs a probability distribution over actions into a smaller neural network π(s|θ), such that the two policies π and π have the same behaviour. In practice this is often done by minimising an expected divergence measure between π and π, for example, the Kullback Leibler divergence DKL(π(s) π (s)), over states gathered while following π . Since policies are multivariate functions, direct application of Sobolev Training would mean producing full Jacobian matrices with respect to the s, which for large actions spaces is computationally expensive. To avoid this issue we employ a stochastic approximation described in Section 2, thus resulting in the objective

min θ DKL(π(s|θ) π (s)) + αEv [ s log π (s), v s log π(s|θ), v ] ,

where the expectation is taken with respect to v coming from a uniform distribution over the unit sphere, and Monte Carlo sampling is used to approximate it.

As target policies π , we use agents playing Atari games [17] that have been trained with A3C [16] on three well known games: Pong, Breakout and Space Invaders. The agent s policy is a neural network consisting of 3 layers of convolutions followed by two fully-connected layers, which we distill to a smaller network with 2 convolutional layers and a single smaller fully-connected layer (see SM for details). Distillation is treated here as a purely supervised learning problem, as our aim is not to re-evaluate known distillation techniques, but rather to show that if the aim is to minimise a given divergence measure, we can improve distillation using Sobolev Training. Figure 4 shows test error during training with and without Sobolev Training3. The introduction of Sobolev Training leads to similar effects as in the previous section the network generalises much more effectively, and this

3Testing is performed on a held out set of episodes, thus there are no temporal nor causal relations between training and testing

Table 1: Various techniques for producing synthetic gradients. Green shaded nodes denote nodes that get supervision from the corresponding object from the main network (gradient or loss value). We report accuracy on the test set standard deviation. Backpropagation results are given in parenthesis.

Noprop Direct SG [12] VFBN [25] Critic Sobolev

CIFAR-10 with 3 synthetic gradient modules Top 1 (94.3%) 54.5% 1.15 79.2% 0.01 88.5% 2.70 93.2% 0.02 93.5% 0.01

Image Net with 1 synthetic gradient module Top 1 (75.0%) 54.0% 0.29 - 57.9% 2.03 71.7% 0.23 72.0% 0.05 Top 5 (92.3%) 77.3% 0.06 - 81.5% 1.20 90.5% 0.15 90.8% 0.01

Image Net with 3 synthetic gradient modules Top 1 (75.0%) 18.7% 0.18 - 28.3% 5.24 65.7% 0.56 66.5% 0.22 Top 5 (92.3%) 38.0% 0.34 - 52.9% 6.62 86.9% 0.33 87.4% 0.11

is especially true in low data regimes. Note the performance gap on Pong is small due to the fact that optimal policy is quite degenerate for this game4. In all remaining games one can see a signiﬁcant performance increase from using our proposed method, and as well as minor to no overﬁtting.

Despite looking like a regularisation effect, we stress that Sobolev Training is not trying to ﬁnd the simplest models for data or suppress the expressivity of the model. This training method aims at matching the original function s smoothness/complexity and so reduces overﬁtting by effectively extending the information content of the training set, rather than by imposing a data-independent prior as with regularisation.

4.3 Synthetic Gradients

The previous experiments have shown how information about the derivatives can boost approximating function values. However, the core idea of Sobolev Training is broader than that, and can be employed in both directions. Namely, if one ultimately cares about approximating derivatives, then additionally approximating values can help this process too. One recent technique, which requires a model of gradients is Synthetic Gradients (SG) [12] a method for training complex neural networks in a decoupled, asynchronous fashion. In this section we show how we can use Sobolev Training for SG.

The principle behind SG is that instead of doing full backpropagation using the chain-rule, one splits a network into two (or more) parts, and approximates partial derivatives of the loss L with respect to some hidden layer activations h with a trainable function SG(h, y|θ). In other words, given that network parameters up to h are denoted by Θ

h h Θ SG(h, y|θ) h

In the original SG paper, this module is trained to minimise LSG(θ) = SG(h, y|θ) L(ph,y)

where ph is the ﬁnal prediction of the main network for hidden activations h. For the case of learning a classiﬁer, in order to apply Sobolev Training in this context we construct a loss predictor, composed

4For majority of the time the policy in Pong is uniform, since actions taken when the ball is far away from the player do not matter at all. Only in crucial situations it peaks so the ball hits the paddle.

of a class predictor p( |θ) followed by the log loss, which gets supervision from the true loss, and the gradient of the prediction gets supervision from the true gradient:

m(h, y|θ) := L(p(h|θ), y), SG(h, y|θ) := m(h, y|θ)/ h,

Lsob SG(θ) = ℓ(m(h, y|θ), L(ph, y))) + ℓ1 m(h,y|θ)

h , L(ph,y)

In the Sobolev Training framework, the target function is the loss of the main network L(ph, y) for which we train a model m(h, y|θ) to approximate, and in addition ensure that the model s derivatives m(h, y|θ)/ h are matched to the true derivatives L(ph, y)/ h. The model s derivatives m(h, y|θ)/ h are used as the synthetic gradient to decouple the main network.

This setting closely resembles what is known in reinforcement learning as critic methods [13]. In particular, if we do not provide supervision on the gradient part, we end up with a loss critic. Similarly if we do not provide supervision at the loss level, but only on the gradient component, we end up in a method that resembles VFBN [25]. In light of these connections, our approach in this application setting can be seen as a generalisation and uniﬁcation of several existing ones (see Table 1 for illustrations of these approaches).

One could ask why we need these additional constraints, and what is gained over using a neural network based approximator directly [12]. The answer lies in the fact that gradient vector ﬁelds are a tiny subset of all vector ﬁelds, and while each neural network produces a valid vector ﬁeld, almost no (standard) neural network produces valid gradient vector ﬁelds. Using non-gradient vector ﬁelds as update directions for learning can have catastrophic consequences learning divergence, oscillations, chaotic behaviour, etc. The following proposition makes this observation more formal:

Proposition 4. If an approximator SG(h, y|θ) produces a valid gradient vector ﬁeld of some scalar function L then the approximator s Jacobian matrix must be symmetric.

It is worth noting that having a symmetric Jacobian is an extremely rare property for a neural network model. For example, a linear model has a symmetric Jacobian if and only if its weight matrix is symmetric. If we sample weights iid from typical distribution (like Gaussian or uniform on an interval), the probability of sampling such a matrix is 0, but it could be easy to learn with strong, symmetric-enforcing updates. On the other hand, for highly non-linear neural networks, it is not only improbable to randomly ﬁnd such a model, but enforcing this constraint during learning becomes much harder too. This might be one of the reasons why linear SG modules work well in Jaderberg et al. [12], but non-linear convolutional SG struggled to achieve state-of-the-art performance.

When using Sobolev-like approach SG always produces a valid gradient vector ﬁeld by construction, thus avoiding the problem described.

We perform experiments on decoupling deep convolutional neural network image classiﬁers using synthetic gradients produced by loss critics that are trained with Sobolev Training, and compare to regular loss critic training, and regular synthetic gradient training. We report results on CIFAR-10 for three network splits (and therefore three synthetic gradient modules) and on Image Net with one and three network splits 5.

The results are shown in Table 1. With a naive SG model, we obtain 79.2% test accuracy on CIFAR-10. Using an SG architecture which resembles a small version of the rest of the model makes learning much easier and led to 88.5% accuracy, while Sobolev Training achieves 93.5% ﬁnal performance. The regular critic also trains well, achieving 93.2%, as the critic forces the lower part of the network to provide a representation which it can use to reduce the classiﬁcation (and not just prediction) error. Consequently it provides a learning signal which is well aligned with the main optimisation. However, this can lead to building representations which are suboptimal for the rest of the network. Adding additional gradient supervision by constructing our Sobolev SG module avoids this issue by making sure that synthetic gradients are truly aligned and gives an additional boost to the ﬁnal accuracy.

For Image Net [3] experiments based on Res Net50 [8], we obtain qualitatively similar results. Due to the complexity of the model and an almost 40% gap between no backpropagation and full backpropagation results, the difference between methods with vs without loss supervision grows signiﬁcantly. This suggests that at least for Res Net-like architectures, loss supervision is a crucial

5N.b. the experiments presented use learning rates, annealing schedule, etc. optimised to maximise the backpropagation baseline, rather than the synthetic gradient decoupled result (details in the SM).

component of a SG module. After splitting Res Net50 into four parts the Sobolev SG achieves 87.4% top 5 accuracy, while the regular critic SG achieves 86.9%, conﬁrming our claim about suboptimal representation being enforced by gradients from a regular critic. Sobolev Training results were also much more reliable in all experiments (signiﬁcantly smaller standard deviation of the results).

5 Discussion and Conclusion

In this paper we have introduced Sobolev Training for neural networks a simple and effective way of incorporating knowledge about derivatives of a target function into the training of a neural network function approximator. We provided theoretical justiﬁcation that encoding both a target function s value as well as its derivatives within a Re LU neural network is possible, and that this results in more data efﬁcient learning. Additionally, we show that our proposal can be efﬁciently trained using stochastic approximations if computationally expensive Jacobians or Hessians are encountered.

In addition to toy experiments which validate our theoretical claims, we performed experiments to highlight two very promising areas of applications for such models: one being distillation/compression of models; the other being the application to various meta-optimisation techniques that build models of other models dynamics (such as synthetic gradients, learning-to-learn, etc.). In both cases we obtain signiﬁcant improvement over classical techniques, and we believe there are many other application domains in which our proposal should give a solid performance boost.

In this work we focused on encoding true derivatives in the corresponding ones of the neural network. Another possibility for future work is to encode information which one believes to be highly correlated with derivatives. For example curvature [18] is believed to be connected to uncertainty. Therefore, given a problem with known uncertainty at training points, one could use Sobolev Training to match the second order signal to the provided uncertainty signal. Finite differences can also be used to approximate gradients for black box target functions, which could help when, for example, learning a generative temporal model. Another unexplored path would be to apply Sobolev Training to internal derivatives rather than just derivatives with respect to the inputs.

[1] Sonnet. https://github.com/deepmind/sonnet. 2017.

[2] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016.

[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248 255. IEEE, 2009.

[4] Michael Fairbank and Eduardo Alonso. Value-gradient learning. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pages 1 8. IEEE, 2012.

[5] Michael Fairbank, Eduardo Alonso, and Danil Prokhorov. Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks. IEEE transactions on neural networks and learning systems, 23(10):1671 1676, 2012.

[6] A Ronald Gallant and Halbert White. On learning the derivatives of an unknown mapping with multilayer feedforward networks. Neural Networks, 5(1):129 138, 1992.

[7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149, 2015.

[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

[9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

[10] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991.

[11] Aapo Hyvärinen. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, pages 695 709, 2005.

[12] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. ar Xiv preprint ar Xiv:1608.05343, 2016.

[13] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In NIPS, volume 13, pages 1008 1014, 1999.

[14] Steven G Krantz. Handbook of complex variables. Springer Science & Business Media, 2012.

[15] W Thomas Miller, Paul J Werbos, and Richard S Sutton. Neural networks for control. MIT press, 1995.

[16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937, 2016.

[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

[18] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. ar Xiv preprint ar Xiv:1301.3584, 2013.

[19] Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier Glorot. Higher order contractive auto-encoder. Machine Learning and Knowledge Discovery in Databases, pages 645 660, 2011.

[20] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. ar Xiv preprint ar Xiv:1511.06295, 2015.

[21] Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression: Distilling knowledge from noisy teachers. ar Xiv preprint ar Xiv:1610.09650, 2016.

[22] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

[23] Patrice Simard, Bernard Victorri, Yann Le Cun, and John S Denker. Tangent prop-a formalism for specifying selected invariances in an adaptive network. In NIPS, volume 91, pages 895 903, 1991.

[24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

[25] Shin-ichi Maeda Koyama Masanori Takeru Miyato, Daisuke Okanohara. Synthetic gradient methods with virtual forward-backward networks. ICLR workshop proceedings, 2017.

[26] Yuval Tassa and Tom Erez. Least squares solutions of the hjb equation with neural network value-function approximators. IEEE transactions on neural networks, 18(4):1031 1041, 2007.

[27] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. Co RR abs/1609.03499, 2016.

[28] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

[29] Paul J Werbos. Approximate dynamic programming for real-time control and neural modeling. Handbook of intelligent control, 1992.

[30] Anqi Wu, Mikio C Aoi, and Jonathan W Pillow. Exploiting gradients and hessians in bayesian optimization and bayesian quadrature. ar Xiv preprint ar Xiv:1704.00060, 2017.

[31] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ar Xiv preprint ar Xiv:1612.03928, 2016.