# sampling_weights_of_deep_neural_networks__e332dfac.pdf Sampling weights of deep neural networks Erik Lien Bolager+ Iryna Burak+ Chinmay Datar+ Qing Sun+ Felix Dietrich+ Technical University of Munich +School of Computation, Information and Technology; Institute for Advanced Study We introduce a probability distribution, combined with an efficient sampling algorithm, for weights and biases of fully-connected neural networks. In a supervised learning context, no iterative optimization or gradient computations of internal network parameters are needed to obtain a trained network. The sampling is based on the idea of random feature models. However, instead of a data-agnostic distribution, e.g., a normal distribution, we use both the input and the output training data to sample shallow and deep networks. We prove that sampled networks are universal approximators. For Barron functions, we show that the L2-approximation error of sampled shallow networks decreases with the square root of the number of neurons. Our sampling scheme is invariant to rigid body transformations and scaling of the input data, which implies many popular pre-processing techniques are not required. In numerical experiments, we demonstrate that sampled networks achieve accuracy comparable to iteratively trained ones, but can be constructed orders of magnitude faster. Our test cases involve a classification benchmark from Open ML, sampling of neural operators to represent maps in function spaces, and transfer learning using well-known architectures. 1 Introduction Training deep neural networks involves finding all weights and biases. Typically, iterative, gradientbased methods are employed to solve this high-dimensional optimization problem. Randomly sampling all weights and biases before the last, linear layer circumvents this optimization and results in much shorter training time. However, the drawback of this approach is that the probability distribution of the parameters must be chosen. Random projection networks [54] or extreme learning machines [30] involve weight distributions that are completely problemand data-agnostic, e.g., a normal distribution. In this work, we introduce a data-driven sampling scheme to construct weights and biases close to gradients of the target function (cf. Figure 1). This idea provides a solution to three main challenges that have prevented randomly sampled networks to compete successfully against iterative training in the setting of supervised learning: deep networks, accuracy, and interpretability. Figure 1: Random feature models choose weights in a data-agnostic way, compared to sampling them where it matters: at large gradients. The arrows illustrate where the network weights are placed. Corresponding author, felix.dietrich@tum.de. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Deep neural networks. Random feature models and extreme learning machines are typically defined for networks with a single hidden layer. Our sampling scheme accounts for the high-dimensional ambient space that is introduced after this layer, and thus deep networks can be constructed efficiently. Approximation accuracy. Gradient-based, iterative approximation can find accurate solutions with a relatively small number of neurons. Randomly sampling weights using a data-agnostic distribution often requires thousands of neurons to compete. Our sampling scheme takes into account the given training data points and function values, leading to accurate and width-efficient approximations. The distribution also leads to invariance to orthogonal transformations and scaling of the input data, which makes many common pre-processing techniques redundant. Interpretability. Sampling weights and biases completely randomly, i.e., without taking into account the given data, leads to networks that do not incorporate any information about the given problem. We analyze and extend a recently introduced weight construction technique [27] that uses the direction between pairs of data points to construct individual weights and biases. In addition, we propose a sampling distribution over these data pairs that leads to efficient use of weights; cf. Figure 1. 2 Related work Regarding random Fourier features, Li et al. [41] and Liu et al. [43] review and unify theory and algorithms of this approach. Random features have been used to approximate input-output maps in Banach spaces [50] and solve partial differential equations [16, 48, 10]. Gallicchio and Scardapane [28] provide a review of deep random feature models, and discuss autoencoders and reservoir computing (resp. echo-state networks [34]). The latter are randomly sampled, recurrent networks to model dynamical systems [6]. Regarding construction of features, Monte Carlo approximation of data-dependent parameter distributions is used towards faster kernel approximation [1, 59, 47]. Our work differs in that we do not start with a kernel and decompose it into random features, but we start with a practical and interpretable construction of random features and then discuss their approximation properties. This may also help to construct activation functions similar to collocation [62]. Fiedler et al. [23] and Fornasier et al. [24] prove that for given, comparatively small networks with one hidden layer, all weights and biases can be recovered exactly by evaluating the network at specific points in the input space. The work of Spek et al. [60] showed a certain duality between weight spaces and data spaces, albeit in a purely theoretical setting. Recent work from Bollt [7] analyzes individual weights in networks by visualizing the placement of Re LU activation functions in space. Regarding approximation errors and convergence rates of networks, Barron spaces are very useful [2, 20], also to study regularization techniques, esp. Tikohnov and Total Variation [40]. A lot of work [54, 19, 17, 57, 65] surrounds the approximation rate of O(m 1/2) for neural networks with one hidden layer of width m, originally proved by Barron [2]. The rate, but not the constant, is independent of the input space dimension. This implies that neural networks can mitigate the curse of dimensionality, as opposed to many approximation methods with fixed, non-trained basis functions [14], including random feature models with data-agnostic probability distributions. The convergence rates of over-parameterized networks with one hidden layer is considered in [19], with a comparison to the Monte Carlo approximation. In our work, we prove the same convergence rate for our networks. Regarding deep networks, E and Wojtowytsch [17, 18] discuss simple examples that are not Barron functions, i.e., cannot be represented by shallow networks. Shallow [15] and deep random feature networks [29] have also been analyzed regarding classification accuracy. Regarding different sampling techniques, Bayesian neural networks are prominent examples [49, 5, 26, 61]. The goal is to learn a good posterior distribution and ultimately express uncertainty around both weights and the output of the network. These methods are computationally often on par with or worse than iterative optimization. In this work, we directly relate data points and weights, while Bayesian neural networks mostly employ distributions only over the weights. Generative modeling has been proposed as a way to sample weights from existing, trained networks [56, 51]. It may be interesting to consider our sampled weights as training set in this context. In the lottery ticket hypothesis [25, 8], winning subnetworks are often not trained, but selected from a randomly initialized starting network, which is similar to our approach. Still, the score computation during selection requires gradient updates. Most relevant to our work is the weight construction method by Galaris et al. [27], who proposed to use pairs of data points to construct weights. Their primary goal was to randomly sample weights that capture low-dimensional structures. No further analysis was provided, and only a uniform distribution over the data pairs was used. We expand and analyze their setting here. 3 Mathematical framework We introduce sampled networks, which are neural networks where each pair of weight and bias of all hidden layers is completely determined by two points from the input space. This duality between weights and data has been shown theoretically [60], here, we provide an explicit relation. The weights are constructed using the difference between the two points, and the bias is the inner product between the weight and one of the two points. After all hidden layers are constructed, we must only solve an optimization problem for the coefficients of a linear layer, mapping the output from the last hidden layer to the final output. We start to formalize this construction by introducing some notation. Let X RD be the input space with being the Euclidean norm with inner product , . Further, let Φ be a neural network with L hidden layers, parameters {Wl, bl}L+1 l=1 , and activation function ϕ : R R. For x X, we write Φ(l)(x) = ϕ(WlΦ(l 1)(x) bl) as the output of the lth layer, with Φ(0)(x) = x. The two activation functions we focus on are the rectified linear unit (Re LU), ϕ(x) = max{x, 0}, and the hyperbolic tangent (tanh). We set Nl to be the number of neurons in the lth layer, with N0 = D and NL+1 as the output dimension. We write wl,i for the ith row of Wl and bl,i for the ith entry of bl. Building upon work of Galaris et al. [27], we now introduce sampled networks. The probability distribution to sample pairs of data points is arbitrary here, but we will refine it in Definition 2. We use L to denote the loss of our network we would like to minimize. Definition 1. Let Φ be a neural network with L hidden layers. For l = 1, . . . , L, let (x(1) 0,i , x(2) 0,i )Nl i=1 be pairs of points sampled over X X. We say Φ is a sampled network if the weights and biases of every layer l = 1, 2, . . . , L and neurons i = 1, 2, . . . , Nl, are of the form wl,i = s1 x(2) l 1,i x(1) l 1,i x(2) l 1,i x(1) l 1,i 2 , bl,i = wl,i, x(1) l 1,i + s2, (1) where s1, s2 R are constants, x(j) l 1,i = Φ(l 1)(x(j) 0,i) for j = 1, 2, and x(1) l 1,i = x(2) l 1,i. The last set of weights and biases are WL+1, b L+1 = arg min L(WL+1Φ(L)( ) b L+1). The constants s1, s2 are used to fix what values the activation function takes on when it is applied to the points x(1), x(2); cf. Figure 2. For Re LU, we set s1 = 1 and s2 = 0, so that ϕ x(1) = 0 and ϕ x(2) = 1. For tanh, we set s1 = 2s2 and s2 = ln(3)/2, which implies ϕ x(1) = 1/2 and ϕ x(2) = 1/2, respectively, and ϕ 1/2 x(1) + x(2) = 0. This means that in a regression problem with Re LU, we linearly interpolate values between the two points. For classification, the tanh construction introduces a boundary if x(1) belongs to a different class than x(2). We will use this idea later to define a useful distribution over pairs of points (cf. Definition 2). Figure 2: Placement of the point pairs x(1), x(2) for activation functions Re LU (left) and tanh (right). Two data pairs are chosen in each subfigure, resulting in two activation functions on each data domain. The space of functions that sampled networks can approximate is not immediately clear. First, we are only using points in the input space to construct both the weights and the biases, instead of letting them take on any value. Second, there is a dependence between the bias and the direction of the weight. Third, for deep networks, the sampling space changes after each layer. These apparent restrictions require investigation into which functions we can approximate. We assume that the input space in Theorem 1 and Theorem 2 extends into its ambient space RD as follows. Let X be any compact subset of RD with finite reach τ > 0. Informally, such a set has a boundary that does not change too quickly [12]. We then set the input space X to be the space of points including X and those that are at most ϵI away from X , given the canonical distance function in RD, where 0 < ϵI < τ. In Theorem 1, we also consider L layers to show that the construction of weights in deep networks does not destroy the space of functions that networks with one hidden layer can approximate, even though we alter the space of weights we can construct when L > 1. Theorem 1. For any number of layers L N>0, the space of sampled networks with L hidden layers, Φ: X RNL+1, with activation function Re LU, is dense in C(X, RNL+1). Sketch of proof: We show that the possible biases that sampled networks can produce are all we need inside a neuron, and the rest can be added in the last linear part and with additional neurons. We then show that we can approximate any neural network with one hidden layer with at most 6 N1 neurons which is not much, considering the cost of sampling versus backpropagation. We then show that we can construct weights so that we preserve the information of X through the first L 1 layers, and then we use the arbitrary width result applied to the last hidden layer. The full proof can be found in Appendix A.1. We also prove a similar theorem for networks with tanh activation function and one hidden layer. The proof differs fundamentally, because tanh is not positive homogeneous. We now show existence of sampled networks for which the L2 approximation error of Barron functions is bounded (cf. Theorem 2). We later demonstrate that we can actually construct such networks (cf. Section 4.1). The Barron space [2, 20] is defined as B = {f : f(x) = Z Ω w2ϕ( w1, x b)dµ(b, w1, w2) and f B < } with ϕ being the Re LU function, Ω= R RD R, and µ being a probability measure over Ω. The Barron norm is defined as f B = infµ max(b,w1,w2) supp(µ){|w2|( w1 1 + |b|)}. Theorem 2. Let f B and X = [0, 1]D. For any N1 N>0, ϵ > 0, and an arbitrary probability measure π, there exist sampled networks Φ with one hidden layer, N1 neurons, and Re LU activation function, such that f Φ 2 2 = Z X |f(x) Φ(x)|2dπ(x) < (3 + ϵ) f 2 B N1 . Sketch of proof: It quickly follows from the results of E et al. [20], which showed it for regular neural networks, and Theorem 1. The full proof can be found in Appendix A.2. Up until now, we have been concerned with the space of sampling networks, but not with the distribution of the parameters. We found that putting emphasis on points that are close and differ a lot with respect to the output of the true function works well. As we want to sample layers sequentially, and neurons in each layer independently from each other, we define a layer-wise conditional definition underneath. The norms Xl 1 and Y, that defines the following densities, are arbitrary over their respective space, denoted by the subscript. Definition 2. Let f : RD RNL+1 be Lipschitz-continuous and set Y = f(X). For any l {1, 2, . . . , L}, setting ϵ = 0 when l = 1 and otherwise ϵ > 0, we define qϵ l x(1) 0 , x(2) 0 | {Wj, bj}l 1 j=1 = f(x(2) 0 ) f(x(1) 0 ) Y max{ x(2) l 1 x(1) l 1 Xl 1, ϵ} , x(1) l 1 = x(2) l 1 0, otherwise, where x(1) 0 , x(2) 0 X, x(1) l 1 = Φ(l 1)(x(1) 0 ), and x(2) l 1 = Φ(l 1)(x(2) 0 ), with the network Φ(l 1) parameterized by sampled {Wj, bj}l 1 j=1. Then, using λ as the Lebesgue measure, we define the integration constant Cl = R X X qϵ l dλ. The density pϵ l to sample pairs of points for weights and biases in layer l is equal to qϵ l /Cl if Cl > 0, and uniform over X X otherwise. Note that here, a distribution over pair of points is equivalent to a distribution over weights and biases, and the additional ϵ is a regularization term. Now we can sample for each layer sequentially, starting with l = 1, using the conditional density pϵ l. This induces a probability distribution P over the full parameter space, which, with the given regularity conditions on X and f, is a valid probability distribution. For a complete definition of P and proof of validity, see Appendix A.3. Using this distribution also comes with the benefit that sampling and training are not affected by rigid body transformations (affine transformation with orthogonal matrix) and scaling, as long as the true function f is equivariant w.r.t. to the transformation. That is, if H is such a transformation, we say f is equivariant with respect to H, if there exists a scalar and rigid body transformation H such that H (f(x)) = f(H(x)) for all x X, and invariant if H is the identity function. We also assume that norms Y and X0 in Equation (2) are chosen such that orthogonal matrix multiplication is an isometry. Theorem 3. Let f be the target function and equivariant w.r.t. to a scalar and rigid body transformation H. If we have two sampled networks, Φ, ˆΦ, with the same number of hidden layers L and neurons N1, . . . , NL, where Φ: X RNL+1 and ˆΦ: H(X) RNL+1, then the following statements hold for all x X: (1) If ˆx(1) 0,i = H(x(1) 0,i ) and ˆx(2) 0,i = H(x(2) 0,i ) for all i = 1, 2, . . . , N1, then Φ(1)(x) = ˆΦ(1)(H(x)). (2) If f is invariant w.r.t. H, then for any parameters of Φ, there exist parameters of ˆΦ such that Φ(x) = ˆΦ(H(x)), and vice versa. (3) The probability measure P over the parameters is invariant under H. Sketch of proof: Any neuron in the sampled network can be written as ϕ( s1 w w 2 , x x(1) s2). As we divide by the square of the norm of w, the scalar in H cancels. There is a difference between two points in both inputs of , , which means the translation cancels. Orthogonal matrices cancel due to isometry. When f is invariant with respect to H, the loss function is also unchanged and lead to the same output. Similar argument is made for P, and the theorem follows (cf. Appendix A.3). If the input is embedded in a higher-dimensional ambient space RD , with D < D , we sample from a subspace with dimension D = dim(span{X}) D . All the results presented in this section still hold due to orthogonal decomposition. However, the standard approach of backpropagation and initialization allows the weights to take on any value in RD , which implies a lot of redundancy when D D . The biases are also more relevant to the input space than when initialized to zero potentially avoiding the issues highlighted by Holzmüller and Steinwart [32]. For these reasons, we have named the proposed method Sampling Where It Matters (SWIM), which is summarized in Algorithm 1. For computational reasons, we consider a random subset of all possible pairs of training points when sampling weights and biases. We end this section with a time and memory complexity analysis of Algorithm 1. In Table 1, we list runtime and memory usage for three increasingly strict assumptions. The main assumption is that the dimension of the output is less than or equal to the largest dimension of the hidden layers. This is true for the problems we consider, and the difference in runtime without this assumption is only reflected in the linear optimization part. The term N/M , i.e., integer ceiling of N/M, is required because only a subset of points are considered when sampling. For the full analysis, see Appendix F. 4 Numerical experiments We now demonstrate the performance of Algorithm 1 on numerical examples. Our implementation is based on the numpy and scipy Python libraries, and we run all experiments on a machine with 32GB system RAM (256GB in Section 4.3 and Section 4.4) and a Ge Force 4x RTX 3080 Turbo GPU with 10GB RAM. The Appendix contains detailed information on all experiments. In Section 4.1 we compare sampling to random Fourier feature models regarding the approximation of a Barron function. In Section 4.2 we compare classification accuracy of sampled networks to iterative, gradientbased optimization in a classification benchmark with real datasets. In Section 4.3 we demonstrate that more specialized architectures can be sampled, by constructing deep neural architectures as PDE solution operators. In Section 4.4 we show how to use sampling of fully-connected layers for transfer learning. For the probability distribution over the pairs in Algorithm 1, we always choose the L norm for Y and for l = 1, 2, . . . , L, we choose the L2 norm for Xl 1. The code to reproduce the experiments from the paper, and an up-to-date code base, can be found at https://gitlab.com/felix.dietrich/swimnetworks-paper, https://gitlab.com/felix.dietrich/swimnetworks. Algorithm 1: The SWIM algorithm, for activation function ϕ, and norms on input, output of the hidden layers, and output space, X0, Xl, and Y respectively. Also, L is a loss function, which in our case is always L2 loss, and arg min L( , ) becomes a linear optimization problem. Constant :ϵ R>0, ς N>0, L N>0, {Nl N>0}L+1 l=1 , and s1, s2 R Data: X = {xi : xi RD, i = 1, 2, . . . , M}, Y = {yi : f(xi) = yi RNL+1, i = 1, 2, . . . , M} Φ(0)(x) = x; for l = 1, 2, . . . , L do M ς Nl M M ; P (l) R M; P (l) i 0 i; X = {(x(1) i , x(2) i ): Φ(l 1)(x(1) i ) = Φ(l 1)(x(2) i )} M i=1 Uniform(X X); for i = 1, 2, . . . , M do x(1) i , x(2) i Φ(l 1)(x(1) i ), Φ(l 1)(x(2) i ); y(1) i , y(2) i = f(x(1) i ), f(x(2) i ); P (l) i y(2) i y(1) i Y max{ x(2) i x(1) i Xl 1,ϵ}; end Wl RNl,Nl 1, bl RNl; for k = 1, 2, . . . , Nl do Sample (x(1), x(2)) from X, with replacement and with probability proportional to P (l); x(1), x(2) Φ(l 1)(x(1)), Φ(l 1)(x(2)); W (k,:) l s1 x(2) x(1) x(2) x(1) 2 ; b(k) l W (k,:) l , x(1) + s2; end Φ(l)( ) ϕ(Wl Φ(l 1)( ) bl); end WL+1, b L+1 arg min L(Φ(L)(X), Y ); return {Wl, bl}L+1 l=1 Table 1: Runtime and memory requirements for training sampled neural networks with the SWIM algorithm, where N = max{N0, N1, N2, . . . , NL}. Assumption (I) is that the output dimension is less than or equal to N. Assumption (II) adds that N < M 2, i.e., number of neurons and input dimension is less than the size of dataset squared. Assumption (III) requires a fixed architecture. Runtime Memory Assumption (I) O(L M(min{ N/M , M} + N 2)) O(M min{ N/M , M} + LN 2) Assumption (II) O(L M( N/M + N 2)) O(M N/M + LN 2) Assumption (III) O(M) O(M) 4.1 Illustrative example: approximating a Barron function We compare random Fourier features and our sampling procedure on a test function for neural networks [64]: f(x) = p 3/2( x a x + a ), with x RD and the vector a RD defined by aj = 2j/D 1, j = 1, 2, . . . , D. The Barron norm of f is equal to one for all input dimensions, and it can be represented exactly with a network with one infinitely wide hidden layer, Re LU activation, and weights uniformly distributed on a sphere of radius D1/2. We approximate f using networks ˆf of up to three hidden layers. The error is defined by e2 rel = P x X (f(x) ˆf(x))2/ P x X f(x)2. We compare this error over the domain X = [ 1, 1]2, with 10, 000 points sampled uniformly, separately for training and test sets. For random features, we use w N(0, 1), b U( π, π), as proposed in [54], and ϕ = sin. For sampling, we also use ϕ = sin to obtain a fair comparison. We also observed similar accuracy results when repeating the experiment with the tanh function. The number of neurons m is the same in each hidden layer and ranges from m = 64 up to m = 4096. Figure 3 shows results for D = 5, 10 (results are similar for D = 2, 3, 4, and sampled networks can be constructed as fast as the random feature method, cf. Appendix B). Random features here have Network width Rel. L2 error, dim=5 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=10 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Figure 3: Relative L2 approximation error of a Barron function (test set), using random features and sampling, both with sine activation. Left: input dimension D = 5. Right: input dimension D = 10. comparable accuracy for networks with one hidden layer, but very poor performance for deeper networks. This may be explained by the much larger ambient space dimension of the data after it is processed through the first hidden layer. With our sampling method, we obtain accurate results even with more layers. The convergence rate for D < 10 seems to be faster than the theoretical rate. 4.2 Classification benchmark from Open ML We use the Open ML-CC18 Curated Classification benchmark [4] with all its 72 tasks to compare our sampling method to the Adam optimizer [36]. With both methods, we separately perform neural architecture search, changing the number of hidden layers from 1 to 5. All layers always have 500 neurons. Details of the training are in Appendix C. Figure 4 shows the benchmark results. On all tasks, sampling networks is faster than training them iteratively (on average, 30 times faster). The classification accuracy is comparable (cf. Figure 4, second and third plot). The best number of layers for each problem is slightly higher for the Adam optimizer (cf. Figure 4, fourth plot). 2 1 0 1 2 Fit time (log 10) Adam Sampling 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Adam Accuracy Sampling 0.2 0.0 0.2 Accuracy (Adam - Sampling) Count Average 4 2 0 2 4 Layer difference (Adam - Sampling) Figure 4: Fitting time, accuracy, and number of layers using weight sampling, compared to training with the Adam optimizer. The best architecture is chosen separately for each method and each problem, by evaluating 10-fold cross-validation error over 1-5 layers with 500 neurons each. 4.3 Deep neural operators We sample deep neural operators and compare their speed and accuracy to iterative gradient-based training of the same architectures. As a test problem, we consider Burgers equation, u 2x, x (0, 2π), t (0, 1], with periodic boundary conditions and viscosity ν = 0.1. The goal is to predict the solution at t = 1 from the initial condition at t = 0. Thus, we construct neural operators that represent the map G : u(x, 0) u(x, 1). We generate initial conditions by sampling five Fourier coefficients of lowest frequency and restoring the function values from these coefficients. Using a classical numerical solver, we generate 15000 pairs of (u(x, 0), u(x, 1)), and split them into the train (60%), validation (20%), and test sets (20%). Figure 5 shows samples from the generated dataset. 4.3.1 Fully-connected network in signal space The first baseline for the task is a fully-connected network (FCN) trained with tanh activation to predict the discretized solution from the discretized initial condition. We trained the classical version using the Adam optimizer and the mean squared error as a loss function. We also performed early stopping based on the mean relative L2-error on the validation set. For sampling, we use Algorithm 1 to construct a fully-connected network with tanh as the activation function. 4.3.2 Fully-connected network in Fourier space Similarly to Poli et al. [53], we train a fully-connected network in Fourier space. For training, we perform a Fourier transform on the initial condition and the solution, keeping only the lowest frequencies. We always split complex coefficients into real and imaginary parts, and train a standard FCN on the transformed data. The reported metrics are in signal space, i.e., after inverse Fourier transform. For sampling, we perform exactly the same pre-processing steps. 4.3.3 POD-Deep ONet The third architecture considered here is a variation of a deep operator network (Deep ONet) architecture [44]. The original Deep ONet consists of two trainable components: the trunk net, which transforms the coordinates of an evaluation point, and the branch net, which transforms the function values on some grid. The outputs of these nets are then combined into the predictions of the whole network G(u)(y) = Pp k=1 bk(u)tk(y)+b0, where u is a discretized input function; y is an evaluation point; [t1, . . . , tp]T Rp are the p outputs of the trunk net; [b1, . . . , bp]T Rp are the p outputs of the branch net; and b0 is a bias. Deep ONet sets no restrictions on the architecture of the two nets, but often fully-connected networks are used for one-dimensional input. POD-Deep ONet proposed by Lu et al. [46] first assumes that evaluation points lie on the input grid. It performs proper orthogonal decomposition (POD) of discretized solutions in the train data and uses its components instead of the trunk net to compute the outputs G(u)(ξj) = Pp k=1 bk(u)ψk(ξj) + ψ0(ξj). Here [ψ1(ξj), . . . , ψp(ξj)] are p precomputed POD components for a point ξj, and ψ0(ξj) is the mean of discretized solutions evaluated at ξj. Hence, only the branch net is trained in POD-Deep ONet. We followed Lu et al. [46] and applied scaling of 1/p to the network output. For sampling, we employ orthogonality of the components and turn POD-Deep ONet into a fully-connected network. Let ξ = [ξ1, . . . , ξn] be the grid used to discretize the input function u and evaluate the output function G(u). Then the POD components of the training data are Ψ(ξ) = [ψ1(ξ), . . . , ψp(ξ)] Rn p. If b(u) Rp is the output vector of the trunk net, the POD-Deep ONet transformation can be written G(u)(ξ) = Ψb(u)+ψ0(ξ). As ΨT Ψ = Ip, we can express the output of the trunk net as b(u) = ΨT (G(u)(ξ) ψ0(ξ)). Using this equation, we can transform the training data to sample a fully-connected network for b(u). We again use tanh as the activation function for sampling. 4.3.4 Fourier Neural Operator The concept of a Fourier Neural Operator (FNO) was introduced by Li et al. [42] to represent maps in function spaces. An FNO consists of Fourier blocks, combining a linear operator in the Fourier space and a skip connection in signal space. As a first step, FNO lifts an input signal to a higher dimensional channel space. Let vt Rn dv be an input to a Fourier block having dv channels and discretized with n points. Then, the output vt+1 of the Fourier block is computed as vt+1 = ϕ(F 1 k (R Fk(vt))+W vt) Rn dv. Here, Fk is a discrete Fourier transform keeping only Model width layers mean rel. L2 error Time FCN; signal space 1024 2 4.48 10 3 644s FCN; Fourier space 1024 1 3.29 10 3 1725s POD-Deep ONet 2048 4 1.62 10 3 4217s FNO n/a 4 0.38 10 3 3119s FCN; signal space 4096 1 0.92 10 3 20s FCN; Fourier space 4096 1 1.08 10 3 16s POD-Deep ONet 4096 1 0.85 10 3 21s FNO 4096 1 0.94 10 3 387s Figure 5: Left: Samples of initial conditions u0 and corresponding solutions u1 for Burgers equation. Right: Parameters of the best model for each architecture, the mean relative L2 error on the test set, and the training time. We average the metrics across three runs with different random seeds. the k lowest frequencies and F 1 k is the corresponding inverse transform; ϕ is an activation function; is a spectral convolution; is a 1 1 convolution with bias; and R Ck dv dv, W Rdv dv are learnable parameters. FNO stacks several Fourier blocks and then projects the output signal to the target dimension. The projection and the lifting operators are parameterized with neural networks. For sampling, the construction of convolution kernels is not possible yet, so we cannot sample FNO directly. Instead, we use the idea of FNO to construct a neural operator with comparable accuracy. Similar to the original FNO, we normalize the input data and append grid coordinates to it before lifting. Then, we draw the weights from a uniform distribution on [ 1, 1] to compute the 1 1 lifting convolution. We first apply the Fourier transform to both input and target data, and then train a fully-connected network for each channel in Fourier space. We use skip connections, as in the original FNO, by removing the input data from the lifted target function during training, and then add it before moving to the output of the block. After sampling and transforming the input data with the sampled networks, we apply the inverse Fourier transform. After the Fourier block(s), we sample a fully-connected network that maps the signal to the solution. The results of the experiments in Figure 5 show that sampled models are comparable to the Adamtrained ones. The sampled FNO model does not directly follow the original FNO architecture, as we are able to only sample fully-connected layers. This shows the advantage of gradient-based methods: as of now, they are applicable to much broader use cases. These experiments showcase one of the main advantages of sampled networks: speed of training. We run sampling on the CPU; nevertheless, we see a significant speed-up compared to Adam training performed on GPU. 4.4 Transfer learning Training deep neural networks from scratch involves finding a suitable neural network architecture [21] and hyper-parameters [3, 66]. Transfer learning aims to improve performance on the target task by leveraging learned feature representations from the source task. This has been successful in image classification [35], multi-language text classification, and sentiment analysis [63, 9]. Here, we compare the performance of sampling with iterative training on an image classification transfer learning task. We choose the CIFAR-10 dataset [39], with 50000 training and 10000 test images. Each image has dimension 32 32 3 and must be classified into one of the ten classes. We consider Res Net50 [31], VGG19 [58], and Xception [11], all pre-trained on the Image Net dataset [55]. We freeze the weights of all convolutional layers and append one fully connected hidden layer and one output layer. We refer to these two layers as the classification head, which we sample with Algorithm 1 and compare the classification accuracy against iterative training with the Adam optimizer. Figure 6 (left) shows that for a pre-trained Res Net50, the test accuracy using the sampling approach is higher than the Adam training approach for a width greater than 1024. We observe similar qualitative behavior for VGG19 and Xception (figures in Appendix E). Figure 6 (middle) shows that the sampling approach results in a higher test accuracy with all three pre-trained models. Furthermore, the deviation in test accuracy obtained with the sampling algorithm is very low, demonstrating that sampling is more robust to changing random seeds than iterative training. After fine-tuning the whole neural network with the Adam optimizer with a learning rate of 10 5, the test accuracies of sampled networks are close to the iterative approach. Thus, sampling provide a good starting point for fine-tuning the entire model. A comparison for the three models before and after fine-tuning is contained in Appendix E. Figure 6 (right) shows that sampling is up to two orders of magnitude faster than iterative training for 102 103 104 Adam: Test data Adam: Train data Sampling: Test data Sampling: Train data Test accuracy Adam Adam + Fine-tuning Sampling Sampling + Fine-tuning 102 103 104 Adam: Res Net50 Adam: VGG19 Adam: Xception Sampling: Res Net50 Sampling: VGG19 Sampling: Xception Figure 6: Left: Train and test accuracies with different widths for Res Net50 (averaged over 5 random seeds). Middle: Test accuracy with different models with and without fine-tuning (width = 2048). Right: Adam training and sampling times of the classification head (averaged over 5 experiments). smaller widths, and around ten times faster for a width of 2048. In summary, Algorithm 1 is much faster than iterative training, yields a higher test accuracy for certain widths before fine-tuning, and is more robust with respect to changing random seeds. The sampled weights also provide a good starting point for fine-tuning of the entire model. 5 Broader Impact Sampling weights through data pairs at large gradients of the target function offers improvements over random feature models. In terms of accuracy, networks with relatively large widths can even be competitive to iterative, gradient-based optimization. Constructing the weights through pairs of points also allows to sample deep architectures efficiently. Sampling networks offers a straightforward interpretation of the internal weights and biases, namely, which data pairs are important. Given the recent critical discussions around fast advancement in artificial intelligence, and calls to slow it down, publishing work that potentially speeds up the development (concretely, training speed) in this area by orders of magnitude may seem irresponsible. The solid mathematical underpinning of random feature models and, now, sampled networks, combined with much greater interpretability of the individual steps during network construction, should mitigate some of these concerns. 6 Conclusion We present a data-driven sampling method for fully-connected neural networks that outperforms random feature models in terms of accuracy, and in many cases is competitive to gradient-based optimization. The time to obtain a trained network is orders of magnitude faster compared to gradientbased optimization. In addition, much fewer hyperparameters need to be optimized, as opposed to learning rate, number of training epochs, and type of optimizer. Several open issues remain, we list the most pressing here. Many architectures like convolutional or transformer networks cannot be sampled with our method yet, and thus must still be trained with iterative methods. Implicit problems, such as the solution to PDE without any training data, are a challenge, as our distribution over the data pairs relies on known function values from a supervised learning setting. Iteratively refining a random initial guess may prove useful here. On the theory side, convergence rates for Algorithm 1 beyond the default Monte-Carlo estimate are not available yet, but are important for robust applications in engineering. In the future, hyperparameter optimization, including neural architecture search, could benefit from the fast training time of sampled networks. We already demonstrate benefits for transfer learning here, which may be exploited for other pre-trained models and tasks. Analyzing which data pairs are sampled during training may help to understand the datasets better. We did not show that our sampling distribution results in optimal weights, so there is a possibility of even more efficient heuristics. Applications in scientific computing may benefit most from sampling networks, as accuracy and speed requirements are much higher than for many tasks in machine learning. Acknowledgments and Disclosure of Funding We are grateful for discussions with Erik Bollt, Constantinos Siettos, Edmilson Roque Dos Santos, Anna Veselovska, and Massimo Fornasier. We also thank the anonymous reviewers at Neur IPS for their constructive feedback. E.B., I.B., and F.D. are funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - project no. 468830823, and also acknowledge association to DFG-SPP-229. C.D. is partially funded by the Institute for Advanced Study (IAS) at the Technical University of Munich. F.D. and Q.S. are supported by the TUM Georg Nemetschek Institute - Artificial Intelligence for the Built World. [1] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1 38, 2017. URL http://jmlr.org/papers/v18/15-178. html. [2] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930 945, 1993. ISSN 0018-9448, 1557-9654. doi: 10.1109/18.256500. [3] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in neural information processing systems, 24, 2011. [4] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael Gomes Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. Open ML benchmarking suites. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=OCr D8yc Kj G. [5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015. URL http://proceedings.mlr.press/v37/blundell15.html. [6] Erik Bollt. On explaining the surprising success of reservoir computing forecaster of chaos? The universal machine learning dynamical system with contrast to VAR and DMD. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(1):013108, 2021. doi: 10.1063/5.0024890. [7] Erik M. Bollt. How Neural Networks Work: Unraveling the Mystery of Randomized Neural Networks For Functions and Chaotic Dynamical Systems. unpublished, submitted, 2023. [8] Rebekka Burkholz, Nilanjana Laha, Rajarshi Mukherjee, and Alkis Gotovos. On the Existence of Universal Lottery Tickets. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=SYB4Wr Jql1n. [9] Jireh Yi-Le Chan, Khean Thye Bea, Steven Mun Hong Leow, Seuk Wai Phoong, and Wai Khuen Cheng. State of the art: a review of sentiment analysis based on sequential transfer learning. Artificial Intelligence Review, 56(1):749 780, 2023. [10] Jingrun Chen, Xurong Chi, Weinan E, and Zhouwang Yang. Bridging traditional and machine learningbased algorithms for solving PDEs: The random feature method. (ar Xiv:2207.13380), 2022. [11] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251 1258, 2017. [12] A. Cuevas, R. Fraiman, and B. Pateiro-López. On Statistical Properties of Sets Fulfilling Rolling-Type Conditions. Advances in Applied Probability, 44(2):311 329, 2012. ISSN 0001-8678, 1475-6064. doi: 10.1239/aap/1339878713. [13] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314, 1989. ISSN 1435-568X. doi: 10.1007/BF02551274. [14] Ronald De Vore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30: 327 444, 2021. ISSN 0962-4929, 1474-0508. doi: 10.1017/S0962492921000052. [15] Sjoerd Dirksen, Martin Genzel, Laurent Jacques, and Alexander Stollenwerk. The separation capacity of random neural networks. Journal of Machine Learning Research, 23(309):1 47, 2022. [16] Suchuan Dong and Zongwei Li. Local extreme learning machines and domain decomposition for solving linear and nonlinear partial differential equations. Computer Methods in Applied Mechanics and Engineering, 387:114129, 2021. ISSN 00457825. doi: 10.1016/j.cma.2021.114129. [17] Weinan E. Towards a Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don t. CSIAM Transactions on Applied Mathematics, 1(4):561 615, 2020. ISSN 2708-0560, 2708-0579. doi: 10.4208/csiam-am.SO-2020-0002. [18] Weinan E and Stephan Wojtowytsch. Representation formulas and pointwise properties for Barron functions. Calculus of Variations and Partial Differential Equations, 61(2):46, 2022. ISSN 0944-2669, 1432-0835. doi: 10.1007/s00526-021-02156-6. [19] Weinan E, Chao Ma, and Lei Wu. A Priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407 1425, 2019. ISSN 15396746, 19450796. doi: 10.4310/CMS.2019.v17.n5.a11. [20] Weinan E, Chao Ma, and Lei Wu. The Barron Space and the Flow-Induced Function Spaces for Neural Network Models. Constructive Approximation, 55(1):369 406, 2022. ISSN 0176-4276, 1432-0940. doi: 10.1007/s00365-021-09549-y. [21] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997 2017, 2019. [22] Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, and Frank Hutter. Open ML-Python: An extensible Python API for Open ML. (ar Xiv:1911.02490), 2021. [23] Christian Fiedler, Massimo Fornasier, Timo Klock, and Michael Rauchensteiner. Stable Recovery of Entangled Weights: Towards Robust Identification of Deep Neural Networks from Minimal Samples. (ar Xiv:2101.07150), 2021. [24] Massimo Fornasier, Timo Klock, Marco Mondelli, and Michael Rauchensteiner. Finite Sample Identification of Wide Shallow Neural Networks with Biases. (ar Xiv:2211.04589), 2022. [25] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. [26] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of The 33rd International Conference on Machine Learning, pages 1050 1059. PMLR, 2016. [27] Evangelos Galaris, Gianluca Fabiani, Ioannis Gallos, Ioannis Kevrekidis, and Constantinos Siettos. Numerical Bifurcation Analysis of PDEs From Lattice Boltzmann Model Simulations: A Parsimonious Machine Learning Approach. Journal of Scientific Computing, 92(2):34, 2022. ISSN 0885-7474, 1573-7691. doi: 10.1007/s10915-022-01883-y. [28] Claudio Gallicchio and Simone Scardapane. Deep Randomized Neural Networks. In Recent Trends in Learning From Data, volume 896, pages 43 68. Springer International Publishing, Cham, 2020. ISBN 978-3-030-43882-1 978-3-030-43883-8. doi: 10.1007/978-3-030-43883-8_3. [29] Raja Giryes, Guillermo Sapiro, and Alex M. Bronstein. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? IEEE Transactions on Signal Processing, 64(13):3444 3457, 2016. ISSN 1941-0476. doi: 10.1109/TSP.2016.2546221. [30] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: A new learning scheme of feedforward neural networks. In IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), volume 2, pages 985 990. IEEE, 2004. ISBN 978-0-7803-8359-3. doi: 10.1109/IJCNN.2004.1380068. [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [32] David Holzmüller and Ingo Steinwart. Training two-layer Re LU networks with gradient descent is inconsistent. Journal of Machine Learning Research, 23(181):1 82, 2022. [33] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. doi: 10.1016/0893-6080(89)90020-8. [34] Herbert Jaeger and Harald Haas. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science, 304(5667):78 80, 2004. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.1091277. [35] Hee E Kim, Alejandro Cosa-Linan, Nandhini Santhanam, Mahboubeh Jannesari, Mate E Maros, and Thomas Ganslandt. Transfer learning for medical image classification: A literature review. BMC medical imaging, 22(1):69, 2022. [36] D. P. Kingma and L. J. Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations ICLR, 2015. [37] Klaus Greff, Aaron Klein, Martin Chovanec, Frank Hutter, and Jürgen Schmidhuber. The Sacred Infrastructure for Computational Research. In Proceedings of the 16th Python in Science Conference, pages 49 56, 2017. doi: 10.25080/shinma-7f4c6e7-008. [38] Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural Operator: Learning Maps Between Function Spaces. (ar Xiv:2108.08481), 2022. [39] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Ph D thesis, University of Toronto, 2009. [40] Lingfeng Li, Xue-Cheng Tai Null, and Jiang Yang. Generalization Error Analysis of Neural Networks with Gradient Based Regularization. Communications in Computational Physics, 32(4):1007 1038, 2022. ISSN 1815-2406, 1991-7120. doi: 10.4208/cicp.OA-2021-0211. [41] Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic. Towards a unified analysis of random fourier features. Journal of Machine Learning Research, 22(108):1 51, 2021. [42] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier Neural Operator for Parametric Partial Differential Equations. In International Conference on Learning Representations, 2021. doi: 10.48550/ar Xiv.2010.08895. [43] Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens. Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128 7148, 2022. ISSN 1939-3539. doi: 10.1109/TPAMI.2021.3097011. [44] Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via Deep ONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218 229, March 2021. doi: 10.1038/s42256-021-00302-5. [45] Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. Deep XDE: A Deep Learning Library for Solving Differential Equations. SIAM Review, 63(1):208 228, 2021. ISSN 0036-1445, 1095-7200. doi: 10.1137/19M1274067. [46] Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with practical extensions) based on FAIR data. Computer Methods in Applied Mechanics and Engineering, 393:114778, 2022. ISSN 00457825. doi: 10.1016/j.cma.2022.114778. [47] Takuo Matsubara, Chris J. Oates, and François-Xavier Briol. The ridgelet prior: A covariance function approach to prior specification for bayesian neural networks. Journal of Machine Learning Research, 22 (157):1 57, 2021. URL http://jmlr.org/papers/v22/20-1300.html. [48] A.J. Meade and A.A. Fernandez. Solution of nonlinear ordinary differential equations by feedforward neural networks. Mathematical and Computer Modelling, 20(9):19 44, 1994. ISSN 08957177. doi: 10.1016/0895-7177(94)00160-X. [49] Radford M. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in Statistics. Springer New York, 1996. ISBN 978-0-387-94724-2 978-1-4612-0745-0. doi: 10.1007/ 978-1-4612-0745-0. [50] Nicholas H. Nelsen and Andrew M. Stuart. The Random Feature Model for Input-Output Maps between Banach Spaces. SIAM Journal on Scientific Computing, 43(5):A3212 A3243, 2021. ISSN 1064-8275, 1095-7197. doi: 10.1137/20M133957X. [51] William Peebles, Ilija Radosavovic, Tim Brooks, Alexei A. Efros, and Jitendra Malik. Learning to Learn with Generative Models of Neural Network Checkpoints. (ar Xiv:2209.12892), 2022. [52] Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143 195, 1999. ISSN 0962-4929, 1474-0508. doi: 10.1017/S0962492900002919. [53] Michael Poli, Stefano Massaroli, Federico Berto, Jinkyoo Park, Tri Dao, Christopher Ré, and Stefano Ermon. Transform once: Efficient operator learning in frequency domain. In Advances in Neural Information Processing Systems, volume 35, pages 7947 7959, 2022. [54] Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555 561. IEEE, 2008. ISBN 978-1-4244-2925-7. doi: 10.1109/ALLERTON.2008.4797607. [55] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211 252, 2015. [56] Konstantin Schürholt, Boris Knyazev, Xavier Giró-i Nieto, and Damian Borth. Hyper-representations as generative models: Sampling unseen neural network weights. In Advances in Neural Information Processing Systems, volume 35, pages 27906 27920, 2022. URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/b2c4b7d34b3d96b9dc12f7bce424b7ae-Paper-Conference.pdf. [57] Jonathan W. Siegel and Jinchao Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313 321, 2020. ISSN 08936080. doi: 10.1016/j.neunet.2020.05.019. [58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. (ar Xiv:1409.1556), 2014. [59] Sho Sonoda. Fast Approximation and Estimation Bounds of Kernel Quadrature for Infinitely Wide Models. (ar Xiv:1902.00648), 2020. [60] Len Spek, Tjeerd Jan Heeringa, and Christoph Brune. Duality for neural networks through Reproducing Kernel Banach Spaces. (ar Xiv:2211.05020), 2022. [61] Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational Bayesian neural networks. In International Conference on Learning Representations, 2019. [62] Michael Unser. A representer theorem for deep neural networks. Journal of Machine Learning Research, 20(110):1 30, 2019. URL http://jmlr.org/papers/v20/18-418.html. [63] Karl Weiss, Taghi M. Khoshgoftaar, and Ding Ding Wang. A survey of transfer learning. Journal of Big Data, 3(1):9, 2016. ISSN 2196-1115. doi: 10.1186/s40537-016-0043-6. [64] Stephan Wojtowytsch and Weinan E. Can Shallow Neural Networks Beat the Curse of Dimensionality? A Mean Field Training Perspective. IEEE Transactions on Artificial Intelligence, 1(2):121 129, 2020. ISSN 2691-4581. doi: 10.1109/TAI.2021.3051357. [65] Lei Wu and Jihao Long. A Spectral-Based Analysis of the Separation between Two-Layer Neural Networks and Linear Methods. Journal of Machine Learning Research, 23(1), 2022. ISSN 1532-4435. doi: 10.5555/3586589.3586708. [66] Li Yang and Abdallah Shami. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295 316, 2020. Here we include full proofs of all theorems in the paper, and then details on all numerical experiments. In the last section, we briefly explain the code repository (URL in section Appendix F) and provide the analysis of the algorithm runtime and memory complexity. A Sampled networks theory 1 A.1 Universality of sampled networks . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A.1.1 Input space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A.1.2 Networks with a single hidden layer . . . . . . . . . . . . . . . . . . . . . 5 A.1.3 Deep networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 A.1.4 Networks with a single hidden layer, tanh activation . . . . . . . . . . . . 11 A.2 Barron spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.3 Distribution of sampled networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B Illustrative example: approximating a Barron function 18 C Classification benchmark from Open ML 18 D Deep neural operators 18 D.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 D.2 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 E Transfer learning 23 E.1 Sampling Vs Adam training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 E.2 Sampling with tanh Vs Re LU activation functions . . . . . . . . . . . . . . . . . . 24 E.3 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 E.4 One vs two hidden layers in the classification head . . . . . . . . . . . . . . . . . 25 F Code repository and algorithm complexity 26 A Sampled networks theory In this section, we provide complete proofs of the results stated in the main paper. In Appendix A.1 we consider sampled networks and the universal approximation property, followed by the result bounding L2 approximation error between sampled networks and Barron functions in Appendix A.2. We end with the proof regarding equivariance and invariance in Appendix A.3. We start by defining neural networks and sampled neural networks, and show more in depth the choices of the scalars in sampled networks for Re LU and tanh. Let the input space X be a subset of RD. We now define a fully connected, feed forward neural network; mostly to introduce some notation. Definition 3. Let ϕ: R R be non-polynomial, L 1, N0 = D, and N1, . . . , NL+1 N>0. A (fully connected) neural network with L hidden layers, activation function ϕ, and weight space Θ = {Wl RNl,Nl 1, bl RNl}L+1 l=1 , is a function Φ: RD RNL+1, of the form Φ(x) = WL+1x L b L+1, (3) where xl is defined recursively as x0 = x X and xl = ϕ(Wlxl 1 bl), l = 1, 2, . . . , L. (4) The image after l layers is denoted as Xl = Φ(l)(X). As we study the space of these networks, we find it useful to introduce the following notation. The space of all neural networks with L hidden layers, with N = [N1, . . . , NL] neurons in the separate layers, is denoted as FL,N, assuming the input dimension and the output dimension are implicitly given. If each hidden layer has N neurons, and we let the input dimension and the output dimension be fixed, we write the space of all such neural networks as FL,N. We then let meaning the set of neural networks with N width, and arbitrary depth. Similarly, for arbitrary width at fixed depth. Finally, We will now introduce sampled networks again, and go into depth the choice of constants. The input space X is a subset of Euclidean space, and we will work with canonical inner products , and their induced norms if we do not explicitly specify the norm. Definition 4. Let Φ be an neural network with L hidden layers. For l = 1, . . . , L, let (x(1) 0,i , x(2) 0,i )Nl i=1 be pairs of points sampled over X X. We say Φ is a sampled network if the weights and biases of every layer l = 1, 2, . . . , L and neurons i = 1, 2, . . . , Nl, are of the form wl,i = s1 x(2) l 1,i x(1) l 1,i x(2) l 1,i x(1) l 1,i 2 , bl,i = wl,i, x(1) l 1,i + s2 (5) where s1, s2 R are constants related to the activation function, and x(j) l 1,i = Φ(l 1)(x(j) 0,i) for j = 1, 2, assuming x(1) l 1,i = x(2) l 1,i. The last set of weights and biases are WL+1, b L+1 = arg min L(WL+1Φ(L)( ) b L+1), where L is a suitable loss. Remark 1. The constants s1, s2 can be fixed such that for a given activation function, we can specify values at specific points, e.g., we can specify what value to map the two points x(1) and x(2) to; cf. Figure 7. Also note that the points we sampled from X X are sampled such that we have unique points after mapping the two sampled points through the first l 1 layers. This is enforced by constructing the density of the distribution we use to sample the points so that zero density is assigned to points that map to the same output of Φ(l 1) (see Definition 2). Figure 7: Illustration of the placement of point pairs x(1), x(2) for activation functions Re LU (left) and tanh (right). Example A.1. We consider the construction of the constants s1, s2 for Re LU and tanh, as they are the two activation functions used for both the proofs and the experiments. We start by considering the activation function tanh, i.e., ϕ(x) = exp2x 1 Consider an arbitrary weight w, where we drop the subscripts for simplicity. Setting s1 = 2 s2, the input of the corresponding activation function, at the mid-point x = x(1) + x(2) x(1) 2 = x(2)+x(1) w, x b = w, x(2) + x(1) = w, x(2) x(1) = 2s2 x(2) x(1) 2 1 D x(2) x(1), x(2) x(1)E s2 = s2 s2 = 0. The output of the neuron corresponding to the input above is then zero, regardless of what the constant s2 is. Another aspect of the constants are that we can decide activation values for the two sampled points. This can be seen with a similar calculation as above, w, x(1) b = s2. Letting s2 = ln 3 2 , we see that the output of the corresponding neuron at point x(1) is ϕ( s2) = expln(3) 1 expln(3) +1 = 1 and for x(2) we get 1 2. We can also consider the Re LU activation function where we choose to set s1 = 1 and s2 = 0. The center of the real line is then placed at x(1), and x(2) is mapped to one. A.1 Universality of sampled networks In this section we show that universal approximation results also holds for our sampled networks. We start by considering the Re LU function. To separate it from tanh, which is the second activation function we consider, we denote ϕ to be Re LU and ψ to be tanh. Networks with Re LU as activation function are then denoted by Φ, and Ψ are networks with tanh activation function. When we use Re LU, we set s1 = 1 and s2 = 0, as already discussed. We provide the notation that is used throughout Appendix A in Table 2. The rest of this section is structured as follows: 1. We introduce the type of input spaces we are working with. 2. We then aim to rewrite an arbitrary fully connected neural network with one hidden layer into a sampled network. 3. This is done by first constructing neurons that add a constant to parts of the input space while leaving the rest untouched. 4. We then show that we can translate between an arbitrary neural network and a sampled network, if the weights of the former are given. This gives us the first universal approximator result, by combining this step with the former. 5. We go on to construct deep sampled networks with arbitrary width, showing that we can contain the information needed through the first L 1 layers, and then apply the result for sampled networks with one hidden layer, to the last hidden layer. 6. We conclude the section by showing that sampled networks with tanh activation functions are also universal approximators. Concretely, we show that it is dense in the space of sampled networks with Re LU activation function. The reason we need a different proof for the tanh case is that it is not positive homogeneous, a property of Re LU that we heavily depend upon for the proof of sampled networks with Re LU as activation function. Before we can start with the proof that F1, is dense in the space of continuous functions, we start by specifying the domain/input space for the functions. Table 2: Notation used through the theory section of this appendix. X Pre-noise input space. Subset of RD and compact for Appendix A.1 and Appendix A.2. X Input space. Subset of RD, X X, and compact for Appendix A.1 and Appendix A.2. ϵI Noise level for X . Definition 5. ϕ Activation function Re LU, ϕ(x) = max{x, 0}. Φ Neural network with activation function Re LU. ψ Activation function tanh. Ψ Neural network with activation function tanh. Φc Constant block. 5 neurons that combined add a constant value c to parts of the input space X, while leaving the rest untouched. Definition 6. Nl Number of neurons in layer l of a neural network. N0 = D and NL+1 is the output dimension of network. wl,i The weight at the ith neuron, of the lth layer. bl,i The bias at the ith neuron, of the lth layer. Φ(l) Function that maps x X to the output of the lth hidden layer. Φ(l,i) The ith neuron for the lth layer. Xl The image of X after l layers, i.e., Xl = Φ(X). We set X0 = X. xl xl = Φ(l)(x). FL,[N1,N2,...,NL] Space of neural networks with L hidden layers and Nl neurons in layer l. F1, Space of all neural networks with one hidden layer, i.e., networks with arbitrary width. FS 1, Space of all sampled networks with one hidden layer and arbitrary width. ζ Function that maps points from X to X , mapping to the closest point in X , which is unique. η Maps x X [0, 1] to X, by η(x, δ) = x + δ(ζ(x) x). Canonical norm of RD, with inner product . Bϵ(x) The ball Bϵ(x) = {x RD : x x < ϵ}. Bϵ(x) The closed ball Bϵ(x) = {x RD : x x ϵ}. τ Reach of the set X . λ Lebesgue measure. A.1.1 Input space We again assume that the input space X is a subset of RD. We also assume that X contains some additional additive noise. Before we can define this noisy input space, we recall two concepts. Let (Z, d) be a metric space, and the distance between a subset A Z and a point z Z be defined as d Z(z, A) = inf{d(z, a): a A}. If Z is also a normed space, the medial axis is then defined as Med(A) = {z Z : p = q A, p z = q z = d Z(z, A)} The reach of A is defined as τA = inf a A d Z(a, Med(A)). Informally, the reach is the smallest distance from a point in the subset A to a non-unique projection of it in the complement Ac. This means that the reach of convex subsets is infinite (all projections are unique), while other sets can have zero reach, which means 0 τA . Let Z = RD, and d be the canonical Euclidean distance. Definition 5. Let X be a nonempty compact subset of RD with reach τX > 0. The input space X is defined as X = {x RD : d Z(x, X ) ϵI}, where 0 < ϵI < min{τX , 1}. We refer to X as the pre-noise input space. Remark 2. As d Z(x, X ) = 0 for all x X , we have that X X. Due to d Z(x, X ) ϵI we preserve that X is closed, and as ϵI < means it is bounded, and hence compact. Informally, we have enlarged the input space with new points at most ϵI distance away from X . This preserves many properties of the pre-noise input space X , while also being helpful in the proofs to come. We also argue that in practice, we are often faced with noisy datasets X anyway, rather than non-perturbed data X . We also define a function that maps elements in X \ X down to X , using the uniqueness property given by the medial axis. That is, ζ : X X is the mapping defined as ζ(x) = x , for x X and x X , such that d(x, x ) = d Z(x, X ). As we also want to work along the line between these two points, we set η: X [0, 1] X, where η(x, δ) = x + δ(ζ(x) x). We conclude this part with an elementary result. Lemma 1. Let x X, δ [0, 1], and x = η(x, δ) X, then x x δ, with strict inequality when δ > 0. Furthermore, when δ > 0, we have x / X. Proof. We start by noticing d(ζ(x), x ) = ζ(x) x = (1 δ) ζ(x) x (1 δ)ϵI ϵI, which means x X. When δ > 0, we have strict inequality, which implies x int X = X \ X. Furthermore, d(x, x ) = x x = δ ζ(x) x δ ϵI δ, where the last inequality holds due to ϵI < 1, and is equal only when δ = 0. A.1.2 Networks with a single hidden layer We can now proceed to the proof that sampled networks with one hidden layer are indeed universal approximators. The main idea is to start off with an arbitrary network with a single hidden layer, and show that we can approximate this network arbitrarily well. Then we can rely on previous universal approximation theorems [13, 33, 52] to finalize the proof. We start by showing some results for a different type of neural networks, but very similar in form. We consider networks where the bias is of the same form as a sampled network. However, the weight is normalized to be a unit weight, that is, divided by the norm, and not the square of the norm. We show in Lemma 4 that the results also hold for any positive scalar multiple of the unit weight, and therefore our sampled network, where we divide by the norm of the weight squared. To be more precise, the weights are of the form wl,i = x(2) l 1,i x(1) l 1,i x(2) l 1,i x(1) l 1,i , and biases bl,i = wl,i, x(1) l 1,i . Networks with weights/biases in the hidden layers of this form is referred to as unit sampled network. In addition, when the output dimension is NL+1 = 1, we split the bias of the last layer, b L+1 into NL parts, to make the proof easier to follow. This is of no consequence for the final result, as we can always sum up the parts to form the original bias. We write the different parts of the split as b L+1,i for i = 1, 2, . . . , NL. We start by defining a constant block. This is crucial for handling the bias, as we can add constants to the output of certain parts of the input space, while leaving the rest untouched. This is important when proving Lemma 2. Definition 6. Let c1 < c2 < c3, and c R+. A constant block Φc is defined as five neurons summed together as follows. For x R, f1(x) = a1 ϕ(x c2), f2(x) = a1 ϕ( (x c3)), f3(x) = a1 ϕ(x c3), f4(x) = a2 ϕ( (x c1)) f5(x) = a3 ϕ(x c1), and a1 = c c3 c2 , a2 = a1 c1 c3 c1 c2 , and a3 = a2 a1. Remark 3. The function Φc are constructed using neurons, but can also be written as the continuous function, 0, x c1 a3 x + d, c1 < x c2 c, x > c2, where d = a1 c3 + a2 c2. Obviously, if c needs to be negative, one can simply swap the sign on each of the three parameters, ai. We can see by the construction of Φc that we might need the negative of some original weight. That is, the input to Φc is of the form w1,i, x , and for f2 and f4, we require w1,i, x . In Lemma 3, we shall see that this is not an issue and that we can construct neurons such that they approximate constant blocks arbitrarily well, as long as c1, c2, c3 can be produced by the inner product between a weight and points in X, that is, if we can produce the biases equal to the constants c1, c2, c3. Let ˆΦ F1,K, with parameters { ˆwl,i,ˆbl,i}2,K l=1,i=1, be an arbitrary neural network. Unless otherwise stated, the weights in this arbitrary network ˆΦ are always nonzero. Sampled networks cannot have zero weights, as the point pairs used to construct weights, both for unit and regular sampled networks, are distinct. However, one can always construct the same output as neurons with zero weights by setting certain weights in WL+1 to zero. We start by showing that, given all weights of a network, we can construct all biases in a unit sampled network so that the function values agree. More precisely, we want to construct a network with weights ˆwl,i, and show that we can find points in X to construct the biases of the form in unit sampled networks, such that the resulting neural network output on X equals exactly the values of the arbitrary network ˆΦ. Lemma 2. Let ˆΦ F1,K : X RN2. There exists a set of at most 6 N1 points xi X and biases b2,i R, such that a network Φ with weights w1,i = ˆw1,i, w2,i = ˆw2,i, and biases b2,i R, b1,i = w1,i, xi , for i = 1, 2, . . . , N1, satisfies Φ(x) ˆΦ(x) = 0 for all x X. Proof. W.l.o.g., we assume N2 = 1. For any weight/bias pair ˆw1,i,ˆb1,i, we let w1,i = ˆw1,i, w2,i = ˆw2,i, Bi = { w1,i, x : x X}, b(i) = inf Bi, and b(i) = sup Bi. As w1,i, is continuous, means Bi is compact and b(i) , b(i) Bi. We have four different cases, depending on ˆb1,i. (1) If ˆb1,i Bi, then we simply choose a corresponding xi X such that b1,i = w1,i, xi = ˆb1,i. Letting b2,i = ˆb2,i, we have w2,iϕ( w1,i, x b1,i) b2,i = ˆw2,iϕ( ˆw1,i, x ˆb1,i) ˆb2,i. (2) If ˆb1,i > b(i) , we choose xi such that b1,i = w1,i, xi = b(i) and b2,i = ˆb2,i. As ϕ( w1,i, x b1,i) = ϕ( ˆw1,i, x ˆb1,i) = 0, for all x X, we are done. (3) If ˆb1,i < b(i) , we choose corresponding xi such that b1,i = w1,i, xi = b(i) , and set b2,i = ˆb2,i + w2,i ˆb1,i b(i) . We then have w2,iϕ( w1,i, x b1,i) b2,i = w2,i w1,i, x w2,i b1,i b2,i = w2,i w1,i, x ˆb2,i w2,i ˆb1,i w2,ib(i) = ˆw2,i ˆw1,i, x ˆb2,i ˆw2,i ˆb1,i = ˆw2,iϕ( ˆw1,i, x ˆb1,i) ˆb2,i, where first and last equality holds due to w1,i, x > b1,i > ˆb1,i for all x X. (4) If b(i) < ˆb1,i < b(i) , and ˆb1,i / Bi, things are a bit more involved. First notice that B(1) i = Bi [b(i) ,ˆb1,i] and B(2) i = Bi [ˆb1,i, b(i) ] are both non-empty compact sets. We therefore have that supremum and infimum of both sets are members of their respective set, and thus also part of Bi. We therefore choose xi such that b1,i = w1,i, xi = inf B(2) i . To make up for the difference between ˆb1,i < b1,i, we add a constant to all x X where w1,i, x > b1,i. To do this we add some additional neurons, using our constant block Φ(i) c ( w1,i, ). Letting c = w2,i b1,i ˆb1,i , c1 = sup B(1) i , c2 = b1,i, and c3 = b(i) . We have now added five more neurons, and the weights and bias in second layer corresponding to the neurons is set to be a and 0, respectively, where both a and the sign of a depends on Definition 6. In case we require a negative sign in front of w1,i, , we simply set it as we are only concerned with finding biases given weights. We then have that for all x X, Φ(i) c ( w1,i, x ) = c, w1,i, x > b1,i 0, otherwise. Finally, by letting b2,i = ˆb2,i, we have w2,iϕ( w1,i, x b1,i) b2,i + Φ(i) c ( w1,i, x ) =w2,i w1,i, x w2,i b1,i ˆb2,i + w2,i b1,i ˆw2,i ˆb1,i = ˆw1,i, x ˆw2,i ˆb1,i ˆb2,i = ˆw2,iϕ( ˆw1,i, x ˆb1,i) ˆb2,i, when w1,i, x > b1,i, and w2,iϕ( w1,i, x b1,i) b2,i + Φ(i) c ( w1,i, x ) = b2,i = ˆb2,i = ˆw2,iϕ( w1,i, x ˆb1,i) ˆb2,i, otherwise. And thus, Φ(x) = ˆΦ(x) for all x X. As we add five additional neurons for each constant block, and we may need one for each neuron, means we need to construct our network with at most 6 N1 neurons. Now that we know we can construct suitable biases in our unit sampled networks for all types of biases ˆbl,i in the arbitrary network ˆΦ, we show how to construct the weights. Lemma 3. Let ˆΦ F1,K : X RN2, with biases of the form ˆb1,i = ˆw1,i, xi , where xi X for i = 1, 2, . . . , N1. For any ϵ > 0, there exist unit sampled network Φ, such that Φ ˆΦ < ϵ. Proof. W.l.o.g., we assume N2 = 1. For all i = 1, 2, . . . , N1, we construct the weights and biases as follows: If xi / X, then we set ϵ > 0 such that Bϵ (xi) X. Let x(1) 0,i = xi and x(2) 0,i = x(1) 0,i + ϵ 2 w1,i Bϵ (xi) X. Setting w2,i = ˆw1,i ˆw2,i, b2,i = ˆb2,i, and w1,i = x(2) 0,i x(1) 0,i x(2) 0,i x(1) 0,i = ˆ w1,i ˆ w1,i , implies w2,iϕ( w1,i, x x(1) 0,i ) b2,i = ˆw1,i ˆw2,iϕ ˆw1,i ˆw1,i , x xi = ˆw2,iϕ( ˆwi, x xi ) ˆb2,i, where last equality follows by ϕ being positive homogeneous. If xi X, by continuity we find δ > 0 such that for all i = 1, 2, . . . , N1 and x , x X, where x xi < δ, we have |ϕ( ˆw1,i, x x ) ϕ( ˆw1,i, x xi )| < ϵ N1 w2 , where w2 = max{| ˆw2,i|}N1 i=1. We set x(1) 0,i = η(xi, min{δ, 1}), with x(1) 0,i int X and xi x(1) 0,i < δ, due to δ > 0 and Lemma 1. We may now proceed by constructing x(2) 0,i as above, and similarly setting w2,i = ˆw1,i ˆw2,i, b2,i = ˆb2,i, we have i=1 w2,iϕ( w1,i, x x(1) 0,i ) b2,i i=1 ˆw2,iϕ( ˆw1,i, x xi ) ˆb2,i i=1 ˆw2,i ϕ( ˆw1,i, x x(1) 0,i ) ϕ( ˆw1,i, x xi ) w2 ϕ( ˆw1,i, x x(1) 0,i ) ϕ( ˆw1,i, x xi ) i=1 w2 ϵ N1 w2 and thus Φ ˆΦ < ϵ. Until now, we have worked with weights of the form w = x(2) x(1) x(2) x(1) , however, the weights in a sampled network are divided by the norm squared, not just the norm. We now show that for all the results so far, and also for any other results later on, differing by a positive scalar (such as this norm) is irrelevant when Re LU is the activation function. Lemma 4. Let Φ be a network with one hidden layer, with weights and biases of the form w1,i = x(2) 0,i x(1) 0,i x(2) 0,i x(1) 0,i , b1,i = w1,i, x(1) 0,i , for i = 1, 2, . . . , N1. For any weights and biases in the last layer, {w2,i, b2,i}N2 i=1, and set of strictly positive scalars {ωi}N1 i=1, there exist sampled networks Φω where weights and biases in the hidden layer {w 1,i, b 1,i}N1 i=1 are of the form w 1,i = ωiw1,i, b 1,i = w 1,i, x(1) 0,i , such that Φω(x) = Φ(x) for all x X. Proof. We set w 2,i = w2,i ωi and b 2,i = b2,i. As Re LU is a positive homogeneous function, we have for all x X, w 2,iϕ( w 1,i, x b 1,i) b 1,i = ωiw2,i ωi ϕ( w1,i, x b1,i) w2,iϕ( w1,i, x b1,i). The result itself is not too exciting, but it allows us to use the results proven earlier, applying them to sampled networks by setting the scalar to be ωi = 1 x(2) 0,i x(1) 0,i . We are now ready to show the universal approximation property for sampled networks with one hidden layer. We let FS 1, be defined similarly to F1, , with every Φ FS 1, being a sampled neural network. Theorem 4. Let g C(X, RNL+1). Then, for any ϵ > 0, there exist Φ FS 1, with Re LU activation function, such that That is, FS 1, is dense in C(X, RNL+1). Proof. W.l.o.g., let NL+1 = 1, and in addition, let ϵ > 0 and g C(X, RNL+1). Using the universal approximation theorem of Pinkus [52], we have that for ϵ > 0, there exist a network ˆΦ F1, , such that g ˆΦ < ϵ 2. Let K be the number of neurons in ˆΦ. We then create a new network Φ, by first keeping the weights fixed to the original ones, { ˆw1,i, ˆw2,i}K i=1, and setting the biases of Φ according to Lemma 2 using {ˆb1,i,ˆb2,i}K i=1, adding constant blocks if necessary. We then change the weights of our Φ with the respect to the new biases, according to Lemma 3 (with the epsilon set to ϵ/2). It follows from the two lemmas that Φ ˆΦ < ϵ 2. That means g Φ = g ˆΦ + ˆΦ Φ g ˆΦ + ˆΦ Φ < ϵ. As the weights of the first layer of Φ can be written as w1,i = x(2) 0,i x(1) 0,i x(2) 0,i x(1) 0,i 2 and bias b1,i = w1,i, x(1) 0,i , both guaranteed by Lemma 4, means Φ FS 1, . Thus, FS 1, is dense in C(X, RNL+1). Remark 4. By the same two lemmas, Lemma 2 and Lemma 3, one can show that other results regarding networks with one hidden layer with at most K neurons, also hold for sampled networks, but with 6 K neurons, due to the possibility of one constant block for each neuron. When X is a connected set, we only need K neurons, as no additional constant blocks must be added; see proof of Lemma 2 for details. A.1.3 Deep networks The extension of sampled networks into several layers is not obvious, as the choice of weights in the first layer affects the sampling space for the weights in the next layer. This additional complexity raises the question, letting NL = [N1, N2, . . . , NL 1, NL], is the space S NL=1 FS L,NL dense in C(X, RNL+1), when L > 1? We aim to answer this question in this section. With dimensions in each layer being NL = [D, D, . . . , D, NL], we start by showing it holds for S NL=1 FS L,NL, i.e., the space of sampled networks with L hidden layers, and D neurons in the first L 1 layers and arbitrary width in the last layer. Lemma 5. Let NL = [D, D, . . . , D, NL] and L > 1. Then S NL=1 FS L,NL is dense in C(X, RNL+1). Proof. Let ϵ > 0, and g C(X, RNL+1). Basic linear algebra implies there exists a set of D linearly independent unit vectors v = {vj}D j=1, such that X span{vj}D j=1, with vj X for all vj v. For l = 1 and i {1, 2, . . . , Nl}, the bias is set to bl,i = inf{ vi, x : x X}. Due to continuity and compactness, we can set x(1) l 1,i to correspond to an x X such that vi, x(1) l 1,i = bl,i. We must have x(1) l 1,i X otherwise the inner product between some points in X and vi is smaller than the bias, which contradicts the construction of bl,i. We now need to show that ζ(x(1) l 1,i) x(1) l 1,i = c vi+x(1) l 1,i for c R>0, to proceed constructing x(2) l 1,i in similar fashion as in Lemma 3. Let U = BϵI(ζ(x(1) l 1,i)), with x(1) l 1,i U. Also, U is the only closed ball with center ζ(x(1) l 1,i) and {x(1) l 1,i} = U X otherwise ζ would not give a unique element, which is guaranteed by Definition 5. We define the hyperplane Ker = {x RD : vi, x vi, x(1) l 1,i = 0}, and the projection matrix P from RD onto Ker. Let x = Pζ(x(1) l 1,i). If x = x(1) l 1,i, then x int U, as x(1) l 1,i U and P is a unique projection minimizing the distance. As x int U, means there is an open ball around x , where there are points in BϵI(ζ(x(1) l 1,i)) separated by Ker, which implies vi, x(1) l 1,i is not the minimum. This is a contradiction, and hence x = x(1) l 1,i. This means Ker is a tangent hyperplane, and as any vectors along the hyperplane is orthogonal to ζ(x(1) l,i ) x(1) l,i , implies the angle between vi and ζ(x(1) l 1,i) x(1) l 1,i is 0 or π. As vi, x(1) l 1,i is the minimum, means there exist a c R>0, such that ζ(x(1) l 1,i) x(1) l 1,i = c vi + x(1) l 1,i. We may now construct x(2) l 1,i = x(1) l 1,i + c vi, assured that x(2) l 1,i X due to the last paragraph. We then have wl,i = vi vi and bl,i = vi, x(1) l 1,i for all neurons in the first hidden layer. The image after this first hidden layer Ψ(l)(X) is injective, and therefore bijective. To show this, let u1, u2 X such that u1 = u2. As the vectors in v spans X, means there exists two unique set of coefficients, c1, c2 RD, such that u1 = P c(i) 1 vi and u2 = P c(i) 2 vi. We then have Φ(l)(u1) = V c1 bl = V c2 bl = Φ(l)(u2), where V is the Gram matrix of v. As vectors in v are linearly independent, V is positive definite, and combined with c1 = c2, this implies the inequality. That means Φ(l) is a bijective mapping of X. As the mapping is bijective and continuous, we have that for any x X , there is an ϵ(l) I > 0, such that Bϵ I(Φ(l)(x)) X. For 1 < l < L, we repeat the procedure, but swap X with Xl 1. As Φ(l 1) is a bijective mapping, we may find similar linear independent vectors and construct similar points x(1) l 1,i, x(2) l 1,i, but now with noise level ϵ(l) I . For l = L, as we have a subset of X that is a closed ball around each point in Φl 1(x), for every x X , means we can proceed by constructing the last hidden layer and the last layer in the same way as explained when proving Theorem 4. The only caveat is that we are approximating a network with one hidden layer with the domain Xl 1, and the function we approximate is g = g [Φ(L 1)] 1. Given this, denoting Φ(L:L+1) as the function of last hidden layer and the last layer, there exists a number of nodes, weights, and biases in the last hidden layer and the last layer, such that g Φ = g Φ(L:L+1) < ϵ, due to construction above and Theorem 4. As Φ is a unit sampled network, it follows by Lemma 4 that S NL=1 FS L,NL is dense in C(X, RNL+1). We can now prove that sampled networks with L layers, and different dimensions in all neurons, with arbitrary width in the last hidden layer, are universal approximators, with the obvious caveat that each hidden layer l = 1, 2, . . . , L 1 needs at least D neurons, otherwise we will lose some information regardless of how we construct the network. Theorem 5. Let NL = [N1, N2, . . . , NL 1, NL], where min{Nl : l = 1, 2, . . . , L 1} D, and L 1. Then S NL=1 FS L,NL is dense in C(X, RNL+1). Proof. Let ϵ > 0 and g C(X, RNL+1). For L = 1, Theorem 4 is enough, and we therefore assume L > 1. We start by constructing a network Φ S NL=1 FS L, NL, where NL = [D, D, . . . , D, NL] according to Lemma 5, such that Φ g < ϵ. To construct Φ S NL=1 FS L,NL, let l = 1, and start by constructing weights/biases for the first D nodes according to Φ. For the additional nodes, in the first hidden layer, select an arbitrary direction w. Let X1 = { x(j) 1,i : i = 1, 2, . . . , D and j = 1, 2} be the set of all points needed to construct the D neurons in the next layer of Φ. Then for each additional node i = D + 1, . . . , N1, we set x(1) 0,i = arg max{ w, x : x X and Φ(1)(x) X1}. and choose x(2) 0,i X such that x(2) 0,i x(1) 0,i = aw, where a R>0, similar to what is done in the proof for Lemma 3. Using these points to define the weights and biases of the last N1 D nodes, the following space X1 now contains points [ x(j) 1,i, 0, 0, . . . , 0], for j = 1, 2 and i = 1, 2, . . . , D. For 1 < l < L repeat the process above, setting x(j) l,i = [ x(j) l,i , 0, 0, . . . , 0] for the first D nodes, and construct the weights of the additional nodes as described above, but with sampling space being Xl 1. When l = L, set number of nodes to the same as in Φ, and choose the points to construct the weights and biases as x(j) L 1,i = [ x(j) L 1,i, 0, 0, . . . , 0], for j = 1, 2 and i = 1, 2, . . . , NL. The weights and biases in the last layer are the same as in Φ. This implies, Φ g = Φ g < ϵ, and thus S NL=1 FS L,NL is dense in C(X, RNL+1). Remark 5. Important to note that the proof is only showing existence, and that we expect networks to have a more interesting representation after the first L 1 layers. With this theorem, we can conclude that stacking layers is not necessarily detrimental for the expressiveness of the networks, even though it may alter the sampling space in non-trivial ways. Empirically, we also confirm this, with several cases performing better under deep networks very similar to iteratively trained neural networks. Corollary 1. FS , is dense in C(X, RNL+1). A.1.4 Networks with a single hidden layer, tanh activation We now turn to using tanh as activation function, which we find useful for both prediction tasks, and if we need the activation function to be smooth. We will use the results for sampled networks with Re LU as activation function, and show we can arbitrarily well approximate these. The reason for this, instead of using arbitrary network with Re LU as activation function, is that we are using weights and biases of the correct form in the former, such that the tanh networks we construct will more easily have the correct form. We set s2 = ln(3) 2 , s1 = 2 s2 as already discussed and let ψ be the tanh function, with Ψ being neural networks with ψ as activation function simply to separate from the Re LU ϕ, as we are using both in this section. Note that Φ is still network with Re LU as activation function and s1 = 1, s2 = 0. We start by showing how a sum of tanh functions can approximate a set of particular functions. Lemma 6. Let f : [c0, c M+1] R+, defined as f(x) = PM i=1 ai 1[ci,c M+1](x), with 1 being the indicator function, c0 < c1 < < c M < c M+1, and for all i = 0, 1, . . . , M + 1, ci R and ai R>0. Then there exists strictly positive scalars ω = {ωi}M i=1 such that i=1 gi(x) = 2 [ψ(ωi(x ci) s2) + 1] , fulfills f(ci 1) < g(x) < f(ci+1) whenever x [ci 1, ci+1], for all i = 1, 2, . . . , M. Proof. We start by observing that both functions, f and g, are increasing, with the latter strictly increasing. We also have that f(c0) = 0 < g(x) < PM i=1 ai = f(c M), for all x [c0, c M+1], regardless choice of ω. We then fix constants 3 4ai M i 0 < ϵi+1 < for i = 1, 2, . . . , M 1. We have, due to s2, that gi(ci) = ai 4 , for all i = 1, 2, . . . , M. In addition, we can always increase ωi to make sure gi(ci+1) is large enough, and gi(ci 1) small enough for our purposes, as ψ is bijective and strictly increasing. We set ω1 large enough such that g1(cj) > a1 ϵj, for all j = 2, 3, . . . , M 1. For i = 2, 3, . . . , M 1, we set ωi large enough such that gi(cj) < δj, where j = 1, 2, . . . , i 1, and gi(cj) > ai ϵj, where j = i + 1, i + 2, . . . , M 1. Finally, let ωM be large enough such that g M(cj) < δj, for j = 1, 2, . . . , M 1. With the strictly increasing property of every gi, we see that j=1 gj(ci) + ai j=i+1 gj(ci) j=1 aj + ai j=1 aj + ai j=1 gj(ci) + ai j=i+1 gj(ci) j=1 (aj ϵi) + ai 4 = f(ci 1). Combing the observations at the start with f(ci 1) < g(ci) < f(ci), for i = 1, 2, . . . , M, and the property sought after follows quickly. We can now show that we can approximate a neuron with Re LU ϕ activation function and with unit sampled weights arbitrarily well. Lemma 7. Let ˆx(1), ˆx(2) X and ˆw2 R. For any ϵ > 0, there exist a M N>0, and M pairs of distinct points {(x(2) i , x(1) i ) X X}M i=1, such that ˆw2ϕ( ˆw1, x ˆx(1) ) i=1 wi h ψ D wi, x x(1) i E s2 + 1 i < ϵ, where ˆw1 = ˆx(2) ˆx(1) ˆx(2) ˆx(1) , wi R, and wi = s1 x(2) i x(1) i x(2) i x(1) i 2 . Proof. Let ϵ > 0 and, w.l.o.g., ˆw2 > 0. Let B = { ˆw1, x : x X}, as well as ˆf(x) = ˆw2ϕ( ˆw1, x ˆx(1)). We start by partitioning ˆf into ϵ/4-chunks. More specifically, let c = max B, and M = l 4 ˆ w2 |c| ϵ m . Set dk = (k 1)c M , for k = 1, 2, . . . , M , M + 1. We will now define points cj, with the goal of constructing a function f as in Lemma 6. Still, because we require cj B to define biases in our tanh functions later, we must define the cjs iteratively, by setting c1 = 0, k = j = 2, and define every other cj as follows: 1. If dk = c, we are done, otherwise proceed to 2. d k = sup B [cj 1, dk] d k = inf B [dk, c]. 3. If d k = d k, set cj = dk, and increase j and k, and go to 1. 4. If d k > cj 1, set cj = d k, and increase j, otherwise discard the point d k. 5. Set cj = d k, and increase j. Set k = arg min{dk cj : dk cj > 0} and go to 1. We have M < 2 M points, and can now construct the ajs of f. For j = 1, 2, . . . , M, with c M+1 = c, let ( ˆf(cj+1) ˆf(cj), cj+1 cj c M ˆf(ρ(cj)) ˆf(cj), otherwise, where ρ(cj) = arg min{dk cj : dk cj 0 and k = 1, 2, . . . , M +1}. Note that 0 < c c M c M , by Definition 5 and continuity of the inner product ˆw1, x(1) . We then construct f as in Lemma 6, namely, i=1 ai1[ci,c M+1](x). Letting C = {[cj, cj+1]: j = 0, 1, . . . , M and cj+1 cj c M }, it is easy to see |f(x) ˆf(x)| < ϵ for all x S c C c . For any x outside said set, we are not concerned with, as it is not part of B, and hence nothing from X is mapped to said points. Construct ω = {ωi}M i=1 according to Lemma 6. We will now construct a sum of tanh functions, using only weights/biases allowed in sampled networks. For all i = 1, 2, . . . , M, define wi = ai 2 and set x(1) i = η(x, δi), with δi 0 and x X, such that ˆw1, x = ci where δi = 0 iff x / X. We specify δi and ϵ i > 0 such that s1 ϵ i ωi, B2ϵ i(x(1) i ) X, and |ψ( wi, x x(1) i s2) ψ( wi, x ci s2)| < ϵ 4| wi|M , with wi = s1 x(2) i x(1) i x(2) x(1) 2 , x(2) i = x(1) i + ϵ i ˆw1, and for all x X. It is clear that x(1) i , x(2) i X. We may now rewrite the sum of tanh functions as i=1 wiψ wi, x x(1) i s2 + wi i=1 wiψ s1 ϵ i ˆw1 ϵ i ˆw1 ϵ i ˆw1 , x x(1) i ϵ i ˆw1, x ci s2 2 [ψ ( ωi ˆw1, x ci s2) + 1], where ci = ˆw1, x(1) i . As ωi ωi for all i = 1, 2, . . . , M, it follows from Lemma 6 and the construction above that |g(x) ˆf(x)| |g(x) 2 [ψ ( ωi ˆw1, x ci s2) + 1]| 2 [ψ ( ωi ˆw1, x ci s2) + 1] f(x)| + |f(x) ˆf(x)| We are now ready to prove that sampled networks with one hidden layer with tanh as activation function are universal approximators. Theorem 6. Let FS 1, be the set of all sampled networks with one hidden layer of arbitrary width and activation function ψ. F S 1, is dense in C(X, RN2), with respect to the uniform norm. Proof. Let g C(X, RN2), ϵ > 0, and w.l.o.g., N2 = 1. By Theorem 4, we know there exists a network Φ, with ˆN1 neurons and parameters { ˆw1,i, ˆw2,i,ˆb1,i,ˆb2,i} ˆ N1 i=1, and Re LU as activation function, such that Φ g < ϵ 2. We can then construct a new network Ψ, with ψ as activation function, where for each neuron Φ(1,n), we construct Mi neurons in Ψ, according to Lemma 7, with ϵ 2 ˆ N1 . Setting the biases in last layer of Ψ based on Φ, i.e., for every i = 1, 2, . . . , ˆN1, b2,j = ˆb2,i where j = 1, 2, . . . , Mi. We then have, letting N1 be the number of neurons in Ψ, |Ψ(x) Φ(x)| = i=1 w2,iΨ(1,i)(x) b2,i i=1 ˆw2,iΦ(1,i)(x) ˆb2,i j=1 w2,jψ( w1,j, x b1,j) ˆw2,iϕ( ˆw1,i, x ˆb1,i) j=1 w2,jψ( w1,j, x b1,j) ˆw2,iϕ( ˆw1,i, x ˆb1,i) for all x X. The last inequality follows from Lemma 7. This implies that Ψ g Ψ Φ + Φ g < ϵ and F S 1, is dense in C(X, RN2). A.2 Barron spaces Working with neural networks and sampling makes it very natural to connect our theory to Barron spaces [2, 20]. This space of functions can be considered a continuum analog of neural networks with one hidden layer of arbitrary width. We start by considering all functions f : X R that can be written as Ω w2ϕ( w1, x b)dµ(b, w1, w2), where µ is a probability measure over (Ω, ΣΩ), with Ω= R RD R. A Barron space Bp is equipped with a norm of the form, f Bp = inf µ {Eµ[|w2|p( w1 1 + |b|)p]1/p}, 1 p , taken over the space of probability measure µ over (Ω, ΣΩ). When p = , we have f B = inf µ max (b,w1,w2) supp(µ){|w2|( w1 1 + |b|)}. The Barron space can then be defined as Bp = {f : f(x) = Z Ω w2ϕ( w1, x b)dµ(b, w1, w2) and f Bp < }. As for any 1 p , we have Bp = B , and so we may drop the subscript p [20]. Given our previous results, we can easily show approximation bounds between our sampled networks and Barron functions. Theorem 7. Let f B and X = [0, 1]D. For any N1 N>0, ϵ > 0, and an arbitrary probability measure π, there exist sampled networks Φ with one hidden layer, N1 neurons, and Re LU activation function, such that f Φ 2 2 = Z X |f(x) Φ(x)|2dπ(x) < (3 + ϵ) f 2 B N1 . Proof. Let N1 N>0 and ϵ > 0. By E et al. [20], we know there exists a network ˆΦ F1,N1, where ˆΦ( ) = PN1 i=1 ˆw2,iϕ( ˆw1,i, ˆb1,i), such that f ˆΦ 2 2 3 f 2 B N1 . By Theorem 4, letting X = [ϵI, 1 ϵI]D, we know there exists a network Φ FS 1,N1, such that Φ ˆΦ < q ϵ f 2 B N1 . We do not need constant blocks, because X is connected. We then have X |Φ(x) ˆΦ(x)|2dπ(x) + Z X |ˆΦ(x) f(x)|2dπ(x) < ϵ f 2 B N1 X dπ(x) + 3 f 2 B N1 = (3 + ϵ) f 2 B N1 . A.3 Distribution of sampled networks In this section we prove certain invariance properties for sampled networks and our proposed distribution. First, we define the distribution to sample the pair of points, or equivalently, the parameters. Note that X is now any compact subset of RD, as long as λD(X) > 0, where λD the D-dimensional Lebesgue measure. As we are in a supervised setting, we assume access to values of the true function f, and define Y = f(X). We also choose a norm Y over RNL+1, and norms Xl 1 over RNl 1, for each l = 1, 2, . . . , L. For the experimental part of the paper, we choose the L norm for Y and for l = 1, 2, . . . , L, we choose the L2 norm for Xl 1. We also denote N = PL l=1 Nl as the total number of neurons in a given network, and Nl = Pl k=1 Nk. Due to the nature of sampled networks and because we sample each layer sequentially, we start by giving a more precise definition of the conditional density given in Definition 2. As a pair of points from X identifies a weight and bias, we need a distribution over X 2 N, and for each layer l condition on sets of 2 Nl 1 points, which then parameterize the network Φ(l 1) by constructing weights and biases according to Definition 4. Definition 7. Let X be compact, λD(X) > 0, and f : RD RNL+1 be Lipschitz-continuous w.r.t. the metric spaces induced by Y and X . For any l {1, 2, . . . , L}, setting ϵ = 0 when l = 1 and otherwise ϵ > 0, we define qϵ l x(1) 0 , x(2) 0 | Xl 1 = f(x(2) 0 ) f(x(1) 0 ) Y max{ x(2) l 1 x(1) l 1 Xl 1, ϵ} , x(1) l 1 = x(2) l 1 0, otherwise, where x(1) 0 , x(2) 0 X, x(1) l 1 = Φ(l 1)(x(1) 0 ), and x(2) l 1 = Φ(l 1)(x(2) 0 ), with the network Φ(l 1) parameterized by pairs of points in X Nl 1. Then, we define the integration constant Cl = R X X qϵ l dλ. The l-layered density pϵ l is defined as ( qϵ l Cl , if Cl > 0 1 λ2D(X X), otherwise. (6) Remark 6. The added ϵ is there to ensure the density is bounded, but is not needed when considering the first layer, due to the Lipschitz assumption. Adding ϵ > 0 for l = 1 is both unnecessary and affects equivariant/invariant results in Theorem 8. We drop the ϵ superscript wherever it is unambiguously included. We can now use this definition to define the probability of the whole parameter space X 2 N, i.e., given an architecture provide a distribution over all weights and biases the network require. Let X = X 2 N, i.e., the sampling space of P with the product topology, and D = 2 ND as the dimension of the space. Since all the parameters is defined through points of X X, we may choose an arbitrary ordering of the points, which means one set of weights and biases for the whole network can be written as {x(1) i , x(2) i } N i=1. Definition 8. The probability distribution ρf over X have density p, p x(1) 1 , x(2) 1 , x(1) 2 , x(2) 2 , . . . , x(1) N , x(2) N i=1 pl x(1) Nl 1+i, x(2) Nl 1+i | Xl 1 , with Xl 1 = Sl 1 k=1 SNk j=1{x(1) Nk+j, x(2) Nk+j}. It is not immediately clear from above that P is valid distribution, and in particular, that the density is integrable. This is what we show next. Proposition 1. Let X RD be compact, λD(X) > 0, and f be Lipschitz continuous w.r.t. the metric spaces induced by Y and X . For fixed architecture {Nl}L l=1, the proposed function p is integrable and Z X p dλ D = 1. It therefore follows that P is a valid probability distribution. Proof. We will show for each l = 1, 2, . . . , L that pl is bounded a.e. and is nonzero for at least one subset with nonzero measure. There exist A X X such that pl(A) = 0 and λ2D(A) > 0, as either Cl > 0 or pl is the uniform density by Definition 7 and λ2D(X X) > 0 by assumption and the product topology. For l = 1, let Kl > 0 be the Lipschitz constant. Then ql(x(1), x(2)) by assumption of f for all x(1), x(2) X. When l > 1, for all Xl 1 X 2 Nl 1, we have that there exist a constant Kl > 0 due to continuity and compactness, such that pl(x(1), x(2) | Xl) max Kl ϵ , 1 λ2D(X X) As p is a multiplication of a finite set of elements, we end up with ϵ + Kl + 1 λ2D(X X) ϵ + Kl + 1 λ2D(X X) l=1 λ D( X) < . using the fact that 0 < λD(X) < and λ D being the product measure, implies 0 < λ D( X) < . Due to the normalization constants added to ql, we see pl integrates to one. This means P is a valid distribution of X, with implied independence between the neurons in the same layer. One special property of sampled networks, and in particular of the distribution P, is their invariance under both linear isometries and translation (together forming the Euclidean group, i.e., rigid body transformations), as well as scaling. We denote the set of possible transformations as H(D) = R \ {0} O(D) R, with O(D) being the orthogonal group of dimension D. We then denote (a, A, c) = H H(D) as H(x) = a Ax + c, where x X. The space Hf(D) H are all transformations such that f : X RNL+1 is equivariant with respect to the transformations, with the underlying metric space given by Y. That is, for any H Hf(D), there exists a H H(NL+1), such that f(H(x)) = H (f(x)), where x X, and the orthogonal matrix part of H is isometric w.r.t. Y. Note that often the H will be the identity transformation, for example by having the same labels for the transformed data. When H is the identity function, we say f is invariant with respect to H. In the next result, we assume we choose norms X0 and Y, such that the orthogonal matrix part of H is isometric w.r.t. those norms and the canonical norm , as well as continue the assumption of Lipschitz-continuous f. Theorem 8. Let H Hf(D), Φ, ˆΦ be two sampled networks with the same number of layers L and neurons N1, . . . , NL, where Φ: X RNL+1 and ˆΦ: H(X) RNL+1, and f : X RNL+1 is the true function. Then the following statements hold: (1) If ˆx(1) 0,i = H(x(1) 0,i ) and ˆx(2) 0,i = H(x(2) 0,i ), for all i = 1, 2, . . . , N1, then Φ(1)(x) = ˆΦ(1)(H(x)), for all x X. (2) If f is invariant w.r.t. H: Φ FS L,[N1,...NL](X) if and only if ˆΦ FS L,[N1,...NL](H(X)), such that Φ(x) = ˆΦ(x) for all x X. (3) The probability measure P over the parameters is invariant under H. Proof. Let H = (a, A, c). Assume we have sampled ˆx(1) 0,i = H(x(1) 0,i ) and ˆx(2) 0,i = H(x(2) 0,i ), for all i = 1, 2, . . . , N1. The points sampled determines the weights and biases in the usual way, giving ϕ D w1,i, x x(1) 0,i E = ϕ * x(2) 0,i x(1) 0,i x(2) 0,i x(1) 0,i 2 , x x(1) 0,i A x(2) 0,i x(1) 0,i A(x(2) 0,i x(1) 0,i ) 2 , A(x x(1) 0,i ) A x(2) 0,i x(1) 0,i A(x(2) 0,i x(1) 0,i ) 2 , A(x x(1) 0,i ) a A x(2) 0,i x(1) 0,i a A(x(2) 0,i x(1) 0,i ) 2 , a A(x x(1) 0,i ) a A x(2) 0,i + c x(1) 0,i c a A(x(2) 0,i + c x(1) 0,i c) 2 , a A(x + c x(1) 0,i c) * H(x(2) 0,i ) H(x(1) 0,i ) H(x(2) 0,i ) H(x(1) 0,i ) 2 , H(x) H(x(1) 0,i ) = ϕ D ˆw1,i, ˆx ˆx(1) 0,i E for all x X, ˆx H(X), and i = 1, 2, . . . , N1. Which implies that (1) holds. Assuming f is invariant w.r.t. H, then for any Φ FS L,[N1,...NL](X), let X = {x(1) 1,i , x(2) 1,i }N1 i=1, we can then choose H(X) as points to construct weights and biases in the first layer of ˆΦ, and by (1), we have X1 = Φ(1)(X) = ˆΦ(1)(H(X)) = ˆ X1. As the sampling space is the same for the next layer, we see that we can choose the points for the weights and biases of ˆΦ, such that Xl = ˆ Xl, where l = 1, 2, . . . , L. As the input after the final hidden layer is also the same, by the same argument, means the weights in the last layer must be the same, due to the loss function L in Definition 4 is the same due to the invariance assumption. Thus, Φ(x) = ˆΦ(H(x)) for all x X. As H is bijective, means the same must hold true starting with ˆΦ, and constructing Φ, and we conclude that (2) holds. For any distinct points x(1), x(2) X, letting (a , A , c ) be the set such that g(a Ax + c) = a A g(x) + c , and ˆCl, Cl be the normalization constants over the conditional density pl, for H(X) and X resp. We have for the conditional density when l = 1 is, p1(H(x(1)), H(x(2))) = 1 f(a Ax(2) + c) f(a Ax(1) + c) Y a Ax(2) + c a Ax(1) c X a A f(x(2)) + c a A f(x(1)) c Y a Ax(2) a Ax(1) X f(x(2)) f(x(1)) Y x(2) x(1) X = |a | |a| ˆC1 p1(x(1), x(2)). With similar calculations, we have |a| |a | ˆC1 = |a| X X p1(H(x), H(z))dxdz = |a| X X p1(x, z)dxdz = C1. Hence, the conditional probability distribution for the first layer is invariant under H, and then by (1) and (2), the possible sampling spaces are equal for the following layers, and therefore the conditional distributions for each layer is the same, and therefore P is invariant under H. Remark 7. Point (2) in the theorem requires f to be invariant w.r.t. H. This is due to the fact that the parameters in the last layer minimizes the implicitly given loss function L, seen in Definition 4. We have rarely discussed the loss function, as it depends on what function we are approximating. Technically, (2) also holds as long as the loss function is not altered by H in a way that the final layer alters, but we simplified it to be invariant, as this is also the most likely case to stumble upon in practice. The loss function also appears in the proofs given in Appendix A.1 and Appendix A.2. Here both uses, implicitly, the loss function based on the uniform norm. B Illustrative example: approximating a Barron function All computations for this experiment were performed on the Intel Core i7-7700 CPU @ 3.60GHz 8 with 32GB RAM. We compare random Fourier features and our sampling procedure on a test function for neural networks [64]: f(x) = p 3/2( x a ℓ2 x + a ℓ2), with the constant vector a Rd defined by aj = 2j/d 1. For all dimensions d, we sample 10,000 points uniformly at random in the cube [ 1, 1]d for training (i.e., sampling and solving the last, linear layer problem), and another 10,000 points for evaluating the error. We re-run the same experiment with five different random seeds, but with the same train/test datasets. This means the random seed only influences the weight distributions in sampled networks and random feature models, not the data. Figure 8 shows the relative L2 error in the full hyperparameter study, i.e. in dimensions d {1, 2, 3, 4, 5, 10} (rows) and for tanh (left column) and sine activation functions (right column). For the random feature method, we always use sine activation, because we did not find any data-agnostic probability distributions for tanh. Figure 9 shows the fit times for the same experiments, demonstrating that sampling networks are not slower than random feature models. This is obvious from the complexity analysis in Appendix F, and confirmed experimentally here. Interesting observations regarding the full hyperparameter study are that for one dimension, d = 1, random feature models outperform sampled networks and can even use up to two hidden layers. In higher dimensions, and with larger widths / more layers, sampled networks consistently outperform random features. In d = 10, a sampled network with even a single hidden layer is about one order of magnitude more accurate than the same network with normally distributed weights. The convergence rate of the error of sampled networks with respect to increasing layer widths is consistent over all dimensions and layers, even sometimes outperforming the theoretical bound of O(N 1/2). C Classification benchmark from Open ML The Adam training was done on the Ge Force 4x RTX 3080 Turbo GPU with 10 GB RAM, while sampling was performed on the Intel Core i7-7700 CPU @ 3.60GHz 8 with 32GB RAM. We use all 72 datasets in the Open ML-CC18 Curated Classification benchmark from the Open ML Benchmarking Suites (CC-BY) [4], which is available on openml.org: https://openml.org/ search?type=benchmark&sort=tasks_included&study_type=task&id=99. We use the Open ML Python API (BSD-3 Clause) [22] to download the benchmark datasets. The hyperparameters used in the study are listed in Table 3. From all datasets, we use at most 5000 points. This was done for all datasets before applying any sampled or gradient-based approach, because we wanted to reduce the training time when using the Adam optimizer. We pre-process the input features using one-hot encoding for categorical variables, and robust whitening (removal of mean, division by standard deviation using the Robust Scaler class from scikit-learn). Missing features are imputed with the median of the feature. All layers of all networks always have 500 neurons. For Adam, we employ early stopping with patience=3, monitoring the loss value. The least squares regularization constant for sampling is 10 10. We split the data sets for 10-fold cross-validation using stratified k-fold (Stratified KFold class from scikit-learn), and report the average of the accuracy scores and fit times over the ten folds (Figure 4 in the main paper). D Deep neural operators D.1 Dataset To generate an initial condition u0, we first sample five Fourier coefficients for the lowest frequencies. We sample both real and imaginary parts from a normal distribution with zero mean and standard Network width Rel. L2 error, dim=1 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=2 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=3 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=4 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=5 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=10 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=1 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=2 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=3 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=4 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=5 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Network width Rel. L2 error, dim=10 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 reference m 1/2 reference m 1 Figure 8: Relative L2 approximation error for random features (using sine activation) and sampling (left: using tanh activation, right: using sine activation), with input dimensions d = 1, 2, 3, 4, 5 and 10. Results are averaged over five runs with different random seeds. deviation equal to five. We then scale the k-th coefficient by 1/(k + 1)2 to create a smoother initial condition. For the first coefficient corresponding to the zero frequency, we also set the imaginary part to zero as our initial condition should be real-valued. Then, we use real inverse Fourier transform with orthonormal normalization (np.fft.irrft) to generate an initial condition with the Network width Fit time [s], dim=1 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=2 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=3 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=4 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=5 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=10 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=1 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=2 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=3 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=4 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=5 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Network width Fit time [s], dim=10 randomfeatures, L=1 sampling, L=1 randomfeatures, L=2 sampling, L=2 randomfeatures, L=3 sampling, L=3 Figure 9: Fit times in seconds for random features (using sine activation) and sampling (left: using tanh activation, right: using sine activation), with input dimensions d = 1, 2, 3, 4, 5 and 10. Results are averaged over five runs with different random seeds. discretization over 256 grid points. To solve the Burgers equation, we use a classical numerical solver (scipy.integrate.solve_ivp) and obtain a solution u1 at t = 1. This way, we generate 15000 pairs of (u0, u1) and split them between train (9000 pairs), validation (3000 pairs), and test sets (3000 pairs). Table 3: Network hyperparameters used in the Open ML benchmark study. Parameter Value Shared Number of layers [1, 2, 3, 4, 5] Layer width 500 Activation tanh Sampling L2-regularization 10 5 Adam Learning rate 10 3 Max. epochs 100 Early stopping patience 3 Batch size 64 Loss mean-squared error Table 4: Network hyperparameters used to train deep neural operators. Parameter Values Note Shared Number of layers [1, 2, 4] Layer width [256, 512, 1024, 2048, 4096] Number of modes 8, 16, 32 Number of channels 16, 32 Activation tanh Sampling L2-regularization 10 10, 10 8, 10 6 Adam Learning rate 10 5, 5 10 5, 10 4, 5 10 4 FCNs and POD-Deep ONet 10 4, 5 10 4, 10 3, 5 10 3 FNO Weight decay 0 Max. epochs 2000 FNO, FCN in signal space 5000 FCN in Fourier space 90000 POD-Deep ONet Patience 100 FNO, FCN in signal space 200 FCN in Fourier space 4500 POD-Deep ONet Batch size 64 FCNs and FNO full data POD-Deep ONet Loss mean-squared error FCNs and POD-Deep ONet mean relative H1-loss FNO D.2 Experiment setup The Adam training was done on the Ge Force 4x RTX 3080 Turbo GPU with 10 GB RAM, while sampling was performed on the 1x AMD EPYC 7402 @ 2.80GHz 8 with 256GB RAM. We use the sacred package (https://github.com/IDSIA/sacred, MIT license, [37]) to conduct the hyperparameter search. We perform a grid search over several hyperparameters (where applicable): the number of layers/Fourier blocks, the width of layers, the number of hidden channels, the regularization scale, the learning rate, and the number of frequencies/modes to keep. The full search grid is listed in Table 4. Adam With the exception of FNO, we use the mean squared error as the loss function when training with Adam. For FNO, we train with the default option, that is, with the mean relative H1-loss. This loss is based on the H1-norm for a one-dimensional function f defined as f H1 = f L2 + f L2. For all the experiments, we use the mean relative L2 error as the validation metric. We perform an early stopping on the validation set and restore the model with the best metric. Deep ONet We use the deepxde package (https://github.com/lululxvi/deepxde, LGPL-2.1 License, [45]) to define and train POD-Deep ONet for our dataset. After defining the model, we apply an output transform by centering and scaling the output with 1/p, where p is the number of modes we FNO Block FNO Block u0 u1 1x1 conv Figure 10: The architecture of the sampled FNO. keep in the POD. To compute the POD, we first center the data and then use np.linalg.eigh to obtain the modes. We add two callbacks to the standard training: the early stopping and the model checkpoint. We use the default parameters for the optimization and only set the learning rate. The default training loop uses the number of iterations instead of the number of epochs, and we train for 90000 iterations with patience set to 4500. By default, POD-Deep ONet uses all the data available as a batch. We ran several experiments with the batch size set to 64, but the results were significantly worse compared to the default setting. Hence, we stick to using the full data as one batch. FNO We use the neuralop (https://github.com/neuraloperator/neuraloperator, MIT license, [42, 38]) package with slight modifications. We changed the trainer.py to add the early stopping and to store the results of experiments. Apart from these changes, we use the functionality provided by the library to define and run experiments. We use batch size 64 and train for 2000 epochs with patience 100. In addition to the hyperparameters considered for other architecture, we also did a search over the number of hidden channels (32 or 64). We use the same number of channels for lifting and projection operators. FCN For both fully-connected networks in our experiments, we use Py Torch framework to define the models. For training a network in Fourier space, we prepend a forward Fourier transform before starting the training. During the inference, we applied the inverse Fourier transform to compute the validation metric. We considered using the full data as a batch, but several experiments indicated that this change worsens the final metric. Hence, we use batch size 64 for both architectures. Sampling FNO Here, we want to give more details about sampled FNO. First, we start with input u0 and append the coordinate information to it as an additional channel. Then, we lift this input to a higher dimensional channel space by drawing 1 1 convolution filters from a normal distribution. Then, we apply a Fourier block which consists of fully-connected networks in Fourier space for each channel. After applying the inverse Fourier transform, we add the input of the block to the restored signal as a skip connection. After applying several Fourier blocks, we learn another FCN in signal space to obtain the final prediction. We can see the schematic depiction of the model in Figure 10. We note that to train fully-connected networks in the Fourier space, we apply Fourier transform to labels u1 and use it during sampling as target functions. Similarly to the Adam-trained FNO, we search over the number of hidden channels we use for lifting. D.3 Results We can see the results of the hyperparameter search in Figure 11. We see that sampled FNO and FCN in Fourier space perform well even with a smaller number of frequencies available, while Deep ONet requires more modes to achieve comparable accuracy. We also see that adding more layers is not beneficial for sampled networks. A more important factor is the width of the layers: all of the sampled networks achieved the best results with the largest number of neurons. For the Adam-trained models, we note that we could not significantly improve the results by increasing the widths of the layers. Possibly, this is due to a limited range of learning rates considered in the experiments. Figure 11: Comparison of different deep neural operators trained with Adam and with SWIM algorithm. FNO with Adam has no dependency on a layer width. Thus, we show results for FNO trained with Adam as a straight line in all rows. For the hyperparameters not present in the plot, we chose the values that produce the lowest error on the validation set. We repeat each experiment three times and average the resulting metrics for the plotting. The error is computed on the validation set. E Transfer learning This section contains details about the results shown in the main paper and the supporting hyperparameter search concerning the transfer learning task. The source and target tasks of transfer learning, train-test split, and pre-trained models are the same as in the main paper. We describe the software, hardware, data pre-processing, and details of the sampling and the Adam training approach. Note that the results shown here are not averaged over five runs, as was done for the plots in the main paper. Software and hardware We performed all the experiments concerning the transfer learning task in Python, using the Tensor Flow framework. We used a Ge Force 4x RTX 3080 Turbo GPU with 10 GB RAM for the Adam training and a 1x AMD EPYC 7402 socket with 24 cores @ 2.80GHz for sampling. We use the pre-trained models from Keras Applications which are licensed under the Apache 2.0 License which can be found at https://github.com/keras-team/keras/blob/v2. 12.0/LICENSE. Pre-processing of data The size of images in the Image Net data set is larger than the ones in CIFAR10, so we use a bicubic interpolation to upscale the CIFAR-10 images to dimensions 224 224 3. We pre-process the input data for each model the same way it was pre-processed during the pretraining on Image Net. Moreover, we apply some standard data augmentation techniques before training, such as vertical and horizontal shifts, rotation, and horizontal flips. After pre-processing, the images are first passed through the pre-trained classification layers and a global average pooling layer. The output of the global average pooling layer and classification labels serve as the input and output data for the classification head respectively. Details of the Adam training and sampling approaches We find the weights and biases of the hidden layer of the classification head using the proposed sampling algorithm and the usual gradient-based training algorithm. First, in the Adam-training approach, we find the parameters of the classification head by iterative training with the Adam optimizer. We use a learning rate of 10 3, batch size of 32, and train for 20 epochs with early-stopping patience of 10 epochs. We store the parameters that yield the lowest loss on test data. We use the tanh activation function for the hidden layer, the softmax activation function for the output layer, and the categorical cross-entropy loss function with no regularization. Second, in the sampling approach, we use the proposed sampling algorithm to sample parameters of the hidden layer of the classification head. We use the tanh activation function for the hidden layer. Once the parameters of the hidden layers are sampled, an optimization problem for the coefficients of a linear output layer is solved. For this, the mean squared error loss function without regularization is used. Unless mentioned otherwise, the above-mentioned setup is used for all the experiments in this section. E.1 Sampling Vs Adam training Figure 12 compares the train and test accuracy for different widths (number of neurons in the hidden layer of the classification head) using three pre-trained neural network architectures for the Adam training and sampling approaches. As in the main paper, we observe that for all the models, the sampling approach results in a higher test accuracy than the Adam training approach for sufficiently higher widths. The sampling algorithm tends to over-fit for very high widths. The loss on the train data for the iterative training approach decreases with width, particularly for the Xception network. This suggests that an extensive hyper-parameter search for the iterative training approach could yield higher classification accuracy on the train data. However, as shown in the main paper, the iterative training approach can be orders of magnitude slower than the sampling approach. 102 103 104 102 103 104 102 103 104 Adam: Test data Adam: Train data Sampling: Test data Sampling: Train data Figure 12: Left: Res Net50. Middle: VGG19. Right: Xception. E.2 Sampling with tanh Vs Re LU activation functions This sub-section compares the performance of the sampling algorithm used to sample the weights of the hidden layer of the classification head for tanh and Re Lu activation functions. Figure 13 shows that for Res Net50, the test accuracy with tanh is similar to that obtained with Re LU for a width smaller than 2048. As the width increases beyond 4096, test accuracy with Re LU is slightly better than with tanh. However, for VGG19 and Xception, test accuracy with tanh is higher than or equal to that with Re LU for all widths. Thus, we find the sampling algorithm with the tanh activation function yields better results for classification tasks than with Re LU. On the training dataset, tanh and Re LU yield similar accuracies for all widths for Res Net50 and Xception. For VGG19, using the Re LU activation function yields much lower accuracies on the train and test data sets, especially as the width of the fully connected layer is increased. Thus, we observe that the tanh activation function is more suitable for classification tasks. E.3 Fine-tuning There are two typical approaches to transfer learning: feature extraction and fine-tuning. In feature extraction, there is no need to retrain the pre-trained model. The pre-trained models capture the 102 103 104 102 103 104 102 103 104 tanh: Train data tanh: Test data Re LU: Train data Re LU: Test data Figure 13: Left: Res Net50. Middle: VGG19. Right:Xception essential representation/features from a similar task. One only needs to add a new classifier (one or more fully connected layers) and find the parameters of the new classifier. In fine-tuning, after the feature extraction, some or all the parameters of the pre-trained model (a few of the last layers or the entire model) are re-trained with a much smaller learning rate using typical gradient-based optimization algorithms. Fine-tuning is not the focus of this work. Nevertheless, it is essential to verify that the weights of the last layer sampled by the proposed sampling algorithm can also be trained effectively in the fine-tuning phase. The results are shown in Figure 14. In Figure 14, we observe that for certain widths (512, 1024, and 4096), sampling the last layer followed by fine-tuning the entire model yields slightly higher test accuracies than training the last layer followed by fine-tuning. For widths 2048, 6144, and 8192, for some models, sampling the last layer, followed by fine-tuning, is better; for others, training the last layer, followed by fine-tuning, is better. Nevertheless, these experiments validate that the parameters of the last layer sampled with the proposed algorithm serve as a good starting point for fine-tuning. Moreover, the test accuracy after fine-tuning is comparable irrespective of whether the last layer was sampled or trained. However, as we show in the main paper, sampling the weights in the feature extraction phase takes much less time and gives better accuracy than training with the Adam optimizer for appropriate widths. E.4 One vs two hidden layers in the classification head The goal of this sub-section is to explore whether adding an additional hidden layer in the classification head leads to an improvement in classification accuracy. We keep the same width for the extra layer in this experiment. Figure 15 compares the train and test accuracies obtained with one and two hidden layers in the classification head for different widths. Figure 15 (left) shows that unless one chooses a very high width with the sampling approach >= 6148, adding an extra hidden layer yields a lower test accuracy. On the train data, the sampling approach with 2 hidden layers results in lower accuracy for all widths in consideration. Figure 15 (right) shows that with the Adam training approach, adding an extra hidden layer yields a lower test accuracy for all the widths in consideration. For lower widths, adding more layers results in over-fitting. We believe that the accuracy on the train data for higher widths could be improved with an extensive hyper-parameter search. However, adding an extra layer increases the training time too. Test accuracy Width: 512 Width: 1024 Width: 2048 Test accuracy Width: 4096 Width: 6144 Width: 8192 Adam Adam + Fine-tuning Sampling Sampling + Fine-tuning Figure 14: Comparison of test accuracies obtained by the sampling approach followed by fine-tuning and Adam training approach followed by fine-tuning for different widths F Code repository and algorithm complexity The code to reproduce the experiments from the paper can be found at https://gitlab.com/felix.dietrich/swimnetworks-paper. An up-to-date code base is maintained at https://gitlab.com/felix.dietrich/swimnetworks. Python 3.8 is required to run the computational experiments and install the software. Installation works using pip install -e . in the main code repository. The python packages required can be installed using the file requirements.txt file, using pip. The code base is contained in the folder swimnetworks, the experiments and code for hyperparameter studies in the folder experiments. We now include a short complexity analysis of the SWIM algorithm (Algorithm 1 in the main paper), which is implemented in the code used in the experiments. The main points of the analysis can be found in Table 5. There are two separate parts that contribute to the runtime. First, we consider sampling the parameters of the hidden layers. Letting N = max{N0, N1, N2, . . . , NL}, i.e., the maximum of the number of neurons in any given layer and the input dimension, and M being the size of the training set. The size of the training set also limits the number of possible weights / biases that we can sample, namely M 2. The time complexity to sample the parameters of the hidden layers is O(L M(min{ N/M , M} + N 2)). We can further refine the expression with O(PL l=1 M (min{ Nl/M , M} + Nl Nl 1)). Several factors contribute to this runtime. (1) The 102 103 104 102 103 104 1-Layer: Train data 1-Layer: Test data 2-Layers: Train data 2-Layers: Test data Figure 15: Left: Sampling, Right: Adam training Table 5: Runtime and memory requirements for training a sampled neural networks with the SWIM algorithm, where N = max{N0, N1, N2, . . . , NL}. Assumption (I) assume the output dimension is less than or equal to N. Assumption (II) assumes in addition that N < M 2, i.e., number of neurons and input dimension is less than the size of dataset squared. Assumption (III) assumes a fixed architecture. Runtime Memory Assumption (I) O(L M(min{ N/M , M} + N 2)) O(M min{ N/M , M} + LN 2) Assumption (II) O(L M( N/M + N 2)) O(M N/M + LN 2) Assumption (III) O(M) O(M) number of layers increases the number of samples. (2) Computing the output after l 1 layers for the entire dataset to compute the space from which we construct the weights, hence the term Nl Nl 1 M. (3) Computing M Ml probabilities, where Ml = min{ Nl/M , M}. When the size of the hidden layer is less than the number of training points which is often the case we compute 2 M probabilities depending on a scaling constant. On the other hand, when the size of the layer is greater than or equal to M 2, we compute in worst case all possible probabilities, that is, M 2 probabilities. The last contributing factor is to sample Nl pair of points, that in the expression is dominated by the term Nl Nl 1 M. We are often interested in a fixed architecture, or at least to bound the number of neurons in each hidden layer to be less than the square of the number of training points, i.e., N < M 2. Adding the latter as an assumption, we end up with the runtime O(L M( N/M + N 2)). For the second part, we optimize the weights/biases WL+1, b L+1 to map from the last hidden layer to the output. Assume that we use the SVD decomposition to compute the pseudoinverse and subsequently solve the least squares problem. The time complexity in general is then O(M NL min{M, NL} + NL M NL+1). Again, if we make the reasonable assumption that the number of training points is larger than the number of hidden layers, and that the output dimension is smaller than the dimension of the last hidden layer, we find O(MN 2 L), which can be rewritten to O(MN 2). With these assumptions in mind, the run time for the full training procedure, from start to finish, is O(L M( N/M + N 2)), and when the architecture is fixed, we have O(M) runtime for the full training procedure. In terms of memory, at every layer l, we need to store the probability matrix, the output of the training set when mapped through the previous l 1 layers, and the number of points we sample. This means the memory required is O(M Nl/M + NL+1 NL). In the last layer, we only need the image of the data passed through all the hidden layers, as well as the weights/biases, which leads to O(M + NL+1 NL). We end up with the required memory for the SWIM algorithm is O(M N/M + LN 2).