# controlling_neural_level_sets__e26bcb7e.pdf Controlling Neural Level Sets Matan Atzmon, Niv Haim, Lior Yariv, Ofer Israelov, Haggai Maron, Yaron Lipman Weizmann Institute of Science Rehovot, Israel The level sets of neural networks represent fundamental properties such as decision boundaries of classifiers and are used to model non-linear manifold data such as curves and surfaces. Thus, methods for controlling the neural level sets could find many applications in machine learning. In this paper we present a simple and scalable approach to directly control level sets of a deep neural network. Our method consists of two parts: (i) sampling of the neural level sets, and (ii) relating the samples positions to the network parameters. The latter is achieved by a sample network that is constructed by adding a single fixed linear layer to the original network. In turn, the sample network can be used to incorporate the level set samples into a loss function of interest. We have tested our method on three different learning tasks: improving generalization to unseen data, training networks robust to adversarial attacks, and curve and surface reconstruction from point clouds. For surface reconstruction, we produce high fidelity surfaces directly from raw 3D point clouds. When training small to medium networks to be robust to adversarial attacks we obtain robust accuracy comparable to state-of-the-art methods. 1 Introduction The level sets of a Deep Neural Network (DNN) are known to capture important characteristics and properties of the network. A popular example is when the network F(x; θ) : Rd Rm Rl represents a classifier, θ are its learnable parameters, fi(x; θ) are its logits (the outputs of the final linear layer), and the level set S(θ) = x Rd fj max i =j {fi} = 0 (1) represents the decision boundary of the j-th class. Another recent example is modeling a manifold (e.g., a curve or a surface in R3) using a level set of a neural network (e.g., [24]). That is, S(θ) = n x Rd F = 0 o (2) represents (generically) a manifold of dimension d l in Rd. The goal of this work is to provide practical means to directly control and manipulate neural level sets S(θ), as exemplified in Equations 1, 2. The main challenge is how to incorporate S(θ) in a differentiable loss function. Our key observation is that given a sample p S(θ), its position can be associated to the network parameters: p = p(θ), in a differentiable and scalable way. In fact, p(θ) is itself a neural network that is obtained by an addition of a single linear layer to F(x; θ); we call these networks sample networks. Sample networks, together with an efficient mechanism for sampling the level set, {pj(θ)} S(θ), can be incorporated in general loss functions as a proxy for the level set S(θ). 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. (a) (b) (c) (d) Figure 1: Our method applied to binary classification in 2D. Blue and red dots represent positive and negative examples respectively. (a) standard cross-entropy loss baseline; (b) using our method to move the decision boundary at least ε away from the training set in L norm; (c) same for L2 norm; (d) a geometrical adaptation of SVM soft-margin loss to neural networks using our method, the +1, 1 level sets are marked in light red and blue, respectively. Note that in (b),(c),(d) we achieve decision boundaries that seem to better explain the training examples compared to (a). We apply our method of controlling the neural level sets to two seemingly different learning tasks: controlling decision boundaries of classifiers (Equation 1) and reconstructing curves and surfaces from point cloud data (Equation 2). At first, we use our method to improve the generalization and adversarial robustness in classification tasks. In these tasks, the distance between the training examples and the decision boundary is an important quantity called the margin. Margins are traditionally desired to be as large as possible to improve generalization [7, 10] and adversarial robustness [10]. Usually, margins are only controlled indirectly by loss functions that measure network output values of training examples (e.g., cross entropy loss). Recently, researchers have been working on optimizations with more direct control of the margin using linearization techniques [13, 21, 10], regularization [26], output-layer margin [27], or using margin gradients [9]. We suggest controlling the margin by sampling the decision boundary, constructing the sample network, and measuring distances between samples and training examples directly. By applying this technique to train medium to small-size networks against adversarial perturbations we achieved comparable robust accuracy to state-of-the-art methods. To improve generalization when learning from small amounts of data, we devise a novel geometrical formulation of the soft margin SVM loss to neural networks. This loss aims at directly increasing the input space margin, in contrast to standard deep network hinge losses that deal with output space margin [28, 27]. The authors of [10] also suggest to increase input space margin to improve generalization. Figure 1 shows 2D examples of training our adversarial robustness and geometric SVM losses for networks. In a different application, we use our method for the reconstruction of manifolds such as curves and surfaces in R3 from point cloud data. The usage of neural networks for the modeling of surfaces has recently become popular [12, 29, 6, 2, 24, 22]. There are two main approaches: parametric and implicit. The parametric approach uses networks as parameterization functions to the manifolds [12, 29]. The implicit approach represents the manifold as a level set of a neural network [24, 6, 22]. So far, implicit representations were learned using regression, given a signed distance function or occupancy function computed directly from a ground truth surface. Unfortunately, for raw point clouds in R3, computing the signed distance function or an occupancy function is a notoriously difficult task [4]. In this paper we show that by using our sample network to control the neural level sets we can reconstruct curves and surfaces directly from point clouds in R3. Lastly, to theoretically justify neural level sets for modeling manifolds or arbitrary decision boundaries, we prove a geometric version of the universality property of MLPs [8, 14]: any piecewise linear hyper-surface in Rd (i.e., a d 1 manifold built out of a finite number of linear pieces, not necessarily bounded) can be precisely represented as a neural level set of a suitable MLP. 2 Sample Network Given a neural network F(x; θ) : Rd Rm Rl its 0 Rl level set is defined by S(θ) := n x F(x; θ) = 0 o . (3) We denote by Dx F(p; θ) Rl d the matrix of partial derivatives of F with respect to x. Assuming that θ is fixed, F(p; θ) = 0 and that Dx F(p; θ) is of full rank l (l d), a corollary of the Implicit Function Theorem [15] implies that S(θ) is a d l dimensional manifold in the vicinity of p S(θ). Our goal is to incorporate the neural level set S(θ) in a differentiable loss. We accomplish that by performing the following procedure at each training iteration: (i) Sample n points on the level set: pi S(θ), i [n]; (ii) Build the sample network pi(θ), i [n], by adding a fixed linear layer to the network F(x; θ); and (iii) Incorporate the sample network in a loss function as a proxy for S(θ). 2.1 Level set sampling To sample S(θ) at some θ = θ0, we start with a set of n points pi, i [n] sampled from some probability measure in Rd. Next, we perform generalized Newton iterations [3] to move each point p towards S(θ0): pnext = p Dx F(p; θ0) F(p; θ0), (4) where Dx F(p, θ0) is the Moore-Penrose pseudo-inverse of Dx F(p, θ0). The generalized Newton step solves the under-determined (l d) linear system F(p; θ0) + Dx F(p; θ0)(pnext p) = 0. To motivate this particular solution choice we show that the generalized Newton step applied to a linear function is an orthogonal projection onto its zero level set (see proof in the supplementary material): Lemma 1. Let ℓ(x) = Ax + b, A Rl d, b Rl, ℓ< d, and A is of full rank l. Then Equation 4 applied to F(x) = ℓ(x) is an orthogonal projection on the zero level set of ℓ, namely, on {x | ℓ(x) = 0}. For l > 1 the computation of Dx F(p; θ0) requires inverting an l l matrix; in this paper l {1, 2}. The directions Dx F(pi; θ0) can be computed efficiently using back-propagation where the points pi, i [n] are grouped into batches. We performed 10 20 iterations of Equation 4 for each pi, i [n]. Scalar networks. For scalar networks, i.e., l = 1, Dx F R1 d, a direct computation shows that Dx F(p; θ0) = Dx F(p; θ0)T Dx F(p; θ0) 2 . (5) That is, the point p moves towards the level set S(θ0) by going in the direction of the steepest descent (or ascent) of F. It is worth mentioning that the projection-on-level-sets formula in the case of l = 1 has already been developed in [23] and was used to find adversarial perturbations; our result generalizes to the intersection of several level sets and shows that this procedure is an instantiation of the generalized Newton algorithm. The generalized Newton method (similarly to Newton method) is not guaranteed to find a point on the zero level set. We denote by ci = F(pi; θ0) the level set values of the final point pi; in the following, we use also points that failed to be projected with their level set ci. Furthermore, we found that the generalized Newton method usually does find zeros of neural networks but these can be far from the initial projection point. Other, less efficient but sometimes more robust ways to project on neural level sets could be using gradient descent on F(x; θ) or zero finding in the direction of a successful PGD attack [17, 20] as done in [9]. In our robust networks application (Section 3.2) we have used a similar approach with the false position method. Relation to Elsayed et al. [10]. The authors of [10] replace the margin distance with distance to the linearized network, while our approach is to directly sample the actual network s level set and move it explicitly. Specifically, for L2 norm (p = 2) Elsayed s method is similar to our method using a single generalized Newton iteration, Equation 4. 2.2 Differentiable sample position Next, we would like to relate each sample p, belonging to some level set F(p; θ0) = c, to the network parameters θ. Namely, p = p(θ). The functional dependence of a sample p on θ is defined by p(θ0) = p and F(p(θ); θ) = c, for θ in some neighborhood of θ0. The latter condition ensures that p stays on the c level set as the network parameters θ change. As only first derivatives are used in the optimization of neural networks, it is enough to replace this condition with its first-order version. We get the following two equations: p(θ0) = p ; θ θ=θ0F(p(θ); θ) = 0. (6) Using the chain rule, the second condition in Equation 6 reads: Dx F(p, θ0)Dθp(θ0) + DθF(p, θ0) = 0. (7) This is a system of linear equations with d m unknowns (the components of Dθp(θ0)) and l m equations. When d > l, this linear system is under-determined. Similarly to what is used in the generalized Newton method, a minimal norm solution is given by the Moore-Penrose inverse: Dθp(θ0) = Dx F(p, θ0) DθF(p, θ0). (8) The columns of the matrix Dθp(θ0) Rd m describe the velocity of p(θ) w.r.t. each of the parameters in θ. The pseudo-inverse selects the minimal norm solution that can be shown to represent, in this case, a movement in the orthogonal direction to the level set (see supplementary material for a proof). We reiterate that for scalar networks, where l = 1, Dx F(pi; θ0) has a simple closed-form expression, as shown in Equation 5. The sample network. A possible simple solution to Equation 6 would be to use the linear function p(θ) = p + Dθp(θ0)(θ θ0), with Dθp(θ0) as defined in Equation 8. Unfortunately, this would require storing Dθp(θ0), using at-least O(m) space (i.e., the number of network parameters), for every projection point p. (We assume the number of output channels l of F is constant, which is the case in this paper.) A much more efficient solution is p(θ) = p Dx F(p; θ0) [F(p; θ) c] , (9) that requires storing Dx F(p; θ0) , using only O(d) space, where d is the input space dimension, for every projection point p. Furthermore, Equation 9 allows an efficient implementation with a single network G(p, Dx F(p; θ0) ; θ) := p(θ). We call G the sample network. Note that a collection of samples pi, i [n] can be treated as a batch input to G. 3 Incorporation of samples in loss functions Once we have the sample network pi(θ), i [n], we can incorporate it in a loss function to control the neural level set S(θ) in a desired way. We give three examples in this paper. 3.1 Geometric SVM Support-vector machine (SVM) is a model which aims to train a linear binary classifier that would generalize well to new data by combining the hinge loss and a large margin term. It can be interpreted as encouraging large distances between training examples and the decision boundary. Specifically, the soft SVM loss takes the form [7]: loss(w, b) = 1 j=1 max {0, 1 yjℓ(xj; w, b)} + λ w 2 2 , (10) where (xj, yj) Rd { 1, 1}, j [N] is the binary classification training data, and ℓ(x; w, b) = w T x + b, w Rd, b R is the linear classifier. We would like to generalize Equation 10 to a deep network F : Rd Rm R towards the goal of increasing the network s input space margin, which is defined as the minimal distance of training examples to the decision boundary S(θ) = x Rd | F(x; θ) = 0 . Note that this is in strong contrast to standard deep network hinge loss that works with the output space margin [28, 27], namely, measuring differences of output logits when evaluating the network on training examples. For that reason, this type of loss function does not penalize small input space margin, so long as it doesn t damage the output-level classification performance on the training data. Using the input margin over the output margin may also provide robustness to perturbations [10]. We now describe a new, geometric formulation of Equation 10, and use it to define the soft-SVM loss for general neural networks. In the linear case, the following quantity serves as the margin: w 1 2 = d(S(θ), S1(θ)) = d(S(θ), S 1(θ)) where St(θ) = x Rd | F(x; θ) = t , and d(S(θ), St(θ)) is the distance between the level sets, which are two parallel hyper-planes. In the general case, however, level sets are arbitrary hypersurfaces which are not necessarily equidistant (i.e., the distance when traveling from S to St does not have to be constant across S). Hence, for each data sample x, we define the following margin function: (x; θ) = min n d p(θ), S1(θ) , d p(θ), S 1(θ) o , where p(θ) is the sample network of the projection of x onto S(θ0). Additionally, note that in the linear case: w T x + b = d(x, S(θ))/ (x; θ). With these definitions in mind, Equation 10 can be given the geometric generalized form: loss(w, b) = 1 j=1 max 0, 1 sign(yj F(xj; θ))d(xj, pj) j=1 (xj; θ)α, (11) where F(x; θ) is a general classifier (such as a neural network, in our applications). Note that in the case where F(x; θ) is affine, α = 2 and d = L2, Equation 11 reduces back to the regular SVM loss, Equation 10. Figure 1d depicts the result of optimizing this loss in a 2D case, i.e., xj R2; the light blue and red curves represent S 1 and S1. 3.2 Robustness to adversarial perturbations The goal of robust training is to prevent a change in a model s classification result when small perturbations are applied to the input. Following [20] the attack model is specified by some set S Rd of allowed perturbations; in this paper we focus on the popular choice of L perturbations, that is S is taken to be the ε-radius L ball, {x | x ε}. Let (xj, yj) Rd L denote training examples and labels, and let Sj(θ) = x Rd | F j(x; θ) = 0 , where F j(x; θ) = fj maxi =j fi, the decision boundary of label j. We define the loss loss(θ) = 1 j=1 λj max n 0, εj sign(F yj(xj; θ))d(xj, Syj(θ)) o , (12) where d(x, Sj) is some notion of a distance between x and Sj, e.g., miny Sj x y p or R Sj(θ) x y p dµ(y), dµ is some probability measure on Sj(θ). The parameter λj controls the weighting between correct (i.e., F yj(xj; θ) > 0) and incorrect (i.e., F yj(xj; θ) < 0) classified samples. We fix λj = 1 for incorrectly classified samples and set λj to be the same for correctly classified samples; The parameter εj controls the desired target distances; Similarly to [10], the idea of this loss is: (i) if xj is incorrectly classified, pull the decision boundary Syj toward xj; (ii) if xj is classified correctly, push the decision boundary Syj to be within a distance of at-least εj from xj. In our implementation we have used d(x, Sj) = ρ(x, p(θ)), where p = p(θ0) Sj is a sample of this level set; ρ(x, p) = xi pi , and xi , pi denote the i -th coordinates of x, p (resp.), i = arg maxi [d] |Dx F(p; θ0)|. This loss encourages p(θ) to move in the direction of the axis (i.e., i ) that corresponds to the largest component in the gradient Dx F(p; θ0). Intuitively, as ρ(x, p) x p , the loss pushes the level set Sj to leave the εj-radius L -ball in the direction of the axis that corresponds to the maximal speed of p(θ). The inset depicts an example where this distance measure is more effective than the standard L distance: Dx F(p; θ) is shown as black arrow and the selected axis i as dashed line. In the case where the distance function miny S(θ) x y p is used, the computation of its gradient using Equation 8 coincides with the gradient derivation of [9] up to a sign difference. Still, our derivation allows working with general level set points (i.e., not just the closest) on the decision boundary Sj, and our sample network offers an efficient implementation of these samples in a loss function. Furthermore, we use the same loss for both correctly and incorrectly classified examples. 3.3 Manifold reconstruction Surface reconstruction. Given a point cloud X = {xj}N j=1 Rd that samples, possibly with noise, some surface M R3, our goal is to find parameters θ of a network F : R3 Rm R, so that the neural level set S(θ) approximates M. Even more desirable is to have F approximate the signed distance function to the unknown surface sampled by X. To that end, we would like the neural level set St(θ), t T to be of distance |t| to X, where T R is some collection of desired level set values. Let d(x, X) = minj [N] x xj 2 be the distance between x and X. We consider the reconstruction loss loss(θ) = X d(x, X) |t| p dv(x) j=1 |F(xj; θ)| , (13) where dv(x) is the normalized volume element on St(θ) and λ > 0 is a parameter. The first part of the loss encourages the t level set of F to be of distance |t| to X; note that for t = 0 this reconstruction error was used in level set surface reconstruction methods [33]. The second part of the loss penalizes samples X outside the zero level set S(θ). Curve reconstruction. In case of approximating a manifold M Rd with co-dimension greater than 1, e.g., a curve in R3, one cannot expect F to approximate the signed distance function as no such function exists. Instead, we model the manifold via the level set of a vector-valued network F : Rd Rm Rl whose zero level set is an intersection of l hyper-surfaces. As explained in Section 2, this generically defines a d l manifold. In that case we used the loss in Equation 13 with T = {0}, namely, only encouraging the zero level set to be as close as possible to the samples X. 4 Universality To theoretically support the usage of neural level sets for modeling manifolds or controlling decision boundaries we provide a geometric universality result for multilayer perceptrons (MLP) with Re LU activations. That is, the level sets of MLPs can represent any watertight piecewise linear hyper-surface (i.e., manifolds of co-dimension 1 in Rd that are boundaries of d-dimensional polytopes). More specifically, we prove: Theorem 1. Any watertight, not necessarily bounded, piecewise linear hypersurface M Rd can be exactly represented as the neural level set S of a multilayer perceptron with Re LU activations, F : Rd R. The proof of this theorem is given in the supplementary material. Note that this theorem is a geometrical version of Theorem 2.1 in [1], asserting that MLPs with Re LU activations can represent any piecewise linear continuous function. 5 Experiments 5.1 Classification generalization 0.1 0.2 0.3 0.5 1 2 5 80 Hinge Xent Ours Fraction of Training Data (%) Test accuracy (%) 0.1 0.2 0.3 0.5 1 2 5 Hinge Xent Ours Fraction of Training Data (%) Test accuracy (%) 0.1 0.2 0.3 0.5 1 2 5 Hinge Xent Ours Fraction of Training Data (%) Test accuracy (%) (a) (b) (c) Figure 2: Generalization from small fractions of the data. In this experiment, we show that when training on small amounts of data, our geometric SVM loss (see Equation 11) generalizes better than the cross entropy loss and the hinge loss. Experiments were done on three datasets: MNIST [18], Fashion-MNIST [31] and CIFAR10 [16]. For all datasets we arbitrarily merged the labels into two classes, resulting in a binary classification problem. We randomly sampled a fraction of the original training examples and evaluated on the original test set. Method Dataset Arch. Attack εtrain Test Acc. Rob. Acc. Xent Rob. Acc. Margin Standard MNIST Conv Net-4a PGD40(εattack = 0.3) - 99.34% 13.59% 0.00% Madry et al. [20] MNIST Conv Net-4a PGD40(εattack = 0.3) 0.3 99.35% 96.04% 96.11% Madry et al. [20] MNIST Conv Net-4a PGD40(εattack = 0.3) 0.4 99.16% 96.54% 96.53% TRADES [32] MNIST Conv Net-4a PGD40(εattack = 0.3) 0.3 98.97% 96.75% 96.74% TRADES [32] MNIST Conv Net-4a PGD40(εattack = 0.3) 0.4 98.62% 96.78% 96.76% Ours MNIST Conv Net-4a PGD40(εattack = 0.3) 0.4 99.35% 99.23% 97.35% Standard CIFAR10 Conv Net-4b PGD20 (εattack = 0.031) - 83.67% 0.00% 0.00% Madry et al. [20] CIFAR10 Conv Net-4b PGD20 (εattack = 0.031) 0.031 71.86% 39.84% 38.18% Madry et al. [20] CIFAR10 Conv Net-4b PGD20 (εattack = 0.031) 0.045 63.66% 41.53% 39.13% TRADES [32] CIFAR10 Conv Net-4b PGD20 (εattack = 0.031) 0.031 71.24% 41.89% 38.4% TRADES [32] CIFAR10 Conv Net-4b PGD20 (εattack = 0.031) 0.045 68.24% 42.04% 38.18% Ours CIFAR10 Conv Net-4b PGD20 (εattack = 0.031) 0.045 71.96% 38.45% 38.54% Standard CIFAR10 Res Net-18 PGD20 (εattack = 0.031) - 93.18% 0.00% 0.00% Madry et al. [20] CIFAR10 Res Net-18 PGD20 (εattack = 0.031) 0.031 81.0% 47.29% 46.58% Madry et al. [20] CIFAR10 Res Net-18 PGD20 (εattack = 0.031) 0.045 74.97% 49.84% 48.02% TRADES [32] CIFAR10 Res Net-18 PGD20 (εattack = 0.031) 0.031 83.04% 53.31% 51.36% TRADES [32] CIFAR10 Res Net-18 PGD20 (εattack = 0.031) 0.045 79.52% 53.49% 51.22% Ours CIFAR10 Res Net-18 PGD20 (εattack = 0.031) 0.045 81.3% 79.74% 43.17% Table 1: Results of different L -bounded attacks on models trained using our method (described in Section 3.2) compared to other methods. Due to the variability in the results, we rerun the experiment 100 times for MNIST and 20 times for Fashion-MNIST and CIFAR10. We report the mean accuracy along with the standard deviation. Figure 2 shows the test accuracy of our loss compared to the cross-entropy loss and hinge loss over different training set sizes for MNIST (a), Fashion-MNIST (b) and CIFAR10 (c). Our loss function outperforms the standard methods. For the implementation of Equation 11 we used α = 1, d = L , and approximated d(x, St) x pt , where pt denotes the projection of p on the level set St, t { 1, 0, 1} (see Section 2.1). The approximation of the term (x; θ), where x = xj is a train example, is therefore min p0 p 1 , p0 p1 . See supplementary material for further implementation details. 5.2 Robustness to adversarial examples In this experiment we used our method with the loss in Equation 12 to train robust models on MNIST [18] and CIFAR10 [16] datasets. For MNIST we used Conv Net-4a (312K params) used in [32] and for CIFAR10 we used two architectures: Conv Net-4b (2.5M params) from [30] and Res Net-18 (11.2M params) from [32]. We report results using the loss in Equation 12 with the choice of εj fixed as εtrain in Table 1, λj to be 1, 11 for MNIST and CIFAR10 (resp.), and d = ρ as explained in Section 3.2. We evaluated our trained networks on L bounded attacks with εattack radius using Projected Gradient Descent (PGD) [17, 20] and compared to networks with the same architectures trained using the methods of Madry et al. [20] and TRADES [32]. We found our models to be robust to PGD attacks based on the Xent loss; during the revision of this paper we discovered weakness of our trained models to PGD attack based on the margin loss, i.e., min F j , where F j = fj maxi =j fi; we attribute this fact to the large gradients created at the level set. Consequently, we added margin loss attacks to our evaluation. The results are summarized in Table 1. Note that although superior in robustness to Xent attacks we are comparable to baseline methods for margin attacks. Furthermore, we believe the relatively low robust accuracy of our model when using the Res Net-18 architecture is due to the fact that we didn t specifically adapt our method to Batch-Norm layers. In the supplementary material we provide tables summarizing robustness of our trained models (MNIST Conv Net-4a and CIFAR10 Conv Net-4b) to black-box attacks; we log black-box attacks of our and baseline methods [20, 32] in an all-versus-all fashion. In general, we found that all black-box attacks are less effective than the relevant white-box attacks, our method performs better when using standard model black-box attacks, and that all three methods compared are in general similar in their black-box robustness. Figure 3: Point cloud reconstruction. Surface: (a) input point cloud; (b) Atlas Net [12] reconstruction; (c) our result; (d) blow-ups; (e) more examples of our reconstruction. Curve: (f) bottom image shows a curve reconstructed (black line) from a point cloud (green points) as an intersection of two scalar level sets (top image); (g) bottom shows curve reconstruction from a point cloud with large noise, where top image demonstrates the graceful completion of an open curve point cloud data. 5.3 Surface and curve reconstruction In this experiment we used our method to reconstruct curves and surfaces in R3 using only incomplete point cloud data X R3, which is an important task in 3D shape acquisition and processing. Each point cloud is processed independently using the loss function described in Equation 13, which encourages the zero level set of the network to pass through the point cloud. For surfaces, it also moves other level sets to be of the correct distance to the point cloud. Chamfer L1 Chamfer L2 Atlas Net-1 sphere 23.56 2.91 17.69 2.45 Atlas Net-1 patch 18.67 3.45 13.38 2.66 Atlas Net-25 patches 11.54 0.53 7.89 0.42 Ours 10.71 0.63 7.32 0.46 Table 2: Surface reconstruction results. For surface reconstruction, we trained on 10 human raw scans from the FAUST dataset [5], where each scan consists of 170K points in R3. The scans include partial connectivity information which we do not use. After convergence, we reconstruct the mesh using the marching cubes algorithm [19] sampled at a resolution of [100]3. Table 2 compares our method with the recent method of [12] which also works directly with point clouds. Evaluation is done using the Chamfer distance [11] computed between 30K uniformly sampled points from our and [12] reconstructed surfaces and the ground truth registrations provided by the dataset, with both L1, L2 norms. Numbers in the table are multiplied by 103. We can see that our method outperforms its competitor; Figure 3b-3e show examples of surfaces reconstructed from a point cloud (a batch of 10K points is shown in 3a) using our method (in 3c, 3d-right, 3e), and the method of [12] (in 3b, 3d-left). Importantly, we note that there are recent methods for implicit surface representation using deep neural networks [6, 24, 22]. These methods use signed distance information and/or the occupancy function of the ground truth surfaces and perform regression on these values. Our formulation, in contrast, allows working directly on the more common, raw input of point clouds. For curve reconstruction, we took a noisy sample of parametric curves in R3 and used similar network to the surface case, except its output layer consists of two values. We trained the network with the loss Equation 13, where T = {0}, using similar settings to the surface case. Figure 3f shows an example of the input point cloud (in green) and the reconstructed curve (in black) (see bottom image), as well as the two hyper-surfaces of the trained network, the intersection of which defines the final reconstructed curve (see top image); 3g shows two more examples: reconstruction of a curve from higher noise samples (see bottom image), and reconstruction of a curve from partial curve data (see top image); note how the network gracefully completes the curve. 6 Conclusions We have introduced a simple and scalable method to incorporate level sets of neural networks into a general family of loss functions. Testing this method on a wide range of learning tasks we found the method particularly easy to use and applicable in different settings. Current limitations and interesting venues for future work include: applying our method with the batch normalization layer (requires generalization from points to batches); investigating control of intermediate layers level sets; developing sampling conditions to ensure coverage of the neural level sets; and employing additional geometrical regularization to the neural level sets (e.g., penalize curvature). Acknowledgments This research was supported in part by the European Research Council (ERC Consolidator Grant, "Lift Match" 771136) and the Israel Science Foundation (Grant No. 1830/17). [1] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with rectified linear units. ar Xiv preprint ar Xiv:1611.01491, 2016. [2] H. Ben-Hamu, H. Maron, I. Kezurer, G. Avineri, and Y. Lipman. Multi-chart generative surface modeling. In SIGGRAPH Asia 2018 Technical Papers, page 215. ACM, 2018. [3] A. Ben-Israel. A newton-raphson method for the solution of systems of equations. Journal of Mathematical analysis and applications, 15(2):243 252, 1966. [4] M. Berger, A. Tagliasacchi, L. M. Seversky, P. Alliez, G. Guennebaud, J. A. Levine, A. Sharf, and C. T. Silva. A survey of surface reconstruction from point clouds. In Computer Graphics Forum, volume 36, pages 301 329. Wiley Online Library, 2017. [5] F. Bogo, J. Romero, M. Loper, and M. J. Black. FAUST: Dataset and evaluation for 3D mesh registration. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Piscataway, NJ, USA, June 2014. IEEE. [6] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. ar Xiv preprint ar Xiv:1812.02822, 2018. [7] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273 297, 1995. [8] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. [9] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang. Max-margin adversarial (mma) training: Direct input space margin maximization through adversarial training. ar Xiv preprint ar Xiv:1812.02637, 2018. [10] G. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio. Large margin deep networks for classification. In Advances in Neural Information Processing Systems, pages 842 852, 2018. [11] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605 613, 2017. [12] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 216 224, 2018. [13] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, pages 2266 2276, 2017. [14] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. [15] S. G. Krantz and H. R. Parks. The implicit function theorem: history, theory, and applications. Springer Science & Business Media, 2012. [16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [17] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016. [18] Y. Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. [19] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In ACM siggraph computer graphics, volume 21, pages 163 169. ACM, 1987. [20] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. [21] A. Matyasko and L.-P. Chau. Margin maximization for robust classification using deep learning. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 300 307. IEEE, 2017. [22] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. ar Xiv preprint ar Xiv:1812.03828, 2018. [23] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574 2582, 2016. [24] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. ar Xiv preprint ar Xiv:1901.05103, 2019. [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. [26] J. Sokoli c, R. Giryes, G. Sapiro, and M. R. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265 4280, 2017. [27] S. Sun, W. Chen, L. Wang, and T.-Y. Liu. Large margin deep neural networks: Theory and algorithms. ar Xiv preprint ar Xiv:1506.05232, 148, 2015. [28] Y. Tang. Deep learning using support vector machines. Co RR, abs/1306.0239, 2, 2013. [29] F. Williams, T. Schneider, C. Silva, D. Zorin, J. Bruna, and D. Panozzo. Deep geometric prior for surface reconstruction. ar Xiv preprint ar Xiv:1811.10943, 2018. [30] E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, pages 8400 8409, 2018. [31] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [32] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019. [33] H.-K. Zhao, S. Osher, and R. Fedkiw. Fast surface reconstruction using the level set method. In Proceedings IEEE Workshop on Variational and Level Set Methods in Computer Vision, pages 194 201. IEEE, 2001.