# learning_globally_smooth_functions_on_manifolds__22b5ad36.pdf Learning Globally Smooth Functions on Manifolds Juan Cervi no 1 Luiz F.O. Chamon 2 Benjamin D. Haeffele 3 Rene Vidal 3 4 Alejandro Ribeiro 1 Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. This work combines techniques from semi-infinite constrained learning and manifold regularization to learn representations that are globally smooth on a manifold. To do so, it shows that under typical conditions the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. It is shown that under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Experiments on real world data illustrate the advantages of the proposed method relative to existing alternatives. Our code is available here. 1. Introduction Learning smooth functions has been shown to be advantageous in general and is of particular interest in physical systems. This is because of the general observation that close input features tend to be associated with close outputs and of the particular fact that in physical systems, Lipschitz continuity of input-output maps translates to stability and safety (Oberman & Calder, 2018; Finlay et al., 2018b; Couellan, 2021; Finlay et al., 2018a; Pauli et al., 2021; Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021). 1Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, USA 2Excellence Cluster for Simulation Technology, University of Stuttgart, Germany 3Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, USA 4Innovation in Data Engineering and Science (IDEAS), University of Pennsylvania, Philadelphia, USA. Correspondence to: Juan Cervi no . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). To learn smooth functions one can require the parameterization to be smooth. Such is the idea, e.g., of spectral normalization of weights in neural networks (Miyato et al., 2018; Zhao & Liu, 2020). Smooth parameterizations have the advantage of being globally smooth, but they may be restrictive because they impose smoothness for inputs that are not necessarily realized in the data. This drawback motivates the use of Lipschitz penalties in risk minimization (Oberman & Calder, 2018; Finlay et al., 2018b; Couellan, 2021; Pauli et al., 2021; Bungert et al., 2021), which offers the opposite tradeoff. Since penalties encourage but do not enforce small Lipschitz constants, we may learn functions that are smooth on average, but with no global guarantees of smoothness at every point in the support of the data (Bubeck & Sellke, 2021; Bubeck et al., 2021). Formulations that guarantee global smoothness can be obtained if the risk minimization problem is modified by the addition of a Lipschitz constant constraint (Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021). This yields formulations that guarantee Lipschitz smoothness in all possible inputs without the drawback of enforcing smoothness outside of the input data distribution. Several empirical studies (Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) demonstrated the advantage of imposing global smoothness constraints on observed inputs. In this paper we exploit the fact that data can be often modeled as points in a low-dimensional manifold. We therefore consider manifold Lipschitz constants in which function smoothness is assessed with respect to distances measured over the data manifold (Definition 1). Although this looks like a minor difference, controlling Lipschitz constants over data manifolds is quite different from controlling Lipschitz constants in the ambient space. In Figure 1, we look at a classification problem with classes arranged in two separate half moons. Constraining Lipschitz constants in the ambient space effectively assumes the underlying data is uniformly distributed in space [cf. Figure 1(d)]. Constraining Lipschitz constants in the data manifold, however, properly accounts for the data distribution [cf. Figure 1(a)]. This example also illustrates how constraining manifold Lipschitz constants is related to manifold regularization (Belkin et al., 2005; Niyogi, 2013; Li et al., 2022). The difference is that manifold regularization penalizes the average norm of the manifold gradient. This distinction is significant because Learning Globally Smooth Functions on Manifolds 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (a) Dataset 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.0 (b) Manifold Lipschitz (ours) 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.0 (c) Manifold Regularization 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.0 (d) Ambient Regularization Figure 1: Two moons dataset. The setting consists of a two dimensional classification problem of two classes with 1 labeled, and 200 unlabeled samples per class. The objective is to correctly classify the 200 unlabeled samples. We consider two cases (top) the estimated manifold has two connected components, and (bottom) the manifold is weakly connected (cf. Figure 1). We plot the output of a one layer neural network trained using Manifold Regularization, Manifold/Ambient Lipschitz. regularizing is more brittle than imposing constraints. In the example in Figure 1, manifold regularization fails to separate the dataset when the moons are close [cf. Figure 1-(c), bottom]. Classification with a manifold Lipschitz constant constraint is more robust to this change in the data distribution [cf. Figure 1-(a), bottom]. The advantages of constraining Lipschitz constants on manifolds that we showcase in the synthetic example of Figure 1 manifest in the wild. To illustrate it, we empirically demonstrate this fact with two physical experiments. The first experiment entails learning a model for a differential drive ground robot in a mix of complex terrains. The second experiment consists of learning a dynamical model for a quadrotor. In both of these experiments constraining Lipschitz constants on manifolds improves upon standard risk minimization, manifold regularization, and the imposition of Lipschitz constraints in ambient space. Global constraints on the manifold gradient yield a statistical constrained learning problem with an infinite and dense number of constraints. This is a challenging problem to approximate and solve. Here, we approach the solution of this problem in the Lagrangian dual domain and establish connections with manifold regularization that allow for the use of point cloud Laplacians. Our contributions include: (C1) We introduce a constrained statistical risk minimization problem in which we learn a function that: (i) attains a target loss and (ii) attains the smallest possible manifold Lipschitz constant among functions that achieve this target loss (Section 2). (C2) We introduce the Lagrangian dual problem and show that its empirical version is a statistically consistent approximation of the primal. These results do not require the learning parametrization to be linear (Section 3.1). (C3) We generalize results from the manifold regularization literature to show that under regularity conditions, the evaluation of manifold Lipschitz constants can be recast in a more amenable form utilizing a weighted point cloud Laplacian (Proposition 3 in Section 3.2). (C4) We present a dual ascent algorithm to find optimal multipliers. The function that attains the target loss and minimizes the manifold Lipschitz constant follows as a byproduct (Section 3.3). (C5) We illustrate the merits of learning with global manifold smoothness guarantees with respect to ambient space and standard manifold regularization through two physical experiments: (i) learning model mismatches in differential drive steering over non-ideal surfaces and (ii) learning the motion dynamics of a quadrotor (Section 4). Related Work This paper is at the intersection of learning with Lipschitz constant constraints (Oberman & Calder, 2018; Finlay et al., 2018b; Couellan, 2021; Pauli et al., 2021; Bungert et al., 2021; Miyato et al., 2018; Zhao & Liu, 2020; Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) and manifold regularization (Belkin et al., 2005; Learning Globally Smooth Functions on Manifolds Niyogi, 2013; Li et al., 2022; Hein et al., 2005; Belkin & Niyogi, 2005). Relative to learning with Lipschitz constraints, we offer the ability to leverage data manifolds. Since data manifolds are often obtained from unlabeled data, as in (Kejani et al., 2020; Belkin & Niyogi, 2004; Jiang et al., 2019; Kipf & Welling, 2016; Yang et al., 2016; Zhu, 2005; Lecouat et al., 2018; Ouali et al., 2020; Cabannes et al., 2021) we also use point-cloud Laplacian techniques to compute the integral of the norm of the gradient. Relative to the literature on manifold regularization, we offer global smoothness assurances instead of an average penalty of large manifold gradients. Similar to us, (Krishnan et al., 2020) poses the problem of minimizing a Lipschitz constant. However, they utilize a softer surrogate (i.e. pnorm loss) which is a smoother version of the Lipschitz constant. Their approach therefore, tradeoffs numerical stability (small p) with accurate Lipschitz constant estimation (p = ). We do not work with surrogates and seek to minimize the maximum norm of the gradient utilizing an epigraph technique. 2. Globally Constraining Manifold Lipschitz Constants We consider data pairs (x, y) in which the input features x M RD lie in a compact oriented Riemannian manifold M and the output features are real valued y R. We study the regression problem of finding a function fθ : M R, parameterized by θ Θ RQ that minimizes the expectation of a nonnegative loss ℓ: R R R+, where ℓ(fθ(x), y) represents the loss of predicting output fθ(x) when the world realizes the pair (x, y). Data pairs (x, y) are drawn according to an unknown probability distribution p(x, y) on M R which we can factor as p(x, y) = p(x)p(y|x). We are interested in learning smooth functions, i.e. functions with controlled variability over the manifold M. We therefore let Mfθ(x) represent the manifold gradient of fθ and introduce the following definition. Definition 1 (Manifold Lipschitz Constant) Given a Riemannian manifold M, the function fθ : M R is said to be L-Lipschitz continuous if there exists a strictly positive constant L > 0 such that for all pairs of points x1, x2 M, |fθ(x1) fθ(x2)| L d M(x1, x2), (1) where d M(x1, x2) denotes the distance between x1 and x2 in the manifold M. If the function fθ is differentiable on the manifold, (1) is equivalent to requiring the gradient norm to be bounded by L, Mfθ(x) := lim δ 0 sup x M : x =x, d M(x,x ) δ |fθ(x) fθ(x )| L, for all x M. (2) With Definition 1 in place and restricting attention to differentiable functions fθ, our stated goal of learning functions fθ with controlled variability over the manifold M can be written as P = min θ Θ,ρ 0 ρ, (3) subject to Ep(x,y)[ℓ(fθ(x), y)] ϵ, Mfθ(z) 2 ρ, p(z)-a.e., z M. In this formulation, the statistical loss Ep(x,y)[ℓ(fθ(x), y)] is required to be below a target level ϵ. Of all functions fθ that satisfy this loss requirement, Problem (3) defines as optimal those that have the smallest Lipschitz constant L = ρ. The goal of this paper is to develop methodologies to solve (3) when the data distribution and the manifold are unknown. To characterize the distribution, we use N sample pairs (xn, yn) drawn independently from the joint distribution p(x, y). To characterize the manifold, we use N samples zn drawn from the marginal distribution p(x). This includes the N samples xn from the (labeled) data pairs (xn, yn), but may also include additional (unlabeled) samples zn, n = N + 1, . . . , N + N . Observe from (3) that the problems of manifold regularization (Belkin et al., 2005; Niyogi, 2013; Kejani et al., 2020; Li et al., 2022) and Lipschitz constant control (Oberman & Calder, 2018; Finlay et al., 2018b; Couellan, 2021; Finlay et al., 2018a; Pauli et al., 2021; Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021) are related. This connection is important to understand the merit of (3). To explain this better observe that there are three motivations for the problem formulation in (3): (i) it is often the case that if samples x1 and x2 are close, then the conditional distributions p(y | x1) and p(y | x2) are close as well. A function fθ with small Lipschitz constant leverages this property, (ii) the Lipschitz constant of fθ is guaranteed to be smaller than L = ρ, this provides advantages in, e.g., physical systems where Lipschitz constant guarantees translates to stability and safety assurances, (iii) it leverages the intrinsic low-dimensional structure of the manifold M embedded in the ambient space. In particular, this permits taking advantage of unlabeled data. Motivations (i) and (ii) are tropes of the Lipschitz regularization literature; e.g., (Oberman & Calder, 2018; Finlay et al., 2018b; Couellan, 2021; Finlay et al., 2018a; Pauli et al., Learning Globally Smooth Functions on Manifolds 2021; Krishnan et al., 2020; Shi et al., 2019; Lindemann et al., 2021; Arghal et al., 2021). Indeed, the problem formulation in (3) is inspired in similar problem formulations in which the Lipschitz constant is regularized in the ambient space, minimize θ Θ,L 0 L, (4) subject to Ep(x,y)[ℓ(fθ(x), y)] ϵ, |fθ(w) fθ(z)| L w z , (w, z) p(w) p(z). A key difference between (3) and (4) is that the latter uses a Lipschitz condition that does not require differentiability. A more important difference is that in (4), the Lipschitz constant is regularized in the ambient space. The distance between features w and z in (4) is the Euclidean distance w z . This is disparate from the manifold metric d M(w, z) that is implicit in the manifold gradient constraint in (3). Thus, the formulation in (3) improves upon (4) because it leverages the structure of the manifold M [cf. motivation (iii)]. Motivations (i) and (iii) are themes of the manifold regularization literature (Belkin et al., 2005; Niyogi, 2013; Kejani et al., 2020; Li et al., 2022). And, indeed, it is ready to conclude by invoking Green s first identity (see Section 3.2) that the formulation in (3) is also inspired in the manifold regularization problem, minimize θ Θ Ep(x,y)[ℓ(fθ(x), y)] (5) M Mfθ(z) 2p(z)d V (z). The difference between (3) and (5) is that the latter adds the manifold Lipschitz constant as a regularization penalty. This is disparate from the imposition of a manifold Lipschitz constraint in (3). The regularization in (5) favors solutions with small Lipschitz constant by penalizing large Lipschitz constants, while the constraint in (3) guarantees that the Lipschitz constant is bounded by ρ. This is the distinction between regularizing a Lipschitz constant and constraining a Lipschitz constant. The constraint in (5) is also imposed at all points in the manifold, whereas the regularization in (5) is an average over the manifold. Taking an average allows for large Lipschitz constants at some specific points if this is canceled out by small Lipschitz constants in other points of the manifold (Bubeck & Sellke, 2021; Bubeck et al., 2021). Both of these observations imply that (3) improves upon (5) because it offers global smoothness guarantees that are important in, e.g., physical systems [cf. motivation (iii)]. Remark 1 (Manifold Lipschitz formulations) There are three arbitrary choices in (3): (a) We choose to constrain the average statistical loss Ep(x,y)[ℓ(fθ(x), y)] ϵ; (b) we choose to constrain the pointwise Lipschitz constant Mfθ(z) 2 ρ; and (c) we choose as our objective to require a target loss ϵ and minimize the Lipschitz constant L = ρ. We can alternatively choose to constrain the pointwise loss ℓ(fθ(x), y) ϵ, to constrain the average Lipschitz constant R M Mfθ(z) 2p(z)d V (z) ρ [cf (5)], or to require a target smoothness L = ρ and minimize the loss ϵ. All of the possible eight combinations of choices are of interest. We formulate (3) because it is the most natural intersection between the regularization of Lipschitz constants in ambient spaces [cf. (4)] and manifold regularization [cf. (5)]. The techniques we develop in this paper can be adapted to any of the other seven alternative formulations. 3. Learning with Global Lipschitz Constraints Problem (3) is a constrained learning problem that we will solve in the dual domain (Chamon & Ribeiro, 2020). To that end, observe that (3) has statistical and pointwise constraints. The loss constraint Ep(x,y)[ℓ(fθ(x), y)] ϵ is said to be statistical because it restricts the expected loss over the data distribution. The Lipschitz constant constraints Mfθ(z) 2 ρ are said to be pointwise because they are imposed for all individual points in the manifold except perhaps for a set of zero measure. Consider then a Lagrange multiplier µ associated with the statistical constraint Ep(x,y)[ℓ(fθ(x), y)] ϵ and a Lagrange multiplier distribution λ(z) associated with the set of pointwise constraints Mfθ(z) 2 ρ. We define the Lagrangian L(θ, µ, λ) associated with (3) as L(θ, µ, λ) :=µ E[ℓ fθ(x), y ] ϵ M λ(z) Mfθ(z) 2p(z)d V (z). (6) The dual problem associated with (3) can then be written as D = max µ,λ 0 min θ L(θ, µ, λ), (7) subject to Z M λ(z)p(z)d V (z) = 1. We point out that in (7) we remove ρ from the Lagrangian by incorporating the dual variable constraint R M λ(x)p(x)d V (x) = 1 (see Appendix A for details). We henceforth use the dual problem (7) in lieu of (3). Since we are interested in situations in which we do not have access to the data distribution p(x, y), we further consider empirical versions of (7). Given N i.i.d. samples (xn, yn) drawn from p(x, y), define the empirical La- Learning Globally Smooth Functions on Manifolds grangian ˆL(θ, ˆµ, ˆλ) as ˆL(θ, ˆµ, ˆλ) :=ˆµ 1 n=1 ℓ fθ(xn), yn ϵ n=1 ˆλ(zn) Mfθ(zn) 2. (8) and the empirical dual problem as ˆD = max ˆµ,ˆλ 0 min θ ˆL(θ, ˆµ, ˆλ) (9) subject to 1 n=1 ˆλ(xn) = 1. To simplify notation, we assume from now on that no unlabeled samples are available. If unlabeled samples are given, the modification is straightforward (see Appendix B). The remainder of this section provides three technical contributions: In section 3.1 we address contribution C2: To justify the use of (9) we must show statistical consistency with respect to the primal problem (3). This is challenging for two reasons: (i) since we do not assume the use of a linear parameterization, (3) is not a linear problem in θ. Thus, the primal (3) and dual (7) are not necessarily equivalent, (ii) since we are maximizing over the dual variables µ and λ(z), we do not know if the empirical dual formulation in (9) is close to the statistical dual formulation in (7). We overcome these two challenges and show that the empirical dual problem (9) is a consistent approximation of the statistical primal problem (Proposition 1). In section 3.2 we address contribution C3: Solving (9) requires evaluating the gradient norm sum PN n=1 ˆλ(xn) Mfθ(xn) 2. We will generalize results from the manifold regularization literature to show that under regularity conditions on λ, the gradient norm integral can be computed in a more amenable form utilizing a weighted point-cloud Laplacian (Proposition 3). (Contribution C3). In section 3.3 we address contribution C4: We introduce a primal-dual algorithm to solve (9). 3.1. Statistical consistency of the empirical dual problem In this section, we show that (9) is close to (3) under the following assumptions: Assumption 1 The loss ℓ( , y) is M-Lipschitz continuous, [0, B]-valued, and convex for all y R. Assumption 2 Let H = {fθ | θ Θ} with compact Θ RQ be the hypothesis class and let H = conv(H) be the closure of its convex hull. For each ν > 0 and φ H, there exists θ Θ such that simultaneously supz M |φ(z) fθ(z)| ν and supz M Mφ(z) Mfθ(z) ν. Assumption 3 The functions in the hypothesis class H satisfy Mfθ1(z) Mfθ2(z) G|fθ1(z) fθ2(z)|, for all θ1, θ2 Θ. Assumption 4 There exists ζ(N, δ) 0, and ˆζ(N, δ) 0 monotonically decreasing with N, such that with probability 1 δ over independent draws (xn, yn) p, E[ℓ(fθ(x), y)] 1 n=1 ℓ fθ(xn), yn ζ(N, δ), n=1 fθ(xn) ˆζ(N, δ), (10) for all θ Θ and all distributions p. Assumption 5 There exists a feasible solution θ Θ such that E[ℓ(f θ(x), y)] < ϵ Mν. Assumption 1 holds for most losses utilized in practice. Assumption 2 is a requirement on the richness of the parametrization. In the particular case of neural networks, the covering constant ν is upper bounded by the universal approximation bound of the neural network. Assumption 3 also holds for neural networks with smooth nonlinearities e.g., hyperbolic tangents. The uniform convergence property (10) is customary in learning theory to prove PAC learnability and is implied by bounds on complexity measures such as VC dimension or Rademacher complexity (Mohri et al., 2018; Vapnik, 1999; Shalev-Shwartz & Ben-David, 2014). In fact, if it has bounded Rademacher complexity, both ζ and ˆζ are bounded due to Assumption 1. The following proposition provides the desired bound. Proposition 1 Let ˆµ , ˆλ be solutions of the empirical dual problem (9). Under assumptions 1 5, there exists ˆθ argminθ ˆL(θ, ˆµ , ˆλ ) such that, with probability 1 5δ, |P ˆD | O(ν) + (1 + )ζ(N, δ) + O(ˆζ(N, δ)), (11) where = max(ˆµ , µ ) < and µ , ˆµ are solutions of (7), (9) respectively. Proposition 1 shows that the empirical dual problem 9 is statistically consistent. That is to say, for any draw of N samples according to p, the difference between the empirical dual problem (9) and the statistical smooth learning problem (3) decreases as N increases. This difference is bounded in terms of the richness of the parametrization (ν), the difficulty of the fit requirement (as expressed by the optimal dual variables µ , ˆµ ), and the number of samples (N). Learning Globally Smooth Functions on Manifolds The guarantee has a form typical for constrained learning problems (Chamon et al., 2023). Proposition 1 states that we are able to predict what is the minimum norm of the gradient that a function class can have while achieving an expected loss of at most ϵ. This is important because we do not require access to the distribution p, only a set of N samples from this distribution. On the other hand, Proposition 1 does not state that by solving the dual problem (9) we will obtain a solution of the primal problem (3). The following proposition provides a bound on the near feasibility of the solution of (9) with respect to the solution of (3). Proposition 2 Let ˆµ , ˆλ be solutions of the empirical dual problem (9). Under assumptions 1 5, there exists ˆθ argminθ ˆL(θ, ˆµ , ˆλ ) such that, with probability 1 5δ, max n [N] Mfˆθ (xn) 2 P + O(ν) + (1 + )ζ(N, δ) + O(ˆζ(N, δ)) (12) n=1 ℓ(fˆθ (xn), yn) ϵ and E[ℓ(fˆθ (x), y)] ϵ + ζ(N, δ). (13) where = max(ˆµ , µ ) < and µ , ˆµ are solutions of (7), (9) respectively. Proposition 2 provides near optimality and near feasibility conditions for solutions ˆθ obtained through the empirical dual problem (9). The difference between the maximum gradient of the obtained solution θ and the optimal value P is bounded by the number of samples N as well as the empirical constraint violation. Notice that even though the optimal dual variable ˆµ is not known, the constraint violation can be evaluated in practice as it only requires to evaluate the obtained function over the N given samples. Remark 2 (Interpolators) In practice, the number of parameters in a parametric function (e.g., Neural Network) tends to exceed the dimension of the input, which allows functions to interpolate the data, i.e., to attain zero loss on the dataset. Proposition 2 presents a connection to interpolating functions. By setting ϵ = 0, if the function achieves zero error over the empirical distribution i.e. ℓ(fˆθ (xn), yn) = 0, for all n [N], then the dependency on µ disappears. This implies that within the classifiers that interpolate the data, the one with the minimum Lipschitz constant over the available samples will probably be the one with the minimum Lipschitz constant. 3.2. From Manifold Gradient to Discrete Laplacian We derive an alternative way of computing the integral of the norm of the gradient utilizing samples. To do so, we define the normalized point-cloud Laplacian according a probability distribution λ. Definition 2 (Point-cloud Laplacian) Consider a set of points x1, . . . , x N M, sampled according to the density λ : M R. The normalized graph Laplacian of fθ at z M is defined as Lt λ,Nfθ(z) = 1 n=1 W(z, xn) fθ(z) fθ(xn) , (14) for W(z, xn) = 1 t Gt(z, xn) q ˆw(z) ˆW(xn) , with Gt(z, xn) = 1 (4πt)d/2 e z xn 2 n=1 Gt(z, xn), and ˆW(xn) = 1 N 1 m =n Gt(xm, xn). As long as the function considered is smooth enough, the following convergence result holds: Proposition 3 (Point-Cloud Estimate) Let Λ be the set of probability densities defined on a compact d-dimensional differentiable manifold M isometrically embedded in RD such that Λ = {λ : 0 < a λ(z) b < , | λ x| c < and | 2λ x2 | d < for all z M}, and let fθ, with θ Θ, be a family of functions with uniformly bounded derivatives up to order 3 vanishing at the boundary M. For any ϵ > 0, δ > 0, there exists N , N such that P sup λ Λ,θ Θ Z Mfθ(z) 2λ(z)d V (z) i=1 fθ(zi)Lt λ,N fθ(zi)λ(zi) > ϵ δ (15) where the point-cloud Laplacian Lt λ,N fθ is as in Definition 2, with t = N 1 d+2+α for any α > 0. The proof of Proposition 3 relies on two steps. First, we relate the integral of the norm of the gradient over the manifold with the integral of the continuous Laplace-Beltrami operator by virtue of Green s identity. Second, we approximate the value of the Laplace-Beltrami operator by the point-cloud Laplacian. Proposition 3 connects the integral of the norm of the gradient of function fθ with a point-cloud Laplacian operator. This result connects the dual problem (9), with the primal (3) while allowing for a more amenable way of computing the integral. Remark 3 (Laplacian Regularization) The dual problem (7) is closely related to the manifold regularization Learning Globally Smooth Functions on Manifolds problem (5). In particular, the two become equivalent by substituting ρ = µ 1 and utilizing a uniform distribution for λ. The key difference between the problems is given by the dual variable λ, which can be thought as a probability distribution over the manifold that penalizes regions of the manifold where the norm of the gradient of fθ is larger. The standard procedure in Laplacian regularization is to calculate the graph-Laplacian of the set of points L utilizing the heat kernel and compute the integral by f T Lf, where f = [fθ(x1), . . . , fθ(xn)]T . In the case of Manifold Lipschitz, the same product can be computed, but utilizing the re-weighted point-cloud laplacian. 3.3. Dual Ascent Algorithm We outline an iterative and empirical primal-dual algorithm to solve the dual problem. Upon the initialization of θ0, λ0 and µ0, we set out to minimize the dual function using, θk+1 = θk ηθ θL(θk, µk, λk), (16) where ηθ is a positive stepsize. Note that to compute (16), we can either utilize the gradient version of the Lagrangian (9) or the point-cloud Laplacian (15). Consequent to updating θ, we update the dual variables by µk+1 = µk + ηµ n=1 ℓ(fθ(xn), yn) ϵ (17) λk+1(xn) = λk(xn) + ηλ Mfθ(xn) 2, n = 1, . . . , N (18) λk+1 = argmin λ λk+1 λk+1 , (19) n=1 λk+1(xn) = N, where ηµ, ηλ are again a positive stepsize. Note that we require a convex projection over λ to satisfy the normalizing constraint. In step (18), we need to estimate the norm of the gradient at data point xn, which we do using neighboring data points. Intuitively, the primal-dual procedure increases the value of λ(xn) at points in which the norm of the gradient is larger. The role of µ is to enforce the loss ℓto be smaller than ϵ, by adjusting the relative importance of the loss ℓover the norm of the integral (see Appendix F). 4. Experiments To demonstrate the effectiveness of our method, we conduct two real world experiments with physical systems. In Section 4.1, we tackle a regression problem in which we try to predict the error made when estimating the state of a ground robot making turns in both pavement and grass. In Section 4.2 we learn the dynamics of a quadrotor when taking off and making circles in open air. In both experiments, data is acquired from real world robots and we compare our method against standard Empirical Risk Minimization (ERM), Ambient Regularization, and Manifold Regularization. 4.1. Ground Robot Error Prediction In this experiment, we seek to learn the error that a model would make in predicting the dynamics of a ground robot by looking at the states throughout a trajectory (Koppel et al., 2016). The data acquisition of this experiment involves an i Robot Packbot equipped with high resolution camera. The setting of the data acquisition is a robot making turns on both pavement and grass. The dynamics of the system are govern by the discrete-time nonlinear state-space system xk+1 = f(xk, uk) + g(uk), (20) where xk is the state of the system, uk is the control input, f(xk, uk) is the model prediction, and g(uk) is the nonmodelable error of the prediction. In this setting, the robot state involves the position and the control inputs are linear and angular velocities. The dynamics of the system are modeled by f(xk, uk) and they involve the mass of the robot, the radius of the wheel, and other known parameters of the robot. In practice, the model f(xk, uk) is not perfect and the model mismatch is given by the difference in friction with the ground, delays in communications with the sensors, and discrepancies in the robot specifications. To make matters worse, some of the discrepancies are difficult to model or intractable to compute in practice. We possess trajectories in the form of time series of the control signals, i.e., linear and angular velocity of the robot. For each trajectory, the average and variance of the mismatch between the real and the model-predicted states is quantified. Given a dataset of trajectories and errors made by the model, the objective is to predict the error that the model will make on a given trajectory. In order to construct the point-cloud Laplacian, we measured the euclidean distance between samples and computed (14). In this case we added a threshold, i.e. minimum magnitude of Lij below which the value of Lij becomes 0. This is to disregard the effect of samples that are far away, by only considering neighborhoods of each sample. By doing this we are constructing a manifold in which samples that are sufficiently close will be forced to have similar outputs. However, if samples are not sufficiently close, the similarity between the outputs of the function will not be forced to be small (to exemplify this point we have a toy example in Appendix G.5). The manifold in this case is constructed by time series that are sufficiently similar to each other. For further details of the experiment and sample trajectories can be found in Appendix G.2. Learning Globally Smooth Functions on Manifolds Method Grass Pavement ERM 0.42 0.0120 Ambient Regularization 0.31 0.0065 Manifold Regularization 0.38 0.0045 Manifold Lipschitz (ours) 0.25 0.0032 Table 1: Error prediction accuracy for the Ground Robot Experiment. As seen on Table 1, regularization improves the accuracy over the standard ERM framework. This is related to the fact that given that the underlying predictive model is the same in all trajectories, similar trajectories will have similar errors. However, our method improves upon both regularization techniques. This is related to the idea that between experiments only the velocities change (the environment is fixed), so we can intuitively imagine a continuous transition in the error made by the model. This explains why our method (Manifold Lipschitz) by forcing the function to be smooth over the trajectory space is significantly better in both experiments. Our method also improves upon Ambient regularization because the euclidean distance is able to approximate the distance on the manifold locally, but it is not able to approximate it globally. For a better clarification on this last point see Appendix G.5. In all, in this experiment we show that our method improves the generalization capabilities of a function. 4.2. Quadrotor Model Mismatch In this section introduce a state prediction problem based real world dynamics. The setting consists of a quadrotor taking off and flying in circles for 12 seconds. Training Trajectory Test Trajectory Figure 2: Quadrotor Sample Trajectories The experimental setup is to target a speed of 0.4m/s and to track a circular trajectory of radius 0.5m. We consider 2 trajectories of 12 seconds each, and the starting position of the quadrotor is the same for both trajecto- Method Error on Test Trajectory ERM 0.00666 Ambient Regularization 0.00734 Manifold Regularization 0.00625 Manifold Lipschitz (ours) 0.00036 Table 2: State prediction error of a quadrotor flying in circles on an unseen trajectory. ries (the ground). For each time stamp t, we have measurements of position, velocity and acceleration in R3, i.e. [xt, yt, zt, xt, yt, zt, xt, yt, zt]. For the learning procedure we utilize the 6000 samples, and we seek to minimize the mean square error loss between the next state and the prediction given the current state. We train a two layer neural network with different methods as seen in Table 2. To test the accuracy of the learned function, we compute the difference between the predicted state and the next state on the test trajectory. Analogous to experiment 4.1, to construct the point-cloud Laplacian, we compute the euclidean distances in a neighborhood of each sample. Intuitively, by looking at the trajectories 2, we can expect next states to be similiar only on samples that are sufficiently close. The euclidean distance is able to approximate the manifold locally but fails when it is too large. By looking at Figure 2 we can see that even though the training and testing trajectories are not the same, there is a close resemblance between the two of them. This allows a manifold model to be effective in predicting the next state. However, by looking at the results in Table 2, we see that adding regularization does not always help, as ambient regularization does not improve upon ERM. This is related to the fact that the simple euclidean distance is not necessarily a good measure of the how related two states are. Instead, one conclusion of the results shown in Table 2 is that adding regularization on the manifold space always improves upon ERM. In particular, our method give performance an order of magnitude better than standard ERM and standard Laplacian Regularization. This is a consequence of adding extra information about the problem by grouping similar states together, which reduces the impact of the noise in a particular sample. To conclude, we show that our method obtains an improvement over all the techniques considered and that by predicting the next state of a quadrotor from the current state utilizing smooth functions improves generalization for noisy measurements. 5. Conclusion In this work, we presented a constraint learning method to obtain smooth functions over manifold data. We showed that under mild conditions, the problem of finding smooth Learning Globally Smooth Functions on Manifolds functions over a manifold can be reformulated as a weighted point-cloud Laplacian penalty over varying probability distributions whose dynamics are govern by the constraint violations. Two experiments on real world data validate the empirical advantages of obtaining functions that vary smoothly over the data. ACKNOWLEDGMENTS The authors would like to thank Alec Koppel, Garrett Warnell, Jonathan Fink, Ethan Stump collecting and providing the dataset utilized for experiment 4.1. The authors would also like to thank Tom Zhang and M. Ani Hsieh for collecting and providing the dataset utilized for experiment 4.2. The work of Luiz F. O. Chamon was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy Grant EXC 2075-390740016. Arghal, R., Lei, E., and Bidokhti, S. S. Robust graph neural networks via probabilistic lipschitz constraints. ar Xiv preprint ar Xiv:2112.07575, 2021. Bai, Q., Bedi, A. S., Agarwal, M., Koppel, A., and Aggarwal, V. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. ar Xiv preprint ar Xiv:2109.06332, 2021. Belkin, M. and Niyogi, P. Semi-supervised learning on riemannian manifolds. Machine learning, 56(1):209 239, 2004. Belkin, M. and Niyogi, P. Towards a theoretical foundation for laplacian-based manifold methods. In International Conference on Computational Learning Theory, pp. 486 500. Springer, 2005. Belkin, M., Niyogi, P., and Sindhwani, V. On manifold regularization. In International Workshop on Artificial Intelligence and Statistics, pp. 17 24. PMLR, 2005. Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. Bubeck, S. and Sellke, M. A universal law of robustness via isoperimetry. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=z71OSKq TFh7. Bubeck, S., Li, Y., and Nagaraj, D. M. A law of robustness for two-layers neural networks. In Conference on Learning Theory, pp. 804 820. PMLR, 2021. Bungert, L., Raab, R., Roith, T., Schwinn, L., and Tenbrinck, D. Clip: Cheap lipschitz training of neural networks. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 307 319. Springer, 2021. Cabannes, V., Pillaud-Vivien, L., Bach, F., and Rudi, A. Overcoming the curse of dimensionality with laplacian regularization in semi-supervised learning. Advances in Neural Information Processing Systems, 34, 2021. Castellano, A., Min, H., Bazerque, J., and Mallada, E. Reinforcement learning with almost sure constraints. ar Xiv preprint ar Xiv:2112.05198, 2021. Cervino, J., Ruiz, L., and Ribeiro, A. Training stable graph neural networks through constrained learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4223 4227. IEEE, 2022. Chamon, L. and Ribeiro, A. Probably approximately correct constrained learning. Advances in Neural Information Processing Systems, 33:16722 16735, 2020. Chamon, L. F. O., Paternain, S., Calvo-Fullana, M., and Ribeiro, A. Constrained learning with non-convex losses. IEEE Transactions on Information Theory, 69(3):1739 1760, 2023. doi: 10.1109/TIT.2022.3187948. Chee, K. Y., Jiahao, T. Z., and Hsieh, M. A. Knode-mpc: A knowledge-based data-driven predictive control framework for aerial robots. IEEE Robotics and Automation Letters, 7(2):2819 2826, 2022. Couellan, N. The coupling effect of lipschitz regularization in neural networks. SN Computer Science, 2(2):1 9, 2021. Do Carmo, M. P. and Flaherty Francis, J. Riemannian geometry, volume 6. Springer, 1992. Dunson, D. B., Wu, H.-T., and Wu, N. Spectral convergence of graph laplacian and heat kernel reconstruction in l from random samples. Applied and Computational Harmonic Analysis, 55:282 336, 2021. Durrett, R. Probability: theory and examples, volume 49. Cambridge university press, 2019. Eisen, M., Zhang, C., Chamon, L. F., Lee, D. D., and Ribeiro, A. Learning optimal resource allocations in wireless systems. IEEE Transactions on Signal Processing, 67(10):2775 2790, 2019. Elenter, J., Naderi Alizadeh, N., and Ribeiro, A. A lagrangian duality approach to active learning. ar Xiv preprint ar Xiv:2202.04108, 2022. Learning Globally Smooth Functions on Manifolds Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019. Finlay, C., Calder, J., Abbasi, B., and Oberman, A. Lipschitz regularized deep neural networks generalize and are adversarially robust. ar Xiv preprint ar Xiv:1808.09540, 2018a. Finlay, C., Oberman, A. M., and Abbasi, B. Improved robustness to adversarial examples using lipschitz regularization of the loss. Co RR, abs/1810.00953, 2018b. URL http://arxiv.org/abs/1810.00953. Hasanbeig, M., Abate, A., and Kroening, D. Logicallyconstrained reinforcement learning. ar Xiv preprint ar Xiv:1801.08099, 2018. Hein, M., Audibert, J.-Y., and Luxburg, U. v. From graphs to manifolds weak and strong pointwise consistency of graph laplacians. In International Conference on Computational Learning Theory, pp. 470 485. Springer, 2005. Hein, M., Audibert, J.-Y., and Luxburg, U. v. Graph laplacians and their convergence on random neighborhood graphs. Journal of Machine Learning Research, 8(6), 2007. Jiahao, T. Z., Hsieh, M. A., and Forgoston, E. Knowledgebased learning of nonlinear dynamics and chaos. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31 (11):111101, 2021. Jiahao, T. Z., Chee, K. Y., and Hsieh, M. A. Online dynamics learning for predictive control with an application to aerial robots. ar Xiv preprint ar Xiv:2207.09344, 2022. Jiang, B., Zhang, Z., Lin, D., Tang, J., and Luo, B. Semisupervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11313 11320, 2019. Jin, C. and Rinard, M. Manifold regularization for locally stable deep neural networks. ar Xiv preprint ar Xiv:2003.04286, 2020. Kejani, M. T., Dornaika, F., and Talebi, H. Graph convolution networks with manifold regularization for semisupervised learning. Neural Networks, 127:160 167, 2020. Khoury, M. and Hadfield-Menell, D. On the geometry of adversarial examples. ar Xiv preprint ar Xiv:1811.00525, 2018. Khrulkov, V., Mirvakhabova, L., Ustinova, E., Oseledets, I., and Lempitsky, V. Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Koppel, A., Fink, J., Warnell, G., Stump, E., and Ribeiro, A. Online learning for characterizing unknown environments in ground robotic vehicle models. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 626 633, 2016. doi: 10.1109/IROS.2016. 7759118. Krishnan, V., Makdah, A., Al Rahman, A., and Pasqualetti, F. Lipschitz bounds and provably robust training by laplacian smoothing. Advances in Neural Information Processing Systems, 33:10924 10935, 2020. Krogh, A. and Hertz, J. A simple weight decay can improve generalization. Advances in neural information processing systems, 4, 1991. Lassance, C., Gripon, V., and Ortega, A. Laplacian networks: Bounding indicator function smoothness for neural networks robustness. APSIPA Transactions on Signal and Information Processing, 10, 2021. Lecouat, B., Foo, C.-S., Zenati, H., and Chandrasekhar, V. R. Semi-supervised learning with gans: Revisiting manifold regularization. ar Xiv preprint ar Xiv:1805.08957, 2018. Li, Z., Chen, Y., Le Cun, Y., and Sommer, F. T. Neural manifold clustering and embedding. ar Xiv preprint ar Xiv:2201.10000, 2022. Lindemann, L., Hu, H., Robey, A., Zhang, H., Dimarogonas, D., Tu, S., and Matni, N. Learning hybrid control barrier functions from data. In Conference on Robot Learning, pp. 1351 1370. PMLR, 2021. Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S., Schoenebeck, G., Song, D., Houle, M. E., and Bailey, J. Characterizing adversarial subspaces using local intrinsic dimensionality. ar Xiv preprint ar Xiv:1801.02613, 2018. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018. Moosavi-Dezfooli, S.-M., Fawzi, A., Uesato, J., and Frossard, P. Robustness via curvature regularization, and vice versa. In Proceedings of the IEEE/CVF Conference Learning Globally Smooth Functions on Manifolds on Computer Vision and Pattern Recognition, pp. 9078 9086, 2019. Niyogi, P. Manifold regularization and semi-supervised learning: Some theoretical analyses. Journal of Machine Learning Research, 14(5), 2013. Oberman, A. M. and Calder, J. Lipschitz regularized deep neural networks converge and generalize. ar Xiv preprint ar Xiv:1808.09540, 2018. Ouali, Y., Hudelot, C., and Tami, M. An overview of deep semi-supervised learning. ar Xiv preprint ar Xiv:2006.05278, 2020. Paternain, S., Chamon, L., Calvo-Fullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019. Paternain, S., Calvo-Fullana, M., Chamon, L. F., and Ribeiro, A. Safe policies for reinforcement learning via primal-dual methods. IEEE Transactions on Automatic Control, 2022. Pauli, P., Koch, A., Berberich, J., Kohler, P., and Allg ower, F. Training robust neural networks using lipschitz bounds. IEEE Control Systems Letters, 6:121 126, 2021. Robey, A., Chamon, L., Pappas, G. J., Hassani, H., and Ribeiro, A. Adversarial robustness with semi-infinite constrained learning. Advances in Neural Information Processing Systems, 34:6198 6215, 2021. Rosca, M., Weber, T., Gretton, A., and Mohamed, S. A case for new neural network smoothness constraints. In Zosa Forde, J., Ruiz, F., Pradier, M. F., and Schein, A. (eds.), Proceedings on I Can t Believe It s Not Better! at Neur IPS Workshops, volume 137 of Proceedings of Machine Learning Research, pp. 21 32. PMLR, 12 Dec 2020. URL https://proceedings.mlr.press/ v137/rosca20a.html. Rosenberg, S. The Laplacian on a Riemannian Manifold: An Introduction to Analysis on Manifolds. London Mathematical Society Student Texts. Cambridge University Press, 1997. doi: 10.1017/CBO9780511623783. Ruszczynski, A. Nonlinear optimization. Princeton university press, 2011. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Shen, Z., Cervino, J., Hassani, H., and Ribeiro, A. An agnostic approach to federated learning with class imbalance. In International Conference on Learning Representations, 2021. Shi, G., Shi, X., O Connell, M., Yu, R., Azizzadenesheli, K., Anandkumar, A., Yue, Y., and Chung, S.-J. Neural lander: Stable drone landing control using learned dynamics. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9784 9790. IEEE, 2019. Stutz, D., Hein, M., and Schiele, B. Disentangling adversarial robustness and generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6976 6987, 2019. Vapnik, V. The nature of statistical learning theory. Springer science & business media, 1999. Wang, W. and Carreira-Perpin an, M. A. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. ar Xiv preprint ar Xiv:1309.1541, 2013. Wu, H.-T. and Wu, N. Think globally, fit locally under the manifold setup: Asymptotic analysis of locally linear embedding. The Annals of Statistics, 46(6B):3805 3837, 2018. Yang, T. Advancing non-convex and constrained learning: Challenges and opportunities. AI Matters, 5(3):29 39, 2019. Yang, Z., Cohen, W., and Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40 48. PMLR, 2016. Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018. Zhang, S., Huang, K., Zhu, J., and Liu, Y. Manifold adversarial training for supervised and semi-supervised learning. Neural Networks, 140:282 293, 2021. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2021.03. 031. URL https://www.sciencedirect.com/ science/article/pii/S0893608021001192. Zhao, L. and Liu, Y. Spectral normalization for domain adaptation. Information, 11(2):68, 2020. Zhu, X. J. Semi-supervised learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2005. Learning Globally Smooth Functions on Manifolds A. Dual Problem Formulation In Section 3 we introduce the optimization program in (7) as the dual of (3). Strictly speaking (7) is equivalent to the actual dual problem of (3). This follows from a ready reformulation of the dual problem as we show in the following proposition. Proposition 4 The optimization program in (7) is equivalent to the Lagrangian dual of (3). Proof: The result is true because the dual problem is linear in ρ. To see this, recall that µ is the dual variable associated with the statistical constraint Ep(x,y)[ℓ(fθ(x), y)] ϵ and that λ(z) is the Lagrange multiplier distribution associated with the set of pointwise constraints Mfθ(z) 2 ρ. The Lagrangian L(ρ, θ, µ, λ) of (3) is therefore written as L(θ, ρ, µ, λ) =ρ + µ E[ℓ(fθ(x), y)] ϵ + Z M λ(z) fθ(z) 2 ρ p(z)d V (z) (21) Reorder terms in (21) to group the two summands that involve ρ to write L(θ, ρ, µ, λ) =ρ 1 Z λ(z)p(z)d V (z)dz + µ Ep(x,y)[ℓ(fθ(x), y)] ϵ + Z λ(z) Mfθ(z) 2p(z)d V (z). (22) An important observation to make is that the Lagrangian decomposes in a term that involves only ρ and a term that involves only fθ. Define then L1(ρ, λ) := ρ 1 Z λ(z)p(z)d V (z)dz , L2(θ, µ, λ) := µ Ep(x,y)[ℓ(fθ(x), y)] ϵ + Z λ(z) Mfθ(z) 2p(z)d V (z), (23) so that we can write the Lagrangian in (22) as L(θ, ρ, µ, λ) = L1(ρ, λ) + L2(θ, µ, λ). The dual problem can now be written as the maximization over multipliers of the minimum of the Lagrangian over primal variables D = max µ,λ min θ,ρ L(ρ, θ, µ, λ) min ρ L1(ρ, µ, λ) + min θ L2(θ, µ, λ) . (24) where we utilized the decomposition of the Lagrangian to write the second equality. The important observation to make is that the minimization over ρ of L1(ρ, µ, λ) has an elementary solution. Indeed, as per its definition we have min ρ L1(ρ, µ, λ) = min ρ ρ 1 Z λ(z)p(z)d V (z)dz . (25) This minimization yields when R λ(z)p(z)d V (z)dz = 1 and 0 when R λ(z)p(z)d V (z)dz = 1. Since in the dual problem we are interested in the maximum over all dual variables, we know that: (i) The maximum will be attained for a dual distribution that satisfies R λ(z)p(z)d V (z)dz = 1. (ii) When dual variables satisfy this property we know that minρ ρ(1 R λ(z)p(z)d V (z)dz) = 0. It then follows that the dual problem in (24) is equivalent to D = max µ,λ min θ L2(θ, µ, λ), subject to Z M λ(z)p(z)d V (z) = 1. (26) This is the problem in (7) given the definition of L2(θ, µ, λ) in (23) which is the same as the definition of L(θ, µ, λ) in (7). Learning Globally Smooth Functions on Manifolds Notice that the constraint R M λ(z)p(z)d V (z) = 1 implies that λ(z) is a probability distribution over the manifold M. This is an important observation for the connections we establish to manifold regularization in Section 3.2. It also implies that even though the dual problem 7 is not an unconstrained problem, the constraint is easy to enforce as an orthogonal projection on the space of probability distributions over the manifold M. This is a simple normalization. B. Empirical Dual Problem with unlabeled samples In Section 3 we introduce the empirical dual program in (9) in the case in which unlabeled samples are unavailable. The ability to leverage unlabeled samples is an important feature of this work and ready to incorporate in (9). Indeed, if in addition to N i.i.d. labeled samples (xn, yn) with N [1, N] drawn from p(x, y) we are also given N i.i.d. unlabeled samples xn with n [N + 1, N + N ] drawn from the input distribution p(x) we redefine (9) as ˆD = max ˆµ,ˆλ 0 min θ ˆL(θ, ˆµ, ˆλ) := ˆµ 1 n=1 ℓ fθ(xn), yn ϵ + 1 N +N n=1 ˆλ(xn) Mfθ(xn) 2, subject to 1 N + N n=1 ˆλ(xn) = 1, (27) Results in Section 3 hold with proper modifications. C. Proof of Proposition 1 The proof in this appendix is a generalization of the proof in (Chamon et al., 2023). In order to show Proposition 1, we need to introduce an auxiliary problem formulation over a functional domain of functions. In this case, we take the optimization problem (3) over the convex hull of the domain of parametric functions ϕ H, P α = min ϕ H,ρ 0 ρ, (28) subject to Ep(x,y)[ℓ(ϕ(x), y)] ϵ Mα, Mϕ(z) 2 ρ, p(z)-a.e., z M. We can now show that problem (28) with α = ν is strongly dual as follows, Lemma 1 Under the assumptions of Proposition 1, the functional smooth learning problem (28) with α = ν has zero duality gap, i.e. P ν = D ν. Proof Lemma 1: To begin with, the non-parametric space H is convex as it is the convex hull of the space of parametric functions H, therefore the domain of the optimization problem is a convex set. Note that the objective is linear. Given that the gradient of ϕ is a linear function of ϕ, and that by taking the norm we preserve convexity, the constraint Mϕ(z) 2 ρ is convex. By Assumption 1, the loss ℓis convex. Therefore, (3) is a semi-infinite convex problem. Moreover, since θ from Assumption 5 belongs to the relative interior of the feasible set, and H ˆH, it suffices to take ϕ ( ) = f( θ, ), and ρ > supz M || Mf θ(z)||2, and by Slater s condition as ϕ , ρ belongs to the interior of the feasible domain, problem (28) has zero duality gap, i.e. it is strongly dual. Now we need to define the supergradient of function d(µ, λ). Definition 3 (Supergradient and Superdifferential) We say that c R is a supergradient of d at µ if, d(µ ) d(µ) + c(µ µ) for all µ R. (29) The set of all supergradients of d at µ is called the superdifferential, and we denote it d(µ). Lemma 2 (Danskin s Theorem) Consider the function, F(x) = sup y Y f(x, y). (30) Learning Globally Smooth Functions on Manifolds where f : Rn Y R { , + }, if the following conditions are satisfied 1. Function f( , y) is convex for all y Y . 2. Function f(x, ) is upper semicontinuous for all x in a certain neighborhood of a point x0. 3. The set Y Rm is compact F(x0) = conv y ˆY (x0) xf(x0, y) where xf(x0, y) denotes the subdifferential of the function f( , y) at x0 Proof of Lemma 2: The proof can be found in (Ruszczynski, 2011)[Theorem 2.87]. Lemma 3 Let µ , λ be a solution of dual problem D (cf. (7)). Under the conditions of Proposition 1, there exists a feasible θ argminθ L(θ, µ , λ ), such that the value D is bounded by, P ν(µ νM 2 q P ν ν) D P (32) where P ν and µ ν are the optimal value, and optimal dual variable of the functional version of problem (28) with constraint α = ν. Proof: First, we need to show that P is finite, which is equivalent to showing that there is a feasible θ Θ for problem (3), which is verified by Assumption 4. Now, we need to verify that there exists a Lagrangian minimizer θ Θ (µ , λ ) that is feasible. To do so, We begin by considering the set of Lagrange minimizers Θ (µ , λ ) as follows, Θ (µ , λ ) = argmin θ L(θ, µ , λ ) (33) we can also define the constraint slack associated with parameters θ as follows, c(θ) = [E[ℓ(fθ(x), y)] ϵ]+. (34) Therefore, to show that there exists a feasible solution, it is analogous to show that there is an element θ of Θ (ˇµ , λ ) whose slack is equal to zero, i.e. c(θ ) = 0. To show that this element exists, by contradiction, we can say that if no element of Θ(µ , λ ) is feasible, then E[ℓ(fθ(x), y)] ϵ > 0, for all θ Θ(µ , λ ). From Lemma 2, we get that 0 / Θ (µ , λ ), which contradicts the optimality of µ , λ . Given that the dual variable µ is finite, considering the existence of a feasible solution by Assumption 2. Hence, there must be at least one element of Θ(µ , λ ) that is feasible. Now we need to show that the inequality (32) holds. The upper bound is trivially verified by weak duality (Boyd et al., 2004), i.e., Now note that problem (28) with α = 0 is strongly dual by Lemma 1, i.e., P ν = min ϕ maximize µ,λ 0 Lν(ϕ, µ, λ) = maximize µ,λ 0 min ϕ Lν(ϕ, µ, λ) = D ν (36) with the Lagrangian defined as, Lν(ϕ, µ, λ) = µ E[ℓ ϕ(x), y ] (ϵ Mν) + Z M λ(x) Mϕ(x) 2p(x)d V (x)), for all λ that satisfy Z M λ(x)p(x)d V (x) = 1, (37) Learning Globally Smooth Functions on Manifolds Coming back to the parametric dual problem (7), by optimality we know that, D min θ Θ L(θ, µ, λ), for all µ, λ. (38) We thus utilize the optimal dual variables of the functional problem with constraints ϵ Mν, i.e., µ ν, λ ν, as follows, D min θ Θ L(θ, µ ν, λ ν) (39) min ϕ H L(ϕ, µ ν, λ ν) (40) min ϕ H Lν(ϕ, µ ν, λ ν) µ νMν (41) = P ν µ νMν. (42) Equation (39) holds given that H H, and problem P ν is strongly dual by Lemma 1. To complete the proof we need to show that, P ν ν2. (43) Denoting ϕ ν a solution to problem P ν , by Assumption 2, we know there is a parameterization θ ν such that simultaneously, sup z M |ϕ ν(z) f θ ν(z)| ν (44) sup z M Mϕ ν(z) Mf θ ν(z) ν (45) therefore, E[ℓ(ϕ ν(x), y)] E[ℓ(f θ ν(x), y)] E h |ℓ(ϕ ν(x), y)] ℓ(f θ ν(x), y)| i (46) ME h |ϕ ν(x) f θ ν(x)| i Mν (47) Since ϕ ν is feasible for Pν, this implies that f θ ν is feasible for Problem 3. We can now define the ρ θ ν as the minimum ρ obtained with f θ ν as follows, P ν = min ρ 0 ρ, (48) subject to Ep(x,y)[ℓ f θ ν(x), y ] ϵ, Mf θ ν(z) 2 ρ, p(z)-a.e., z M. Returning to (39), and by optimality, P P ν , D P ν µ νMν (49) P ν + P P ν µ νMν (50) Now we need to bound the difference P ν P ν . By compactness of M, we know that there exist z1, z2 M such that Mϕ ν(z1) 2 = P ν and Mf θ ν(z2) 2 = P ν , by optimality, P ν P ν = Mϕ ν(z1) 2 Mf θ ν(z2) 2 (51) Mϕ ν(z2) 2 Mf θ ν(z2) 2 (52) Now we can add a subtract Mϕ ν(z2) to the second term, and use the triangle inequality twice to obtain Mϕ ν(z2) 2 Mf θ ν(z2) 2 Mϕ ν(z2) 2 ( Mf θ ν(z2) Mϕ ν(z2) + Mϕ ν(z2) ) ( Mf θ ν(z2) Mϕ ν(z2) + Mϕ ν(z2) ) (53) = Mf θ ν(z2) Mϕ ν(z2) 2 2 Mf θ ν(z2) Mϕ ν(z2) Mϕ ν(z2) (54) Learning Globally Smooth Functions on Manifolds Now, by noting that Mf θ ν(z2) Mϕ ν(z2) 2 ν2 by construction of θ ν, and that Mϕ ν(z2) q P ν by optimality, we end up obtaining, P ν P ν 2ν q P ν ν2 (55) Putting (50) and (55) together, we attain the desired result. Proof of Proposition 1: This proof follows the lines of (Chamon et al., 2023)[Proposition III.4]. We begin by considering µ , λ , and ˆµ , ˆλ , solutions of (7), and (9), and we define the set of optimal dual minimizers as, Θ(µ , λ ) = argmin θ Θ L(θ, µ , λ ) (56) ˆΘ(ˆµ , ˆλ ) = argmin θ Θ ˆL(θ, ˆµ , ˆλ ) (57) where L, and ˆL are as defined in 7, and 9. We can proceed to bound the difference between the values of the dual problems as follows, D ˆD = min θ Θ L(θ, µ , λ ) min θ Θ ˆL(θ, ˆµ , ˆλ ) (58) min θ Θ L(θ, µ , λ ) min θ Θ ˆL(θ, µ , λ ) (by optimality) (59) L(ˆθ , µ , λ ) ˆL(ˆθ , µ , λ ) (60) where [ λ (z)]n = λ (zn)/ PN n=1 λ (zn), and we define ˆθ ˆΘ(µ , λ ). By utilizing the definition of the Lagrangian, we obtain, D ˆD |µ | E[ℓ(fˆθ (x), y)] 1 i=1 ℓ(fˆθ (xn), yn) (61) + Ez λ [ Mfˆθ (z) 2] 1 n=1 λ (zn) Mfˆθ (zn) 2 (62) Note that the sampled dual variables converge to the continuous values as follows, lim N λ (z) 1 N PN n=1 λ (zn) = λ (z) (63) Given that lim N 1 N PN n=1 λ (zn) = 1. Utilizing the same argument on the other direction of the inequality it yields, D ˆD L(ˆθ , ˆµ , ˆλ ) ˆL(ˆθ , ˆµ , ˆλ ) (64) |ˆµ | E[ℓ(fˆθ (x), y)] 1 i=1 ℓ(fˆθ (xn), yn) (65) + Ez λ [ Mfˆθ (z) 2] 1 n=1 ˆλ (zn) Mfˆθ (zn) 2 (66) Where λ = 1 N PN n=1 ˆλ (xn)Bα(xn), and Bα(xn) is a ball of center xn, and radius α. By tending α 0, and N , given that λ is a simple function that approximates ˆλ , by the Monotone Convergence Theorem (Durrett, 2019), λ converges to ˆλ . Therefore, it holds, D ˆD max |L(ˆθ , µ , λ ) ˆL(ˆθ , µ , λ )|, |L(ˆθ , ˆµ , ˆλ ) ˆL(ˆθ , ˆµ , ˆλ )| (67) Learning Globally Smooth Functions on Manifolds Now we utilize the property of the gradients of Mfθ given by Assumption 2, and the fact that the set Θ is compact, and the manifold M is compact, Mfθ1(x) 2 Mfθ2(x) 2 (68) = Mfθ1(x) 2 Mfθ2(x) + Mfθ1(x) Mfθ1(x) 2 (69) = Mfθ1(x) 2 Mfθ2(x) Mfθ1(x) 2 || Mfθ1(x) 2 + 2 Mfθ2(x) Mfθ1(x) || Mfθ1(x) = Mfθ2(x) Mfθ1(x) 2 + 2 Mfθ2(x) Mfθ1(x) || Mfθ1(x) (71) We can now bound Mfθ2(x) Mfθ1(x) , and 2 Mfθ1(x) by 2 maxθ {θ1,θ2},z M Mfθ(z) , and therefore, we obtain, Mfθ1(x) 2 Mfθ2(x) 2 (72) 2 max θ {θ1,θ2},z M Mfθ(z) Mfθ1(x) Mfθ2(x) (73) We can now use Assumption 3 to obtain, Mfθ1(x) 2 Mfθ2(x) 2 (74) 2 max θ {θ1,θ2},z M Mfθ(z) G|fθ1(x) fθ2(x)| (by compactness and Assumption 3.) (75) To complete the proof, we leverage Talagrand s lemma (Mohri et al., 2018)[Lemma 5.7], and with probability 1 2δ, D ˆD max |µ |, |ˆµ | ζ(N, δ) + 2 max P Gˆζ(N, δ) (76) By leveraging Lemma 3 we attain the desired result. Learning Globally Smooth Functions on Manifolds D. Proof Of Proposition 2 Proof of Proposition 2: To prove that ˆθ is approximately feasible, we use the same argument that in Lemma 3. By contradiction, if there exists no ˆθ ˆΘ(ˆµ , ˆλ ) that is feasible, the supergradient, n=1 ℓ(fˆθ (xn), yn) > ϵ (77) and therefore 0 / ˆΘ(ˆθ , ˆλ ) which contradicts the optimality of ˆθ , and ˆλ . Therefore, there exists ˆθ ˆΘ(ˆµ , ˆλ ) such that, E[ℓ(fˆθ (x), y)] ϵ + ζ(N, δ) (78) with probability 1 δ. We can analyze the term given by the summation of the dual variables λ, noting that if there is no solution ˆλ , such that n=1 λ(xn) Mfθ(xn) 2 = max n [N] fθ(xn) 2 (79) then, utilizing the same argument, 0 / ˆΘ(ˆθ , ˆλ ), contradicting the optimality of ˆθ . Now, we can evaluate ˆD at ˆθ as follows, n=1 ℓ(fˆθ (xn), yn) ϵ + 1 n=1 ˆλ (xn) Mfˆθ (xn) 2 (80) n=1 ℓ(fˆθ (xn), yn) ϵ + max n [N] Mfˆθ (xn) 2 (by optimality of ˆλ ) (81) To conclude the proof, we leverage Proposition 1, in conjunction with (81) and we get that, |P max n [N] Mfˆθ (xn)||2| |ˆµ | N X n=1 ℓ(fˆθ (xn), yn) ϵ + O(ν) + max |µ |, |ˆµ | ζ(N, δ) + 2 max P Gˆζ(N, δ) (82) Completing the proof. Corollary 1 Under the same conditions of Proposition 2, if additionally, the gradients of the parameterized functions satisfy, || Mfθ(z1) Mfθ(z2)|| Ld M(z1, z2), (83) and for any dataset sampled according to p the following holds, max z M min n [N] d M(z, xn) ζ(N). (84) max z M Mfˆθ (z) 2 P + O(ν) + (1 + )ζ(N, δ) + O(ˆζ(N, δ)) (85) ˆP ζ(N) + L2 ζ2(N) + ˆµ 1 N n=1 ℓ(fˆθ (xn), yn) ϵ and E[ℓ(fˆθ (x), y)] ϵ + ζ(N, δ). (86) ˆP = maxn [N] || Mfˆθ (xn)||. Learning Globally Smooth Functions on Manifolds Proof of Corollary 1: Building from Proposition 2, we need to relate the samples xn to the whole manifold M. To do so, we start from the tautological equality that holds for all z M, Mfˆθ (z) = Mfˆθ (z) + Mfˆθ (xn) Mfˆθ (xn) (87) || Mfˆθ (xn) + Mfˆθ (z) Mfˆθ (xn)|| (88) || Mfˆθ (xn) + Ld M(z, xn) (89) || Mfˆθ ( xn) + L min n [N] d M(z, xn) (90) max x [N] || Mfˆθ (xn) + L min n [N] d M(z, xn) (91) max x [N] || Mfˆθ (xn) + L max z M min xn d M(z, xn) (92) max x [N] || Mfˆθ (xn) + Lˆζ(N) (93) where equation (88) holds by triangle inequality, and equation (89) by assumption 83. Given that equation (89) holds for all xn, it also holds for xn = arg minn [N] d M(z, xn) i.e. equation (90). Equation (91) holds given that || Mfˆθ (xn) maxx [N] || Mfˆθ (xn) , and equation (92) holds by taking the maximum. Finally, equation (93) holds by assumption (84). We can now take squares as follows, Mfˆθ (z) 2 max x [N] || Mfˆθ (xn) + Lˆζ(N) 2 (94) max n [N] || Mfˆθ (xn) 2 + L2ˆζ(N)2 + 2L p ˆP ζ(N). (95) were we used that the norm is non-negative, and the definition of p ˆP . By substituting equation (95) in equation (82), we complete the proof. Learning Globally Smooth Functions on Manifolds E. Proof of Proposition 3 The proof of Proposition 3 utilizes the following Definitions. In this proof, we use the fact that the total volume of the smooth Manifold M is bounded given its compactness ,i.e., R M vol(x) = V. We also use the fact that the functions fθ(z) are bounded up to the third derivative, and we denote F, F1, F2, and F3 their bounds respectively. We also require to define the degree probability dt(z) and the continuous heat kearnel Laplacian Lt λfθ(z). Definition 4 Given a probability distribution λ Λ defined over the manifold M, we define the degree dt(x) as, M Gt(p, z)λ(z)vol(z) (96) Definition 5 We define the continuous heat kernel laplacian Lt λfθ(z) associated with density function λ Λ, function fθ, θ Θ and point z M as, Lt λfθ(z) = 1 dt(w) (fθ(z) fθ(w))λ(w)vol(w) (97) were dt is the degree function (cf. Definition 4). Lemma 4 For a given point in the interior of the manifold, i.e., z int(M), and any open set B M, z B, in the same conditions as in Proposition 3, we have that 4t λ(w)fθ(w)vol(w) Z 4t λ(w)fθ(w)vol(w) b VFe r2 where r = infx/ B,x M z w , and b in Λ. Proof of Lemma 4: The proof is similar to Lemma 4.1 in (Belkin & Niyogi, 2005). The proof follows from bounding λ by b from assumption, fθ by F, and the volume of M B by the volume of M given by V, and bounding e p z 2 Lemma 5 The limit of the degree function dt(x) (defined in Definition 4) when t 0 is lower bounded for probabilities λ Λ as in Proposition 3,i.e., min p M,λ Λ lim t 0 dt(x) = a. (99) Proof of Lemma 5: Now we can write down the definition of the degree function dt(x), min x M,λ Λ lim t 0 dt(x) = min x M,λ Λ lim t 0 M Gt(x, z)λ(z)d V (z) (100) = min x M,λ Λ lim t 0 1 (4πt)d/2 e x z 2 4t λ(z)d V (z) (101) min x M lim t 0 a Z 1 (4πt)d/2 e x z 2 4t d V (z) (102) min x M lim t 0 a Z 1 (4πt)d/2 e (x z) ( 1 2 d V (z) (103) = min x M lim t 0 a Z 1 (4πt)d/2 e (x z) ( 1 2 d V (z) (104) = min x M lim t 0 a Z 1 (4πt)d/2 e (v) ( 1 2 d V (z) (105) = a (4πt)d/2 ((2π)d(2t)d) 1 2 = a (106) Learning Globally Smooth Functions on Manifolds Now, we now that minz M,λ Λ λ(z) = a, which means that, min z M,λ Λ lim t 0 λ(z) a (107) Noting that the integral of limt 0 R M Gt(x, z)d V (z) = 1, we have, min z M,λ Λ lim t 0 λ(z) Z M Gt(x, z)d V (z) a (108) min z M,λ Λ lim t 0 M λ(z)Gt(x, z)d V (z) a (109) min z M,λ Λ lim t 0 dt(z) a (110) By symmetry of Gt, we recover the desired result. Noting that both a dt, and dt a, then the limit holds and is equal to a. Lemma 6 For any two sufficiently close points p, q M, such that q = expp(v) = Rk, the relationship between the Euclidean distance and geodesic distance is given by, d2 M(p, q) = v 2 Rk (111) Proof of Lemma 6: Consider a geodesic curve γ, that goes from γ(0) = p to γ(1) = q, this geodesic can be obtained by γ(t) = expp(tv). The distance between p, q can be expressed as, d M(p, q) = Z 1 0 γ (t) dt = v . (112) given that the geodesic has constant derivative, and it is given by v by the definition of exponential map (see (Do Carmo & Flaherty Francis, 1992)[Proposition 3.6]). Lemma 7 Let Λ be the set of probability distributions defined on a compact d-dimensional differentiable manifold M isometrically embedded in RD such that Λ = {λ : 0 < a λ(z) b < , | λ x| c < and | 2λ x2 | d < for all z M}, and let fθ, with θ Θ, be a family of functions with uniformly bounded derivatives up to order 3 vanishing at the boundary M. For a point z M, the probability of the difference between the point-cloud Laplacian operator Lt λ,Nfθ(z) defined in definition (2) and the continuous heat kernel Laplacian is given by sup θ Θ,λ Λ | Lt λfθ(z) Lλ,Nfθ(z)| ϵ where N is such that 2e ϵ2N Fb )2 + 2e ϵ2N 32 ta2 2FGb2 2 + 2(N 1)e ϵ2(N 1) 32 ta2 2FGb2 2 δ. Proof of Lemma 7: The proof is similar to section 5.1 of Theorem 5.2 from (Belkin & Niyogi, 2005). To begin with, we define a intermediate operator ˆLλ,nfθ(z) as, ˆLλ,Nfθ(z) = 1 Gt(z, xi) p dt(xi) (fθ(z) fθ(xi)) (114) Learning Globally Smooth Functions on Manifolds Now we consider equation (113), adding and subtracting the intermediate operator ˆLλ,nfθ(z) as follows, sup θ Θ,λ Λ | Lt λfθ(z) Lλ,Nfθ(z)| ϵ sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z) + ˆLλ,nfθ(z) Lλ,Nfθ(z)| ϵ sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z) + ˆLλ,nfθ(z) Lλ,Nfθ(z)| ϵ sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z)| ϵ/2 sup θ Θ,z M,λ Λ |ˆLλ,nfθ(z) Lλ,Nfθ(z)| ϵ/2 sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z)| > ϵ/2 sup θ Θ,z M,λ Λ |ˆLλ,nfθ(z) Lλ,Nfθ(z)| > ϵ/2 sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z)| > ϵ/2 sup θ Θ,λ Λ |ˆLλ,nfθ(z) Lλ,Nfθ(z)| > ϵ/2 where equation (117) holds by taking the complement, equation (118) holds given that it is a subset of the total probability, equation (119) by taking complement again, and equation (120) by the union bound. We will now focus on the complement of the first term of (120), i.e. P(supθ Θ,λ Λ | Lt λfθ(z) ˆLλ,Nfθ(z)| ϵ/2), as follows, sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z)| ϵ/2 = P sup θ Θ,λ Λ t Gt(z, w) p dt(w) (fθ(z) fθ(w))λ(w)d V (w) Gt(z, xi) p dt(xi) (fθ(z) fθ(xi)) ϵ/2 . Now we can use the fact that sampling according to λ(z)d V (z) is equivalent to sampling xi according to d V (z) and then multiplying by λ(xi), as follows, sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z)| ϵ/2 = P sup θ Θ,λ Λ t Gt(z, w) p dt(w) (fθ(z) fθ(w))λ(w)d V (w) Gt(z, xi) p dt( xi) (fθ(z) fθ( xi))λ( xi) ϵ/2 . (123) We can now bound |fθ(z)| F, a < λ(z) < b, a < dt(z) < b with , Gt 1/(4πt)D/2 := G, to express (123) as, sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z)| ϵ/2 M Gt(z, w)d V (w) 1 i=1 Gt(z, xi) ϵ/2 Now, using the Hoeffding s inequality we obtain, sup θ Θ,λ Λ | Lt λfθ(z) ˆLλ,nfθ(z)| ϵ/2 32 ( at FGb )2 (125) Learning Globally Smooth Functions on Manifolds We will now focus on the second term of (120), i.e. P supθ Θ,λ Λ |ˆLt λ,Nfθ(z) Lλ,Nfθ(z)| > ϵ/2 , as follows, sup θ Θ,λ Λ |ˆLt λ,Nfθ(z) Lλ,Nfθ(z)| ϵ/2 sup θ Θ,λ Λ Gt(z, xi) p dt(xi) (fθ(z) fθ(xi)) 1 Gt(z, xi) p ˆW(xi) (fθ(z) fθ(xi)) ϵ/2 Now, by bounding over fθ, and λ we obtain, sup θ Θ,λ Λ |ˆLt λ,Nfθ(z) Lλ,Nfθ(z)| ϵ/2 dt(xi)) + q Now bounding over d, ˆw, and ˆW we obtain, sup θ Θ,λ Λ |ˆLt λ,Nfθ(z) Lλ,Nfθ(z)| ϵ/2 dt(xi)) + ( p Now, we can split the probability as follows, sup θ Θ,λ Λ |ˆLt λ,Nfθ(z) Lλ,Nfθ(z)| ϵ/2 Now, we can take a closer look at d(z), w(z), which we can express as, M Gt(z, w)λ(w)vol(w) n=1 Gt(z, xn) ϵ/4 We can once again use the fact that sampling according to λ(z)d V (z) is equivalent to sampling xi according to d V (z) and Learning Globally Smooth Functions on Manifolds then multiplying by λ(xi) and then bound λ(z) to obtain, M Gt(z, w)λ(w)d V (w) n=1 Gt(z, xn)λ( xn) ϵ/4 M Gt(z, w)d V (w) n=1 Gt(z, xn) ϵ/4 Now, given that Gt is bounded by G, and given that R M Gt(z, w)d V (w) = 1 > 0, we can use the Hoeffding s inequality as follows, M Gt(z, w)d V (w) 1 n=1 Gt(z, xn) ϵ/4 Returning to (132), we can repeat the steps, and by the union bound we get, M Gt(xi, w)d V (w) 1 N 1 n =i Gt(xi, xn) ϵ/4 2(N 1)e ϵ2(N 1) Finally, we obtain that with probability, sup θ Θ,λ Λ | Lt λfθ(z) Lλ,Nfθ(z)| ϵ 32 ( at FGb )2 + 2e ϵ2N b2 2 + 2(N 1)e ϵ2(N 1) It suffices to pick N0 such that the right term is small than δ. Lemma 8 Consider fθ with θ Θ, and λ Λ, for a point z M, there is a uniform bound between the continuous heat kernel Laplacian (cf .definition 5) and the the Laplace-Beltrami operator λfθ = fθ + Mfθ, Mλ , where is the Laplace-Beltrami operator, i.e. sup θ Θ,λ Λ | Lt λfθ(z) λfθ(z)| O(t1/2) (141) with Lt λfθ(z) = 1 dt(w)(fθ(z) fθ(w))λ(w)d V (w) . Proof of Lemma 8: The proof is as follows, first we will compare the heat kernel Laplacian Lt λfθ(p) with the heat kernel Laplacian with λ instead of dt. Next, we will consider the integral over a ball B, as opposed to the integral over the whole manifold M. Then, we will exploit the euclidean properties of the manifold at point p, which will allow us to convert an integral over a manifold into an integral over the low dimensional structure of the manifold. In this low dimensional manifold, we can compute the integrals, and obtain the desired result. Learning Globally Smooth Functions on Manifolds To begin with, we will introduce the continuous heat kernel laplacian, but using the probability distribution λ as follows, Lt λfθ(p) = 1 dt(z) (fθ(p) fθ(z))λ(z)d V (z) (142) λ(z) (fθ(p) fθ(z))λ(z)d V (z) (143) dt(z) (fθ(p) fθ(z))λ(z)d V (z) (144) λ(z) (fθ(p) fθ(z))λ(z)d V (z)| (145) λ(z) (fθ(p) fθ(z))λ(z)d V (z) (146) M Gt(p, z)| 1 p λ(z) ||fθ(p) fθ(z)|λ(z)d V (z) (147) By compactness of manifold M, and the fact that limt 0 dt(p) = λ(p) the following holds for any point p M, dt(p) = λ(p) + O(tg(p)) (148) where g is a smooth function depending upon higher derivatives of λ, which is bounded by c. Therefore, we can write d 1/2 t (z) = (λ(z) + O(tg(z))) 1/2 = λ(z) 1/2 + O(t), (149) where O(t) c|t|. Therefore, we can bound the right hand side of equation (147) by, M Gt(p, z)| 1 p λ(z) ||fθ(p) fθ(z)|λ(z)d V (z) (150) M Gt(p, z)ct|fθ(p) fθ(z)|λ(z)d V (z) (151) M Gt(p, z)c|fθ(p) fθ(z)|λ(z)d V (z) O(t) (152) where the bound holds uniformly given the compactness of the manifold M, and the upper bound on the derivative of fθ. We now return to equation (147), by virtue of Lemma 4, we can convert the integral over the manifold M, to an integral over the ball B as follows, Learning Globally Smooth Functions on Manifolds λ(y) (fθ(x) fθ(y))λ(y)d V (y) (153) λ(y) (fθ(x) fθ(y))λ(y)d V (y) (154) λ(y) (fθ(x) fθ(y))λ(y)d V (y) (155) λ(y) (fθ(x) fθ(y))λ(y)d V (y) (156) λ(y) (fθ(x) fθ(y))λ(y)d V (y) (157) λ(y) (fθ(x) fθ(y))λ(y)d V (y) (158) λ(y) (fθ(x) fθ(y))λ(y)d V (y) (159) λ(y) (fθ(x) fθ(y))λ(y)d V (y) + b VFe r2 We now consider the exponential coordinates exp around p, after introducing a ball B of radius r. Letting the change of variables be expp : Tp M M. Letting B be a ball in Tp M, then B is the image of the ball under the exponential map. We will use the exponential coordinates as v = expp(z), and fθ(v) = fθ(expp(v)), and fθ(0) = fθ(expp(0)). λ(y) (fθ(x) fθ(y))λ(y)d V (y) = (161) expp(v) expp(0) 2 λ(v)( fθ(0) fθ(v))det(g(v))dv (162) where det(g(v)) is the determinant of the metric tensor in exponential coordinates. We can now expand p λ(v), and fθ(v) by their Taylor expansions as follows, λ1/2(v) λ1/2(0) + 1 2 λ 1/2(0)v λ(0) + O( v 2) (163) fθ(v) = fθ(0) + v fθ(0) + 1 2v Hv + O( v 3) (164) fθ(0) fθ(v) v fθ(0) 1 2v Hv + O( v 3) (165) Where f = ( f x1 , . . . , f xk ) , and H its corresponding hessian matrix. Now note that given that the norm of the second derivative λ and third derivative of fθ are uniformly bounded for all p M, θ Θ and λ Λ. With these two approximations in hand, we also approximate the heat kernel e expp(v) expp(0) 2 4t as follows, e d M(expp(v) p)2 given that y is close enough to p. We also approximate the determinant of the metric tensor in exponential coordinates det(g(v)) using the fact that det(g(v)) = 1 1 6z Rv + O( v 3) (167) 1 + k0 v 2. (168) Learning Globally Smooth Functions on Manifolds where R is the Ricci curvature tensor which is uniformly bounded given the compactness of M. The last inequality holds uniformly for all p M given the compactness of M. Returning to 162, putting all the pieces together we obtain, λ(y) (fθ(x) fθ(y))λ(y)d V (y) (169) 4t ( λ1/2(0) + 1 2 λ 1/2(0)v λ(v) + k1 v 2) (170) (v fθ(0) + 1 2vt Hv + k2 v 3)dv (171) 4t ( f(0) f(v))k0 v 2dv (172) Which can be split into 5 integrals as follows: 4t λ1/2(0)v fθ(0)dv (173) 4t λ1/2(0)1 2vt Hvdv (174) 2 λ 1/2(0)v λ(0) fθ(0) vdv (175) 2 λ 1/2(0)v λ(0)1 2v Hvdv (176) 4t ( f(0) f(v))k0 v 2d Rk (177) 4t k0 v 3d Rk (178) Where for Et we have used the fact that the first derivative has a uniform bound. Note that given the Gaussian kernel integration, the following conditions are true, Z 4t dv = 0 (179) Z B vivje v 2 4t dv = 0 (180) Z B viv2 j e v 2 4t dv = 0 (181) t 1 (4πt)k/2 4t dv = O(t1/2) (182) t 1 (4πt)k/2 B v2 i e v 2 4t dv = 2 (183) for i = j given the zero mean Gaussian distribution with zero mean and diagonal covariance matrix. Using these conditions, we can conclude that At = Dt = 0. From Bt and Ct, the only non-zero elements are given by the diagonal elements, therefore, we obtain, At + Bt + Ct + Dt + Et = x2 i 1 λ(0) i=1 [ λ(0)]i[ fθ(0)]i + O(t1/2) (184) λ λ, fθ Tp + O(t1/2) (185) Learning Globally Smooth Functions on Manifolds Where we have used that Mf(p) = Rk f(0) = Pk i=1 2 f x2 i (0) (see Chapter 3 of (Rosenberg, 1997)). Proof of Proposition 3: To begin with, utilizing Green s identity, for the Laplacian λ = fθ + 1 λ λ, fθ , given that the function vanishes at the boundary, the following holds Z M fθ(z) λfθ(z)λ(z)d V (z) = Z M Mfθ(z) 2λ(z)d V (z) (186) An equivalent statement of the proof is that for any ϵ > 0, and δ > 0, there exist N0 such that, for N N0 P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi)Lt N λ,Nfθ(xi) Z M fθ(z)( λfθ(z))λ(x)d V (z) > ϵ δ (187) By considering the complement of the event in (187), we obtain, P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi)Lt N λ,Nfθ(xi) Z M fθ(z)( λfθ(z))λ(x)d V (z) ϵ 1 δ (188) We can now add and subtract 1 N PN i=1 fθ(xi)λ(xi) λfθ(xi), and split the probability as follows, P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi)Lt N λ,Nfθ(xi) Z M fθ(z)( λfθ(z))λ(x)d V (z) ϵ (189) P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi)Lt N λ,Nfθ(xi) 1 i=1 fθ(xi)λ(xi)( λfθ(xi)) ϵ/2 + P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi)( λfθ(xi)) Z M fθ(z)( λfθ(z))λ(x)d V (z) ϵ/2 (190) Where equation (190) holds given that it is a subset of the event in (189). We can now focus on the second term of (190). Given that the second derivatives of fθ are uniformly bounded for θ Θ, we can bound | fθ| F2, to obtain, P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi)( λfθ(xi)) Z M fθ(z)( λfθ(z))λ(x)d V (z) ϵ/2 P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi) Z M fθ(z)λ(x)d V (z) ϵ 2K Now, for equation (191), we can use Assumption 4 with probability δ/2 as follows, ˆζ(N, δ/2) ϵ 2K . (192) Returning to (190), we can also bound the value of |fθ(z)| F, and a λ b to obtain, P sup λ Λ,θ Θ i=1 fθ(xi)λ(xi)Lt N λ,Nfθ(xi) 1 i=1 fθ(xi)λ(xi)( λfθ(xi)) ϵ/2 (193) P sup λ Λ,θ Θ i=1 Lt N λ,Nfθ(xi) 1 i=1 ( λfθ(xi)) ϵ 2b F P sup λ Λ,θ Θ Lt N λ,Nfθ(xi) ( λfθ(xi)) ϵ 2b F i=1 sup λ Λ,θ Θ Lt N λ,Nfθ(xi) ( λfθ(xi)) ϵ 2b F i=1 P sup λ Λ,θ Θ Lt N λ,Nfθ(xi) ( λfθ(xi)) ϵ 2b F Learning Globally Smooth Functions on Manifolds Where equation (195) holds by the triangle inequality, equation (196) holds by taking the intersection of the events, and equation (197) holds by the union bound. Now we add and subtract the continuous Lt λfθ(xi) (cf. definition 5) as follows, P sup λ Λ,θ Θ Lt N λ,Nfθ(xi) ( λfθ(xi)) ϵ 2b F = P sup λ Λ,θ Θ Lt N λ,Nfθ(xi) Lt λfθ(xi) + Lt λfθ(xi) ( λfθ(xi)) ϵ 2b F P sup λ Λ,θ Θ Lt N λ,Nfθ(xi) Lt λfθ(xi) ϵ 4b F + P sup λ Λ,θ Θ Lt λfθ(xi) ( λfθ(xi)) ϵ 4b F For term (200), we can use Lemma 8 with δ/4, and ϵ = ϵ/4b F to obtain N 0 should be such that 2e ϵ 2N 0 32 ( at Fb )2 + 2e ϵ 2N 0 32 ta3/2 b2 2 + 2(N 0 1)e ϵ 2(N 0 1) 32 ta3/2 b2 2 δ/4N (202) For term (201) can be bounded utilizing Lemma 8, and therefore N needs to verify, t1/2 δ 4N and N 1 d+2+α = t. (203) To conclude, it suffices to pick N such that condition (192) is verified, and N needs to be picked such that condition (202), and (203) are meet. The total number of samples is then N0 + N 0. Learning Globally Smooth Functions on Manifolds F. Dual Ascent Algorithm In this section, we explore the two algorithms that we present to solve problem 7; the gradient based primal dual Algorithm 1 in subsection F.1, and the pointcloud laplacian based primal dual 2 in subsection F.2. The laplacian variant, presents a computational advantage, but requires conditions on the dual variables λ that might be difficult to secure in practice. The gradient based method on the other hand, presents a less restrictive algorithm at the expense of a higher computational cost. It is important to note that the sole difference between the two relies on the gradient of the lagrangian with respect to θ, that is to say, the update on the dual variables λ and µ remains the same. In this appendix section, we provide an verbose explanation of the two procedures. Moreover, we elaborate on the estimator of the norm of the gradient F.3. F.1. Gradient Based Primal Dual Ascent On this subsection we elaborate on the gradient based method to update the primal variable θ. Upon showing that under certain conditions problem (3) and problem (9) are close (see Proposition 1, and 2), we will introduce a dual ascent algorithm to solve the later. The problem that we seek to solve is given by, max ˆλ,ˆµ ˆd G(ˆλ, ˆµ) := min θ 1 N ℓ fθ(xn), yn µϵ + ˆµ 1 1 n=1 ˆλ(xn) Mfθ(xn) (204) subject to 1 N n=1 λ(xn) = 1 (205) We will procede with an iterative process that minimizes over θ, and maximizes over λ and µ. In short, for each value of µ, λ, we seek to minimize the Lagrangian ˆL(θ, ˆµ, ˆλ) by taking gradient steps as follows, θk+1 = θk + ηθ θ ˆL(θ, ˆµ, λ) (206) = θk + ηθ θ n=1 ℓ fθ(xn), yn + 1 n=1 ˆλ(xn) Mfθ(xn) 2 (207) The gradient base approach computes gradients with respect to the loss function ℓ, and with respect to the norm of the gradient. After the dual function ˆd(ˆλ, ˆµ) is minimized, in order to solve 7, we require to maximize the dual function ˆd(ˆλ, ˆµ) over both ˆλ, and ˆµ, which can be done by evaluating the constraint violation as follows, λn λn + ηλ ˆλn ˆL(θ, ˆµ, ˆλ), with ˆλn ˆL(θ, ˆµ, ˆλ) = Mϕ(xn) 2 (208) ˆµ ˆµ + ηˆµ ˆL(θ, ˆµ, ˆλ), with ˆµ ˆL(θ, ˆµ, ˆλ) = 1 n=1 ℓ fθ(xn), yn ϵ, (209) where stepsizes ηλ, ηµ are positive numbers. Note that to evaluate the norm of the gradient on a particular point Mϕ(xn) we can look at its neighboring points N(xi), and estimate its norm. After updating the dual variables λ using gradient ascent, we require |λ|1 = 1, which can be done by either normalizing, or projecting to the simplex (Wang & Carreira-Perpin an, 2013). How to compute the norm of the gradient will be explained in the sequel in subsection F.3. The overall procedure is explained in Algorithm 1. F.2. point-cloud Laplacian Dual Ascent Our algorithm will take advantage of the Laplacian formulation given in Proposition 3. That is, we will estimate the gradient of the lagrangian with respect to θ utilizing the point-cloud laplacian formulation. Formally, for a vector ˆλ RN +, such that 1 N PN n=1 ˆλn = 1, and a constant ˆµ R+, the empirical dual function associated with the Lagrangian ˆL(θ, ˆµ, ˆλn) of the empirical dual function associated with (7) is defined as, ˆd(ˆλ, ˆµ) = min θ 1 N n=1 ℓ fθ(xn), yn + ˆµ 1 1 n=1 fθ(xn)ˆλn Lt N λp,Nfθ(xn), (210) Note that in (210), we have omitted the term ˆµϵ as it is a constant term for a given ˆµ, and we have divided over ˆµ which renders an equivalent problem as long as ˆµ > 0. By taking the maximum of the dual function over ˆµ, and ˆλ, we recover the Learning Globally Smooth Functions on Manifolds Algorithm 1 Gradient Based Smooth Learning on Data Manifold 1: Initialize parametric function θ, define neighborhoods N(p) for every p D 2: repeat 3: for primal steps k do 4: Estimate maximum norm of gradient for all xi: Mfθ(xi) = maxz N(xi) fθ(xi) fθ(z) 5: Update θ : θ θ ηθ θ µ 1 N PN i=1 ℓ(fθ(xi), yi) + 1 N PN i=1 λ(xi) Mfθ(xi) 2 6: end for 7: Update dual variable µ: µ [µ + ηµ 1 N PN i=1 ℓ(fθ(xi), yi) ϵ)]+ 8: Update dual variable λ(xi): λ(xi) [λ(xi) + ηλ(maxz N(xi) fθ(xi) fθ(z) 2 d(xi,z)2 )]+ 9: Project λ : λ = arg min0 λ λ λ s.t. | λ|1 = N 10: e = e + 1 11: until convergence Algorithm 2 point-cloud Laplacian Smooth Learning on Data Manifold 1: Initialize parametric function θ, fix temperature t 2: repeat 3: for primal steps k do 4: Update θ : θ θ ηθ θ µ 1 N PN i=1 ℓ(fθ(xi), yi) + 1 N PN i=1 fθ(xi)λ(xi)Lt λp,Nfθ(xi) 5: end for 6: Update dual variable µ: µ [µ + ηµ 1 N PN i=1 ℓ(fθ(xi), yi) ϵ)]+ 7: Update dual variable λ(xi): λ(xi) [λ(xi) + ηλ(maxz N(xi) fθ(xi) fθ(z) 2 d(xi,z)2 )]+ 8: Project λ : λ = arg min0 λ λ λ s.t. | λ|1 = N 9: e = e + 1 10: until convergence dual problem 9. For a given choice of dual variables, ˆλ, ˆµ, the dual function (210) is an unconstrained problem that only depends on the parameters θ. Now, the link with Manifold Regularization (5) is seen; as considering λ(xn) = 1/N, and µ = γ, the problems become equivalent. In order to minimize (210), we can update the parameters θ following the gradient, n=1 ℓ fθ(xn), yn + ˆµ 1 1 n=1 fθ(xn)ˆλn Lt N λp,Nfθ(xn) , (211) where ηθ > 0 is a step-size. Note that the updates in 211 have two parts, one that relies on the loss ℓwhich utilizes labeled data, and another term given by ˆλn Lt N λp,N, that penalizes the Lipschitz constant of the function, and can be computed utilizing unlabeled data. A more succinct explanation of the algorithm is described in Algorithm 2. F.3. Gradient Norm Estimate For both Algorithms 1, and 2, we require to compute the norm of the gradient of fθ at sample point xn with respect to the manifold M. Leveraging the Lipschitz constant definition 1, in order to estimate the maximum norm of the gradient, we require to evaluate the Lipschitz over the sample points in our dataset. To do so, we require to set a metric distance over the manifold d M, with which we will define the neighborhood of sample xn as the samples that are sufficiently close,i.e. N(xn) = {z M : d M(z, xn) δ}. (212) Note that our definition of neighborhood 212, requires a maximum distance δ. In practice, we can either choose δ, or set the degree of the neighborhood,i.e. choose the knearest neighbors. Upon the selection of the neighborhood, the norm of the gradient is computed as follows. Mfθ(xi) = max z N(xi) fθ(xi) fθ(z) d(xi, z) (213) Learning Globally Smooth Functions on Manifolds The selection of distance over the manifold is application dependent. In controls problems that involve physical systems, the metric might involve the work required from two states to reach each other. On computer vision applications, common choices of metrics can be perceptual losses (Zhang et al., 2018), or distances over embedding (Khrulkov et al., 2020). Learning Globally Smooth Functions on Manifolds G. Experiments G.1. Baselines In all experiments, we compared our method with empirical risk minimization (ERM) (Vapnik, 1999), ambient regularization (Krogh & Hertz, 1991) and manifold regularization (Belkin et al., 2005). Regarding memory allocation and computation, both manifold regularization, and manifold Lipschitz, require to calculate and storage the point-cloud Laplacian matrix. Moreover, our method also requires to save a real value variable for each sample. Regarding training times, our experiments show that our method takes 10% more time than laplacian regularization, which over all takes twice as much time as ERM and ambient regularization. We run experiments on both NVIDIA 2080, as well as 3090 GPUs. G.2. Ground Robotic Vehicle Residual Learning In this section we explain the residual learning experiment that involves a ground robot vehicle. The data acquisition of this experiment involves an i Robot Packbot equipped with high resolution camera. The setting of the data acquisition, is a robot making turns on both pavement, and grass. The dynamics of the system, are govern by the discrete-time nonlinear state-space system of equations xk+1 = f(xk, uk) + g(uk), (214) where xk is the state of the system, and uk is the action taken, f(xk, uk) is the model prediction, and g(uk) is a nonmodellable error of the prediction. In this setting, the robot state involves the position, and the action taken is given by its linear, and angular velocities. The dynamics of the system are modeled by f(xk, uk), and they involve the mass of the robot, the radius of the wheel, among other known parameters of the robot. In practice, the model f(xk, uk) is not perfect, and the model mismatch is given by the difference in friction with the ground, delays in communications with the sensors, and discrepancies in the robot specifications. To make matters worse, some of the discrepancies, are difficult to model, or intractable to compute in practice. Therefore, we seek to learn the model mismatch g(uk). Figure 3 shows examples of trajectories given by time series xk, each of which is associated with an average error model mismatch, and its corresponding variance, k=1 xk+1 f(xk, uk) , (215) v u u t 1 K 1 k=1 µi (xk+1 f(xk, uk)) 2. (216) In practice, samples from the grass dataset tend to have larger disturbance mean µ, given that pavement presents a more uniform setting. For example, if we compare sample 3b to 3f, we can see that for similar trajectories, the error is larger in the case of grass. Moreover, trajectories that involve larger magnitudes in velocities, also present larger disturbance means, and variances (cf. 3a vs 3c). This is due to the fact that at high speeds, the non-modeled effects such as drift, and friction are more significant. For more details on the data acquisition, please refer to the original paper (Koppel et al., 2016). G.2.1. NUMERICAL IMPLEMENTATION We are equipped with two datasets, one for grass with 195 samples, and one for pavement with 224 samples. We partition the datasets into 170 samples for training and 25 samples for testing, and 200 samples for training and 24 samples for testing, for grass and pavement respectively. Each sample has associated a disturbance mean, and disturbance variance vector [µ, σi] R2. Regarding the optimization, we utilize the mean square error as a loss, and we train a 2 layer fully connected neural network with 256 hidden dimensions and hyperbolic tangent as the non-linearity. We train for 10000 epochs, with a learning rate of 0.0015 in the case of pavement, and 0.00015 in the case of grass. For batch size, we utilize the whole training set. For the case of ambient Lipschitz, we utilize weight decay w D = 0.3, and w D = 0.1 for grass and pavement respectively. For the point-cloud laplacian, we utilize heat kernel with temperature t = 15, and t = 35 for grass and pavement respectively. Learning Globally Smooth Functions on Manifolds 0 10 20 30 40 50 Samples linear velocity angular velocity (a) Grass sample 10 with average error µ = 0.288 and variance error σ = 0.088. 0 10 20 30 40 50 Samples linear velocity angular velocity (b) Grass sample 100 with average error µ = 0.919 and variance error σ = 0.113. 0 10 20 30 40 50 Samples linear velocity angular velocity (c) Grass sample 120 with average error µ = 1.01 and variance error σ = 0.298. 0 10 20 30 40 50 Samples linear velocity angular velocity (d) Pavement sample 10 with average error µ = 0.370 and variance error σ = 0.172. 0 10 20 30 40 50 Samples linear velocity angular velocity (e) Pavement sample 100 with average error µ = 0.424 and variance error σ = 0.313. 0 10 20 30 40 50 Samples linear velocity angular velocity (f) Pavement sample 120 with average error µ = 0.652 and variance error σ = 0.747. Figure 3: Sample trajectories for grass, and pavement, with their corresponding average error, and variance error. This gives us a point-cloud laplacian with one connected component. For the Laplacian regularization we utilize 1 10 3, and 1 10 4 for grass and pavement respectively. As for the Manifold Lipschitz, we utilize ηµ = 0.01, and ηλ = 0.1, and ϵ = 0.005, and 0.01 for grass and pavement respectively. We initialize µ = 5, and λ uniform. The final results are summarized in Table 1. G.3. Quadrotor state prediction In this section we present a state prediction problem based on a real world collected from a quadrotor taking off and flying in a circle for 12 seconds. The aerial robot is the open-source Crazyflie 2.1 quadrotor (https://www.bitcraze.io/products/crazyflie-2-1/), which has a mass of 32 g, and a mass of size 9 cm2. The quadrotor communicates with a computer running on Intel i7 CPU, and the communication is established with the Crazyradio PA and at a nominal rate of 500 Hz (Ts = 1/500). To measure the position of the quadrotor a VICON is utilized. The position is obtained from the VICON, whereas the accelerations are obtained from the on-board accelerometers and gyroscope sensors. For further information on this setup, please refer to (Jiahao et al., 2022; Chee et al., 2022; Jiahao et al., 2021). The experimental setup is to target a speed of 0.4m/s and to track a circular trajectory of radius 0.5m. We consider 2 trajectories of 12 seconds each (each trajectory has 6000 time stamps), and the starting position of the quadrotor is the same for both trajectories. For each time stamp t, we have measurements of position, velocity and acceleration in R3, i.e. [xt, yt, zt, xt, yt, zt, xt, yt, zt]. The problem consists on learning the dynamical system composed by the quadrotor. We consider the dynamics given by the equation, [xt+1, yt+1, zt+1, xt+1, yt+1, zt+1] = f([xt, yt, zt, xt, yt, zt, xt, yt, zt]). (217) The learning problem consists of learning the dynamical system, i.e. we consider the dynamics given by the equation (217). For the learning procedure we utilize the 6000 samples, and we seek to minimize the mean square error loss between the next state and the prediction given the current state (cf. equation 217). We train a two layer neural network with different methods as seen in Table 2. To test the neural network, we compute the difference between the predicted state, and the Learning Globally Smooth Functions on Manifolds 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Figure 4: Point-cloud Laplacian next state on the test trajectory. Note that even though the training, and testing trajectories are not the same, there is a resemblance between the two of them. Te begin with, we can conclude that adding regularization not always helps, as ambient regularization does not improve upon the ERM prediction method. A salient conclusion of the results shown in Table 2, is that adding regularization on the manifold space always improves upon ERM. In particular, our method is almost 3 times better than standard ERM, and more than 2 times better than standard Laplacian Regularization. To conclude with, we show that our method obtains an improvement over all the techniques considered. This allows us to conclude that in predicting the next state of a quadrotor from the current state under noisy measurements utilizing smooth functions improves generalization. G.3.1. DETAILS ON NUMERICAL IMPLEMENTATION For the state prediction problem of a quadrotor we utilized a two layer fully connected neural network with 8192 hidden units, hyperbolic tangent as the non-linearity, and bias term. For the optimizer, we utilized a learning rate of 10 5, and the full dataset per batch. We trained until convergence in all cases with number of epochs e = 1000. For the ambient regularization we used weight decay 0.1. For the construction of the Laplacian, we utilized temperature coefficient t = 0.01. For Laplacian regularization we utilized γ = 10 6. For our method, we utilized µ dual step 0.5, ϵ = 0.003, and λ dual step 0.1. In all cases we trained until convergence with e = 10000 epochs. G.4. Two-Moons Dataset In this subsection we provide the details of the experiment with the Two-moons data set utilized in 1. To generate the data we utilized sklearn library, and we utilize 1 labeled, and 200 unlabeled samples per class (i.e. moon), and we added noise σ = {0.05, 0.1}. For the neural network, we utilized a two layer fully connected neural network with 64 hidden neurons with bias term, and hyperbolic tangent as the non-linearity. For the optimizer, we utilized a learning rate of 0.9, and no momentum. For the ambient regularization, we added a weight decay of 0.1. For the construction of the Laplacian of Figure 1 we utilized a heat kernel temperature of t = 0.005, and we normalize it. For Laplacian regularization, we set γ = 0.5. For Manifold Lipschitz (our method), a µ dual step of 0.5, and a λ dual step of 0.1. An ablation study over different values of temperature coefficient t can be found in sections I. As seen in figure 1, Ambient Regularization fails to classify the unlabeled samples, given that ignores the distribution of samples given by the Manifold. The case in which the manifold has two connected component (cf. Figure 1a), our method works as good as Manifold Regularization, due to the fact that the Lipschitz constant will be made small in both components separately. However, when the manifold is weakly connected, Manifold Regularization fails to recognize the transition between the components, as it will penalize Learning Globally Smooth Functions on Manifolds large gradients across the manifold, converging to a plane that connects the two samples. Our Manifold Lipschitz method, as it requires the Lipschitz constant to be small, forces a sharp transition along the point with maximal separation. G.5. Navigation Controls Problem In this section, we consider the problem of continuous navigation of an agent. The agent s objective is to reach a goal while avoiding obstacles. The state space of the agent is S = [0, 20] [0, 10], which represents the x and y axis respectively. The agent navigates by taking actions on the velocity v R2, and the state evolves according to the dynamics st+1 = st + vt Ts where Ts = 0.1s. We construct a square grid of points in the environment that are on the free space i.e. outside of the obstacles, and utilize Dijkstra s algorithm to find the shortest path for two starting [1, 9]T , [14, 1]T positions, and goal [19, 1]T along the grid. For those two grid trajectories, we compute the optimal actions to be taken at each point in order to follow the trajectory. The learner is equipped with both the labeled trajectories, as well as the unlabeled point grid. To leverage the manifold structure of the data, we consider the grid of points, and we construct the point-cloud Laplacian considering adjacent points in the grid. We train a two layer neural network using the mean square error loss over the optimal set of points and actions for ERM, ERM with ambient Lipschitz regularizer, Manifold Regularization, and our method Manifold Lipschitz method. Method Trajectories ERM 85 Ambient Reg. 66 Manifold Reg. 77 Manifold Lipschitz 94 Table 3: Number of successful trajectories from 100 random starting points. To evaluate the performance, we randomly chose 100 starting points and compute the trajectories generated by each learned function. A trajectory is successful if it reaches the goal without colliding with the obstacles or the walls. The results are summarized in table 3, and showcase the benefit of implementing manifold lipschitz. Our method outperforms the 3 other methods due to the fact that it minimizes the gradient of the function over the domain of the data. As opposed to ERM, our method generates a smooth function outside of the labeled trajectory. Ambient Lipschitz regularization fails due to the fact that the euclidean distance ignores the real distance between samples across wall, forcing similar outputs for points that should take different actions. Manifold regularization is able to capture the similarity between points, but it fails to properly capture the sharp turns near the edges off the obstacles. The success of Manifold Lipschtiz, can be explained by its dual variables λ shown in figure 5c. In this figure, the radius of each ball represents the value of the dual variable, which is larger close to the corners of the obstacles due to the fact that the problem requires larger gradients to make sharp turns over it. Besides, the fact that we can disentangle the loss on the labeled data, from the Lipschitz constant, allows us to overfit the data as much as we require. As measure of merit, we take 100 random points and we compute the trajectories. A trajectory is successful if it achieves the goal without colliding. The results are shown in Table 3, and the learned functions in Figure 5. Learning Globally Smooth Functions on Manifolds H. Further References Since we introduce the Lipschitz constant as a constraint to the learning problem our reformulation and solution methodologies are framed within the constrained learning paradigm (Chamon & Ribeiro, 2020; Chamon et al., 2023; Yang, 2019). Central to the solution of constrained learning problems is the use of dual formulations and dual ascent learning algorithms. These are finding increasing applicability as evidenced by their use in, e.g., adversarial robustness (Robey et al., 2021), graph neural networks (Cervino et al., 2022; Arghal et al., 2021), federated learning (Shen et al., 2021), active learning (Elenter et al., 2022), reinforcement learning (Paternain et al., 2019; 2022; Castellano et al., 2021; Bai et al., 2021; Hasanbeig et al., 2018), and wireless communications (Eisen et al., 2019). In the context of adversarial attacks to neural networks, manifold based regularization techniques have shown a vast amount of empirical and theoretical evidence of its utility, improving its adversarial robustness(Zhang et al., 2021; Khoury & Hadfield-Menell, 2018; Ma et al., 2018; Moosavi-Dezfooli et al., 2019; Jin & Rinard, 2020; Lassance et al., 2021). Some works seek to obtain manifold attacks, which are more realistic attacks than utilizing the normball, given the high dimensionality of the input and the low dimensional structure of the data (Stutz et al., 2019). Smooth function have also been studied in the context of robustness (Rosca et al., 2020; Bubeck & Sellke, 2021; Bubeck et al., 2021). Our work, is based on previous results that show convergence of graph laplacians to Laplace-Beltrami operators. There exists a vast amount of work on that validates the convergence results for point-cloud operators over Manifolds (Hein et al., 2005; 2007; Dunson et al., 2021; Wu & Wu, 2018). Regarding Lipschitz constant estimation for neural networks, (Fazlyab et al., 2019) has formulated the problem as a convex optimization problem. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (a) Dataset 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (b) Manifold Lipschitz. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (c) Dual variables λ. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (e) Ambient Regularization. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (f) Manifold Regularization. Figure 5: Figure 5a shows the training dataset, blue stars depict unlabeled point, and blue arrow the optimal action at the red star. Figure 5b shows the learned function using Manifold Learning, and 5c its associated dual variables associated. Figures 5d, 5e, 5f show the functions learned using ERM, ambient regularization, and Manifold regularization respectively. Learning Globally Smooth Functions on Manifolds I. Ablation Study on Laplacian Construction In this section we study the impact of the temperature coefficient t in the construction of the Laplacian. To do so we repeat the setting of Figure 1, and we vary the value of the temperature t. We consider the two moons dataset problem with 1 labeled, and 200 unlabeled samples per class. We vary the value of the temperature coefficient, which varies the number of cross-manifold edges, and therefore makes the problem more challenging. Value of Heat Kernel Connected Components Number of Cross-Manifold Edges Manifold Regularization Manifold Gradient (Ours) 0.0040 2 0 100% 100% 0.0050 1 2 N/A 100% 0.0060 1 3 N/A 100% 0.0070 1 10 N/A 100% 0.0080 1 20 N/A 100% 0.0090 1 27 N/A 100% 0.0100 1 42 N/A 100% 0.0150 1 124 N/A 100% 0.0175 1 176 N/A 100% 0.0180 1 192 N/A 100% 0.0190 1 217 N/A 100% 0.0200 1 251 N/A N/A Table 4: Ablation study on the temperature of the heat kernel t. We plot the accuracy when it achieves 100%, and N/A otherwise, given that an accuracy of less than 100% is not representative as the method fails to capture the manifold structure of the problem. As seen in table 4, Laplacian regularization fails to achieve a perfect accuracy when the number of connected components is less than 2. That is to say, manifold regularization achieves a perfect accuracy when each class has a connected component. However, once the components become connected, Laplacian regularization smoothness the integral of the gradient, and therefore does no properly identify the transition between components. As can be seen in table 4, our method is more robust to non-exact manifolds. Which means that if the manifold is not calculated perfectly, and as a result we obtain 1 connected component as opposed to 2 separate moons, our method still works. It is important to remark that our method still works when the connected components have cross-manifold edges in different places of the manifold. As an example, take the Laplacian with heat kernel t = 0.007, there are edges on both side of the manifold. Moreover, our method is able to distinguish between the two classes even with edges in the middle of the two manifold as can be seen with t = 0.0150, and t = 0.0175 (cf. 6h). Our method brakes once the manifold structure vanishes and most of the points are interconnected, as can be seen with t = 0.02. In this case there are 251 cross manifold edges, and the low-dimensional structure of the problem disappears. In all, our proposed solution is more robust to imperfect estimation of the manifold. Even when the number of cross-edges is large, our method is able to create a partition between classes given that it finds the points with maximal separation, and allows the function to change values between them. Learning Globally Smooth Functions on Manifolds 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (a) t = 0.004. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (b) t = 0.005. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (c) t = 0.006. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (d) t = 0.007. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (e) t = 0.008. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (f) t = 0.009. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (g) t = 0.01. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (h) t = 0.015. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (i) t = 0.0175. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (j) t = 0.018. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (k) t = 0.019. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (l) t = 0.02. Figure 6: Laplacian for different values of temperature coefficient t. Learning Globally Smooth Functions on Manifolds J. Laplacian Spectral Embedding In figure 7 we show the result of projecting the 2 moons dataset to the 2 eigenvector associated with the 2 smallest eigenvalues larger than 0 of the unweighted Laplacian L in 7b, and the Laplacian multiplied by the dual variables diag(λ)L in 7a. The dual variable λ is obtained by utilizing our method and reaching 100% accuracy. If we compare the spectral embedding induced by the unweighted Laplacian (i.e., Laplacian eigenmap) and the re-weighted Laplacian for the two moons dataset, we notice that the latter displays better separation between the classes. It is worth noting, however, that this experiment is unfair in our favour, since the weighted Laplacian uses extra information from the primary task, namely, labeled data. Nevertheless, comparing the Laplacian spectral embeddings do provide another illustration as to the importance of considering the underlying data geometry (manifold) when imposing smoothness. 0.004 0.002 0.000 0.002 0.004 0.015 (a) Our Laplacian. 0.004 0.002 0.000 0.002 0.004 0.015 (b) Unweighted Laplacian. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (c) Dataset. Figure 7: Laplacian Eigenmaps. In figures 7a,7b we project the points in the dataset to the 2 largest eigenvalues of the Laplacian L, and diag(λ) L respectively. The colors correspond to the classes predicted. In figure 1a we show the dataset.