# supervised_learning_via_eulers_elastica_models__9f4195cc.pdf Journal of Machine Learning Research 16 (2015) 3637-3686 Submitted 7/13; Revised 11/14; Published 12/15 Supervised Learning via Euler s Elastica Models Tong Lin lintong@pku.edu.cn Hanlin Xue linlie312@gmail.com Ling Wang ling.wang.nj@gmail.com Bo Huang bohuang0321@gmail.com Hongbin Zha zha@cis.pku.edu.cn Key Laboratory of Machine Perception (Ministry of Education) School of Electronics Engineering and Computer Science Peking University, Beijing, 100871, China Editor: Mikhail Belkin This paper investigates the Euler s elastica (EE) model for high-dimensional supervised learning problems in a function approximation framework. In 1744 Euler introduced the elastica energy for a 2D curve on modeling torsion-free thin elastic rods. Together with its degenerate form of total variation (TV), Euler s elastica has been successfully applied to low-dimensional data processing such as image denoising and image inpainting in the last two decades. Our motivation is to apply Euler s elastica to high-dimensional supervised learning problems. To this end, a supervised learning problem is modeled as an energy functional minimization under a new geometric regularization scheme, where the energy is composed of a squared loss and an elastica penalty. The elastica penalty aims at regularizing the approximated function by heavily penalizing large gradients and high curvature values on all level curves. We take a computational PDE approach to minimize the energy functional. By using variational principles, the energy minimization problem is transformed into an Euler-Lagrange PDE. However, this PDE is usually high-dimensional and can not be directly handled by common low-dimensional solvers. To circumvent this difficulty, we use radial basis functions (RBF) to approximate the target function, which reduces the optimization problem to finding the linear coefficients of these basis functions. Some theoretical properties of this new model, including the existence and uniqueness of solutions and universal consistency, are analyzed. Extensive experiments have demonstrated the effectiveness of the proposed model for binary classification, multi-class classification, and regression tasks. Keywords: supervised learning, Euler s elastica, total variation, geometric regularization, Euler-Lagrange PDE, function approximation, universal consistency c 2015 Tong Lin, Hanlin Xue, Ling Wang, Bo Huang, Hongbin Zha. Lin, Xue, Wang, Huang, and Zha Read Euler, read Euler, he is our master in everything Pierre-Simon Laplace (1749 1827) 1. Introduction Supervised learning (Murphy, 2012; Hastie et al., 2009; Bishop, 2006) aims at inferring a function that maps inputs to desired outputs under the guidance of training data. Two main tasks in supervised learning are classification and regression. Numerous supervised learning methods have been developed in several decades; Caruana and Niculescu-Mizil (2006) gave a comprehensive empirical comparison of these methods. A most recent evaluation of classification methods was conducted by Fern andez-Delgado et al. (2014): 179 classifiers arising from 17 families were compared on 121 data sets, showing that random forests, support vector machines (SVM), neural networks, and boosting are among the top methods nowadays. Roughly speaking, existing methods can be divided into two main categories: statistics based and function learning based. One advantage of function learning methods is that powerful mathematical theories in functional analysis can be explored rather than doing optimizations on discrete data points. Most function learning methods can be derived from the energy regularization framework, which minimizes a fitting loss term plus a smoothing penalty. It is arguable that the most successful classification and regression method is the support vector machines (SVM) (Vapnik, 1998; Cristianini and Shawe-Taylor, 2000; Sch olkopf and Smola, 2002), whose cost function is composed of a hinge loss and a RKHS norm penalty determined by a kernel. There are several variants of SVM by combining different losses and different penalties (Steinwart, 2005; Bartlett et al., March 2006; Huang et al., 2014). In particular, when replacing the hinge loss by a squared loss, the modified algorithm is called Regularized Least Squares (RLS) method (Rifkin, 2002). Instead of considering a variety of loss terms, manifold regularization (Belkin et al., 2006) introduced a geometric regularizer of squared gradient magnitude on a manifold. Its discrete version corresponds to graph Laplacian regularization (Zhou and Sch olkopf, 2005; Nadler et al., 2009). A most recent work is the geometric level set (GLS) classifier (Varshney and Willsky, 2010), with an energy functional composed of a margin-based loss and a geometric regularization term based on the surface area of the decision boundary. The GLS classifier was motivated by the study of minimal surfaces and its applications in image processing. Experiments showed that GLS is competitive with SVM and other state-of-the-art classifiers. Following the geometric regularization approach, in this paper we propose to use the Euler s elastica for supervised learning problems. The energy functional is composed of a squared loss and an Euler s elastica (EE in the sequel) regularizer. Briefly, an elastica regularizer integrates two important geometric factors, gradients and curvatures, in a unified manner. Particularly, its degenerate form is the well-known total variation (TV) if only considering gradients and disregarding the influence of curvatures. Since both TV and EE models have achieved great success in image denoising and image inpainting (Chan and Shen, 2005; Aubert and Kornprobst, 2006), a natural question is whether the success of TV and EE models on image processing applications can be transferred to high dimensional data analysis such as supervised learning. This paper investigates the question by extending TV and EE models to supervised learning settings, and evaluating their performance on Supervised Learning via Euler s Elastica Models Figure 1: Results on two moon data by using the EE classifier. (a) Decision boundary (in blue) that separates two classes of points (represented by red stars or green circles); (b) learned target function illustrated as a surface in a 3D space. benchmark data sets against state-of-the-art methods. Figure 1 shows the classification result and the learned target function on the popular example of two moon dataset by using the EE classifier. Note that three important factors considered in the EE classifier, gradient, curvature, and margin between two classes, are depicted in different directions on one data point of the produced decision boundary in Figure 1(b). Although some researchers in the machine learning community may think that the supervised learning problems have been widely studied and several leading algorithms like SVM (Vapnik, 1998; Cristianini and Shawe-Taylor, 2000; Sch olkopf and Smola, 2002), boosting (Schapire and Freund, 2012), and random forests (Breiman, 2001) have been available to achieve superb classification performance, we argue that this work provides a new perspective on understanding supervised learning problems. Particularly, the contributions of this paper are: 1. A proper balance of three important factors in supervised learning: margin, gradient, and curvature. Here the term margin refers to the original geometric meaning used in SVM for binary classification problems, namely, the perpendicular distance from a data point to the decision boundary in the input space. The margin of a SVM classifier sign(w h(x)) can be written as y(w h(x))/( w 2 h(x) 2), where w denotes the coefficients of the separating hyperplane, and h(x) is the high-dimensional feature vector representation of a data point x. Similarly, the margin in boosting can be defined as y(w h(x))/( w 1 h(x) ), or simply yf(x) if the combined classifier f(x) has been properly normalized (see Schapire and Freund, 2012, chap. 5). Large margins play a central role in developing several state-of-the-art classifiers. Following the traditions in image processing, in this work the squared loss (y f(x))2 is used for easier derivative calculations on both classification and regression tasks. Note that Lin, Xue, Wang, Huang, and Zha the squared loss is equivalent to a margin-based loss (1 yf(x))2, called the quadratic loss (Bartlett et al., March 2006, table 1), since y { 1, +1} in binary classifications. On the other hand, the term gradient is related to the slope of function values in a continuous setting, while the curvature measures the degree to which all the level curves (including the decision boundary) is curved. Both gradients and curvatures are geometric measurements that reflect the complexity of the output classifier. The tradeoffbetween the squared loss and the complexity involving gradients and curvatures in this work is new to the machine learning community. 2. Euler-Lagrange PDEs that characterize the optimal solution for supervised learning problems. Historically, PDEs have been used to describe a wide range of physical phenomena such as sound, heat, fluid flow, electrostatics, electrodynamics, or elasticity. Surprisingly, these seemingly distinct physical phenomena can be unified under a PDE framework, which implies that they are essentially governed by same or similar nature s mechanism. A natural question is, can PDEs be applicable to high-dimensional supervised learning problems? To the best of our knowledge, Varshney and Willsky (2010) were the first attempt to propose level set based PDEs for classification. Following this research line, we propose the Euler-Lagrange PDEs derived from Euler s elastica model and its degenerate total-variation model, for classification and regression. These PDEs reveal equilibrium conditions of the desired fitting process for supervised learning. 3. Two numerical algorithms for solving the elastica based supervised learning problem in high dimensions. By using radial basis function approximation, we present two PDE solvers: the gradient descent time marching method and the lagged linear equation iteration method. The remainder of this paper is organized as follows. In Section 2 we begin with a brief review of TV and EE models used in image processing. The proposed models for supervised learning are described in Section 3, followed by the corresponding numerical solutions presented in Section 4. Some theoretical properties of the proposed models are discussed in Section 5. Section 6 presents the experimental results, and Section 7 concludes the paper. 2. Preliminaries For better understanding the proposed method, we firstly review the notions of total variation and Euler s elastica from an image processing perspective, and point out some connections with prior work in the machine learning literature. 2.1 Total Variation (TV) A function is said to have bounded variation (BV functions in the sequel) if its total variation is finite. For simplicity we begin with the classical definition of total variation (TV) for a function of one real variable. The total variation of a real-valued function f defined on an Supervised Learning via Euler s Elastica Models interval [a, b] R is the quantity V a b (f) = sup P i=0 |f(xi+1) f(xi)|, (1) where the supremum runs over the set of all partitions P of the given interval [a, b], with n P being the number of points in a specific partition P. If f is differentiable and its derivative is Riemann-integrable, the total variation can be written as V a b (f) = Z b a |f (x)|dx. Intuitively it measures the total distance along the direction of the y-axis, neglecting the contribution of motion along x-axis, traveled by a point moving along the graph. Notice that if f (x) > 0 for all x [a, b], it is simply equal to f(b) f(a) by the fundamental theorem of calculus. The modern definition is based on the concept of distributional derivatives. Let Ω R be a bounded open interval. A function f L1(Ω) is said to be of bounded variation (BV) if sup ϕ Ω f(x)ϕ (x)dx : ϕ C1 c (Ω), ϕ L (Ω) < 1 o < , (2) where C1 c (Ω) is the space of continuously differentiable functions with compact support in Ω, and L (Ω) is the essential supremum norm. Note that this definition may have some variants, e.g. imposing the test function that satisfies ϕ C c (Ω) and ϕ C0(Ω) < 1 (Golubov and Vitushkin, 2001). An equivalent definition is that BV functions are functions whose distributional derivative is a finite Radon measure. Also the two definitions (1) and (2) are consistent. It is natural to generalize the definition (2) for functions of several variables. For an open Ω Rd, the total variation of f L1(Ω) is given by Ω f ϕ dx : ϕ = (ϕ1, ϕ2, , ϕd) C1 c (Ω, Rd), ϕ L (Ω) < 1 o < , (3) where ϕ is a vector-valued test function, ϕ = P ϕi/ xi is the divergence operator, and all the components of ϕ has a L (Ω)-norm less than one. For more details of TV definitions and the BV function space, one can refer to Chan and Shen (2005), Aubert and Kornprobst (2006), Ambrosio et al. (2000), Giusti (1994), and Golubov and Vitushkin (2001). By penalizing large gradients of the target functions, total variation has been widely used for image processing tasks such as denoising and inpainting. The pioneering work is Rudin, Osher, and Fatemi s image denoising model (Rudin et al., 1992): (u u0)2 + λ| u| dx, where u0 is the input image with noise, u is the desired output image, λ is a regulation parameter that balances the two terms, u is the gradient vector ( u/ x, u/ y) for a function u(x, y), | u| is the l2-length of the gradient vector, and Ωdenotes a 2D rectangular image domain. The first fitting term measures the fidelity to the input, while the second Lin, Xue, Wang, Huang, and Zha is a p-Sobolev regularization term (p = 1) where the gradient u is understood in the distributional sense. The main benefit is to preserve significant image edges during the denoising procedure (Chan and Shen, 2005; Aubert and Kornprobst, 2006), as image edges are important features that should be faithfully retained in image processing. The common downside of TV-based methods is that piecewise constant images with | u| = 0 almost everywhere are favored over piecewise smooth images, which is the so-called staircasing effect (Duan et al., 2013). Euler s elastica model is one of high order approaches to overcome this drawback, which is described in the next subsection. In the machine learning literature, p-Sobolev regularizer can be found in the literature of nonparametric smoothing splines, generalized additive models, and projection pursuit regression models (Hastie et al., 2009). Specifically, Belkin et al. (2006) proposed the manifold regularization term Z x M | Mu|2dx, for any smooth function u(x) on a manifold M. On the other hand, discrete graph Laplacian regularization was discussed in Zhou and Sch olkopf (2005) as X v V | vu|p, where v is a vertex from a vertex set V , and p is an arbitrary number. This penalty measures the roughness of the discrete function u over a graph. 2.2 Euler s Elastica (EE) The elastica energy first appeared in Euler s work in 1744 on modeling torsion-free thin elastic rods (for the history see Levien, 2008; Fraser, 1991). Then Mumford (1994) reintroduced elastica into computer vision for measuring the quality of interpolating curves in disocclusion. Later, elastica based image inpainting methods were developed in Masnou and Morel (1998) and Chan et al. (2002). A smooth curve γ is said to be Euler s elastica if it is the equilibrium curve of the elasticity energy: γ (a + bκ2)ds, (4) where a and b are two non-negative constant weights, κ denotes the scalar curvature (see Appendix A for its definition), and ds is an infinitesimal arc length element. Euler obtained the energy in studying the steady shape of a thin and torsion-free rod under external forces. The curve implies the lowest elastica energy, thus getting its name. The ratio a/b (if b = 0) indicates the relative importance of the total length versus total squared curvature (Chan and Shen, 2005, chap. 2.1). According to Mumford (1994), the key link between the elastica curves and image inpainting relies on the the interpolation capability of elasticas. Elasticas were discovered to comply to the connectivity principle (Chan and Shen, 2001; Kanizsa, 1979) in visual perception better than total variation. This principle in vision psychology shows that humans mostly prefer having two disjoint parts occluded by another object connected psychologically, even when they are far apart. Such kinds of nonlinear splines , like classical polynomial splines, are natural tools for completing the missing or occluded edges. Besides, there Supervised Learning via Euler s Elastica Models is an interesting Bayesian rationale revealed by Mumford (1994) (see also Chan et al., 2002) by considering the random walk of a drunk. Suppose the drunk starts from the origin of a 2-D ground and each step is straight. With some distribution assumptions on the step size and the orientation of each step, the maximum likelihood estimation (MLE) of such discrete random walk is approximately equivalent to the minimization of the elastica energy (4) in a continuous fashion. This drunk walking model also sheds light on the choice of 2 for the curvature power in (4). For any p > 1, one could consider the general p-elastica energy γ (a + b|κ|p)ds. Notice that the situation of p = 1 is less ideal since in this case the total curvature energy permits sudden turns. Chan et al. (2002) pointed out that generic stationary points of the p-elastica energy are forbidden when p 3, implying that p (1, 3) sounds to be a good choice. A common approach to bridge the gap between a prior energy model for curves and that for images is using level sets (or called isophotes), pioneered by Osher and Sethian (1988). By lifting a curve prior model into a 2D space, one can construct an image prior model imposed on all the level curves of an image (corresponding to a 2D function). Formally, the Euler s elastica of all the level curves of an image u can be expressed as γl:u=l (a + bκ2)dsdl, (5) where γl is the level curve determined by u(x) = l, and the level value l varies in the image range [0, L]. Let dt denote an infinitesimal length element along the normal direction n of the level curve (or along the steepest ascent curve), then we have dl dt = | u| or dl = | u|dt. Thus by the co-area formula (Giusti, 1994), the integrated elastica energy (5) now passes on to u by γl:u=l (a + bκ2)| u|dtds = Z Ω (a + bκ2)| u|dx, since dt and ds represent a couple of orthogonal length elements. Here Ωdenotes the whole rectangular image domain. Now the elastica energy of an image is completely expressed in terms of u, when considering the well known curvature formula (Morel and Solimini, 1995) for any level curve γl : u(x) = l where denotes the divergence operator, defined as Lin, Xue, Wang, Huang, and Zha for a vector V = (A, B), and N is the ascending unit normal field u/| u|. See Appendix A for a short derivation of (6). Of course this curvature expression makes sense only for a certain class of smooth functions (such as C2(Ω)) and requires to be relaxed in order to handle more general functions (like BV or L1 functions). Given a small image region D to be inpainted in the whole image domain Ω, Chan and Shen (2005) proposed an inpainting model based on Euler s elastica Ω\D (u u0)2dx + λ Z Ω (a + bκ2)| u|dx, (7) where λ is a trade-offparameter that balances the first fitting term and the second smoothing term. Notice that the second term in (7) is an elastica regularizer that penalizes high elastica energy on all the level curves of u(x), as expressed in (5). By using calculus of variation (van Brunt, 2004), its minimization is reduced to a nonlinear Euler-Lagrange equation. Its numerical method can be implemented by a finite difference scheme, and experimental results show that this elastica based inpainting method performs better than TV based approaches. Note that total variation can be regarded as a degenerate form of Euler s elastica if setting a = 1 and b = 0 in (7). In fact, elastica is a combination of total variation that suppresses oscillations in the gradient direction, and a curvature regularizer that penalizes non-smooth level set curves (see Figure 1). 3. The Proposed Framework We first set up the supervised learning problem, and then introduce three models, Laplacian, total variation, and Euler s elastica, in an increasing order of computational complexity. 3.1 Problem Setup The general supervised learning problem can be described as follows: Given a training data set {(x1, y1), ...(xn, yn)} where each data point xi Ω Rd is a d-dimensional column vector and yi is the corresponding target variable, the goal is to estimate an unknown function u(x) for predicting the desired y on a newly coming point x. The difference between classification and regression lies only in the corresponding target values, with one discrete and the other continuous. For regression, we simply use u(x) to approximate the target values; for binary classification, the decision boundaries are given by the zero level set of u(x), or sign(u(x)). Most popular multi-class classifiers are based on some types of reductions to binary classifications; we defer the discussion of multi-class problems to the experiments section. The widely used functional regularization framework for supervised learning can be formulated as: min u λS(u) + i=1 L(yi, u(xi)), (8) Supervised Learning via Euler s Elastica Models where S(u) is a smoothing term or called a penalty and L( ) denotes a loss function. The penalty term is used to control the complexity of the learned function, which has proven to be essential in Statistical Learning Theory (Vapnik, 1998; Bousquet et al., 2004; Boucheron et al., 2005; von Luxburg and Sch olkopf, 2008). The misclassification risk corresponds to the use of 0-1 loss: L0 1(y, u(x)) = 1[y = sign u(x)], where 1[α] denotes an indicator function that is 1 if α holds true and 0 otherwise. Or we can slightly misuse the notation to allow a margin based representation: L0 1(y, u(x)) = 1(yu(x)), where 1(α) is 1 if α 0 and 0 otherwise. It is well known that directly minimizing the 0-1 loss is computationally intractable for many nontrivial classes of functions, and often some nonnegative convex nondecreasing loss function are considered for computational efficiency. Another advantage of such convex surrogates for 0-1 loss is that it is possible to demonstrate the Bayes-risk consistency and to obtain uniform upper bounds on the generalization risk. See Bartlett et al. (March 2006) and Boucheron et al. (2005, Section 4.2) for more discussions. In the literature a variety of convex surrogate loss functions L(.) have been proposed for binary classification where y { 1, +1}, such as: 1. hinge loss Lhinge(y, u(x)) = max{0, 1 yu(x)} for SVM; 2. squared loss Lsquared(y, u(x)) = (y u(x))2 for RLS; 3. logistic loss Llogistic(y, u(x)) = log(1 + exp( yu(x))) for logistic regression; 4. and exponential loss Lexponential(y, u(x)) = exp( yu(x)) in boosting. Except for the squared loss, other above losses are margin-based since the classification margin yu(x) is explicitly used. When restricting the discussion on binary classification where y { 1, +1}, the squared loss is actually equivalent to the quadratic loss (1 yu(x))2 which is then margin-based. Throughout the paper, the squared loss is used in all our models due to several reasons: (1) The derivative of a squared loss is very simple to calculate; (2) It can be applied to both classification and regression, without any modification; (3) For classification, Rifkin (2002) showed that the RLS method based on squared loss can offer comparable or slightly better accuracies than hinge loss based SVM; (4) Using squared loss is consistent to the related work in image processing area, leading to identical or similar PDEs; (5) We have no intention to exhaustively try and compare different loss functions; instead our focus is on the second term which is a new geometric regularization for supervised learning. For more loss functions and penalties, one can refer to Steinwart (2005), Bartlett et al. (March 2006), and Huang et al. (2014). Our goal is to explore how TV and EE can be applied to classification and regression problems on high dimensional data sets. To this end, we prefer a continuous integral form rather than the discrete summation form in (8). In contrast to discrete methods such as SVM and graph Laplacian, the proposed framework operates in a continuous fashion where powerful mathematical analysis tools can play a role. Specifically, the calculus of variations plays a role in minimizing the energy functional, leading to the Euler-Lagrange PDE. A typical procedure of this computational PDE approach has three steps: (1) Set up the function learning problem under a continuous setting by designing a proper energy functional; (2) Derive the Euler-Lagrange PDE via the calculus of variations; (3) Finally solve the PDE numerically on discrete data points. Lin, Xue, Wang, Huang, and Zha 3.2 Laplacian Regularization (LR) A commonly used model with squared loss can be written as min u λS(u) + u(xi) yi 2 . If the RKHS norm is used as the smoothing term S(u), the model is called regularized least squares (RLS) (Rifkin, 2002). Another natural choice is the squared L2-norm of the gradient: S(u) = | u|2, as proposed in Belkin et al. (2006). We need to move from the discrete cost function to a continuous functional to leverage powerful mathematical tools. Suppose Ω Rd is a regular region that contains all the given data points. Under a continuous setting, we have the following Laplacian regularization (LR) model: λ| u|2 + (u y)2 dx. (9) This LR model has been widely used in the image processing literatures. By calculus of variations (see Appendix B), the minimization is reduced to the following Euler-Lagrange PDE with a natural boundary condition over the boundary Ω: λ u + (u y) = 0, u n| Ω= 0, (10) where u is the Laplacian operator of u defined as u .= 2u = u = 2u (x(i))2 , and n denotes the outer normal of Ω. This PDE (10) is relatively simple and can be easily solved using common methods in two and three dimensions. The next section provides a function approximation method for solving the PDE in high dimensions. One can observe that the PDE (10) is very similar to the Poisson s equation u = f in mathematical physics, where f is a given function. Hence its behavior shares certain degrees of similarity with Poisson s equation. Particularly, if u fits y perfectly (satisfying u y = 0) in a small neighborhood of a particular point x, then by (10) we have u = 0 and further by u y = 0 we also have y = 0 in this neighborhood. On the contrary, if y = 0 (implying that y(x) is not a harmonic function), then we can not obtain u y = 0; otherwise by (10) we have u = 0 and y = 0, which is contradictive to our assumption y = 0. Therefore, the smoothness of the target variable y(x) determines the fitting degree for supervised learning. The regularization parameter λ controls the strength of this connection. Throughout the paper, the natural boundary condition is adopted for easier treatments. It is well known that boundary conditions can play a significant role in traditional lowdimensional PDE areas, where the shape of the domain boundary is explicitly determined. In these situations, boundary conditions are given by the underlying real problems and their physical meanings are clear. However, in our case of high dimensional spaces for supervised learning, there is no need to specify the exact domain boundary as long as this Supervised Learning via Euler s Elastica Models domain contains all the data points. Often the input data is preprocessed by scaling each attribute into the range [ 1, +1] or [0, 1], and hence in practice we define the domain of our TV/EE models as a d-dimensional hypercube. Scaling has been a very important step for using neural networks and SVM, with some advantages discussed in Hsu et al. (2007). Most of these considerations also apply to our algorithms. Recall that our focus is to learn the target function u(x) on an active region that contains both the given training data and the future test data, whereas this active region is usually far away from the boundary of the hypercube domain in our settings. Hence boundary values in our high dimensional models are not so important as in low dimensional spaces, and we use the natural boundary condition purely from a computational aspect, just like the related work in image processing. Note that in the GLS classifier (Varshney and Willsky, 2010), the issue of PDE boundary conditions was treated in a similar way. 3.3 Total Variation (TV) Similar to image denoising, the total variation (TV) model for supervised learning can be formulated as ETV [u] = Z 2(u y)2 dx. (11) The only difference between LR and TV is just on the p-Sobolev regularizer with p = 2 for LR and p = 1 for TV, respectively. Intuitively, LR penalizes gradients on edges too much due to the squared gradient magnitude, while TV is rather milder to permit sharper edges near the decision boundaries between two classes. Similarly, by calculus of variations (see Appendix B) we get the following PDE, which is the exactly same to that in image denoising area: + (u y) = 0. (12) Note that by the same curvature notation (6) of the associated level hypersurfaces, (12) can be compactly written as λκ + (u y) = 0. (13) See Appendix A for this curvature notation in Rd, which amounts to the mean curvature up to a constant factor 1/(d 1). The PDE (13) implies that the mean curvature κ of all level hypersurfaces with respect to the approximation function u(x) imposes an equilibrium condition on the fitting process of u y = 0. 3.4 Euler s Elastica (EE) The more complicated elastica model for supervised learning can be formulated as λ(a + bκ2)| u| + 1 2(u y)2 dx, (14) where κ is given by (6). Due to the elastica regularizer, the final decision boundary and all level sets of u(x) should have a low elastica energy. If setting a = 1 and b = 0, this model degenerates to the total variance model. Therefore, a unified solution can be implemented for both TV and EE models, as described in the next section. Lin, Xue, Wang, Huang, and Zha Using calculus of variations, we obtain the following PDE for the elastica model: λ V(u) + (u y) = 0, (15) where the vector field V(u) is called the flux of the elastica energy related to u(x) and can be expressed as a decomposition in a natural orthogonal frame (N, T): V(u) .= f(κ)N T | u| (f (κ)| u|) = f(κ)N 1 | u| n (f (κ)| u|) N N, (f (κ)| u|) o = f(κ)N 1 | u| (f (κ)| u|) + 1 | u|3 u u, (f (κ)| u|) . Here f(κ) .= 1 + bκ2 by fixing a = 1 for simplicity, and N, T are the normal and tangent vectors given by: | u|, T = N . The directional derivative along T for a function u is defined as the inner product of u and T: u/ T .= u T = u, T . See Appendix B for the detailed derivations from (14) to (15), which originates from Chan et al. (2002). When b = 0, (15) degenerates to (12) as f (κ) = 0 and κ = N. Again, the PDE (15) indicates that the divergence of the flux vector field, namely the first term V(u), imposes an equilibrium condition on the fitting process of u y = 0. 4. Numerical Algorithms Due to the nonlinearity of the regularizer in TV and EE models, the corresponding PDEs in (12) and (15) are too complicated to be efficiently solved in high dimensional space. Even though the PDE in (10) associated with the LR model can be solved by Finite Difference Method (FDM) or Finite Element Method (FEM) in 2-D or 3-D spaces, currently we have no PDE tools to deal with such high dimensional problems. Therefore we take a function approximation idea by using radial basis functions (RBF), similar to the treatment in GLS (Varshney and Willsky, 2010). Then the computational PDE problems can be reduced to finding the expanding coefficients. In the literature of image denoising and inpainting, dynamic programming was firstly employed to solve elastica related image processing problems in Masnou and Morel (1998). The most widely used method is the computational PDE approach (Chan and Shen, 2005; Aubert and Kornprobst, 2006), partially due to the following reasons: 1. The theory of PDEs is well established; 2. Many variational problems or their regularized approximations can often be effectively computed from their Euler-Lagrange equations; Supervised Learning via Euler s Elastica Models 3. As in classical mathematical physics, PDEs are powerful tools to describe, model, and simulate many dynamic as well as equilibrium phenomena. Later in Bae et al. (2011) and Komodakis and Paragios (2009), graph-cuts methods are applied to elastica models. Several numerical solutions (Tai et al., 2011; Hahn et al., 2011; Duan et al., 2013) are based on the operator splitting technique and the augmented Lagrangian method (ALM), which decomposes the original problem into a series of subproblems. All subproblems are either linear which can be solved efficiently by iterative solvers, or having closed-form solutions. Recently, Bredies et al. (2013) proposed a convex, lower semicontinuous approximation of Euler s elastica energy on image processing tasks via functional lifting, which can be expressed as a linear program. However, it is still unclear whether these newly developed numerical methods are applicable to high dimensional elastica problems. 4.1 Approximation by Radial Basis Functions The function approximation idea relies on the fact that a function u(x) can be expressed as a sum of weighted basis function {φi(x)}. For instance, a Taylor expansion represents a function by using polynomials as basis functions. The Ritz method is a direct method for solving problems in variational calculus by means of a linear combination of known basis functions. In the literature of machine learning, the most widely used are the Gaussian radial basis function (RBF) kernels, which are simple in expressions but have powerful fitting ability. Hence we assume that the function u(x) to be learned has the following representation i=1 wiφi(x), (17) where {φi(x)} are a set of Gaussian RBF kernels φi(x) .= exp( 1 2c||x xi||2). Here {xi} are the training samples in supervised learning, and c is a tunable parameter. Note that the granularity of this representation is well-matched to the data size, as the number of RBFs is equal to the number of training samples. By using the RBF approximation, the problem is reduced to finding the coefficients {wi}. Hence our approach is similar to kernel machines with the Gaussian RBF kernels since the decision function is formulated as a linear combination of RBFs. The main difference is that our approach is based on the Euler s elastica regularization term, while kernel methods in the literature employs a squared norm of reproducing kernel Hilbert space for regularization. Though there are numerous basis functions (also known as kernels) being proposed by researchers, four basic types are often considered in the SVM literature and related books: linear, polynomial, sigmoid, and Gaussian RBFs. In Hsu et al. (2007) the Gaussian RBF kernel is suggested to be a reasonable first choice for training SVMs due to several reasons. Most of these considerations also apply to our algorithms, such as the number of hyperparameters, and the difficulties in numerical computations. In addition, one might consider other types of RBFs instead of Gaussians, like compactly supported RBFs used in scattered data interpolation (Wendland, 1995; Floater and Iske, 1996). The main purpose of Lin, Xue, Wang, Huang, and Zha compactly supported RBFs is for reducing computational complexity. However, the usage of compactly supported RBFs might lead to numerical difficulties in the following derivative calculations in our algorithms. Let H(u) denote the Hessian matrix of u, and I be an identity matrix with a proper size. For short notations we also use φi for φi(x). Based on the RBF approximation (17), the following are some analytical expressions and handy notations that will be frequently used later. See Appendix C for some derivations of these expressions. Note that d is the dimension of the feature space. φi = c(x xi)φi, φi = c(c|x xi|2 d)φi, (18) H(φi) = cφi I + c2(x xi)(x xi)T φi, (19) i wi φi = c X i wi(x xi)φi = cg, i wi(x xi)φi, (20) i wi φi = c X i wi(c|x xi|2 d)φi, i wiφi I + c2Φ, (21) i wi(x xi)(x xi)T φi, N .= u | u| = g u u T H(u) u i wi(c|x xi|2 d + 1)φi cg T Φg 4.2 Algorithm for LR First, let us consider how to deal with the simplest LR model by solving the linear elliptic PDE (10): λ u + (u y) = 0. By using the RBF approximation (17) and the linearity of the Laplacian operator, the goal is reduced to finding a set of weights {wi}: i wi(φi λ φi) = y. Let w .= (w1, w2, ..., wn)T and y .= (y1, y2, ..., yn)T , where n is the number of training samples. Then w can be solved by the system of linear equations: Aw = y, Aij = φj(xi) λ φj(xi). Supervised Learning via Euler s Elastica Models Numerically, the following regularized least squares solution is adopted in practice to avoid ill-posed problems: min w |Aw y|2 + η|w|2. The closed-form solution is simply given by w = (AT A + ηI) 1AT y with fast computational speed. It is interesting to see that both classification and regression problems can be solved by fitting a set of linear equations. Naturally, the LR method can be regarded as a generalization of linear regression Xw = y or ridge regression minw |Xw y|2 + η|w|2 (Hastie et al., 2009, chap. 3), where the original data matrix X is replaced by a new data matrix A(X) in the LR model. 4.3 Algorithm for TV and EE Models As the TV model is one degenerate case of the EE model, we describe solutions for the more complicated EE model in this section. Here two algorithms are developed to tackle the nonlinearity in (15): (1) gradient descent time marching, and (2) lagged linear equation iteration. 4.3.1 Gradient Descent Time Marching A standard solution is the steepest gradient descent marching with an artificial time t: for the total variation PDE (12) and u = λ V (u y) (25) for the elastica PDE (15). Note that by setting ut = Eu, the energy functional E should decrease in the gradient direction as time marching. Here the partial derivative Eu can be obtained from the first variation of E (see Appendix A). For image processing tasks, these gradient descent flows can be processed on a natural regular grid of the image domain. For high dimensional data space, such computational process is prohibitive. With the function approximation (17), a more practical way is handling the gradient descent flow about the weight vector w. Consider a matrix form of the function approximation (17) on all training data points: u(x1) ... u(xn) = Ψw, Ψij .= φj(xi). Thus we have the gradient descent flow about w: t |x=x1 ... u Lin, Xue, Wang, Huang, and Zha Then in each iteration the weight vector w can be updated by w(k+1) = w(k) + τΨ 1 t |x=x1 ... u(k) where τ is a small time step. We first initialize the weight vector w as w(0) = (ΨT Ψ + ηI) 1ΨT y by solving the regularized least squares problem Ψw = y, with η a regularization parameter. Then we get u(0) = Ψw(0), and run the iteration by computing w(k+1) and u(k+1) alternately. Here we give some details about the computation of the partial derivatives. Clearly the partial ut in (24) can be obtained by (23). By omitting the third and higher order terms, V can be expanded into the following expression (see Appendix D): V = κ + bκ3 2b( u)2 | u|5 α + 6b u | u|7 κ | u|6 α2 + 6b | u|7 αβ + 2b | u|5 γ, (26) where α .= u T H(u) u, β .= u T H(u)2 u, γ .= u T H(u)3 u. We can see that if by setting b = 0, the expression of V is degenerated to κ = ( u/| u|), which is exactly the same expression of the TV model. The time complexity in each iteration is O(n2d), where n is the number of data points and d is the dimension. There are 3 parameters in the algorithm: the RBF parameter c, the regularization parameter λ, and the elastica weight parameter b. Note that we always set a = 1 since a can be absorbed into λ. 4.3.2 Lagged Linear Equation Iteration Following the spirit of the lagged diffusivity fixed-point iteration method (Chan and Shen, 2005), we develop the following lagged linear equation iteration method. Empirically, the original lagged diffusivity fixed-point iteration often yields poor performance due to its brute-force linearization on the nonlinear PDE. For the simpler TV model, by expanding the curvature term with (23) we have u u T H(u) u + (u y) = 0, or equivalently by the RBF approximation i wi(1 d + c|x xi|2)φi cg T Φg o + |g| n X i wiφi y o = 0. The above nonlinear equation about w is rather complex as g and Φ contain the unknown w. To simplify this equation, we use an iteration method that computes w or g alternately by fixing the other variables. First, w is initialized as a random vector. Then g can be Supervised Learning via Euler s Elastica Models computed according to (20). Now assuming that g is fixed, we have g T g = g T [P i wi(x xi)(x xi)T φi]g i wiφi g T (x xi)(x xi)T g Thus the original nonlinear equation about w becomes a linear equation λ h φi = |g| h .= 1 d + c|x xi|2 cg T (x xi)(x xi)T g Using the lagged idea, we obtain the method of lagged linear equation iteration: (1) By fixing g, solve the system of linear equations with respect to w to get a new w; (2) Compute g with the updated w; (3) Iterate until convergence or reaching maximal iteration number. For the more complicated EE model, we have to simplify the corresponding PDE greatly. Following the lagged idea again, we first assume the term about curvature K .= a + bκ2 being fixed. Then K can be absorbed into λ, leading to the following linear equation in a similar way: X λK f φi = |g| Similarly, a two-step lagged iteration procedure can be developed for the EE model: (1) By fixing g and K, solve the linear system with respect to w; (2) Compute g and K with the updated w; (3) Iterate until convergence or reaching maximal iteration number. There are three parameters: c, λ, and the regularization parameter η (empirically chosen in experiments) in the least squares problems. 5. Theoretical Properties In this section, we explore some theoretical analysis for elastica based supervised learning algorithms under the framework of statistical learning theory (SLT) (Vapnik, 1998; Bousquet et al., 2004; Boucheron et al., 2005; von Luxburg and Sch olkopf, 2008). First we present the existence and uniqueness analysis of our TV/EE solutions. Then we prove that elastica based classifiers are universally consistent, mainly based on the pioneering work of Steinwart (2005) for SVM and other regularized kernel classifiers. 5.1 Existence and Uniqueness of TV We first consider the TV model (11), which is a special yet useful case of the elastica model (14). It is well-known that one can carry out the existence and uniqueness analysis for TV model in image processing tasks. Thanks to the fact that most properties of a BV function are independent of the data dimension, the following proof in Rd is a trivial but detailed Lin, Xue, Wang, Huang, and Zha extension of the overly simplified proof for the TV-based image denoising model in (Chan and Shen, 2005, Theorem 4.14 in chap. 4). Before giving the theorem on existence and uniqueness, we first review several major properties of BV functions (Chan and Shen, 2005, Section 2.2.2) (Aubert and Kornprobst, 2006, Section 2.2.3) that are frequently used in the following proofs. Theorem 1 (1) (Completeness) BV(Ω) L1(Ω) is a Banach space under the BV norm Ω (|u| + | u|)dx. (2) (Weak Compactness) Let {un} be a bounded sequence in BV(Ω) where Ωis a Lipschitz domain. There must exist a subsequence which converges in L1(Ω). (3) (L1-Lower Semicontinuity) Suppose a sequence {un} converges to u in L1(Ω). Then Z Ω | u|dx lim inf n In particular if {un} is a bounded sequence in BV(Ω), then u belongs to BV(Ω) as well. Theorem 2 (Existence and Uniqueness of TV) Under the assumption that the given target function y(x) L2(Ω) with x Rd, the minimization problem ETV [u] = Z 2(u y)2 + λ| u| dx admits a unique solution ˆu(x) BV(Ω). Proof We first show the existence. ETV is finite for at least one BV function u(x) R Ωy(x)dx, which is a constant function over Ωwith | u| = 0. Thus there exist some BV functions having finite ETV values. Clearly 0 is a lower bound of these ETV values. Hence this nonempty number set of all ETV values with 0 as a lower bound must have an infimum denoted as E0( 0). Since E0 is an infimum, we can select a sequence of BV functions {ui} with bounded ETV values such that their ETV values converges to E0. Note that such sequence of {ui} must be bounded as well as in BV(Ω) in terms of the BV norm, since the TV seminorm R Ω| u|dx is contained in ETV and BV(Ω) L1(Ω). According to the weak compactness of the BV space, for the bounded sequence {ui} in BV(Ω), there must exist a subsequence indexed by i(k), k = 1, 2, . . ., which converges in L1(Ω). Due to the completeness of L1(Ω), let ˆu L1(Ω) be its limit. By the L1-lower semicontinuity of the TV seminorm, we have Z Ω | ˆu|dx lim inf k and also ˆu BV(Ω) since {ui} is a bounded sequence in BV(Ω). Observe that ETV is lower semicontinuous with respect to the L1(Ω) topology because both of its components, the L2 norm (the squared loss in ETV ) and the TV seminorm, are lower semicontinuous. That is, ETV [ˆu] lim inf k ETV [uik] = inf u BV(Ω) ETV [u] = E0, Supervised Learning via Euler s Elastica Models indicating that there exists ˆu BV(Ω) achieving the minimum point of ETV . The uniqueness follows directly from the strict convexity of ETV . Thanks to the Minkowski inequality f + g Lp f Lp + g Lp, the TV seminorm is convex (but not strictly convex) given by Z Ω | (αu + (1 α)v)| = Z Ω |α u + (1 α) v| α Z Ω | u| + (1 α) Z where α [0, 1]. Apparently the L2 norm R Ω(u y)2 is strictly convex. Hence combining two components together, we have that ETV is strictly convex. Therefore as the minimum point of ETV , ˆu BV(Ω) is unique. In the image processing literature, there are some variants of the existence and uniqueness analysis for different TV models. Chan et al. (2002) discussed the existence of TV inpainting models in the cases of noise free and having noise, but the uniqueness is neglected. Aubert and Kornprobst (2006) present the existence and uniqueness analysis for the TV-based image restoration problem min ETV [u] = Z 2(Ru y)2 + λφ(| u|) dx, where R is a linear blurring operator and φ is a strictly convex and nondecreasing cost function. 5.2 Existence of EE We now consider the more complicated elastica model. In Ambrosio and Masnou (2003), the authors proved that a relaxed version of elastica-based image inpainting has at least one solution in BV(Ω). Here we give the existence proof of a discrete elastica model for binary classification, which is adapted from the elegant proof in Steinwart (2005) for SVMs and other regularized kernel classifiers. The existence is the first step to fulfill the consistency proof in the next subsection. But the solution to elastica model can be non-unique, due to the lack of convexity for this energy functional. We begin with some preliminary notations. In the following, let R = [ , + ], R+ = [0, + ), and R + = [0, + ]. A binary classifier is a rule that assigns to every training set T = {(x1, y1), . . . , (xn, yn)} (X Y )n (Y = { 1, +1} for binary problems) a measurable function f : X R with the final decision given by signf(x). Similar to the gray scale constraint in image processing tasks, we assume that f takes values in a bounded interval (e.g. [ 2, 2]) since f should approximate y { 1, +1} and the classification decision is only rated with the sign of f. Sometimes we use a looser condition that f L (X). For a given loss function L(y, f(x)), write a cost function C(α, t) .= αL(1, t)+(1 α)L( 1, t) for α .= P(Y = 1|X = x) [0, 1] and t R. For a fixed α, define M(α) and the corresponding tα such that M(α) .= C(α, tα) .= mint C(α, t). We then give the basic condition on the loss function L in order to guarantee that the solution tα minimizing C(α, t) tends to have the same sign as the Bayes decision rule. Definition 3 A continuous function L(y, f(x)) is called an admissible loss function if for every α [0, 1] and tα R we have tα < 0 if α < 1/2 and tα > 0 if α > 1/2. Lin, Xue, Wang, Huang, and Zha A similar concept called classification-calibrated can be found in Bartlett et al. (March 2006), requiring that an incorrect sign of tα always leads to a strictly larger M(α). The classification-calibrated condition generalizes the requirement of an admissible loss that the minimizer of C(α, t) (if it exists) has the correct sign. The admissibility of L is necessary in order to develop universally consistent classifiers (Steinwart, 2005). In particular, the quadratic loss L(y, f(x)) = (1 yf(x))2 used in our classification models is admissible and classification-calibrated; other examples can be found in Steinwart (2005) and Bartlett et al. (March 2006). In the following we always assume that L(y, f(x)) is a margin-based admissible loss function which is continuous with respect to the margin yf(x). Definition 4 Let S(λ, t) : R+ R + R + be an increasing function with respect to λ and t, which is continuous in 0 with respect to λ and unbounded with respect to t. Moreover, for all λ > 0 there exists a t > 0 such that S(λ, t) < . We call S(λ, t) a regularization function if for all λ > 0 and s R+ we have S(λ, 0) = S(0, s) = 0, and if for all λ > 0, t R +, and for all sequences {tn} R + with tn t and S(λ, tn) < , we have S(λ, tn) S(λ, t). In our TV/EE models, S(λ, t) = λt2 clearly satisfies the requirements of a regularization function. This regularization function is a typical setting in several variants of SVMs (Steinwart, 2005), leaving the differences of these variants mainly on the loss functions. Definition 5 The (0-1) risk of a measurable function f : X R is defined by RP (f) .= P({(x, y) : signf(x) = y}) = E(x,y) P 1(y f(x)). The smallest achievable risk RP .= inf{RP (f) : f : X R measurable} is called the Bayes risk of P. Definition 6 Given an admissible loss function L and a probability measure P, the L-risk of a measurable function f : X R is defined by RL,P (f) .= E(x,y) P L(y, f(x)) (x,y) P L(y, f(x))PX(dx)PY (dy) X C(P(Y = 1|X = x), f(x))PX(dx). The smallest possible L-risk is denoted by RL,P . Furthermore, given a regularization function S, the regularized L-risk is defined by Rreg L,P,λ(f) .= S(λ, f EE) + RL,P (f) for all λ > 0. Here f 2 EE .= R X(1 + bκ2)| f|dx is the Euler s elastica regularizer with a misused norm notation, and κ = f | f| . If overlooking the curvature term, it degenerates to the TV seminorm f 2 TV .= R X | f|dx. If P is an empirical measure with respect to T (X Y )n, we write RL,T (f) and Rreg L,T,λ(f), respectively. Supervised Learning via Euler s Elastica Models Theorem 7 (Existence of EE) For all Borel probability measures P on X Y and all λ > 0, there always exists a function f P,λ BV(X) minimizing the regularized L-risk Rreg L,P,λ(f). Moreover, for all such f P,λ BV(X) we have f P,λ EE δλ where δλ .= sup{t : S(λ, t) 2[L(1, 0) + L( 1, 0)]}. Proof The following proof is adapted from Steinwart (2005, Lemma 3.1), and the difference lies on Rreg L,P,λ(f) where the original RKHS norm f H for SVM is replaced by the pseudonorm f EE for EE. The proof consists of the following five steps. A. Clearly Rreg L,P,λ(f) is finite for the BV function f(x) E(x,y) P y or f(x) 0, which is a constant function over X with | f| = 0. Thus there exist some BV functions having finite Rreg L,P,λ(f) values. For all ε (0, L(1, 0) + L( 1, 0)], by the definition of an infimum we can select an function fε L1(X) with Rreg L,P,λ(fε) inf f L1(X) Rreg L,P,λ(f) + ε. Now we have inf f L1(X) Rreg L,P,λ(f) Rreg L,P,λ(f 0) = S(λ, f 0 EE) + RL,P (f 0) = 0 + E(x,y) P L(y, f(x) 0) = P(y = 1|x)L(1, 0) + P(y = 1|x)L( 1, 0) L(1, 0) + L( 1, 0), where S satisfies the condition S(λ, 0) = 0 in the second equality. Furthermore, S(λ, fε EE) S(λ, fε EE) + RL,P (fε) = Rreg L,P,λ(fε) inf f L1(X) Rreg L,P,λ(f) + ε 2[L(1, 0) + L( 1, 0)]. As S(λ, t) is an increasing function with respect to t, we obtain the boundedness of fε 2 EE. Since f 2 TV f 2 EE, we also have the boundedness of fε 2 TV and fε BV(X). B. The Bolzano-Weierstrass theorem states that each bounded sequence in Rn has a convergent subsequence. In functional analysis, the Eberlein-Smulian theorem (Conway, 1990, Theorem 13.1 in chap. 5) states that three different kinds of weak compactness are equivalent in a Banach space. Particularly, we will use the sequential compactness property of a subset A in a Banach space: Every sequence from A has a convergent subsequence whose limit is in A in the weak sense. Recall that BV(X) is a Banach space. By the two theorems, there exist f P,λ BV(X), a sequence {fεn} BV(X), and two finite number c1, c2 R+ such that fεn EE c1, fεn TV c2, and fεn f P,λ weakly. Note that the weak convergence implies that f P,λ is uniquely determined, f P,λ BV lim infn fεn BV , and f P,λ L1 lim infn fεn L1 since BV(X) L1(X) (Yosida, 1999, Theorem 5 and 9 in Chapter V.1). In particular, by the weak compactness of the BV space, we further have that {fεn} converges to f P,λ in L1(X). Thus yfεn(x) yf P,λ(x) since the margin is a linear functional of f. As L is continuous with respect the margin, we obtain L(y, fεn(x)) Lin, Xue, Wang, Huang, and Zha L(y, f P,λ(x)) for all (x, y) X Y . Recall that |L(y, fεn(x))| is uniformly bounded by the boundedness assumption of |f| and the continuity of L. Therefore, the bounded convergence theorem (as a special case of Lebesgue dominated convergence theorem) implies RL,P (fεn(x)) = Z (x,y) P L(y, fεn(x))PX(dx)PY (dy) (x,y) P L(y, f P,λ(x))PX(dx)PY (dy) = RL,P (f P,λ(x)). C. By RL,P (fεn) RL,P (f P,λ), for a fixed ρ > 0, there exists an index n0 such that for all n n0 we have both εn ρ and RL,P (f P,λ) RL,P (fεn) ρ. In other words, we obtain the following inequalities S(λ, fεn EE) + RL,P (f P,λ) ρ S(λ, fεn EE) + RL,P (fεn) = Rreg L,P,λ(fεn) inf f L1(X) Rreg L,P,λ(f) + εn Rreg L,P,λ(f P,λ) + εn = S(λ, f P,λ EE) + RL,P (f P,λ) + εn, where the second inequality is based on the definition of fεn. It implies that S(λ, fεn EE) S(λ, f P,λ EE) + εn + ρ S(λ, f P,λ EE) + 2ρ. On the other hand, we need to consider another inequality in the opposite direction. By the weak convergence we already have f P,λ BV lim infn fεn BV and f P,λ L1 lim infn fεn L1. However these two inequalities have nothing to do with f EE. Thanks to the lower semicontinuity of the mean curvature s Lp norm, Leonardi and Masnou (2009, Theorem 4.4) proved that X | f|(1 + | f is lower semicontinuous in the class of C2(Rd) functions whenever p 1 for d = 2 or p 2 for d 3. An earlier result (Ambrosio and Masnou, 2003, Theorem 6) required p > d 1 for d 2. Of course the definition of Fp(f) is valid only for a certain class of smooth functions and we use the following relaxed functional (Ambrosio and Masnou, 2003; Leonardi and Masnou, 2009) Fp(f) = inf{lim inf h Fp(fh) : fh f L1} to extend to the whole space L1(Rd) (including BV (X)). We also have lower semicontinuity of Fp(f) (Ambrosio and Masnou, 2003, Theorem 5) and Fp(f) = Fp(f) whenever f C2(X) (Leonardi and Masnou, 2009, Theorem 4.4). Immediately we obtain f P,λ EE lim inf n fεn EE and thus by the increasing property of S(λ, t), S(λ, f P,λ EE) lim n S(λ, fεn EE). Supervised Learning via Euler s Elastica Models Combining the inequalities in two directions together yields lim n S(λ, fεn EE) = S(λ, f P,λ EE). D. Combining RL,P (fεn) RL,P (f P,λ) with S(λ, fεn EE) S(λ, f P,λ EE), we have Rreg L,P,λ(fεn) Rreg L,P,λ(f P,λ). Because the definition of {fεn} indicates Rreg L,P,λ(fεn) inf f L1(X) Rreg L,P,λ(f), we have found a f P,λ BV (X) L1(X) such that Rreg L,P,λ(f P,λ) = inf f L1(X) Rreg L,P,λ(f). E. The second assertion f P,λ EE δλ is obtained by the boundedness of fε in the first step. 5.3 Binary Classification Consistency In classical statistics, a statistic ˆθn is a consistent estimator of a parameter θ based on a sample of size n if and only if for any ε > 0, limn P(|ˆθn θ| > ε) = 0. In the same spirit, it is natural to request that a learning algorithm should eventually converge to an optimal solution when more and more training examples are presented. In the literature of machine learning, there exists two different types of consistency depending on the optimal solution that belongs to some particular function space or the space of all functions (von Luxburg and Sch olkopf, 2008). The latter is often called Bayes consistency if the risk of a learned classifier converges to the risk of the Bayes optimal decision rule. It is well accepted that a good learning algorithm should satisfy this asymptotic property of consistency when the data size is sufficiently large. The literature on the consistency analysis of learning algorithms can be roughly classified into following categories: (1) binary classification (Zhang, 2004a; Bartlett et al., March 2006), in particular for SVM (Steinwart, 2005), for Boosting (Bartlett and Traskin, 2007), and for random forests (Biau et al., 2008); (2) multi-class classification (Zhang, 2004b; Tewari and Bartlett, 2007; Glasmachers, 2010); (3) regression (Zakai and Ritov, 2009); (4) learning to rank (Cossock and Zhang, 2008; Xia et al., 2008; Duchi et al., 2010); (5) multilabel learning (Gao and Zhou, 2013). The work by Biau et al. (2008) showed that some popular classifiers, including Breiman s random forest classifier, are not consistent. We first formalize the definitions of several kinds of consistency used in this section, following von Luxburg and Sch olkopf (2008) and Steinwart (2005). Definition 8 A classifier fn is said to be (Bayes) consistent with respect to a given probability measure P if the risk R(fn) converges in probability to the Bayes risk, that is for all ε > 0, P(R(fn) R(f ) > ε) 0 as n Lin, Xue, Wang, Huang, and Zha where R(f) .= P({(x, y) : signf(x) = y}) is the risk of a classifier f and f denotes the Bayes classifier. Furthermore, fn is said to be universally consistent if it is consistent for all distributions P on X Y . It is called strongly universally consistent if such limiting property even holds almost surely (a.s.), that is P( lim n R(fn) = R(f )) = 1. Note that the Bayes risk is the minimum that we can achieve in the space of all measurable functions, so we always have R(fn) R(f ) and there is no need to use the absolute value as in classical statistics. We also need the notion of simple functions to approximate any function from Lp(X). Definition 9 A simple function is a function ψ : X R of the form i=1 ciχAi(x) where χA is the indicator function of the set A and {ci} R. Another description of a simple function is a function that takes on finitely many values in its range. Proposition 10 (From Regularized to Unregularized) For every Borel probability measure P on X Y , we have lim λ 0 Rreg L,P,λ(f P,λ) = RL,P where f P,λ BV(X) minimizes the regularized L-risk Rreg L,P,λ(f), and RL,P is the smallest possible L-risk RL,P (f) achieved by any measurable function f : X R. Proof First by the definition of f P,λ we have lim λ 0 Rreg L,P,λ(f P,λ) = lim λ 0 inf f BV(X) Rreg L,P,λ(f) = lim λ 0 inf f BV(X){S(λ, f EE) + RL,P (f)} = inf f BV(X){lim λ 0 S(λ, f EE) + RL,P (f)} = inf f BV(X) RL,P (f) since S(λ, ) is continuous in 0 with respect to λ and S(0, ) = 0. Next we show that the following identities hold true inf f BV(X) RL,P (f) = inf f L1(X) RL,P (f) = RL,P for a sequence of embedding spaces BV(X) L1(X) {f : X R measurable}, which suffices to prove the assertion. We first check the first identity. Recall that the simple functions that belong to Lp(X) are dense in Lp(X) for 1 p (Hunter, 2011, Theorem 7.8). Note that an integrable simple function Supervised Learning via Euler s Elastica Models belongs to Lp(X) for 1 p < if and only if µ(Ai) < for each Ai X such that ci = 0, meaning that its support has finite µ measure. On the other hand, each simple function belongs to L . We restrict the discussion on bounded functions in Lp(X) since any unbounded f Lp(X) can be replaced by a modified bounded f Lp(X) to make the loss L smaller. Hence the nice property of density indicates that for every bounded f Lp(X) (1 p ), there exists a sequence of simple functions gn such that f gn Lp 0 and |gn(x)| |f(x)| pointwise. The strong convergence in Lp norm implies the weak convergence in measure PX({x X : |f gn| ε}) 0. Since L(y, t) is uniformly continuous with respect to the second variable in the closed interval [ |f(x)|, |f(x)|], for any fixed y we have PX({x X : |L(y, f(x)) L(y, gn(x))| ε}) 0. By the previous assumption that L(y, f(x)) is a margin-based admissible loss function which is continuous with respect to the margin yf(x), there exists a function ˆL(yf(x)) L1(X) such that |L(y, gn(x))| ˆL(yf(x)). By the Lebesgue s dominated convergence theorem, the expectation in RL,P (f) and the limit can change order: (x,y) P L(y, gn(x))PX(dx)PY (dy) = Z (x,y) P L(y, f(x))PX(dx)PY (dy) = E(x,y) P L(y, f(x)) = RL,P (f). Thus by fixing p = 1 we have inf{RL,P (f) : f simple} = inf f L1(X) RL,P (f). Clearly such simple functions belong to BV(X), and also by the definition of BV functions we have BV(X) L1(X). Then the relation of embedding spaces implies that inf{RL,P (f) : f simple} inf f BV(X) RL,P (f) inf f L1(X) RL,P (f). Together with the previous identity between simple functions and L1(X) functions, the first identity inf f BV(X) RL,P (f) = inf f L1(X) RL,P (f) follows. The second identity comes from the fact inf f L (X) RL,P (f) = RL,P Lin, Xue, Wang, Huang, and Zha with the proof given by Steinwart (2005, Proposition 3.2). On the other hand, the embedding relationship L (X) L1(X) {f : X R measurable} leads to inf f L (X) RL,P (f) inf f L1 RL,P (f) RL,P . Therefore the second identity inf f L1 RL,P (f) = RL,P holds true. Following the framework of consistency proof in Steinwart (2005), we need the final piece of the puzzle by showing that some suitable concentration inequalities hold true for our proposed algorithms. These concentration inequalities bridge the gap between the expected L-risk of f P,λ and the empirical L-risk of f P,λ. Steinwart s framework is somehow modular: each tuple of concentration inequality, loss function, and function space gives a condition on {λn} ensuring |RL,P (f P,λ) RL,T (f P,λ)| 0, and each different combination of this tuple leads to new consistency results. There exist several concentration inequalities in Steinwart (2005) based on covering numbers, localized covering numbers, and algorithmic stability. Among these three concentration inequalities, the algorithmic stability (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Poggio et al., 2004) is an elegant approach that does not depend on any complexity measure of the underlying hypothesis space, but rather depend on how the learning algorithm searches this space. However, stability based concentration inequalities (Bousquet and Elisseeff, 2002) heavily rely on the reproducing property of the RKHS space and often require that the regularization term is convex, while these conditions do not hold for our elastica based learning algorithm. In the following we give a concentration inequality based on covering numbers. For a metric space (M, d) we define its covering number N((M, d), ε) to be the minimal l such that there exist l disks in M with radius ε covering M: N((M, d), ε) .= min n l N : {x1, . . . , xl} M, M i=1 B(xi, ε) o , where B(x, ε) denotes the closed ball with center x and radius ε 0. We also have to measure the continuity of a given loss function L. The modulus of continuity of L is defined by ω(L, δ) .= sup{|L(y, t) L(y, t )| : y Y, t, t R, |t t | δ}. In addition we define the inverted modulus of continuity as ω 1(L, ε) .= sup{δ > 0 : ω(L, δ) ε}. Moreover, since only f P,λ BV(X) and f T,λ BV(X) are our focus considered in the consistency results, we define the restricted loss function: Lλ( , ) .= L(y, f(x)) : y Y, f BV(X) L (X), f TV δλ , where δλ given in Theorem 7 is a simple upper bound on the TV semi-norm of the solutions of Rreg L,P,λ(f). Supervised Learning via Euler s Elastica Models Lemma 11 (Concentration) For all Borel probability measures P on X Y , all ε > 0, λ > 0, and all n 1 we have P(|RL,T (f T,λ) RL,P (f T,λ)| ε) 2 N δλI, ω 1(Lλ, ε/3) exp 2nε2 where δλI .= {f BV(X) L (X) : f TV δλ} is a metric space equipped with the norm. Proof Write the loss class as F .= {L( , f( )) : f BV(X) L (X), f TV δλ}. Note that F is a subset of C(X Y ) of nonnegative functions that are bounded by Lλ . Let l = N(F, ε/3) and consider f1, . . . , fl such that the disks Dj centered at fj and with radius ε/3 cover F. Recall that Hoeffding s inequality (Bousquet et al., 2004, Theorem 1) (see also the book by Boucheron et al., 2013), perhaps the most elegant quantitative version of the law of large numbers, states that for all ε > 0, i=1 f(Zi) E[f(Z)] > ε 2 exp 2nε2 where Z1, . . . , Zn be n i.i.d. random variables with f(Z) [a, b]. For each fixed fj, applying Hoeffding s inequality yields P(|RL,T (fj) RL,P (fj)| ε/3) 1 2 exp 2n(ε/3)2 with RL,P (fj) = E(x,y) P L(y, fj(x)) and L(y, fj(x)) [0, Lλ ]. As the disks Dj are ε/3 cover of F, the following inequalities hold true sup f Dj |RL,T (f) RL,P (f)| = sup f Dj |RL,T (f) RL,T (fj) + RL,T (fj) RL,P (fj) + RL,P (fj) RL,P (f)| ε/3 + |RL,T (fj) RL,P (fj)| + ε/3 with probability at least 1 2 exp ( 2nε2 9 Lλ 2 ) over the random choice of the training set T. Since f EE δλ implies f TV δλ, using the union bound we get P( sup f EE δλ |RL,T (f) RL,P (f)| ε) 2N(F, ε/3) exp ( 2nε2 By the definition of the modulus of continuity, every ε cover f1, . . . , fl with fj TV δλ defines an ω(Lλ, ε) cover L( , f1( )), . . . , L( , fl( )) of F with respect to the supremum norm. Thus we have N(F, ε/3) N(δλI, ω 1(Lλ, ε/3)), which immediately yields P sup f δλI |RL,T (f) RL,P (f)| ε 2 N(δλI, ω 1(Lλ, ε/3)) exp 2nε2 Lin, Xue, Wang, Huang, and Zha Since Lemma 7 guarantees that f P,λ TV δλ or f T,λ TV δλ, the assertion follows. Theorem 12 (Universal Consistency) The classifier f T,λn BV(X) minimizing the regularized empirical L-risk Rreg L,T,λn(f) is universally consistent for a positive sequence {λn} with λn 0 and 1 n Lλn 2 ln N(δλn I, ω 1(Lλn, ε)) 0 for all ε > 0. Proof The Proposition 3.3 of Steinwart (2005) states that for any Borel probability measure P on X Y and for all ε > 0, there exists a δ > 0 such that for all measurable f : X R with RL,P (f) RL,P + δ we have RP (f) RP + ε. Here L in RL,P (f) requires to be an admissible loss function. Therefore, in order to prove the 0-1 risk RP (f T,λn) RP + ε, it suffices to show the L-risk RL,P (f T,λn) RL,P + δ. The outline is given as follows: RL,P (f T,λn) S(λn, f T,λn EE) + RL,P (f T,λn) S(λn, f T,λn EE) + RL,T (f T,λn) + δ/3 (27) S(λn, f P,λn EE) + RL,T (f P,λn) + δ/3 (28) S(λn, f P,λn EE) + RL,P (f P,λn) + 2δ/3 (29) = Rreg L,P,λn(f P,λn) + 2δ/3 RL,P + δ. (30) Among the above inequalities, (27) and (29) hold true by the empirical concentration inequality in Lemma 11 with probability at least 1 2 N δλI, ω 1(Lλ, ε/3) exp 2nε2 over the random choice of the training set T, while (28) is obtained by the fact that f T,λn minimizes the regularized empirical L-risk Rreg L,T,λn(f). Proposition 10 with respect to λn 0 immediately implies (30): there exists an integer n0 1 such that for all n n0 we have |Rreg L,P,λn(f P,λn) RL,P | δ/3. Note that the condition 1 n Lλn 2 ln N(δλn I, ω 1(Lλn, ε)) 0 assures that RL,P (f T,λn) RL,P + δ holds true with probability 1 nearly as n . Then the universal consistency follows by P(RP (f T,λn) RP ε) 1 for all distributions P on X Y . Supervised Learning via Euler s Elastica Models Figure 2: Decision boundaries produced by SVM and EE with common parameters on two moon data. 6. Experimental Results The proposed two models (TV and EE) are compared with LR, SVM with RBF kernels using the LIBSVM implementation (Chang and Lin, 2011), and Back-Propagation Neural Networks (BPNN) in the Matlab neural network toolbox. Two implementations of our methods are also compared: Gradient Descent method (GD) and Lagged Linear Equation method (LAG). The maximum number of iterations in GD and LAG is empirically setting as 40. Binary classification, multi-class classification, and regression tasks are tested on synthetic and real-world data sets. We collected real data sets from the libsvm website (Chang and Lin, 2011) and the UCI machine learning repository (Asuncion and Newman, 2013). Some attributes have been removed due to missing entries. Some data sets have a huge number of instances, hence we use only 1000 instances in our experiments. All data sets are scaled into [0,1] before training and testing. 6.1 Synthetic Data We first compare our EE model and SVM for binary classification on two synthetic data sets: the two moon data and one data set made by ourselves. Fig. 2 and Fig. 3 show the decision boundaries produced by SVM and EE with common parameters. We can see that SVM tends to yield curved or even wiggly decision boundaries to pursue low training errors. In contrast, smooth or even straight decision boundaries with low curvature are favored by EE, hence reducing the risk of overfitting. One may argue that SVM can produce smooth and low curvature decision boundaries by tuning the parameters. Fig. 4 shows the results of SVM with different combinations of kernel parameter g and slack parameter C. For comparison, Fig. 5 displays the results of EE with different combinations of regularization parameter λ and kernel parameter c. We can see that most decision boundaries produced by EE have lower curvature values and are smoother than the results by SVM. Actually the elastica term in EE may be interpreted as the accumulated bending energy of all level lines, including the level line on the decision boundary. Lin, Xue, Wang, Huang, and Zha Figure 3: Decision boundaries produced by SVM and EE with common parameters on our synthetic data. Figure 4: Decision boundaries produced by SVM with different parameter combinations on two moon data. 6.2 Binary Classification We use eleven data sets for binary classification. The optimal parameters for each algorithm are selected by grid search using 5-fold cross-validation. To make the grid search more practical, only two common parameters are searched for all methods except BPNN: (C, g) for SVM, while (c, λ) for LR, TV, and EE. Empirically, the parameter η is set as 1 for LR, and the parameter b is fixed as 0.01 for EE. Then excluding BPNN, the two common parameters are searched from 10 : 10 in logarithm with step 2. For each data set, we randomly run the 5-fold cross validation ten times to reduce the influence of data partitions. Supervised Learning via Euler s Elastica Models Figure 5: Decision boundaries produced by EE with different parameter combinations on two moon data. Table 1 gives the average classification accuracies (with standard deviations) for the five methods. The results indicate that BPNN performs the worst, while the LAG version of EE achieves the best accuracies on six data sets. LR and other implementations of TV and EE are comparable with SVM. When comparing EE-LAG and SVM in a pairwise fashion, we can see that EE-LAG achieves improvements over SVM on 10 datasets (though not much statistically significant as the differences on two averaged accuracies is often less than one standard deviation). 6.3 Multi-Class Classification For multi-class tasks, we collected twelve data sets. For the 256-dimensional USPS data, PCA is used as a preprocessing step to reduce the dimension to 30 and we randomly select 1000 samples for experiments. Same as the settings for binary problems, we use ten runs 5fold cross-validation to choose the optimal parameters for each method. All methods except for BPNN have two common parameters which are searched from 10 : 10 in logarithm with step 1. Aside from BPNN that has a built-in ability for multi-class tasks, almost all function learning approaches are originally designed for binary classification. In order to handle multi-class situations, usually one versus all (OVA) or one versus one (OVO) strategies can be adopted. If using OVA, one needs to learn M scoring functions to fulfill the multi-class task, where M is the number of classes. The final decision is the label whose scoring function achieves the largest value or confidence score. However, these scoring functions are learned independently, often suffering to the so-called calibration problem (Mohri et al., 2012, chap. 8). LIBSVM uses the OVO strategy, with some reasons and detailed comparisons given in (Hsu and Lin, 2002). See also Mohri et al. (2012, chap. 8) for dis- Lin, Xue, Wang, Huang, and Zha Data Dim Num SVM BPNN LR TV EE GD LAG GD LAG Australian 14 690 85.94 85.34 87.06 87.11 87.01 86.54 87.25 2.70 1.97 2.45 2.06 2.46 2.31 2.01 Blood 4 748 79.01 79.08 79.32 79.55 79.42 79.73 79.73 transfusion 3.01 3.36 3.74 2.38 2.60 2.18 2.03 Breast10 683 97.36 96.40 97.60 97.36 97.72 97.13 97.83 cancer 1.59 1.14 1.27 1.28 1.43 1.37 1.29 Diabetes 8 768 77.73 76.85 77.96 77.83 77.81 78.23 78.10 3.03 4.22 3.50 3.19 2.73 2.54 2.63 German. 24 1000 77.10 76.37 77.10 76.19 77.10 76.50 77.22 number 1.61 1.61 1.36 1.47 1.29 1.59 1.30 Haberman s 3 306 74.51 74.52 75.77 75.30 75.28 75.65 75.34 survival 4.31 3.53 3.00 3.31 3.78 3.42 3.32 Heart 13 270 83.70 81.76 84.26 84.45 84.58 84.78 84.96 2.72 3.16 2.22 2.82 2.73 2.69 2.79 Liver6 345 73.62 71.52 73.20 74.81 73.62 74.32 73.91 disorders 5.72 4.44 2.95 2.49 2.65 2.29 2.83 Planning 12 182 73.63 67.62 72.22 71.67 71.67 72.22 71.67 relax 4.41 4.93 4.46 4.93 4.08 4.25 4.79 Sonar 60 208 89.90 88.99 90.88 90.30 90.27 90.07 90.50 4.41 4.79 3.83 4.47 4.72 3.27 3.37 Vertebral 6 310 85.81 85.16 84.52 84.55 84.75 85.83 85.92 column 4.26 3.12 3.90 4.14 4.37 3.38 3.68 Table 1: Average accuracies (%) for binary classification with 5-fold cross-validation. cussions between OVA and OVO. Recently in Varshney and Willsky (2010), an efficient binary encoding strategy was proposed to represent the decision boundary by using only m = log2M functions. Empirically we compared the log2M strategy and the OVA strategy for LR, TV and EE, and found that the in most cases the log2M strategy performs slightly better. As the codewords for making decisions are represented as 0-1 bits of length m, the log2M strategy may somehow favor those methods with good function approximation ability. In multi-class experiments, the log2M strategy is used for LR, TV and EE, while LIBSVM runs with the OVO strategy. The multi-class results of classification accuracies are shown in Table 2. The accuracy results demonstrate that both SVM and EE-GD offer the best accuracies on four (different) data sets, and both EE-LAG and TV-GD take the first place on two (different) data sets. If we compare SVM and EE-GD in a pairwise fashion by excluding other competing methods, the results show that SVM wins on only five data sets while EE-GD performs better on the other seven data sets. Therefore on multi-class tasks, Table 2 implies that our EE-GD version can offer competitive results, or can perform slightly better than SVM. 6.4 Regression We use ten regression data sets to validate the proposed TV/EE methods compared with SVM, BPNN, and LR. All data sets are scaled into [0,1]. The same experimental settings Supervised Learning via Euler s Elastica Models Data Cls Dim Num SVM BPNN LR TV EE GD LAG GD LAG Balance 3 4 625 98.40 92.48 89.44 90.88 89.92 90.40 91.36 scale 1.13 1.77 1.90 1.25 1.36 1.80 1.35 Flags 8 29 194 52.06 46.90 53.13 51.50 52.10 53.55 52.10 7.57 7.25 7.34 7.29 7.10 6.39 7.22 Glass 6 9 214 73.83 63.99 73.81 75.59 76.19 75.82 75.71 9.22 11.83 7.34 8.73 8.62 9.07 9.15 Hayes3 5 132 81.82 74.26 73.63 77.87 77.08 78.90 78.15 rath 4.12 4.62 4.31 4.59 4.67 4.29 4.31 Iris 3 4 150 96.67 96.00 95.33 96.00 96.00 96.00 96.00 3.65 3.37 3.42 3.50 3.27 3.33 3.10 Statlog 7 19 2310 97.27 96.74 97.31 97.31 97.21 97.45 97.44 imageseg 0.91 1.12 0.95 0.93 0.88 0.81 0.83 Seeds 3 7 210 94.76 95.71 92.38 92.86 92.65 92.86 92.75 1.78 1.56 1.85 1.74 1.62 1.93 1.87 Teaching 3 5 151 60.93 56.63 63.47 65.18 66.00 65.41 67.33 assist 20.97 19.44 17.28 14.26 15.37 16.23 17.41 USPS 10 30 1000 94.10 89.72 94.90 94.40 94.80 94.40 95.00 1.39 2.79 1.28 1.32 1.73 1.54 1.27 Vehicle 4 18 846 84.40 79.18 82.75 85.00 84.25 85.00 84.84 0.70 1.41 1.33 0.82 0.93 0.78 0.90 Wine 3 13 178 98.88 97.78 99.44 99.44 99.43 99.44 98.86 1.27 1.43 0.83 0.83 0.85 0.83 1.31 Yeast 10 8 1484 60.78 54.49 58.22 57.95 57.91 57.95 57.97 3.26 4.57 3.79 3.34 3.27 3.64 3.52 Table 2: Average accuracies (%) for multi-class classification with 5-fold cross-validation. are repeated by running ten times of 5-fold cross-validation for each data set. Table 3 shows the regression results in mean square errors (MSE) with standard deviations. Clearly, we can see that TV-LAG and two versions of EE achieve the best regression results, with each winning three times on overall ten data sets. BPNN yields the lowest errors on two data sets. Surprisingly SVM takes the first place on only one data set. If we select SVM and LR in a pairwise fashion by excluding other methods, we find that LR offers lower errors on seven data sets while SVM performs better on only other three data sets. If we compare SVM and TV-GD separately by neglecting other methods, TV-GD performs better on nine data sets. Note that TV-GD performs the worst among all versions of TV/EE. These results demonstrate that compared with other competing methods, the performance of SVM on regression tasks is rather unsatisfactory. The reason might be that the original purpose of SVM is designed for classification, not for regression. In contrast, our TV/EE methods exhibit excellent regression ability on these data sets. 6.5 Running Times To compare the real performance in computational burdens, in Table 4 we list the running times of the competing methods on five data sets for binary classification. The running times are obtained for five-fold cross-validation in one single round, averaged by ten rounds. The experiments are conducted on a PC Sever with two Intel Xeon 5620 cores and 8GB RAM. Lin, Xue, Wang, Huang, and Zha Data Dim Num SVM BPNN LR TV EE GD LAG GD LAG Auto MPG 7 392 7.11 5.63 6.07 5.67 5.62 5.47 5.69 0.56 0.57 0.55 0.56 0.53 0.51 0.56 Concrete 8 1030 6.42 4.88 6.02 5.98 5.43 5.83 5.24 comp. str. 0.62 0.56 0.64 0.83 0.60 0.78 0.61 Concrete 9 103 4.48 14.30 5.01 1.86 1.61 1.76 1.61 slump test 2.00 7.14 1.70 0.81 0.70 0.71 0.70 Forest 12 517 5.95 6.20 3.41 3.43 3.41 3.37 3.41 fires 3.62 3.73 3.69 3.61 3.69 3.54 3.69 Housing 13 506 5.88 7.54 5.13 4.92 4.90 5.14 4.95 2.28 2.41 2.36 2.47 2.29 2.39 2.33 Machine 6 209 3.32 5.18 1.78 2.37 1.91 1.75 1.75 CPU 2.81 3.23 1.70 1.80 1.78 1.72 1.72 Pyrim 27 74 9.32 23.06 6.59 5.81 5.89 5.90 5.93 9.75 9.97 6.22 5.17 5.85 6.09 6.12 Servo 4 167 9.93 5.62 7.29 8.81 8.34 8.87 7.86 5.09 5.24 5.80 5.49 5.63 5.29 5.83 Triazines 60 186 19.24 41.90 20.73 19.67 20.32 19.63 19.95 6.61 8.71 4.46 3.93 4.08 2.50 2.79 Yacht 6 308 4.52 3.70 7.75 2.33 1.45 2.07 1.45 hydrodynamics 0.31 0.29 1.91 0.47 0.32 0.43 0.32 Table 3: Regression errors measured by MSE (10 3) with 5-fold cross-validation. We can see that the computational burdens of TV/EE algorithms is similar to that of BPNN in Matlab toolbox, but much slower than LIBSVM. The computational PDE approach of our TV/EE models is implemented by gradient descent or lagged iteration, which often requires a long time for assuring that the iterations converge. In each iteration, all the data points participate in the computations of our methods within the current implementations. In contrast, the solutions of SVM is essentially sparse, and recent several improvements show that carefully selecting a small representative subset of the training data can further greatly speed-up the optimization process of SVM (Nandan et al., 2014; Wang et al., 2014). Our intention in this paper is not to develop a fully-fledged and highly optimized algorithm for supervised learning problems. Instead, this work only serves as a starting point for applying Euler s elastica to classification and regression tasks. The above experiments have demonstrated the excellent accuracies of our elastica based algorithms, though the numerical solutions are rather slow. Hence there exists an opportunity to dramatically improve the computational efficiency by considering the following techniques: (1) Some first order numerical methods, like the augmented Lagrangian method (ALM). The operator splitting method and ALM have been successfully implemented to solve Euler s elastica model for image applications (Tai et al., 2011; Hahn et al., 2011; Duan et al., 2013). The speed-up is spectacular compared with prior approaches. Interestingly the ALM has been also applied to optimize the primal SVM problem with linear computational cost (Nie et al., 2014). (2) Imposing the sparsity constraint on the coefficients w. The sparsity property may enhance the efficiency in each iteration. (3) Selecting a small representative subset of the training data in a similar way proposed by Nandan et al. (2014) and Wang et al. (2014). Supervised Learning via Euler s Elastica Models Data Dim Num SVM BPNN LR TV EE GD LAG GD LAG Australian 14 690 0.859 30.673 4.734 24.734 32.453 25.734 33.197 Blood transfusion 4 748 0.297 22.247 5.938 28.467 35.746 27.481 36.497 Breast-cancer 10 683 0.453 20.318 4.609 18.953 19.447 19.732 20.278 Diabetes 8 768 0.547 23.142 6.453 20.120 20.981 22.145 21.519 German.number 24 1000 1.266 31.452 14.156 29.266 32.145 34.497 31.876 Table 4: Running times (in seconds) for binary classification with 5-fold cross-validation in one single round. 7. Conclusion Regularization framework and function learning approaches have become very popular in the recent machine learning literature. Due to the great success of total variation and Euler s elastica models in image processing area, we extend these two models for supervised classification and regression on high dimensional data sets. The TV regularizer permits steeper edges near the decision boundaries, while the elastica smoothing term penalizes non-smooth level set hypersurfaces of the target function. Compared with SVM and BPNN, our proposed methods have demonstrated the competitive performance on commonly used benchmark data sets. Specifically, TV and EE offer superb results on binary classification and regression tasks, and performs slightly better than SVM on multiclass problems. In comparison, SVM often yields excellent accuracies for multi-class classification, but offer poor results on regression problems. Our future work is to explore other possibilities in using different basis functions and to speedup the training time. Recently, several fast Augmented Lagrangian Methods (ALM) (Tai et al., 2011; Duan et al., 2013) have been applied to solve Euler s elastica models in image denoising, inpainting, and zooming applications. Particularly in Duan et al. (2013), the Euler s elastica functional is reformulated as a serial of subproblems, which can be efficiently solved by either closed-form solution or fast iteration method. Whether these methods can be extended to high dimensional problems needs further investigations. Another interesting direction is to extend the work of Zakai and Ritov (2009) on regression consistency to the TV and EE models. Acknowledgments The authors thank the anonymous reviewers for valuable comments and insightful suggestions. This work was supported by the National Basic Research Program of China (973 Program) 2011CB302202, the Natural Science Foundation of China (NSFC Grants 61075119 and 61375051), and the Seeding Grant for Medicine and Information Sciences of Peking University (2014-MI-21). Lin, Xue, Wang, Huang, and Zha Appendix A: Curvature The following material comes from Aubert and Kornprobst (2006, chap. 2.4) with slightly different notations. Readers are also referred to the classical geometry book of do Carmo (1976). Let c(p) = (x(p), y(p)) be a regular planar oriented curve on R2 with parameter p [0, 1]. Then T(p) = c (p) = (x (p), y (p)) is the tangent vector, N(p) = ( y (p), x (p)) is the normal vector, and 0 |c (q)|dq = Z p (x (p))2 + (y (p))2dq is the arc length. Due to the regularity condition c (p) = 0, the arc length s is a differentiable function of p and ds/dp = |c (p)|. If we parametrize the regular curve c by s, then T(s) = dc(s)/ds is the unit tangent vector satisfying |T(s)| = 1. The number κ(s) .= |d T(s)/ds| is called the curvature at s, measuring the change rate of the angle which neighboring tangents make. Since |T(s)| = 1, we have T(s) d T(s)/ds = 0, indicating d T(s)/ds is collinear to the unit normal vector N(s). That is, under the arc length parametrization, d T(s)/ds = κ(s)N(s), or κ(s) = |T d T/ds| = |c (s) c (s)| where is the exterior product. Back to the general parametrization c(p), we have κ(p) = |c (p) c (p)| |c (p)|3 = x y x y ((x )2 + (y )2)3/2 . (31) Now we derive the divergence expression (6) of the curvature on a level curve. Consider the case where c(s) is the l-level curve of a function u : R2 R, denoted by c(s) = {(x(s), y(s)) : u(x(s), y(s)) = l}. By differentiating the equality u(x(s), y(s)) = l with respect to s, we obtain uxx (s) + uyy (s) = 0. (32) Hence the vectors (x (s), y (s)) and ( uy, ux) are collinear, or equivalently for some λ we have x (s) = λuy, y (s) = λux. (33) Note that since |c (s)| = 1, from (33) we get λ = 1/| u| (supposing | u| = 0). If differentiating again (32) with respect to s we obtain uxx(x (s))2 + uyy(y (s))2 + 2uxyx (s)y (s) + uxx (s) + uyy (s) = 0. Plugging (33) into the above equality leads to λ2[uxx(uy)2 + uyy(ux)2 2uxyuyux] + 1 λ[y (s)x (s) x (s)y (s)] = 0. By (31) we can deduce the curvature expression as κ(s) = |c (s) c (s)| |c (s)|3 = x (s)y (s) x (s)y (s) = uxx(uy)2 + uyy(ux)2 2uxyuyux Supervised Learning via Euler s Elastica Models Denoting f .= | u| = p (ux)2 + (uy)2, we have f2 fyuy + 1 f (uxx + uyy) f (uxuxx + uyuyx) i ux 1 f (uxuxy + uyuyy) i uy + 1 f (uxx + uyy) n (ux)2uxx + (uy)2uyy + 2uxuyuxy h (ux)2 + (uy)2i (uxx + uyy) o = uxx(uy)2 + uyy(ux)2 2uxyuyux thus getting the curvature expression (6). Since the above derivations only consider the case of level curves for a 2D function u(x, y), here we give some remarks on the curvature expression (6) in high dimensional spaces. For a level surface defined in 3D space, the curvature expression (6) at point p amounts to the mean curvature of this surface: where N is a unit normal of the surface (see Chan and Shen, 2005, chap. 2.1.2). Formally, the mean curvature is defined as the average of the principal curvatures (Spivak, 1999, vol. 3, chap. 2): H = (κ1 + κ2)/2, where κ1 and κ2 are two principal curvatures. In this case, the Gaussian curvature is given by K = κ1 κ2. More generally (Spivak, 1999, vol. 4, chap. 7), for a (d 1)-dimensional level hypersurface embedded in Rd the mean curvature is given as H = (κ1 + + κd 1)/(d 1) in terms of principal curvatures. More abstractly, the mean curvature is the trace of the second fundamental form divided by d 1 (or equivalently the shape operator or Weingarten map). The shape operator s (Lee, 1997, chap. 8) is an extrinsic curvature, and the Gaussian curvature is given by the determinant of s. Mean curvature is closely related to the first variation of surface area, in particular a minimal surface such as a soap film, has mean curvature zero and a soap bubble has constant mean curvature. Unlike Gauss curvature, the mean curvature is extrinsic and depends on the embedding, for instance, a cylinder and a plane are locally isometric but the mean curvature of a plane is zero while that of a cylinder is nonzero (see http://en.wikipedia.org/wiki/Curvature). One can also refer to Ambrosio and Masnou (2003) for the description of this high dimensional representation. Appendix B: PDEs Derived by Calculus of Variations We present the following derivations of the Euler-Lagrange PDEs by calculus of variations (van Brunt, 2004). Note that the variation operator δ acts much like a differentiation Lin, Xue, Wang, Huang, and Zha operator. First we list some expressions about δ which are useful in the following derivations (where F is a d-dimensional differentiable vector field): δ( u) = (δu), δ( F) = δ d X i=1 δ F (i) δ(| u|2) = δ h d X = 2 u, δ u = 2 u, δu , δ(| u|) = δ nh d X 2i 1/2 δ h d X = 1 2 1 | u|2 u, δu = u 2i 1/2o = 1 2i 3/2 δ h d X 2 1 | u|3 2 u, δu = 1 | u|3 u, δu . Proof of (9) (10). Suppose u : Rd R is a differentiable function. The first variation, ELR ELR + δELR, under u u + δu is given by δELR = δ n Z h (u y)2 + λ| u|2i dx o n δ h (u y)2i + λδ | u|2 o dx h 2(u y)δu + 2λ u, δu i dx Ω (u y)δudx + Z Ω λ uδu nd S Z Ω λ( u)δudx i (34) Ω (u y)δudx + Z Ω λ( u)δudx i (35) h (u y) λ u i δudx. (36) Here is the divergence operator, is the Laplacian operator, and n denotes the outer normal along the boundary Ω. The equation (34) is obtained based on the Gauss-Green divergence theorem in vector calculus (Spiegel and Lipschutz, 2009) (which is a special case of the more general Stokes theorem): Z V ( F)d V = Z S (F n)d S, where V is a subset of Rd (in the case of d = 3, V represents a volume in 3D space) which is compact and has a piecewise smooth boundary S (also indicated with V = S), F is a Supervised Learning via Euler s Elastica Models continuously differentiable vector field, and n is the outward pointing unit normal field of the boundary V . In fact, we use the following corollary of the divergence theorem when applied to the product of a scalar function g (that is δu in our context) and a vector field F ( u in our context): Z V [F ( g) + g( F)]d V = Z S (g F n)d S. Then integration by parts implies (34). The equation (35) is written with the directional derivative notation u/ n .= u n = u, n . The last equation (36) is due to the assumption of natural boundary conditions According to the fundamental lemma of calculus of variations, the integrand part in parentheses is equal to zero because δu is an arbitrary function. Hence we obtain (10). Proof of (11) (12). The first variation, ETV ETV +δETV , under u u+δu is given by δETV = δ n Z 2(u y)2 + λ| u| i dx o n (u y)δu + λδ(| u|) o dx (u y)δu + λ D u | u|, δu E dx Ω (u y)δudx + Z Ω λ 1 | u| u nδud S Z Again the integration term over the boundary Ωin (37) can be removed by the natural boundary conditions. By the fundamental lemma of calculus of variations, the integrand part in parentheses of (38) must equal to zero. Thus we get (12). Proof of (14) (15). The original derivation comes from Chan et al. (2002). Let f(κ) = a + bκ2 and the elastica regularization term be Ω f(κ)| u|dx. We need to prove that the first variation, R(u) R(u) + δR(u), under u u + δu is given by Ω V(u)δudx, where V(u) is a flux field defined as V(u) = f(κ)N T | u| (f (κ)| u|) Lin, Xue, Wang, Huang, and Zha Here N is the ascending normal field u/| u|, and T is the tangent field defined as T = N . Note that the exact orientation of T does not matter due to the coupling of T and / T in the expression. Since the curvature κ is a function of u, by variational rules we have δR(u) = δ n Z Ω f(κ)| u|dx o n | u|δ h f(κ) i + f(κ)δ(| u|) o dx n | u|f (κ)δκ + f(κ) D u | u|, δu Eo dx Ω | u|f (κ)δκdx + Z f(κ) | u| u nδud S Z n | u|f (κ)δκ f(κ) u n | u|f (κ)δκ h (f(κ)N) i δu o dx. Here the integration term over the boundary Ωin (39) can be removed by the natural boundary conditions. The variation of curvature κ = N is a function of δu, which can be further written as = h 1 | u|δ( u) + uδ 1 | u| = h 1 | u| (δu) u 1 | u|3 u, (δu) i = h 1 | u| (δu) 1 | u|N N, (δu) i = h 1 | u|(I N N) (δu) i = h 1 | u|PT( (δu)) i . Here I denotes the identity transform, PN .= N N is the orthogonal projection onto the ascending normal direction of u, and PT .= I N N = T T is the orthogonal projection onto the tangent direction of u. Therefore by the Gauss-Green divergence theorem we have Z Ω | u|f (κ)δκdx Ω f (κ)| u| n h 1 | u|PT( (δu)) io dx Ω f (κ)PT( (δu)) nd S Z D h f (κ)| u| i , 1 | u|PT( (δu)) E dx Supervised Learning via Euler s Elastica Models D h f (κ)| u| i , 1 | u|PT( (δu)) E dx D 1 | u|PT n h f (κ)| u| io , (δu) E dx (40) 1 | u|PT n h f (κ)| u| io δu nd S + Z Ω 1 | u|PT n h f (κ)| u| io δudx Ω 1 | u|PT n h f (κ)| u| io δudx, where proper natural boundary conditions are imposed to remove the integrations over the boundary Ω, and the equation (40) is given by the symmetry property of the projection operator PT in an inner product. Finally, using the definition of directional derivative PT( f) = T( f/ T), we complete the derivations of (14) (15) by n | u|f (κ)δκ h (f(κ)N) i δu o dx n 1 | u|PT n h f (κ)| u| io (f(κ)N) o δudx Ω n f(κ)N 1 | u|PT n h f (κ)| u| io δudx Ω n f(κ)N 1 | u| (f (κ)| u|) Appendix C: Expressions in Terms of RBF Approximations The following gives some useful expressions about Laplacian, Hessian, and curvature of u(x) in terms of RBF approximations u(x) = Pn i=1 wiφi(x), where φi(x) = exp( c|x xi|2/2). Proof of (18) for Laplacian: 2φk x(i) x(i) = x(i) h c(x(i) x(i) k )φk i = c(x(i) x(i) k ) φk x(i) cφk (x(i) x(i) k ) x(i) = c(x(i) x(i) k )[ c(x(i) x(i) k )φk] cφk = c[c(x(i) x(i) k )2 1]φk. 2φk x(i) x(i) = c(c|x xk|2 d)φk. Proof of (19) for Hessian: (for i = j) 2φk x(i) x(j) = x(j) [ c(x(i) x(i) k )φk] Lin, Xue, Wang, Huang, and Zha = c(x(i) x(i) k )[ c(x(j) x(j) k )φk] = c2(x(i) x(i) k )(x(j) x(j) k )φk. (for i = j) 2φk x(i) x(j) = c[c(x(i) x(i) k )2 1]φk. H(φk) = c2(x xk)(x xk)T φk cφk I. Proof of (21) for Hessian in terms of notation Φ .= Pn i=1 wi(x xi)(x xi)T φi: i=1 wi H(φi) i=1 wi[ cφi I + c2(x xi)(x xi)T φi] i=1 wiφi I + c2 n X i=1 wi(x xi)(x xi)T φi i=1 wiφi I + c2Φ. To prove (23) and others derivations involving gradients in Appendix D, here we list some useful expressions (notice that we do not distinguish H(u) from H(u)T due to symmetry): (| u|2) = h d X 2u x(i) x(1) . . . 2u x(i) x(d) (| u|) = nh d X 2i 1/2 h d X = 1 2| u|2H(u) u = 1 | u|H(u) u, 2i 1/2o = 1 2i 3/2 h d X = 1 2| u|3 2H(u) u = 1 | u|3 H(u) u, 2i 3/2o = 3 2i 5/2 h d X = 3 2| u|5 2H(u) u = 3 | u|5 H(u) u. Proof of (23) for curvature: Supervised Learning via Euler s Elastica Models = 1 | u| u + 1 | u| = 1 | u| u 1 | u|3 H(u) u u u u T H(u) u i wi(c|x xi|2 d)φi ( cg)T (c2Φ c(P i wiφi)I)( cg) ( cg)T ( cg) i wi(c|x xi|2 d)φi g T (cΦ (P i wiφi)I)g g T g i wi(c|x xi|2 d + 1)φi cg T Φg Appendix D: Expansion of V in Gradient Descent Time Marching Here we give the derivations of the expansion (26) of V in Gradient Descent Time Marching. For simpler notations we define α .= u T H(u) u, β .= u T H(u)2 u, γ .= u T H(u)3 u. By definition (16) V(u) .= f(κ)N T | u| (f (κ)| u|) = f(κ)N 1 | u| (f (κ)| u|) + 1 | u|3 u u, f (κ)| u| = (1 + bκ2)N 1 | u| (2bκ| u|) + 1 | u|3 u u, (2bκ| u|) , V = [(1+bκ2)N] 2b h 1 | u| (κ| u|) i +2b n 1 | u|3 u h u T (κ| u|) io . (41) Then we show the following derivations for the three parts on the right side of (41). Part 1: The first term can be expanded as N + b (κ2N) = κ + b[ (κ2) N + κ2 N] = κ + b(2κ κ N + κ3), where κ can be further written as u + 1 | u| u Lin, Xue, Wang, Huang, and Zha i u + ( u) 1 | u| u + 1 | u| ( u) H(u) 1 | u| = 1 | u|3 [H(u)2 u + u H(u) u]. Here the third equality is obtained by the formula for the gradient of a dot product (a b) = ( a) b + ( b) a a1 x1 . . . ad x1 . . . . . . . . . a1 xd . . . ad xd b1 x1 . . . bd x1 . . . . . . . . . b1 xd . . . bd xd and we omit the third order derivatives by notation for easier calculations. Therefore, the first term on the right side of (41) can be written as ((1 + bκ2)N) = κ + bκ3 2bκ | u|4 [ u T H(u)2 u + u u T H(u) u] = κ + bκ3 2bκ | u|4 (α u + β). Part 2: The second term on the right side of (41) can be expanded as h 1 | u| (κ| u|) i (κ| u|) + 1 | u| h (κ| u|) i = 1 | u|3 H(u) u h | u| κ + κ (| u|) i + 1 | u| n h | u| κ + κ (| u|) io = 1 | u|2 u T H(u) κ κ | u|3 u T H(u) (| u|) h (| u|) κ + | u| κ + κ (| u|) +((((((( κ (| u|) i 1 | u|2 u T H(u) κ κ | u|3 u T H(u) h 1 | u|H(u) u i + 2 | u|2 u T H(u) κ = 1 | u|2 u T H(u) κ κ | u|4 u T H(u)2 u = 1 | u|2 u T H(u) n 1 | u|3 [H(u)2 u + u H(u) u] o κ | u|4 β | u|5 + κ | u|4 β 1 | u|5 γ. Part 3: Finally we consider the third term on the right side of (41). With notation v .= (κ| u|), we have n 1 | u|3 u h u T (κ| u|) io Supervised Learning via Euler s Elastica Models = h 1 | u|3 u( u v) i = h 1 | u|3 ( u v) + 1 | u|3 ( u v) i u + u v u + u | u|3 | u|3 3 | u|5 α ( u v). u v = u [ (κ| u|)] = u (| u| κ + κ (| u|)) = | u| u κ + κ u (| u|) = | u| u n 1 | u|3 [H(u)2 u + u H(u) u] o + κ u 1 | u|H(u) u = κ | u| u | u|2 α 1 | u|2 β, we obtain the expansion of the third term on the right side of (41): n 1 | u|3 u h u T (κ| u|) io | u|3 3 | u|5 α ( u v) | u|4 ( u)2 | u|7 3κ | u|6 α2 u | u|5 β + 3 | u|7 αβ Putting all three parts together, we have the expansion of V as V = κ + bκ3 2bκ | u|4 (α u + β) + 2b n u | u|5 + κ | u|4 β + 1 | u|5 γ o | u|4 ( u)2 | u|7 3κ | u|6 α2 u | u|5 β + 3 | u|7 αβ o = κ + bκ3 2b( u)2 | u|5 α + 6b u | u|7 κ | u|6 α2 + 6b | u|7 αβ + 2b | u|5 γ. L. Ambrosio and S. Masnou. A direct variational approach to a problem arising in image reconstruction. Interface and Free Boundaries, 5:63 81, 2003. L. Ambrosio, N. Fusco, and D. Pallara. Functions of Bounded Variation and Free Discontinuity Problems. Oxford University Press, 2000. Lin, Xue, Wang, Huang, and Zha A. Asuncion and D.J. Newman. UCI Machine Learning Repository. 2013. URL http: //archive.ics.uci.edu/ml/. G. Aubert and P. Kornprobst. Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations. Springer-Verlag, 2nd edition, 2006. E. Bae, J. Shi, and X.C. Tai. Graph cuts for curvature based image denoising. IEEE Transaction on Image Processing, 20(5):1199 1210, 2011. P.L. Bartlett and M. Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8:2347 2368, 2007. P.L. Bartlett, M.I. Jordan, and J.D. Mc Auliffe. Convexity, classification, and risk bounds. Journal of American Statistical Association, 101(473):138 156, March 2006. M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7(48):2399 2434, 2006. G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9:2015 2033, 2008. C.M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006. S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of recent advances. ESAIM: Probability and Statistics, 9:323 375, 2005. S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499 526, 2002. O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Advanced Lectures in Machine Learning. Springer, 2004. K. Bredies, T. Bock, and B. Wirth. A convex, lower semi-continuous approximation of euler s elastica energy. Preprint, 2013. L. Breiman. Random forest. Machine Learning, 45(1):5 32, 2001. R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, pages 161 168, Pittsburgh, Pennsylvania, 2006. T.F. Chan and J. Shen. Nontexture inpainting by curvature driven diffusions (CDD). Journal of Visual Communication and Image Representation, 12:436 449, 2001. T.F. Chan and J. Shen. Image Processing and Analysis: Variational, PDE, Wavelet, and Stochastic Methods. SIAM, 2005. Supervised Learning via Euler s Elastica Models T.F. Chan, S.H. Kang, and J. Shen. Euler s elastica and curvature-based inpaintings. SIAM Journal on Applied Mathematics, 63:564 592, 2002. C.C. Chang and C.J. Lin. Libsvm a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:1 27, 2011. URL http://www.csie.ntu. edu.tw/~cjlin/libsvm/. J.B. Conway. A Course in Functional Analysis. Springer-Verlag, 2nd edition, 1990. D. Cossock and T. Zhang. Statistical analysis of Bayes optimal subset ranking. IEEE Transaction on Information Theory, 54(11):5140 5154, 2008. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. M. do Carmo. Differential Geometry of Curves and Surfaces. Prentice-Hall, 1976. Y. Duan, Y. Wang, and J. Hahn. A fast augmented lagrangian algorithm for euler s elastica models. Numerical Mathematics: Theory, Methods and Applications, 6(1):47 71, 2013. J.C. Duchi, L.W. Mackey, and M.I. Jordan. On the consistency of ranking algorithm. In Proceedings of the 27th International Conference on Machine Learning, pages 327 334, Haifa, Israel, 2010. M. Fern andez-Delgado, E. Cernadas, and S. Barro. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15: 3133 3181, 2014. M.S. Floater and A. Iske. Multistep scattered data interpolation using compactly supported radial basis functions. Journal of Computational and Applied Mathematics, 73(1 2):65 78, 1996. C.G. Fraser. Mathematical technique and physical conception in Euler s investigation of the elastica. Centaurus, 34(3):211 246, 1991. W. Gao and Z.H. Zhou. On the consistency of multi-label learning. Artificial Intelligence, 199 200:22 44, 2013. E. Giusti. Minimal Surfaces and Functions of Bounded Variation. Birkh auser, Boston, 1994. T. Glasmachers. Universal consistency of multi-class support vector classication. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, Vancouver, Canada, 2010. B.I. Golubov and A.G. Vitushkin. Variation of a function. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001. URL http://www.encyclopediaofmath. org/index.php/Function_of_bounded_variation, updated in 2013. Lin, Xue, Wang, Huang, and Zha J. Hahn, G.J. Chung, Y. Wang, and X.C. Tai. Fast algorithms for p-elastica energy with the application to image inpaiting and curve reconstruction. In Proceedings of International Conference on Scale Space and Variational Methods in Computer Vision, pages 169 182. Springer, 2011. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2009. C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415 425, 2002. C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification. 2007. URL http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. X. Huang, L. Shi, and J.A.K. Suykens. Ramp loss linear programming support vector machine. Journal of Machine Learing Research, 15(6):2185 2211, 2014. J.K. Hunter. Chapter 7: Lp spaces. Course Notes of Measure Theory, 2011. URL http: //www.math.ucdavis.edu/~hunter/m206/ch6_measure_notes.pdf. G. Kanizsa. Organization in Vision. Praeger, New York, 1979. N. Komodakis and N. Paragios. Beyond pairwise energies: efficient optimization for higherorder MRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2985 2992, 2009. S. Kutin and P. Niyogi. Almost everywhere algorithmic stability and generalization error. In Proceedings of the Eighteenth Conference Conference on Uncertainty in Artificial Intelligence, pages 275 282, Alberta,Canada, 2002. J.M. Lee. Riemannian Manifolds: An Introduction to Curvature. Springer-Verlag, 1997. G.P. Leonardi and S. Masnou. Locality of the mean curvature of rectifiable varifolds. Advances in Calculus of Variations, 2(1):17 42, 2009. R. Levien. The elastica: a mathematical history. Technical report, EECS Department, University of California, Berkeley, 2008. S. Masnou and J.-M. Morel. Level lines based disocclusion. In Proceedings of the 5th IEEE International Conference on Image Processing, pages 259 263, Chicago, Illinois, October 1998. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012. J.M. Morel and S. Solimini. Variational methods in image segmentation. In Progress in Nonlinear Differential Equations and Their Applications. Birkh auser, Boston, 1995. D. Mumford. Elastica and computer vision. In C.L. Bajaj, editor, Algebraic Geometry and Its Applications, pages 491 506. Springer-Verlag, New York, 1994. Supervised Learning via Euler s Elastica Models K.P. Murphy. Machine Learning: a Probabilistic Perspective. MIT Press, 2012. B. Nadler, N. Srebro, and X. Zhou. Semi-supervised learning with the graph laplacian: The limit of infinite unlabelled data. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pages 1330 1338, Vancouver, B.C., Canada, 2009. M. Nandan, P.P. Khargonekar, and S.S. Talathi. Fast SVM training using approximate extreme points. Journal of Machine Learning Research, 15(1):59 98, 2014. F. Nie, Y. Huang, and H. Huang. Linear time solver for primal SVM. In Proceedings of The 31st International Conference on Machine Learning, pages 505 513, Beijing, 2014. S. Osher and J.A. Sethian. Fronts propagating with curvature-dependent speed: Algorithms based on Hamilton-Jacobi formulations. Journal of Computational Physics, 79(1):12 49, 1988. T. Poggio, S. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428:419 422, 2004. R. M. Rifkin. Everything Old Is New Again : A Fresh Look at Historical Approaches in Machine Learning. Ph D thesis, MIT, 2002. L.I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60:259 268, 1992. R.E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT Press, 2012. B. Sch olkopf and A. Smola. Learning with Kernels. MIT Press, 2002. M. R. Spiegel and S. Lipschutz. Vector Analysis. Mc Graw-Hill, 2nd edition, 2009. M. Spivak. A Comprehensive Introduction to Differential Geometry, volume 3 4. Publish or Perish Press, 3rd edition, 1999. I. Steinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51(1):128 142, 2005. X.C. Tai, J. Hahn, and G.J. Chung. A fast algorithm for euler s elastica model using augmented lagrangian method. SIAM Journal on Imaging Sciences, 4(1):313 344, 2011. A. Tewari and P.L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007 1025, 2007. B. van Brunt. The Calculus of Variations. Springer-Verlag, 2004. V.N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998. K. R. Varshney and A. S. Willsky. Classification using geometric level sets. Journal of Machine Learning Research, 11(2):491 516, 2010. U. von Luxburg and B. Sch olkopf. Statistical learning theory: Models, concepts, and results. Technical report, ar Xiv:0810.4752, 2008. Lin, Xue, Wang, Huang, and Zha J. Wang, P. Wonka, and J. Ye. Scaling SVM and least absolute deviations via exact data reduction. In Proceedings of The 31st International Conference on Machine Learning, pages 523 531, Beijing, 2014. H. Wendland. Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree. Advances in Computational Mathematics, 4(1):389 396, 1995. F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning, pages 1192 1199, Helsinki, Finland, 2008. K. Yosida. Functional Analysis. Springer-Verlag, 6th edition, 1999. A. Zakai and Y. Ritov. Consistency and localizability. Journal of Machine Learning Research, 10:827 856, 2009. T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56 85, 2004a. T. Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225 1251, 2004b. D. Zhou and B. Sch olkopf. Regularization on discrete spaces. In Proceedings of the 27th DAGM Symposium Symposium on Pattern Recognition, pages 361 368, Springer, Berlin, 2005.