# learning_with_fredholm_kernels__49051583.pdf Learning with Fredholm Kernels Qichao Que Mikhail Belkin Yusu Wang Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {que,mbelkin,yusu}@cse.ohio-state.edu In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation. Our approach fits naturally into the kernel framework and can be interpreted as constructing new data-dependent kernels, which we call Fredholm kernels. We proceed to discuss the noise assumption for semi-supervised learning and provide both theoretical and experimental evidence that Fredholm kernels can effectively utilize unlabeled data under the noise assumption. We demonstrate that methods based on Fredholm learning show very competitive performance in the standard semi-supervised learning setting. 1 Introduction Kernel methods and methods based on integral operators have become one of the central areas of machine learning and learning theory. These methods combine rich mathematical foundations with strong empirical performance. In this paper we propose a framework for supervised and unsupervised learning as an inverse problem based on solving the integral equation known as the Fredholm problem of the first kind. We develop regularization based algorithms for solving these systems leading to what we call Fredholm kernels. In the basic setting of supervised learning we are given the data set (xi, yi), where xi X, yi R. We would like to construct a function f : X R, such that f(xi) yi and f is nice enough to generalize to new data points. This is typically done by choosing f from a class of functions (a Reproducing Kernel Hilbert Space (RKHS) corresponding to a positive definite kernel for the kernel methods) and optimizing a certain loss function, such as the square loss or hinge loss. In this paper we formulate a new framework for learning based on interpreting the learning problem as a Fredholm integral equation. This formulation shares some similarities with the usual kernel learning framework but unlike the standard methods also allows for easy incorporation of unlabeled data. We also show how to interpret the resulting algorithm as a standard kernel method with a non-standard data-dependent kernel (somewhat resembling the approach taken in [13]). We discuss reasons why incorporation of unlabeled data may be desirable, concentrating in particular on what may be termed the noise assumption for semi-supervised learning, which is related but distint from manifold and cluster assumption popular in the semi-supervised learning literature. We provide both theoretical and empirical results showing that the Fredholm formulation allows for efficient denoising of classifiers. To summarize, the main contributions of the paper are as follows: (1) We formulate a new framework based on solving a regularized Fredholm equation. The framework naturally combines labeled and unlabeled data. We show how this framework can be expressed as a kernel method with a non-standard data-dependent kernel. (2) We discuss the noise assumption in semi-supervised learning and provide some theoretical evidence that Fredholm kernels are able to improve performance of classifiers under this assumption. More specifically, we analyze the behavior of several versions of Fredholm kernels, based on combining linear and Gaussian kernels. We demonstrate that for some models of the noise assumption, Fredholm kernel provides better estimators than the traditional data-independent kernel and thus unlabeled data provably improves inference. (3) We show that Fredholm kernels perform well on synthetic examples designed to illustrate the noise assumption as well as on a number of real-world datasets. Related work. Kernel and integral methods in machine learning have a large and diverse literature (e.g., [12, 11]). The work most directly related to our approach is [10], where Fredholm integral equations were introduced to address the problem of density ratio estimation and covariate shift. In that work the problem of density ratio estimation was expressed as a Fredholm integral equation and solved using regularization in RKHS. This setting also relates to a line of work on on kernel mean embedding where data points are embedded in Reproducing Kernel Hilbert Spaces using integral operators with applications to density ratio estimation and other tasks [5, 6, 7]. A very interesting recent work [9] explores a shrinkage estimator for estimating means in RKHS, following the Stein James estimator originally used for estimating the mean in an Euclidean space. The results obtained in [9] show how such estimators can reduce variance. There is some similarity between that work and our theoretical results presented in Section 4 which also show variance reduction for certain estimators of the kernel although in a different setting. Another line of related work is the class of semi-supervised learning techniques (see [15, 2] for a comprehensive overview) related to manifold regularization [1], where an additional graph Laplacian regularizer is added to take advantage of the geometric/manifold structure of the data. Our reformulation of Fredholm learning as a kernel, addressing what we called noise assumptions , parallels data-dependent kernels for manifold regularization proposed in [13]. 2 Fredholm Kernels We start by formulating learning framework proposed in this paper. Suppose we are given l labeled pairs (x1, y1), . . . , (xl, yl) from the data distribution p(x, y) defined on X Y and u unlabeled points xl+1, . . . , xl+u from the marginal distribution p X(x) on X. For simplicity we will assume that the feature space X is a Euclidean space RD, and the label set Y is either { 1, 1} for binary classification or the real line R for regression. Semi-supervised learning algorithms aim to construct a (predictor) function f : X Y by incorporating the information of unlabeled data distribution. To this end, we introduce the integral operator Kp X associated with a kernel function k(x, z). In our setting k(x, z) does not have to be a positive semi-definite (or even symmetric) kernel. Kp X : L2 L2 and Kp Xf(x) = Z k(x, z)f(z)p X(z)dz, (1) where L2 is the space of square-integrable functions. By the law of large numbers, the above operator can be approximated using unlabeled data from p X as Kˆp Xf(x) = 1 l + u i=1 k(x, xi)f(xi). This approximation provides a natural way of incorporating unlabeled data into algorithms. In our Fredholm learning framework, we will use functions in Kp XH = {Kp Xf : f H}, where H is an appropriate Reproducing Kernel Hilbert Space (RKHS) as classification or regression functions. Note that unlike RKHS, this space of functions, Kp XH, is density dependent. In particular, this now allows us to formulate the following optimization problem for semi-supervised classification/regression in a way similar to many supervised learning algorithms: The Fredholm learning framework solves the following optimization problem1: f = arg min f H 1 i=1 ((Kˆp Xf)(xi) yi)2 + λ f 2 H, (2) 1We will be using the square loss to simplify the exposition. Other loss functions can also be used in Eqn 2. The final classifier is c(x) = (Kˆp Xf ) (x), where Kˆp X is the operator defined above. Eqn 2 is a discretized and regularized version of the Fredholm integral equation Kp Xf = y, thus giving the name of Fredholm learning framework. Even though at a first glance this setting looks similar to conventional kernel methods, the extra layer introduced by Kˆp X makes significant difference, in particular, by allowing the integration of information from unlabeled data distribution. In contrast, solutions to standard kernel methods for most kernels, e.g., linear, polynomial or Gaussian kernels, are completely independent of the unlabeled data. We note that our approach is closely related to [10] where a Fredholm equation is used to estimated the density ratio for two probability distributions. The Fredholm learning framework is a generalization of the standard kernel framework. In fact, if the kernel k is the δ-function, then our formulation above is equivalent to the Regularized Kernel Least Squares equation f = arg minf H 1 l Pl i=1(f(xi) yi)2 + λ f 2 H. We could also replace the L2 loss in Eqn 2 by other loss functions, such as hinge loss, resulting in a SVM-like classifier. Finally, even though Eqn 2 is an optimization problem in a potentially infinite dimensional function space H, a standard derivation, using the Representer Theorem (See full version for details), yields a computationally accessible solution as follows: f (x) = 1 l + u j=1 k H(x, xj)vj, v = KT l+u Kl+u KH + λI 1 KT l+uy, (3) where (Kl+u)ij = k(xi, xj) for 1 i l, 1 j l + u, and (KH)ij = k H(xi, xj) for 1 i, j l + u. Note that Kl+u is a l (l + u) matrix. Fredholm kernels: a convenient reformulation. In fact we will see that Fredholm learning problem induces a new data-dependent kernel, which we will refer to as Fredholm kernel2. To show this connection, we use the following identity, which can be easily verified: KT l+u Kl+u KH + λI 1 KT l+u = KT l+u Kl+u KHKT l+u + λI 1 . Define KF = Kl+u KHKT l+u to be the l l kernel matrix associated with a new kernel defined by ˆk F (x, z) = 1 (l + u)2 i,j=1 k(x, xi)k H(xi, xj)k(z, xj), (4) and we consider the unlabeled data are fixed for computing this new kernel. Using this new kernel ˆk F , the final classifying function from Eqn 3 can be rewritten as: c (x) = 1 l + u i=1 k(x, xi)f (xi) = s=1 ˆk F (x, xs)αs, α = (KF + λI) 1 y. Because of Eqn 4 we will sometimes refer to the kernels k H and k as the inner and outer kernels respectively. It can be observed that this solution is equivalent to a standard kernel method, but using a new data dependent kernel ˆk F , which we will call the Fredholm kernel, since it is induced from the Fredholm problem formulated in Eqn 2. Proposition 1. The Fredholm kernel defined in Eqn 4 is positive semi-definite as long as KH is positive semi-definite for any set of data x1, . . . , xl+u. The proof is given in the full version. The outer kernel k does not have to be either positive definite or even symmetric. When using Gaussian kernel for k, discrete approximation in Eqn 4 might be unstable when the kernel width is small, so we also introduce the normalized Fredholm kernel, ˆk N F (x, z) = n k(x, xn)k H(xi, xj) k(z, xj) P n k(z, xn). (5) It is easy to check that the resulting Fredholm kernel ˆk N F is still symmetric positive semi-definite. Even though Fredholm kernel was derived using L2 loss here, it could also be derived when hinge loss is used, which will be explained in full version. 2 We note that the term Fredholm Kernel has been used in mathematics ([8], page 103) and also in a different learning context [14]. Our usage represents a different object. 3 The Noise Assumption and Semi-supervised Learning In order for unlabeled data to be useful in classification tasks it is necessary for the marginal distribution of the unlabeled data to contain information about the conditional distribution of the labels. Several ways in which such information can be encoded has been proposed including the cluster assumption [3] and the manifold assumption [1]. The cluster assumption states that a cluster (or a high density area) contains only (or mostly) points belonging to the same class. That is, if x1 and x2 belong to the same cluster, the corresponding labels y1, y2 should be the same. The manifold assumption assumes that the regression function is smooth with respect to the underlying manifold structure of the data, which can be interpreted as saying that the geodesic distance should be used instead of the ambient distance for optimal classification. The success of algorithms based on these ideas indicates that these assumptions do capture certain characteristics of real data. Still, better understanding of unlabeled data may still lead to progress in data analysis. Figure 1: Left: only labelled points, and Right: with unlabelled points. The noise assumption. We propose to formulate a new assumption, the noise assumption , which is that in the neighborhood of every point, the directions with low variance (for the unlabeled data) are uninformative with respect to the class labels, and can be regarded as noise. While intuitive, as far as we know, it has not been explicitly formulated in the context of semi-supervised learning algorithms, nor applied to theoretical analysis. Note that even if the noise variance is small along a single direction, it could still significantly decrease the performance of a supervised learning algorithm if the noise is high-dimensional. These accumulated non-informative variations in particular increase the difficulty of learning a good classifier when the amount of labeled data is small. The first figure on right illustrates the issue of noise with two labeled points. The seemingly optimal classification boundary (the red line) differs from the correct one (in black) due to the noisy variation along the y axis for the two labeled points. Intuitively unlabeled data shown in the right panel of Figure 1 can be helpful in this setting as low variance directions can be estimated locally such that algorithms could suppress the influences of the noisy variation when learning a classifier. Connection to cluster and manifold assumptions. The noise assumption is compatible with the manifold assumption within the manifold+noise model. Specifically, we can assume that the functions of interest vary along the manifold and are constant in the orthogonal direction. Alternatively, we can think of directions with high variance as signal/manifold and directions with low variance as noise . We note that the noise assumption does not require the data to conform to a low-dimensional manifold in the strict mathematical sense of the word. The noise assumption is orthogonal to the cluster assumption. For example, Figure 1 illustrates a situation where data has no clusters but the noise assumption applies. 4 Theoretical Results for Fredholm Kernels Non-informative variation in data could degrade traditional supervised learning algorithms. We will now show that Fredholm kernels can be used to replace traditional kernels to inject them with noise-suppression power with the help of unlabeled data. In this section we will present two views to illustrate how such noise suppression can be achieved. Specifically, in Section 4.1) we show that under certain setup, linear Fredholm kernel suppresses principal components with small variance. In Section 4.2) we prove that under certain conditions we are able to provide good approximations to the true kernel on the hidden underlying space. To make our arguments more clear, we assume that there are infinite amount of unlabelled data; that is, we know the marginal distribution of data exactly. We will then consider the following continuous versions of the un-normalized and normalized Fredholm kernels as in Eqn 4 and 5: k U F (x, z) = Z Z k(x, u)k H(u, v)k(z, v)p(u)p(v)dudv (6) k N F (x, z) = Z Z k(x, u) R k(x, w)p(w)dwk H(u, v) k(z, v) R k(z, w)p(w)dwp(u)p(v)dudv. (7) Note, in the above equations and in what follows, we sometimes write p instead of p X for the marginal distribution when its choice is clear from context. We will typically use k F to denote appropriate normalized or unnormalized kernels depending on the context. 4.1 Linear Fredholm kernels and inner products For this section, we consider the unormalized Fredholm kernel, that is k F = k U F . If the outer kernel k(u, v) is linear, i.e. k(u, v) = u, v , the resulting Fredholm kernel can be viewed as an inner product. Specifically, the un-normalized Fredholm kernel from Eqn 6 can be rewritten as: k F (x, z) = x T ΣF z, where ΣF = Z Z uk H(u, v)v T p(u)p(v)dudv. Thus k F (x, z) is simply an inner product which depends on both the unlabeled data distribution p(x) and the inner kernel k H. This inner product re-weights the standard norm in feature space based on variances along the principal directions of the matrix ΣF . We show that for the model when unlabeled data is sampled from a normal distribution this kernel can be viewed as a soft thresholding PCA, suppressing the directions with low variance. Specifically, we have the following3 Theorem 2. Let k H(x, z) = exp x z 2 2t and assume the distribution p X for unlabeled data is a single multi-variate normal distribution, N(µ, diag(σ2 1, . . . , σ2 d)). We have t 2σ2 d + t ! µµT + diag σ4 1 2σ2 1 + t, . . . , σ4 D 2σ2 D + t Assuming that the data is mean-subtracted, i.e. µ = 0, we see that x T ΣF z re-scales the projections along the principal components when computing the inner product; that is, the rescaling factor for the i-th principal direction is q σ4 i 2σ2 i +t. Note that this rescaling factor σ4 i 2σ2 i +t 0 when σ2 i t. On the other hand when σ2 i t, we have that σ4 i 2σ2 i +t σ2 i 2 . Hence t can be considered as a soft threshold that eliminates the effects of principal components with small variances. When t is small the rescaling factors are approximately proportional to diag(σ2 1, σ2 2, . . . , σ2 D), in which case ΣF is is proportional to the covariance matrix of the data XXT . 4.2 Kernel Approximation With Noise We have seen that one special case of Fredholm kernel could achieve the effect of principal components re-scaling by using linear kernel as the outer kernel k. In this section we give a more general interpretation of noise suppression by the Fredholm kernel. First, we give a simple senario to provide some intuition behind the definition of Fredholm kernle. Consider a standard supervised learning setting which uses the solution f = arg minf H 1 l Pl i=1(f(xi) yi)2+λ f 2 H as the classifier. Let ktarget H denote the ideal kernel that we intend to use on the clean data, which we call the target kernel from now on. Now suppose what we have are two noisy labelled points xe and ze for true data x and z, i.e. xe = x + εx, ze = z + εz. The evaluation of ktarget H (xe, ze) can be quite different from the true signal ktarget H ( x, z), leading to an suboptimal final classifier (the red line in Figure 1 (a)). On the other hand, now consider the Fredholm kernel from Eqn 6 (or similarly from Eqn 7): k F (xe, ze) = R R k(xe, u)p(u) k H(u, v) k(ze, v)p(v)dudv, and set the outer kernel k to be the Gaussian kernel, and the inner kernel k H to be the same as target kernel ktarget H . We can think of k F (xe, ze) as an averaging of k H(u, v) over all possible pairs of data u, v, weighted by k(xe, u)p(u) and k(ze, v)p(v) respectively. Specifically, points 3The proof of this and other results can be found in the full version. that are close to xe (resp. ze) with high density will receive larger weights. Hence the weighted averages will be biased towards x and z respectively (which presumably lie in high density regions around xe and ze). The value of k F (xe, ze) tends to provide a more accurate estimate of k H( x, z). See the right figure for an illustration where the arrows indicate points with stronger influences in the computation of k F (xe, ze) than k H(xe, ze). As a result, the classifier obtained using the Fredholm kernel will also be more resilient to noise and closer to the optimum. The Fredholm learning framework is rather flexible in terms of the choices of kernels k and k H. In the remainder of this section, we will consider a few specific scenarios and provide quantitative analysis to show the noise robustness of the Fredholm kernel. Problem setup. Assume that we have a ground-truth distribution over the subspace spanned by the first d dimension of the Euclidean space RD. We will assume that this distribution is a single Gaussian N(0, λ2Id). Suppose this distribution is corrupted with Gaussian noise along the orthogonal subspace of dimension D d. That is, for any true point x drawn from N(0, λ2Id), its observation xe is drawn from N( x, σ2ID d). Since the noise lies in a space orthogonal to data distribution, this means that any observed point, labelled or unlabeled, is sampled from p X = N(0, diag(λ2Id, σ2ID d). We will show that Fredholm kernel provides a better approximation to the original kernel given unlabeled data than simply computing the kernel of noisy points. We choose this basic setting to be able to state the theoretical results in a clean manner. Even though this is a Gaussian distribution over a linear subspace with noise, this framework has more general implications since local neighborhoods of manifolds are (almost) linear spaces. Note: In this section we use normalized Fredholm kernel given in Eqn 7, that is k F = k N F for now on. Un-normalized Fredholm kernel displays similar behavior, while the bounds are trickier. Linear Kernel. First we consider the case where the target kernel ktarget H (u, v) is the linear kernel, ktarget H (u, v) = u T v. We will set k H in Fredholm kernel to also be linear, and k to be the Gaussian kernel k(u, v) = e u v 2 2t We will compare k F (xe, ze) with the target kernel on the two observed points, that is, with ktarget H (xe, ze). The goal is to estimate ktarget H ( x, z). We will see that (1) both k F (xe, ze) and (appropriately scaled) k H(xe, ze) are unbiased estimators of ktarget H ( x, z), however (2) the variance of k F (xe, ze) is smaller than that of ktarget H (xe, ze), making it a more precise estimator. Theorem 3. Suppose the probability distribution for the unlabeled data p X = N(0, diag(λ2Id, σ2ID d)). For Fredholm kernel defined in Eqn 7, we have Exe,ze(ktarget H (xe, ze)) = Exe,ze 2 k F (xe, ze) Moreover, when λ > σ, Varxe,ze λ2 2 k F (xe, ze) < Varxe,ze(ktarget H (xe, ze)). Remark: Note that we have a normalization constant for the Fredholm kernel to make it an unbiased estimator of x T z. In practice, choosing normalization is subsumed in selecting the regularization parameter for kernel methods. Thus we can see the Fredholm kernel provides an approximation of the true linear kernel, but with smaller variance compared to the actual linear kernel on noisy data. Gaussian Kernel. We now consider the case where the target kernel is the Gaussian kernel: ktarget H (u, v) = exp u v 2 2r . To approximate this kernel, we will set both k and k H to be Gaussian kernels. To simplify the presentation of results, we assume that k and k H have the same kernel width t. The resulting Fredholm kernel turns out to also be a Gaussian kernel, whose kernel width depends on the choice of t. Our main result is the following. Again, similar to the case of linear kernel, the Fredholm estimation k F (xe, ze) and ktarget H (xe, ze) are both unbiased estimator for the target ktarget H ( x, z) up to a constant; but k F (xe, ze) has a smaller variance. Theorem 4. Suppose the probability distribution for the unlabeled data p X = N(0, diag(λ2Id, σ2ID d)). Given the target kernel ktarget H (u, v) = exp u v 2 2r with ker- nel width r > 0, we can choose t, given by the equation t(t+λ2)(t+3λ2) λ4 = r, and two scaling constants c1, c2, such that Exe,ze(c 1 1 ktarget H (xe, ze)) = Exe,ze(c 1 2 k F (xe, ze)) = ktarget H ( x, z). and when λ > σ, we have Varxe,ze(c 1 1 ktarget H (xe, ze)) > Varxe,ze(c 1 2 k F (xe, ze)). Remark. In practice, when applying kernel methods for real world applications, optimal kernel width r is usually unknown and chosen by cross-validation or other methods. Similarly, for our Fredholm kernel, one can also use cross-validation to choose the optimal t for k F . 5 Experiments Using linear and Gaussian kernel for k or k H respectively, we will define three instances of the Fredholm kernel as follows. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 2 1.5 1 0.5 0 0.5 1 1.5 Figure 2: Noise but not cluster assumption. Gaussian noise in R100 is added. Linear (above) and non-linear (below) class boundaries. (1) Fred Lin1: k(x, z) = x T z and k H(x, z) = exp x z 2 (2) Fred Lin2: k(x, z) = exp x z 2 2r and k H(x, z) = x T z. (3) Fred Gauss: k(x, z) = k H(x, z) = exp x z 2 2r . For the kernels in (2) and (3) that use the Gaussian kernel as outside kernel k we can also define their normalized version, which we will denote by by Fred Lin2(N) and Fred Gauss(N) respectively. Synthetic examples. Noise and cluster assumptions. To isolate the ability of Fredholm kernels to deal with noise from the cluster assumption, we construct two synthetic examples that violate the cluster assumption, shown in Figure 2. The figures show first two dimensions, with multi-variate Gaussian noise with variance σ2 = 0.01 in R100 added. The classification boundaries are indicated by the color. For each class, we provide several labeled points and large amount of unlabeled data. Note that the classification boundary in the circle example is non-linear. We compare Fredholm kernel based classifier with RLSC (Regularized Least Squares Classifier), and two widely used semisupervised methods, the transductive support vector machine and Lap RLSC. Since the examples violate the cluster assumption, the two existing semi-supervised learning algorithms, Transductive SVM and Lap RLSC, should not gain much from the unlabeled data. For TSVM, we use the primal TSVM proposed in [4], and we will use the implementation of Lap RLSC given in [1]. Different numbers of labeled points are given for each class, together with another 2000 unlabeled points. To choose the optimal parameters for each method, we pick the parameters based on their performance on the validation set, while the final classification error is computed on the held-out testing data set. Results are reported in Table 1 and 2, in which Fredholm kernels show clear improvement over other methods for synthetic examples in term of classification error. Real-world Data Sets. Unlike artificial examples, it is usually difficult to verify whether certain assumptions are satisfied in real-world problems. In this section, we examine the performance of Fredholm kernels on several real-world data sets and compare it with the baseline algorithms mentioned above. Linear Kernels. Here we consider text categorization and sentiment analysis, where linear methods are known to perform well. We use the following data (represented by TF-IDF features): (1) 20 news group: it has 11269 documents with 20 classes, and we select the first 10 categories for our experiment. (2) Webkb: the original data set contains 7746 documents with 7 unbalanced classes, and we pick the two largest classes with 1511 and 1079 instances respectively. (3) IMDB movie review: it has 1000 positive reviews and 1000 negative reviews of movie on IMDB.com. (4) Twitter sentiment data from Sem-Eval 2013: it contains 5173 tweets, with positive, neural and negative sentiment. We combine neutral and negative classes to set up a binary classification problem. Results are reported in Table 3. In Table4, we use Web KB as an example to illustrate the change of the performance as number of labeled points increases. Number of Labeled Methods(Linear) RLSC TSVM Lap RLSC Fred Lin1 Fred Lin2(N) 8 10.0( 3.9) 5.2( 2.2) 10.0( 3.5) 3.7( 2.6) 4.5( 2.1) 16 9.1( 1.9) 5.1( 1.1) 9.1( 2.2) 2.9( 2.0) 3.6( 1.9) 32 5.8( 3.2) 4.5( 0.8) 6.0( 3.2) 2.3( 2.3) 2.6( 2.2) Table 1: Prediction error of different classifiers for the two lines example. Number of Labeled Methods(Gaussian) K-RLSC TSVM Lap RLSC Fred Gauss(N) 16 17.4( 5.0) 32.2( 5.2) 17.0( 4.6) 7.1( 2.4) 32 16.5( 7.1) 29.9( 9.3) 18.0( 6.8) 6.0( 1.6) 64 8.7( 1.7) 20.3( 4.2) 9.7( 2.0) 5.5( 0.7) Table 2: Prediction error of different classifiers for the circle example. Gaussian Kernel. We test our methods on hand-written digit recognition. The experiment use subsets of two handwriting digits data sets MNIST and USPS: (1) the one from MNIST contains 10k digits in total with balanced examples for each class, and the one for USPS is the original testing set containing about 2k images. The pixel values are normalized to [0, 1] as features. Results are reported in Table 5. In Table 6, we show that as we add additional Gaussian noise to MNIST data, Fredholm kernels start to show significant improvement. Data Set Methods(Linear) RLSC TSVM Fred Lin1 Fred Lin2 Fred Lin2(N) Webkb 16.9( 1.4) 12.7( 0.8) 13.0( 1.3) 12.0( 1.6) 12.0( 1.6) 20news 22.2( 1.0) 21.0( 0.9) 20.5 ( 0.7) 20.5 ( 0.7) 20.5( 0.7) IMDB 30.0( 2.0) 20.2( 2.6) 19.9( 2.3) 21.7( 2.9) 21.7( 2.7) Twitter 38.7( 1.1) 37.6( 1.4) 37.4( 1.2) 37.4( 1.2) 37.5( 1.2) Table 3: The error of various methods on the text data sets. 20 labeled data per class are given with rest of the data set as unlabeled points. Optimal parameter for each method are used. Number of Labeled Methods(Linear) RLSC TSVM Fred Lin1 Fred Lin2 Fred Lin2(N) 10 20.7( 2.4) 13.5( 0.5) 14.8( 2.4) 14.6( 2.4) 14.6( 2.3) 20 16.9( 1.4) 12.7( 0.8) 13.0( 1.3) 12.0( 1.6) 12.0( 1.6) 80 10.9( 1.4) 9.7( 1.0) 8.1( 1.0) 7.9( 0.9) 7.9( 0.9) Table 4: Prediction error on Webkb with different number of labeled points. Data Set Methods(Gaussian) K-RLSC Lap RLSC Fred Gauss Fred Gauss(N) USPST 11.8( 1.4) 10.2 ( 0.5) 12.4( 1.8) 10.8( 1.1) MNIST 14.3( 1.2) 8.6( 1.2) 12.2( 1.0) 13.0( 0.9) Table 5: Prediction error of nonlinear classifiers on the MNIST and USPS. 20 labeled data per class are given with rest of the data set as unlabeled points. Optimal parameter for each method are used. Number of Labeled Methods(Gaussian) K-RLSC Lap RLSC Fred Gauss Fred Gauss(N) 10 34.1( 2.1) 35.6 ( 3.5) 27.9( 1.6) 29.0( 1.5) 20 27.2( 1.1) 27.3 ( 1.8) 21.9( 1.2) 22.9( 1.2) 40 20.0( 0.7) 20.3 ( 0.8) 17.3( 0.5) 18.4( 0.4) 80 15.6( 0.4) 15.6 ( 0.5) 14.8( 0.6) 15.4( 0.5) Table 6: The prediction error of nonlinear classifiers on MNIST corrupted with Gaussian noise with standard deviation 0.3, with different numbers of labeled points, from 10 to 80. Optimal parameter for each method are used. Acknowledgments. The work was partially supported by NSF Grants CCF-1319406 and RI 1117707. We thank the anonymous NIPS reviewers for insightful comments. [1] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399 2434, 2006. [2] Oliver Chapelle, Bernhard Sch olkopf, and Alexander Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006. [3] Oliver Chapelle, Jason Weston, and Bernhard Sch olkopf. Cluster kernels for semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 585 592, 2003. [4] Oliver Chapelle and Alexander Zien. Semi-supervised classification by low density separation. In Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS, pages 57 64, 2005. [5] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch olkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, pages 131 160, 2009. [6] S. Gr unew alder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, and M. Pontil. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning, volume 2, pages 1823 1830, 2012. [7] Steffen Grunewalder, Gretton Arthur, and John Shawe-Taylor. Smooth operators. In Proceedings of the 30th International Conference on Machine Learning, pages 1184 1192, 2013. [8] Michiel Hazewinkel. Encyclopaedia of Mathematics, volume 4. Springer, 1989. [9] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, and Bernhard Sch olkopf. Kernel mean shrinkage estimators. ar Xiv preprint ar Xiv:1405.5505, 2014. [10] Qichao Que and Mikhail Belkin. Inverse density as an inverse problem: the fredholm equation approach. In Advances in Neural Information Processing Systems 26, pages 1484 1492, 2013. [11] Bernhard Sch olkopf and Alexander J Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2001. [12] John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. [13] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. Beyond the point cloud: from transductive to semi-supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 824 831, New York, NY, USA, 2005. ACM Press. [14] SVN Vishwanathan, Alexander J Smola, and Ren e Vidal. Binet-cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. International Journal of Computer Vision, 73(1):95 119, 2007. [15] Xiaojin Zhu. Semi-supervised learning literature survey. Technical report, Computer Science, University of Wisconsin-Madison, 2005.