# spectrally_transformed_kernel_regression__e4126b28.pdf

Published as a conference paper at ICLR 2024

SPECTRALLY TRANSFORMED KERNEL REGRESSION

Runtian Zhai, Rattana Pukdee, Roger Jin, Maria-Florina Balcan, Pradeep Ravikumar Carnegie Mellon University {rzhai,rpukdee,rrjin,ninamf,pradeepr}@cs.cmu.edu

Unlabeled data is a key component of modern machine learning. In general, the role of unlabeled data is to impose a form of smoothness, usually from the similarity information encoded in a base kernel, such as the ϵ-neighbor kernel or the adjacency matrix of a graph. This work revisits the classical idea of spectrally transformed kernel regression (STKR), and provides a new class of general and scalable STKR estimators able to leverage unlabeled data. Intuitively, via spectral transformation, STKR exploits the data distribution for which unlabeled data can provide additional information. First, we show that STKR is a principled and general approach, by characterizing a universal type of target smoothness , and proving that any sufficiently smooth function can be learned by STKR. Second, we provide scalable STKR implementations for the inductive setting and a general transformation function, while prior work is mostly limited to the transductive setting. Third, we derive statistical guarantees for two scenarios: STKR with a known polynomial transformation, and STKR with kernel PCA when the transformation is unknown. Overall, we believe that this work helps deepen our understanding of how to work with unlabeled data, and its generality makes it easier to inspire new methods.

1 INTRODUCTION

The past decade has witnessed a surge of new and powerful algorithms and architectures for learning representations (Vaswani et al., 2017; Devlin et al., 2019; Chen et al., 2020; He et al., 2022); spurred in part by a boost in computational power as well as increasing sizes of datasets. Due to their empirical successes, providing an improved theoretical understanding of such representation learning methods has become an important open problem. Towards this, a big advance was made recently by Hao Chen et al. (2021), who showed that when using a slight variant of popular contrastive learning approaches, termed spectral contrastive learning, the optimal learnt features are the top-d eigenfunctions of a population augmentation graph. This was further extended to other contrastive learning approaches (Johnson et al., 2023; Cabannes et al., 2023), as well as more generally to all augmentation-based self-supervised learning methods (Zhai et al., 2024).

A high-level summary of this recent line of work is as follows: The self-supervised learning approaches implicitly specify inter-sample similarity encoded via a Mercer base kernel. Suppose this kernel has the spectral decomposition K(x, x ) = P i=1 λiψi(x)ψi(x ), where λ1 λ2 0. The above line of work then showed that recent representation learning objectives can learn the optimal d features, which are simply the top-d eigenfunctions [ψ1, , ψd] of this base kernel. Given these d features, a linear probe is learned atop via regression. It can be seen that this procedure is equivalent to kernel regression with the truncated kernel Kd(x, x ) = Pd i=1 λiψi(x)ψi(x ). More generally, one can extend this to regression with a spectrally transformed kernel (STK) Ks(x, x ) = P i=1 s(λi)ψi(x)ψi(x ), where s : [0, + ) [0, + ) is a general transformation function. We call this generalized method spectrally transformed kernel regression (STKR). Then, Kd amounts to an STK with the truncation function s(λi) = λi1{i d}.

In fact, STK and STKR were quite popular two decades ago in the context of semi-supervised learning, which similar to more recent representation learning approaches, aims to leverage unlabeled data. Their starting point again was a base kernel encoding inter-sample similarity, but unlike recent representation learning approaches, at that period this base kernel was often explicitly rather than implicitly specified. For manifold learning this was typically the ϵ-neighbor or the heat kernel (Belkin & Niyogi, 2003). For unlabeled data with clusters, this was the cluster kernel (Chapelle et al., 2002).

Published as a conference paper at ICLR 2024

And for graph structured data, this was typically the (normalized) adjacency or Laplacian matrix of an explicitly specified adjacency graph (Chung, 1997; Belkin & Niyogi, 2003). A range of popular approaches then either extracted top eigenfunctions, or learned kernel machines. These methods include LLE (Roweis & Saul, 2000), Isomap (Tenenbaum et al., 2000), Laplacian eigenmap (Belkin & Niyogi, 2003) for manifold learning; spectral clustering (Ng et al., 2001) for clustered data; and label propagation (Zhu & Ghahramani, 2002; Zhou et al., 2003) for graph structured data. With respect to kernel machines, Bengio et al. (2004) linked these approaches to kernel PCA, and Chapelle et al. (2002); Smola & Kondor (2003); Zhu et al. (2006) proposed various types of STK. In this work, we revisit STK and STKR, and provide three sets of novel results. Our first contribution is elevating STKR to be a principled and general way of using unlabeled data. Unlabeled data is useful as it provides additional information about the data distribution PX , but the kernel could be independent of PX . STKR implicitly mixes the information of PX and the kernel in the process of constructing the STK. We then prove the generality of STKR via an existence result (Theorem 1): Suppose the target function satisfies a certain unknown target smoothness that preserves the relative smoothness at multiple scales, then there must exist an STK that describes this target smoothness. Our second contribution is implementing STKR with general transformations for the inductive setting. Most prior work is limited to the transductive setting where test samples are known at train time (Zhou et al., 2003; Johnson & Zhang, 2008), in large part because it is easier to carry out spectral transformation of the finite-dimensional Gram matrix than the entire kernel function itself. But for practical use and a comprehensive analysis of STKR, we need inductive approaches as well. Towards this, Chapelle et al. (2002) solved an optimization problem for each test point, which is not scalable; Chapelle et al. (2006, Chapter 11.4) provided a more scalable extension that propagates the labels to unseen test points after transductive learning, but they still needed to implicitly solve a quadratic optimization program for each set of test points. These approaches moreover do not come with strong guarantees. Modern representation learning approaches that use deep neural networks to represent the STK eigenfunctions inductively do provide scalable approaches, but no longer have rigorous guarantees. To the best of our knowledge, this work develops the first inductive STKR implementation that (a) has closed-form formulas for the predictor, (b) works for very general STKs, (c) is scalable, and importantly, (d) comes with strong statistical guarantees. We offer detailed implementations with complexity analysis, and verify their efficiency with experiments on real tasks in Section 5. Our third contribution is developing rigorous theory for this general inductive STKR, and proving nonparametric statistical learning bounds. Suppose the target function f is smooth w.r.t. an STK Ks, and there are n labeled and m unlabeled samples both i.i.d.. We prove estimation and approximation error bounds for the STKR predictor (in L2 norm) when s(λ) is known or completely unknown. By incorporating recent theoretical progress, three of our four bounds have tightness results. In a nutshell, this work conceptually establishes STKR as a general and principled way of learning with labeled and unlabeled data together with a similarity base kernel; algorithmically we provide scalable implementations for general inductive STKR, and verify them on real datasets; statistically we prove statistical guarantees, with technical improvements over prior work. Limitations and open problems are discussed in Section 6, and more related work can be found in Appendix A. We also provide a table of notations at the beginning of the Appendix for the convenience of our readers.

2 DERIVING STKR FROM DIFFUSION INDUCED MULTISCALE SMOOTHNESS

Let the input space X be a compact Hausdorff space, Y = R be the label space, and PXY be the underlying data distribution over X Y, whose marginal distribution PX is a Borel measure with support X. We will use the shorthand dp(x) to denote d PX (x). Let L2(PX ) be the Hilbert space of L2 functions w.r.t. PX that satisfy R f(x)2dp(x) < + , with f1, f2 PX = R f1(x)f2(x)dp(x) and f PX = p

f, f PX . f L2(PX ) also implies f L1(PX ), which guarantees that EX PX [f(X)] exists and is finite. Let a base kernel K(x, x ) encode inter-sample similarity information over X. We assume full access to K (i.e. we can compute K(x, x ) for all x, x ), and that K satisfies:

(i) K is a Mercer kernel, so it has the spectral decomposition: K(x, x ) = P i=1 λiψi(x)ψi(x ), where the convergence is absolute and uniform. Here λi, ψi are the eigenvalues and orthonormal eigenfunctions of the integral operator TK : L2(PX ) L2(PX ) defined as (TKf)(x) = R f(x )K(x, x )dp(x ), such that λ1 λ2 0, and ψi, ψj PX = δi,j = 1{i=j}.

(ii) K is centered: Defined as TK1 = 0, where 1(x) 1 and 0(x) 0. One can center any K by K(x0, y0) = K(x0, y0) R K(x, y0)dp(x) R K(x0, y)dp(y) + RR K(x, y)dp(x)dp(y).

Published as a conference paper at ICLR 2024

Why assuming centeredness? In this work, we view the smoothness and scale of a function f as two orthogonal axes, since our smoothness pertains to the inter-sample similarity. Thus, we view f1 and f2 as equally smooth if they differ by a constant a.e.. If K is not centered, then this will not be true under the RKHS norm (see Section 2.1). In practice centering is not a necessary step, though often recommended in kernel PCA.

This work investigates the regression function estimation problem in nonparametric regression, with error measured in L2 norm (see Györfi et al. (2002) for an introduction of regression problems):

Problem. Let f (x) := R y d PXY(y|x) L2(PX ) be the target regression function. Given n

labeled samples (x1, y1), , (xn, yn) i.i.d. PXY, m unlabeled samples xn+1, , xn+m i.i.d. PX , and access to K(x, x ) for any x, x X, find a predictor ˆf L2(PX ) with low prediction error:

err( ˆf, f ) := EX PX

ˆf(X) f (X) 2 = ˆf f 2

One can also think of f as the target function, and y = f (x) + ϵ, where ϵ is random noise with zero mean. Let {λi : i I} be the set of non-zero eigenvalues of TK, then define Kp(x, x ) := P

i I λp i ψi(x)ψi(x ) for all p R, which corresponds to an STK with s(λ) = λp. The set {Kp} delineates a diffusion process w.r.t. K, because Kp+1(x, x ) = R Kp(x, x0)K(x , x0)dp(x0), so that Kp+1 captures similarity with one additional hop to Kp. For continuous diffusion, p can be real-valued. Then, the reproducing kernel Hilbert space (RKHS) associated with Kp for any p 1 is:

u2 i λp i <

and f 2 HKp = f, f HKp . Kp is the reproducing kernel of HKp, as one can verify for all f HKp and x that f, Kp x HKp = f(x), for Kp x(z) := Kp(x, z). HK1 is also denoted by HK. HKp are called power spaces (Fischer & Steinwart, 2020) or interpolation Sobolev spaces (Jin et al., 2023). Kernel ridge regression (KRR) is a classical least-squares algorithm. KRR with K is given by:

ˆf arg min f HK

i=1 (f(xi) yi)2 + βn f 2 HK

for some βn > 0. Although KRR is very widely used, the problem is that it does not leverage the unlabeled data, because the optimal solution of KRR only depends on x1, , xn but not xn+1, , xn+m, as is explicitly shown by the Representer Theorem (Schölkopf & Smola, 2002, Theorem 4.2): All minimizers of KRR admit the form ˆf (x) = Pn j=1 α j K(x, xj), where

α arg inf α Rn

j=1 αj K(xi, xj) yi

i,j=1 αiαj K(xi, xj)

Figure 1: Sample graph.

One consequence is that for KRR, the whole base kernel could be useless. Consider the graph example on the right, where only the three shaded nodes are labeled, and K is the adjacency matrix. With KRR, the unlabeled nodes are useless and can be removed. Then, the graph becomes three isolated nodes, so it has zero impact on the learned predictor.

2.1 DIFFUSION INDUCED MULTISCALE SMOOTHNESS

Let us use this graph example to motivate STKR. Unlabeled samples are useful as they offer more information about the marginal distribution PX . The problem is that we don t know the connection between K and PX . So while KRR can leverage K, it does not necessarily exploit more information about PX than supervised learning over the n labeled samples, which is why the unlabeled samples are useless in our graph example. To address this, the seminal work Belkin et al. (2006) proposed this elegant idea of explicitly including another regularizer f 2 I that reflects the intrinsic structure of PX . For instance, f 2 I can be defined with the Laplace-Beltrami operator in manifold learning, or the graph Laplacian for graphs. In comparison, STKR also exploits PX , but in an implicit way the construction of the STK mixes K with PX . To see this: In our graph example, suppose we were to use the STK K2, i.e. a two-step random walk. Then, (a) the graph would be useful again because the three labeled nodes were connected in K2, and (b) we mixed K with PX since K2 is essentially an

Published as a conference paper at ICLR 2024

integral of K K over PX . The main takeaway from the above analysis is: With STKR, we impose another kind of smoothness we call the target smoothness, and it mixes the information of K with the information of PX . In the rest of this section, we formally characterize this target smoothness.

We start with formally defining smoothness . Suppose the inter-sample similarity is characterized by a metric d(x, x ) over the input space X, then one can naturally measure the smoothness of f by its Lipschitz constant Lipd(f) = supx,x X,x =x |f(x) f(x )|

d(x,x ) . So it suffices to specify d(x, x ). If X is an Euclidean space, then one can choose d to be the Euclidean distance, which is used in lots of prior work (Tenenbaum et al., 2000; Belkin & Niyogi, 2003). The caveat is that the Euclidean distance is not guaranteed to be correlated with similarity, and X is not necessarily Euclidean in the first place.

Instead, one can use K to define the metric, which should align better with inter-sample similarity by the definition of K. And if one further assumes transitivity of similarity, i.e. (a, b) and (b, c) being similar implies that (a, c) are similar, then Kp also aligns with similarity. The kernel metric of Kp is given by d Kp(x, x ) := Kp x Kp x HKp = P

i λp i (ψi(x) ψi(x ))2, which is equivalent to the diffusion distance defined in Coifman & Lafon (2006), and p can be real-valued. Thus, kernel diffusion {Kp} induces a multiscale metric geometry over X, where a larger p induces a weaker metric. Here weaker means d Kp = O(d Kq) if p > q. One can also think of {Kp}p 1 as forming a chain of smooth function classes: L2(PX ) HK1 HK2 , and for continuous diffusion we can also have sets like HK1.5. A larger p imposes a stronger constraint since HKp is smaller.

Now we show: f HKp is equal to its Lipschitz constant. But this is not true for Lipd(f), which is not very tractable under the topological structure of X. Thus, we consider the space of finite signed measures over X, denoted by X. For any function f on X, define its mean f as a linear functional over X, such that f(µ) = R

X f(x)µ(x). Then, define d Kp(µ, ν) := R Kp xµ(x) R Kp xν(x) HKp for µ, ν X, and Lipd Kp(f) := supµ,ν X,µ =ν | f(µ) f(ν)| d Kp(µ,ν) . In other words, f is smooth if its mean w.r.t. µ does not change too much when the measure µ over X changes by a little bit. Then, we have:

Proposition 1 (Proofs in Appendix B). This Lipd Kp(f) satisfies: Lipd Kp(f) = f HKp , f HKp.

We define r Kp(f) := f EPX [f] 2 PX / Lipd Kp(f)2 = f EPX [f] 2 PX / f 2 HKp, and use it to measure the smoothness of any f L2(PX ) at scale p 1. Here f HKp is extended to all f L2(PX ): If fp HKp such that f EPX [f] = fp (PX -a.e.), then f HKp := fp HKp; If there is no such fp, then f HKp := + . Since K is centered, for any f1 and f2 that differ by a constant PX -a.e., there is r Kp(f1) = r Kp(f2). This would not be true without the centeredness assumption. We define r Kp(f) as a ratio to make it scale-invariant, i.e. f and 2f are equally smooth, for the same purpose of decoupling smoothness and scale. And in Appendix B.2, we will discuss the connection between r Kp(f) and discriminant analysis, as well as the Poincaré constant.

Now we characterize target smoothness , an unknown property that the target f possesses. We assume that it has the same form rt(f) := f EPX [f] 2 PX /Lipdt(f)2, for some metric dt over X. Then, we assume all functions with target smoothness belong to a Hilbert space Ht, and x, x are similar if all functions in Ht give them similar predictions, i.e. dt(µ, ν) = sup f Ht=1 | f(µ) f(ν)|. We also assume that target smoothness implies base smoothness, i.e. Ht HK (this is relaxable).

2.2 TARGET SMOOTHNESS CAN ALWAYS BE OBTAINED FROM STK: SUFFICIENT CONDITION

Let rt(f) be defined as above. Our first theorem gives the following sufficient condition: If the target smoothness preserves relative multiscale smoothness, then it must be attainable with an STK.

Theorem 1. If rt(f) preserves relative smoothness: f1, f2 L2(PX ), if r Kp(f1) r Kp(f2) for all p 1, then rt(f1) rt(f2) , and Ht HK, then rt(f) = f EPX [f] 2 PX / f 2 Ht, and Ht must be an RKHS, whose reproducing kernel is an STK that admits the following form:

Ks(x, x ) = X

i:λi>0 s(λi)ψi(x)ψi(x ), (3)

for a transformation function s : [0, + ) [0, + ) that is: (i) monotonically non-decreasing, (ii) s(λ) Mλ for some constant M > 0, (iii) continuous on [0, + ), and (iv) C on (0, + ).

The proof is done by sequentially showing that (i) Ht is an RKHS; (ii) Its reproducing kernel is Ks(x, x ) := P

i siψi(x)ψi(x ), with s1 s2 0; (iii) si = O(λi); (iv) There exists such a function s(λ) that interpolates all si. From now on, we will use HKs to denote Ht. This theorem

Published as a conference paper at ICLR 2024

implies that s(λ) makes the eigenvalues decay faster than the base kernel, but it does not imply that Ks is a linear combination of {Kp}p 1. This result naturally leads to KRR with f 2 HKs = f 2 Ht:

f arg min f EPX [f] HKs

i=1 (f(xi) yi)2 + βn f EX PX [f(X)] 2 HKs

which we term spectrally transformed kernel regression (STKR). One could also relax the assumption Ht HK by considering HKp for p p0 where p0 < 1. Assuming that HKp0 is still an RKHS and Ht HKp0, one can prove the same result as Theorem 1, with (ii) changed to s(λ) Mλp0.

Now we develop theory for STKR, and show how it exploits the unlabeled data. Here is a road map:

(a) We first study the easier transform-aware setting in Section 3, where a good s(λ) is given by an oracle. But even though s(λ) is known, Ks is inaccessible as one cannot obtain ψi with finite samples. Unlabeled data becomes useful when one constructs a kernel ˆKs to approximate Ks.

(b) In reality, such an oracle need not exist. So in Section 4, we study the harder transform-agnostic setting where we have no knowledge of s(λ) apart from Theorem 1. We examine two methods:

(i) STKR with inverse Laplacian (Example 1), which is popular in semi-supervised learning and empirically works well on lots of tasks though the real s might not be inverse Laplacian. (ii) STKR with kernel PCA, which extracts the top-d eigenfunctions to be an encoder and then learns a linear probe atop. This is used in many manifold and representation learning methods. Here, unlabeled data is useful when approximating ψ1, , ψd in kernel PCA. Notation: For any kernel K, we use GK R(n+m) (n+m), GK,n Rn n, GK,m Rm m to respectively denote its Gram matrix on all, labeled and unlabeled samples, i.e. GK[i, j] = K(xi, xj).

3 TRANSFORM-AWARE: STKR WITH KNOWN POLYNOMIAL TRANSFORM

Let the scale of f be measured by B. This section supposes that s(λ) is known, and the following:

Assumption 1. s(λ) = P p=1 πpλp is a polynomial, with πp 0.

Assumption 2. There exists a constant κ > 0 such that K(x, x) κ2 for PX -almost all x.

Assumption 3. EPX [f ] = 0, and there exist constants B, ϵ > 0 such that: f PX B, f HKs, and f 2 HKs ϵ f 2 PX (i.e. rt(f ) ϵ 1). (cf. the isometry property in Zhai et al. (2024))

Assumption 4. PXY satisfies the moment condition for σ, L > 0: E[|y f (x)|r] 1

for all r 2 and PX -almost all x. (e.g. For y f (x) N(0, σ2), this holds with L = σ.)

Assumption 1 is a natural condition for discrete diffusion, such as a multi-step random walk on a graph, and p starts from 1 because s(0) = 0. The assumption EPX [f ] = 0 in Assumption 3 is solely for the simplicity of the results, without which one can prove the same but more verbose bounds. The moment condition Assumption 4 is essentially used to control the size of the label noise.

Method: We implement inductive STKR by constructing a computable kernel ˆKs(x, x ) to approximate the inaccessible Ks. For example, if Ks(x, x ) = K2(x, x ) = R K(x, x0)K(x , x0)dp(x0), then a Monte-Carlo approximation can be done by replacing the integral over x0 with an average over x1, , xn+m. Computing this average leverages the unlabeled data. Specifically, we define:

ˆf arg min f H ˆ Ks

i=1 (f(xi) yi)2 + βn f 2 H ˆ Ks

where ˆKs(x, x ) :=

p=1 πp ˆKp(x, x ); ˆK1 = K; p 2, ˆKp(x, x ) = v K(x) Gp 2 K v K(x ) (n + m)p 1 .

Here, v K(x) Rn+m such that v K(x)[i] = K(x, xi), i [n + m]. One can compute ˆKs(x, x ) for any x, x with full access to K(x, x ). Let y = [y1, , yn] Rn, and v Ks,n(x) Rn be defined as v Ks,n(x)[i] = Ks(x, xi) for i [n]. The following closed-form solutions can be derived from the Representer Theorem. While they are not necessarily unique, we will use them throughout this work: ( f(x) = v Ks,n(x) α, α = (GKs,n + nβn In) 1y; ˆf(x) = v ˆ Ks,n(x) ˆα, ˆα = (G ˆ Ks,n + nβn In) 1y.

Published as a conference paper at ICLR 2024

Results overview: Now, for all s and f that satisfy the above assumptions, we bound the prediction error ˆf f 2 PX . The bound has two parts and here is a synopsis: In Part 1 (Theorem 2), we assume access to Ks, and use the general results in Fischer & Steinwart (2020) to bound the estimation error entailed by KRR with finite samples and label noise; In Part 2 (Theorem 3), we bound the approximation error entailed by using ˆKs to approximate the inaccessible Ks.

Theorem 2. Let M be given by Theorem 1. If eigenvalues of Ks decay by order p 1 for p (0, 1], i.e. s(λi) = O(i 1

p ) for all i, then under Assumptions 2 and 4, for a sequence of {βn}n 1 with

βn = Θ(n 1 1+p ), there is a constant c0 > 0 independent of n 1 and τ κ 1M 1

2 such that f f 2

PX c0τ 2κ2M h ϵB2 + σ2 n 1 1+p + max L2, κ2MϵB2 n 1+2p

holds for all f satisfying Assumption 3 and sufficiently large n with probability at least 1 4e τ.

Remark. The O(n 1 1+p ) learning rate is minimax optimal as shown in Fischer & Steinwart (2020), i.e. one can construct an example where the learning rate is at most Ω(n 1 1+p ). And under Assumption 2, one can always choose p = 1 since i s(λi) Pi j=1 s(λj) M P λj = M Tr(TK) Mκ2. So one statistical benefit of using an appropriate s is to make the eigenvalues decay faster (i.e. make p smaller). Also note that the random noise should scale with f , which means that σ, L = Θ(B).

Theorem 3. Let ˆλ1 be the largest eigenvalue of GK n+m, and denote λmax := max n λ1, ˆλ1 o . Then, under Assumptions 1 and 2, for any δ > 0, it holds with probability at least 1 δ that: ˆf f 2

PX 8s(λmax) λ

β 2 n κ4 n + m

! y 2 2 n .

Remark. The key to prove this is to first prove a uniform bound for | ˆKs(x, xj) Ks(x, xj)| over all

x and j. With Assumptions 3 and 4, an O(B2 + σ2 + L2) bound for y 2 2 n can be easily obtained. If βn = Θ(n 1 1+p ) as in Theorem 2, then with m = ω(n 4 1+p ) this bound vanishes, so more unlabeled samples than labeled ones are needed. Moreover, ˆλ1 is known to be close to λ1 when n + m is large: Lemma 2. (Shawe-Taylor et al., 2005, Theorem 2) For any δ > 0, with probability at least 1 δ,

ˆλ1 λ1 + κ2 n + m

19 log 2(n + m + 1)

Implementation: STKR amounts to solving A ˆα = y for A = G ˆ Ks,n + nβn In by Eqn. (7). There are two approaches: (i) Directly computing A (Algorithm 3 in Appendix C) can be slow due to costly matrix multiplication; (ii) Iterative methods are faster by only performing matrix-vector multiplication. Algorithm 1 solves A ˆα = y via Richardson iteration. We name it STKR-Prop as it is very similar to label propagation (Label-Prop) (Zhou et al., 2003). If s(λ) = Pq p=1 πpλp and q < , and computing K(x, x ) for any x, x takes O(1) time, then Algorithm 1 has a time complexity of O(q(n + m)2β 1 n s(λ) log 1

ϵ ) for achieving error less than ϵ, where λ is a known upper bound of λ1 (see derivation in Appendix C). Besides, STKR-Prop is much faster when K is sparse. In particular, for a graph with |E| edges, STKR-Prop runs in O(q|E|β 1 n ) time, which is as fast as Label-Prop.

At inference time, one can store v computed in line 4 of Algorithm 1 in the memory. Then for any x, there is ˆf(x) = Pn+m i=1 K(xi, x)vi + π1 Pn j=1 K(xj, x)ˆαj, which takes O(n + m) time to compute. This is much faster than Chapelle et al. (2002) who solved an optimization problem for each new x.

For some other transformations, including the inverse Laplacian we are about to discuss, s is complex, but s 1(λ) = Pq 1 p=0 ξpλp r is simple. For this type of s(λ), Algorithm 1 is infeasible, but there is a viable method in Algorithm 2: It finds θ Rn+m such that Qθ = [ ˆα, 0m] and Mθ = y, where

Q := Pq 1 p=0 ξp GK n+m p , M := (n+m) In GK n+m r +nβn Q, In := diag{1, , 1, 0, , 0} with

n ones and m zeros, and y := [y, 0m] . In Appendix C we will derive these formulas step by step, and prove its time complexity to be O(max{q, r}(n + m)2β 1 n ). And at inference time, one can

compute ˆf(x) = v K(x) GK n+m r 1 θ in O(n + m) time for any x by storing GK n+m r 1 θ in the

Published as a conference paper at ICLR 2024

Algorithm 1 STKR-Prop for simple s

Input: GK, s(λ), βn, y, γ, ϵ

1: Initialize: ˆα 0 Rn

2: while True do

# Compute u = (G ˆ Ks,n + nβn In) ˆα

3: α 1 n+m GK,n+m,n ˆα, v 0 Rn+m

4: for p = q, , 2 do v GKv

5: u G K,n+m,nv + π1GK,n ˆα + nβn ˆα 6: if u y 2 < ϵ y 2 then return ˆα 7: ˆα ˆα γ(u y)

Algorithm 2 STKR-Prop for simple s 1

Input: GK, s 1(λ), βn, y, γ, ϵ

1: Initialize: θ 0 Rn+m, y [y, 0m]

2: while True do

# Compute u = Mθ 3: v 0 Rn+m

4: for p = q 1, , 0 do v GKv

5: u h Gr K (n+m)r 1 θ [1 : n], 0m i +nβnv

6: a u y, θ θ γa 7: if a 2 < ϵ y 2 then return θ

memory, where v K is defined as in Eqn. (5). Once again, for a graph with |E| edges, STKR-Prop has a time complexity of O(max{q, r}|E|β 1 n ), which is as fast as Label-Prop. Finally, here we showed the existence of a good solver (Richardson), but practitioners could surely use other linear solvers.

4 TRANSFORM-AGNOSTIC: INVERSE LAPLACIAN AND KERNEL PCA

We have derived learning guarantees for general inductive STKR when s is known. This is useful, but in reality, it is unreasonable to presume that such an oracle s will be given. What should one do if one has zero knowledge of s(λ) but still want to enforce target smoothness? Here we provide two parallel methods. The first option one can try is STKR with the canonical inverse Laplacian transformation. Laplacian as a regularizer has been widely used in various context (Zhou et al., 2003; Johnson & Zhang, 2008; Hao Chen et al., 2021; Zhai et al., 2024). For our problem, we want f 2 HKs = f Ks 1f to be the Laplacian, so the kernel Ks should be the inverse Laplacian:

Example 1 (Inverse Laplacian for the inductive setting). For η (0, λ 1 1 ), define Ks such that K 1 s (x, x ) = K 1(x, x ) ηK0(x, x ). K 1 and K0 are STKs with s(λ) = λ 1 and s(λ) = λ0. Then, s 1(λ) = λ 1 η > 0 for λ (0, λ1] (s 1 is the reciprocal, not inverse), s(λ) = λ 1 ηλ = P p=1 ηp 1λp, and f 2 HKs = f 2 HK η f 2 PX . Classical Laplacian has η = 1 and λ1 < 1. For the connection between transductive and inductive versions of Laplacian, see Appendix B.3.

This canonical transformation empirically works well on lots of tasks, and also have this guarantee: Proposition 3. Let s be the inverse Laplacian (Example 1), and s be an arbitrary oracle satisfying Theorem 1. Suppose f satisfies Assumption 3 w.r.t. s , but STKR (Eqn. (7)) is performed with s. Then, Theorem 3 still holds for f given by Eqn. (6), and Theorem 2 holds by replacing ϵ with Mϵ .

Note that this result does not explain why inverse Laplacian is so good its superiority is mainly an empirical observation, so it could still be bad on some tasks, for which there is the second option. The key observation here is that since s is proved in Theorem 1 to be monotonic, the order of ψ1, ψ2, must remain unchanged. So if one is asked to choose d functions to represent the target function, regardless of s the best choice with the lowest worst-case approximation error must be ψ1, , ψd: Proposition 4. Let s be any transformation function that satisfies Theorem 1. Let Fs be the set of functions that satisfy Assumption 3 for this s. Then, the following holds for all ˆΨ = [ ˆψ1, , ˆψd] such that ˆψi L2(PX ), as long as s(λ1)ϵ > 1 and s(λd+1) s(λ1) s(λd+1)[s(λ1)ϵ 1] 1

max f Fs min w Rd

PX s(λd+1) s(λ1) s(λd+1)[s(λ1)ϵ 1]B2.

To attain equality, it is sufficient for ˆΨ to span span{ψ1, , ψd}, and necessary if s(λd) > s(λd+1).

Method: This result motivates using representation learning with two stages: A self-supervised pretraining stage that learns a d-dimensional encoder ˆΨ = [ ˆψ1, , ˆψd] with the unlabeled samples, and a supervised fitting stage that fits a linear probe on ˆΨ with the labeled samples. The final predictor is ˆfd(x) = ˆw ˆΨ(x), for which we do not include a bias term since f is assumed to be centered.

For pretraining, the problem boils down to extracting the top-d eigenfunctions of TK, for which a classical method is kernel PCA (Schölkopf & Smola, 2002, Chapter 14). Indeed, kernel PCA has been widely applied in manifold learning (Belkin & Niyogi, 2003; Bengio et al., 2004), and more recently self-supervised pretraining (Johnson et al., 2023). Suppose that GK,m Rm m, the Gram matrix of K over xn+1, , xn+m, is at least rank-d. Then, kernel PCA can be formulated as:

Published as a conference paper at ICLR 2024

j=1 vi[j]K(xn+j, x), (8)

where GK,mvi = m λivi; λ1 λd > 0; vi Rm; i, j [d], vi, vj = δi,j

For any i, j [d], there is ˆψi, ˆψj HK = v i GK,mvj = δi,j. Consider running KRR w.r.t. K over all f = w ˆΨ. For ˆf = ˆw ˆΨ, there is ˆf 2 HK = Pd i,j=1 ˆwi ˆwj ˆψi, ˆψj HK = ˆw 2 2. So it amounts to minimize 1

n Pn i=1( ˆw ˆΨ(xi) yi)2 + βn ˆw 2 2 as in ridge regression, which is an approximation of STKR with a truncation function s(λi) = λi if i d, and 0 otherwise (not a real function if λd = λd+1). Denote ˆΨ(Xn) = [ˆΨ(x1), , ˆΨ(xn)] Rd n. Then, the final predictor is given by: ˆfd = ˆw ˆΨ, ˆw = ˆΨ(Xn)ˆΨ(Xn) + nβn Id 1 ˆΨ(Xn)y. (9)

Results overview: We now bound the prediction error of ˆfd for all f satisfying Assumption 3, with no extra knowledge about s(λ). The bound also has two parts. In Part 1 (Theorem 4), we bound the estimation error entailed by KRR over H ˆΨ given by Eqn. (9), where H ˆΨ is the RKHS spanned by ˆΨ = [ ˆψ1, , ˆψd], which is a subspace of HK; In Part 2 (Theorem 5), we bound the approximation error, which is the distance from f to this subspace ˆΨ. Note that if ˆΨ has insufficient representation capacity (e.g. d is small), then the approximation error will not vanish. Specifically, let fd be the projection of f onto H ˆΨ, i.e. fd = w ˆΨ, and fd, f fd HK = 0. Then, Part 1 bounds the KRR estimation error with fd being the target function, and Part 2 bounds f fd 2.

Theorem 4. Let M be given by Theorem 1. Then, under Assumptions 2 and 4, for Eqn. (9) with a sequence of {βn}n 1 with βn = Θ(n 1 1+p ) for any p (0, 1], and any δ > 0 and τ κ 1, if

n 16κ4 λ 2 d 2 + q

δ 2 , then there is a constant c0 > 0 independent of n, τ such that: ˆfd fd 2

+ c0τ 2h κ2MϵB2 + κ2σ2 n 1 1+p + κ2 max L2, κ2MϵB2 n 1+2p

holds for all f under Assumption 3 and sufficiently large n with probability at least 1 4e τ δ.

Remark. This bound has two terms. The first term bounds the gap between y and new labels y, where yi = yi f (xi) + fd(xi). The second term again comes from the results in Fischer & Steinwart (2020). Comparing the second term to Theorem 2, we can see that it achieves the fastest minimax optimal learning rate (i.e. p can be arbitrarily close to 0), as the eigenvalues decay the fastest with s being the truncation function . But the side effect of this statistical benefit is the first term, as the d-dimensional ˆΨ has limited capacity. The coefficient 3 can be arbitrarily close to 1 with larger n, c0. Our astute readers might ask why ˆΨ is learned only with the unlabeled samples, while in the last section STKR was done with both labeled and unlabeled samples. This is because in the supervised fitting stage, the function class is the subspace spanned by ˆΨ. To apply uniform deviation bounds in Theorem 4, this function class, and therefore ˆΨ, must not see x1, , xn during pretraining. On the contrary, the function class in Theorem 2 is HKs, which is independent of x1, , xn by definition.

Theorem 5. Let M be given by Theorem 1. Let f fd = bg, where b R, and g HK such that g HK = 1 and g, ˆψi HK = 0 for i [d]. Then, f fd 2 PX = b2 g 2 PX , and

f fd 2 HK = b2 ϵMλ1 1

2 λ1 g 2 PX B2 for all f satisfying Assumption 3. (10)

And if Assumption 2 holds, then for any δ > 0, it holds with probability at least 1 δ that:

λd+1 g 2 PX λd+1 + κ2

Remark. When m is sufficiently large, g 2 PX can be very close to λd+1. Compared to Proposition 4, one can see that the bound for f fd 2 PX = b2 g 2 PX given by this result is near tight provided

λ1 = s(λd+1)

λd+1 = M: The only difference is that Eqn. (10) has ϵMλ1 1

2 instead of ϵMλ1 1.

Published as a conference paper at ICLR 2024

Table 1: Experiment results. We compare Label-Prop (LP) to STKR-Prop (SP) with inverse Laplacian (Lap), with polynomial s(λ) = λ8 (poly), with kernel PCA (topd), and with s(λ) = λ (KRR) (i.e. KRR with base kernel). (t) and (i) indicate transductive and inductive settings. Test samples account for 1% of all samples. We report the accuracies of the argmax prediction of the estimators (%). Optimal hyperparameters are selected using a validation set (see Appendix D for details). Standard deviations are given across ten random seeds.

Dataset LP (t) SP-Lap (t) SP-poly (t) SP-topd (t) SP-Lap (i) SP-poly (i) SP-topd (i) KRR (i)

Computers 77.303.05 77.813.94 76.724.12 80.803.06 77.152.64 71.914.13 80.803.28 26.354.34 Cora 73.336.00 77.045.74 71.485.80 69.267.82 67.787.62 65.199.11 63.706.00 28.528.56 DBLP 66.443.78 65.425.02 64.524.20 64.864.60 65.204.92 64.514.05 63.163.41 44.803.86

Our analysis in this section follows the framework of Zhai et al. (2024), but we have the following technical improvements: (a) Estimation error: They bound with classical local Gaussian and localized Rademacher complexity, while we use the tighter bound in Fischer & Steinwart (2020) that is minimax optimal; (b) Approximation error: Our Theorem 5 has three improvements. (i) g 2 PX λd+1 is O(

d) instead of O(d); (ii) It does not require delocalization of the top-d eigenfunctions, thereby removing the dependence on the covariance matrix; (iii) Our bound does not depend on λ 1 d . Eigen-decomposition of Gk,m takes O(m3) time in general, though as of today the fastest algorithm takes O(mω) time with ω < 2.38 (Demmel et al., 2007), and could be faster if the kernel is sparse.

5 EXPERIMENTS

We implement STKR-Prop (SP) with inverse Laplacian (Lap), polynomial (poly) s(λ) = λp, and kernel PCA (topd). We run them on several node classification tasks, and compare them to Label-Prop (LP) and KRR with the base kernel (i.e. STKR with s(λ) = λ). Details and full results are deferred to Appendix D, and here we report a portion of the results in Table 1, in which the best and second-best performances for each dataset are marked in red and blue. We make the following observations:

(a) STKR works pretty well with general polynomial s(λ) in the inductive setting. In the transductive setting, the performance of SP-Lap is similar to LP, and SP-poly is slightly worse. The inductive performance is slightly worse than transductive, which is reasonable since there is less information at train time for the inductive setting. Note that LP does not work in the inductive setting. (b) STKR with s(λ) = λp for p > 1 is much better than KRR (i.e. p = 1). In fact, we observe that for STKR with s(λ) = λp, a larger p performs better (see Figure 2 in Appendix D). This suggests one possible reason why inverse Laplacian works so well empirically: It contains Kp for p = 1, 2, , so it can use multi-step similarity information up to infinitely many steps. (c) STKR also works pretty well with kernel PCA. Specifically, on 3 of the 9 datasets we use, such as Computers, kernel PCA is better than LP and STKR with inverse Laplacian. This shows that inverse Laplacian and kernel PCA are two parallel methods neither is superior.

6 CONCLUSION

This work revisited the classical idea of STKR, and proposed a new class of general and scalable STKR estimators able to leverage unlabeled data with a base kernel. We established STKR as a general and principled approach, provided scalable implementations for general transformation and inductive settings, and proved statistical bounds with technical improvements over prior work.

Limitations and open problems. This work assumes full access to K(x, x ), but in some cases computing K(x, x ) might be slow or impossible. The positive-pair kernel in contrastive learning (Johnson et al., 2023) is such an example, for which computing K is hard but computing f 2 HK is easy, so our methods need to be modified accordingly. Also, this work does not talk about how to choose the right base kernel K, which is a critical yet difficult open problem. For graph tasks, STKR like label propagation only leverages the graph, but it does not utilize the node features that are usually provided, which are important for achieving high performances in practice. Finally, this work focuses on the theoretical part, and a more extensive empirical study on STKR is desired, especially within the context of manifold learning, and modern self-supervised and semi-supervised learning.

There are three open problems from this work. (i) Improving the minimax optimal learning rate: In this work, we provided statistical bounds w.r.t. n, m jointly, but one question we did not answer is: If m is sufficiently large, can we improve the minimax optimal learning rate w.r.t. n proved in prior work on supervised learning? (ii) Distribution shift: Diffusion induces a chain of smooth function classes L2(PX ) HK1 HK2 , but this chain will collapse if PX changes. Can one learn predictors or encoders that are robust to the shift in PX ? (iii) Combining multiple kernels: In practice, usually the predictor is expected to satisfy multiple constraints. For example, an image classifier should be invariant to small rotation, translation, perturbation, etc. When each constraint induces a kernel, how should a predictor or encoder be learned? We leave these problems to future work.

Published as a conference paper at ICLR 2024

The code of Section 5 can be found at https://colab.research.google.com/drive/ 1m8OENF2lvx W3BB6CVEu45SGe K9Io Ypd1?usp=sharing.

ACKNOWLEDGMENTS

We would like to thank Zico Kolter, Andrej Risteski, Bingbin Liu, Elan Rosenfeld, Shanda Li, Yuchen Li, Tanya Marwah, Ashwini Pokle, Amrith Setlur and Xiaoyu Huang for their feedback on the early draft of this work, and Yiping Lu and Fanghui Liu for their useful discussions. We are grateful to our anonymous ICLR reviewers, with whose help this work has been greatly improved. We acknowledge the support of NSF via IIS-1909816, IIS-2211907, ONR via N00014-23-1-2368, DARPA under cooperative agreement HR00112020003, and Bloomberg Data Science Ph D fellowship.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7 (11), 2006.

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541 549. PMLR, 2018.

Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-François Paiement, Pascal Vincent, and Marie Ouimet. Learning eigenfunctions links spectral embedding and kernel pca. Neural computation, 16(10):2197 2219, 2004.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019.

David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=Hklke R4KPB.

Gilles Blanchard, Olivier Bousquet, and Laurent Zwald. Statistical properties of Kernel Prinicipal Component Analysis. Machine Learning, 66(2-3):259 294, March 2007. doi: 10.1007/ s10994-006-6895-9. URL https://hal.science/hal-00373789.

Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. In International Conference on Learning Representations, 2018.

Haim Brezis. Functional analysis, Sobolev spaces and partial differential equations. Springer, 2011.

Simon Buchholz. Kernel interpolation in sobolev spaces is not consistent in low dimensions. In Conference on Learning Theory, pp. 3410 3440. PMLR, 2022.

Vivien Cabannes, Bobak Kiani, Randall Balestriero, Yann Lecun, and Alberto Bietti. The SSL interplay: Augmentations, inductive bias, and generalization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 3252 3298. PMLR, 23 29 Jul 2023. URL https://proceedings. mlr.press/v202/cabannes23a.html.

Published as a conference paper at ICLR 2024

Olivier Chapelle, Jason Weston, and Bernhard Schölkopf. Cluster kernels for semi-supervised learning. Advances in neural information processing systems, 15, 2002.

Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-supervised learning. MIT Press, 2, 2006.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020.

Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.

Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006.

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 113 123, 2019.

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702 703, 2020.

Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131 10143, 2021.

Maarten V de Hoop, Nikola B Kovachki, Nicholas H Nelsen, and Andrew M Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA Journal on Uncertainty Quantification, 11(2):480 513, 2023.

James Demmel, Ioana Dumitriu, and Olga Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59 91, 2007.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //aclanthology.org/N19-1423.

Terrance De Vries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017.

David Elworthy. Does baum-welch re-estimation help taggers? Co RR, abs/cmp-lg/9410012, 1994. URL http://arxiv.org/abs/cmp-lg/9410012.

Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. The Journal of Machine Learning Research, 21(1):8464 8501, 2020.

Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. Advances in neural information processing systems, 32, 2019.

Ian Goodfellow, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Multi-prediction deep boltzmann machines. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/ 2013/file/0bb4aec1710521c12ee76289d9440817-Paper.pdf.

László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.

Published as a conference paper at ICLR 2024

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.

Jeff Z Hao Chen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000 5011, 2021.

Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In International conference on machine learning, pp. 4116 4126. PMLR, 2020.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000 16009, 2022.

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=Bklr3j0c KX.

Carl J Huberty and Stephen Olejnik. Applied MANOVA and discriminant analysis. John Wiley & Sons, 2006.

Jikai Jin, Yiping Lu, Jose Blanchet, and Lexing Ying. Minimax optimal kernel operator learning via multilevel training. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=z En1Bha NYs C.

Daniel D. Johnson, Ayoub El Hanchi, and Chris J. Maddison. Contrastive learning can find an optimal basis for approximately view-invariant functions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= Aj C0KBji Mu.

Rie Johnson and Tong Zhang. Graph-based semi-supervised learning and spectral kernel design. IEEE Transactions on Information Theory, 54(1):275 288, 2008.

Kwang-Sung Jun, Ashok Cutkosky, and Francesco Orabona. Kernel truncated randomized ridge regression: Optimal rates and low noise acceleration. Advances in neural information processing systems, 32, 2019.

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2016.

Risi Imre Kondor and John Lafferty. Diffusion kernels on graphs and other discrete structures. In Proceedings of the 19th international conference on machine learning, volume 2002, pp. 315 322, 2002.

Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum? id=BJ6o Ofqge.

Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. In ICML 2013 Workshop on Challenges in Representation Learning, 2013.

Zhu Li, Dimitri Meunier, Mattes Mollenhauer, and Arthur Gretton. Optimal rates for regularized conditional mean embedding learning. Advances in Neural Information Processing Systems, 35: 4433 4445, 2022.

Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel Ridgeless regression can generalize. The Annals of Statistics, 48(3):1329 1347, 2020. doi: 10.1214/19-AOS1849. URL https://doi.org/10.1214/19-AOS1849.

Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms with least-squares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 48(3):868 890, 2020.

Published as a conference paper at ICLR 2024

Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan AK Suykens. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128 7148, 2021.

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3 84, 2022.

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979 1993, 2018.

Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In T. Dietterich, S. Becker, and Z. Ghahramani (eds.), Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. URL https://proceedings.neurips.cc/paper_ files/paper/2001/file/801272ee79cfde7fa5960571fee36b9b-Paper.pdf.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking : Bringing order to the web. In The Web Conference, 1999. URL https://api. semanticscholar.org/Corpus ID:1508503.

Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40): 24652 24663, 2020.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.

Alexander Rakhlin and Xiyu Zhai. Consistency of interpolation with laplace kernels is a highdimensional phenomenon. In Conference on Learning Theory, pp. 2595 2623. PMLR, 2019.

Marc Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document representations with deep networks. In Proceedings of the 25th international conference on Machine learning, pp. 792 799, 2008.

Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. Advances in neural information processing systems, 28, 2015.

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323 2326, 2000.

Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

John Shawe-Taylor, Christopher KI Williams, Nello Cristianini, and Jaz Kandola. On the eigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Transactions on Information Theory, 51(7):2510 2522, 2005.

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop @ Neur IPS, 2018.

Mingguang Shi and Bing Zhang. Semi-supervised learning improves gene expression-based prediction of cancer recurrence. Bioinformatics, 27(21):3017 3023, 2011.

Aman Sinha and John C Duchi. Learning kernels with random features. Advances in neural information processing systems, 29, 2016.

Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pp. 144 158. Springer, 2003.

Published as a conference paper at ICLR 2024

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596 608, 2020.

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Ingo Steinwart, Don R Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In The 22nd Conference on Learning Theory, pp. 79 93, 2009.

Prem Talwai, Ali Shameli, and David Simchi-Levi. Sobolev norm learning rates for conditional mean embeddings. In International conference on artificial intelligence and statistics, pp. 10422 10447. PMLR, 2022.

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.

Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:6256 6268, 2020a.

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10687 10698, 2020b.

Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40 48. PMLR, 2016.

Runtian Zhai, Bingbin Liu, Andrej Risteski, Zico Kolter, and Pradeep Ravikumar. Understanding augmentation-based self-supervised representation learning via rkhs approximation and regression. In International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=Ax2y Rh CQr1.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy8gd B9xx.

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=r1Ddp1-Rb.

Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.

Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. In CMU CALD tech report CMU-CALD-02-107, 2002.

Xiaojin Zhu, Jaz S Kandola, John Lafferty, and Zoubin Ghahramani. Graph kernels by spectral transforms. In Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien (eds.), Semi-supervised learning, chapter 15. MIT Press, 2006.

Published as a conference paper at ICLR 2024

Table 2: Table of notations.

Notation Description

δi,j Kronecker delta. Equal to 1 if i = j, and 0 otherwise X Input space, assumed to be a compact Hausdorff space Y Label space, assumed to be R PXY Underlying data distribution over X Y PX , PY Marginal distributions of PXY dp(x) A shorthand of d PX (x) n, m Numbers of labeled and unlabeled i.i.d. samples from PXY and PX , respectively x1, , xn+m x1, , xn are the labeled samples, and xn+1, , xn+m are unlabeled y1, , yn The observed labels. Also define y = [y1, , yn] Rn

L2(PX ) L2 function space w.r.t. PX , with inner product , PX and norm PX K(x, x ) A Mercer base kernel that encodes inter-sample similarity information GK (n + m) (n + m) Gram matrix over all samples. GK[i, j] = K(xi, xj) GK,n n n Gram matrix over the n labeled samples GK,m m m Gram matrix over the m unlabeled samples TK An integral operator w.r.t. K, (TKf)(x) = R f(x )K(x, x )dp(x) (λi, ψi) Eigenvalue and eigenfunction of TK such that TKψi = λiψi. λ1 λ2 0 I The set {i N : λi > 0} Ks(x, x ) A spectrally transformed kernel of K, whose formula is given by Eqn. (3) s(λ) The transformation function of Ks(x, x ) M s(λ) Mλ as proved in Theorem 1. Frequently used in other theorems Kp(x, x ) Equivalent to Ks(x, x ) with s(λ) = λp, for all p R HKp The RKHS associated with Kp(x, x ) when p 1. HK1 is also denoted by HK HKs The RKHS associated with Ks(x, x ) that encodes target smoothness d Kp The kernel metric of Kp, defined as d Kp(x, x ) = P

i λp i (ψi(x) ψi(x )) Lipd Kp(f) The Lipschitz constant of f w.r.t. metric d Kp over X defined in Section 2.1 r Kp(f) Defined as f EPX [f] 2 PX / Lipd Kp(f)2, similar to the discriminant function rt(f) Defined as f EPX [f] 2 PX /Lipdt(f)2. dt is defined in Section 2.1 f Regression function defined as f (x) := R y d PXY(y|x) L2(PX ) ˆf A predictor that approximates f . For STKR, its formula is given by Eqn. (4) B The scale of f , defined as f PX B πp The coefficient of Ks when it is assumed to be polynomial in Assumption 1 κ2 The upper bound of K(x, x) for PX -a.e. x X. See Assumption 2 ϵ f is assumed to satisfy rt(f ) ϵ 1 in Assumption 3 σ, L Used in the moment condition in Assumption 4 βn The regularization coefficient in KRR and STKR. Can change with n ˆKs A computable kernel used to approximate Ks, defined in Eqn. (5) v K(x) An Rn+m vector. An important component of ˆKs defined after Eqn. (5) f A predictor defined in Eqn. (6) that is inaccessible, but important in our analysis α, ˆα Defined in Eqns. (6) and (7) ˆλ1 The largest eigenvalue of GK n+m γ The step size in STKR-Prop implemented with Richardson iteration η A hyperparameter of inverse Laplacian, which is defined with s 1(λ) = λ 1 η ˆΨ A pretrained d-dimensional representation [ ˆψ1, , ˆψd] in representation learning ˆw A linear probe on top of ˆΨ trained during downstream, and ˆfd(x) = ˆw ˆΨ(x) H ˆΨ A d-dimensional RKHS spanned by ˆΨ, and it is a subspace of HK fd Projection of f onto H ˆΨ

Published as a conference paper at ICLR 2024

A RELATED WORK

A.1 LEARNING WITH UNLABELED DATA

There are two broad classes of methods of learning with unlabeled data, namely semi-supervised learning and representation learning. Semi-supervised learning focuses on how to improve supervised learning by incorporating unlabeled samples, while representation learning focuses on how to extract a low-dimensional representation of data using huge unlabeled data.

Semi-supervised learning has long been used in different domains to enhance the performance of machine learning (Elworthy, 1994; Ranzato & Szummer, 2008; Shi & Zhang, 2011). While there are a wide variety of semi-supervised learning algorithms, the main differences between them lie in their way of realizing the assumption of consistency (Zhou et al., 2003), i.e. soft constraints that the predictor is expected to satisfy by prior knowledge. Enforcing consistency can be viewed as an auxiliary task (Goodfellow et al., 2013; Rasmus et al., 2015), since the unlabeled samples cannot be used in the main task involving the labels. The smoothness studied in this work can be regarded as one type of consistency.

While there are various types of consistency, most of them can be put into one of three categories: Euclidean-based consistency, prediction-based consistency and similarity-based consistency.

Euclidean-based consistency supposes X to be an Euclidean space, and relies on the assumption that x, x have the same label with high probability if they are close in the Euclidean distance, which is the foundation of many classical methods including nearest neighbor, clustering and RBF kernels. In the context of semi-supervised learning, virtual adversarial training (VAT) learns with unlabeled samples by perturbing them with little noise and forcing the model to give consistent outputs (Miyato et al., 2018). Mixup (Zhang et al., 2018) uses a variant of Euclidean-based consistency, which assumes that for two samples (x1, y1) and (x2, y2) and any θ (0, 1), the label of θx1 + (1 θ)x2 should be close to θy1 + (1 θ)y2. Mixup has been proved to be empirically successful, and was later further improved by Mixmatch (Berthelot et al., 2019).

Prediction-based consistency assumes that deep learning models enjoy good generalization: When a deep model is well trained on a small set of labeled samples, it should be able to achieve fairly good accuracy on the much larger set of unlabeled samples. The most simple method is pseudo labeling (Lee, 2013), which first trains a model only on the labeled samples, then uses it to label the unlabeled samples, and finally trains a second label on all samples. There are a large number of variants of pseudo labeling, also known as self-training. For instance, temporal ensembling (Laine & Aila, 2017) pseudo labels the unlabeled samples with models from previous epochs; Mean teacher (Tarvainen & Valpola, 2017) improves temporal ensembling for large datasets by using a model that averages the weights of previous models to generate pseudo labels; And Noisy Student (Xie et al., 2020b), which does self-training iteratively with noise added at each iteration, reportedly achieves as high as 88.4% top-1 accuracy on Image Net.

Finally, similarity-based consistency assumes prior knowledge about some kind of similarity over samples or transformations of the samples. A classical type of similarity-based consistency is based on graphs, and there is a rich body of literature on semi-supervised learning on graphs (Zhou et al., 2003; Johnson & Zhang, 2008) and GNNs (Kipf & Welling, 2016; Hamilton et al., 2017; Veliˇckovi c et al., 2018). The most popular type of similarity-based consistency in deep learning is based on random data augmentation, also known as invariance-based consistency, where all augmentations of the same sample are assumed to be similar and thus should share the same label. This simple idea has been reported to achieve good performances by a lot of work, including UDA (Xie et al., 2020a), Re Mix Match (Berthelot et al., 2020) and Fix Match (Sohn et al., 2020). People have also experimented a variety of augmentation techniques, ranging from simple ones like image translation, flipping and rotation to more complicated ones like Cutout (De Vries & Taylor, 2017), Auto Augment (Cubuk et al., 2019) and Rand Augment (Cubuk et al., 2020).

Representation learning aims to learn low-dimensional representations of data that concisely encodes relevant information useful for building classifiers or other predictors (Bengio et al., 2013). By this definition, in order to learn a good representation, we need to have information about what classifiers or predictors we want to build, for which consistency also plays a very important role. Popular representation learning algorithms based on consistency can be roughly classified into Euclidean-

Published as a conference paper at ICLR 2024

based or similarity-based consistency. There is no prediction-based consistency for unsupervised representation learning because there is nothing to predict in the first place.

Euclidean-based consistency has been widely applied in manifold learning, such as LLE (Roweis & Saul, 2000), Isomap (Tenenbaum et al., 2000) and Laplacian eigenmaps (Belkin & Niyogi, 2003), whose goal is to determine the low-dimensional manifold on which the observed high-dimensional data resides. See Bengio et al. (2004) for descriptions of these methods.

Similarity-based consistency is the basis of most deep representation learning algorithms. In fact, Euclidean-based consistency can be viewed as a special case of similarity-based consistency, which assumes near samples under the Euclidean distance to be similar. The most obvious similarity-based method is contrastive learning (Oord et al., 2018; Hjelm et al., 2019; Chen et al., 2020), which requires the model to assign similar representations to augmentations of the same sample. But in fact, lots of representation learning methods that make use of certain auxiliary tasks can be viewed as enforcing similarity-based consistency. If the auxiliary task is to learn a certain multi-dimensional target function g(x), then by defining kernel Kg(x, x ) := g(x), g(x ) , the task can also be seen as approximating Kg, and Kg encodes some kind of inter-sample similarity. For example, even supervised pretraining such as Image Net pretraining, can be viewed as enforcing similarity-based consistency (defined by the Image Net labels) over the data. In fact, Papyan et al. (2020) showed that if supervised pretraining with cross entropy loss is run for sufficiently long time, then the representations will neural collapse to the class labels (i.e. samples of the same class share exactly the same representation), which is a natural consequence of enforcing class-based similarity. Zhai et al. (2024) also showed that mask-based auxiliary tasks such as BERT (Devlin et al., 2019) and MAE (He et al., 2022) can also be viewed as approximating a similarity kernel.

Finally, for semi-supervised and self-supervised learning on graphs, diffusion has been widely used, from the classical works of Page et al. (1999); Kondor & Lafferty (2002) to more recent papers such as Gasteiger et al. (2019); Hassani & Khasahmadi (2020). For graph tasks, STKR can be viewed as a generalization of these diffusion-based methods.

A.2 STATISTICAL LEARNING THEORY ON KERNEL METHODS

Kernel methods have a very long history and there is a very rich body of literature on theory pertaining to kernels, which we shall not make an exhaustive list here. We point our readers who want to learn more about kernels to two books that are referred to a lot in this work: Schölkopf & Smola (2002); Steinwart & Christmann (2008). Also, Chapters 12-14 of Wainwright (2019) can be more helpful and easier to read for elementary readers.

Here we would like to briefly talk about recent theoretical progress on kernel methods, especially on interpolation Sobolev spaces. This work uses the results in Fischer & Steinwart (2020), in which minimax optimal learning rates for regularized least-squares regression under Sobolev norms were proved. Meanwhile, Lin et al. (2020) proved that KRR converges to the Bayes risk at the best known rate among kernel-based algorithms. Addition follow-up work includes Jun et al. (2019); Cui et al. (2021); Talwai et al. (2022); Li et al. (2022); de Hoop et al. (2023); Jin et al. (2023).

However, there are also counter arguments to this line of work, and kernel-based learning is by no means a solved problem. As Jun et al. (2019) pointed out, these optimal rates only match the lower bounds under certain assumptions (Steinwart et al., 2009). Besides, most of these minimax optimal learning rates are for KRR, which requires a greater-than-zero regularization term, but empirical observations suggest that kernel regression with regularization can perform equally well, if not better (Zhang et al., 2017; Belkin et al., 2018; Liang & Rakhlin, 2020). What makes things even more complicated is that kernel ridgeless regression seems to only work for high-dimensional data under some conditions, as argued in Rakhlin & Zhai (2019); Buchholz (2022) who showed that minimum-norm interpolation in the RKHS w.r.t. Laplace kernel is not consistent if the input dimension is constant. Overall, while kernel methods have a very long history, they still have lots of mysteries, which we hope that future work can shed light on.

Another class of methods in parallel with STKR is random features, first introduced in the seminal work Rahimi & Recht (2007). We refer our readers to Liu et al. (2021) for a survey on random features. The high-level idea of random features is the following: One first constructs a class of features {φ(x, θ)} parameterized by θ, e.g. Fourier features. Then, randomly sample D features from

Published as a conference paper at ICLR 2024

this class, and perform KRR or other learning algorithms on this set of random features. Now, why would such method have good generalization? As explained in Mei et al. (2022), if D is very large (overparameterized regime), then the approximation error is small since f should be well represented by the D features, but the estimation error is large because more features means more complexity; If D is very small (underparameterized regime), then the model is simple so the estimation error is small, but the approximation error could be large because the D features might be unable to represent f . However, since the class of features {φ(x, θ)} is of our choice, we can choose them to be good features (e.g. with low variance) so that the estimation error will not be very large even if D is big. Rahimi & Recht (2007) used Fourier features that are good, and there are lots of other good features we could use. One can even try to learn the random features: For example, Sinha & Duchi (2016) proposed to learn a distribution over a pre-defined class of random features such that the resulting kernel aligns with the labels.

B.1 PROOF OF PROPOSITION 1

Proof. First, for all f HKp, and µ, ν X we have

f(µ) f(ν) =

X f(x)dµ(x) Z

X f(x)dν(x) =

X Kp xdµ(x) Z

X Kp xdν(x)

X Kp xdµ(x) Z

X Kp xdν(x) HKp = f HKpd Kp(µ, ν),

which means that Lipd Kp(f) f HKp.

Second, {Kp x Kp x }x,x X span the entire HKp, because if f HKp satisfies f, Kp x Kp x HKp = 0 for all x, x , then f c for some c, which means that f 0 since 0 is the only constant function in HKp (because K is centered). Thus, any f HKp can be written as f = RR (Kp x Kp x )dξ(x, x ) for some finite signed measure ξ over X X, and thus f = R

X Kp xdµ(x) R

X Kp xdν(x), where µ, ν are the marginal measures of ξ. By using such defined µ, ν in the above formula, we can see that Lipd Kp(f) = f HKp.

B.2 PROOF OF THEOREM 1

First, we show that Ht must be an RKHS. Since ψ1 is the top-1 eigenfunction of Kp for all p 1, we have r Kp(ψ1) r Kp(f) for all f HK, which implies that rt(ψ1) rt(f) for all f Ht HK. Let C0 := rt(ψ1). Then, for all f Ht, since f must be centered, we have f 2 PX C0 Lipdt(f)2. Let f = f/ f Ht, then:

C0 sup x,x X,x =x inf f1 Ht=1 |f(x) f(x )| |f1(x) f1(x )|

C0 sup x,x X,x =x |f(x) f(x )| | f(x) f(x )| = p

This implies that Ht is stronger than PX . Thus, if there is a sequence of points {xi} such that xi x w.r.t. Ht, then (a) x Ht because Ht is a Hilbert space, and (b) xi x w.r.t. PX . Meanwhile, we know that HK is stronger than PX , so xi x w.r.t. HK also implies xi x w.r.t. PX .

Now, consider the inclusion map I : Ht HK, such that Ix = x. For any sequence {xi} Ht such that xi x w.r.t. Ht and Ixi y w.r.t. HK, we have xi x w.r.t. PX , and xi y w.r.t. PX . Thus, we must have y = x = Ix, meaning that the graph of I is closed. So the closed graph theorem says that I must be a bounded operator, meaning that there exists a constant C > 0 such that f HK C f Ht for all f Ht. (For an introduction of the closed graph theorem, see Chapter 2 of Brezis (2011).)

For all f HK, let δx : f 7 f(x) be the evaluation functional at point x. Since HK is an RKHS, for any x X, δx is a bounded functional on HK, which means that there exists a constant

Published as a conference paper at ICLR 2024

Mx > 0 such that |f(x)| Mx f HK for all f HK. So for any f Ht HK, there is |f(x)| Mx f HK Mx C f Ht, which means that δx is also a bounded functional on Ht. Thus, Ht must be an RKHS. Let Ks be the reproducing kernel of Ht. From now on, we will use HKs to denote Ht. Now that HKs is an RKHS, we can use the proof of Proposition 1 to show that Lipdt(f) = f HKs , which implies that rt(f) = f EPX [f] 2 PX / f 2 HKs .

Second, we prove by induction that ψ1, , ψd are the top-d eigenfunctions of Ks, and they are pairwise orthogonal w.r.t. HKs. When d = 1, ψ1 is the top-1 eigenfunction of Kp for all p 1, so ψ1 arg maxf L2(PX ) r Kp(f). By the relative smoothness preserving assumption, this implies that ψ1 arg maxf L2(PX ) rt(f). Therefore, ψ1 must be the top-1 eigenfunction of Ks. Now for d 2, suppose ψ1, , ψd 1 are the top-(d 1) eigenfunctions of Ks with eigenvalues s1, , sd 1, and they are pairwise orthogonal w.r.t. HKs. Let H0 = {f : f, ψi PX = 0, i [d 1]}. Obviously, H0 HKp is a closed subspace of HKp for all p 1. And for any f H0 HKs and any i [d 1], we have f, ψi HKs = f (K 1 s ψi) = f s 1 i ψi = 0, so HKs H is a closed subspace of HKs. Applying the assumption with this H0, we can see that ψd is the topd eigenfunction of Ks, and is orthogonal to ψ1, , ψd 1 w.r.t. both HKs. Thus, we complete our proof by induction. And if λd = λd+1, then we can show that both ψ1, , ψd 1, ψd and ψ1, , ψd 1, ψd+1 are the top-d eigenfunctions of Ks, meaning that sd = sd+1. Thus, Ks can be written as Ks(x, x ) = P i siψi(x)ψi(x ), where s1 s2 0, and si = si+1 if λi = λi+1. Moreover, by HKs HK, it is obviously true that λi = 0 implies si = 0.

Third, we prove by contradiction that si Mλi for all i. If this statement is false, then obviously one can find t1 < t2 < such that sti i λti for all i. Consider f = P

i 1λtiψi, for which f 2 HK = P

i i 1 = + . Since HKs HK, this implies that f 2 HKs = + , so we have

i λti i sti P

i λti i2λti = P

i 1 i2 < + , which gives a contradiction and proves the claim.

Fourth, we find a function s(λ) that satisfies the conditions in the theorem to interpolate those points. Before interpolation, we first point out that we can WLOG assume that λi < 2λi+1 for all i: If there is an i that does not satisfy this condition, we simply insert some new λ s between λi and λi+1, whose corresponding s s are the linear interpolations between si and si+1, so that si Mλi still holds. With this assumption, it suffices to construct a series of bump functions {fi} i=1, where fi 0 if λi = λi+1; otherwise, fi(λ) = si si+1 for λ λi and fi(λ) = 0 for λ λi+1. Such bump functions are C and monotonically non-decreasing. Then, define s(λ) = P

i fi(λ) for λ > 0, and s(0) = 0. This sum of bump functions converges everywhere on (0, + ), since it is a finite sum locally everywhere. Clearly this s is monotonic, interpolates all the points, continuous on [0, + ) and C on (0, + ). And for all λ that is not λi, for instance λ (λi+1, λi), there is s(λ) s(λi) Mλi 2Mλi+1 2Mλ. Thus, s(λ) = O(λ) for λ [0, + ). Remark. In general, we cannot guarantee that s(λ) is differentiable at λ = 0. Here is a counterexample: λi = 3 i, and si = 3 i if i is odd and 2 3 i if i is even. Were s(λ) to be differentiable at λ = 0, its derivative would be 1 and also would be 2, which leads to a contradiction. Nevertheless, if the target smoothness is strictly stronger than base smoothness, i.e. Lipd K(f) = o(Lipdt(f)), then s can be differentiable at λ = 0 but still not C .

Link with discriminant analysis and the Poincaré constant. First, we point out the connection between r Kp(f) and the discriminant function in discriminant analysis (see Chapters 3-4 of Huberty & Olejnik (2006)). We can see that r Kp(f) is defined as the proportion of variance of f w.r.t. L2(PX ) and w.r.t. HKp, so essentially r Kp(f) measures how much variance of f is kept by the inclusion map HKp , L2(PX ). Meanwhile, the discriminant function is the proportion of variance of f in the grouping variable and in total. Thus, similar to PCA which extracts the d features that keep the most variance (i.e. the top-d singular vectors), kernel PCA also extracts the d features that keep the most variance (i.e. the top-d eigenfunctions). This is also closely related to the ratio trace defined in Zhai et al. (2024), whose Appendix C showed that lots of existing contrastive learning methods can be viewed as maximizing the ratio trace, i.e. maximizing the variance kept. Zhai et al. (2024) also showed that for supervised pretraining with multi-class classification, we can also define an augmentation kernel , and then r Kp(f) is equivalent to the discriminant function (i.e. the η2 defined in Section 4.2.1 of Huberty & Olejnik (2006)).

Moreover, r Kp(f) defined for the interpolation Sobolev space HKp is analogous to the Poincaré constant for the Sobolev space. Specifically, the Poincaré-Wirtinger inequality states that for any

Published as a conference paper at ICLR 2024

1 p < and any bounded connected open set of C1 functions Ω, there exists a constant C depending on p and Ωsuch that for all u in the Sobolev space W 1,p(Ω), there is u u Lp(Ω) C u Lp(Ω) where u = 1 |Ω| R

Ωu is the mean, and the smallest value of such C is called the Poincaré constant (see Chapter 9.4, Brezis (2011)). Consider the special case p = 2, and replace L2(Ω) with L2(µ) for a probability measure dµ such that dµ(x) = exp( V (x))dx, where V is called the potential function. Then, with proper boundary conditions, we can show with integration by parts that u 2 L2(µ) = u, Lu L2(µ), where L is the diffusion operator defined as Lf = f + V f

for all f. Here, = P

x2 j is the Laplace-Beltrami operator. Now, by replacing L with the integral

operator TKp, we can see that C2 is equivalent to supf r Kp(f) for the interpolation Sobolev space HKp, which is known to be r Kp(ψ1) = λ1 if f L2(PX ), and r Kp(ψj) = λj if f is orthogonal to ψ1, , ψj 1 (i.e. H0 defined in the proof above).

B.3 CONNECTION BETWEEN TRANSDUCTIVE AND INDUCTIVE FOR INVERSE LAPLACIAN

Proposition 5. Let X = {x1, , xn+m}, and G be a graph with node set X and edge weights wij. Define W , D R(n+m) (n+m) as W [i, j] = wij, and D be a diagonal matrix such that D[i, i] = P

j wij. Suppose W is p.s.d., and D[i, i] > 0 for all i. Let L := In+m ηD 1/2W D 1/2 for a

constant η. Let PX be a distribution over X such that p(xi) = D[i,i]

Tr(D). Define a kernel K : X X

R as K(xi, xj) = Tr(D)(D 1W D 1)[i, j]. For any u Rn+m and ˆy = D 1

define f : X R as: [f(x1), , f(xn+m)] = p

2 ˆy. Then, we have

ˆy Lˆy = f 2 HK η f 2 PX = X

i,j f(xi)f(xj)K 1(xi, xj)p(xi)p(xj) η X

i f(xi)2p(xi).

Proof. Denote f := [f(x1), , f(xn+m)]. Let GK 1[i, j] = K 1(xi, xj), i.e. GK 1 is the Gram matrix of K 1. Let P = D/ Tr(D) = diag{p(x1), , p(xn+m)}. Then, we have

f 2 HK η f 2 PX = f P GK 1P f ηf P f.

Let us first characterize GK 1. For any f HK, there is

(GKP GK 1P f)[t] = X

i,j f(xi)K 1(xi, xj)K(xj, xt)p(xi)p(xj)

i f(xi)K0(xi, xt)p(xi) = f(xt),

meaning that GKP GK 1P f = f. Moreover, let w = D 1 2 u, then f = p

Tr(D)D 1W D 1w = Tr(D) 1

2 GKw, so we have

f P GK 1P f = Tr(D) 1

2 w GKP GK 1P f = Tr(D) 1

2 w f = u D 1

2 ˆy = ˆy 2 2.

Besides, f P f = f D Tr(D) 1f = ˆy D 1

2 ˆy. So f 2 HK η f 2 PX = ˆy Lˆy.

Remark. The definition of K, i.e. GK = Tr(D)D 1W D 1, has a similar form as the positive-pair kernel in Johnson et al. (2023), Eqn. (1). The important difference between this and the normalized adjacency matrix D 1

2 is that it has D 1 instead of D 1

2 . However, the above result says that using a kernel with Gram matrix D 1W D 1 in the inductive setting is equivalent to using the matrix D 1

2 in the transductive setting. Moreover, this result assumes that ˆy belongs to the column space of D 1

2 (which is what u is used for). This is necessary for ˆy to be representable by an f HK; otherwise, ˆy ˆy cannot be expressed by any f HK.

Published as a conference paper at ICLR 2024

B.4 PROOF OF THEOREM 2

First of all, note that for any f = P

i uiψi HK such that f HK T, there is f(x)2 =

i uiψi(x))2 P i u2 i λi

i λiψi(x)2 T 2κ2, i.e. |f(x)| κT for PX -almost all x. And

for any f HKs such that f HKs T, by Theorem 1 we have f HK

MT, so there is |f(x)| κ

MT for PX -almost all x.

The main tool to prove this result is Theorem 3.1 in Fischer & Steinwart (2020), stated below:

Theorem 6 (Theorem 3.1, Fischer & Steinwart (2020)). Let PXY, PX and the regression function f be defined as in Section 2. Let HK be a separable RKHS on X with respect to a measurable and bounded kernel K, and K(x, x) κ2 for PX -almost all x. Define the integral operator TK : L2(PX ) L2(PX ) as (TKf)(x) = R f(x )K(x, x )dp(x ). Let the eigenvalues/functions of TK be λi, ψi, with λ1 λ2 . Let HKp be defined as Eqn. (1). Assume that there exists a constant B > 0 such that f L (PX ) B , and the following four conditions holds:

Eigenvalue decay (EVD): λi c1i 1

p for some constant c1 > 0 and p (0, 1].

Embedding condition (EMB) for α (0, 1]: The inclusion map HKα , L (PX ) is bounded, with HKα , L (PX ) c2 for some constant c2 > 0.

Source condition (SRC) for β (0, 2]: f HKβ c3 for some constant c3 > 0.

Moment condition (MOM): There exist constants σ, L > 0 such that for PX -almost all x X and all r 2, R |y f (x)|rp(dy|x) 1

Let f be the KRR predictor with βn > 0. Let γ be any constant such that γ [0, 1] and γ < β. If β + p > α, and βn = Θ(n 1 β+p ), then there is a constant A > 0 independent of n 1 and τ > 0 such that:

f f HKγ 2c2 3ββ γ n + Aτ 2

σ2κ2 + c2 2c2 3

nβγ+p n + c2 2 max n L2, (B + c2c3)2o

n2βα+γ+(α β)+ n

holds for sufficiently large n with PX n-probability at least 1 4e τ.

This exact bound can be derived from the proof in Sections 6.1-6.10 of Fischer & Steinwart (2020). We apply this result by substituting K = Ks, α = 1, β = 1, and γ = 0. Note that f HK0 = f PX for f HK0, and we have proved that f f HKs HK0. For the four conditions, we have:

Eigenvalue decay (EVD): This is assumed to be satisfied by condition.

Embedding condition (EMB) for α = 1: HKs , L (PX ) c2 for some constant c2 > 0. This condition is satisfied with c2 = κ

M, as mentioned at the beginning of this proof.

Source condition (SRC) for β = 1: f HKs c3 for some constant c3 > 0. By Assumption 3, this is satisfied with c3 = ϵB.

Moment condition (MOM): This is assumed to be satisfied by condition.

Finally, we have Ks(x, x) Mκ2 a.e., and B = κ

MϵB, as mentioned at the beginning of this proof. Thus, applying this result yields the desired bound.

B.5 BOUNDING THE GAP BETWEEN ˆKs AND Ks

Lemma 6. For any δ > 0, the following holds with probability at least 1 δ for all p 1:

ˆKp(x, xj) Kp(x, xj) (p 1)λp 2 max κ4 n + m

Published as a conference paper at ICLR 2024

for all x X, j [n + m], and λmax = max n λ1, ˆλ1 o , which implies that

ˆKs(x, xj) Ks(x, xj) λ

Proof. For any x X and any p 1, Kp(x, x ) as a function of x satisfies

Kp(x, x ) 2 HK =

i λp i ψi(x)ψi(x )

λ2p i ψi(x )2

λi λ2p 2 1 κ2.

Now, for any u Rn+m such that u 1 1, consider Fp(x) = u GK n+m p v K(x). Since

K(xi, ), K(xj, ) HK = K(xi, xj), we have v K, v K HK = GK, which implies that

Fp 2 HK = u GK

p v K, u GK

HK = u G2p+1 K (n + m)2p u.

We now provide a bound for Fp HK, which uses the following exercise from linear algebra:

Proposition 7. For any p.s.d. matrices A, B Rd d, there is Tr(AB) A 2 Tr(B).*

Since GK is p.s.d., we can define G1/2 K . Then, using the above exercise, we have

u G2p+1 K (n + m)2p u = Tr

2p G1/2 K u

2p G1/2 K uu G1/2 K

ˆλ2p 1 Tr G1/2 K uu G1/2 K .

And Tr G1/2 K uu G1/2 K = u GKu = Pn+m i,j=1 uiuj K(xi, xj) Pn+m i,j=1 |uiuj K(xi, xj)|

κ2 u 2 1 κ2. Thus, we have Fp HK ˆλp 1κ for all p 0.

Define F := {f = g1g2 | g1, g2 HK, g1 HK, g2 HK 1}. Then, as we proved in the proof of Theorem 2, g1 κ and g2 κ, which means that for all f F, f κ2. Moreover, by Proposition 13 of Zhai et al. (2024), we have Rn(F) κ2

n, where Rn is the Rademacher complexity. Thus, by Theorem 4.10 of Wainwright (2019), for any δ > 0, 1 n + m

i=1 f(xi) EX PX [f(X)]

for all f F (14)

holds with probability at least 1 δ. In what follows, we suppose that this inequality holds.

For any p, define v Kp(x) Rn+m as v Kp(x)[i] = Kp(x, xi) for all i [n + m]. Then, Kp(x, xj) ˆKp(x, xj)

= Kp(x, xj) 1 (n + m)p 1 v K(x) Gp 2 K v K(xj)

Kp(x, xj) 1 n + mv Kp 1(x) v K(xj)

v Kp q(x) Gq 1 K v K(xj) v Kp q 1(x) Gq K n + mv K(xj) .

*An elementary proof can be found at https://math.stackexchange.com/questions/ 2241879/reference-for-trace-norm-inequality.

Published as a conference paper at ICLR 2024

Let us start with bounding the first term: Kp(x, xj) 1 n + mv Kp 1(x) v K(xj)

X Kp 1(x, z)K(xj, z)dp(z) 1 n + m

i=1 Kp 1(x, xi)K(xj, xi)

λp 2 1 κ4 n + m

because Kp 1(x, ) HK λp 2 1 κ, and K(xj, ) HK κ.

For the second term, note that v K(xj) = GKej, where ej = [0, , 0, 1, 0, , 0]. So we have:

v Kp q(x) Gq 1 K v K(xj) 1 n + mv Kp q 1(x) Gq Kv K(xj)

X Kp q 1(x, z) e j

q v K(z) dp(z) 1 n + m

j=1 Kp q 1(x, xj) e j

λp q 2 1 ˆλq 1 κ4 n + m

because Kp q 1(x, ) HK λp q 2 1 κ, and e j GK n+m q v K HK ˆλq 1κ since ej 1 = 1.

Combining the above two inequalities yields Eqn. (12). Then, note that

p=1 πp(p 1)λp 2,

which together with πp 0 for all p yields Eqn. (13).

Corollary 8. If Eqn. (14) holds, then for all i, j [n + m], and λmax = max n λ1, ˆλ1 o , Ks2(xi, xj) ˆKs(xi, ), Ks(xj, ) PX + ˆKs(xi, ), ˆKs(xj, ) PX ˆKs(xi, ), Ks(xj, ) PX

Proof. Consider Fp,q(x) = u GK n+m p v Kq(x) for any u 1 1 and any p 0, q 1. If Eqn. (14) holds, then by Proposition 7, we have

Fp,q 2 HK = u GK

p GK2q 1 GK

p 1/2 GK2q 1

p 1/2 G1/2 K uu G1/2 K

2 Tr G1/2 K uu G1/2 K

2 u GKu ˆλ2p 1 1

For any unit vector w Rn+m, we have

n + mw = 1 n + m

i,j=1 wiwj K(xi, xj) = 1 n + m

Published as a conference paper at ICLR 2024

where Ψt R(n+m) (n+m) such that Ψt[i, j] = ψt(xi)ψt(xj), so Ψt is p.s.d.. Thus, we have

n + m w = 1 n + m

t λ2q 1 t w Ψtw λ2q 2 1 1 n + m

t λtw Ψtw λ2q 2 1 ˆλ1,

which implies that GK2q 1

n+m 2 λ2q 2 1 ˆλ1. Thus, Fp,q 2 HK λ2q 2 1 ˆλ2p 1 κ2.

Note that v K, v K PX = GK2. So for any p, q 1 and any i, j [n + m], there is: Kp+q(xi, xj) D ˆKp(xi, ), Kq(xj, ) E

e i GKp+qej e i Gp 1 K (n + m)p 1 GKq+1ej

e i Gp t K (n + m)p t GKq+tej e i Gp t 1 K (n + m)p t 1 GKq+t+1ej

(xl) e j v Kq+t (xl)

p t 1 v K, e j v Kq+t

t=1 λq+t 1 1 ˆλp t 1 1 κ4 n + m

(p 1)λp+q 2 max κ4 n + m

Thus, we have Ks2(xi, xj) ˆKs(xi, ), Ks(xj, ) PX =

Kp+q(xi, xj) D ˆKp(xi, ), Kq(xj, ) E

p,q=1 πpπq(p 1)λp+q 2 max κ4 n + m

Similarly, we can show that: D ˆKp(xi, ), ˆKq(xj, ) E

PX D ˆKp(xi, ), Kq(xj, ) E

e i Gp 1 K (n + m)p 1 GK2 Gq 1 K (n + m)q 1 ej e i Gp 1 K (n + m)p 1 GKq+1ej

e i Gp 1 K (n + m)p 1 GKt+1 Gq t K (n + m)q t ej e i Gp 1 K (n + m)p 1 GKt+2 Gq t 1 K (n + m)q t 1 ej

p 1 v Kt+1, e j

t=1 λt 1ˆλp+q t 2 1 κ4 n + m

(q 1)λp+q 2 max κ4 n + m

which implies that ˆKs(xi, ), ˆKs(xj, ) PX ˆKs(xi, ), Ks(xj, ) PX

D ˆKp(xi, ), ˆKq(xj, ) E

PX D ˆKp(xi, ), Kq(xj, ) E

p,q=1 πpπq(q 1)λp+q 2 max κ4 n + m

Published as a conference paper at ICLR 2024

Combining the above inequalities, we obtain Ks2(xi, xj) ˆKs(xi, ), Ks(xj, ) PX + ˆKs(xi, ), ˆKs(xj, ) PX ˆKs(xi, ), Ks(xj, ) PX

p,q=1 πpπq(p + q 2)λp+q 2 max κ4 n + m

so we get the result by expanding the derivative.

B.6 PROOF OF THEOREM 3

Define v Ks,n(x) Rn such that v Ks,n(x)[i] = Ks(x, xi). Define v ˆ Ks,n(x) similarly. Recall the

formulas f = α v Ks,n and ˆf = ˆα v ˆ Ks,n. Define f := ˆα v Ks,n. Since G ˆ Ks,n is p.s.d., we can

see that ˆα 2 y 2

nβn , and ˆα 1 n ˆα 2. So if Eqn. (13) in Lemma 6 holds, then by Corollary 8, we have: ˆf f 2

PX = ˆα D v ˆ Ks,n v Ks,n, v ˆ Ks,n v Ks,n E

= ˆα ˆKs(xi, ), ˆKs(xj, ) PX + Ks2(xi, xj) 2 ˆKs(xi, ), Ks(xj, ) PX ˆα

β 2 n κ4 n + m

! y 2 2 n .

By the definitions of α and ˆα, we can also see that:

(GKs,n + nβn In)( ˆα α) = GKs,n G ˆ Ks,n ˆα. (15)

Note that GKs,n G ˆ Ks,n 2 n GKs,n G ˆ Ks,n max. Thus, by Eqn. (15), we have: f f 2

HKs = ( ˆα α) GKs,n( ˆα α)

= ( ˆα α) GKs,n G ˆ Ks,n ˆα nβn( ˆα α) ( ˆα α)

ˆα 2 GKs,n G ˆ Ks,n 2 ˆα 2 + α 2 GKs,n G ˆ Ks,n 2 ˆα 2 0

β 2 n κ4 n + m

! y 2 2 n .

And note that we have f f 2

PX s(λ1) f f 2

HKs s(λmax) f f 2

HKs . Thus,

PX 2 ˆf f 2

β 2 n κ4 n + m

! y 2 2 n ,

as desired.

B.7 PROOF OF PROPOSITION 3

By Assumption 3, there is f 2 HKs ϵ f 2 PX . Let f = P

i uiψi, then this is equivalent to

u2 i s (λi) ϵ X

Published as a conference paper at ICLR 2024

Let s(λ) = λ 1 ηλ be the inverse Laplacian, then we have

u2 i s(λi) X

u2 i s (λi)/M Mϵ X

which means that f also satisfies Assumption 3 w.r.t. s by replacing ϵ with Mϵ. All other conditions are the same, so we can continue to apply Theorems 2 and 3.

B.8 PROOF OF PROPOSITION 4

This proof is similar to the proof of Proposition 4 in Zhai et al. (2024).

Since ˆΨ is at most rank-d, there must be a function in span{ψ1, , ψd+1} that is orthogonal to ˆΨ. Thus, we can find two functions f1, f2 span{ψ1, , ψd+1} such that: f1 PX = f2 PX = 1, f1 is orthogonal to ˆΨ, f2 = u ˆΨ (which means that f1 f2), and ψ1 span{f1, f2}. Let ψ1 = α1f1 + α2f2, and without loss of generality suppose that α1, α2 [0, 1]. Then, α2 1 + α2 2 = 1. Let f0 = α2f1 α1f2, then f0 PX = 1 and f0, ψ1 PX = 0. This also implies that f0, ψ1 HKs = 0.

Let β1, β2 [0, 1] be any value such that β2 1 + β2 2 = 1. Let f = B(β1ψ1 + β2f0), then f PX = B.

And we have f 2 HKs = B2 β1ψ1 2 HKs + β2f0 2 HKs

B2 β2 1 s(λ1) + β2 2 s(λd+1) ϵB2, as long

as β2 2 s(λd+1) s(λ1) s(λd+1)[s(λ1)ϵ 1]. This means that f Fs.

Let F(α1) := α1β1 +α2β2 = α1β1 + p

1 α2 1β2 for α1 [0, 1]. It is easy to show that F(α1) first increases and then decreases on [0, 1], so F(α1)2 min F(0)2, F(1)2 = min β2 1, β2 2 , which can be s(λd+1) s(λ1) s(λd+1)[s(λ1)ϵ 1] given that it is at most 1

2. Thus, for this f, we have

PX = B(α1β1 + α2β2)f1 2 PX = B2F(α1)2 s(λd+1) s(λ1) s(λd+1)[s(λ1)ϵ 1]B2.

If the equality is attained, then we must have f0 2 HKs = s(λd+1) 1. So if s(λd) > s(λd+1), then ˆΨ must span the linear span of ψ1, , ψd.

Finally, we prove that maxf Fs minw Rd w ˆΨ f 2

PX s(λd+1) s(λ1) s(λd+1)[s(λ1)ϵ 1]B2 if ˆΨ

spans span{ψ1, , ψd}. For any f = P

i uiψi Fs, we have P

i u2 i B2, and P

i u2 i s(λi)

ϵ P i u2 i . Let a = P i d+1 u2 i , and b = Pd i=1 u2 i . Then, a + b B2. So we have

u2 i s(λi) ϵ X

i u2 i 1 s(λ1) ϵ b + 1 s(λd+1) ϵ a (since s is monotonic)

1 s(λ1) ϵ (B2 a) + 1 s(λd+1) ϵ a = 1 s(λ1) ϵ B2 + 1 s(λd+1) 1 s(λ1)

which implies that minw Rd w ˆΨ f 2

PX = a s(λd+1) s(λ1) s(λd+1)[s(λ1)ϵ 1]B2.

B.9 PROOF OF THEOREM 4

For any f = P

i uiψi HKs satisfying Assumption 3, f 2 HK = P

i u2 i λi P

i u2 i s(λi)/M ϵMB2.

Define fd as the projection of f onto ˆΨ w.r.t. HK. Define RKHS H ˆΨ := span n ˆψ1, , ˆψd o as a

subspace of HK, then fd H ˆΨ. Let K ˆΨ be the reproducing kernel of H ˆΨ. Let y := [ y1, , yn], where yi := fd(xi) + yi f (xi). Then, the KRR of fd with K ˆΨ is given by

ˆf d = w ˆΨ, w = ˆΨ(Xn)ˆΨ(Xn) + nβn Id 1 ˆΨ(Xn) y.

Published as a conference paper at ICLR 2024

First, we show a lower bound for the eigenvalues of ˆΨ(Xn)ˆΨ(Xn) . Similar to Eqn. (14), define F := {f = g1g2 | g1, g2 HK, g1 HK, g2 HK 1}, then we have:

i=1 f(xi) EX PX [f(X)]

for all f F;

i=n+1 f(xi) EX PX [f(X)]

for all f F

hold simultaneously with probability at least 1 δ for any δ > 0. In what follows, we assume them

to hold. Then for all f F, we have 1

n Pn i=1 f(xi) 1

m Pn+m i=n+1 f(xi) κ2

For any unit vector u Rd, let f = u ˆΨ. Then, f HK 1, so f 2 F. And we have

i=n+1 f(xi)2 = GK,m[v1, , vd]u 2 2 = [m λ1v1, , m λdvd]u 2

j=1 m λju2 j m λd,

which implies that

i=1 f(xi)2 λd κ2

for all f = u ˆΨ where u Rd is a unit vector.

Thus, for any unit vector u Rd, u ˆΨ(Xn) 2 2 n λd κ2 n 4 + 2 q

Second, we bound ˆfd ˆf d 2 HK. Denote y := [f (x1) fd(x1), , f (xn) fd(xn)] Rn. Note that ˆΨ, ˆΨ HK = Id, so we have

ˆΨ(Xn)ˆΨ(Xn) + nβn Id 1 ˆΨ(Xn)(y y) ˆΨ

= ˆΨ(Xn)ˆΨ(Xn) + nβn Id 1 ˆΨ(Xn) y

So it suffices to bound Q 1 ˆΨ(Xn) 2

2 where Q = ˆΨ(Xn)ˆΨ(Xn) + nβn Id, which is equal to

the largest eigenvalue of ˆΨ(Xn) Q 2 ˆΨ(Xn), which is further equal to the largest eigenvalue of Q 2 ˆΨ(Xn)ˆΨ(Xn) by Sylvester s theorem. Let the eigenvalues of ˆΨ(Xn)ˆΨ(Xn) be µ1 µd 0, with corresponding eigenvectors α1, , αd that form an orthonormal basis of Rd. For all i [d], if µi = 0, then Q 2 ˆΨ(Xn)ˆΨ(Xn) αi = 0, meaning that αi is also an eigenvector of Q 2 ˆΨ(Xn)ˆΨ(Xn) with eigenvalue 0. And if µi > 0, then we have

Qαi = ˆΨ(Xn)ˆΨ(Xn) αi + nβnαi = (µi + nβn)αi,

which implies that Q2αi = Q(µi + nβn)αi = (µi + nβn)2αi = (µi+nβn)2

µi ˆΨ(Xn)ˆΨ(Xn) αi.

Thus, αi is an eigenvector of Q 2 ˆΨ(Xn)ˆΨ(Xn) with eigenvalue µi (µi+nβn)2 . This means

that α1, , αd are all eigenvectors of Q 2 ˆΨ(Xn)ˆΨ(Xn) . On the other hand, we have

µd n λd κ2 n 4 + 2 q

δ , and suppose that n is large enough so that µd n λd

δ 2 . Then, we have

Q 1 ˆΨ(Xn) 2

2 max i [d] µi (µi + nβn)2 max i [d] 1 µi 2 n λd .

Thus, ˆfd ˆf d 2

Published as a conference paper at ICLR 2024

Third, we bound ˆfd ˆf d 2 PX . Denote f := [ ˆfd(x1) ˆf d(x1), , ˆfd(xn) ˆf d(xn)] Rn. Then, we have

f = ˆΨ(Xn) ( ˆw w ) = ˆΨ(Xn) ˆΨ(Xn)ˆΨ(Xn) + nβn Id 1 ˆΨ(Xn) y.

Similarly, we can show that the eigenvalues of ˆΨ(Xn) ˆΨ(Xn)ˆΨ(Xn) + nβn Id 1 ˆΨ(Xn) are

µi µi+nβn , which are no larger than 1. Thus, f 2 y 2. And by Eqn. (16), we have

y 2 2 n f fd 2 PX + f fd 2 HK κ2 n

So by Eqn. (16), we have

ˆfd ˆf d 2 PX f 2 2 n + ˆfd ˆf d 2 HK κ2 n

f fd 2 PX + f fd 2 HK κ2 n

f fd 2 PX + λd

4 f fd 2 HK

where the final step uses n 4κ4

Finally, we bound ˆf d fd PX using Theorem 6 with K = K ˆΨ, α = β = 1, and γ = 0. Recall the

four conditions:

(EVD) is satisfied for any p (0, 1] since K ˆΨ has at most d non-zero eigenvalues. (EMB) is satisfied with c2 = κ since f H ˆ Ψ = f HK for all f H ˆΨ.

(SRC) is satisfied with c3 =

ϵMB since fd H ˆ Ψ f HK

ϵMB. (MOM) is satisfied by condition.

And we have B = κ

ϵMB. Thus, when τ κ 1, it holds with probability at least 1 4e τ that ˆf d fd 2

2 τ 2h κ2MϵB2 + κ2σ2 n 1 1+p + κ2 max L2, κ2MϵB2 n 1+2p

Combining the two inequalities with (a + b)2 2(a2 + b2) yields the result.

B.10 PROOF OF THEOREM 5

We start with the following lemma:

Lemma 9. For any g HK such that g HK = 1 and g, ˆψi HK = 0 for all i [d], there is

ˆψ1 2 PX + + ˆψd 2 PX + g 2 PX λ1 + + λd+1. (18)

Proof. Let h ˆψ1, , ˆψd, g i = QD1/2 λ Ψ , where Dλ = diag(λ1, λ2, ), and Ψ = [ψ1, ψ2, ].

Then, QQ = Dh ˆψ1, , ˆψd, g i , h ˆψ1, , ˆψd, g i E

HK = Id+1, and

ˆψ1 2 PX + + ˆψd 2 PX + g 2 PX = Tr QDλQ .

So we obtain the result by applying Lemma 9 in Zhai et al. (2024).

Published as a conference paper at ICLR 2024

Define Fd := n f = Pd i=1 g2 i gi HK, gi, gj HK = δi,j o . We now bound its Rademacher com-

plexity. For any x, denote Ψ(x) = [λ1/2 1 ψ1(x), λ1/2 2 ψ2(x), ]. For any S = {x1, , xm}, denote Ψk = Ψ(xk) for k [m]. Let gi(x) = u i Ψ(x), and denote U = [u1, , ud]. Then, U U = Id. Let σ = [σ1, , σm] be Rademacher variates. Thus, the empirical Rademacher complexity satisfies:

ˆRS(Fd) E σ

sup u1, ,ud

i=1 σku i ΨkΨ k ui

sup U:U U=Id

k=1 σkΨkΨ k

sup U:U U=Id

k=1 σkΨkΨ k

Let µ1 µ2 be the singular values of 1

m Pm k=1 σkΨkΨ k . For any U, the singular values of UU are d ones and then zeros. So by von Neumann s trace inequality, we have:

sup U:U U=Id

k=1 σkΨkΨ k

UU ! µ1 + + µd

k=1 σkΨkΨ k

where the last step is Cauchy-Schwarz inequality applied to the diagonalized matrix. So we have:

k=1 σkΨkΨ k

d m almost surely,

where the last step was proved in Proposition 13 of Zhai et al. (2024). Thus, for the Rademacher complexity we have Rm(Fd) = ES[ ˆRS(Fd)] κ2

d m . Moreover, for PX -almost all x, we have:

i=1 gi(x)2 = Ψ(x) d X

Ψ(x) = Ψ(x) UU Ψ(x) Ψ(x) 2 2 UU 2 = Ψ(x) 2 2 κ2,

where the last step is because Ψ(x) Ψ(x) = P

i λiψi(x)2 κ2 for all x. Hence, by Theorem 4.10 of Wainwright (2019), we have: 1 m

i=n+1 f(xi) EX PX [f(X)]

for all f Fd (19)

holds with probability at least 1 δ

2. Let F(x) = Pd i=1 ˆψi(x)2. Then, F Fd. And for all i [d], there is [ ˆψi(xn+1), , ˆψi(xn+m)] = Gk,mvi = m λivi, so ˆψi(xn+1)2 + + ˆψi(xn+m)2 = m2 λ2 i vi 2 2 = m λi. Thus, 1

m Pn+m i=1 F(xi) = λ1 + + λd. So if Eqn. (19) holds, then

ˆψ1 2 PX + + ˆψd 2 PX λ1 + + λd κ2

Since ˆλ1, , ˆλd are the eigenvalues of Gk,m

m , by Theorem 3.2 of Blanchard et al. (2007), we have:

λ1 + + λd λ1 + + λd κ2

holds with probability at least 1 δ

2. By union bound, it holds with probability at least 1 δ that

ˆψ1 2 PX + + ˆψd 2 PX λ1 + + λd κ2

Let f fd = bg, where b R, and g HK satisfies g HK = 1 and g, ˆψi HK = 0 for i [d]. Then, by Lemma 9, we have

g 2 PX λd+1 + κ2

Published as a conference paper at ICLR 2024

Let a = fd HK. Then, f 2 HK = a2 + b2. Since f 2 HK ϵM f 2 PX , we have:

ϵM f 2 PX fd PX + b g PX 2 a p

λ1 + b g PX 2 2(a2λ1 + b2 g 2 PX ),

which implies that

(λ1 g 2 PX )b2 λ1 1 2ϵM

(a2 + b2) λ1ϵM 1

which completes the proof.

C DETAILS OF NUMERICAL IMPLEMENTATIONS

Algorithm 3 Directly solving STKR

Input: K(x, x ), s(λ) = Pq p=1 πpλp, βn, y # GK,n+m,n is the left n columns of GK 1: Initialize: M GK,n+m,n R(n+m) n

2: A nβn In + π1GK,n Rn n

3: for p = 2, , q do 4: A A + πp n+m G K,n+m,n M

5: M 1 n+m GKM

6: Solve A ˆα = y

STKR amounts to solving A ˆα = y for A = G ˆ Ks,n + nβn In. First, consider s(λ) = Pq p=1 πpλp

with q < . Provided that computing K(x, x ) for any x, x takes O(1) time, directly computing A and then solving A ˆα = y as described above has a time complexity of O((n+m)2nq) as it performs O(q) matrix multiplications, and a space complexity of O((n + m)n). Calculating A directly could be expensive since it may require many matrix-matrix multiplications.

Alternatively we can use iterative methods, such as Richardson iteration that solves A ˆα = y with ˆαt+1 = ˆαt γ(A ˆαt y) for some γ > 0, as described in Algorithm 1 in the main text. This is faster than the direct method since it replaces matrix-matrix multiplication with matrix-vector multiplication.

It is well-known that with a proper γ, it takes O(κ(A) log 1

ϵ ) Richardson iterations to ensure ˆαt ˆα 2 < ϵ ˆα 2, where ˆα is the real solution, and κ(A) is the condition number of A. Let λ be a known upper bound of λ1 (e.g. for augmentation-based pretraining, λ = 1 (Zhai et al., 2024)). With a sufficiently large n, we have κ(A) = O(β 1 n s(λ)) almost surely, so Richardson iteration has a total time complexity of O((n + m)2β 1 n s(λ)q log 1

ϵ ), and a space complexity of O(n + m). The method can be much faster if K is sparse. For instance, if K is the adjacency matrix of a graph with |E| edges, then each iteration only takes O(q |E|) time instead of O(q(n + m)2).

Next, we consider the case where s could be complex, but s 1(λ) = Pq 1 p=0 ξpλp r is simple, such as the inverse Laplacian (Example 1). In this case we cannot compute G ˆ Ks,n, but if we define

Q := Pq 1 p=0 ξp GK n+m p , then there is G ˆ Ks Q = (n + m) GK n+m r . With this observation, we use an

indirect approach to find ˆα, which is to find a θ Rn+m such that Qθ = [ ˆα, 0m] . Note that by the

definition of ˆα, the first n elements of G ˆ Ks + nβn In+m [ ˆα, 0m] = G ˆ Ks + nβn In+m Qθ = h (n + m) GK n+m r + nβn Q i θ is y, which provides n linear equations. The last m elements of Qθ

are zeros, which gives m linear equations. Combining these two gives an (n + m)-dimensional linear system, which we simplify as:

Mθ = y, where M := (n + m) In

r + nβn Q, and y := [y, 0m] . (20)

Here, In := diag{1, , 1, 0, , 0}, with n ones and m zeros. Again, we can find θ by Richardson iteration, as described in Algorithm 2, with O(n + m) space complexity. Let v K be defined as in

Published as a conference paper at ICLR 2024

Eqn. (5). Then, at inference time, one can compute ˆf(x) = v K(x) GK n+m r 1 θ in O(n + m) time

by storing GK n+m r 1 θ in the memory.

We now discuss the time complexity of Algorithm 2. We cannot use the previous argument because now M is not symmetrical. Let ρ(λ) := λr s(λ) = Pq 1 p=0 ξpλp, where ρ(0) = ξ0 > 0 . Then, ρ(λ) is continuous on [0, + ). Denote its maximum and minimum on [0, λ] by ρmax and ρmin, where again λ is a known upper bound of λ1. Then, we have the following:

Proposition 10. With γ = (nλr) 1, Algorithm 2 takes O λr βnρmin log max n 1 ϵ , λrρmax y 2 nβ2nρ2 min ˆα 2

iterations so that ˆα ˆα 2 ϵ ˆα 2 almost surely for sufficiently large n, where ˆα is the ground truth solution and ˆα is the output of the algorithm. Each iteration takes O(max {q, r}(n+m)2) time.

Thus, the total time complexity is O (n + m)2β 1 n max {q,r}λr

ρmin log max n 1 ϵ , λrρmax y 2 nβ2nρ2 min ˆα 2

Proof. Let ˆλ1 ˆλn+m be the eigenvalues of GK n+m. It is easy to show that Q has the same eigenvectors as GK n+m, with eigenvalues g(ˆλ1), , g(ˆλ2). By Lemma 2 and Borel-Cantelli lemma,

as n , ˆλ1 a.s. λ1. For simplicity, let us assume that λ is slightly larger than λ1, so almost surely there is ˆλ1 λ. Then, all eigenvalues of Q are in [ρmin, ρmax].

The first part of this proof is to bound ut 2, where ut := (n + m) In GK n+m r (θ θt). Let θt be the θ at iteration t, and θ be the solution to Eqn. (20). Since θ0 = 0, we have

θ θt = In+m γ (n + m) In

r + nβn Q θ + γ y

In+m γ (n + m) In

r + nβn Q θt 1 + γ y

= In+m γ (n + m) In

r + nβn Q (θ θt 1)

= In+m γ (n + m) In

r + nβn Q t θ .

Note that GK

r/2 In+m γ (n + m) In

r/2 + nβn Q

Thus, by propagating GK n+m r/2 from left to right, we get

r/2 (θ θt) = (In+m γR)t GK

where R := (n + m) GK n+m r/2 In GK n+m r/2 + nβn Q is a p.s.d. matrix. Denote the smallest and

largest eigenvalues of R by λmin and λmax. Then, λmin nβnρmin. In terms of λmax, we have

By Sylvester s theorem, all non-zero eigenvalues of G

1 2 K are the eigenvalues of In GK In, i.e. the non-zero eigenvalues of GK,n. By Lemma 2, 1

n GK,n 2 a.s. λ1, so suppose GK,n 2 nλ. Then, λmax nλr + nβnρmax.

Published as a conference paper at ICLR 2024

Since Mθ = y, and GK n+m r/2 M = R GK n+m r/2 , we have R GK n+m r/2 θ = GK n+m r/2 y.

Note that R(In+m γR) = (In+m γR)R. Thus, we have GK

r/2 (θ θt) = (In+m γR)t GK

= R 1(In+m γR)t R GK

= R 1(In+m γR)t GK

Now we bound ut 2. First, note that for any matrices A, B Rd d where B is p.s.d., there is u A BAu B 2 Au 2 2 B 2 A A 2 u 2 2 for any u Rd, so A BA 2 B 2 A A 2. Second, note that the last m elements of y are zeros, which means that y = In y. Thus, we have

ut 2 = (n + m) In

r/2 R 1(In+m γR)t GK

r/2 (In+m γR)t/2R 1(In+m γR)t/2 GK

(In+m γR)t/2R 1(In+m γR)t/2 2

1 λmin In+m γR t 2(nλr 1) y 2,

where the last step is because we have already proved (n + m) In GK n+m r In 2 nλr 1.

Now, for γ = 1 nλr , when n is sufficiently large it is less than 2 λmax+ λmin , because βn = o(1). Thus,

In+m γR 2 1 λmin

nλr 1 βnρmin

λr . Thus, we have

ut 2 1 βnρmin

βnρmin y 2.

The second part of this proof is to bound Q(θ θt) 2. Let us return to Eqn. (21), which says that Q(θ θt+1) 2 = (In+m γnβn Q)Q(θ θt) γQut 2

Q(θ θt) 2 + ρmax

Here again, we assume that n is large enough so that λr > βnρmin. This implies that

Q(θ θt+1) 2 t 1 βnρmin

Q(θ θt) 2 (t 1) 1 βnρmin

t 1 ρmax y 2

Qθ 2 + ρmax y 2

Thus, there is Q(θ θt) 2 1 βnρmin

λr t Qθ 2 + t 1 βnρmin

λr t 1 ρmax y 2

nβnρmin . Using 1

x e x, we have

Q(θ θt) 2 exp βnρmint

Qθ 2 + t exp βnρmin(t 1)

Published as a conference paper at ICLR 2024

When t = t0 := 4λr βnρmin log 2λrρmax y 2 nβ2nρ2 min Qθ 2 , by log(2x) x for x > 0, we have

2λrρmax y 2

nβ2nρ2 min Qθ 2

2 4λrρmax y 2 nβ2nρ2 min Qθ 2 log 2λrρmax y 2

nβ2nρ2 min Qθ 2

Let F(t) := exp βnρmin

2λr t ρmax y 2 nβnρmin Qθ 2 t. Then we have F(t0) 0. And it is easy to show

that for all t t0

2 , there is F (t) 0. This means that when t t0, there is F(t) 0, so we have

Q(θ θt) 2 exp βnρmint

Qθ 2 + exp βnρmin

Hence, when t max n 2λr βnρmin log 2

ϵ + 2, t0 o , we have Q(θ θt) 2 ϵ Qθ 2, which implies that the relative estimation error of ˆα is less than ϵ.

D EXPERIMENTS

The purpose of our experiments is threefold:

(i) Verify that STKR-Prop (Algorithms 1 and 2) works with general polynomial s(λ) including inverse Laplacian under transductive and inductive settings on real datasets, and compare them to label propagation (for the transductive setting). (ii) Explore possible reasons why canonical Laplacian works so well empirically, by examining the effect of p on the performance when using STKR with s(λ) = λp. (iii) Verify that STKR-Prop works with kernel PCA on real datasets, and compare it to other methods.

Datasets. We focus on graph node classification tasks, and work with the publicly available datasets in the Py Torch Geometric library (Fey & Lenssen, 2019), among which Cora, Cite Seer and Pub Med are based on Yang et al. (2016); Computers, Photos, CS and Physics are based on Shchur et al. (2018); DBLP and Cora Full are based on Bojchevski & Günnemann (2018). See Table 3 for a summary of the dataset statistics.

Train/val/test/other splits. We split a dataset into four sets: train, validation (val), test and other. Among them, train and val contain labeled samples, while test and other contain unlabeled samples. The test performance which we will report later is only evaluated on the test set. The val set is used to select the best model, so it is used in a similar way as the test set as explained below:

In the transductive setting, the learner can see all four sets at train time. The learner manually hides the labels of the samples in the val set (so that the val performance approximates the test performance). Thus, n is the size of the train set, while m is the size of the other three combined. In the inductive setting, the learner can see train, val and other, but not test. Neither can it see any edges connected to test nodes. Then, the learner manually hides the entire val set (nodes, outcoming edges and labels), so that the val performance approximates the test performance. Thus, n is the size of the train set, while m is the size of the other set.

In all our experiments, these four sets are randomly split. This means that with the same random seed, the four splits are exactly the same; With a different random seed, the four splits are different, but their sizes are kept the same for the same dataset.

Sizes of the splits. First, we specify a hyperparameter ptest, and then ptest of all the samples will be in the test set. For Cora, Cite Seer and Pub Med, we use the default train/validation/test split size, and from the test set we take out ptest #(all samples) of the samples to be the real test set, and the rest of the samples go to the other set. For the other six datasets, we set the train and validation set size to be 20 number of classes. For example, the Physics dataset has 5 classes, so we randomly sample 100 samples to be train data, and another 100 samples to be validation data. We also do an ablation study for ptest in our experiments, where ptest could range from 1% to 50%.

Published as a conference paper at ICLR 2024

Table 3: Number of classes, nodes, edges, and fractions (%) of train and validation sets.

Classes Nodes Edges Train Validation

Cora 7 2,708 10,556 5.17 18.46 Cite Seer 6 3,327 9,104 3.61 15.03 Pub Med 3 19,717 88,648 0.3 2.54 Amazon - Computers 10 13,752 491,722 1.45 1.45 Amazon - Photos 8 7,650 238,162 2.09 2.09 Coauthor - CS 15 18,333 163,788 1.64 1.64 Coauthor - Physics 5 34,493 495,924 0.29 0.29 DBLP 4 17,716 105,734 0.45 0.45 Cora Full 70 19,793 126,842 7.07 7.07

Implementations. For label propagation, we use the version in Zhou et al. (2003), which solves:

(In+m ηS)ˆy = y, where y := [y, 0m] . (22)

Here ˆy is the predicted labels for all n + m samples under the transductive setting, and S is defined as S := D 1

2 , where W is the adjacency matrix such that W [i, j] = 1 if xi is connected to xj and 0 otherwise, and D is a diagonal matrix defined as D[i, i] = Pn+m j=1 W [i, j] for i [n + m]. For STKR, we define the base kernel K as:

K(x, x ) = (n + m) W(x, x ) p

D(x)D(x ) , (23)

where W(xi, xj) = W [i, j], and for the transductive setting there is D(xi) = D[i, i], so that S = GK n+m. For the inductive setting, D(xi) = P

j / test nodes W(xi, xj) for all i [n + m], i.e. the sum is only taken over the visible nodes at train time.

Hyperparameters. Below are the hyperparameters we use in the experiments. Best hyperparameters are selected with the validation split as detailed above.

Label Propagation

Number of iteration T [1, 2, 4, 8, 16, 32] η [0.7, 0.8, 0.9, 0.99, 0.999, 0.9999, 0.99999, 0.999999] STKR transductive

Number of iteration T [1, 2, 4, 8, 16, 32] Laplacian s 1(λ) = λ 1 η: η [0.7, 0.8, 0.9, 0.99, 0.999, 0.9999, 0.99999, 0.999999] Polynomial s(λ) = λk: k [1, 2, 4, 6, 8] β [103, 102, 101, 100, 10 1, 10 2, 10 3, 10 4, 10 5, 10 6, 10 7, 10 8] STKR inductive

Laplacian s 1(λ) = λ 1 η: η [0.7, 0.8, 0.9, 0.99, 0.999, 0.9999, 0.99999, 0.999999] Polynomial s(λ) = λk: k [1, 2, 4, 6, 8] β [103, 102, 101, 100, 10 1, 10 2, 10 3, 10 4, 10 5, 10 6, 10 7, 10 8] Kernel PCA

Number of representation dimension d [32, 64, 128, 256, 512] β [103, 102, 101, 100, 10 1, 10 2, 10 3, 10 4, 10 5, 10 6, 10 7, 10 8]

D.2 RESULTS

We report the test accuracy of STKR-Prop with different transformations and the Label-Prop algorithm in Table 4. To make a fair comparison between the transductive and inductive setting, we report the test accuracy on the same ptest = 0.01 fraction of the data (for the same random seed). This is a part of the unlabeled data in the transductive setting, but is completely hidden from the learner in the inductive setting at train time.

First, our results show that STKR-Prop can work pretty well with a general s(λ) under the inductive setting. The drops in the accuracy of the inductive STKR-Prop compared to transductive are small. In

Published as a conference paper at ICLR 2024

Table 4: The test accuracy (%) of Label-Prop (LP), STKR-Prop (SP) with inverse Laplacian (Lap), with polynomial s(λ) = λ8 (poly), with kernel PCA (topd), and with s(λ) = λ (KRR). (t) and (i) indicate the transductive and inductive settings. We report the test accuracy when ptest = 0.01, i.e. test samples account for 1% of all samples. Standard deviations are given across ten random seeds. The best and second-best results for each dataset are marked in red and blue, respectively.

CS Cite Seer Computers Cora Cora Full DBLP Photo Physics Pub Med

LP (t) 79.072.19 52.737.72 77.303.05 73.336.00 54.473.24 66.443.78 83.955.78 84.334.86 72.285.55 SP-Lap (t) 78.962.53 52.127.67 77.813.94 77.045.74 53.812.34 65.425.02 84.086.52 84.224.86 71.934.86 SP-poly (t) 79.132.29 48.798.51 76.724.12 71.485.80 53.253.54 64.524.20 79.217.20 84.454.89 72.184.66 SP-topd (t) 78.803.22 46.061.08 80.803.06 69.267.82 50.362.85 64.864.60 84.616.30 83.202.25 65.385.66 SP-Lap (i) 78.422.81 46.066.97 77.152.64 67.787.62 53.303.24 65.204.92 84.875.66 83.115.09 70.364.80 SP-poly (i) 79.022.42 44.559.15 71.974.13 65.199.11 51.983.88 64.524.05 78.427.80 84.684.83 70.764.28 SP-topd (i) 79.133.35 41.526.71 80.803.28 63.706.00 47.413.39 63.163.41 85.535.68 82.443.88 64.314.95 KRR (i) 13.112.29 13.645.93 26.354.34 28.528.56 19.802.22 44.803.86 33.957.07 19.741.46 20.762.06

Table 5: Test accuracy (%) of STKR-Prop (SP) with inverse Laplacian (Lap), with polynomial s(λ) = λ8

(poly) and with kernel PCA (topd) for inductive setting with different value of ptest {0.01, 0.05, 0.1, 0.2, 0.3}. Standard deviations are given across ten random seeds.

Methods ptest CS Cite Seer Computers Cora Cora Full DBLP Photo Physics Pub Med

0.01 78.422.81 46.066.97 77.152.64 67.787.62 53.303.24 65.204.92 84.875.66 83.115.09 70.364.80 0.05 77.931.41 41.754.82 77.422.25 62.373.66 51.630.90 65.572.52 84.902.50 82.724.26 67.813.56 SP-Lap 0.1 76.451.17 39.972.68 77.401.88 59.742.37 50.220.75 65.342.09 84.101.84 82.024.01 66.433.62 0.2 75.061.01 35.501.94 77.182.03 55.102.92 47.840.79 63.241.96 83.811.23 81.333.59 63.773.36 0.3 72.890.95 30.921.14 76.591.71 50.222.87 44.960.77 60.441.68 83.281.11 80.273.45 60.252.75 0.01 79.133.35 44.559.15 71.974.13 65.199.11 51.983.88 64.524.05 78.427.80 84.684.83 70.764.28 0.05 78.741.42 40.365.51 73.041.80 61.854.15 50.211.69 64.642.07 79.034.61 84.053.99 67.682.91 SP-poly 0.1 77.511.03 38.582.90 73.101.58 58.562.59 49.060.90 64.211.93 78.594.49 83.133.69 66.082.96 0.2 75.741.02 33.651.78 72.701.84 53.112.39 46.530.79 61.892.03 78.823.05 82.363.27 62.973.10 0.3 73.350.76 28.981.21 72.351.60 47.992.62 43.420.78 59.552.02 78.202.73 81.083.25 59.322.57 0.01 79.133.35 41.526.71 80.803.28 63.706.00 47.413.39 63.163.41 85.535.68 82.443.88 64.314.95 0.05 78.371.58 40.005.14 80.172.30 61.703.53 47.171.63 62.793.36 85.472.07 82.261.88 62.793.28 SP-topd 0.1 77.171.02 38.674.17 79.352.57 58.813.26 45.220.89 61.512.60 84.851.70 80.591.86 61.752.26 0.2 75.610.64 35.102.64 79.002.02 55.514.21 42.310.91 60.302.60 84.631.45 80.151.86 59.603.22 0.3 72.590.70 32.921.92 78.091.27 51.714.39 38.620.79 59.182.29 83.911.69 79.072.33 57.502.50

Table 5, we further provide an ablation on the test accuracy as we increase ptest. As ptest is larger, the performance of inductive STKR decreases across the board. Nevertheless, there are many datasets such as Photo, Physics, Computer where the performance drop is fairly small at around 2 3 percent even when ptest = 0.3. Our ablation study shows that inductive STKR is quite robust to the number of available unlabeled data at the training time. Our experiment clearly demonstrates that one can implement STKR with a general transformation such as s(λ) = λp efficiently. The running time of STKR-Prop is similar to that of Label-Prop with the same number of iterations.

Second, we explore the impact of the number of hop p on the performance of STKR with s(λ) = λp (Figure 2). We consider p {1, 2, 4, 6, 8}, and note that for p = 1, this STKR is equivalent to performing a KRR with the base kernel. As we have already seen in Table 4, the performance of such KRR is extremely poor compared to the other methods. This is consistent with our analysis in Section 2 that KRR with the base kernel is not sufficient. We find that by increasing p which in turns increase the additional smoothness requirement, the performance of STKR increases by a large margin for all datasets. This clearly illustrates the benefits of the transitivity of similarity, and offers a possible explanation why the inverse Laplacian transformation performs so well in practice: It uses multi-hop similarity information up to infinitely many hops.

Third, the results show that STKR also works pretty well with kernel PCA. Comparing between kernel PCA and LP (or STKR with inverse Laplacian), on 3 of the 9 datasets we use, the former is better. Thus, this experiment clearly demonstrates the parallel nature of these two methodologies STKR with inverse Laplacian, and kernel PCA.

Finally, we provide an ablation study about the effect of η in inverse Laplacian on the performance of SP-Lap. Recall that for inverse Laplacian we have s 1(λ) = λ 1 η. Our observation is that when η is very close to 0, the performance is low; once it is bounded away from 0, the performance is fairly consistent, and gets slightly better with a larger η. Only on dataset CS do we observe a significant performance boost as η increases.

Published as a conference paper at ICLR 2024

CS Cite Seer Computers

Cora Cora Full DBLP

Test accuracy

STKR s( ) = p (i) STKR inv-Lap (i)

Figure 2: Test accuracy (%) of STKR-Prop (SP) with polynomial with s(λ) = λp for p {1, 2, 4, 6, 8}. The test accuracy increases significantly as p is larger than 1, illustrating the benefits of the transitivity of similarity.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

STKR inv-Lap s 1( ) = 1

Test Accuracy

Data CS Cite Seer Computers Cora Cora Full

DBLP Photo Physics Pub Med

Figure 3: Test accuracy of SP-Lap with different values of η. The test accuracy is fairly consistent as long as η is not too close to 0, and gets slightly better with a larger η. All reported performances are averaged across ten random seeds.