# nonparametric_teaching_of_implicit_neural_representations__48af9982.pdf

Nonparametric Teaching of Implicit Neural Representations

Chen Zhang 1 * Steven Tin Sui Luo 2 * Jason Chun Lok Li 1 Yik-Chung Wu 1 Ngai Wong 1

We investigate the learning of implicit neural representation (INR) using an overparameterized multilayer perceptron (MLP) via a novel nonparametric teaching perspective. The latter offers an efficient example selection framework for teaching nonparametrically defined (viz. non-closedform) target functions, such as image functions defined by 2D grids of pixels. To address the costly training of INRs, we propose a paradigm called Implicit Neural Teaching (INT) that treats INR learning as a nonparametric teaching problem, where the given signal being fitted serves as the target function. The teacher then selects signal fragments for iterative training of the MLP to achieve fast convergence. By establishing a connection between MLP evolution through parameter-based gradient descent and that of function evolution through functional gradient descent in nonparametric teaching, we show for the first time that teaching an overparameterized MLP is consistent with teaching a nonparametric learner. This new discovery readily permits a convenient drop-in of nonparametric teaching algorithms to broadly enhance INR training efficiency, demonstrating 30%+ training time savings across various input modalities.

1. Introduction

Implicit neural representation (INR) (Sitzmann et al., 2020b; Tancik et al., 2020) focuses on modeling a given signal, which is often discrete, through the use of an overparameterized multilayer perceptron (MLP) such that the signal is accurately fitted by this MLP preserving great details. Such an overparameterized MLP inputs low-dimensional

*Equal contribution 1Department of Electrical and Electronic Engineering, The University of Hong Kong, HKSAR, China 2Department of Computer Science, The University of Toronto, Ontario, Canada. Correspondence to: Ngai Wong <nwong@eee.hku.hk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

0th Iteration Tth Iteration 4999th Iteration

Selected Fragments

Nonparametric Teacher

MLP Learner

5000th Iteration

T+1th Iteration

1st Iteration

Figure 1. Fitting a 2D grayscale image signal with Implicit Neural Teaching (INT): By comparing the disparity between the given signal and the current MLP output (a), the nonparametric teacher (b) selectively chooses examples (pixels) of the greatest disparity (red boxes), instead of a raster scan, to feed to the MLP learner (c) who undergoes learning (d) and outputs the final (e).

coordinates of the given signal and outputs corresponding values for each input location, e.g., the MLP maps 2D input coordinates to their respective 8-bit levels for a grayscale image. INR has proven to be promising in various domains, including vision data representation (Sitzmann et al., 2020b; Reddy et al., 2021), view synthesis (Martin-Brualla et al., 2021; Mildenhall et al., 2021) and signal compression (Dupont et al., 2021; Pistilli et al., 2022; Str umpler et al., 2022; Schwarz et al., 2023).

Nevertheless, the training of an overparameterized multilayer perceptron (MLP) in INR can be costly, especially when dealing with high-definition signals. For instance, consider the case of a 2D grayscale image with a resolution of 1024 1024, which leads to a training set comprising 106 pixels. Additionally, for long videos, the scale of the

Our project page is available at https://chen2hang.github. io/_publications/nonparametric_teaching_of_implicit_ neural_representations/int.html.

Nonparametric Teaching of Implicit Neural Representations

training set can become prohibitively large. Consequently, it becomes imperative to lower the training cost and enhance the training efficiency of INR.

A recent investigation on nonparametric teaching (Zhang et al., 2023b;a) presents a theoretical framework to facilitate efficient example selection when the target function is nonparametric, i.e., implicitly defined. This inspires a fresh perspective on universally enhancing training efficiency of INR herein. Specifically, machine teaching (Zhu, 2015; Liu et al., 2017; Zhu et al., 2018) considers the design of a training set (dubbed the teaching set) for the learner, with the goal of enabling speedy convergence towards target functions. Nonparametric teaching (Zhang et al., 2023b;a) relaxes the assumption of target functions being parametric (Liu et al., 2017; 2018) to encompass the teaching of nonparametric target functions. In the context of INR, an overparameterized MLP f is akin to a nonparametric function due to its nonlinear activation functions (Leshno et al., 1993) and the inability to be represented solely by its weights w as f(x) = w, x with input x (Liu et al., 2017; Zhang et al., 2023b), despite appearing to be a parametric function with w. Unfortunately, the evolution of an MLP is typically achieved by gradient descent on its parameters, whereas nonparametric teaching involves functional gradient descent as the means of function evolution. Bridging this (theoretical + practical) gap is of great value and calls for more examination prior to the application of nonparametric teaching algorithms in the context of INR.

To this end, we recast the evolution achieved through parameter-based gradient descent of an MLP by using dynamic neural tangent kernel (NTK)1 (Jacot et al., 2018; Lee et al., 2019; Bietti & Mairal, 2019; Dou & Liang, 2021). We express this evolution, from a high-level standpoint of function variation, using functional gradient descent. We show that this dynamic NTK converges to the canonical kernel used in functional gradient descent, indicating that the evolution of the MLP using parameter gradient descent aligns with that using functional gradient descent2 (Geifman et al., 2020; Chen & Xu, 2020). Therefore, it is natural to cast INR as a nonparametric teaching problem: The given signal to be fitted serves as the target function, and the teacher chooses specific signal fragments prior to providing them to an overparameterized MLP learner, ensuring the MLP fits the signal accurately and efficiently. Consequently, to improve the training efficiency of INR without scenario specification, we propose a novel paradigm called Implicit Neural Teaching (INT), where the teacher leverages the

1Although NTK for an infinite width MLP remains unchanged during training (Jacot et al., 2018), we do not restrict the width of the MLP to be infinite, and instead consider the dynamic NTK. 2Another example of the alignment is that teaching a parametric function is a special case of nonparametric teaching by using a linear kernel (Zhang et al., 2023b).

counterpart of the greedy teaching algorithm in nonparametric teaching (Zhang et al., 2023b;a) for INR, namely, selecting examples of the greatest disparity between the given signal and the MLP output (Arbel et al., 2019; Cormen et al., 2022). Figure 1 depicts an intuitive illustration of INT. Lastly, we conduct extensive experiments to validate the effectiveness of INT. Our key contributions are:

We propose Implicit Neural Teaching (INT) that novelly interprets implicit neural representation (INR) via the theoretical lens of nonparametric teaching, which in turn enables the utilization of greedy algorithms from the latter to effectively bolster the training efficiency of INRs.

We unveil a strong link between the evolution of a multilayer perceptron (MLP) using gradient descent on its parameters and that of a function using functional gradient descent in nonparametric teaching. This connects nonparametric teaching to MLP training, thus expanding the applicability of nonparametric teaching towards deep learning. We further show that the dynamic NTK, derived from gradient descent on the parameters, converges to the canonical kernel of functional gradient descent.

We showcase the effectiveness of INT through extensive experiments in INR training across multiple modalities. Specifically, INT saves training time for 1D audio (- 31.63%), 2D images (-38.88%) and 3D shapes (-35.54%), while upkeeping its reconstruction quality.

2. Related Works

Implicit neural representation. There has been a recent surge of interest in implicit neural representation (INR) (Park et al., 2019; Atzmon & Lipman, 2020; Gropp et al., 2020; Grattarola & Vandergheynst, 2022; Lindell et al., 2022; Xie et al., 2023; Li et al., 2023; Molaei et al., 2023; Li et al., 2024a;b) due to its ability to represent discrete signals continuously. Such representation typically is achieved by training an overparameterized MLP, which offers various practical benefits, including memory efficiency (Sitzmann et al., 2020b; Xie et al., 2023) and enhanced training efficiency for downstream computer vision tasks (Dupont et al., 2022; Chen et al., 2023). There have been various efforts to the accuracy of MLP representation, such as using sinusoidal activation function (Sitzmann et al., 2020b) and positional encoding with Fourier mapping (Tancik et al., 2020), and to the learning efficiency, such as using a method of dictionary training (Y uce et al., 2022; Wang et al., 2022) and relying on meta-learning framework (Sitzmann et al., 2020a; Tancik et al., 2021; Tack et al., 2023). Differently, we frame INR from a new perspective as a nonparametric teaching problem (Zhang et al., 2023b;a) and aim to improve the training efficiency by adopting the greedy algorithm from the latter.

Nonparametric Teaching of Implicit Neural Representations

Nonparametric teaching. Machine teaching (Zhu, 2015; Zhu et al., 2018) delves into designing a teaching set that leads to a rapid convergence of the learner towards a target model function. It can be seen as an inverse problem of machine learning, in the sense that machine learning aims to learn a function from a given training set while machine teaching aims to construct the set based on a target function. Its applicability has been proven over various domains, such as computer vision (Wang et al., 2021; Wang & Vasconcelos, 2021), robustness (Alfeld et al., 2017; Ma et al., 2019; Rakhsha et al., 2020), and crowd sourcing (Singla et al., 2014; Zhou et al., 2018). Nonparametric teaching (Zhang et al., 2023b;a) improves upon iterative machine teaching (Liu et al., 2017; 2018) by extending the parameterized family of target functions to a general nonparametric one. Nevertheless, there are difficulties in directly applying the findings of nonparametric teaching into broadly practical tasks that involves neural networks (Zhang et al., 2023b;a), which arises due to the gap between nonparametric functions implicitly defined by dense points and overparameterized MLPs. This work bridges this gap using the NTK machinery (Jacot et al., 2018; Lee et al., 2019; Bietti & Mairal, 2019; Bietti et al., 2019; Dou & Liang, 2021), and shows that teaching an overparameterized MLP is consistent with teaching a nonparametric target function (Gao et al., 2019; Geifman et al., 2020; Chen & Xu, 2020). Such insight immediately permits adaptation of tools from the latter to broadly accelerate INR training in the former.

3. Background

Notation. To simplify notations, the function being discussed is regarded as scalar-valued without specific emphasis3. Let X Rn denote an n dimensional input (i.e., the coordinate) space and Y R be an output (i.e., the corresponding value) space. Let a d dimensional column vector with entries ai indexed by i Nd be [ai]d = (a1, , ad)T , where Nd := {1, , d}. One may denote it by a for simplicity. Likewise, let {ai}d be a set comprising d elements. Moreover, if the relationship {ai}d {ai}n is given, then {ai}d denotes a subset of {ai}n of size d with the index i Nn. By M(i, ) and M( ,j) we refer to the i-th row and j-th column vector of a matrix M, respectively.

Consider K(x, x ) : X X 7 R as a positive definite kernel function. It can be equivalently denoted as K(x, x ) = Kx(x ) = Kx (x), and Kx( ) can be shortened as Kx for brevity. The reproducing kernel Hilbert space (RKHS) H defined by K(x, x ) is the closure of linear span {f : f( ) = Pr i=1 ai K(xi, ), ai R, r N, xi X} equipped with inner product f, g H = P

ij aibj K(xi, xj) when

3In nonparametric teaching, the extension from scalar-valued functions to vector-valued ones, which corresponds to multi-output MLPs, is a well-established generalization in Zhang et al., 2023a.

j bj Kxj (Liu & Wang, 2016; Arbel et al., 2019; Shen et al., 2020; Zhang et al., 2023b). Given the target signal f : X 7 Y, it can uniquely return y using the corresponding coordinate x as y = f (x ). By means of the Riesz Fr echet representation theorem (Lax, 2002; Sch olkopf et al., 2002; Zhang et al., 2023b), the evaluation functional is defined as below: Definition 1. For a reproducing kernel Hilbert space H with the positive definite kernel Kx H where x X, the evaluation functional Ex( ) : H 7 R is defined with the reproducing property as

Ex(f) = f, Kx( ) H = f(x), f H. (1)

Furthermore, in the case of a functional F : H 7 R, the Fr echet derivative (Coleman, 2012; Liu, 2017; Shen et al., 2020; Zhang et al., 2023b) of F is presented as follows: Definition 2. (Fr echet derivative in RKHS) The Fr echet derivative of a functional F : H 7 R at f H, which is represented by f F(f), is defined implicitly as F(f + ϵg) = F(f) + f F(f), ϵg H + o(ϵ) for any g H and ϵ R. This derivative is also a function in H.

Nonparametric teaching. Zhang et al., 2023b presents the formulation of nonparametric teaching as a functional minimization over teaching sequence D = {(x1, y1), . . . (x T , y T )}, with the collection of all possible teaching sequences denoted as D:

D = arg min D D M( ˆf, f ) + λ len(D)

s.t. ˆf = A(D). (2)

In the above formulation, there are three key elements: M which measures the disagreement between ˆf and f (e.g., L2 distance in RKHS M( ˆf , f ) = ˆf f H), len( ) referring to the length of the teaching sequence D (i.e., the iterative teaching dimension introduced in Liu et al., 2017) regularized by a constant λ, and A which denotes the learning algorithm of learners. Typically, A(D) employs empirical risk minimization:

ˆf = arg min f H Ex P(x) (L(f(x), f (x))) (3)

with a convex (w.r.t. f) loss L, which is optimized by functional gradient descent:

f t+1 f t ηG(L, f ; f t, xt), (4)

where t = 0, 1, . . . , T denotes the time index, η > 0 signifies the learning rate, and G represents the functional gradient computed at time t.

To obtain the functional gradient, which is derived as

G(L, f ; f , x) = Ex

Nonparametric Teaching of Implicit Neural Representations

Zhang et al., 2023b;a introduce the Chain Rule for functional gradients (Gelfand et al., 2000) (refer to Lemma 3) and the derivative of the evaluation functional using Fr echet derivative in RKHS (Coleman, 2012) (cf. Lemma 4).

Lemma 3. (Chain rule for functional gradients) For differentiable functions G(F) : R 7 R that depends on functionals F(f) : H 7 R, the formula

f G(F(f)) = G(F(f))

F(f) f F(f) (6)

commonly refers to the chain rule.

Lemma 4. The gradient of an evaluation functional Ex(f) = f(x) : H 7 R is f Ex(f) = Kx.

4. Implicit Neural Teaching

We commence by linking the evolution of an MLP that is based on parametric variation with the one that is perceived from a high-level standpoint of function variation. Next, by solving the formulation of MLP evolution as an ordinary differential equation (ODE), we obtain a deeper understanding of this evolution and the underlying cause for its slow convergence. Lastly, we introduce the greedy INT algorithm, which effectively selects examples with steeper gradients at an adaptive batch size and frequency.

4.1. Evolution of an overparameterized MLP

The function represented by an overparameterized MLP fθ H with the real-valued parameters θ Rm (where m denotes the number of parameters in the MLP) is of significant interest (Leshno et al., 1993; Gao et al., 2019; Geifman et al., 2020; Chen & Xu, 2020). Typically, such an MLP is optimized in terms of a task-specific loss by the method of gradient descent on its parameters (Ruder, 2016). Given a training set of size N {(xi, yi)|xi X, yi Y}N, the parameter evolves as:

i=1 θL(fθt(xi), yi). (7)

When governed by an extremely small learning rate η, the update is minute enough over multiple iterations, allowing it to be approximated as a derivative on the time dimension and subsequently transformed into a differential equation:

Based on Taylor s theorem, it can obtain the evolution of fθ (a variational representing the variation of fθ caused by changes in θ) as:

f(θt+1) f(θt) = θf(θt), θt+1 θt + o(θt+1 θt), (9)

where f(θ ) := fθ . Similar to the transformation of parameter evolution, it can be converted into a differential form in a comparable manner:

It is important to underscore that the nonlinearity of f(θ) with respect to θ, attributed to the inclusion of nonlinear activation functions, often leads to the remainder o(θt+1 θt) not being equal to zero. By substituting the specific parameter evolution into the first-order approximation term ( ) of the variational, we obtain

N [Kθt(xi, )]N + o θt

where the symmetric and positive definite Kθt(xi, ) = fθ

(cf. detailed derivation in Ap-

pendix A). In a minor distinction, Jacot et al., 2018 directly apply the chain rule, paying less heed to the convexity of L with respect to θ, resulting in the derivation of the firstorder approximation as the variational. Meanwhile, Kθ is referred to as the NTK and is demonstrated to remain constant during training by constraining the width of the MLP to be infinite (Jacot et al., 2018). In practical terms, it is not necessary for the width of the MLP to be infinitely large, prompting us to explore the dynamic NKT (Appendix A provides an illustration of NTK computation in Figure 7).

Let the variational be expressed from a high-level standpoint of function variation. Using functional gradient descent,

t = ηG(L, f ; fθt, {xi}N), (12)

where the specific functional gradient is

G(L, f ; fθt, {xi}N) = 1

N [K(xi, )]N. (13)

The asymptotic relationship between NTK and the canonical kernel in functional gradient is presented in Theorem 5 below, whose proof is in Appendix B.

Theorem 5. For a convex loss L and a given training set {(xi, yi)|xi X, yi Y}N, the dynamic NTK obtained through gradient descent on the parameters of an overparameterized MLP achieves point-wise convergence to the canonical kernel present in the dual functional gradient with respect to training examples, that is,

lim t Kθt(xi, ) = K(xi, ), i NN. (14)

It suggests that NTK serves as a dynamic substitute to the canonical kernel used in functional gradient descent, and the

Nonparametric Teaching of Implicit Neural Representations

evolution of the MLP through parameter gradient descent aligns with that via functional gradient descent (Kuk, 1995; Geifman et al., 2020; Chen & Xu, 2020). This functional insight not only connects the teaching of overparameterized MLPs with that of nonparametric target functions, but also simplifies additional analysis (e.g., a convex functional L retains the convexity regarding fθ in the functional viewpoint, while it is typically nonconvex when considering θ). Through the functional insight and the use of the canonical kernel (Dou & Liang, 2021) instead of NTK in conjunction with the remainder, it facilitates the derivation of sufficient reduction concerning L in Proposition 6, with its proof deferred to Appendix B.

Proposition 6. (Sufficient Loss Reduction) Assuming that the convex loss L is Lipschitz smooth with a constant ξ > 0 and the canonical kernel is bounded above by a constant ζ > 0, if learning rate η satisfies η 1/(2ξζ), then there exists a sufficient reduction in L as

It shows that the variation of L over time is upper bounded by a negative value, which indicates that it decreases by at least the magnitude of this upper bound over time, thereby ensuring convergence.

4.2. Spectral understanding of the evolution

The square loss L(fθ(x), f (x)) = 1 2(fθ(x) f (x))2, commonly used in fitting tasks, is typically used in INR (Sitzmann et al., 2020b; Tancik et al., 2020; Li et al., 2023). Using this specification for illustration, one obtains the variational of fθ from a high-level functional viewpoint:

t = ηG(L, f ; fθt, {xi}N)

N [fθt(xi) f (xi)]T N [K(xi, )]N .(16)

Prior to solving this differential equation, a Lemma of matrix ODE (Godunov, 1997; Hartman, 2002) is in place, with its proof given in Appendix B.

Lemma 7. Let A be an n n matrix and α(t) be a timedependent column vector of size n 1. The unique solution of the matrix ODE α(t)

t = Aα(t) with initial value α(0) is α(t) = e Atα(0), where e At = P i=0 ti Ai

Using Lemma 7, Equation 16 can be resolved as follows:

[fθt(xi) f (xi)]N = e η Kt [fθ0(xi) f (xi)]N , (17)

where K = K/N, and K is a symmetric and positive definite matrix of size N N with entries K(xi, xj) at the i-th

row and j-th column. The comprehensive solution procedure is available in Appendix A. Due to the symmetric and positive definite nature of K, it can be orthogonally diagonalized as K = V ΛV T based on spectral theorem (Hall, 2013), where V = [v1, , v N] with column vectors vi representing eigenvectors corresponding to eigenvalue λi, and Λ = diag(λ1, , λN) is an ordered diagonal matrix (λ1 λN). Hence, we can express e η Kt in a spectral decomposition form as:

e η Kt = I ηt V ΛV T + 1

2!η2t2(V ΛV T )2 +

= V e ηΛt V T . (18)

After rearrangement, Equation 17 can be reformulated as:

V T [fθt(xi) f (xi)]N = Dt V T [fθ0(xi) f (xi)]N, (19)

with a diagonal matrix Dt = diag(e ηλ1t, , e ηλNt). To be specific, [fθ0(xi) f (xi)]N refers to the difference vector between fθ0 and f at the initial time, which is evaluated at all training examples, whereas [fθt(xi) f (xi)]N denotes the difference vector at time t. Additionally, V T [fθ0(xi) f (xi)]N can be interpreted as the projection of the difference vector onto eigenvectors (i.e., the principal components) at the beginning, while V T [fθt(xi) f (xi)]N represents the projection at time t. Figure 2 provides a lucid illustration in a 2D function coordinate system.

Based on the above, Equation 19 reveals the connection between the training set and the convergence of fθ0 towards f , which indicates that when evaluated on the training set, the discrepancy between fθ0 and f at the i-th component exponentially converges to zero at a rate of e ηλit, which is also dependent on the training set (Jacot et al., 2018). Meanwhile, this insight uncovers the reason for the sluggish convergence that empirically arises after training for an extended period, wherein small eigenvalues hinder the speed of convergence when continuously training on a static training set. It prompts us to dynamically select examples for fast convergence as described in the next section.

4.3. INT algorithm

Intending to make the gradient steeper, the greedy functional teaching algorithm in nonparametric teaching chooses examples by recklessly maximizing the gradient norm:

{xi}k = arg max {xi}k {xi}N G(L, f ; fθ, {xi}k) H , (20)

where G(L, f ; fθ, {xi}k) = 1

k [K(xi, )]k and k N denotes the size of selected training set. Drawing from the consistency between an MLP and a nonparametric learner, as explored in Section 4.1 (Geifman et al., 2020;

Nonparametric Teaching of Implicit Neural Representations

𝑓ఏ 𝑥 𝑓 𝑥 ଶ= (1,0.5)

𝑲ഥ(ଵ, ) = (0.5,0.25)

𝑲ഥଶ, = (0.25,0.5)

𝒗ଶ 𝑓ఏ 𝑥 𝑓 𝑥 ଶ= 2 4

Figure 2. An illustration of the spectral understanding in a 2D function coordinate system (i.e., RKHS) with the {K(xi, )}2 basis. The basis can be non-orthogonal if K(xi, xj) = 0 for i = j. The coordinate of fθt f represents its projection on each axis, which is given by (fθt f ) , [K(xi, )]T 2 H = [fθt(xi) f (xi)]T 2 , and that of K(x , ) is K(x , ), [K(xi, )]T 2 H = [K(x , xi)]T 2 , which is

stored in the -th row of K. Assuming K = 0.5 0.25 0.25 0.5

the eigenvalues and the respective eigenvectors can be computed as λ1 = 0.75, λ2 = 0.25 and v1 = (

2 2 )T , v2 = (

2 2 )T , respectively. Assuming [fθt(xi) f (xi)]2 equals (1, 0.5), its first and second principal component projections are 3

2 4 , respectively. Moreover, the discrepancy between

fθt and f diminishes at a rate of e 3ηt

4 for the first and second principal components, respectively.

Chen & Xu, 2020), we present the INT algorithm that also aims to increase the steepness of gradients. Differently, INT circumvents the potentially cumbersome computation of K(xi, ) H in G H by utilizing a projection view. To be specific, for i NN, L

f |fθ,xi can be seen as the component of L

f |fθ projected onto the corresponding element of the basis {K(xi, )}N. Hence, the gradient represents the total sum of the updates, each weighted by L

f |fθ,xi, throughout {K(xi, )}k, which is associated with the selected examples (Wright, 2015). Consequently, steepening the gradient simply requires maximizing the coefficient L

f |fθ,xi, bypassing the need to calculate K(xi, ) H. This indi-

cates that selecting examples that enlarge L

f |fθ,x or those

which correspond to larger components of L

f |fθ can be sufficient to increase the gradient, which means

{xi}k = arg max {xi}k {xi}N

From a functional perspective, when dealing with a convex loss functional L, the norm of the partial derivative of L with respect to f at fθ, denoted as L

f |fθ H, is positively correlated with fθ f H; as fθ gradually approaches f , L

f |fθ H decrease (Boyd et al., 2004; Coleman, 2012). This relationship becomes particularly significant when L is strongly convex with a larger strong convexity con-

Algorithm 1 Implicit Neural Teaching Input: Target signal f , initial MLP fθ0, the size of selected training size k N, small constant ϵ > 0 and maximal iteration number T.

Set fθt fθ0, t = 0.

while t T and [fθt(xi) f (xi)]N 2 ϵ do

The teacher selects k teaching examples:

/* Examples corresponding to the k largest |fθt(xi) f (xi)|. */ {xi}k = arg max {xi}k {xi}N [fθt(xi) f (xi)]k 2.

Provide {xi}k to the MLP learner.

The learner updates fθt based on received {xi}k :

// Parameter-based gradient descent. θt θt η

xi {xi}k θL(fθt(xi), f (xi)).

Set t t + 1. end

stant (Kakade & Tewari, 2008; Arjevani et al., 2016). Based on these findings, the INT algorithm selects examples by

{xi}k = arg max {xi}k {xi}N [fθ(xi) f (xi)]k 2 . (22)

Pseudo code is in Algorithm 1.

When considering the square loss commonly employed in INR, the aforementioned correlation can be represented as L

f |fθ H fθ f H. Besides, it is intriguing that the INT algorithm aligns with the applied variant of the greedy functional teaching algorithm, wherein it is necessary for K(xi, ) H to be uniform or K(xi, ) H = 1 for all xi (Zhang et al., 2023b). The convergence analysis of the INT algorithm also aligns with that of the greedy functional teaching algorithm obtained in Zhang et al., 2023b;a.

With the spectral analysis in Section 4.2, a deeper understanding of INT follows. First, we define the entire space as the one spanned by the basis corresponding to the whole training set {K(xi, )}N. Similarly, {K(xi, )}k {K(xi, )}N spans subspaces associated with the selected examples. The eigenvalue of the transformation from the entire space to the subspace of concern (i.e., spanned by {K(xi, )}k associated with selected examples) is one, while it is zero for the subspace without interest (Watanabe & Katagiri, 1995; Burgess & Van Veen, 1996). The spectral understanding indicates that fθt approaches f swiftly at the early stage within the current subspace, owing to the large eigenvalues (Jacot et al., 2018). Hence, the INT algorithm can be interpreted as dynamically altering the subspace of interest to fully exploit the period when fθt approaches f rapidly. Meanwhile, by selecting examples based on Equation 22, the subspace of interest is precisely

Nonparametric Teaching of Implicit Neural Representations

0th Iteration

f PGD f FGD

100th Iteration

200th Iteration

300th Iteration

500th Iteration

1000th Iteration

Figure 3. Training dynamics of f using PGD and FGD. Apparently, f PGD closely follows f FGD, empirically showing the evolution consistency between PGD training and FGD training.

(a) GT (b) w/o INT (c) w/o INT (20%) (d) With INT (20%) (e) With INT (incre.)

PSNR (d B) 26.78 SSIM 0.7234 26.78 0.7236 28.86 0.7364 28.73 0.7756

Figure 4. Reconstruction quality of SIREN. (b) trains SIREN without (w/o) INT using all pixels. (c) trains it w/o INT using 20% randomly selected pixels. (d) trains it using INT of 20% selection rate. (e) trains it using progressive INT (i.e., increasing selection rate progressively from 20% to 100%).

the one where fθt remains significantly distant from f . In a nutshell, the INT algorithm, by dynamically altering the subspace of interest, not only maximizes the benefits of the fast convergence stage but also updates fθt in the most urgent direction towards f , thereby saving computational resources compared to training on the entire dataset.

5. Experiments and Results

We begin by using a synthetic signal to empirically show the evolution consistency between parameter-based gradient descent (PGD) and functional gradient descent (FGD). Next, we assess the behavior of INT on a toy image-fitting instance and explore diverse algorithms with different INT frequencies and ratios. Lastly, we validate the INT efficiency in multiple modalities such as audio (-31.63% training time), images (-38.88%), and 3D shapes (-35.54%), while upkeeping its reconstruction quality. Detailed settings are given in Appendices C.

Synthetic 1D signal. For an intuitive visualization, we utilize a synthetic 1D signal and present the training dynamics of f obtained through both PGD and FGD. Specifically, the signal (i.e., the target function) is f (x) = sin(x) where x {xi}100 and is uniformly distributed in the range of [ π, π]. The function corresponding to PGD is obtained by inputting {xi}100 into the Fourier Feature network (FFN) trained using PGD, while the function corresponding to FGD is represented by dense points of the nonparametric function updated using FGD. As depicted in Figure 3, f

is well fitted by both PGD and FGD. Moreover, the function obtained through PGD closely mirrors the one obtained

Figure 5. Progression of INT selected pixels (marked as black) at corresponding iterations when training with INT 20% (top) and 40% (bottom).

through FGD. This observation indicates the consistency in the evolution of the function through both PGD and FGD, suggesting that teaching an overparameterized MLP aligns with teaching a nonparametric target function.

Toy 2D Cameraman fitting. In practice, SIREN (Sitzmann et al., 2020b) is commonly used to encode various modalities of signal such as images. Here, we test the effectiveness of INT in a real-life setting where a SIREN model is used to fit the Cameraman image (Van der Walt et al., 2014). We compare the reconstruction quality of SIREN trained with INT of 20% selection rate (i.e., the size of selected training set is at 20% of the entire set comprised of all pixels) against that trained without INT, (i.e., using all pixels) and that trained with random sampling at the rate of 20% at each iteration. INT training results in a higher PSNR and SSIM but exhibits visible artifacts in the background. As shown in Figure 5 which presents the selected pixels throughout training, we hypothesize that this is due to the over-emphasis of the INT on boundary pixels where color changes are usually abrupt and hence loss values are larger, leading to an overfitting on the background pixels. On the contrary, using a higher selection rate permits INT to select more examples on the flat surfaces (background), which serves as a regularizer to alleviate the artifacts. Thus, we train an additional SIREN with a progressively increasing INT selection rate from 20% to nearly 100%, which achieves superior reconstruction quality without the artifacts.

Nonparametric Teaching of Implicit Neural Representations

Figure 6. Selecting ratio and interval of various INT algorithms. (Left) Red - decremental; Blue - incremental; Yellow - Dense. (Right) Red - R-Cosine; Blue - Cosine; Yellow - Incremental.

Ratio Interval Time (s) PSNR (d B) SSIM - - 345.22 35.95 1.89 0.935 0.03 Cosine Dense 337.00 36.39 2.40 0.941 0.02 Cosine Incremental 227.84 36.61 2.55 0.942 0.02 R-Cosine Dense 346.64 35.18 1.44 0.920 0.02 R-Cosine Decremental 225.30 33.56 2.53 0.894 0.03 Incremental Dense 468.01 36.84 2.70 0.946 0.02 Incremental Incremental 211.04 37.04 2.51 0.946 0.02

Table 1. Performance and training time for different INT strategies on Kodak dataset. The first line ( - in both Ratio and Interval) corresponds to training without INT.

INT with different frequencies and ratios. While using INT can train an INR with fewer examples without sacrificing reconstruction quality, it should be noted that each selecting process requires inferencing all data through the network to rank the difference between the outputs and f from higher to lower, which is rather time-consuming. This counters the effect of reducing training time that could originally be brought by the reduction in training examples. Consequently, we follow the observation that increasing the selection ratio leads to increasing overlaps between each consecutive INT selection, and thus devise several INT algorithms that space out the INT frequency (i.e., selecting frequency) and vary the INT ratio (i.e., sizes of selected training sets) dynamically throughout training. Namely, for selecting ratio, we test constant ratio, step-wise increment of ratio at fixed intervals, and gradually increasing/decreasing the ratio in a cosine annealing manner. On the other hand, for sample interval, we test densely sampling per iteration, and step-wise increment/decrement of sampling intervals between 1 and 100 steps.

Figure 6 visualizes the various algorithms we tested against each other. In particular, as presented in Table 1, our experiment on a subset of 8 representative images from the Kodak dataset (Eastman Kodak Company, 1999) shows that combining an incrementally increasing sampling ratio with an incrementally increasing sampling interval leads to the best performance in terms of both training speed and construction quality. We also want to highlight the severe degradation in reconstruction quality that comes with training an INR via decremental sampling ratio and intervals (comparing rows 4&5 in Table 1). We attribute this to the nature of INRs to progressively learn signals of lower to higher frequencies as shown in (Rahaman et al., 2019) while

INT Modality Time (s) PSNR(d B) / Io U(%)

Audio 23.05 48.38 3.50 Image 345.22 36.09 2.51 Megapixel 16.78K 31.82 3D Shape 144.58 97.07 0.84

Audio 15.76 (-31.63%) 48.15 3.39 Image 211.04 (-38.88%) 36.97 3.59 Megapixel 11.87K (-29.26%) 33.01 3D Shape 93.19 (-35.54%) 96.68 0.83

Table 2. Signal fitting results for different data modalities. The encoding time is measured excluding data I/O latency.

the decremental strategy goes against it. Specifically, at the beginning of training, the MLP may not be able to learn all the information provided by densely sampled examples. But towards the end of training when the MLP is trying to fit the remaining details of the signal, the decremental INT algorithm provides sparser and sparser samples that do not get updated frequently. This serves as a counter-example that explains the effectiveness of utilizing incremental INT for training general INRs, as we shall see in the following section.

INT on multiple real-world modalities. To demonstrate the practicality of INT in real-world applications, we conduct experiments on signal fitting tasks across datasets of various modalities, including 1D audio (Librispeech (Panayotov et al., 2015)), 2D images (Kodak (Eastman Kodak Company, 1999)), megapixel images (Pluto (NASA, 2018)), and 3D shapes (Stanford 3D Scanning Repository (Stanford Computer Graphics Laboratory, 2007)). We selected the optimal strategy from Table 1 (i.e., step-wise increments of both sampling ratio and intervals) as the default INT setting and evaluated it against the baseline without INT. The implementation details of the experiment for each modality can be found in Appendix C. As shown in Table 2, it is evident that INT can effectively speed up encoding for all modalities, ranging from 1.41 to 1.64 , with minimal degradation in performance (< 1d B PSNR or < 1% Io U). In the case of 2D images, the PSNR with INT even improves from 36.09d B to 36.97d B with near 40% decrease in training time. We also highlight the results for fitting 3D shapes and megapixel Pluto image (8192 8192), which instead requires mini-batch INT (Zhang et al., 2023a) due to hardware constraints. That is, for each iteration of optimization, we randomly sample a subset of points from the training set and run the INT algorithm to train our model. We make sure that all pixels in the image are sampled for each epoch. This serves as an analogous training procedure to combining stochastic gradient descent with INT and presents the robustness of our INT algorithms in improving training efficiencies.

Nonparametric Teaching of Implicit Neural Representations

6. Concluding Remarks and Future Work

This paper has proposed Implicit Neural Teaching (INT), a novel paradigm that enhances the learning efficiency of implicit neural representation (INR) through nonparametric machine teaching. Using an overparameterized multilayer perceptron (MLP) to fit a given signal, INT reduces the wallclock time for learning INR by over 30% as demonstrated by extensive experiments. Moreover, INT establishes a theoretically rich connection between the evolution of an MLP using parameter-based gradient descent and that of a function using functional gradient descent in nonparametric teaching. This bridge between nonparametric teaching and MLP training readily expands the applicability of nonparametric teaching in the realm of deep learning.

Moving forward, it could be more intriguing to explore other practical utilities related to INT towards data efficiency (Henaff, 2020; Touvron et al., 2021; Arandjelovi c & Zisserman, 2021; M uller et al., 2022). This will involve developing a deeper theoretical understanding of INT, with the neural tangent kernel playing a crucial role. Additionally, exploring more efficient example selection algorithms tailored to specific tasks, such as fine-tuning and prompt training in large language models, holds promise for future advancements.

Impact Statement

Implicit neural representation (INR) has emerged as a promising paradigm in vision data representation, view synthesis and signal compression, domains with significant societal impacts, for its ability of representing discrete signals continuously. This work focuses on enhancing the training efficiency of INR via a novel nonparametric teaching perspective, which can bring positive impacts to INR-related fields and society.

Meanwhile, this work connects nonparametric teaching to MLP training, which expands the applicability of nonparametric teaching towards deep learning. Thus, it also makes positive contributions to the community of machine teaching.

Lastly, we are confident that the proposed framework, Implicit Neural Teaching (INT), is highly relevant for enhancing data efficiency and has broader applicability to machine learning tasks, especially in scenarios where the target is known and overfitting is desired, as exhibited in INRs and nonparametric teaching.

Acknowledgements

We thank all anonymous reviewers for their constructive feedback to improve our paper. This work was supported by the Theme-based Research Scheme (TRS) project T45-

701/22-R, and in part by ACCESS AI Chip Center for Emerging Smart Systems, sponsored by Inno HK funding, Hong Kong SAR.

Alfeld, S., Zhu, X., and Barford, P. Explicit defense actions against test-set attacks. In AAAI, 2017.

Arandjelovi c, R. and Zisserman, A. Nerf in detail: Learning to sample for view synthesis. ar Xiv preprint ar Xiv:2106.05264, 2021.

Arbel, M., Korba, A., Salim, A., and Gretton, A. Maximum mean discrepancy gradient flow. In Neur IPS, 2019.

Arjevani, Y., Shalev-Shwartz, S., and Shamir, O. On lower and upper bounds in smooth and strongly convex optimization. The Journal of Machine Learning Research, 17 (1):4303 4353, 2016.

Atzmon, M. and Lipman, Y. Sal: Sign agnostic learning of shapes from raw data. In CVPR, 2020.

Bietti, A. and Mairal, J. On the inductive bias of neural tangent kernels. In Neur IPS, 2019.

Bietti, A., Mialon, G., Chen, D., and Mairal, J. A kernel perspective for regularizing deep neural networks. In ICML, 2019.

Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.

Burgess, K. A. and Van Veen, B. D. Subspace-based adaptive generalized likelihood ratio detection. IEEE Transactions on Signal Processing, 44(4):912 927, 1996.

Chen, H., Yang, H., Fitzmeyer, S., and Hao, C. Rapidinr: Storage efficient cpu-free dnn training using implicit neural representation. In ICCAD, 2023.

Chen, L. and Xu, S. Deep neural tangent kernel and laplace kernel have the same rkhs. In ICLR, 2020.

Coleman, R. Calculus on normed vector spaces. Springer Science & Business Media, 2012.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to algorithms. MIT press, 2022.

Dou, X. and Liang, T. Training neural networks as learning data-adaptive kernels: Provable representation and approximation benefits. Journal of the American Statistical Association, 116(535):1507 1520, 2021.

Dupont, E., Goli nski, A., Alizadeh, M., Teh, Y. W., and Doucet, A. Coin: Compression with implicit neural representations. In ICLR Neural Compression Workshop, 2021.

Nonparametric Teaching of Implicit Neural Representations

Dupont, E., Kim, H., Eslami, S., Rezende, D., and Rosenbaum, D. From data to functa: Your data point is a function and you can treat it like one. In ICML, 2022.

Eastman Kodak Company. Kodak lossless true color image suite. http://r0k.us/graphics/kodak/, 1999. [Accessed 14-08-2023].

Gao, R., Cai, T., Li, H., Hsieh, C.-J., Wang, L., and Lee, J. D. Convergence of adversarial training in overparametrized neural networks. In Neur IPS, 2019.

Geifman, A., Yadav, A., Kasten, Y., Galun, M., Jacobs, D., and Ronen, B. On the similarity between the laplace and neural tangent kernels. In Neur IPS, 2020.

Gelfand, I. M., Silverman, R. A., et al. Calculus of variations. Courier Corporation, 2000.

Godunov, S. K. Ordinary differential equations with constant coefficient, volume 169. American Mathematical Soc., 1997.

Grattarola, D. and Vandergheynst, P. Generalised implicit neural representations. In Neur IPS, 2022.

Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In ICML, 2017.

Gropp, A., Yariv, L., Haim, N., Atzmon, M., and Lipman, Y. Implicit geometric regularization for learning shapes. In ICML, 2020.

Hall, B. C. Quantum theory for mathematicians. Springer, 2013.

Hartman, P. Ordinary differential equations. SIAM, 2002.

Henaff, O. Data-efficient image recognition with contrastive predictive coding. In ICML, 2020.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, 2018.

Kakade, S. M. and Tewari, A. On the generalization ability of online strongly convex programming algorithms. In Neur IPS, 2008.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.

Kuk, A. Y. Asymptotically unbiased estimation in generalized linear models with random effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 57(2):395 407, 1995.

Lax, P. D. Functional analysis, volume 55. John Wiley & Sons, 2002.

Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In Neur IPS, 2019.

Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861 867, 1993.

Li, J. C. L., Liu, C., Huang, B., and Wong, N. Learning spatially collaged fourier bases for implicit neural representation. In AAAI, 2024a.

Li, J. C. L., Luo, S. T. S., Xu, L., and Wong, N. Asmr: Activation-sharing multi-resolution coordinate networks for efficient inference. In ICLR, 2024b.

Li, Z., Wang, H., and Meng, D. Regularize implicit neural representation by itself. In CVPR, 2023.

Lindell, D. B., Van Veen, D., Park, J. J., and Wetzstein, G. Bacon: Band-limited coordinate networks for multiscale scene representation. In CVPR, 2022.

Liu, Q. Stein variational gradient descent as gradient flow. In Neur IPS, 2017.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Neur IPS, 2016.

Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L. B., Rehg, J. M., and Song, L. Iterative machine teaching. In ICML, 2017.

Liu, W., Dai, B., Li, X., Liu, Z., Rehg, J., and Song, L. Towards black-box iterative machine teaching. In ICML, 2018.

Loshchilov, I. and Hutter, F. Online batch selection for faster training of neural networks. In ICLR Workshop, 2015.

Ma, Y., Zhang, X., Sun, W., and Zhu, J. Policy poisoning in batch reinforcement learning and control. In Neur IPS, 2019.

Martin-Brualla, R., Radwan, N., Sajjadi, M. S., Barron, J. T., Dosovitskiy, A., and Duckworth, D. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In CVPR, 2021.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2021.

Nonparametric Teaching of Implicit Neural Representations

Mindermann, S., Brauner, J. M., Razzak, M. T., Sharma, M., Kirsch, A., Xu, W., H oltgen, B., Gomez, A. N., Morisot, A., Farquhar, S., et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In ICML, 2022.

Molaei, A., Aminimehr, A., Tavakoli, A., Kazerouni, A., Azad, B., Azad, R., and Merhof, D. Implicit neural representation in medical imaging: A comparative survey. In ICCV, 2023.

M uller, T., Evans, A., Schied, C., and Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4): 1 15, 2022.

NASA. True colors of pluto. https: //solarsystem.nasa.gov/resources/ 933/true-colors-of-pluto/?category= planets/dwarf-planets_pluto, 2018.

Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In ICASSP, 2015.

Park, J. J., Florence, P., Straub, J., Newcombe, R., and Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, 2019.

Pistilli, F., Valsesia, D., Fracastoro, G., and Magli, E. Signal compression via neural implicit representations. In ICASSP, 2022.

Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., and Courville, A. On the spectral bias of neural networks. In ICML, 2019.

Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and Singla, A. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In ICML, 2020.

Reddy, P., Zhang, Z., Wang, Z., Fisher, M., Jin, H., and Mitra, N. A multi-implicit neural representation for fonts. In Neur IPS, 2021.

Ruder, S. An overview of gradient descent optimization algorithms. ar Xiv preprint ar Xiv:1609.04747, 2016.

Sch olkopf, B., Smola, A. J., Bach, F., et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

Schwarz, J. R., Tack, J., Teh, Y. W., Lee, J., and Shin, J. Modality-agnostic variational compression of implicit neural representations. In ICML, 2023.

Shen, Z., Wang, Z., Ribeiro, A., and Hassani, H. Sinkhorn barycenter via functional gradient descent. In Neur IPS, 2020.

Singla, A., Bogunovic, I., Bart ok, G., Karbasi, A., and Krause, A. Near-optimally teaching the crowd to classify. In ICML, 2014.

Sitzmann, V., Chan, E., Tucker, R., Snavely, N., and Wetzstein, G. Metasdf: Meta-learning signed distance functions. In Neur IPS, 2020a.

Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wetzstein, G. Implicit neural representations with periodic activation functions. In Neur IPS, 2020b.

Stanford Computer Graphics Laboratory. The stanford 3d scanning repository. https://graphics. stanford.edu/data/3Dscanrep/, 2007.

Str umpler, Y., Postels, J., Yang, R., Gool, L. V., and Tombari, F. Implicit neural representations for image compression. In ECCV, 2022.

Tack, J., Kim, S., Yu, S., Lee, J., Shin, J., and Schwarz, J. R. Learning large-scale neural fields via context pruned meta-learning. In Neur IPS, 2023.

Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., and Ng, R. Fourier features let networks learn high frequency functions in low dimensional domains. In Neur IPS, 2020.

Tancik, M., Mildenhall, B., Wang, T., Schmidt, D., Srinivasan, P. P., Barron, J. T., and Ng, R. Learned initializations for optimizing coordinate-based neural representations. In CVPR, 2021.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J egou, H. Training data-efficient image transformers & distillation through attention. In ICML, 2021.

Van der Walt, S., Sch onberger, J. L., Nunez-Iglesias, J., Boulogne, F., Warner, J. D., Yager, N., Gouillart, E., and Yu, T. scikit-image: image processing in python. Peer J, 2:e453, 2014.

Wang, P. and Vasconcelos, N. A machine teaching framework for scalable recognition. In ICCV, 2021.

Wang, P., Nagrecha, K., and Vasconcelos, N. Gradientbased algorithms for machine teaching. In CVPR, 2021.

Wang, P., Fan, Z., Chen, T., and Wang, Z. Neural implicit dictionary learning via mixture-of-expert training. In ICML, 2022.

Nonparametric Teaching of Implicit Neural Representations

Watanabe, H. and Katagiri, S. Discriminative subspace method for minimum error pattern recognition. In IEEE Workshop on Neural Networks for Signal Processing, 1995.

Wright, S. J. Coordinate descent algorithms. Mathematical programming, 151(1):3 34, 2015.

Xie, S., Zhu, H., Liu, Z., Zhang, Q., Zhou, Y., Cao, X., and Ma, Z. Diner: Disorder-invariant implicit neural representation. In CVPR, 2023.

Y uce, G., Ortiz-Jim enez, G., Besbinar, B., and Frossard, P. A structured dictionary perspective on implicit neural representations. In CVPR, 2022.

Zhang, C., Cao, X., Liu, W., Tsang, I., and Kwok, J. Nonparametric teaching for multiple learners. In Neur IPS, 2023a.

Zhang, C., Cao, X., Liu, W., Tsang, I., and Kwok, J. Nonparametric iterative machine teaching. In ICML, 2023b.

Zhou, Y., Nelakurthi, A. R., and He, J. Unlearn what you have learned: Adaptive crowd teaching with exponentially decayed memory learners. In SIGKDD, 2018.

Zhu, X. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, 2015.

Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. An overview of machine teaching. ar Xiv preprint ar Xiv:1801.05927, 2018.

A. Additional Discussions

Neural Tangent Kernel (NTK) By substituting the parameter evolution

into the first-order approximation term ( ) of Equation 10, it obtains

N [Kθt(xi, )]N , (24)

which derives Equation 11 as

N [Kθt(xi, )]N + o θt

and Kθt is referred to as neural tangent kernel (NTK) (Jacot et al., 2018). Figure 7 provides a visual representation that explains the calculation process of NTK in a clear and understandable way. Informally speaking, studying how a model behaves by focusing on the model itself rather than its parameters typically entails the use of kernel functions.

It can be observed that the quantity fθ

θ ,θt, present in Kθt(xi, ) = fθ

, represents the partial derivative

of the MLP with respect to its parameters, determined by both the structure and specific θt, but independent of the input. On the other hand, fθ

θ |xi,θtoriginates from the parameter evolution, which relies not only on the MLP structure and specific θt, but also on the input example. Assuming the input of fθ

θ |xi,θt is not known, the NTK becomes Kθt( , ). On the other hand, if we specify xj as the input for fθ

θ | ,θt, NTK becomes a scalar as Kθt(xi, xj) = fθ

θ |xj,θt, fθ

θ |xi,θt . This indicates that the NTK is a bivariate function represented by X X 7 R, and this form aligns with the kernel used in functional gradient descent. By feeding the input example xi, one coordinate of Kθt is fixed, causing the MLP to update along Kθt(xi, ) based on the magnitude of fθ

θ |xi,θt , which is consistent with the underlying mechanism of functional gradient descent. In a nutshell, NTK and the canonical kernel not only maintain consistency in their mathematical representation, but also exhibit alignment in how they influence the evolution of the corresponding MLP. Additionally, Theorem 5 demonstrates the asymptotic relationship between the NTK and the canonical kernel used in functional gradient descent.

Jacot et al., 2018 introduce kernel gradient descent, which can be considered as an extension of parameter-based gradient descent. Although kernel gradient descent appears to bear resemblance to functional gradient descent (Zhang et al., 2023b;a), they fundamentally differ in terms of specific details. In kernel gradient descent, the kernel gradient is derived by incorporating a kernel weighting (Jacot et al., 2018), where the NTK serves as the weight to modify the conventional gradient of a real-valued loss L(f(x), y) with respect to f(x), which is limited to the training set, thus allowing the weighted gradient (kernel gradient) to be extrapolated to values beyond the training set. Differently, functional gradient descent takes a higher-level perspective on the evolution of the MLP in function space (Zhang et al., 2023b;a). Specifically, f(x) = Ex(f) represents the result of evaluating the function f at the example x, which is defined as the inner product in RKHS between the function f and K(x, ) (the corresponding kernel with one argument x) based on the reproducing property. By applying the functional chain rule and Fr echet derivative, the functional gradient is derived accordingly.

Nonparametric Teaching of Implicit Neural Representations

𝑓! 𝜃"," 𝑓! 𝜃$,"

NTK = [ !"#

',%,' ]1x1 = [

No. of layers: 𝐿 No. of weights (edges) per layer: 𝑃!

𝑓! 𝜃$,$ 𝑓! 𝜃",+

Figure 7. Graphical illustration of NTK computation.

Due to the discrete nature of computer operations, functional gradient descent relies on dense pairwise points {(xi, K(x , xi))}n for representing the kernel K(x , ), and in order to express f, it is necessary to store all Kxis as dense points, resulting in significant storage requirements. This issue mirrors the challenge encountered when storing discrete signals, and the solution lies in INR, employing overparameterized MLPs to continuously represent functions, eliminating the need for storing dense points by utilizing a relatively small-sized parameter storage. Besides, in terms of evolution, functional gradient descent requires updating all dense points to derive f t based on the functional gradient that also relies on K(xi, ), whereas training an MLP only necessitates updating the parameter θ, providing practical convenience compared to the theoretical analysis facilitated by functional gradient descent. This work establishes a correlation between nonparametric teaching and MLP training, which involves training an MLP to represent general functions, thereby increasing the theoretical framework s potential scope for implementation in deep learning.

The Relationship between Nonparametric Teaching, Implicit Neural Teaching, and Parametric Teaching In simpler terms, nonparametric teaching (Zhang et al., 2023b; Ma et al., 2019) offers a comprehensive framework that encompasses other paradigms, where these paradigms can be viewed as special cases with specific kernels. For instance, this paper focuses on implicit neural teaching, which corresponds to a distinct paradigm by specifying the neural tangent kernel, while parametric teaching (Liu et al., 2017; 2018) considers a particular paradigm utilizing a linear kernel. Furthermore, when the MLP is reduced to a single-layer architecture without nonlinear activation functions, it becomes the linear case examined in parametric teaching (Liu et al., 2017; 2018), resulting in a zero remainder in Equation 10. Figure 8 provides a visualization of these relationships.

Solution of ODE for training with a fixed single input If we allow the MLP to evolve based on a single example x, we have

t = η(fθt(x) f (x)) K(x, ). (26)

t = 0, we can rewrite the above differential equation as:

t = η(fθt(x) f (x)) K(x, ). (27)

Nonparametric Teaching of Implicit Neural Representations

Nonparametric Teaching

Task-specific Kernel

Implicit Neural Teaching

Neural Tangent Kernel

Parametric Teaching

Linear Kernel

𝑓𝑥= 𝑤 𝜎 𝑤ଵ𝑥+ 𝑏ଵ + 𝑏

One-layer MLP

Figure 8. Illustration of the relationship between nonparametric teaching, implicit neural teaching and parametric teaching. Nonparametric teaching deals with general functions corresponding to task-specific kernels. As an instance, implicit neural teaching focuses on neural tangent kernels (Jacot et al., 2018) and is concerned with the functions expressed by an overparameterized MLP. On the other hand, parametric teaching concentrates on parameterized functions of the form f(x) = θ, x , which is a specific case of nonparametric teaching that uses a linear kernel as the task-specific kernel. Additionally, teaching a one-layer MLP without nonlinear activation functions is essentially equivalent to parametric teaching.

By manipulating both sides of the equation using K(x, ), H (K(x, x) = 0) and rearranging, we can obtain

d (fθt(x) f (x)) = η(fθt(x) f (x)) K(x, x)dt

d (fθt(x) f (x))

fθt(x) f (x) = ηK(x, x)dt

Z d (fθt(x) f (x))

fθt(x) f (x) = ηK(x, x) Z dt

ln |fθt(x) f (x)| = ηK(x, x)t + C. (28)

When fθt(x) approaches f (x) from below, that is, fθt(x) f (x) < 0, we have

ln (f (x) fθt(x)) = ηK(x, x)t + C. (29)

Let t=0, we attain

C = ln (f (x) fθ0(x)) . (30)

Therefore, we have

fθt(x) = f (x) e ηK(x,x)t (f (x) fθ0(x)) . (31)

If fθt(x) approaches f (x) from above, which indicates fθt(x) f (x) > 0, we have

fθt(x) = f (x) + e ηK(x,x)t (fθ0(x) f (x)) , (32)

which is equivalent to the case of fθt(x) f (x) < 0 because

e ηK(x,x)t (f (x) fθ0(x)) = e ηK(x,x)t (fθ0(x) f (x)) . (33)

Nonparametric Teaching of Implicit Neural Representations

Detailed solution procedure of matrix ODE corresponding to Equation 16 The case of f (x) fθt(x) > 0. Since f

t = 0, we can rewrite Equation 16

N [fθt(xi) f (xi)]T N [K(xi, )]N (34)

N [fθt(xi) f (xi)]T N [K(xi, )]N . (35)

By applying the inner product , [K(xj, )]T N

H , j NN to both sides of the equation and rearranging, we can derive

d [fθt(xj) f (xj)]T N = η

N [fθt(xi) f (xi)]T N D [K(xi, )]N , [K(xj, )]T N E

d [fθt(xj) f (xj)]T N = η

N [fθt(xi) f (xi)]T N Kdt, (36)

where K is a symmetric and positive definite matrix of size N N with entries K(xi, xj) at the i-th row and j-th column. By substituting the index j with i, we can equivalently derive

d [fθt(xi) f (xi)]T N = η

N [fθt(xi) f (xi)]T N Kdt, (37)

which can be expanded version as

d [fθt(x1) f (x1), , fθt(x N) f (x N)]

N [fθt(x1) f (x1), , fθt(x N) f (x N)]

K(x1, x1) K(x1, x2) K(x1, x N) K(x2, x1) K(x2, x2) K(x2, x N) ... ... ... ... K(x N, x1) K(x N, x2) K(x N, x N)

Lemma 7 provides the solution for this first-order matrix ordinary differential equation, where α(t) = [fθt(xi) f (xi)]N, α(0) = [fθ0(xi) f (xi)]N and A = K = K

[f (xi) fθt(xi)]T N = [f (xi) fθ0(xi)]T N e η Kt. (39)

We can obtain an equivalent result by transposing it as

[f (xi) fθt(xi)]N = e η Kt [f (xi) fθ0(xi)]N . (40)

After rearrangement, it is

[fθt(xi)]N = [f (xi)]N e η Kt [f (xi) fθ0(xi)]N (41)

For the case of f (x) fθt(x) < 0, similarly, we have

[fθt(xi) f (xi)]N = e η Kt [fθ0(xi) f (xi)]N . (42)

After rearrangement, we have

[fθt(xi)]N = [f (xi)]N + e η Kt [fθ0(xi) f (xi)]N , (43)

which is equivalent to the case of f (x) fθt(x) > 0 since

e η Kt [fθ0(xi) f (xi)]N = e η Kt [f (xi) fθ0(xi)]N . (44)

This concludes the solution.

In the sense that a function can be seen as an infinite-dimensional generalization of a Euclidean vector, Equation 17 can be generalized as:

fθt( ) = f ( ) +

i=1 e ηλitνi( ) νi, (fθ0 f ) H | {z } It is a constant

where νi denotes the corresponding eigenfunction based on spectral decomposition, i.e., infinite eigenvectors.

Nonparametric Teaching of Implicit Neural Representations

B. Detailed Proofs

Proof of Theorem 5 By representing the evolution of an MLP through the variation of parameters and through a high-level standpoint of function variation, we have

N [K(xi, )]N = η

Following the reorganization, we obtain

N [K(xi, ) Kθt(xi, )]N = o θt

By substituting the evolution of the parameters

into the remainder, we obtain

N [K(xi, ) Kθt(xi, )]N = o

During the training of an MLP with a convex loss L (which is convex with respect to fθ but usually nonconvex with respect

to θ), we have the limit of the vector limt

N = 0. Since the right-hand side of the equation is of a higher

order infinitesimal compared to the left-hand side, to maintain this equality, we can conclude that

lim t [K(xi, ) Kθt(xi, )]N = 0. (49)

This implies that for each x {xi}N, NTK converges point-wise to the canonical kernel.

Proof of Proposition 6 By recollecting the definition of Fr echet derivative in Definition 2, the convexity of L implies that

t L fθt+1 , fθt

By specifying the Fr echet derivative of L fθt+1 and the evolution of fθt, the r.h.s. term Ξ can be expressed as

Ξ = Gt+1, ηGt

N [Kxi]N , [Kxi]T N

N [Kxi]N , [Kxi]T N

Nonparametric Teaching of Implicit Neural Representations

where K = K/N, and K is a symmetric and positive definite matrix of size N N with elements K(xi, xj) located at the i-th row and j-th column. Furthermore, the last term in Equation 51 can be rewritten as

The last term in Equation 52 above can be elaborated as

Since K is positive definite, it is clear that

is a non-negative term, and therefore by combining Equation 51, 52, and 53, we have

Nonparametric Teaching of Implicit Neural Representations

Given the evaluation functional definition and the assumption that L is Lipschitz smooth with a constant ξ > 0, the term 2 in the last term of Equation 54 is upper bounded as

N ξ2 [Exi (fθt+1 fθt)]T N K [Exi (fθt+1 fθt)]N

= ξ2 D (fθt+1 fθt) , [Kxi]T N E

H K [Kxi]N , (fθt+1 fθt) H

[Kxi]N , [Kxi]T N

[Kxi]N , [Kxi]T N

Based on the assumption that the canonical kernel is bounded above by a constant ζ > 0, we have [Kxi]N , [Kxi]T N

H ζ [1]N, [1]T N ,

N [1]N, [1]T N .

Therefore, 1 is bounded above as

Simultaneously, the last term in Equation 55 is also bounded from above:

[Kxi]N , [Kxi]T N

[Kxi]N , [Kxi]T N

Therefore, by combining Equations 54, 55, 56, and 57, we obtain

4 η2ξ2ζ2 1 N

which indicates

4 η2ξ2ζ2 1 N

Nonparametric Teaching of Implicit Neural Representations

Hence, if η 1 2ξζ , we have

Proof of Lemma 7 For α(t) = e Atc, where e At = P i=0 ti Ai

i! and c is a time-independent column vector of size n 1, we have

t = P i=0 ti Ai

i! = Ae Atc = Aα(t). (61)

Meanwhile, by setting t = 0, we have

α(0) = e0c, (62)

which means c = α(0). Therefore, α(t) = e Atα(0) is the unique solution of the matrix ODE α(t)

t = Aα(t) with initial value α(0).

Nonparametric Teaching of Implicit Neural Representations

C. Experiment Details

C.1. Synthetic 1D signal

The FFN consists of 4 layers, each with 256 hidden units, and the value of σ is set to 2 for the random Fourier features used in the FFN. Based on Theorem 5, the canonical kernel used in FGD is approximated by adopting the empirical NTK of the INR obtained through PGD after 5000 iterations.

C.2. Toy 2D Cameraman Fitting

We train SIREN models with 6 layers, each with 256 hidden units, with default settings as mentioned in Sitzmann et al., 2020b to fit the 512 512 Cameraman grayscale image from scikit-image (Van der Walt et al., 2014). To have a close resemblance with the theoretical analysis of INT, we train the models with vanilla gradient descent without momentum for 5000 iterations. All models are trained using a cosine annealing scheduler with a starting learning rate of 1e-4 and a minimum learning rate of 1e-6. The specific INT sampling strategies of the 4 different SIREN models presented in Figure 4 are as follows:

w/o INT - At each optimization step, the entire image is used.

w/o INT (20%) - At each optimization step, a random 20% of pixels are used.

With INT (20%) - At each optimization step, pixels with the top 20% error rates from the previous training iteration are used to train the current iteration.

With INT (incre.) - Similar scheme as with INT (20%) , except that we increase the sampling rate by 8% for every 500 iterations from 20% to 92%.

C.3. INT Strategy Experiment

We train identical SIREN models as mentioned in the previous section on 8/24 images from the Kodak dataset (Eastman Kodak Company, 1999). As numerous strategies were tested and we hoped to utilize a wide variety of images to find a robust strategy that works not only across different image datasets but also other modalities, we chose only a representative subset of the Kodak dataset for experimental efficiency. As shown in Figure 9, this subset of images is chosen to include both simple images (e.g. single object), complex images (e.g. multiple objects or high-frequency signals such as grass), and images with humans. To better simulate real-world scenarios of utilizing INRs, We test our strategies with the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 1e-3 and an identical cosine annealing scheduler for the learning rate as in the previous section. All models are trained for 5000 iterations.

We highlight that logging PSNR/SSIM values and saving visualization results during training takes up significant time. Thus, to record the most realistic training time, we retrain all the models with the same seed and configurations but without any logs except for the loss value on a single image. As all images have the same dimensions, this is sufficient to represent the general trend of training times across the strategies.

The specific INT strategies presented in Figure 6 are as follows:

Cosine - Increasing sampling ratio from 20% to 100% in a cosine annealing manner. R-Cosine - Decreasing sampling ratio from 100% to 20% in a cosine annealing manner. Step - Incrementing sampling ratio from 20% to 92% in 10 equal intervals, which is 500 iterations in this case where we train for a total of 5000 iterations.

Dense - Sample points with top <ratio>% error rates for every training iteration. Note that the error rates are obtained from the previous iteration. Decremental - Sampling interval decreases from every 90 iterations to 1 iteration incrementally in 10 intervals, which is 500 iterations in this case where we train for a total of 5000 iterations. That is, at every 500 iterations, we decrease the interval by 10, except for the last 500 iterations where we decrease by 9 from 10 to 1.

Nonparametric Teaching of Implicit Neural Representations

Incremental - Sampling interval increases from every 1 iteration to 90 iterations incrementally in 10 intervals, which is 500 iterations in this case where we train for a total of 5000 iterations. That is, at every 500 iterations, we increase the interval by 10, except for the first 500 iterations where we increase by 9 from 1 to 10.

Figure 9. The selected 8/24 images from the Kodak dataset.

Figure 10. The selected 8/24 images from the Kodak dataset and its selected training points at a particular instance.

Figure 11 presents the sampling progression of SIREN trained with SGD and Adam on the kodak05 image. Besides, examining the applicability of active data selection methods (Loshchilov & Hutter, 2015; Graves et al., 2017; Mindermann et al., 2022) for INRs learning efficiency could be interesting.

C.4. Multi-modality Signal Fitting

For all modalities, we train a SIREN model with Adam optimizer and cosine annealing learning rate scheduler. We set ω0 = 30 for the SIREN model. All modalities except 2D Kodak images start with a learning rate of 1e-4, while 2D Kodak images start with 1e-3. We select the best INT strategy found in the previous section to train for all modalities: step-incremental . Note that we always partition the training into 10 same-sized intervals where each interval has its respective INT sampling ratio and sampling interval. For instance, if we train audio samples for 10K iterations, then we start with 20% sampling ratio and a sampling ratio of 1 and progressively add 8% to the sampling ratio and 10 to the sampling interval for every 1K iterations.

Nonparametric Teaching of Implicit Neural Representations

1000 2000 100 3000

Figure 11. Visualizing the progression of sampled points when trained with SGD vs Adam on kodak05 image.

1D Audio. The Librispeech dataset (Panayotov et al., 2015) is chosen for the audio-fitting task. We select the first 100 samples from the test-clean split that have a duration greater than 2 seconds. For our evaluation benchmark, we clip the first 2 seconds of each sample. We train a SIREN with 5 layers, each having 128 hidden units, resulting in a total of approximately 50K parameters. Each sample is trained for 10K iterations.

2D image. The entire Kodak dataset (Eastman Kodak Company, 1999) is used in this case. Model configuration and training parameters are identical to the previous section, except that we select the Step-Incremental strategy for the INT training. The resulting SIREN model has approximately 265K parameters.

Megapixel Image We fit the 8192 8192 Pluto image (NASA, 2018). We use a SIREN model with 6 layers, each having 512 hidden units, resulting in a total of approximately 1M parameters. This model size is necessary to fit the image with 30+ PSNR. Model configurations are identical to that of 2D image fitting. We also train with Adam optimizer and cosine annealing learning rate scheduler, but instead, start with a learning rate of 1e-4. During training, we break the image into mini-batches of 524,288 pixels and have the INT algorithm sample training pixels for each optimization step. This is necessary due to VRAM constraints of consumer-grade GPUs such as NVIDIA RTX3090 (24GB). We train for a total of 500 epochs, where each epoch consists of 128 training steps that progressively sample the entire image. Thus, we also tune the incremental INT sampling interval to increase from 1 to 10 instead of from 1 to 100.

Figure 12. PSNR-Training time curve of Kodak images training with and without INT.

Figure 13. PSNR-Training time curve of megapixel training with and without INT.

The non-smoothness of training curve in Figure 12 and 13 is due to the increase in sampling intervals. In particular, the drop in reconstruction quality occurs when changing from densely selecting optimal training points at each iteration to sampling once per several iterations (as a measure of saving training time without sacrificing much final reconstruction quality). One can think of sampling at sparser intervals as analogous to training on dynamic minibatches of the data. Hence, at early stages of training when the model has not properly learnt the underlying signal yet, these minibatch training steps may lead to temporary overfitting and more jumpy training curves. However, our results show that this does not affect the final reconstruction quality. In fact, accompanying increasing INT ratio with sampling intervals is the optimal method of balancing lesser training samplers (faster training time) and retaining training quality.

3D Shape. We conduct 3D shape experiments using the Stanford 3D Scanning Repository dataset (Stanford Computer Graphics Laboratory, 2007). We choose 4 scenes: Asian Dragon, Thai Statue, Lucy, and Armadillo. For our experiments,

Nonparametric Teaching of Implicit Neural Representations

we utilize an 8-layer SIREN with 256 hidden units, resulting in approximately 400K parameters. Each scene is trained for 10K iterations. Following the approach of Bacon (Lindell et al., 2022) and Scone (Li et al., 2024a), we sample points from the surface using a coarse and fine sampling procedure. We add two levels of Laplacian noise with variances of 1e-1 and 1e-3 for the coarse and fine samples, respectively. During each iteration, we randomly select a batch of 50K points. If INT is utilized, it is applied within each batch. Io U is computed by first transforming the learned signed distance function (SDF) to an occupancy grid of shape 512 512 512 bounded by [ 0.5, 0.5]3. Below, we present the complete results for each scene:

Scene INT Io U(%)

Asian Dragon 96.46 96.05

Armadillo 98.48 98.25

Thai Statue 96.43 96.22

Lucy 96.91 96.19

Table 3. 3D shape representation results for all scenes.