# fourier_learning_with_cyclical_data__82d4c204.pdf Fourier Learning with Cyclical Data Yingxiang Yang 1 Zhihan Xiong 2 Tianyi Liu 1 Taiqing Wang 1 Chong Wang 1 Many machine learning models for online applications, such as recommender systems, are often trained on data with cyclical properties. These data sequentially arrive from a time-varying distribution that is periodic in time. Existing algorithms either use streaming learning to track a time-varying set of optimal model parameters, yielding a dynamic regret that scales linearly in time; or partition the data of each cycle into multiple segments and train a separate model for each a pluralistic approach that is computationally and storage-wise expensive. In this paper, we have designed a novel approach to overcome the aforementioned shortcomings. Our method, named Fourier learning , encodes the periodicity into the model representation using a partial Fourier sequence, and trains the coefficient functions modeled by neural networks. Particularly, we design a Fourier multi-layer perceptron (F-MLP) that can be trained on streaming data with stochastic gradient descent (streaming-SGD), and we derive its convergence guarantees. We demonstrate Fourier learning s better performance with extensive experiments on synthetic and public datasets, as well as on a large-scale recommender system that is updated in real-time, and trained with tens of millions of samples per day. 1. Introduction Cyclical data are an important component in a wide range of machine learning applications. In large-scale recommender systems such as You Tube and Tik Tok, users usually log into the platform during a relatively fixed time window each day (e.g. before bed or after work), resulting in a strong cyclical pattern in the system s revenue. In federated learning *Equal contribution 1Byte Dance Inc 2Paul G. Allen School of Computer Science & Engineering, University of Washington, WA. Correspondence to: Yingxiang Yang, Chong Wang <{yingxiang.yang, chong.wang}@bytedance.com>. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 0 10 20 30 40 50 60 70 Hour Normalized Revenue 0 20 40 60 80 100 120 140 Hour Figure 1. Top: Normalized revenue of a recommender system over 3 days. Bottom: Normalized trend of the word dinner over 6 days (data collected from https://trends.google.com). (Kairouz et al., 2019), the training data are cyclical in nature as the availability of each device is usually fixed throughout the day (Eichner et al., 2019). In financial markets, asset prices rise and fall periodically on a yearly basis, a phenomenon commonly known to the investors as seasonality (Gultekin & Gultekin, 1983). In search engines, the number of hits for certain keywords can also display periodic patterns (Trac a et al., 2021). Figure 1 depicts a few of these applications. How to exploit the periodicity within the training data to learn a better prediction model is an important issue for these applications. Problem setup. Given samples denoted by (x, y, t), with x X Rd being the feature, y Y R being the label, and t R being the time at which the sample is generated, we wish to maintain a model f that can consistently and accurately predict y with x at any given t. In this paper, we focus on the scenario where the model is updated under an online learning (Hazan, 2019) or continual learning (Lopez Paz & Ranzato, 2017) framework, while the data arrive in a streaming and cyclical fashion. More specifically, between two consecutive updates of the model at t and t + δ, only samples arrived within the interval [t, t + δ) is available for training. In addition, if (x, y) is generated from a timedependent distribution Dt, then there exists a T such that Dt = Dt T for all t. Assuming that T is given, and that, for Fourier Learning with Cyclical Data any (x, y, t), the triplet (x, y, mod(t, T)) is sampled from a joint distribution p(mod(t, T))Dmod(t,T )(x, y),1 our goal is to solve the following set of optimization problems: f t (x) argmin f L2(X) Ex,y Dt[ℓ(f(x), y)] t R. (1) Here, we assume that X and Y are convex and compact sets, and ℓis strongly convex with respect to f for all y Y. The optimization is conducted within L2(X) = {f : X R| R X f 2(x)dx < }, a function space that contains all finite-energy functions defined over X. Motivation. The periodicity of the data distribution plays an important role in (1). Since Dt = Dt+n T for all n N, a solution f t (x) at t is also guaranteed to be a solution at t + n T. This immediately implies that the model learned at time t may offer useful information to improve the prediction accuracy at t + n T. This intuition motivates the design of a learning algorithm that can effectively exploit this information offered by the cyclical nature of the data. Surprisingly, existing literature in optimization and machine learning offers little insight on how to exploit periodicity within the data distribution to solve (1) efficiently. This is particularly pronounced under a big-data setting, where industrial practices implement algorithms that simply underrate the periodicity within the training data. Below, we summarize existing approaches to the best of our knowledge from both academia and industry. 1.1. Related Works Learning with a time feature. A straight-forward design to encode periodicity into the model structure is to include t as a model input and learn a model f(x, t). Unfortunately, this approach is not guaranteed to learn a periodic function out-of-the-box, especially when f(x, t) is approximated by a neural network (Ziyin et al., 2020). When f(x, t) is assumed to belong to a non-parametric family, such as a reproducing kernel Hilbert space (RKHS) with a periodic kernel (Fukumizu et al., 2008; Wahba, 1990), the periodicity is usually encoded across all input dimensions, whereas in (1), f(x, t) may be aperiodic in x. An enhanced version of the above approach is to focus on a single period of f(x, t) and learn a model f(x, mod(t, T)) instead. Although the pre-processing of t into mod(t, T) guarantees periodicity during the inference stage, it still often requires laborious feature engineering, especially when x is high-dimensional and f(x, t) has a complicated design, e.g., (Cheng et al., 2016). The pluralistic approach. An alternative approach to (1) is to partition the time axis, and learn a model for every 1For convenience, we denote p(mod(t, T)) and Dmod(t,T ) as p(t) and Dt interchangeably in the rest of the paper. time interval (Eichner et al., 2019). When Dt is piecewise constant in time, e.g., Dt = D01[0 mod(t, T) < T/2] + D11[T/2 mod(t, T) < T], this approach allows each separate model to converge to its optimal as t/T . On the other hand, however, the pluralistic approach suffers from an approximation error when the partition is crude (Eichner et al., 2019). More importantly, it is hard to scale as it requires storing multiple models. Although computationally efficient methods exist, e.g., partially sharing the network structure between the models, they typically compromise the theoretical guarantees as a trade-off. Online learning. A standard industrial practice for training large-scale systems using sequential data is to follow the online learning protocol (Hazan, 2019). The performance of the learning algorithm is typically evaluated using the concept of dynamic regret (Mokhtari et al., 2016), which measures the model s capability to consistently and accurately predict the labels of the latest batch of arriving data. Crudely speaking, when t takes a set of discretized values, the dynamic regret measures P t EDt[ℓ(ft(x), y) ℓ(f t (x), y)], the cumulative sum of the differences between the loss under the learned model and the optimal loss under f t (x) defined in (1). Although a plethora of optimization algorithms have been proposed to improve the dynamic regret analysis (Zhang et al., 2018b; 2016), none of them shed light on how to exploit periodicity within the training data. What is more, when Dt does not converge to a fixed distribution as t diverges, the dynamic regret scales linearly in t, implying that the gap between the learned model and the desired optimal does not vanish even when we know that the data is cyclical. To summarize, learning with cyclical data remains largely an open problem for large-scale machine learning models that are trained with an online learning setup. 1.2. Our Contributions In this paper, we address the challenge of learning with cyclical data by proposing a novel learning framework called Fourier learning . Simply put, Fourier learning reduces (1) into a single optimization problem in a function space that naturally contains time-periodic functions, and learns the coefficients of a partial Fourier expansion for these functions using streaming-SGD. Theoretically, we support the Fourier learning framework from two different aspects: (i) from a modeling perspective, we derive Fourier learning naturally from a functional optimization problem that is equivalent to problem (1) under a strongly convex and a realizable setting; (ii) in terms of optimization, we show that the coefficient functions updated with streaming-SGD provably converge in the frequency domain. Practically, for large-scale learning systems with a neural network architecture, we introduce FMLP, a deep learning component that can be incrementally Fourier Learning with Cyclical Data added to a wide range of existing model designs to increase capacity. Last but not least, we provide extensive numerical evidence on synthetic, public, and real-time production data of a recommender system to demonstrate the advantage of Fourier learning over the prior state-of-the-arts. 2. Learning Cyclical Data in Hilbert Spaces A Hilbert space formulation. In this section, we lay the theoretical foundations for Fourier learning by deriving it as a natural solution to a function optimization problem. We start by reformulating the set of learning problems in (1) as one single learning problem in a Hilbert space. This allows us to learn a unified model that takes both x and t as inputs. Our learning objective takes the form of (2) below, where the expectation can be replaced by the empirical mean over datasets in practice: min f H L(f) := Ex,y,t Dt(x,y)p(t) ℓ(f(x, t), y) . (2) The unified objective (2) is related to (1) via the following lemma, the proof of which is given in Appendix A. Lemma 1. For f t (x) in (1), let f0(x, t) = f t (x), t R. If f0(x, t) H, then f0(x, t) minimizes L(f) in H. Lemma 1 implies that, if (2) has a unique minimizer, and if f t (x) belongs to H when treated as a function of both x and t, then the minimizer of (2) leads to the solution of (1). Hence, under a realizable setting, (2) serves as a proxy to solving (1). In the rest of the paper, we adopt this realizable learning setup and focus on solving (1) via (2). Another critical element in (2) is the design of H. Here, we focus particularly on functions that are continuous, periodic in time, and have finite energy in a single period. In addition, we require that the functions in H degenerate to L2(X) as specified in (1) for every fixed t. We now introduce two important elements required for designing such an H. Functions on circles. Defining functions on circles is a common way to characterize periodic functions. These functions take the angular information of a point on a circle as input, and naturally have a period that is proportional to the circle s circumference. In our problem, we further define a Hilbert space structure over these functions by viewing a circle as a line segment with its end-points glued together: 2 L2(ST ) = f : R R 0 |f(t)|2dt < and f(t + T) = f(t) t R . (3) 2More strictly speaking, ST = R/TZ is a quotient space, and therefore could be treated more rigorously with group theory (Rudin, 2017). As it turns out, if we define f, g def = R T 0 f(t)g(t)dt, then (L2(ST ), , ) forms a Hilbert space.3 This Hilbert space meets our needs in the special case when there is no input feature to the model, i.e., when f(x, t) depends on t only. Tensor product between Hilbert spaces. To further augment L2(ST ) into a Hilbert space that contains functions dependent on both x and t, we turn to the concept of tensor product between Hilbert spaces, which is a direct metric space extension to the concept of Kronecker product between vectors in Euclidean spaces. Specifically, given two Hilbert spaces denoted by (H1, , 1) and (H2, , 2), respectively, the tensor product of H1 and H2 is a Hilbert space (H H1 H2, , ) coupled by a bi-linear mapping ϕ : H1 H2 H. Together, H and ϕ must satisfy the following properties. (i) The set of vectors ϕ(u1, u2) with u1 H1 and u2 H2 must form a total subset of H. That is, H = Span{ϕ(u1, u2)|u1 H1, u2 H2}. (ii) The inner product of H, , , must satisfy ϕ(u1, u2), ϕ(v1, v2) = u1, v1 1 u2, v2 2 for any u1, v1 H1 and u2, v2 H2. If we adopt two orthonormal sets of basis functions, {e1i}dim(H1) i=1 and {e2j}dim(H2) j=1 , for H1 and H2, respectively, these aforementioned properties would allow us to expand any element ϕ(u1, u2) H into ϕ(u1, u2) = Pdim(H1) i=1 Pdim(H2) j=1 u1iu2jϕ(e1i, e2j), where u1i = u1, e1i 1 and u2j = u2, e2j 2, respectively. Furthermore, when H1 = L2(X) and H2(Y) = L2(Y), as is the case for our problem, an isomorphism exists such that ϕ(e1i, e2j) = e1ie2j. This implies that we can consider H as an isomorphism of H1 H2 containing functions that are linear combinations of {e1ie2j} i=1,j=1, i.e, H = L2(X) L2(Y) = L2(X Y). We refer interested readers to (Reed, 2012) for more details on this topic. 2.1. A Tensor-Product-Based Design of H Augmenting L2(ST ) by its tensor product with L2(X), a natural choice of H is to set H L2(X ST ), where L2(X ST ) = f : X R R f L2(X [0,T ]) < and f(x, t) = f(x, t T) t . (4) This L2(X ST ) expands L2(X) with an additional dimension in t defined on a circle with a circumference T, which naturally restricts f(x, t) H to be a periodic function over t for any fixed x X. The following lemma certifies that H is a Hilbert space and characterizes its basis functions using the isomorphism between H and L2(X) L2(ST ). Lemma 2. Let H be defined in (4). For f, g H, let f, g def = R X R T 0 f(x, t)g(x, t)dxdt, then (H, , ) is a 3For the simplicity of notations, we denote the inner product as , when its associated Hilbert space is clear by context. Fourier Learning with Cyclical Data Hilbert space. Furthermore, there exists an isomorphism between H and L2(X) L2(ST ), i.e., if {ϕi} i=1 and {ψj} j=1 are two orthonormal sets of bases for L2(X) and L2(ST ), respectively, then {φij} i,j=1, where φij(x, t) def = ϕi(x)ψj(t), is an orthonormal set of basis functions for H. The proof of Lemma 2 is given in Appendix B. This lemma paves the way to a framework that learns the model f via its basis expansion in H, which we introduce in the following section. Meanwhile, we argue that H is general enough for our learning purpose in the sense that the function f0(x, t) defined pointwise by solutions in (1) belongs to H under mild assumptions. We start by introducing necessary definitions and assumptions as below. Definition 3 (Continuity under total variation). Let Dt(x) be the conditional distribution of y given x under Dt. We say Dt(x) is continuous in t under total variation distance if, for any fixed t and any ϵ > 0, there exists δ > 0 such that Dt (x) Dt(x) TV ϵ whenever |t t| δ. Assumption 4. Suppose: (i) X and Y are compact and convex sets; (ii) Dt(x) in Definition 3 is continuous under total variation for all x X; (iii) the loss function ℓ(f(x), y) is σ-strongly-convex in its first argument for all y Y; (iv) f(x) in (1) is bounded and maxx X,y Y ℓ(f(x), y) K for some constant K. In practice, Assumption 4 can be easily satisfied by a wide range of machine learning systems. For instance, deep neural networks (DNNs) typically have bounded outputs when a clipping on the final output is enforced. The uniform strong convexity of the loss function also holds for a wide range of ℓsuch as the the mean squared loss. With the above definition and assumption, we now state the following lemma. Lemma 5. Under Assumption 4, f t (x) is continuous in t for any given x X. In addition, f0(x, t) def = f t (x) H. The proof of Lemma 5 is given in Appendix C. This lemma implies that, under Assumpiton 4, the optimal solution of (2), f0(x, t), belongs to H. Combining Lemmas 1 and 5, we immediately see that the satisfaction of Assumption 4 allows us to acquire a set of desired solution of (1) by solving (2). 2.2. Fourier Learning with Cyclical Data We now introduce Fourier learning, a learning framework that hard-wires the periodicity into the model s structure via a partial Fourier expansion and learns the model by learning its Fourier coefficient functions. From a modeling aspect, we invoke Lemma 2, and represent f(x, t) H by j=1 ci,jϕi(x)ψj(t) = j=1 cj(x)ψj(t), (5) where cj(x) L2(X) is the sum of ci,jϕi(x) over i. Noticing that a set of bases for L2(ST ) are the trigonometric functions with a base frequency 1/T, we immediately have the following Theorem (the proof is given in Appendix D). Theorem 6. Any f(x, t) H can be represented by an(x) sin 2πnt + bn(x) cos 2πnt where an(x), bn(x) L2(X), and a0(x) 0. Theorem 6 provides an explicit way of designing periodic models and specifies how the time-feature could be exploited. Note that, it is entirely possible to construct H with a weighted L2 space defined on circles to guarantee periodicity, e.g., an RKHS with a periodic kernel. This allows us to deviate from the trigonometric functions and use potentially other periodic functions to encode periodicity, e.g., the periodic Gaussian kernel (Mac Kay et al., 1998). Our goal now shifts towards learning the coefficient functions in the frequency domain. For tractable learning, we introduce a cutoff frequency N/T, and learn a truncated expansion of f(x, t) instead: f N(x, t) = an(x) sin 2πnt + bn(x) cos 2πnt By (Carleson, 1966), the approximation error for all f H, EN(f) = f(x, t) f N(x, t) satisfies lim N EN(f) = 0. Hence, with a properly selected N, we can control the approximation error while limiting the amount of model parameters at the same time. The coefficient functions, an(x) and bn(x), can be learned under a variety of regimes. For example, they can be learned non-parametrically using function optimization algorithms (Yang et al., 2019) (see Section 4 for more details). In the remainder of the paper, we focus primarily on parameterizing an(x) and bn(x) with neural networks, so as to apply Fourier learning to large-scale machine learning scenarios. 2.3. Discussions Fourier analysis in machine learning. Fourier analysis, as a prominent branch of study in signal processing and digital communications (Oppenheim et al., 1983), has been widely adopted in machine learning and the designs of neural networks (Gallant & White, 1988; Uteuliyeva et al., 2020). In many applications, such as computer vision and time series analysis (Tancik et al., 2020; Zhang et al., 2018a), Fourier analysis has attracted substantial research interests as it yields more expressive features (Rahimi et al., 2007), Fourier Learning with Cyclical Data Fourier Learning (time domain) Fourier Learning (frequency domain) Pluralistic (1 model) (2 models) Online Learning Optimal Optimal Optimal Optimal Weight Weight Figure 2. Illustration on how online learning, the pluralistic approach, and Fourier learning learn a sine function. {an(x)}N n=1 Element-Wise Product Last Hidden Layer SUM Fourier Layer Output dim = 2N + 1 Final Output dim = 1 SIN dim = N COS dim = N + 1 Fourier Layer {bn(x)}N n=0 Figure 3. The design of a Fourier-MLP with a scalar output. and improves the predictive power of the neural networks (Tancik et al., 2020; Sitzmann et al., 2020). Unlike existing works, Fourier learning focuses on an online learning setup, and improves the model performance by exploiting periodicity within the data distribution. To further illustrate this difference, we compare Fourier learning, online learning, and the pluralistic approach in Example 1. Example 1. Consider learning f t (x) = sin(2πt/T) given a stream of samples {ti, sin(2πti/T)}100 i=1, with T = 10 and ti = i. Under a mean squared error (MSE) loss, we sketched the output of online learning, the pluralistic approach, and Fourier learning in Figure 2. As illustrated, online learning learns a periodic function with a delay. This is because at time t, online learning can only learn with the samples collected prior to t. The pluralistic approach learns a fixed model for each interval. It suffers from an approximation error when the partition is crude. By comparison, Fourier learning learns the frequency representation of the model a1(x), and yields a near-optimal solution in time domain. By exploiting periodicity within the data distribution, Fourier learning mitigates the delay in online learning, and the approximation error in the pluralistic approach. In Example 1, Fourier learning is only tasked to learn a frequency component that is independent of x. In practice, ft(x) often highly depends on x, and is typically modeled by neural networks. This motivates us to design a deep learning solution to incorporate periodicity into the existing deep learning models, which we focus on in the next section. 3. Fourier Multi-Layer Perceptron (F-MLP) To integrate Fourier learning in large-scale machine learning systems, we introduce F-MLP, a neural network structure that can be incrementally added to existing neural network designs to better learn a periodic model. 3.1. Architecture of F-MLP F-MLP is designed by the following intuition: if we view x as the output of a neural network s last hidden layer, then (7) can be viewed as the network s output layer with an architecture shown in Figure 3. Specifically, F-MLP first transforms x into an(x) and bn(x), and then element-wise multiplies them with basis vectors SIN and COS, yielding a (2N + 1)-dimension layer. This layer is then added up, yielding a scalar output.4 Notably, when an(x) = bn(x) = 0 for all n 1, the final output equals b0(x), which, by itself, can be interpreted as the original model s output. This implies that replacing the original model s output layer with an F-MLP increases its capacity, avoiding the need for laborious feature engineering. 3.2. Training F-MLP with Streaming-SGD The training of F-MLP is performed jointly with the original model, following the procedure of streaming-SGD. This procedure is different from the standard SGD, which in practice would need sample data (x, y, t) Dt(x, y)p(t). However, sampling from p(t) is difficult for many online applications due to the real-time update requirement, where data arrive sequentially. Here we show that with streaming-SGD we can avoid this issue while still having good practical performances and convergence guarantees. The training procedure is as follows. We parameterize an(x) and bn(x) by an(x; θn) and bn(x; ρn), respectively, with θn and ρn being the neural network parameters. For cyclical data, we collect the τ-th mini-batch of data in the k-th cycle, and update the model with the following update rule: θ(k,τ+1) n = θ(k,τ) n ηk,τg(k,τ) θn , ρ(k,τ+1) n = ρ(k,τ) n ηk,τg(k,τ) ρn . (9) Here, g(k,τ) θn and g(k,τ) ρn are gradients calculated using the 4We refer interested readers to Appendix H for the design of F-MLP with a vector output. Fourier Learning with Cyclical Data collected mini-batch of data: g(k,τ) θn = θn b Lk,τ(f (k,τ) N ), g(k,τ) ρn = ρn b Lk,τ(f (k,τ) N ), (10) where f (k,τ) N is computed with (7), while b Lk,τ is the empirical version of the loss in (2) over this mini-batch of data. The overall training procedure is summarized in Algorithm 1, and we present the convergence analysis of it below. 3.3. Convergence Analysis In this section, we discuss the convergence properties when training Fourier learning models with streaming-SGD. Recall that, using a truncated f N(x, t), problem (2) reduces into finding the optimal of min {an}N n=1,{bn}N n=0 Ex,y,t Dt(x,y)p(t) ℓ(f N(x, t), y) (11) in the frequency domain. We denote the optimal set of coefficient functions of (11) as a n(x) and b n(x), for which the corresponding model f N(x, t) can be expressed as f N(x, t) = a n(x) sin 2πnt + b n(x) cos 2πnt In the following, we first show a gradient norm convergence result for streaming-SGD under a general non-convex setting, and then introduce a global convergence result under the assumption of strong convexity. Prior to that, we introduce some additional assumptions. Assumption 7. Suppose: (i) for all n, k, τ, the second moment of the update directions are bounded: max{E[ g(k,τ) ρn 2 2|F(k)], E[ g(k,τ) θn 2 2|F(k)]} G2 for some G R, where F(k) is the minimum σ-algebra generated by a(κ,τ) n and b(κ,τ) n for all n, κ < k and 1 τ Γ. In addition, we assume (ii) there exists Λ > 0 such that L(f1) L(f2) Λ f1 f2 for all f1, f2 H. This assumption assumes bounded second moment of the update directions, and the Lipschitzness of the gradient, which are usually required in the convergence analysis of SGDtype algorithms. The following result shows that streaming SGD with a proper learning rate achieves convergence under both non-convex and strongly convex settings. Theorem 8 (Convergence of Streaming-SGD). Let (i), (ii) of Assumption 4 and Assumption 7 hold, and define L(f (k,1) N ) as the gradient with respect to a joint parameter vector combining all θ(k,1) n and ρ(k,1) n . Let ηk,τ ηk = Θ(1/ Tmax + 1), then min 0 k Tmax L(f (k,1) N ) 2 = O(1/ p Algorithm 1 Streaming-SGD for F-MLP 1: Input: Macro-iteration number Tmax; loss L; step sizes {ηk,τ}Tmax,Γ k=0,τ=1; period T; cutoff frequency N/T. 2: Initialize {a(0,1) n }N n=1 and {b(0,1) n }N n=0. 3: for (k, τ) = (0, 1) to (Tmax, 1) do 4: Compute f (k,τ) N with a(k,τ) n and b(k,τ) n with (7). 5: for n {0, . . . , N} do 6: Compute g(k,τ) θn and g(k,τ) ρn . 7: Update θ(k,τ) n and ρ(k,τ) n . 8: end for 9: end for 10: Output: f (Tmax,1) N , a(Tmax,1) n and b(Tmax,1) n . Let ηk,τ ηk = Θ(1/ min 0 k Tmax L(f (k,1) N ) 2 = O log(Tmax + 1) Tmax + 1 Moreover, if L is σ-strongly convex with respect to θn and ρn, we can take ηk,τ = ψ/k with ψ < (2σ2) 1 and obtain E θ(Tmax,1) n θ n 2 2 = O(T 1 max), E ρ(Tmax,1) n ρ n 2 2 = O(T 1 max), where we assume an(x, θ n) = a n(x), bn(x, ρ n) = b n(x). The proof of Theorem 8 is given in Appendix F. Simply put, our learning framework offers a convergence rate of O(1/ Tmax) under a general non-convex setting and O(1/Tmax) under a strongly convex setting. If we further drive N , the overall learning error of f0(x, t) can be driven to arbitrarily small. Compared to the online learning benchmark whose dynamic regret is affected by both the changing speed of the data-generating distribution and the variance of the stochastic gradients, Fourier learning yields a much smaller learning error and hence offers a potentially much better performance in many practical scenarios. 4. Extension to the Non-Parametric Regime Apart from the parametric framework introduced in the prior sections, Fourier learning also fits into the non-parametric regime, where an and bn are updated directly: a(k,τ+1) n (x) = a(k,τ) n (x) ηk,τ g(k,τ) an (x), b(k,τ+1) n (x) = b(k,τ) n (x) ηk,τ g(k,τ) bn (x). (13) As the functional gradients in L2 often contain Dirac s δfunctions, causing discontinuous updates, we substitute the functional gradients with their kernel embeddings instead. Specifically, with K( , ) : X X R being a positivedefinite kernel whose minimum eigen-value is bounded Fourier Learning with Cyclical Data away from 0, we let g(k,τ) an (x) = an b Lk,τ(f (k,τ) N )]( ), K(x, ) , g(k,τ) bn (x) = [ bn b Lk,τ(f (k,τ) N )]( ), K(x, ) . (14) It is easy to verify that these kernel embeddings yield continuous updates of an(x) and bn(x) at each iteration. At the same time, g(k,τ) an and g(k,τ) bn are close enough to the exact gradients and retain the convergence guarantees (Yang et al., 2019). If we initialize a(0,τ) n and b(0,τ) n to be zeros, then a(k,τ) n and b(k,τ) n can be written as a linear combination of a finite set of kernels. An example of calculating the kernel embeddings of functional gradients is given in Appendix E. The convergence result for the non-parametric case is given in Theorem 9 below. The proof is given in Appendix G. Theorem 9. Let Assumption 4 and Assumption 7 hold. Let g(k,τ) an and g(k,τ) bn be the kernel embeddings of functional gradient with K( , ) at iteration (k, τ), as defined in (14). Let ηk,τ = σ(k + 0.5)λ 1(k + 1) 2. Then, E a(Tmax,1) n a n 2 2 = O(T 1 max) and E b(Tmax,1) n b n 2 2 = O(T 1 max), with a n and b n defined in (12). 5. Numerical Simulations In this section, we numerically demonstrate the superiority of Fourier learning over the prior state-of-the-arts on synthetic and public datasets. Our major benchmarks include: (i) Time-feature: a benchmark that encodes periodicity with f(x, mod(t, T)) using neural networks; (ii) Pluralistic (Eichner et al., 2019); and (iii): Online learning (Hazan, 2019), as adopted by common industrial systems. For each simulated method, we record the instantaneous loss at each iteration prior to updating the model. This allows us to evaluate the model s ability on tracking a constantly changing set of optimal model parameters and on maintaining a good prediction accuracy consistently. The source code and logs used to generate the reported experiments can be found at https://github.com/ Yangyx891121/Fourier-Learning. 5.1. Synthetic Dataset: Linear Model Experiment settings. We considered a toy experiment with T = 1 and p(t) = 1 for t [0, T] in (1). For t [0, T], let x N(sin(2πt), 0.1), and y N(P6 n=1 sin(2πnt)x, 0.01). We generated samples over t [0, 50T], and fitted a linear time-varying system y = α(t)x using the mean squared error (MSE) loss. The optimal α(t) under this setting is α (t) = P6 n=1 sin(2πnt), allowing us to evaluate the performance of each algorithm using the MSE between its estimated α(t) and α (t). We implemented the aforementioned benchmarks under Online Learning Pluralistic Fourier Learning 0 10 20 30 40 50 Number of Periods Averaged Estimation Error (log scale) Time-Feature Design A Time-Feature Design B Figure 4. Aggregated MSE (in log scale) in each period between the estimated and ground truth α(t) over 50 periods. the following settings. For (i), we simulated two separate designs, one using f(x, t) = α(mod(t, T))x where α(mod(t, T)) is a two-layer neural network, while the other models f(x, t) with a two-layer neural network directly. For (ii), we used a vector α to store m separate α s. The i-th value of α is responsible for learning samples with mod(t, T) [(i 1) interval, i interval) where interval = T/m. The number of models, m, is set as a tuning parameter. For (iii), we directly optimized over α using online gradient descent. We grid-searched the learning rate, hidden layer sizes, and the number of Fourier bases. Results. We plotted the MSE of the simulated algorithms in Figure 4, where we aggregated the MSE over each period so that the horizontal axis shows how an algorithm s performance improves as t/T increases. We see that Fourier learning has a much better performance than all the benchmarks. In addition, the online learning benchmark has almost no MSE reduction as t/T increases. This is because the online learning regime lacks exploitation of the cyclical nature of the data, which leaves behind a constant gap between the learned model and the desired optimal at each iteration. 5.2. Real Dataset: Sentiment140 Experiment settings. We classified the sentiment of tweets using a bag-of-words model over the Sentiment140 Twitter (Go et al., 2009) dataset. Following the experiment settings of Eichner et al. (2019), we manually created a cyclical data stream with T = 1 day = 86400 seconds by first dividing the samples into four blocks based on their associated timestamps, and then down-sampled the positive (or negative) samples of each block based on the sign and value of a randomly-generated ratio within a range of [ 0.7, 0.7]. For the bag-of-words model, we used a three-layers neural network, with a 4096-dimensional input layer, a 64dimensional hidden layer, and a two-dimensional output Fourier Learning with Cyclical Data Test Accuracy Pluralistic Online Learning Fourier Learning 0.500 12 14 Time-Feature Figure 5. Testing error of Fourier learning and benchmarks, averaged over 10 trials. layer, largely following the settings of (Eichner et al., 2019). The time feature was added as a one-dimensional input to the input layer. For the pluralistic approach, we reduced the hidden layer size to keep the number of parameters in line with the other methods. For Fourier learning, a 64dimensional Fourier layer is added at the output end. Results. We conducted the experiment over 15 days of data, and plotted the results for lr = 0.1 in Figure 5 (see Appendix I for more results). We see that Fourier learning has a better performance than all the benchmarks. We also see that adding a time feature and online learning have very similar performances. This suggests that directly adding a time-feature may not bring immediate performance improvement, which warrants the need for further feature engineering. Furthermore, the gap between Fourier learning and the aforementioned benchmarks cannot be easily closed. In fact, even upon trippling the network size, the performance of online learning is still inferior than Fourier learning (see Appendix I). Lastly, despite the competitiveness of the pluralistic approach (Eichner et al., 2019), its performance lags as the storage constraint forces a reduction to its model size. 6. Fourier Learning in Recommender Systems In this section, we report the performance of Fourier learning implemented on a conversion rate (CVR) prediction model in an industrial recommender system. As demonstrated in Figure 1, the system s revenue displays a periodic pattern due to the periodic patterns in the lifestyles of its users. The model architecture follows the design of a wideand-deep network (Cheng et al., 2016), with thousands of features, and several sub-networks that are either wide or deep. The model output at (k, τ)-th iteration has the form foco(x; (k, τ)) = PQ q=1 fq(x; (k, τ)). Here, fq(x; (k, τ)) is the output of the q-th (out of a total of Q) sub-network at iteration (k, τ). Wide Model Deep Model Last Hidden Layer Original Model Model Output F-MLP + Convex Combination Figure 6. Deploying F-MLP to a wide and deep neural network. Design. We added Fourier learning to this model by mixing foco(x; (k, τ)) with f FL(x, (k, τ)), the output of Fourier learning which uses t as an input. This output of Fourier learning is obtained by first aligning the dimensions of the last hidden layers of all the sub-networks, and then passing their sum through a single Fourier layer. Exploiting the linearity of the Fourier transform, this design avoids the need of learning a Fourier layer for each subnetwork. The final output has the form f(x, (k, τ)) = ξfoco(x; (k, τ)) + (1 ξ)f FL(x, (k, τ)), where ξ [0, 1] is a hyperparameter. We visualized this design in Figure 6. Intuitively, the mixture of logits has two advantages. First, it circumvents drastic alterations to the original model architecture. In fact, setting ξ = 1 retains the original model: f(x, (k, τ)) = foco(x; (k, τ)). Secondly, it handles the practical situation that the data distribution may not strictly follow a periodic pattern. When the periodicity pattern is relatively strong, we expect Fourier learning to improve the performance of the original model; when the periodicity pattern is relatively weak, the presence of foco(x; (k, τ)) limits the model mismatch introduced by Fourier learning. It turns out that this strategy is highly effective. Learning setting. We used production-level data spanned over three months, denoted by months F, G, and H, respectively. Each day has thirty to sixty million training samples. Under the aforementioned design, we set the base frequency of the Fourier learning model to 2.4192MHz, or (28 days) 1, and set N = 28 4 in (7), providing a frequency band of up to (6 hours) 1. The coefficient ξ is set to 0.5. The rest of the model parameters are the same as baseline. We trained the model on a distributed machine learning platform with 3,600 CPU cores. The training took one day. A total of five benchmarks are implemented, with the hyperparameters carefully tuned via grid search. Fourier Learning with Cyclical Data Online learning: baseline model, the current production model that serves hundreds of millions of users every single day. It does not use time features. Time-feature: learning f(x, mod(t, T)) with a time of day feature mod(t, T). This feature is inserted into the model the same way as all other features. Positional encoding: learning f(x, g(mod(t, T))) with the time of day feature (mod(t, T)) embedded in a positional encoder g( ) (Vaswani et al., 2017). This encoded feature is inserted into the model as an input to the deep sub-network. Pluralistic (Eichner et al., 2019): a total of six models are trained but the models only differ in the last few layers. Since a single model in the production system costs a tremendous amount of CPU cores and memory to train and serve, the pluralistic approach is simply impossible to implement in its original form. Online learning (large): the online learning approach with a larger network that has the same number of parameters as Fourier learning. We use this benchmark to show that Fourier learning s performance gain is not due to a simple increment in the model size. Results. We reported the experiment results in Figure 7, measured by the Area under the Curve (Au C) aggregated on a monthly basis. The variance of the readings due to the distributed training environment is typically < 0.02%. Table 4 shows that Fourier learning has a clear and consistent advantage of > 0.1% over all benchmarks, none of which possesses a significant (> 0.02%) Au C improvement over the baseline model. Furthermore, this improvement is not due to the increase in the model size, as online learning (large), which has the same number of model parameters as Fourier learning, does not show a consistent Au C gain over the baseline. This is significant for such a large-scale system, since a 0.1% improvement in Au C may bring in millions of dollars of revenue growth on a monthly basis. Apart from Au C, we also examined the calibration of the trained models, aggregated montly. The calibration is defined as Calibration = p CVR/CVR, where p CVR and CVR are the predicted and empirical conversion ratio, averaged over all samples. Note that a well calibrated Ads CVR prediction model should have a calibration reading close to one. As is shown in Table 1, the Fourier learning model s calibration is very close to one and is comparable to others. 7. Conclusion and Discussions In this paper, we introduced a novel framework, Fourier learning , to efficiently train large-scale machine learning Month F Month G Month H Area under the Curve (Au C) Online Learning Online Learning (Large) Positional Encoding Pluralistic Fourier Learning (Ours) Time-Feature Figure 7. Au C for the implemented methods, aggregated monthly. Algorithm Month F Month G Month H Online learning 0.9927 0.9974 0.9977 Online learning (L) 0.9918 0.9974 0.9979 Time-feature 0.9926 0.9974 0.9976 Positional encoding 0.9928 0.9975 0.9977 Pluralistic 0.9929 0.9976 0.9978 Fourier learning (O) 0.9922 0.9973 0.9975 Table 1. Calibration of the algorithms, aggregated monthly. Here, L and O are abbreviations for Large and Ours , respectively. models with cyclical data. Using Fourier analysis, we transformed the learning problem into the frequency domain, and proposed a theoretically guaranteed optimization algorithm to learn the coefficient functions. We introduced F-MLP in the context of deep learning which strictly increases the original model s capacity. We also demonstrated the superiority of Fourier learning over the state-of-the-arts on synthetic, public, and production-level data. This work opens up several research directions. Among them, an important one is to measure the periodicity by the degree to which the data distribution oscillates. This is necessary because, under the current definition provided in this paper, even i.i.d. data can be deemed periodic as T can be arbitrary. However, as explained in Section 1.1 and illustrated in Example 1, Fourier learning may have an advantage over online learning only when the data distribution oscillates. In light of this, how the degree of oscillation relates to the advantage of Fourier learning needs to be answered. Acknowledgements We sincerely thank Maryam Fazel, Kevin Jamieson, Junchi Yang, and Pengkun Yang for useful discussions. We also thank anonymous reviewers for their comments and suggestions during the review process, and pointers to prior literature. Fourier Learning with Cyclical Data Carleson, L. On convergence and growth of partial sums of Fourier series. Acta Mathematica, 116(1):135 157, 1966. Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. Wide & deep learning for recommender systems. In Workshop on Deep Learning for Recommender Systems, pp. 7 10, 2016. Eichner, H., Koren, T., Mc Mahan, B., Srebro, N., and Talwar, K. Semi-cyclic stochastic gradient descent. In International Conference on Machine Learning, pp. 1764 1773. PMLR, 2019. Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Sch olkopf, B. Characteristic kernels on groups and semigroups. In Advances in Neural Information Processing Systems, pp. 473 480, 2008. Gallant, A. R. and White, H. There exists a neural network that does not make avoidable mistakes. In International Conference on Neural Networks, pp. 657 664, 1988. Go, A., Bhayani, R., and Huang, L. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009, 2009. Gultekin, M. N. and Gultekin, N. B. Stock market seasonality: International evidence. Journal of Financial Economics, 12(4):469 481, 1983. Hazan, E. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Lopez-Paz, D. and Ranzato, M. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, volume 30, pp. 6467 6476, 2017. Mac Kay, D. J. et al. Introduction to Gaussian processes. NATO ASI series F computer and systems sciences, 168: 133 166, 1998. Mokhtari, A., Shahrampour, S., Jadbabaie, A., and Ribeiro, A. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Conference on Decision and Control, pp. 7195 7201. IEEE, 2016. Oppenheim, A., Willsky, A., Young, I., and Nawad, H. Signals and Systems. Prentice-Hall signal processing series. Prentice-Hall, 1983. ISBN 9780138097318. Rahimi, A., Recht, B., et al. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 3, pp. 5. Citeseer, 2007. Reed, M. Methods of modern mathematical physics: Functional analysis. Elsevier, 2012. Rudin, W. Fourier analysis on groups. Courier Dover Publications, 2017. Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wetzstein, G. Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems, 33:7462 7473, 2020. Tancik, M., Srinivasan, P. P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J. T., and Ng, R. Fourier features let networks learn high frequency functions in low dimensional domains. ar Xiv preprint ar Xiv:2006.10739, 2020. Trac a, S., Rudin, C., and Yan, W. Regulating greed over time in multi-armed bandits. Journal of Machine Learning Research, 22:3 1, 2021. Uteuliyeva, M., Zhumekenov, A., Takhanov, R., Assylbekov, Z., Castro, A. J., and Kabdolov, O. Fourier neural networks: A comparative study. Intelligent Data Analysis, 24(5):1107 1120, 2020. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017. Wahba, G. Spline models for observational data. SIAM, 1990. Yang, Y., Wang, H., Kiyavash, N., and He, N. Learning positive functions with pseudo mirror descent. In Advances in Neural Information Processing Systems, volume 32, pp. 14144 14154, 2019. Zhang, J., Lin, Y., Song, Z., and Dhillon, I. Learning long term dependencies via Fourier recurrent units. In International Conference on Machine Learning, pp. 5815 5823. PMLR, 2018a. Zhang, L., Yang, T., Yi, J., Jin, R., and Zhou, Z.-H. Improved dynamic regret for non-degenerate functions. ar Xiv preprint ar Xiv:1608.03933, 2016. Zhang, L., Yang, T., Zhou, Z.-H., et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning, pp. 5882 5891. PMLR, 2018b. Ziyin, L., Hartwig, T., and Ueda, M. Neural networks fail to learn periodic functions and how to fix it. In Advances in Neural Information Processing Systems, volume 33, 2020. Fourier Learning with Cyclical Data A. Proof of Lemma 1 Starting from (2), we have, for any f(x, t) H, L(f0(x, t)) = Ex,y,t Dmod(t,T )(x,y)p(mod(t,T )) ℓ(f0(x, t), y) = Ep(mod(t,T )) Ex,y Dmod(t,T )(x,y) ℓ(f0(x, t), y) Ep(mod(t,T )) Ex,y Dmod(t,T )(x,y) ℓ(f(x, t), y) = L(f(x, t)), (15) where the inequality follows from the assumption that Ex,y Dmod(t,T )(x,y) ℓ(f0(x, t), y) Ex,y Dmod(t,T )(x,y) ℓ(f(x, t), y) (16) for any f(x, t) H. Hence, f0(x, t) is a minimizer of (2). B. Proof of Lemma 2 We prove the lemma in two parts. First, we show that H is a Hilbert space. Then, we show the existence of an isomorphism between H and L2(X) L2(ST ). B.1. Proving H is a Hilbert space. H is a pre-Hilbert space. It is relatively straight-forward that H is a linear vector space, where the zero-vector is f(x) 0 almost everywhere under the uniform measure over X [0, T]. We now show that , is an inner product. First, , is symmetric: 0 f(x, t)g(x, t)dxdt Z 0 g(x, t)f(x, t)dxdt = g, f . It is bi-linear: f1 + f2, g = Z 0 (f1(x, t) + f2(x, t))g(x, t)dxdt 0 f1(x, t)g(x, t)dxdt + Z 0 f2(x, t)g(x, t)dxdt = f1, g + f2, g ; 0 λf(x, t)g(x, t)dxdt = λ Z 0 f(x, t)g(x, t)dxdt = λ f, g . It is positive-definite: 0 f 2(x, t)dxdt 0, Fourier Learning with Cyclical Data where the equality holds if and only if f(x, t) 0 almost everywhere on X [0, T] under the uniform measure. Therefore, , is an inner product, and (H, , ) is a pre-Hilbert space. H is a Hilbert space. A pre-Hilbert space H is a Hilbert space if it is also complete, i.e., every Cauchy sequence {fi} i=1 in H has a limit in H. To prove this, it suffices to find a convergent subsequence of {fi} i=1. Let nk be such that fnk fj 2 2 k, for all j nk. Let Bj = {(x, t) : |fnj+1(x, t) fnj(x, t)| 2 j/3}, and B = k=1 j k Bj. Let µ be the uniform probability measure over X [0, T], then by Chebyshev s inequality, we have (subject to a multiplicative volume constant) µ(Bj) 22j/3 fnj+1 fnj 2 2 j/3. (17) Noticing that P j 2 j/3 is convergent, we can apply the Borel-Cantelli lemma to reach the conclusion that µ(B) = 0. This means that the set of (x, t) that belongs to infinitely many Bj s has 0 measure, which further implies that {fnj} j=1 converges point-wise almost everywhere. Define the point-wise limit as f, and as a standard result in real analysis, we know that f is measurable. Then, it only remains to show that f H. Using Fatou s lemma and the definition of the subsequence {fnj} j=1, we have fnj f 2 = Z 0 lim k |fnj(x, t) fk(x, t)|2dxdt 0 |fnj(x, t) fk(x, t)|2dxdt = lim inf k fnj fk 2 Hence, we have both f 2 by triangle inequality of the norm, and the convergence of fnj to f in H norm. Therefore, H is a Hilbert space. B.2. Proving the existence of an isohorphism. We define the following mapping U such that U(ϕi ψj)(x, t) = ϕi(x)ψj(t) for x X, t [0, T]. Then, U is a unitary mapping of L2(X) L2(ST ) onto H. For any f L2(X) and g L2(ST ), assume i aiϕi(x), and g(t) = X then, by linearity, U(f g)(x, t) = U i,j aibj(ϕi ψj) i,j aibjϕi(x)ψj(t) = f(x)g(t). This shows that there exists an isomorphism between L2(X) L2(ST ) and H. Fourier Learning with Cyclical Data C. Proof of Lemma 5 We prove the lemma in two steps: First, we show that f t (x) is continuous in t for any x X. Exploiting the continuity of f t (x), we show f0(x, t) def = f t (x) for every t is in H. In addition, we need the following lemma. Lemma 10. Let X be some metric space and eh : X 7 R be some function such that eh def = maxx X eh(x) K. Let p, q be two distributions over X. Then EX p[eh(X)] EX q[eh(X)] 2K p q TV. Proof. The proof follows from the following reasoning: EX p[eh(X)] EX q[eh(X)] sup |EX p[h(X)] EX q[h(X)]| h K = K sup EX p = K sup |EX p[h(X)] EX q[h(X)]| h 1 = 2K p q TV. (By the dual representation of total variation distance) C.1. Showing f t (x) is continuous in t for any x X. Without loss of generality, consider a fixed x X and a fixed ϵ > 0. Under a realizable setting, we can define ux(t) def = f t (x) argmin u R Ey Dt(x)[ℓ(u, y)], t [0, T]. Here, u(t) is a well-defined function because ℓ(u, y) is strongly convex in u for each y Y. Since Dt(x) is continuous in t under total variation distance, we can find some δ > 0 such that whenever |t t | δ, we have Dt(x) Dt (x) TV ϵ. Now, for arbitrary t such that |t t | δ, the function gt (u) def = Ey Dt (x)[ℓ(u, y)] is σ-strongly convex in u. By definition, we can see that ux(t ) minimizes gt (u). Thus, the property of strongly convex function yields the following: |ux(t) ux(t )| σ 2 gt (ux(t)) gt (ux(t )) = gt (ux(t)) gt(ux(t)) + gt(ux(t)) gt (ux(t )) gt (ux(t)) gt(ux(t)) + gt(ux(t )) gt (ux(t )) (Since u(t) minimizes gt(u)) = Ey Dt (x)[ℓ(ux(t), y)] Ey Dt(x)[ℓ(ux(t), y)] + Ey Dt(x)[ℓ(ux(t ), y)] Ey Dt (x)[ℓ(ux(t ), y)] 4K Dt(x) Dt (x) TV 4Kϵ. (By Lemma 10 and boundedness assumption of loss function) Since σ > 0, we have |u(t) u(t )| = |f t (x) f t (x)| 8Kσ 1ϵ for all |t t | < δ. Hence, driving ϵ and δ towards 0 leads to the continuity of f t (x) as a function of t for any x X. C.2. Showing f t (x) H. Since f t (x) is continuous in t for any x X, h(t) def = R X |f t (x)|2dx is also continuous in t. Since f t L2(X), h(t) is finite for any t. Therefore, h(t) is bounded over [0, T]. Let M 2 = maxt [0,T ] h(t). We have f0(x, t) 2 H = Z 0 f 2 0 (x, t)dxdt Z T 0 f t (x) 2 L2(X)dt M 2T. (18) This implies that f0(x, t) has finite H-norm, and therefore we have f0(x, t) H. Fourier Learning with Cyclical Data D. Proof of Theorem 6 The proof follows directly upon noticing that {exp(2πjnt/T)} n= , where j is the imaginary number, is a complete orthonormal set of basis for L2(ST ) and applying Eq. (5). E. Computing the Kernel Embeddings of a Functional Gradient In this section, we illustrate the computation of kernel embeddings for functional gradients through an example. Adopting a simplified setting, we consider the following loss function: ℓ(f(x, t), y) = 1 2(f(x, t) y)2. Denoting this functional as L(an, bn) with an and bn defined in (7), we can derive the (stochastic) functional gradient for an as follows given samples (xm, ym, tm)M m=1 that arrive between the (k, τ)-th and the (k, τ + 1)-th iterations: [ an b Lk,τ(an, bn)](x) = m=1 (f(xm, tm) ym) sin(2πnt/T)δ(x xm). Let K( , ) : X X R be a positive-definite kernel, we have the kernel embedding of X [ an b Lk,τ(an, bn)](y)K(x, y)dy m=1 (f(xm, tm) ym) sin(2πnt/T)K(xm, x), where the second equality follows from the property of the Dirac s δ-function. F. Proof of Theorem 8 In this section, we prove the major Theorems related to the convergence of streaming-SGD for Fourier learning. In particular, we first prove 8 under both strongly convex and non-convex setting. Then we prove 9, which extends the result to non-parametric case. F.1. Strongly Convex Case For the sake of simplicity, denote the concatenation of all θ(k,τ) n for n {0, . . . , N} and ρ(k,τ) n for n {1, . . . , N} as U (k,τ), and denote its optimal as U . By the moment assumption, we immediately have max k,τ E b Lk,τ(U (k,τ)) 2 (2N + 1)G2, (19) max k,τ E b Lk,τ(U (k,τ)) p (2N + 1)G2. (20) In addition, denoting ηk = ηk,τ for τ {1, . . . , Γ}, we also have U (k+1,1) = U (k,1) ηk τ=1 b Lk,τ(U (k,τ)), (21) where b Lk,τ is the incremental gradient of L with respect to U calculated using the data that arrived between (k, τ)-th and (k, τ 1)-th iterations.5 Since b Lk,τ(U (k,τ) is computed using only the data between the (k, τ)-th and the (k, τ + 1)-th iterations, we can alternatively write (21) as U (k+1,1) = U (k,1) ηk( b Lk(U (k,1)) e(k)), (22) 5Note that for a batch with size n, we have chosen to compute the empirical average of b Lk,τ using coefficient 1 nΓ instead of 1 design ensures E[b Lk] def = E h PΓ τ=1 b Lk,τ i = L. Fourier Learning with Cyclical Data where b Lk(U (k,1)) is now the stochastic gradient of L evaluated at U (k,1), calculated using the data that arrived in macro-iteration k: τ=1 b Lk,τ(U). (23) Meanwhile, the error term e(k) has the following expression: b Lk,τ(U (k,1)) b Lk,τ(U (k,τ)) . (24) Taking norm on both sides and using triangle inequality, we have τ=1 b Lk,τ(U (k,1)) b Lk,τ(U (k,τ)) . (25) Further invoking the Lipschitzness of the gradient gives us τ=1 Λ U (k,1) U (k,τ) . (26) Since U (k,1) U (k,τ) ηk(τ 1) maxκ,ν b Lκ,ν(U (κ,ν)) ηk(τ 1)G 2N + 1, we can invoke (20) and get τ=1 Λ(τ 1)ηk G 2ΛΓ(Γ 1)ηk G 2N + 1, (27) 6η2 kΛ2Γ2(Γ 1)(2Γ 1)G2(2N + 1) (28) upon taking expectations on both sides. This bound implies that the expected error of the micro-iterations over τ vanishes asymptotically if we choose a set of diminishing step sizes ηk. In the meantime, using (22), we have U (k+1,1) U 2 = U (k,1) U ηk( b Lk(U (k,1)) e(k)) 2 (29) = U (k,1) U 2 + η2 k b Lk(U (k,1)) e(k) 2 2ηk U (k,1) U , b Lk(U (k,1)) e(k) . (30) By the rule of total expectation and the assumption on strong convexity of L, we have E 2ηk U (k,1) U , b Lk(U (k,1)) e(k) = E 2ηk U (k,1) U , b Lk(U (k,1)) E 2ηk U (k,1) U , e(k) (31) = E 2ηk U (k,1) U , L(U (k,1)) E 2ηk U (k,1) U , e(k) (32) 2ηkσE U (k,1) U 2 2ηk E U (k,1) U E e(k) (33) 2ηkσE U (k,1) U 2 2η2 k GΓ 2N + 1 E U (k,1) U (34) Combining (29) and (31), we have E U (k+1,1) U 2 E U (k,1) U 2 + 2η2 k(E b Lk(U (k,1)) 2 + E e(k) 2) 2ηkσE U (k,1) U 2+ 2N + 1 E U (k,1) U . (35) Fourier Learning with Cyclical Data Rearranging and assuming D to be the diameter of the space containing U (k,τ), we can further upper bound the right-hand side of (35) by E U (k+1,1) U 2 (1 2ηkσ)E U (k,1) U 2 + 2η2 k(E b Lk(U (k,1)) 2 + E e(k) 2) + 2η2 k GΓD 2N + 1 (36) = (1 2ηkσ)E U (k,1) U 2 + 2η2 k(2N + 1)G2 + 2η2 k GΓD 3η4 kΛ2Γ2(Γ 1)(2Γ 1)G2(2N + 1) (37) C1 = (2N + 1)G2 + GΓD 6η2 kΓ2Λ2(Γ 1)(2Γ 1)G2(2N + 1), (38) E U (k+1,1) U 2 (1 2ηkσ)E U (k,1) U 2 + 2η2 k C1. (39) Finally, selecting ηk = O(1/k) and using mathematical induction yields E U (k+1,1) U 2 = O(1/k). F.2. Non-Convex Case In this section, we show that gradient norm vanishes for streaming-SGD under a constant or a cycle-wise diminishing learning rate. Following the notations defined in Theorem 8, we invoke the smoothness assumption, and obtain L(U (k+1,1)) L(U (k,1)) ηk L(U (k,1)), V (k) + Λ2η2 k 2 V (k) 2, (40) V (k) = b Lk(U (k,1)) e(k) (41) with e(k) defined in (24). Taking the unconditional expectations on both sides, we have E[L(U (k+1,1))] E[L(U (k,1))] ηk E L(U (k,1)) 2 + ηk E L(U (k,1)), e(k) + Λ2η2 k 2 E V (k) 2 (42) E[L(U (k,1))] ηk E L(U (k,1)) 2 + ηk E L(U (k,1)) E e(k) + Λ2η2 k 2 E V (k) 2, (43) where we have invoked the Cauchy-Schwartz inequality in the second step. Once again, invoking (20) and (27), we obtain the following bounds: E L(U (k,1)) = E E[ b Lk(U (k,1))|F(k)] E Γ max k,τ E[ b Lk,τ(U (k,τ))|F(k)] Γ p (2N + 1)G2, (44) E[ e(k) |F(k)] 1 2Γ(Γ 1)ηk G 2N + 1. (45) Therefore, there exists a constant C7 such that E L(U (k,1)) E e(k) ηk C7. (46) In the meantime, we can upper bound E V (k) 2 by E V (k) 2 2E b Lk(U (k,1)) 2 + 2E e(k) 2, (47) invoking (20) and (28), we see that there exists a constant C8 such that E V (k) 2 C8. (48) Fourier Learning with Cyclical Data Plugging (48) and (46) into (43), we see that there exists a constant C9 such that E[L(U (k+1,1))] E[L(U (k,1))] ηk E L(U (k,1)) 2 + η2 k C9. (49) k=0 ηk E L(U (k,1)) 2 C9 k=0 η2 k + E[L(U (0,1))] E[L(U (Tmax+1,1))] (50) k=0 η2 k + E[L(U (0,1))] L(U ). (51) Constant learning rate. Let ηk = 1/ Tmax + 1. Then 1 Tmax + 1E L(U (k,1)) 2 C9 + max(E[L(U (0,1))] L(U )). (52) Assuming max(E[L(U (0,1))] L(U )) C10, we have min 0 k Tmax E L(U (k,1)) 2 C9 + C10 Tmax + 1. (53) Diminishing learning rate. Let ηk = 1/ p (k + 1). Then, (50) can be re-written as ηk PTmax κ=0 ηκ E L(U (k,1)) 2 C9 PTmax k=0 1 k+1 + C10 PTmax κ=0 ηκ . (54) Here, the left-hand side is lower bounded by ηk PTmax κ=0 ηκ E L(U (k,1)) 2 min 1 k Tmax E L(U (k,1)) 2. (55) In the meantime, the numerator of the right-hand side s numerator is upper bounded by O(log Tmax), whereas the denominator is bounded by Θ( Tmax). Hence, we see that min 1 k Tmax E L(U (k,1)) 2 = O(log Tmax/ p Tmax). (56) G. Proof of Theorem 9 The non-parametric case is more challenging as the update rule involves the notion of pseudo-gradients. Unlike the parametric case, where the update is performed in a Euclidean space, the updates for the non-parametric case are performed in H. Nevertheless, the majority of the proof for the parametric case can be retained as follows. Auxiliary results and lemmas. Similar to the parametric case, we combine all an and bn into a vector U (k,τ)(x) = [a(k,τ) 0 (x), . . . , a(k,τ) N (x), b(k,τ) 1 (x), . . . , b(k,τ) N (x)] H2N+1. (57) In addition, we define a composite norm for such vectors, denoted by . Denoting the i-th component of U (k,τ) as U (k,τ) i , we define, with a slight abuse of notations, U (k,τ) def = U (k,τ) 1 H, . . . , U (k,τ) 2N+1 H This composite norm first computes the H-norm of each element of U (k,τ), and then compute the vector norm of the resulting vector. With this composite norm, we can easily check that (19) and (20) hold for the non-parametric case as well. Fourier Learning with Cyclical Data Next, we proceed to write the composite update for the non-parametric case. Since we have chosen the pseudo-gradient to be the kernel embedding of the (incremental) functional gradient, we have U (k+1,1)(x) = U (k,1)(x) ηk τ=1 [ b Lk,τ(U (k,τ))]( ), K(x, ) (59) = U (k,1)(x) ηk [ b Lk(U (k,1)) e(k)]( ), K(x, ) , (60) where, once again, b Lk(U (k,1)) is the empirical gradient function calculated using the samples that arrive during the macro-iteration k, whereas b Lk,τ(U (k,τ)) = [ a0 b Lk,τ(U (k,τ)), . . . , a N b Lk,τ(U (k,τ)), b1 b Lk,τ(U (k,τ)), . . . , bn b Lk,τ(U (k,τ))] . (61) Here, K( , ) is a (2N + 1)-dimensional vector stacked up by the kernel K( , ): K( , ) = [K( , ), . . . , K( , )] . (62) Similar to the definition of the composite norm, the inner product in (60) should be interpreted as a composite inner product, which first takes the inner product of H over each element of b Lk,τ(U (k,τ)) and K( , ), and then takes the vector dot product over the resulting vectors in the (2N + 1)-dimensional Euclidean space. Lastly, in (60), e(k) is the accumulated error function: b Lk,τ(U (k,1)) b Lk,τ(U (k,τ)) . (63) Note that e(k) is a function of x, as are b Lk,τ(U (k,1)) and b Lk,τ(U (k,τ)). Taking the composite norm on both sides, and invoking the Lipschitzness of the gradient for the H-norm, we have τ=1 b Lk,τ(U (k,1)) b Lk,τ(U (k,τ)) τ=1 Λ U (k,1) U (k,τ) . (64) Since U (k,1) U (k,τ) ηk(τ 1) maxκ,ν b Lκ,ν(U (κ,ν)) ηk(τ 1)G 2N + 1, we can invoke (20) and get E[ e(k) |F(k)] 1 2Γ(Γ 1)ηk G 2N + 1, (65) E[ e(k) 2|F(k)] 1 6η2 kΛ2Γ2(Γ 1)(2Γ 1)G2(2N + 1), (66) upon taking expectations on both sides. Similar to the parametric case, this bound again implies that the approximation error of U (k,τ) using U (k,1) vanishes if we choose a set of diminishing step sizes ηk. The remaining part of the proof deviates from the parametric case as the updates use pseudo-gradients. Before introducing the main proof, we first need the following lemma. Lemma 11 (Lemma 11 of (Yang et al., 2019)). Letting be a norm and , be its dual norm. In addition, suppose f H is continuously differentiable, and satisfies f(x + y) f(y) L y , . Then f(x + y) f(x) f(x), y L 2 y , . (67) Main proof. Letting , be the composite norm, and denoting V (k)(x) = [ b Lk(U (k,1)) e(k)]( ), K(x, ) (68) Fourier Learning with Cyclical Data as the pseudo-gradient with error during the macro-iteration k, we can invoke Lemma 11 and (60) on the loss function L, and obtain L(U (k+1,1)) L(U (k,1)) ηk L(U (k,1)), V (k) + Λ2η2 k 2 V (k) 2. (69) Taking unconditional expectations on both sides, we have E L(U (k+1,1)) L(U ) E L(U (k,1)) L(U ) ηk E L(U (k,1)), E[V (k)|F(k)] + Λ2η2 k 2 E V (k) 2 (70) E L(U (k,1)) L(U ) ηkλ L(U (k,1)) 2 + Λ2η2 k 2 E V (k) 2+ + ηk E [ L(U (k,1))]( ), E[e(k)|F(k)]( ), K( , ) (71) where F(k) is the minimum σ-algebra generated by U (κ,τ) for all κ k and 1 τ Γ. In addition, the second inequality is obtained by invoking the assumption that the minimum eigenvalue of K( , ) is λ > 0. The last term on the right-hand side has two inner products: the outer one is taken with respect to , while the inner is taken with respect to . Note that, both inner products are composite. By Cauchy s inequality, and the upper bound on E[ e(k) |F(k)], we get E [ L(U (k,1))]( ), E[e(k)|F(k)]( ), K( , ) E L(U (k,1)) E K( , ) 1 2Γ(Γ 1)ηk G 2N + 1. (72) C2 = E K( , ) 1 2N + 1, (73) E [ L(U (k,1))]( ), E[e(k)|F(k)]( ), K( , ) C2ηk E L(U (k,1)) . (74) Note that, in (74), we can further upper bound E L(U (k,1)) by the convexity of the composite norm: E L(U (k,1)) = E E[ b Lk(U (k,1))|F(k)] E Γ max k,τ E[ b Lk,τ(U (k,τ))|F(k)] Γ p (2N + 1)G2. (75) Therefore, letting C3 = C2Γ p (2N + 1)G2 gives us E [ L(U (k,1))]( ), E[e(k)|F(k)]( ), K( , ) ηk C3. (76) In the meantime, we can invoke the Cauchy s inequality again, and, according to the expression of V (k) given in (68), we have V (k)(x) K(x, ) b Lk(U (k,1)) e(k) max x X K(x, ) ( b Lk(U (k,1)) + e(k) ). (77) Since X is compact, there exists a constant C4 such that V (k) 2 C4( b Lk(U (k,1)) + e(k) )2 2C4( b Lk(U (k,1)) 2 + e(k) 2). (78) Taking the unconditional expectations on both sides, and invoking the upper bounds in (19) and (66), we get E V (k) 2 2C4 (2N + 1)G2 + 1 6η2 kΛ2Γ2(Γ 1)(2Γ 1)G2(2N + 1) . (79) Fourier Learning with Cyclical Data For simplicity, we denote the right-hand side of (79) as C5, which gives the following succinct relationship: E V (k) 2 C5. (80) Combining (80), (76) and (71), we get E L(U (k+1,1)) L(U ) E L(U (k,1)) L(U ) ηkλE L(U (k,1)) 2 + η2 k C6, (81) 2 + C3. (82) Since L is σ-strongly-convex, we have E L(U (k,1)) 2 2σ 1E L(U (k,1)) L(U ) . (83) E L(U (k+1,1)) L(U ) (1 2σ 1ληk)E L(U (k,1)) L(U ) + η2 k C6. (84) ηk = σ(2k + 1) 2λ(k + 1)2 , (85) E L(U (k+1,1)) L(U ) k2 (k + 1)2 E L(U (k,1)) L(U ) + C6 (2k + 1)2 (k + 1)4 , (86) or equivalently, (k + 1)2E L(U (k+1,1)) L(U ) k2E L(U (k,1)) L(U ) + C6 (2k + 1)2 (k + 1)2 (87) k2E L(U (k,1)) L(U ) + 4C6. (88) δ(k) = k2E L(U (k,1)) L(U ) , (89) we immediately have δ(k + 1) δ(k) + 4C6. (90) Noticing that δ(0) = 0, we can sum both sides over 0 to Tmax, and get δ(Tmax) 4Tmax C6, (91) E L(U (Tmax,1)) L(U ) = O(T 1 max). (92) The bound on the norm difference follows from the strong convexity assumption of L. Fourier Learning with Cyclical Data 2 4 6 8 10 12 14 Day Test Accuracy Pluralistic Fourier Learning Online Learning Time-Feature 2 4 6 8 10 12 14 Day Test Accuracy Pluralistic Fourier Learning Online Learning Time-Feature 2 4 6 8 10 12 14 Day Test Accuracy Pluralistic Fourier Learning Online Learning Time-Feature 2 4 6 8 10 12 14 Day Test Accuracy Pluralistic Fourier Learning Online Learning Time-Feature Figure 8. Fourier learning over Sentiment140 Twitter dataset with learning rates 0.05 (top left), 0.1 (top right), 0.2 (bottom left), 0.4 (bottom right), respectively. H. Multi-Dimensional Fourier-MLP If the Fourier layer is to replace a hidden layer with a multi-dimensional output, we can substitute the summation of the Fourier layer in Figure 3 with another MLP, so that each output neuron has a unique time interpolation. Using F-MLP[N] d1 d2(x) Rd2 1 as a general expression for an F-MLP with input dimension d1 and output dimension d2, we have: F-MLP[N] d1 d2(x) = (W2 COS) MLPd1 (N+1)(x) + (W1 SIN) MLPd1 N(x), (93) where x Rd1 1 is the input to the Fourier layer; MLPd1 N(x) RN 1 is a regular MLP that maps x into a vector of dimension N, having no activations; W1 Rd2 N and W2 Rd2 (N+1); while SIN and COS are matrices stacked up by row vectors [sin(2πt/T), . . . , sin(2πNt/T)] and [1, cos(2πt/T), . . . , cos(2πNt/T)] a total of d2 times. The operator is the Hadamard product. When d2 = 1, W1 and W2 can be merged into MLPd1 N and MLPd1 (N+1), which serve the role of an(x) and bn(x) in (7), respectively. I. Additional Experiment Results I.1. Sentiment140 We report additional results for Fourier learning over the Sentiment140 Twitter dataset in Figure 8. Fourier learning has a better performance across the board. In Figure 9, online learning is still inferior than Fourier learning even after we trippled its network size. This suggests that the gap in Figures 5 and 8 cannot be easily mitigated by tweaking the benchmark. I.2. Fourier Learning in Recommender Systems We report the raw data used to plot Figure 7 in Table 2. Fourier Learning with Cyclical Data 2 4 6 8 10 12 14 Day Test Accuracy Fourier Learning Online Learning Online Learning (3x Model Size) Figure 9. F-MLP against a large online learning model. Algorithm Month F Month G Month H Online learning 0.85504 0.85812 0.86350 Online learning (large) 0.85538 0.85766 0.86254 Time-feature 0.85502 0.85810 0.86350 Positional encoding 0.85481 0.85813 0.86351 Pluralistic 0.85497 0.85832 0.86367 Fourier learning (ours) 0.85627 0.85935 0.86456 Table 2. Au C for the implemented algorithms, aggregated monthly.