# functional_linear_regression_with_mixed_predictors__2bef43e0.pdf

Journal of Machine Learning Research 23 (2022) 1-94 Submitted 9/21; Revised 7/22; Published 7/22

Functional Linear Regression with Mixed Predictors

Daren Wang dwang24@nd.edu Department of ACMS University of Notre Dame Indiana, USA

Zifeng Zhao zifeng.zhao@nd.edu Mendoza College of Business University of Notre Dame Indiana, USA

Yi Yu Yi.Yu.2@warwick.ac.uk Department of Statistics University of Warwick Coventry, UK

Rebecca Willett willett@uchicago.edu Department of Statistics University of Chicago Illinois, USA

Editor: Pradeep Ravikumar

We study a functional linear regression model that deals with functional responses and allows for both functional covariates and high-dimensional vector covariates. The proposed model is ﬂexible and nests several functional regression models in the literature as special cases. Based on the theory of reproducing kernel Hilbert spaces (RKHS), we propose a penalized least squares estimator that can accommodate functional variables observed on discrete sample points. Besides a conventional smoothness penalty, a group Lasso-type penalty is further imposed to induce sparsity in the high-dimensional vector predictors. We derive ﬁnite sample theoretical guarantees and show that the excess prediction risk of our estimator is minimax optimal. Furthermore, our analysis reveals an interesting phase transition phenomenon that the optimal excess risk is determined jointly by the smoothness and the sparsity of the functional regression coeﬃcients. A novel eﬃcient optimization algorithm based on iterative coordinate descent is devised to handle the smoothness and group penalties simultaneously. Simulation studies and real data applications illustrate the promising performance of the proposed approach compared to the state-of-the-art methods in the literature.

Keywords: Reproducing kernel Hilbert space, functional data analysis, high-dimensional regression, minimax optimality.

1. Introduction

Functional data analysis is the collection of statistical tools and results revolving around the analysis of data in the form of (possibly discrete samples of) functions, images and more

c 2022 Daren Wang, Zifeng Zhao, Yi Yu and Rebecca Willett.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1091.html.

Wang, Zhao, Yu and Willett

general objects. Recent technological advancement in various application areas, including neuroscience (e.g. Petersen et al., 2019; Dai et al., 2019), medicine (e.g. Chen et al., 2017; Ratcliﬀe et al., 2002), linguistic (e.g. Hadjipantelis et al., 2015; Tavakoli et al., 2019), ﬁnance (e.g. Fan et al., 2014; Benko, 2007), economics (e.g. Yan, 2007; Ramsay and Ramsey, 2002), transportation (e.g. Chiou et al., 2014; Wagner-Muns et al., 2017), climatology (e.g. Fraiman et al., 2014; Bonner et al., 2014), and others, has spurred an increase in the popularity of functional data analysis. The statistical research in functional data analysis has covered a wide range of topics and areas. We refer the readers to a recent comprehensive review (Wang et al., 2016). In this paper, we are concerned with a general functional linear regression model that deals with functional responses and accommodates both functional and vector covariates. By rescaling if necessary, without loss of generality, we assume the domain of the functional variables is [0, 1]. Let A ( , ) : [0, 1] [0, 1] R be a bivariate coeﬃcient function and {β j ( ) : [0, 1] R}p j=1 be a collection of p univariate coeﬃcient functions. The functional linear regression model concerned in this paper is as follows:

[0,1] A (r, s)Xt(s) ds +

j=1 β j (r)Ztj + ϵt(r), r [0, 1], (1)

where Yt(r) : [0, 1] R is the functional response, Xt(s) : [0, 1] R is the functional covariate, Zt = (Ztj)p j=1 Rp is the vector covariate and ϵt(r) : [0, 1] R is the functional noise such that E(ϵt(r)) = 0 and Var(ϵt(r)) < for all r [0, 1]. The index t {1, 2, . . . , T} denotes T independent and identically distributed samples. The vector dimension p is allowed to diverge as the sample size T grows unbounded. Furthermore, instead of assuming the functional variables are fully observed, we consider a more realistic setting where the functional covariates {Xt}T t=1 and responses {Yt}T t=1 are only observed on discrete sample points {si}n1 i=1 and {rj}n2 j=1, respectively. The two collections of sample points do not need to coincide. As an important real-world example, consider a dataset collected from the popular crowdfunding platform kickstarter.com (see more details in Section 5.4). The website provides a platform for start-ups to create fundraising campaigns and charges a 5% service fee from the ﬁnal fund raised by each campaign over its 30-day campaign duration. Denote Nt(r) as the pledged fund for the campaign indexed by t at time r. Note that {Nt(r), r [0, 30]} forms a fundraising curve. For both the platform and the campaign creators, it is of vital interest to generate an accurate prediction of the future fundraising path {Nt(r), r (s, 30]} at an early time s (0, 30), as the knowledge of {Nt(r), r (s, 30]} helps the platform better assess its future revenue and further suggests timing along (s, 30] for potential intervention by the creators and the platform to boost the fundraising campaign and achieve better outcome. At time s, to predict the functional response Yt, i.e. {Nt(r), r (s, 30]}, a functional regression as proposed in model (1) can be built based on the functional covariate Xt, i.e. {Nt(r), r [0, s]}, and vector covariates Zt such as the number of creators and product updates of campaign t. Figure 1 plots the normalized fundraising curves of six representative campaigns and the functional predictions given by our proposed method (RKHS) and two competitors, FDA in Ramsay and Silverman (2005) and PFFR in Ivanescu et al. (2015).

Functional Linear Regression with Mixed Predictors

In general, the proposed method achieves more favorable performance. More detailed real data analysis is presented in Section 5.4.

0 5 10 15 20 25 30

0.0 0.1 0.2 0.3 0.4

Example Plot 1: N(r) and prediction

Prediction RKHS PFFR FDA

0 5 10 15 20 25 30

0.0 0.1 0.2 0.3 0.4 0.5

Example Plot 2: N(r) and prediction

Prediction RKHS PFFR FDA

0 5 10 15 20 25 30

0.0 0.1 0.2 0.3 0.4

Example Plot 3: N(r) and prediction

Prediction RKHS PFFR FDA

0 5 10 15 20 25 30

0.0 0.5 1.0 1.5 2.0 2.5

Example Plot 4: N(r) and prediction

Prediction RKHS PFFR FDA

0 5 10 15 20 25 30

0.0 0.2 0.4 0.6 0.8 1.0

Example Plot 5: N(r) and prediction

Prediction RKHS PFFR FDA

0 5 10 15 20 25 30

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Example Plot 6: N(r) and prediction

Prediction RKHS PFFR FDA

Figure 1: Observed (normalized) fundraising curves {Nt(r)} (dots) of six representative campaigns and functional predictions (solid lines) given by RKHS, FDA and PFFR at s = 14th day (dashed vertical line) based on the functional covariate {Nt(r), r [0, s]}.

1.1 Literature review

As mentioned in recent reviews, e.g. Wang et al. (2016) and Morris (2015), there are in general three types of functional linear regression models: 1) functional covariates and scalar or vector responses (e.g. Cardot et al., 2003; Cai and Hall, 2006; Hall and Horowitz, 2007; Yuan and Cai, 2010; Raskutti et al., 2012; Cai and Yuan, 2012; Jiang et al., 2014; Fan et al., 2015); 2) functional covariates and functional responses (e.g. Wu et al., 1998; Liang et al., 2003; Yao et al., 2005; Fan et al., 2014; Ivanescu et al., 2015; Sun et al., 2018) and 3) vector covariates and functional responses (e.g. Laird and Ware, 1982; Li and Hsing, 2010; Faraway, 1997; Wu and Chiang, 2000). Another closely related area is functional time series. For instance, functional autoregressive models also preserve the regression format. The literature on functional time series is also abundant, including van Delft et al. (2017), Aue et al. (2015), Bathia et al. (2010), Wang et al. (2020a) and many others. Regarding the sampling scheme, to establish theoretical guarantees, it is often assumed in the literature that functions are fully observed, which is generally untrue in practice. To study the properties of functional regression under the more realistic scenario where functions are observed only on discrete sample points, one usually imposes certain smoothness conditions on the underlying functions. The errors introduced by the discretization of functions can therefore be controlled as a function of the smoothness level and n, the number of

Wang, Zhao, Yu and Willett

discretized observations available for each function. Some fundamental tools regarding this aspect were established in Mendelson (2002) and Bartlett et al. (2005), and were further applied to various regression analysis problems (e.g. Raskutti et al., 2012; Koltchinskii and Yuan, 2010; Cai and Yuan, 2012). Regarding the statistical inference task, in the regression context, estimation and prediction are two indispensable pillars. In the functional regression literature, Valencia and Yuan (2013), Cai and Yuan (2011), Yuan and Cai (2010), Lin and Yuan (2006), Park et al. (2018), Fan et al. (2014), Fan et al. (2015), and Reimherr et al. (2019), among many others, have studied diﬀerent aspects of the estimation problem. As for prediction, the existing literature includes Cai and Yuan (2012), Cai and Hall (2006), Ferraty and Vieu (2009), Sun et al. (2018), Reimherr et al. (2018) to name but a few. In contrast to the aforementioned three paradigms of functional regression, in this paper, we study the functional linear regression problem in model (1) with functional responses and mixed predictors that consist of both functional and vector covariates. We assume the functions are only observed on discrete sample points and aim to derive optimal upper bounds on the excess prediction risk (see Theorem 2). More detailed comparisons with existing literature are deferred till we present the main results in Section 3.

1.2 Main contributions

The main contributions of this paper are summarized as follows. Firstly, to the best of our knowledge, the model we study in this paper, as deﬁned in (1), is among the most ﬂexible ones in the literature of functional linear regression. In terms of predictors, we allow for both functional and vector covariates. In terms of model dimensionality, we allow for the dimension p of vector covariates to grow exponentially as the sample size T diverges. In terms of function spaces, which will be elaborated later, we allow for the coeﬃcients of the functional and vector covariates to be from diﬀerent reproducing kernel Hilbert spaces (RKHS). In terms of dependence between the functional and vector covariates, we allow the correlation to be of order up to O(1), which is a rather weak condition, especially considering that the vector covariates are of high dimensions. In terms of the sampling scheme, we allow for the functional covariate and response to be observed on (possibly diﬀerent) discretized sample points. This general and uniﬁed framework imposes new challenges on both theory and optimization, as we elaborate in detail later. Secondly, we develop new peeling techniques in our theoretical analysis, which is crucial and fundamental in dealing with the potentially exponentially growing vector covariate dimension. Existing asymptotic analysis techniques and results in the functional regression literature are insuﬃcient to provide Lasso-type guarantees with the presence of highdimensional covariates. See Theorem 5 for more details. Thirdly, we demonstrate an interesting phase transition phenomenon in terms of the smoothness of the functional covariate and the dimensionality of the vector covariate. To be speciﬁc, let s be the sparsity of {β j }p j=1, i.e. the number of non-zero univariate coeﬃcient functions. Let δT be a quantity jointly determined by the complexities and the alignment of the two RKHS s where the functional covariates {Xt}T t=1 and the bivariate coeﬃcient function A reside; see Theorem 4 for the detailed deﬁnition of δT . We show that the excess

Functional Linear Regression with Mixed Predictors

prediction risk of our proposed estimator is upper bounded by

Op s log(p T)T 1 + δT + discretization error.

When δT s log(p T)T 1, the excess risk is dominated by s log(p T)T 1, which is the standard excess risk rate in the high-dimensional parametric statistics literature (e.g. B uhlmann and van de Geer, 2011). Therefore, the diﬃculty is dominated by estimating {β j }p j=1, the p univariate coeﬃcient functions.

When δT s log(p T)T 1, the excess risk is dominated by δT , which suggests that the diﬃculty is dominated by estimating the bivariate coeﬃcient function A . We show that in this regime, the optimal excess risk of model (1) is of order δT . As a result, in this regime the diﬃculty is dominated by estimating the bivariate coeﬃcient function A .

We further develop matching lower bounds to justify that this phase transition between high-dimensional parametric rate and the non-parametric rate is indeed minimax optimal; see Section 3.3. To the best of our knowledge, this is the ﬁrst time such phenomenon is observed in the functional regression literature.

Note that, this phase transition and its associated optimality are not with respect to the discretization error, i.e. n - the minimum number of observations obtained for each function. Lastly, we derive a representer theorem and further propose a novel optimization algorithm via iterative coordinate descent, which eﬃciently solves a sophisticated penalized least squares problem with the presence of both a ridge-type smoothing penalty and a group Lasso-type sparsity penalty. The rest of the paper is organized as follows. In Section 2, we introduce relevant deﬁnitions and quantities. The main theoretical results are presented in Section 3. The optimization procedure is discussed in Section 4. Numerical experiments and real data applications are conducted in Section 5 to illustrate the favorable performance of the proposed estimator over existing approaches in the literature. Section 6 concludes with a discussion.

Let N denote the collection of positive integers. Let a(n) and b(n) be two quantities depending on n. We say that a(n) b(n) if there exists n0 N and an absolute constant C > 0 such that for any n n0, a(n) Cb(n). We denote a(n) b(n), if a(n) b(n) and b(n) a(n). For any function g : [0, 1] R, denote

[0,1] g2(r) dr and g = sup r [0,1] |g(r)|.

For any p N and a, b Rp, let a, b p = Pp i=1 aibi. Let L2 = L2([0, 1]) be the space of all square integrable functions with respect to the uniform distribution on [0, 1], i.e. L2 = {f : [0, 1] R, f L2 < }. For any α > 0, let W α,2 be the Sobolev space of order α. For a positive integer α, we have

W α,2 = {f L2[0, 1], f(α 1) is absolutely continuous and f(α) L2 < },

Wang, Zhao, Yu and Willett

where f(k) is the k-th weak derivative of f, k N. In this case the Sobolev norm of f is deﬁned as

f 2 W α,2 = f 2 L2 + f(α) 2 L2.

For non-integer valued α > 0, the Sobolev space W α,2 and the norm W α,2 are also well-deﬁned (e.g. see Brezis, 2010). Note that if 0 < α1 < α2, then W α2,2 W α1,2.

2. Background

In this section, we provide some background on the fundamental tools used in this paper, with RKHS and compact linear operators studied in Section 2.1 and Appendix A.1, respectively.

2.1 Reproducing kernel Hilbert spaces (RKHS)

Consider a Hilbert space H L2 and its associated inner product , H, under which H is complete. We assume that there exists a continuous symmetric nonnegative-deﬁnite kernel function K : [0, 1] [0, 1] R+ such that the space H is an RKHS, in the sense that for each r [0, 1], the function K( , r) H and g(r) = g( ), K( , r) H, for all g H. To emphasize this relationship, we write H(K) as an RKHS with K as the associated kernel. For any K, we say it is a bounded kernel if supr [0,1] K(r, r) < . It follows from Mercer s theorem (Mercer, 1909) that there exists an orthonormal basis of L2, {φk} k=1 L2, such that a non-negative deﬁnite kernel function K( , ) has the representation K(s, r) = P k=1 µkφk(s)φk(r), s, r [0, 1], where µ1 µ2 0 are the eigenvalues of K and {φk} k=1 are the corresponding eigenfunctions. In the rest of this paper, when there is no ambiguity, we drop the dependence of µk s and φk s on their associated kernel function K for ease of notation. Note that any function f H(K) can be written as

[0,1] f(r)φk(r) dr

k=1 akφk(r), r [0, 1],

and its RKHS norm is deﬁned as f H(K) = q P k=1 a2 k/µk. Thus, for the eigenfunctions,

we have φk 2 H(K) = µ 1 k . Throughout this section, we further denote ψk = µkφk and note that ψk H(K) = 1 for k N . Deﬁne the linear map LK : L2 L2, associated with K, as LK(f)( ) = R

[0,1] K( , r)f(r) dr. It holds that LK(φk) = µkφk, for k N . Furthermore, deﬁne LK1/2 : L2 H(K) such that LK1/2(φk) = µkφk, and deﬁne LK 1/2 : H(K) L2 such that LK 1/2(φk) = µ 1/2 k φk, for k N . For any two bivariate functions R1( , ), R2( , ) : [0, 1] [0, 1] R, deﬁne R1R2(r, s) = R

[0,1] R1(r, u)R2(u, s) du, r, s [0, 1]. It holds that LR1R2 = LR1 LR2, where LR1 LR2 represents the composition of R1 and R2, which is a linear map from L2 to L2

given that both LR1 and LR2 are linear maps from L2 to L2.

Functional Linear Regression with Mixed Predictors

2.2 Bivariate functions and compact linear operators

To regulate the bivariate coeﬃcient function A in model (1), we consider a class of compact linear operators in H(K). Speciﬁcally, for any compact linear operator A2 : H(K) H(K), denote

A2[f, g] = A2[g], f H(K), f, g H(K).

Note that A2[f, g] is well deﬁned for any f, g H(K) due to the compactness of A2. Let {ψi} i=1 be the eigenbasis of H(K) and aij = A2[ψi, ψj] = A2[ψj], ψi H(K), i, j N . We thus have for any f, g H(K), it follows from Theorem 9 in the appendix that

i,j=1 aij f, ψi H(K) g, ψj H(K). (2)

Note that via (2), we can deﬁne a bivariate function A1(r, s), r, s [0, 1], aﬃliated with the compact operator A2. Speciﬁcally, plugging f = K(r, ) and g = K(s, ) into (2), we have that

A2[K(r, ), K(s, )] =

i,j=1 aij K(r, ), ψi H(K) K(s, ), ψj H(K) =

i,j=1 aijψi(r)ψj(s) = A1(r, s).

From the above set up, it holds that for all v H(K), we have

A2[v](r) = A1(r, ), v( ) H(K), r [0, 1].

We have established an equivalence between a compact linear operator A2 and its corresponding bivariate function A1(r, s), therefore any compact linear operator A2 : H(K) H(K) can be viewed as a bivariate function A1 : [0, 1] [0, 1] R. To facilitate the later formulation of penalized convex optimization and the derivation of the representer theorem, we further focus on the Hilbert Schmidt operator, which is an important subclass of compact operators. A compact operator A2 is Hilbert Schmidt if

A2 2 F(K) =

i,j=1 A2[ψj], ψi 2 H(K) =

i,j=1 a2 ij < . (3)

In the rest of this paper, for ease of presentation, we adopt some abuse of notation and refer to compact linear operators and the corresponding bivariate functions by the same notation A.

3. Main results

3.1 The constrained/penalized estimator and the representer theorem

Recall that in model (1), the response is the function Y , the covariates include the function X and vector Z, and the unknown parameters are the bivariate coeﬃcient function A and univariate coeﬃcient functions β = {β l }p l=1. Given the observations {Xt(si), Zt, Yt(rj)}T,n1,n2 t=1,i=1,j=1, our main task is to estimate A and β .

Wang, Zhao, Yu and Willett

Deﬁne the weight functions ws(i) = (n1 + 1)(si si 1) for 1 i n1 and wr(j) = (n2 + 1)(rj rj 1) for 1 j n2, where by convention we set s0 = r0 = 0. In addition, for any 1 l p, deﬁne

j=1 wr(j)β2 l (rj).

We propose the following constrained/penalized least squares estimator

( b A, bβ) = arg min A CA, β={βl}p l=1 Cβ

i=1 ws(i)A(rj, si)Xt(si) β(rj), Zt p

where λ > 0 is a tuning parameter that controls the group Lasso penalty, CA and Cβ characterize the spaces of coeﬃcient functions A and β respectively such that

CA = {A : H(K) H(K), A 2 F(K) CA}, Cβ =

{βl}p l=1 H(Kβ) :

l=1 βl H(Kβ) Cβ

Here CA, Cβ > 0 are two absolute constants and F(K) is deﬁned in (3). Note that we allow the RKHS s H(K) and H(Kβ) to be generated by two diﬀerent kernels K and Kβ. The detailed optimization scheme is deferred to Section 4. The above optimization problem essentially constrains three tuning parameters. The tuning parameters CA and Cβ control the smoothness of the regression coeﬃcient functions. These constrains are standard and ubiquitous in the functional data analysis and nonparametric statistical literature. The tuning parameter λ controls the group Lasso penalty, which is used to encourage sparsity when the dimensionality p diverges faster than the sample size T. The optimization problem in (4) makes use of the weight functions ws(i) and wr(j), which are determined by the discrete sample points {si}n1 i=1 and {rj}n2 j=1 respectively. This is designed to handle the scenario where the functional variables are observed on unevenly spaced sample points. Indeed, for evenly spaced sample points {si}n1 i=1 and {rj}n2 j=1, we have that ws(i) = wr(j) = 1 for all i, j. We remark that it is well known that the integral of a regular function can be well approximated by its weighted sum evaluated at discrete sample points. Note that (4) is an inﬁnite dimensional optimization problem. Fortunately, Theorem 1 states that the estimator ( b A, bβ) can in fact be written as linear combinations of their corresponding kernel functions evaluated at the discrete sample points {si}n1 i=1 and {rj}n2 j=1.

Proposition 1 Denote K and Kβ as the RKHS kernels of H(K) and H(Kβ) respectively. There always exists a minimizer ( b A, bβ) of (4) such that,

b A(r, s) =

j=1 baij K(r, rj)K(s, si), (r, s) [0, 1] [0, 1], {baij}n1,n2 i,j=1 R,

Functional Linear Regression with Mixed Predictors

j=1 bblj Kβ(rj, r), r [0, 1], l {1, . . . , p}, {bblj}p,n2 l=1,j=1 R.

Theorem 1 is a generalization of the well-known representer theorem for RKHS (Wahba, 1990). Various versions of representer theorems are derived and used in the functional data analysis literature (e.g. Yuan and Cai, 2010).

3.2 Model assumptions

To establish the optimal theoretical guarantees for the estimator ( b A, bβ), we impose some mild model assumptions, on the coeﬃcient functions (Assumption 1), functional and vector covariates (Assumption 2) and sampling scheme (Assumption 3).

Assumption 1 (Coeﬃcient functions)

(a) The bivariate coeﬃcient function A belongs to CA deﬁned in (5), where H(K) is the RKHS associated with a bounded kernel K and CA > 0 is an absolute constant.

(b) The univariate coeﬃcient functions {β l }p l=1 belong to Cβ deﬁned in (5), where H(Kβ) is the RKHS associated with a bounded kernel Kβ and Cβ > 0 is an absolute constant.

In addition, there exists a set S {1, . . . , p} such that β j = 0 for all j {1, . . . , p}\S and there exists a suﬃciently large absolute constant Csnr > 0 such that

T Csnrs log(p T), (6)

where s = |S| denotes the cardinality of the set S.

Assumption 1(a) requires the bivariate coeﬃcient function A to be a Hilbert Schmidt operator mapping from H(K) to H(K). Assumption 1(b) imposes smoothness and sparsity conditions on the univariate coeﬃcient functions {β l }p l=1 to handle the potential highdimensionality of the vector covariate. Note that we allow {β l }p l=1 to be from a possibly diﬀerent RKHS H(Kβ) than H(K) and thus would allow users to choose kernels based on their practical needs.

Assumption 2 (Covariates)

(a) The functional noise {ϵt}T t=1 is a collection of independent and identically distributed Gaussian processes such that E(ϵ1(r)) = 0 and Var(ϵ1(r)) < Cϵ for all r [0, 1]. In addition, {ϵt}T t=1 are independent of {Xt, Zt}T t=1.

(b) The functional covariate {Xt}T t=1 is a collection of independent and identically distributed centered Gaussian processes with covariance operator ΣX and E X1 2 L2 = CX < .

(c) The vector covariate {Zt}T t=1 Rp is a collection of independent and identically distributed Gaussian random vectors from N(0, ΣZ), where ΣZ Rp p is a positive deﬁnite matrix such that

cz v 2 2 v ΣZv Cz v 2 2, v Rp,

and cz, CZ > 0 are absolute constants.

Wang, Zhao, Yu and Willett

(d) For any deterministic f H(K) and deterministic v Rp, it holds that

E( X1, f L2Z 1 v) 3

E{ X1, f 2 L2}(v ΣZv) = 3

ΣX[f, f](v ΣZv).

Assumption 2(a) and (b) state that both the functional noise {ϵt}T t=1 and functional covariates {Xt}T t=1 are Gaussian processes. In fact, one could further relax them to be sub Gaussian processes. Such assumptions are frequently used in high-dimensional functional literature such as Kneip et al. (2016) and Wang et al. (2020b). Assumption 2(d) allows that the functional and vector covariates to be correlated up to 3/4, which means that the correlation, despite the functional and high-dimensional nature of the problem, can be of order O(1). We do not claim the optimality of the constant 3/4 but emphasize that this correlation cannot be equal to one, as detailed in Appendix A.2. Assumptions 1 and 2 are suﬃcient for establishing theoretical guarantees if we require all functions (i.e. the functional responses and functional covariates) to be fully observed, which is typically not realistic in practice. To allow for discretized observations, we further introduce assumptions on the sampling scheme and on the smoothness of the functional covariates.

Assumption 3 (Sampling scheme)

(a) The discrete sample points {si}n1 i=1 and {rj}n2 j=1 are collections of points with 0 s1 < s2 . . . < sn1 = 1 and 0 r1 < r2 . . . < rn2 = 1, and there exists an absolute constant Cd such that

n1 for all 1 i n1 + 1 and rj rj 1 Cd

n2 for all 1 j n2 + 1,

where by convention we set s0 = r0 = 0.

(b) Suppose that H(K), H(Kβ) W α,2 for some α > 1/2. In addition, suppose that

E( X1 2 W α,2) < . (7)

Assumption 3(a) allows the functional variables to be partially observed on discrete sample points. Importantly, Assumption 3(a) can accommodate both ﬁxed and random sampling schemes. In particular, suppose the sample points are randomly generated from an unknown distribution on [0, 1] with a density function µ : [0, 1] R such that infr [0,1] µ(r) > 0, we have Assumption 3(a) holds with high probability. We refer to Theorem 1 of Wang et al. (2014) for more details. To handle the partially observed functional variables, Assumption 3(b) imposes the smoothness assumption that X, A and {β j }p j=1 can be enclosed by a common superset, the Sobolev space W α,2. In particular, (7) is a commonly used assumption for bivariate function estimation in the functional data analysis literature. See for example, Cai and Yuan (2010) and Wang et al. (2020b) and references therein. We emphasize that Assumption 3(b) indeed allows diﬀerent smoothness levels for X, A and {β j }p j=1, and only requires that the least smooth space among X, A and {β j }p j=1 is covered by W α,2.

Functional Linear Regression with Mixed Predictors

The condition (7) requires that the second moment of X W α,2 is ﬁnite, which implies X W α,2 almost surely. Due to the fact that the functional covariates {Xt}T t=1 are partially observed, to derive ﬁnite-sample guarantees, we need to establish uniform control over the approximation error

[0,1] A(r, s)Xt(s) ds 1

i=1 ws(i)A(r, si)Xt(si)

for all t {1, . . . , T}.

This requires the realized sample paths {Xt}T t=1 to be regular. In particular, the second moment condition in (7) can be used to show that { Xt W α,2}T t=1 are bounded with high probability, which implies {Xt}T t=1 are H older smooth with H older parameter α 1/2 > 0 by the Morrey inequality (see Theorem 39 in Appendix E.1 for more details). In Example 1 in Appendix A, we further provide a concrete example to illustrate a suﬃcient condition for (7).

3.3 Theoretical guarantees

With the assumptions in hand, we establish theoretical guarantees and investigate the minimax optimality of the proposed estimator (4), via the lens of excess risk deﬁned in Theorem 2. The derivation of (8) is collected in Theorem 37.

Deﬁnition 2 (Excess prediction risk) Let ( b A, bβ) be any estimator of (A , β ) in model (1). The excess risk of ( b A, bβ) is deﬁned as

E ( b A, bβ) =EX ,Z ,Y

[0,1] b A(r, s)X (s) ds Z , bβ(r) p

[0,1] A (r, s)X (s) ds Z , β (r) p

[0,1] A(r, s)X (s) ds + Z , β(r) p

A(r, s) = b A(r, s) A (r, s) and β(r) = bβ(r) β (r) for s, r [0, 1];

the random objects (X , Z , Y ) are independent from and identically distributed as the observed data generated from model (1).

Remark 3 Deﬁnition 2 is ubiquitously used to measure prediction accuracy in regression settings. Excess risks are usually considered to provide better quantiﬁcation of prediction accuracy for the estimators of interest than the ℓ2 error bounds. Consequently, throughout our paper, we evaluate our proposed estimators using excess risks. We remark that as pointed out by Cai and Hall (2006) and Cai and Yuan (2012), much of the practical interest in the functional regression parameters is centered around applications for the purpose of prediction.

Wang, Zhao, Yu and Willett

We are now ready to present our main results, which provide upper and lower bounds on the excess risk E ( b A, bβ) of the proposed estimator ( b A, bβ) deﬁned in (4). For notational simplicity, in the following, we assume without loss of generality that n1 = n2 = n. For n1 = n2, all the statistical guarantees continue to hold by setting n = min{n1, n2}.

Upper bounds

Theorem 4 Under Assumptions 1, 2 and 3, suppose that the eigenvalues {ξk} k=1 of the linear operator LK1/2ΣXK1/2 satisfy

ξk k 2r, (9)

for some r > 1/2. Let ( b A, bβ) be any solution to (4) with the tuning parameter λ = Cλ p

log(p T)T 1 for some suﬃciently large constant Cλ. For any T log(n), there exists an absolute constant C > 0 such that with probability at least 1 8T 4, it holds that

E ( b A, bβ) C log(T) δT + s log(p T)T 1 + ζn , (10)

where ζn = n α+1/2, s is the sparsity parameter deﬁned in Assumption 1(b) and δT = T 2r/(2r+1).

Theorem 4 provides a high-probability upper bound on the excess risk E ( b A, bβ). The result is stated in a general way for any RKHS s H(Kβ) and H(K), provided that the spectrum of LK1/2ΣXK1/2 satisﬁes the polynomial decay rate in (9). There are three components in the upper bound in (10).

The term δT is an upper bound on the error associated with the estimation of the bivariate coeﬃcient function A in the presence of functional noise. As formally stated in (9), δT is determined by the alignment between the kernel K of the RKHS that A

resides in and the covariance operator of X. This is the well-known nonparametric rate frequently seen in the functional regression literature, see e.g. Cai and Yuan (2012).

The term s log(p)T 1 is an upper bound on the error associated with the estimation of the high-dimensional sparse univariate coeﬃcient functions β and is a parametric rate frequently seen in the high-dimensional linear regression settings (e.g. B uhlmann and van de Geer, 2011).

The term ζn is an upper bound on the error due to the fact that the functional variables {Xt, Yt}T t=1 are only observed on discrete sample points. Recall model (1) consists of two components R

[0,1] A ( , s)Xt(s) ds and Zt, β ( ) . The discretization

errors are captured through ζn = n α+1/2, which only depends on the smoothness of W α,2 with α > 1/2.

We show later in Theorem 7 that, provided the sample points are dense enough, e.g. n T, up to a logarithmic factor of T, this upper bound achieves minimax optimality. In Appendix L, we further provide numerical illustration for the nonparametric rate δT and the high-dimensional parametric rate s log(p T)T 1 in Theorem 4.

Functional Linear Regression with Mixed Predictors

Remark 5 (New peeling techniques) To prove Theorem 4, we develop new peeling techniques to obtain new exponential tail bounds, which are crucial in dealing with the potentially exponentially growing dimension p.

Remark 6 (Phase transition) For suﬃciently many samples, i.e. large n, the upper bound in (10) implies that, there exists an absolute constant C > 0 such that with large probability E ( b A, bβ) C log(T) δT + s log(p T)T 1 .

This unveils a phase transition between the nonparametric regime and the high-dimensional parametric regime, governed by the eigen-decay of the linear operator LK1/2ΣXK1/2 and the sparsity of the univariate coeﬃcient functions β . To be speciﬁc, if δT s log(p T)T 1, the high-probability upper bound on the excess risk is determined by the nonparametric rate δT ; otherwise, the high-dimensional parametric rate dominates.

Theorem 16 in Appendix B.2 further presents a formal theoretical guarantee which quantiﬁes a discretized version of the excess risk deﬁned in Theorem 2 for the proposed estimators ( b A, bβ). Due to the fact that the functional variables are only partially observed on discrete sample points, the discretized version of the excess risk can be more relevant in certain practical applications.

Lower bounds

In this section, we derive a matching lower bound on the excess risk E ( b A, bβ) and thus show that the upper bound provided in Theorem 4 is nearly minimax optimal in terms of the sample size T, the dimension p and the sparsity parameter s, saving for a logarithmic factor. We establish the lower bound under the assumption that the functional variables {Xt, Yt}T t=1 are fully observed, which is equivalent to setting n = . Thus, we do not claim optimality in terms of the number of sample points n for the result in Theorem 4.

Proposition 7 Under Assumptions 1 and 2, suppose that the functional variables {Xt, Yt}T t=1 are fully observed and the eigenvalues {ξk} k=1 of the linear operator LK1/2ΣXK1/2 satisfy ξk k 2r, for some r > 1/2. There exists a suﬃciently small constant c > 0 such that

inf b A,bβ sup A CA,β Cβ E{E ( b A, bβ)} c{T 2r 2r+1 + s log(p)T 1},

where the inﬁmum is taken over all possible estimators of A and β based on the observations {Xt, Zt, Yt}T t=1.

3.4 Extension to functional responses with measurement errors

In this subsection, we consider the setting where the functional responses are corrupted with measurement errors. More precisely, let the functional response Yt(r) be generated as in (1). Assume we observe

yt,j = Yt(rj) + Et,j, t {1, . . . , T}, j {1, . . . , n2},

Wang, Zhao, Yu and Willett

where {Et,j}T,n2 t=1,j=1 is a collection of independent and identically distributed sub-Gaussian measurement errors with mean zero and Var(Et,j) CE.

Given the observations {Xt(si), Zt, yt,j}T,n1,n2 t=1,i=1,j=1, consider

( b A, bβ) = arg min A CA, β={βl}p l=1 Cβ

i=1 ws(i)A(rj, si)Xt(si) β(rj), Zt p

where CA and Cβ are deﬁned in (5). Note that without the presence of measurement errors, i.e. Et,j = 0 for all t and j, the estimator (11) is identical to (4) proposed for the setting with only functional noise. In what follows, we show that the excess risk of the estimator (11) also achieves the same convergence rate as that in Theorem 4.

Theorem 8 Suppose all the assumptions in Theorem 4 hold. Let ( b A, bβ) be any solution to (11) with the tuning parameter λ = Cλ p

log(p T)T 1 for some suﬃciently large constant Cλ. For any T log(n), there exists an absolute constant C > 0 such that with probability at least 1 8T 4, it holds that

E ( b A, bβ) C log(T) δT + s log(p T)T 1 + ζn , (12)

where ζn = n α+1/2, s is the sparsity parameter deﬁned in Assumption 1(b) and δT = T 2r/(2r+1).

4. Optimization

In this section, we propose an eﬃcient convex optimization algorithm for solving (4) given the observations {Xt(si), Zt, Yt(rj)}T,n1,n2 t=1,i=1,j=1. We remark that identical optimization can be applied to (11). To ease presentation, in the following we assume without loss of generality that H(K) = H(Kβ), and thus K = Kβ. The general case where K = Kβ can be handled in exactly the same way with more tedious notation. Section 4.1 formulates (4) as a convex optimization problem and Section 4.2 further proposes a novel iterative coordinate descent algorithm to eﬃciently solve the formulated convex optimization.

4.1 Formulation of the convex optimization

By the equivalence between constrained and penalized optimization (see e.g. Hastie et al., 2009), we can reformulate (4) into a penalized optimization such that

( b A, bβ) = arg min A,β

i=1 ws(i)A(rj, si)Xt(si) β(rj), Zt p

+ λ1 A 2 F(K) + λ2

l=1 βl H(K) + λ3

Functional Linear Regression with Mixed Predictors

where λ1, λ2, λ3 > 0 denote tuning parameters. Note that for notational simplicity, we drop the factor 1/(Tn2) of the squared loss. In the following, we show that (13) can be solved via convex optimization. We ﬁrst deﬁne some necessary notation. For t = 1, . . . , T, denote the functional curves observed on discrete grids as

Yt = [Yt(r1), Yt(r2), , Yt(rn2)] Rn2 and Xt = [Xt(s1), Xt(s2), , Xt(sn1)] Rn1.

Deﬁne Y = [Y1, , YT ] Rn2 T , X = [X1, , XT ] Rn1 T and Z = [Z1, , ZT ] Rp T . For any r, s [0, 1], denote the RKHS kernels as

k1(r) = [K(r, r1), K(r, r2), , K(r, rn2)] Rn2 and k2(s) = [K(s, s1), K(s, s2), , K(s, sn1)] Rn1.

Denote K1 = [k1(r1), k1(r2), , k1(rn2)] Rn2 n2 and K2 = [k2(s1), k2(s2), , k2(sn1)] Rn1 n1. Note that K1 = k1(r), k1(r) H(K) and K2 = k2(s), k2(s) H(K), thus both are symmetric and positive deﬁnite matrices. Furthermore, denote WS = diag(ws(1), ws(2), , ws(n1)) Rn1 n1 and WR = diag( p

wr(n2)) Rn2 n2 as the diagonal weight matrices. Deﬁne Y = WRY , K 1 = WRK1 and K 2 = K2WS. By the representer theorem (Theorem 1), we have the minimizer of (13) taking the form

A(r, s) = k1(r) Rk2(s) and βl(r) = k1(r) bl, l = 1, , p,

where R Rn2 n1 is an n2 n1 matrix and bl = [bl1, bl2, , bln2] Rn2 is an n2dimensional vector for l = 1, , p. Denote β(r) = [β1(r), β2(r), , βp(r)] = [b1, b2, , bp] k1(r) = B k1(r), where B = [b1, b2, , bp] Rn2 p. By straightforward algebra, we can rewrite the optimization problem in (13) as Y 1

n1 K 1RK 2X K 1BZ

F + λ1tr(R K1RK2) + λ2

b l K1bl + λ3

1 n2 b l K 1 K 1bl,

where F and tr( ) are the Frobenius norm and trace of a matrix. We refer to Appendix F.1 for the detailed derivation. It is easy to see that (14) is a convex function of R and B. Note that the ﬁrst two terms of (14) are quadratic functions and can be handled easily, while the main

diﬃculty of the optimization lies in the group Lasso-type penalty λ2 Pp l=1 q

λ3 Pp l=1 q

b l K 1 K 1bl/n2.

4.2 Iterative coordinate descent

In this section, we propose an eﬃcient iterative coordinate descent algorithm which solves (14) by iterating between the optimization of R and B. Optimization of R (i.e. the bivariate coeﬃcient function A): Given a ﬁxed B = [b1, b2, , bp], denote e Y = Y K 1BZ Rn2 T . We have that (14) reduces to a function of R that

e Y n 1 1 K 1RK 2X 2 F + λ1tr(R K1RK2). (15)

Wang, Zhao, Yu and Willett

Deﬁne E = K1/2 1 RK1/2 2 Rn2 n1 and e = vec(E). Denote ey = vec(e Y ) = [e Y 1 , e Y 2 , , e Y T ]

and denote S1 = n 1 1 (X WSK1/2 2 ) (WRK1/2 1 ) RTn1 n1n2. We can rewrite (15) as

ey S1e 2 2 + λ1 e 2 2, (16)

which can be seen as a classical ridge regression with a structured design matrix S1. This ridge regression has a closed-form solution and can be solved eﬃciently by exploiting the Kronecker structure of S1, see Appendix F.2 for more details. Thus, given B, R can be updated eﬃciently. Optimization of B (i.e. the univariate coeﬃcient functions β): Given a ﬁxed R, with some abuse of notation, denote e Y = Y n 1 1 K 1RK 2X Rn2 T . For l = 1, . . . , p, deﬁne hl = K 1bl Rn2 and H = K 1B = [h1, h2, , hp] Rn2 p. Further deﬁne K3 = (K 1) 1K1(K 1) 1. We have that (14) reduces to a function of H (and thus B), i.e. e Y HZ 2

h l K3hl + λ3

1 n2 h l hl

h l K3hl + λ3 n2

l=1 hl 2, (17)

which can be seen as a linear regression with two group penalties: a weighted group Lasso penalty and a standard group Lasso penalty. We solve the optimization of (17) by performing coordinate descent on hl, l = 1, 2 , p. See Friedman et al. (2010) for a similar strategy used to solve the sparse group Lasso problem for a linear regression with both a Lasso and a group Lasso penalty. Speciﬁcally, for each l = 1, 2, , p, the Karush Kuhn Tucker condition for hl is

+ λ2K4s(1) l + λ3 n2 s(2) l = 0.

Deﬁne K4 as the root of K3, i.e. K4K4 = K3. We have that K4s(1) l and s(2) l are the

subgradients of q

h l K3hl and hl 2 respectively, such that

( K4hl/ K4hl 2, hl = 0,

a vector with s(1) l 2 1, hl = 0, and s(2) l =

( hl/ hl 2, hl = 0,

a vector with s(2) l 2 1, hl = 0.

Deﬁne e Y l t = e Yt P j =l Ztjhj. Thus, given {hj, j = l}, we have that hl = 0 is the optimal

solution if there exist two vectors s(1) l and s(2) l with s(1) l 2 1 and s(2) l 2 1, and

t=1 Ztl e Y l t λ3 ns(2) l

This is equivalent to checking

t=1 Ztl e Y l t λ3 ns

Functional Linear Regression with Mixed Predictors

which is a standard constrained optimization problem and can be solved eﬃciently. Otherwise, hl = 0 and to update hl, we need to optimize

t=1 e Y l t Ztlhl 2 2 + λ2 q

h l K3hl + λ3 n2 hl 2. (18)

We again solve this optimization by performing coordinate descent on hlk, k = 1, 2, , n2, where for each k, given {hlj, j = k}, we can update hlk by solving a simple one-dimensional optimization

t=1 (e Y l tk Ztlhlk)2 + λ2 s

K3,kkh2 lk + 2hlk X

j =k K3,kjhlj + X

i,j =k hli K3,ijhlj + λ3 n2

Thus, given R, B can also be updated eﬃciently. The iterative coordinate descent algorithm: Algorithm 1 formalizes the above discussion and outlines the proposed iterative coordinate descent algorithm. The convergence of coordinate descent for convex optimization is guaranteed under mild conditions, see e.g. Wright (2015). Empirically, the proposed algorithm is found to be eﬃcient and stable, and typically reaches a reasonable convergence tolerance within a few iterations.

5. Numerical results

In this section, we conduct extensive numerical experiments to investigate the performance of the proposed RKHS-based penalized estimator (hereafter RKHS) for the functional linear regression with mixed predictors. Sections 5.1-5.3 compare RKHS with popular methods in the literature via simulation studies. Section 5.4 presents a real data application on crowdfunding prediction to further illustrate the potential utility of the proposed method. The implementations of our numerical experiments can be found at https://github.com/ darenwang/functional_regression.

5.1 Simulation settings

Data generating process: We simulate data from the functional linear regression model

[0,1] A (r, s)Xt(s) ds +

j=1 β j (r)Ztj + ϵt(r), r [0, 1], (20)

for t = 1, 2, . . . , T. Note that for p = 0, (20) reduces to the classical function-on-function regression. We generate the functional covariate {Xt} and the functional noise {ϵt} from a qdimensional subspace spanned by basis functions {ui(s)}q i=1, where {ui(s)}q i=1 consists of orthonormal basis of L2[0, 1]. Following Yuan and Cai (2010), we set ui(s) = 1 if i = 1 and ui(s) =

2 cos((i 1)πs) for i = 2, . . . , q. Thus, we have Xt(r) = Pq i=1 xtiui(r)

and ϵt(r) = Pq i=1 etiui(r). For xt = (xt1, . . . , xtq) , we simulate xti i.i.d. Unif[ 1/i, 1/i],

Wang, Zhao, Yu and Willett

Algorithm 1 Iterative coordinate descent

1: input: Observations {Xt(si), Zt, Yt(rj)}T,n1,n2 t=1,i=1,j=1, tuning parameters (λ1, λ2, λ3), the maximum iteration Lmax and tolerance ϵ.

2: initialization: L = 1, B0 = R0 = 0.

3: repeat First level block coordinate descent

4: Given B = BL 1, update RL via the ridge regression formulation (16).

5: Given R = RL, set e Y = Y 1/n1K 1RK 2X and initialize H = K 1BL 1.

6: repeat Second level coordinate descent

7: for l = 1, 2, , p do

8: Given {hj, j = l}, set e Y l t = e Yt P j =l Ztjhj, for t = 1, , T.

9: if min s 2 1 2 PT t=1 Ztl e Y l t λ3 ns 2 1 then

10: Update hl = 0.

12: repeat Third level coordinate descent

13: for k = 1, 2, , n2 do

14: Given {hlj, j = k}, update hlk via the one-dimensional optimization (19).

15: end for

16: until Decrease of function value (18) < ϵ.

18: end for

19: until Decrease of function value (17) < ϵ.

20: Update BL = K 1 1 H and set L L + 1.

21: until Decrease of function value (14) < ϵ or L Lmax.

22: output: b R = RL and b B = BL.

Functional Linear Regression with Mixed Predictors

i = 1, . . . , q. For et = (et1, . . . , etq) , we simulate eti i.i.d. Unif[ 0.2/i, 0.2/i], i = 1, . . . , q.

For the vector covariate {Ztj}, we simulate Ztj i.i.d. Unif[ 1/

3], j = 1, . . . , p. For the coeﬃcient functions A (r, s) and {β j (r)}p j=1, we consider Scenarios A and B.

Scenario A (Exponential): We set A (r, s) = κ

3e (r+s), β 1(r) = κ

3e r and β j (r) 0, j = 2, . . . , p.

Scenario B (Random): We set A (r, s) = κ Pq i,j=1 λijui(r)uj(s), β 1(r) = κ Pq i=1 biui(r) and β j (r) 0, j = 2, . . . , p. Deﬁne matrix Λ = (λij) and b = (b1, . . . , bq) . We simu-

late λij i.i.d. N(0, 1) and rescale Λ such that its spectral norm Λ op = 1. For b, we

simulate bi i.i.d. N(0, 1) and rescale b such that b 2 = 1.

Note that (A , β ) in Scenario B is more complex than that in Scenario A, especially when q is large, while for Scenario A, its complexity is insensitive to q. The parameter κ is later used to control the signal-to-noise ratio (SNR). Here, we deﬁne the SNR for A (r, s) as s

[0,1] A (r, s)Xt(s) ds 2 dr s

[0,1] ϵt(r)2 dr,

which roughly equals to 1, 2 and 4 as we vary κ = 0.5, 1, 2. Similarly, we deﬁne SNR for β

(note that only β 1 is a non-zero function) as s

[0,1] ϵt(r)2 dr,

which also roughly equals to 1, 2 and 4 as we vary κ = 0.5, 1, 2. For simplicity, we set the discrete sample points {rj}n2 j=1 for Yt and {si}n1 i=1 for Xt to be evenly spaced grids on [0, 1] with the same number of grids n = n1 = n2. The simulation result for random sample points where {rj}n2 j=1 and {si}n1 i=1 are generated independently via the uniform distribution on [0,1] is similar and thus omitted. Evaluation criteria: We evaluate the performance of the estimator by its excess risk. Speciﬁcally, given the sample size (n, T), we simulate observations {Xt(si), Zt, Yt(rj)}T+0.5T,n,n t=1,i=1,j=1,

which are then split into the training data {Xt(si), Zt, Yt(rj)}T,n,n t=1,i=1,j=1 for constructing the

estimator ( b A, bβ) and the test data {Xt(si), Zt, Yt(rj)}T+0.5T,n,n t=T+1,i=1,j=1 for the evaluation of the

excess risk. Based on ( b A, bβ) and the predictors {Xt(si), Zt}T+0.5T,n t=T+1,i=1, we generate the pre-

diction {b Yt(r)}T+0.5T t=T+1 and deﬁne

RMISE( b A, bβ) =

v u u t 1 0.5T

Y oracle t (r) b Yt(r) 2 dr, (21)

n RMISE( b A, bβ) =

1 0.5T PT+0.5T t=T+1 R

[0,1] Y oracle t (r) b Yt(r) 2 dr

1 0.5T PT+0.5T t=T+1 R

[0,1] Y oracle t (r) 0 2 dr , (22)

Wang, Zhao, Yu and Willett

where Y oracle t (r) = R

[0,1] A(r, s)Xt(s) ds + β1(r)Zt1 is the oracle prediction of Yt(r). Note that n RMISE is a normalized RMISE, which can be viewed as a percentage error and thus is easy to assess and interpret. Smaller RMISE and n RMISE indicate a better recovery of the signal. Simulation settings: We consider three simulation settings for (n, T, q) where (n, T, q) {(5, 50, 5), (20, 100, 20), (40, 200, 50)}. For each setting, we vary κ {0.5, 1, 2}, which roughly corresponds to SNR {1, 2, 4}. As for the number of scalar predictors p, Section 5.2 considers the classical function-on-function regression, which is a special case of (20) with p = 0 and Section 5.3 considers functional regression with mixed predictors and sets p = 3, 10, 50, 100, 200. For each setting, we conduct 500 experiments. Implementation details of the RKHS estimator: We set K = Kβ and use the rescaled Bernoulli polynomial as the reproducing kernel such that K(x, y) = 1 + k1(x)k1(y) + k2(x)k2(y) k4(x y), where k1(x) = x 0.5, k2(x) = 2 1{k2 1(x) 1/12}, k4(x) = 1/24{k4 1(x) k2 1(x)/2 + 7/240}, x [0, 1], and k4(x y) = k4(|x y|), x, y [0, 1]. Such K is the reproducing kernel for W 2,2. See Chapter 2.3.3 of Gu (2013) for more details. In Algorithm 1, we set the tolerance parameter ϵ = 10 8 and the maximum iterations Lmax = 104. A standard 5-fold crossvalidation (CV) on the training data is used to select the tuning parameters (λ1, λ2, λ3). Note that for p = 0, we explicitly set β 0 in the penalized optimization (13), which reduces to the structured ridge regression in (16) and can be solved eﬃciently with a closed-form solution as detailed in Section 4.2.

5.2 Function-on-function regression

In this subsection, we set p = 0 and compare the proposed RKHS estimator with two popular methods in the literature for function-on-function regression. The ﬁrst competitor (Ramsay and Silverman, 2005) estimates the coeﬃcient function A(r, s) based on a penalized B-spline basis function expansion. We denote this approach as FDA and it is implemented via the R package fda.usc (fregre.basis.fr function). The second competitor (Ivanescu et al., 2015) is the penalized ﬂexible functional regression (PFFR) in and is implemented via the R package refund (pffr function). Unlike our proposed RKHS method, neither FDA or PFFR has optimal theoretical guarantees. Both FDA and PFFR are based on penalized basis function expansions and require a hyper-parameter Nb, which is the number of basis. Intuitively, the choice of Nb is related to the complexity of A (r, s), which is unknown in practice. In the literature, it is recommended to set Nb at a large number to guard against the underﬁtting of A (r, s). On the other hand, a larger Nb can potentially incur higher estimation variance and will also increase the computational cost. For a fair comparison, for both FDA and PFFR we set Nb = 20, which is suﬃcient to accommodate the model complexity across all simulation settings except for Scenario B with q = 50. See more discussions later. We use a standard 5-fold CV to select the tuning parameter λ1 of RKHS and the roughness penalty of FDA from the range 10( 15:0.5:2). PFFR uses a restricted maximum likelihood (REML) approach to automatically select the roughness penalty. Besides the penalized method based on basis function expansions (i.e. FDA and PFFR), another popular method for function-on-function regression in the literature is based on

Functional Linear Regression with Mixed Predictors

functional PCA (i.e. the Karhunen Lo eve expansion, FPCA), see for example Yao et al. (2005). However, FPCA does not seem to perform competitively in our simulation and real data analysis, see similar observations in Cai and Yuan (2012) and Sun et al. (2018). For completeness, we report the performance of FPCA in Appendix H and Appendix K.

Numerical results: For each method, Table 1 reports its average n RMISE (n RMISEavg) across 500 experiments under all simulation settings. For each simulation setting, we further report Ravg, which is the percentage improvement of excess risk given by RKHS deﬁned as Ravg=[min{n RMISEavg(FDA), n RMISEavg(PFFR)} / n RMISEavg(RKHS) 1] 100%.

In addition, we report Rw, which is the percentage of experiments (among 500 experiments) where RKHS gives the lowest RMISE. We further give the boxplot of RMISE in Figure 2 under κ = 1. The boxplots of RMISE under κ {0.5, 2} can be found in Appendix G.

As can be seen, RKHS in general delivers the best performance across almost all simulation settings. For all methods, the estimation performance improves with a larger SNR (reﬂected in κ). Since the complexity of Scenario A is insensitive to q, the performance of all methods improve as (n, T) increases under Scenario A. However, this is not the case for Scenario B, as a larger q increases the estimation diﬃculty. Compared to Scenario A, the excess risk RMISE of Scenario B is larger, especially for q {20, 50}, due to the more complex nature of the operator A . In general, the improvement of RKHS is more notable under high model complexity and SNR.

Note that for Scenario B with q = 50, both FDA and PFFR incur signiﬁcantly higher excess risks due to the underﬁtting bias caused by the insuﬃcient basis dimension Nb, indicating the potential sensitivity of the penalized basis function approaches to the hyperparameter. In comparison, RKHS automatically adapts to diﬀerent levels of model complexity. In Appendix J, we further conduct the same simulation but with a larger number of basis dimension Nb = 50 for FDA and Nb = 30 for PFFR. For Scenario A, where the bivariate function A (r, s) is a simple exponential function, FDA and PFFR give essentially the same performance, while for Scenario B with q = 50, FDA and PFFR give much improved performance due to lower underﬁtting bias, though still having a notable performance gap compared to RKHS. In addition, note that a larger Nb can signiﬁcantly increase the computational cost of the penalized basis function approaches. We refer to Appendix J for more details.

In addition, Appendix I collects the results of the same simulation but with the observed functional responses Yt(r) being additionally corrupted with measurement errors, i.e. the scenario discussed in Section 3.4 (see Theorem 8). It is seen that the performance of RKHS only worsens slightly with the additional measurement errors, providing numerical support for Theorem 8 that measurement errors do not aﬀect the convergence rate of RKHS.

5.3 Functional regression with mixed predictors

In this subsection, we set p = 3 and compare the performance of RKHS with PFFR, as FDA cannot handle mixed predictors. For RKHS, we use the standard 5-fold CV to select the tuning parameter (λ1, λ2, λ3) from the range 10( 4:1:0) 10( 4:1:0) 10( 1:1:3) for Scenario A and from the range 10( 17:1: 13) 10( 17:1: 13) 10( 1:1:3) for Scenario B. PFFR uses REML to automatically select the roughness penalty and we set Nb = 20 for PFFR as before.

Wang, Zhao, Yu and Willett

Scenario A: n = 5, T = 50, q = 5 Scenario B: n = 5, T = 50, q = 5 κ RKHS FDA PFFR Ravg(%) Rw (%) RKHS FDA PFFR Ravg(%) Rw (%) 0.5 15.98 17.97 19.80 12.46 64 20.86 22.98 22.28 6.79 63 1 9.14 10.31 10.98 12.81 68 10.42 11.50 11.12 6.63 71 2 4.99 6.18 5.70 14.37 72 5.21 5.75 5.56 6.68 74 Scenario A: n = 20, T = 100, q = 20 Scenario B: n = 20, T = 100, q = 20 κ RKHS FDA PFFR Ravg(%) Rw (%) RKHS FDA PFFR Ravg(%) Rw (%) 0.5 10.61 12.15 24.84 14.57 80 37.41 45.91 36.51 2.41 47 1 5.89 6.75 12.84 14.66 81 18.39 32.41 22.32 21.37 92 2 3.60 4.26 6.89 18.45 84 9.19 27.71 12.56 36.58 100 Scenario A: n = 40, T = 200, q = 50 Scenario B: n = 40, T = 200, q = 50 κ RKHS FDA PFFR Ravg(%) Rw (%) RKHS FDA PFFR Ravg(%) Rw (%) 0.5 7.37 8.53 18.93 15.72 81 39.22 79.37 78.92 101.21 100 1 3.98 4.41 9.68 10.96 75 20.75 76.74 77.40 269.73 100 2 2.34 2.36 5.05 0.71 49 12.59 75.95 77.09 503.19 100

Table 1: Numerical performance of RKHS, FDA and PFFR under function-on-function regression. The reported n RMISEavg is multiplied by 100 in scale. Ravg reﬂects the percent improvement of RKHS over the best performing competitor, and Rw reﬂects the percentage of experiments in which RKHS achieves the lowest RMISE.

Numerical results: Table 2 reports the average n RMISE (n RMISEavg) across 500 experiments for RKHS and PFFR under all simulation settings. Table 2 further reports Ravg, the percentage improvement of RKHS over PFFR, and Rw, the percentage of experiments (among 500 experiments) where RKHS returns lower RMISE. We further give the boxplot of RMISE in Figure 3 under κ = 1. The boxplots of RMISE under κ = 0.5, 2 can be found in Appendix G. The result is consistent with the one for function-on-function regression, where RKHS delivers the best performance across almost all simulation settings with notable improvement. In the following, we further examine the performance of RKHS for high-dimensional scalar predictors. Speciﬁcally, keep the simulation setting identical as above, we increase the number of scalar predictors to p = 10, 50, 100, 200. As a reminder, the sparsity of β = (β 1, β 2, . . . , β p) is 1 as only β 1 is a non-zero function. In other words, the additionally introduced scalar predictors (Zt2, Zt3, . . . , Ztp) are purely noise. Thus, compared to the case of p = 3, the performance of RKHS and PFFR are expected to worsen with an increasing dimension p. However, thanks to the group Lasso-type penalty, which induces sparsity of the estimated coeﬃcient functions bβ, we expect the excess risk of RKHS to grow at the rate of O log(p) , as suggested in Theorem 4. To conserve space, we present the result under the simulation setting with (n, T, q) = (20, 100, 20) and κ = 1. Results under other settings are similar and thus omitted. Due to the lack of sparsity penalty, PFFR may not be suitable for the setting of high-dimensional predictors. For comparison, we only implement PFFR for p = 10. Figure 4 (A) and (B) give the boxplot of RMISE for RKHS and PFFR across 500 experiments. As expected, the RMISE of RKHS increases as the dimension p increases though at a rate slower than

Functional Linear Regression with Mixed Predictors

RKHS FDA PFFR

0.02 0.04 0.06 0.08 0.10

Scen. A RMISE n=5, T=50, q=5, kappa=1

RKHS FDA PFFR

0.02 0.04 0.06 0.08

Scen. A RMISE n=20, T=100, q=20, kappa=1

RKHS FDA PFFR

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=1

RKHS FDA PFFR

0.04 0.06 0.08 0.10

Scen. B RMISE n=5, T=50, q=5, kappa=1

RKHS FDA PFFR

0.05 0.10 0.15 0.20 0.25

Scen. B RMISE n=20, T=100, q=20, kappa=1

RKHS FDA PFFR

0.10 0.20 0.30 0.40

Scen. B RMISE n=40, T=200, q=50, kappa=1

Figure 2: Boxplots of RMISE of RKHS, FDA and PFFR across 500 experiments under function-on-function regression with κ = 1. Red points denote the average RMISE.

p. Indeed, RKHS at p = 200 still gives better performance than PFFR at p = 10. Figure 4 (C) gives the plot between average MISE1 across 500 experiments and log(p), where the relationship is seen to be roughly linear. This conﬁrms that the excess risk of RKHS increases in the order of O log(p) as suggested in Theorem 4.

5.4 Real data applications

In this section, we conduct real data analysis to further demonstrate the promising utility of the proposed RKHS-based functional regression in the context of crowdfunding. In recent years, crowdfunding has become a ﬂexible and cost-eﬀective ﬁnancing channel for start-ups and entrepreneurs, which helps expedite product development and diﬀusion of innovations. We consider a novel dataset collected from one of the largest crowdfunding websites, kickstarter.com, which provides an online platform for creators, e.g. start-ups, to launch fundraising campaigns for developing a new product such as electronic devices and card games. The fundraising campaign takes place on a webpage set up by the creators, where information of the new product is provided, and typically has a 30-day duration with a goal amount G preset by the creators. Over the course of the 30 days, backers can pledge funds to the campaign, resulting in a functional curve of pledged funds {P(r), r [0, 30]}. At

1. To match the result in Theorem 4, we compute MISE, which is the squared RMISE with MISE = RMISE2.

Wang, Zhao, Yu and Willett

Scenario A: n = 5, T = 50, q = 5 Scenario B: n = 5, T = 50, q = 5 κ RKHS PFFR Ravg(%) Rw (%) RKHS PFFR Ravg(%) Rw (%) 0.5 17.16 20.14 17.35 77 14.81 15.99 8.02 70 1 9.43 10.89 15.53 76 7.42 7.97 7.46 75 2 5.03 5.60 11.35 74 3.72 3.99 7.29 75 Scenario A: n = 20, T = 100, q = 20 Scenario B: n = 20, T = 100, q = 20 κ RKHS PFFR Ravg(%) Rw (%) RKHS PFFR Ravg(%) Rw (%) 0.5 11.25 21.95 95.11 100 23.95 25.74 7.48 63 1 5.95 11.25 89.31 100 11.83 14.44 22.03 90 2 3.38 5.94 75.77 100 5.92 7.85 32.47 99 Scenario A: n = 40, T = 200, q = 50 Scenario B: n = 40, T = 200, q = 50 κ RKHS PFFR Ravg(%) Rw (%) RKHS PFFR Ravg(%) Rw (%) 0.5 8.50 15.76 85.46 99 27.52 77.13 180.22 100 1 4.36 8.02 84.03 100 14.23 76.19 435.38 100 2 2.33 4.16 78.36 100 8.20 75.99 827.09 100

Table 2: Numerical performance of RKHS and PFFR under functional regression with mixed predictors. The reported n RMISEavg is multiplied by 100 in scale. Ravg reﬂects the percent improvement of RKHS over PFFR and Rw reﬂects the percentage of experiments in which RKHS achieves the lower RMISE.

time r, the webpage displays real-time P(r) along with other information of the campaign, such as its number of creators Z1, number of product updates Z2 and number of product FAQs Z3. A fundraising campaign succeeds if P(30) G. Importantly, only creators of successful campaigns can be awarded the ﬁnal raised funds P(30) and the platform kickstarter.com charges a service fee (5% P(30)) of successful campaigns. Thus, for both the platform and the creators, an accurate prediction of the future curve {P(r), r (s, 30]} at an early time s is valuable, as it not only reveals whether the campaign will succeed but more importantly suggests timing along (s, 30] for potential intervention by the creators and the platform to boost the fundraising campaign and achieve greater revenue. The dataset consists of T = 454 campaigns launched between Dec-01-2019 and Dec-312019. For each campaign t = 1, . . . , T, we observe its normalized curve Nt(r) = Pt(r)/Gt2

at 60 evenly spaced time points over its 30-day duration and denote it as {Nt(ri)}60 i=1. See Figure 1 for normalized curves of six representative campaigns. At a time s (0, 30), to predict {Nt(r), r (s, 30]} for campaign t, we employ functional regression, where we treat {Nt(ri), ri (s, 30]} as the functional response Yt, use {Nt(ri), ri [0, s]} as the functional covariate Xt and (Z1, Z2, Z3) as the vector covariate. We compare the performances of RKHS, FDA and PFFR. Note that FDA only allows for one functional covariate, thus for RKHS and PFFR, we implement both the function-on-function regression and the functional regression with mixed predictors (denoted by RKHSmixed and PFFRmixed). The implementation of each method is the same as that in Sections 5.2 and 5.3.

2. Note that the normalized curve Nt(r) is as suﬃcient as the original curve Pt(r) for monitoring the fundraising process of each campaign.

Functional Linear Regression with Mixed Predictors

0.02 0.04 0.06 0.08 0.10

Scen. A RMISE n=5, T=50, q=5, kappa=1

0.01 0.02 0.03 0.04 0.05 0.06 0.07

Scen. A RMISE n=20, T=100, q=20, kappa=1

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=1

0.03 0.04 0.05 0.06 0.07 0.08 0.09

Scen. B RMISE n=5, T=50, q=5, kappa=1

0.04 0.06 0.08 0.10 0.12

Scen. B RMISE n=20, T=100, q=20, kappa=1

0.2 0.4 0.6 0.8 1.0

Scen. B RMISE n=40, T=200, q=50, kappa=1

Figure 3: Boxplots of RMISE of RKHS and PFFR across 500 experiments under functional regression with mixed predictors with κ = 1. Red points denote the average RMISE.

0.02 0.04 0.06 0.08 0.10

(A): Scen. A RMISE n=20, T=100, q=20, kappa=1

RKHS p=10 RKHS p=50 RKHS p=100 RKHS p=200 PFFR p=10

0.05 0.06 0.07 0.08 0.09 0.10 0.11

(B): Scen. B RMISE n=20, T=100, q=20, kappa=1

RKHS p=10 RKHS p=50 RKHS p=100 RKHS p=200 PFFR p=10 2.5 3.0 3.5 4.0 4.5 5.0

0.03 0.04 0.05 0.06

(C): MISE v.s. log(p) for RKHS

Scen. A Scen. B

Figure 4: Boxplots of RMISE of RKHS and PFFR across 500 experiments under functional regression with mixed predictors with κ = 1 for p = 10, 50, 100, 200. Red points denote the average RMISE. For plot (C), MISE = RMISE2 and the two lines are ﬁtted via OLS.

We vary the prediction time s such that s = 7th, 8th, . . . , 20th day of a campaign. Note that we stop at s = 20 as prediction made in the late stage of a campaign is not as useful as early forecasts. To assess the out-of-sample performance of each method, we use a 2-fold

Wang, Zhao, Yu and Willett

CV, where we partition the 454 campaigns into two equal-sized sets and use one set to train the functional regression and the other to test the prediction performance, and then switch the role of the two sets. For each campaign t in the test set, given its prediction { b Nt(r), r (s, 30]} generated at time s, we calculate its RMSE and MAE with respect to the true value {Nt(ri), ri (s, 30]}, where

v u u t 1 #{ri (s, 30]}

ri (s,30] ( b Nt(ri) Nt(ri))2

and MAEt,s = 1 #{ri (s, 30]}

b Nt(ri) Nt(ri) .

Figure 5 visualizes RMSEs = T 1 PT t=1 RMSEt,s and MAEs = T 1 PT t=1 MAEt,s achieved by diﬀerent functional regression methods across s {7, 8, . . . , 20}. In general, the two RKHS-based estimators consistently achieve the best prediction accuracy. As expected, the performance of all methods improve with s approaching 20. Interestingly, the functional regression with mixed predictors does not seem to improve the prediction performance, which is especially evident for PFFR. Thanks to the group Lasso-type penalty, RKHS can perform variable selection on the vector covariate. Indeed, among the 28 (2 folds 14 days) estimated functional regression models based on RKHSmixed, 19 models select no vector covariate and thus reduce to the function-on-function regression, suggesting the potential irrelevance of the vector covariate. We further provide a robustness check of the above analysis in Appendix H, where we repeat the 2-fold CV procedure 100 times for RKHS, FDA and PFFR. It is seen that RKHS consistently provides the best performance, conﬁrming the robustness of our ﬁndings. We refer to Appendix H for more details. For more intuition, Figure 1 plots the normalized fundraising curves {Nt(ri)}60 i=1 of six representative campaigns and further visualizes the functional predictions given by RKHS, FDA and PFFR at s = 14th day. Note that FDA and RKHS provide more similar prediction while the prediction of PFFR seems to be more volatile. This is indeed consistent with the estimated bivariate coeﬃcient functions visualized in Figure 6, where b A(r, s) of RKHS and FDA resembles each other while PFFR seems to suﬀer from under-smoothing. Figure 11 in Appendix G further gives the prediction performance of the classical time series model ARIMA, which is signiﬁcantly worse than the predictions given by RKHS, FDA and PFFR, suggesting the advantage of functional regression for handling the current application.

6. Conclusion

In this paper, we study a functional linear regression model with functional responses and accommodating both functional and high-dimensional vector covariates. We provide a minimax lower bound on its excess risk. To match the lower bound, we propose an RKHS-based penalized least squares estimator, which is based on a generalization of the representer lemma and achieves the optimal upper bound on the excess risk. Our framework allows for partially observed functional variables and provides ﬁnite sample probability bounds.

Functional Linear Regression with Mixed Predictors

8 10 12 14 16 18 20

0.10 0.12 0.14 0.16

Average RMSE

Day s when prediction is made

RKHS RKHS_mixed PFFR PFFR_mixed FDA

8 10 12 14 16 18 20

0.07 0.08 0.09 0.10 0.11 0.12 0.13

Average MAE

Day s when prediction is made

RKHS RKHS_mixed PFFR PFFR_mixed FDA

Figure 5: Average RMSE and MAE achieved by diﬀerent functional regression methods.

0.0 0.2 0.4

RKHS A(r,s) estimated at Day 14

0.0 0.2 0.4

FDA A(r,s) estimated at Day 14

0.0 0.2 0.4

PFFR A(r,s) estimated at Day 14

Figure 6: Estimated bivariate coeﬃcient functions b A(r, s) by RKHS, FDA and PFFR at Day 14 ((r, s) is rescaled to [0, 1]2).

Furthermore, the result unveils an interesting phase transition between a high-dimensional parametric rate and a nonparametric rate in terms of the excess risk. A novel iterative coordinate descent based algorithm is proposed to eﬃciently solve the penalized least squares problem. Simulation studies and real data applications are further conducted, where the proposed estimator is seen to provide favorable performance compared to existing methods in the literature. Throughout the paper, we assume knowledge of the kernel functions K and Kβ. In practice, it would be ideal if one would be able to learn the kernel functions from data. However, to our best knowledge, assuming knowing the true kernel functions is adopted in all RKHS-based functional data analysis literature. Learning kernels is beyond the scope of our paper and we would like to pursue this direction in the future research. We would also like to point out that our estimator ( b A, bβ) is a constrained estimator, with the constraints

Wang, Zhao, Yu and Willett

related with the true kernels. A mis-speciﬁcation might lead to an over/under penalization, which will aﬀect the ﬁnal prediction error rate. Finally, we brieﬂy discuss important distinctions between non-parametric regression and functional regression for interested readers. To make the comparison easier, we consider a simpler setting: the scalar response functional linear regression and the classical nonparametric regression. In non-parametric regression, we have

yi = f(xi) + ϵi,

where {xi}n i=1 are assumed to be random or ﬁxed grid points in [0, 1]. In scalar response functional linear regression with fully observed functional data, we have

yi = Z Xi(s)β(s) ds + ϵi, (23)

where {Xi}n i=1 are assumed to be i.i.d. stochastic processes in L2. Note that we can treat Xi and β as inﬁnite-dimensional vectors w.r.t. some L2 basis system, thus model (23) can be rewritten as yi = e X i eβ + ϵi,

where e Xi R , eβ R . Under the standard assumption that E( Xi 2 L2) < , the eigenvalue sequence of the covariance matrix eΣX = E( e Xi e X i ) converges to 0. As such, model (23) was described as a high-dimensional or inﬁnitely-dimensional ill-posed linear regression model in Hall and Horowitz (2007), which is fundamentally diﬀerent from the standard non-parametric regression.

Acknowledgments

We would like to thank the action editor, Dr. Pradeep Ravikumar, as well as the four anonymous reviewers for their thoughtful assessment and constructive comments which helped us to improve the quality and the presentation of our paper. Z.Z. would like to acknowledge support from NSF DMS-2014053. Y.Y. would like to acknowledge support from DMS-EPSRC EP/V013432/1 and EPSRC EP/W003716/1. R.W. would like to acknowledge support from AFOSR FA9550-18-1-0166, DOE DE-AC02-06CH113575, NSF DMS-1925101, ARO W911NF-17-1-0357, and NGA HM0476-17-1-2003.

Alexander Aue, Diogo Dubart Norinho, and Siegfried H ormann. On the prediction of stationary functional time series. Journal of the American Statistical Association, 110 (509):378 392, 2015.

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

Neil Bathia, Qiwei Yao, and Flavio Ziegelmann. Identifying the ﬁnite dimensionality of curve time series. The Annals of Statistics, 38(6):3352 3386, 2010.

Functional Linear Regression with Mixed Predictors

Michal Benko. Functional data analysis with applications in ﬁnance. Humboldt-Universit at zu Berlin, Wirtschaftswissenschaftliche Fakult at, 2007.

Simon J Bonner, Nathaniel K Newlands, and Nancy E Heckman. Modeling regional impacts of climate teleconnections using functional data analysis. Environmental and ecological statistics, 21(1):1 26, 2014.

Haim Brezis. Functional analysis, Sobolev spaces and partial diﬀerential equations. Springer Science & Business Media, 2010.

Peter B uhlmann and Sara van de Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

T Tony Cai and Peter Hall. Prediction in functional linear regression. The Annals of Statistics, 34(5):2159 2179, 2006.

T Tony Cai and Ming Yuan. Optimal estimation of the mean function based on discretely sampled functional data: Phase transition. The annals of statistics, 39(5):2330 2355, 2011.

T Tony Cai and Ming Yuan. Minimax and adaptive prediction for functional linear regression. Journal of the American Statistical Association, 107(499):1201 1216, 2012.

Tony Cai and Ming Yuan. Nonparametric covariance function estimation for functional and longitudinal data. University of Pennsylvania and Georgia inistitute of technology, 2010.

Herv e Cardot, Fr ed eric Ferraty, and Pascal Sarda. Spline estimators for the functional linear model. Statistica Sinica, pages 571 591, 2003.

Kehui Chen, Pedro Francisco Delicado Useros, and Hans-Georg M uller. Modelling functionvalued stochastic processes, with applications to fertility dynamics. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 79(1):177 196, 2017.

Jeng-Min Chiou, Yi-Chen Zhang, Wan-Hui Chen, and Chiung-Wen Chang. A functional data approach to missing value imputation and outlier detection for traﬃc ﬂow data. Transportmetrica B: Transport Dynamics, 2(2):106 129, 2014.

Xiongtao Dai, Hans-Georg M uller, Jane-Ling Wang, and Sean CL Deoni. Age-dynamic networks and functional correlation for early white matter myelination. Brain Structure and Function, 224(2):535 551, 2019.

Lawrence C Evans. Partial diﬀerential equations. second. vol. 19. Graduate Studies in Mathematics. American Mathematical Society, 2010.

Yingying Fan, Natasha Foutz, Gareth M James, and Wolfgang Jank. Functional response additive model estimation with online virtual stock markets. The Annals of Applied Statistics, 8(4):2435 2460, 2014.

Yingying Fan, Gareth M James, and Peter Radchenko. Functional additive regression. The Annals of Statistics, 43(5):2296 2325, 2015.

Wang, Zhao, Yu and Willett

Julian J Faraway. Regression analysis for a functional response. Technometrics, 39(3): 254 261, 1997.

Fr ed eric Ferraty and Philippe Vieu. Additive prediction and boosting for functional data. Computational Statistics & Data Analysis, 53(4):1400 1413, 2009.

Ricardo Fraiman, Ana Justel, Regina Liu, and Pamela Llop. Detecting trends in time series of functional data: A study of antarctic climate change. Canadian Journal of Statistics, 42(4):597 609, 2014.

Jerome H. Friedman, Trevor Hastie, and Robert Tibshirani. A note on the group lasso and a sparse group lasso. ar Xiv:1001.0736, 2010.

Chong Gu. Smoothing Spline ANOVA Models. Springer-Verlag, New York, 2 edition, 2013.

Pantelis Z Hadjipantelis, John AD Aston, Hans-Georg M uller, and Jonathan P Evans. Unifying amplitude and phase analysis: A compositional data approach to functional multivariate mixed-eﬀects modeling of mandarin chinese. Journal of the American Statistical Association, 110(510):545 559, 2015.

Peter Hall and Joel L Horowitz. Methodology and convergence rates for functional linear regression. The Annals of Statistics, 35(1):70 91, 2007.

Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning. Springer-Verlag New York, 2 edition, 2009.

Andrada E Ivanescu, Ana-Maria Staicu, Fabian Scheipl, and Sonja Greven. Penalized function-on-function regression. Computational Statistics, 30(2):539 568, 2015.

Ci-Ren Jiang, Wei Yu, Jane-Ling Wang, et al. Inverse regression for longitudinal data. The Annals of Statistics, 42(2):563 591, 2014.

Alois Kneip, Dominik Poß, and Pascal Sarda. Functional linear regression with points of impact. The Annals of Statistics, 44(1):1 30, 2016.

Vladimir Koltchinskii and Ming Yuan. Sparsity in multiple kernel learning. The Annals of Statistics, 38(6):3660 3695, 2010.

Nan M Laird and James H Ware. Random-eﬀects models for longitudinal data. Biometrics, pages 963 974, 1982.

Yehua Li and Tailen Hsing. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. The Annals of Statistics, 38(6):3321 3351, 2010.

Hua Liang, Hulin Wu, and Raymond J Carroll. The relationship between virologic and immunologic responses in aids clinical research using mixed-eﬀects varying-coeﬃcient models with measurement error. Biostatistics, 4(2):297 312, 2003.

Yi Lin and Ming Yuan. Convergence rates of compactly supported radial basis function regularization. Statistica Sinica, pages 425 439, 2006.

Functional Linear Regression with Mixed Predictors

Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637 1664, 2012.

Shahar Mendelson. Geometric parameters of kernel machines. In International Conference on Computational Learning Theory, pages 29 43. Springer, 2002.

James Mercer. Xvi. functions of positive and negative type, and their connection the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character, 209(441-458):415 446, 1909.

Jeﬀrey S Morris. Functional regression. Annual Review of Statistics and Its Application, 2: 321 359, 2015.

Byeong U Park, Chun-Jui Chen, Wenwen Tao, and Hans-Georg M uller. Singular additive models for function to function regression. Statistica Sinica, 28(4):2497 2520, 2018.

Alexander Petersen, Sean Deoni, and Hans-Georg M uller. Fr echet estimation of timevarying covariance matrices from sparse data, with application to the regional co-evolution of myelination in the developing brain. The Annals of Applied Statistics, 13(1):393 419, 2019.

James O Ramsay and James B Ramsey. Functional data analysis of the dynamics of the monthly index of nondurable goods production. Journal of Econometrics, 107(1-2):327 344, 2002.

J.O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer-Verlag, New York, 2005.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241 2259, 2010.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for highdimensional linear regression over lq -balls. IEEE transactions on information theory, 57 (10):6976 6994, 2011.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of machine learning research, 13(2), 2012.

Sarah J Ratcliﬀe, Gillian Z Heller, and Leo R Leader. Functional data analysis with application to periodically stimulated foetal heart rate data. ii: Functional logistic regression. Statistics in medicine, 21(8):1115 1127, 2002.

Matthew Reimherr, Bharath Sriperumbudur, Bahaeddine Taouﬁk, et al. Optimal prediction for additive function-on-function regression. Electronic Journal of Statistics, 12(2):4571 4601, 2018.

Wang, Zhao, Yu and Willett

Matthew Reimherr, Bharath Sriperumbudur, and Hyun Bin Kang. Optimal function-onscalar regression over complex domains. ar Xiv preprint ar Xiv:1902.07284, 2019.

Xiaoxiao Sun, Pang Du, Xiao Wang, and Ping Ma. Optimal penalized function-on-function regression under a reproducing kernel hilbert space framework. Journal of the American Statistical Association, 113(524):1601 1611, 2018.

Shahin Tavakoli, Davide Pigoli, John AD Aston, and John S Coleman. A spatial modeling approach for linguistic object data: Analyzing dialect sound variations across great britain. Journal of the American Statistical Association, 114(527):1081 1096, 2019.

Carlos Valencia and Ming Yuan. Radial basis function regularization for linear inverse problems with random noise. Journal of Multivariate Analysis, 116:92 108, 2013.

Anne van Delft, Vaidotas Characiejus, and Holger Dette. A nonparametric test for stationarity in functional time series. ar Xiv preprint ar Xiv:1708.05248, 2017.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Isaac Michael Wagner-Muns, Ivan G Guardiola, VA Samaranayke, and Wasim Irshad Kayani. A functional data analysis approach to traﬃc volume forecasting. IEEE Transactions on Intelligent Transportation Systems, 19(3):878 888, 2017.

Grace Wahba. Spline models for observational data. SIAM, 1990.

Daren Wang, Zifeng Zhao, Rebecca Willett, and Chun Yip Yau. Functional autoregressive processes in reproducing kernel hilbert spaces. ar Xiv preprint ar Xiv:2011.13993, 2020a.

Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg M uller. Functional data analysis. Annual Review of Statistics and Its Application, 3:257 295, 2016.

Jiayi Wang, Raymond KW Wong, and Xiaoke Zhang. Low-rank covariance function estimation for multidimensional functional data. Journal of the American Statistical Association, pages 1 14, 2020b.

Yu-Xiang Wang, Alex Smola, and Ryan Tibshirani. The falling factorial basis and its statistical applications. In International Conference on Machine Learning, pages 730 738. PMLR, 2014.

Stephen J. Wright. Coordinate descent algorithms. Mathematical Programming, 151:3 34, 2015.

Colin O Wu and Chin-Tsang Chiang. Kernel smoothing on varying coeﬃcient models with longitudinal dependent variable. Statistica Sinica, pages 433 456, 2000.

Colin O Wu, Chin-Tsang Chiang, and Donald R Hoover. Asymptotic conﬁdence regions for kernel smoothing of a varying-coeﬃcient model with longitudinal data. Journal of the American statistical Association, 93(444):1388 1402, 1998.

Functional Linear Regression with Mixed Predictors

Ming-Yi Yan. Economic data analysis: an approach based on functional viewpoint of data [j]. Modern Economic Science, 1, 2007.

Fang Yao, Hans-Georg M uller, and Jane-Ling Wang. Functional linear regression analysis for longitudinal data. The Annals of Statistics, pages 2873 2903, 2005.

Bin Yu. Assouad, fano, and le cam. In Yang G.L. Pollard D., Torgersen E., editor, Festschrift for Lucien Le Cam, pages 423 435. Springer, New York, NY, 1997.

Ming Yuan and T Tony Cai. A reproducing kernel hilbert space approach to functional linear regression. The Annals of Statistics, 38(6):3412 3444, 2010.

Wang, Zhao, Yu and Willett

Appendix A provides more discussions on the connection between bivariate functions and compact linear operators, Assumptions 2 and 3. Appendix B collects proofs of results in Section 3. Appendix C contains a series of deviation bounds, which are interesting per se. Appendix D contains the proof of Theorem 7. Appendix E includes additional details on technical proofs. Appendix F collects additional details on optimization. Additional numerical details are exhibited in Appendices G, H, I, J, K and L.

Appendix A. More discussions on bivariate functions, Assumptions 2 and 3

A.1 Bivariate functions and compact linear operators

Remark 9 Fro any compact linear operator A2 : H(K) H(K) denote

A2[f, g] = A2[g], f H(K), f, g H(K).

Note that A2[f, g] is well deﬁned for any f, g H(K) due to the compactness of A2. Deﬁne aij = A2[ψi, ψj] = A2[ψj], ψi H(K), i, j N . We thus have for any f, g H(K), it holds that

A2[f, g] = A2[g], f H(K) =

i=1 f, ψi H(K) A2[g], ψi H(K)

i=1 f, ψi H(K)

j=1 g, ψj H(K)ψj

i,j=1 f, ψi H(K) g, ψj H(K) A2[ψj], ψi H(K) =

i,j=1 aij f, ψi H(K) g, ψj H(K),

where the fourth identity follows from the compactness of A2. This justiﬁes Equation (2).

A.2 Assumption 2(d)

In Assumption 2(d), we assume that

E( X1, f L2Z 1 v) 3

E{ X1, f 2 L2}(v ΣZv),

i.e. the functional and vector covariates are allowed to be correlated up to O(1). In this subsection, we emphasize that the correlation cannot be equal to one. We ﬁrst note that

E( X1, f L2Z 1 v) q

E{ X1, f 2 L2}(v ΣZv). (24)

For model identiﬁcation, we require the inequality in (24) to be strict. To see this, we suppose that there exist g H(K) and u Rp such that

E( X1, g L2Z 1 u) = q

E{ X1, g 2 L2}(u ΣZu).

Functional Linear Regression with Mixed Predictors

Since X1, g L2 and Z 1 u are both normal random variables, the above equality implies that the correlation between X1, g L2 and Z 1 u is 1 and thus X1, g L2 = a Z 1 u for some constant a R. This means that the model deﬁned in (1) is no longer identiﬁable, because the covariates are perfectly correlated. Even in ﬁnite-dimensional regression problems, if the covariates are perfectly correlated, the solution to the linear system is ill-deﬁned.

A.3 Assumption 3(b)

Recall that Assumption 3 is required for handling the case where the functional observations are on discrete sample points. Assumption 3(a) formalizes the sampling scheme and Assumption 3(b) requires that the second moment of the random variable X W α,2 is ﬁnite. In the following, via a concrete example, we show that the second moment assumption in (7) holds under mild conditions. Speciﬁcally in Example 1, we provide a simple suﬃcient condition on the eigen-decay of the covariance operator ΣX, under which (7) holds for W α,2

with α > 1/2.

Example 1 Let α > 1/2 and suppose the Sobolev space W α,2 is generated by the kernel

k=1 ωkψα k (r)ψα k (s),

where due to the property of the Sobolev space, we have ωk k 2α and {ψα k } k=1 can be taken as the Fourier basis such that ψα k 2 W α,2 k2α. To better understand the second moment condition E( X1 2 W α,2) < ,

we proceed by following the same strategy as in Cai and Hall (2006) and Yuan and Cai (2010) and assume that Kα and ΣX are perfectly aligned, i.e., they share the same set of eigen-functions. Let zk = X1, ψα k L2. We have that {zk} k=1 is a collection of independent Gaussian random variables such that zk N(0, σ2 k), where {σ2 k} k=1 are the eigenvalues of ΣX. Thus we have

X1 2 W α,2 =

k=1 z2 k ψα k 2 W α,2

k=1 z2 kk2α,

E( X1 2 W α,2)

k=1 E(z2 k)k2α =

k=1 σ2 kk2α.

To assure that E( X1 2 W α,2) < , it suﬃces to have that

σ2 k k 2α 1 ν

for any constant ν > 0. Note that when α is close to 1/2, this means that the eigenvalues of ΣX decay at a polynomial rate slightly faster than 2.

Wang, Zhao, Yu and Willett

Appendix B. Proofs of the results in Section 3

B.1 Proof of Theorem 1

Proof [Proof of Theorem 1] For any linear subspaces R, S H(K), let PR and PS be the projection mappings of the spaces R and S, respectively with respect to H(K). For any f, g H(K) and any compact linear operator A : H(K) H(K), denote

A|R S[f, g] = A[PRf, PSg].

Let ( b B, bα) be any solution to (4). Let Q = span{Kβ(rj, )}n j=1 H(Kβ), R = span{K(rj, )}n j=1 H(K) and S = span{K(sj, )}n j=1 H(K). Denote bβl = PQ(bαl), l {1, . . . , p}, and b A[ , ] = b B|R S[ , ] = b B[PR , PS ]. Let S and R be the orthogonal complements of S and R in H, respectively. Then for any compact linear operator A, we have the decomposition

A = A|R S + A|R S + A|R S + A|R S .

Due to Theorem 10, there exist {aij}n i,j=1 R such that

A|R S[f, g] =

i,j=1 aij K(ri, ), f H(K) K(sj, ), g H(K).

Step 1. In this step, it is shown that for any compact linear operator A, its associated bivariate function A( , ) satisﬁes that {A(ri, sj)}n i,j=1 only depend on A|R S. The details of the bivariate function is explained in Appendix A.1. Observe that

A|R S (ri, sj) = A|R S [K(ri, ), K(sj, )] = A[PRK(ri, ), PS K(sj, )] = 0.

Similar arguments also lead to that A|R S[K(ri, ), K(sj, )] = 0 and A|R S [K(ri, ), K(sj, )] = 0, i, j = 1, . . . , n.

Step 2. By Step 1, it holds that b A(ri, sj) = b B(ri, sj), i, j = 1, . . . , n. By Theorem 10, we have that there exist {baij}n i,j=1 R such that for any f, g H(K),

b A[f, g] =

i,j=1 baij K(ri, ), f H(K) K(sj, ), g H(K).

Therefore the associated bivariate function satisﬁes that, for all r, s [0, 1],

b A(r, s) =

i,j=1 b A[ψi, ψj]ψi(r)ψj(s) =

k,l=1 baklψi(rk)ψj(sl)ψi(r)ψj(s)

i=1 ψi(r)ψi(rk)

i=1 ψi(s)ψi(sl)

k,l=1 bakl K(r, rk)K(s, sl),

Functional Linear Regression with Mixed Predictors

where K(r, s) = P i=1 ψ(s)ψ(r) is used in the last inequality.

Step 3. For any j {1, . . . , n} and l {1, . . . , p}, by the deﬁnition of bβl, we have that

bαl(rj) = bαl, Kβ(rj, ) H = bβl, Kβ(rj, ) H = bβl(rj).

j=1 ws(j) b A(ri, sj)Xt(sj) bβ(ri), Zt p

j=1 ws(j) b B(ri, sj)Xt(sj) bα(ri), Zt p

In addition, by Theorem 10 we have that

b A F(K) b B F(K) and bβl H(Kβ) bαl H(Kβ) for all l {1, . . . , p},

which completes the proof.

Lemma 10 Let A : H(K) H(K) be any Hilbert Schmidt operator. Let {vi, v i}m i=1 H(K) be a collection of functions in H(K). Let R and S be the subspaces of H(K) spanned by {vi}m i=1 and {v i}m i=1, respectively. Let PR and PS be the projection mappings of the spaces R and S, respectively. Then there exist {aij}m i,j=1 R, which are not necessarily unique, such that for any f, g H(K),

A|R S[f, g] = A[PRf, PSg] =

i,j=1 aij f, vi H(K) g, vi H(K),

and that A|R S[f, g] F(K) A F(K).

Proof Let {uk}K k=1 and {u l}L l=1 be orthogonal basis of subspaces R and S of L2, respectively, with K, L m. Since each ui(u i) can be written as a linear combination of {vk}m k=1({v k}m k=1), it suﬃces to show that there exist {bij}K,L i=1,j=1 R, such that for any f, g H(K),

A|R S[f, g] =

l=1 bij ui, f H(K) u j, g H(K).

Since R and S are linear subspaces of H(K), there exist {uk} k=K+1, {u l} l=L+1 H(K), such that {uk} k=1 and {u l} l=1 are two orthogonal basis of H(K), and that

i,j=1 bij ui, f H(K) u j, g H(K),

Wang, Zhao, Yu and Willett

where bij = A[ui, u j] = A[u j], ui H(K). Therefore

A|R S[f, g] =

i,j=1 bij ui, PRf H(K) u j, PSg H(K) =

i,j=1 bij PRui, f H(K) PSu j, g H(K)

l=1 bkl uk, f H(K) u l, g H(K).

Moreover, we have that

k,l=1 b2 kl

l=1 b2 kl = A|R S 2 F(K),

which concludes the proof.

B.2 Proof of Theorem 4

Proof [Proof of Theorem 4] First note that 0 < δT , ζn < 1, then δ2 T δT and ζ2 n ζn. These inequalities will be used repeatedly in the rest of this proof. Recall the notation that β(r) = bβ(r) β (r) and A(r, s) = b A(r, s) A (r, s), r, s [0, 1]. Note that for any g L2, it holds that

E( g, X 2 L2) = ZZ

[0,1]2 g(r)ΣX(r, s)g(s) dr ds.

The above expression will also be used repeatedly in the rest of the proof. Observe that for any r [0, 1], it holds that

A(r, ) W α,2 (r, ) H(K) CK F(K) CKCA,

where the ﬁrst inequality follows from Assumption 3 b, which assumes that H(K) W α,2, the second inequality follows from Theorem 36 and the last inequality follows from the assumption that A , b A CA. Throughout the proof, we assume without loss of generality that CK = 1. In addition, observe that by Theorem 33, it holds that

P Xt W α,2 4CX p

log(T) T 6.

By a union bound argument,

P(D) = P Xt W α,2 4CX p

log(T) for all 1 t T T 5.

The rest of the proof is shown in the event of D.

Let λ = Cλ q

T for suﬃciently large constant Cλ. From the minimizer property, we have that

j=1 ws(j) b A(ri, sj)Xt(sj) Zt, bβ(ri) p

Functional Linear Regression with Mixed Predictors

j=1 ws(j)A (ri, sj)Xt(sj) Zt, β (ri) p

which implies that

j=1 ws(j) A(ri, sj)Xt(sj)

i=1 wr(i) β(ri), Zt 2 p (26)

j=1 ws(j) A(ri, sj)Xt(sj) β(ri), Zt p

j=1 ws(j) A(ri, sj)Xt(sj) + β(ri), Zt p

ϵt(ri) (28)

j=1 β j n λ

j=1 bβj n. (29)

In the rest of the proof, for readability, we will assume that

wr(i) = 1 for all 1 i n and ws(j) = 1 for all 1 j n.

This assumption is equivalent to the case that {ri}n i=1 and {sj}n j=1 are equally spaced grid. We note that in view of Theorem 40, the general case follows straightforwardly from the same argument and thus will be omitted for readability.

Step 1. Observe that

j=1 A(ri, sj)Xt(sj) Z

[0,1] A(ri, s)Xt(s) ds

β(ri), Zt p (30)

[0,1] A(ri, s)Xt(s) ds

β(ri), Zt p. (31)

As for the term (30), we have that for any j {1, . . . , n} and t {1, . . . , T},

A(rj, )Xt( ) W α,2 A(rj, ) W α,2 Xt W α,2 8CACX p

where the ﬁrst inequality follows from Theorem 41 and event D. By Theorem 40, it holds that for all 1 i n, 1 t T, 1 n

j=1 A(ri, sj)Xt(sj) Z

[0,1] A(ri, s)Xt(s) ds C1ζn p

Wang, Zhao, Yu and Willett

where ζn = n α+1/2 and C1 depends on CA and CX only. So,

log(T) | β(ri), Zt p|

i=1 β(ri), Zt 2p

i=1 β(ri), Zt 2 p 640C2 1ζ2 n log(T)

i=1 β(ri), Zt 2 p C 1ζ2 n log(T),

for some absolute constant C 1 > 0. In addition, for Equation (31), note that with probability at least 1 (p T) 4,

[0,1] A(ri, s) 1

t=1 Xt(s)Z t β(ri) ds

[0,1] A(ri, s)E{Xt(s)Z t } β(ri) ds

[0,1] A(ri, s) 1

t=1 Xt(s)Z t E{Xt(s)Z t } β(ri) ds

t=1 Xt(s)Z t E{Xt(s)Z t } β(ri) L2

t=1 Xt(s)Z t,j E{Xt(s)Z t,j} L2| βj(ri)|

1 j p | βj(ri)| max 1 j p

t=1 Xt(s)Z t,j E{Xt(s)Z t,j} L2

1 j p | βj(ri)|

1 j p | βj(ri)|, (32)

where the second inequality holds because A(ri, ) L2 A(ri, ) H(K) 2CA, the fourth inequality follows from Theorem 34, and the last inequality follows from the assump-

tion that λ = Cλ q

T with suﬃciently large Cλ. In addition, by Assumption 2(c),

[0,1] A(ri, s)E{Xt(s)Z t } β(ri) ds

ΣX[ A(ri, ), A(ri, )] q

β(ri) ΣZ β(ri)

5ΣX[ A(ri, ), A(ri, )] + 5

6 β(ri) ΣZ β(ri)

Functional Linear Regression with Mixed Predictors

20ΣX[ A(ri, ), A(ri, )] + 5

16 β(ri) ΣZ β(ri)

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + 5

16 β(ri) ΣZ β(ri).

j=1 | βj(ri)| 1

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds 1

5 8 β(ri) ΣZ β(ri)

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds 1

5 8 β(ri) ΣZ β(ri)

j S βj n + X

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds 1

5 8 β(ri) ΣZ β(ri)

Putting calculations for Equation (30) and Equation (31) together, we have that

(27) 1 640Tn

i=1 β(ri), Zt 2 p C 1ζ2 n log(T) λ

j S βj n + X

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds 1

5 8 β(ri) ΣZ β(ri). (33)

Step 2. It follows from (43) that with probability at least 1 T 4, it holds that

[0,1]2 A(ri, r)Σ(r, s) A(ri, s) dr ds C2 log(T)ζ2 n + log(T)δT , (34)

where C2 > 0 is an absolute constant. It follows from (47) and (51), we have that with probability at least 1 2T 4, it holds that

(28) 1 320n

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + C 2

ζn + log(T)δT

j S βj n + λ

where C 2 > 0 is an absolute constant.

Step 3. Note that for Equation (29),

j=1 λ bβj n =λ X

j S β j n X

j S bβj n X

Wang, Zhao, Yu and Willett

j S βj n λ X

j Sc bβj n.

Step 4. Putting all previous calculations together leads to that

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds

i=1 β(ri), Zt 2 p 5

i=1 β(ri) ΣZ β(ri) (35)

1 + 1 100 + 1 320 λ X

j S βj n λ 1 1 100 1 320 X

j Sc bβj n + C3 log(T)ζn + log(T)δT ,

j S βj n + C3 log(T)ζn + log(T)δT (36)

where C3 > 0 is some absolute constant.

For Equation (35), note that by constrain set Cβ, we have that

1 j p bβj n X

1 j p bβj X

1 j p bβj H(Kβ) Cβ.

Therefore X

1 j p βj n X

1 j p bβj n + X

1 j p β j n 2Cβ.

From Theorem 27, it holds that with probability at least 1 exp( c T),

t=1 β(ri), Zt 2 p 639

3 β(ri) ΣZ β(ri) C 3 log(p)

320 β(ri) ΣZ β(ri) 4C 3C2 β log(p)

Substitute the above inequality into Equation (35) gives

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds

i=1 β(ri) ΣZ β(ri)

j S βj n (37)

+C 3 log(T)ζn + log(T)δT + log(p)

Functional Linear Regression with Mixed Predictors

Step 5. Note that by Assumption 2c,

j=1 βj 2 n = λ2s 1

j=1 { βj(ri)}2 λ2s

i=1 β(ri) ΣZ β(ri),

where s = |S|, the cardinality of the support set S. This gives

i=1 β(ri) ΣZ β(ri), (38)

which implies that

j S βj n 1 320n

i=1 β(ri) ΣZ β(ri) + C4sλ2

i=1 β(ri) ΣZ β(ri) + C4C2 λ s log(p T)

Substituting the above inequality into (37) gives

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + 12 320n

i=1 β(ri) ΣZ β(ri)

C 4 log(T)ζn + log(T)δT + log(p)

T + s log(p T)

Step 6. Note that by deﬁnition, E ( b A, bβ) 0 for any estimators b A and bβ, and that

E ( b A, bβ) 4 Z

[0,1] A(r, s)X (s) ds

)2 dr + 4E Z

n (Z i ) β(r) o2 dr

[0,1]2 A(r, s)ΣX(s, u) A(r, u) ds du + β (r)ΣZ β(r) dr, (40)

where ΣX and ΣZ are the covariance of Xt and Zt, as deﬁned in Assumption 2.

It follows from (54) that with probability at least 1 T 4,

[0,1]2 A(ri, r) ΣX(r, s) A(ri, s) dr ds

[0,1] A(ri, s)X (s) ds

Wang, Zhao, Yu and Willett

[0,1] A(s, r)X (r) dr

C5 log(T)ζn, (41)

with c5, C5 > 0 being absolute constants. In addition, with absolute constants c 5, C 5 > 0, due to (55), we have that

i=1 β(ri) ΣZ β(ri) c 5

[0,1] β(r) ΣZ β(r) dr C 5ζn (42)

Finally, (39), (40), (41) and (42) together imply that with probability at least 1 6T 4, it holds that

E ( b A, bβ) C 5 log(T)ζn + log(T)δT + log(p)

T + s log(p T)

where C is some absolute constant. This directly leads to the desired result.

B.3 Additional proofs related to Theorem 4

In this section, we present the technical results related to Theorem 4. Recall in the proof of Theorem 4, we set that β(r) = bβ(r) β (r) and A(r, s) = b A(r, s) A (r, s), r, s [0, 1]. Note that for any g L2, it holds that

E( g, X 2 L2) = ZZ

[0,1]2 g(r)ΣX(r, s)g(s) dr ds.

We also assume that the following good event holds:

Xt W α,2 4CX p

log(T) for all 1 t T.

It was justiﬁed in the proof of Theorem 4 that

P( Xt W α,2 4CX p

log(T) for all 1 t T) 1 T 5.

Lemma 11 Let ζn = n α+1/2. Under the same conditions as in Theorem 4, with probability at least 1 T 4, it holds that

j=1 A(ri, sj)Xt(sj)

[0,1]2 A(ri, r)Σ(r, s) A(ri, s) dr ds C2 log(T)ζ2 n + log(T)δT . (43)

Proof Let ζn = n α+1/2 and δT = T 2r/(2r+1). Observe that for any i {1, . . . , n} and t {1, . . . , T}, we have

A(ri, )Xt( ) W α,2 A(ri, ) W α,2 Xt( ) W α,2 8CACX p

Functional Linear Regression with Mixed Predictors

where the ﬁrst inequality follows Theorem 41. By Theorem 40, we have that

j=1 A(ri, sj)Xt(sj) Z

[0,1] A(ri, s)Xt(s) ds

where C4 > 0 is an absolute constant. Therefore, we have that

j=1 A(ri, sj)Xt(sj)

[0,1] A(ri, s)Xt(s)ds

)2 C 4ζ2 n log(T)

320 A(ri, ), Xt( ) 2 L2 C 4ζ2 n log(T). (44)

By Theorem 19 and the fact that

A(ri, ) H(K) A H(K) + b A H(K) 2CA,

we have that with probability at least 1 T 4, it holds that uniformly for all 1 i n, 1 T

t=1 A(ri, ), Xt( ) 2 L2 ZZ

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + C 4 log(T)δT

where C 4 > 0 is an absolute constant. Thus the above display implies that

t=1 A(ri, ), Xt 2 L2 319

[0,1]2 A(ri, r)Σ(r, s) A(ri, s) dr ds C 4 log(T)δT . (45)

j=1 A(ri, sj)Xt(sj)

319 320 A(ri, ), Xt 2 L2 C 4Tn log(T)ζ2 n

[0,1]2 A(ri, r)Σ(r, s) A(ri, s) dr ds C 4 Tn log(T)ζ2 n + log(T)δT ,

[0,1]2 A(ri, r)Σ(r, s) A(ri, s) dr ds C 4 Tn log(T)ζ2 n + log(T)δT , (46)

where the ﬁrst inequality follows from (44) and the second inequality follows from (45).

Wang, Zhao, Yu and Willett

Lemma 12 Let ζn = n α+1/2 and δT = T 2r/(2r+1). Under the same conditions as in Theorem 4, if in addition, δT log(T)

T , then with probability at least 1 T 4, it holds that

j=1 A(ri, sj)Xt(sj)

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + C

ζn + log(T)δT

In addition, let {Et,i}T,n t=1,i=1 be a collection of standard normal random variables independent of {Xt}T t=1,{ri}n i=1 and {si}n i=1. Then with probability at least 1 T 4, it holds that

j=1 A(ri, sj)Xt(sj)

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + C

ζn + log(T)δT

Proof Let ζn = n α+1/2. For Equation (47), note that

j=1 A(ri, sj)Xt(sj)

j=1 A(ri, sj)Xt(sj) Z

[0,1] A(ri, s)Xt(s) ds

ϵt(ri) (49)

i=1 ϵt(ri) Z

[0,1] A(ri, s)Xt(s) ds. (50)

Since for all ri A(ri, )Xt( ) W α,2 8CACX p

Theorem 17 implies that with probability 1 T 4, it holds that uniformly for all 1 i n,

j=1 A(ri, sj)Xt(sj) Z

[0,1] A(ri, s)Xt(s) ds

ϵt(ri) C2ζn log(T),

where C2 > 0 is an absolute constant. Therefore

(49) 2C2Tnζn log(T).

To control the term (50), we deploy Theorem 18. Since

A(ri, ) H(K) A H(K) + b A H(K) 2CA,

Functional Linear Regression with Mixed Predictors

then with probability at least 1 T 4, it holds that that

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + log(T)δT

320C 2 log(T)δT + 1 320C 2n

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds

where C 2 > 0 is an absolute constant.

Therefore, we have that with probability at least 1 2T 4 that

j=1 A(ri, sj)Xt(sj)

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + C 2 Tn

ζn + log2(T)

T + log(T)δT

where C3 > 0 is an absolute constant. Equation (47) follows from the assumption that δT log(T)/T.

The argument of Equation (48) is the same as that of Equation (47) and will be omitted for brevity.

Lemma 13 Let ζn = n α+1/2 and δT = T 2r/(2r+1). Under the same conditions as in Theorem 4, with probability at least 1 (T p) 4, it holds that

i=1 β(ri), Zt pϵt(ri) λ 320

l S βl n + λ

l Sc bβl n. (51)

In addition, let {Et,i}T,n t=1,i=1 be a collection of standard normal random variables independent of {Xt}T t=1, {ri}n i=1 and {si}n i=1. Then with probability at least 1 T 4, it holds that

i=1 β(ri), Zt p Et,i λ 320

l S βl n + λ

l Sc bβl n. (52)

Proof Let ζn = n α+1/2. For Equation (51), note that since Zt,j is centered Gaussian with variance bounded by CZ and ϵt(ri) is Gaussian with variance bounded by Cϵ, Zt,jϵt(ri) is sub-exponential with parameter CϵCZ. Therefore by a union bound argument, there exists an absolute constant C1 > 0 such that, for any i {1, . . . , n},

t=1 Ztϵt(ri)

(Tp) 5. (53)

Wang, Zhao, Yu and Willett

Therefore, with probability at least 1 (T p) 5, it holds that

i=1 β(ri), Zt pϵt(ri)

t=1 Ztϵt(ri)

l=1 | βl(ri)|

l=1 | βl(ri)|

l S βl n + λ

where the second inequality follows from (53), the third inequality follows from H older s

inequality, and the last inequality follows from the assumption that λ = Cλ q

T for suﬃciently large constant Cλ.

The argument of Equation (52) is the same as that of Equation (51) and will be omitted for brevity.

Lemma 14 Let ζn = n α+1/2. Under the same conditions as in Theorem 4, with probability at least 1 T 4, it holds that

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds

[0,1] A(s, r)X (r) dr

C log(T)ζn. (54)

Proof We have

[0,1] A(ri, s)X (s) ds

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds

where X is the predictor in the test set. By Theorem 33, we have with probability at least 1 T 4 that X W α,2 4CX p

log(T). So it holds that

A(ri, ), X L2 W α,2 =

[0,1] A(ri, s)X (s) ds

[0,1] A( , s) W α,2|X (s)| ds 2CA X L2 8CACX p

Functional Linear Regression with Mixed Predictors

Let f(r) = A(r, ), X ( ) L2. Then

f2 W α,2 64C2 AC2 X log(T).

Applying Theorem 40 to f2, we have that

i=1 A(ri, ), X 2 L2 Z

[0,1] A(r, ), X 2 L2 dr C 5 log(T)ζn,

and this implies that

[0,1] A(ri, s)X (s) ds

[0,1] A(s, r)X (r) dr

C 5 log(T)ζn

Lemma 15 Let ζn = n α+1/2. Under the same conditions as in Theorem 4, it holds that

i=1 β(ri) ΣZ β(ri) c Z

[0,1] β(r) ΣZ β(r) dr Cζn. (55)

Proof Note that since the minimal eigenvalue of ΣZ is lower bounded by cz,

i=1 β(ri) ΣZ β(ri) cz

l=1 2 βl(ri).

By Theorem 40, there exists an absolute constant C4 such that

i=1 2 βl(ri) Z

[0,1] 2 βl(r) dr C4ζn 2 βl W α,2, l = 1, . . . , p. (56)

As a result,

i=1 β(ri) ΣZ β(ri) cz

l=1 2 βl(ri) c4cz

[0,1] 2 βl(r) dr C4cz

l=1 ζn 2 βl W α,2

[0,1] 2 βl(r) dr C 4ζn,

where the second inequality follows from (56), and the third inequality follows from

l=1 2 βl W α,2

l=1 βl 2 W α,2

l=1 βl 2 H(Kβ) 2

β l 2 H(Kβ) + bβl 2 H(Kβ)

Wang, Zhao, Yu and Willett

l=1 β l H(Kβ) 2 + 2 p X

l=1 bβl H(Kβ) 2 4C2 β.

i=1 β(ri) ΣZ β(ri) c 4

[0,1] 2 βl(r) dr C 4ζn

[0,1] β(r) ΣZ β(r) dr C 4ζn (57)

where the last inequality follows from the fact that w Σw Cz w 2 2 for all w Rp.

B.4 Extensions

Corollary 16 Deﬁne the discretized excess risk as

E dis( b A, bβ) =EX ,Z ,Y

i=1 b A(rj, si)X (si) Z , bβ(rj) p

i=1 A (rj, si)X (si) ds Z , β (rj) p

Suppose that Assumptions 1, 2 and 3 hold. Let ( b A, bβ) be any solution to (4) with the

tuning parameter λ = Cλ q

T for some suﬃciently large constant Cλ. Deﬁne n = min{n1, n2}. For T log(n), there exists absolute constants C > 0 such that with probability at least 1 8T 4, it holds that

E dis( b A, bβ) C log(T) δT + s log(p T)/T + ζn (59)

where ζn = n α+1/2 and δT = T 2r 2r+1 .

Theorem 16 is a direct consequence of Theorem 4 and provides a formal theoretical guarantee for using E dis( b A, bβ) to evaluate the proposed estimators in practice.

B.5 Proof of Theorem 8

Proof [Proof of Theorem 8] The proof is almost identical to the proof of Theorem 4. As a result, we only point out the diﬀerence. From the minimizer property, we have that

j=1 ws(j) b A(ri, sj)Xt(sj) Zt, bβ(ri) p

Functional Linear Regression with Mixed Predictors

j=1 ws(j)A (ri, sj)Xt(sj) Zt, β (ri) p

which implies that

j=1 ws(j) A(ri, sj)Xt(sj)

i=1 wr(i) β(ri), Zt 2 p (61)

j=1 ws(j) A(ri, sj)Xt(sj) β(ri), Zt p

j=1 ws(j) A(ri, sj)Xt(sj) + β(ri), Zt p

ϵt(ri) (63)

j=1 β j n λ

j=1 bβj n (64)

j=1 ws(j) A(ri, sj)Xt(sj) + β(ri), Zt p

Note that Equation (60) - (64) are identical to Equation (25) - (29). So it suﬃces to analyzed Equation (65). Without loss of generality, we will assume that

wr(i) = 1 for all 1 i n and ws(j) = 1 for all 1 j n.

It follows from (48) and (52), we have that with probability at least 1 2T 4, it holds that

(65) 1 320n

[0,1]2 A(ri, r)ΣX(r, s) A(ri, s) dr ds + C1

ζn + log(T)δT

j S βj n + λ

where C1 > 0 is an absolute constant. The rest of the argument is identical to Theorem 4 and therefore is omitted.

Appendix C. Deviation bounds

C.1 Functional deviation bounds

Lemma 17 Let {sj}n j=1 [0, 1] be independent uniform random variables. Let {εt}T t=1 be a collection of independent standard Gaussian random variables independent of {Xt}T t=1 and

Wang, Zhao, Yu and Willett

{si}n i=1. Under Assumption 3, it holds that

j=1 g(sj)Xt(sj) Z

[0,1] g(s)Xt(s) ds εt

C log(T)ζn for all g W α,2 1

where C, c > 0 are absolute constants depending only on CX and Cε and ζn = n α+1/2.

Proof Step 1. Note that by Theorem 33, it holds that

P Xt W α,2 4CX p

log(T) T 6.

By a union bound argument,

P(E) = P Xt W α,2 4CX p

log(T) for all 1 t T T 5.

Under the event E, for any g such that g W α,2, it holds that

g Xt W α,2 g W α,2 Xt W α,2 4CX p

So under the event E , for all 1 t T and all g W α,2 1, it holds that from Theorem 40 that 1 n

j=1 g(sj)Xt(sj) Z

[0,1] g(s)Xt(s) ds C1n α+1/2p

log(T) for all g W α,2 1.

Since P(|εt| C2 p

log(T)) T 5

the desired result immediately follows.

Through out this section, denote the eigen-expansion of linear map LK1/2ΣXK1/2 as

LK1/2ΣXK1/2(Φk) =

In addition, for any bilinear operator Σ and any L2 functions f, g, denote

Σ[f, g] = ZZ

[0,1]2 Σ(r, s)f(r)g(s) dr ds.

Lemma 18 Suppose {Xt}T t=1 are independent and identically distributed centered Gaussian random processes and that the eigenvalues {ξk} k=1 of the linear operator LK1/2ΣXK1/2 satisfy that

Functional Linear Regression with Mixed Predictors

for some r > 1/2. Let {εt}T t=1 i.i.d. N(0, 1) and be independent of {Xt}T t=1. Then with probability at least 1 T 4, 1 T

t=1 Xt, β L2εt

ΣX[β, β] log(T)δT + log(T)δT

for all β H(K) 1,

where δT = T 2r 2r+1 and C is some absolute constant independent of T.

Proof Note that for any deterministic α H(K).

E( Xt, α L2εt) = 0.

In addition, since Xt, α L2 N(0, ΣX[α, α]) and εt N(0, 1), where

ΣX[α, α] = ZZ

[0,1]2 α(s)ΣX(s, t)α(t) ds dt.

Thus Xt, α L2εt is a centered sub-exponential random variable with parameter ΣX[α, α]. By Theorem 29, for γ < 1.

t=1 Xt, α L2εt

exp( 2γ2T). (66)

F := span {LK1/2(Φk)}D k=1 H(K) and

F := span {LK1/2(Φk)} k=D+1 H(K).

Denote PF to be the projection operator from H(K) to F with respect to the H(K) topology. By Theorem 21, LK1/2(Φk) is a collection of H(K) basis. For any β such that β H(K) 1, since {LK1/2(Φk)} k=1 form a H(K) basis, PF(β) 2 H(K) + PF (β) 2 H(K) = β 2 H(K) 1. Note that

t=1 Xt, β L2εt = 1

t=1 Xt, PF(β) L2εt + 1

t=1 Xt, PF (β) L2εt.

Step 1. Let D T to be chosen later. For any J Z, J 1 and 2 J T 6, consider the sets

k=1 fk LK1/2(Φk) : 2 J 1 p

ΣX[α, α] 2 J,

k=1 f2 k = α 2 H(K) 1

Wang, Zhao, Yu and Willett

So FJ can be identiﬁed as a unit ball in RD. This means that for every δ > 0, there exists a collection {αm}M m=1 such that for any α FJ,

αm α H(K) δ

Therefore for given m, αm FJ, and so

t=1 Xt, αm L2εt

exp( 2γ2T).

So for any α FJ,

t=1 Xt, α L2εt

t=1 Xt, α αm L2εt + sup 1 m M

t=1 Xt, αm L2εt

L2 + sup 1 m M

t=1 Xt, αm L2εt

T + sup 1 m M

t=1 Xt, αm L2εt.

t=1 Xt, α L2εt

ΣX[α, α] + δ

T for all α FJ

exp D log(2/δ) γ2T .

Letting γ = C q

T and δ = T 9/2 gives

t=1 Xt, α L2εt

ΣX[α, α]D log(T)

T 5 for all α FJ

k=1 fk LK1/2(Φk) : p

k=1 f2 k = α 2 H(K) 1

The similar argument shows that

t=1 Xt, α L2εt

ΣX[α, α]D log(T)

T 5 for all α E

Functional Linear Regression with Mixed Predictors

Since ΣX[α, α] T 6 and D T,

t=1 Xt, α L2εt

T 5 for all α E

Since {α F : α H(K) 1} = [

by a union bound argument,

t=1 Xt, α L2εt

DΣX[α, α] log(T)

T 5 for all α F, α H(K) 1

log(T)T 6 (67)

Step 2. For any β such that β H(K) 1, it holds that PF(β) H(K) 1. Therefore

t=1 Xt, PF(β) L2εt

DΣX[PF(β), PF(β)] log(T)

DΣX[β, β] log(T)

where the ﬁrst inequality follows from (67), and the second inequality holds because

ΣX[β, β] =ΣX[PF(β), PF(β)] + 2ΣX[PF (β), PF(β)] + ΣX[PF (β), PF (β)]

=ΣX[PF(β), PF(β)] + ΣX[PF (β), PF (β)]

ΣX[PF(β), PF(β)]

where the second equality follows from Theorem 25, which implies that ΣX[PF (β), PF(β)] = 0

Step 3. Denote the eigen-expansion of linear map LK1/2ΣXK1/2 as

LK1/2ΣXK1/2(Φk) =

wt,k = LK 1/2(Xt), Φk L2 ξk .

sup β H(K) 1

t=1 Xt, PF (β) L2εt

Wang, Zhao, Yu and Willett

sup β2 F , β2 H(K) 1

t=1 Xt, β2 L2εt

sup β2 F , β2 H(K) 1

t=1 LK1/2(Xt), LK 1/2(β2) L2εt

= sup β2 F , β2 H(K) 1

k=1 LK1/2(Xt), Φk L2 LK 1/2(β2), Φk L2εt

= sup β2 F , β2 H(K) 1

k=D+1 LK1/2(Xt), Φk L2 LK 1/2(β2), Φk L2εt

= sup P k=D+1 f2 k 1

ξkwt,kfkεt,

where the fourth inequality holds because β2 F and the last inequality follows from Theorem 22. Note that since wt,k and εt are both centered Gaussian with variance 1,

4 log(k) + 12 log(T)

T + 4 log(k) + 12 log(T)

T for all k 1

4 log(k) + 12 log(T)

T + 4 log(k) + 12 log(T)

So for any D D,

sup P k=D+1 f2 k 1

= sup P k=D+1 f2 k 1

sup P k=D+1 f2 k 1

k=D+1 k r r

4 log(k) + 12 log(T)

T + 4 log(k) + 12 log(T)

sup P k=D+1 f2 k 1

4 log(k) + 12 log(T)

4 log2(k) + 12 log2(T)

sup P k=D+1 f2 k 1

4 log(k) + 12 log(T)

Functional Linear Regression with Mixed Predictors

4 log(p) D2r 1T + 1 (D )r 1/2T +

where the ﬁrst inequality follows from Theorem 30. The second term in Equation (68) can be bounded in a similar way.

Step 4. Putting Step 2 and Step 3 together, it holds that with high probability, for all β H(K) such that β H(K) 1,

t=1 Xt, β L2εt C r

DΣX[α, α] log(T)

4 log(p) D2r 1T + 1 (D )r 1/2T +

Set D = max{T

10 r 1/2 , D} gives

t=1 Xt, β L2εt C r

DΣX[α, α] log(T)

T D 2r+1 + 1

Taking D = T 1 2r+1 gives

t=1 Xt, β L2εt C s

ΣX[α, α]log(T)

T 2r 2r+1 + log(T)

T 2r 2r+1 + 1

Lemma 19 Suppose {Xt}T t=1 are independent identically distributed centered Gaussian random processes and that the eigenvalues of {ξk} k=1 of the linear operator LK1/2ΣXK1/2 satisfy

for some r > 1/2. Then with probability at least 1 T 4, it holds that for any 0 < τ < 1, 1 T

t=1 Xt, β 2 L2 ΣX[β, β]

τΣX[β, β] + Cτ log(T)δT for all β H(K) 1,

where δT = T 2r 2r+1 , and Cτ is some constant only depending on τ and independent of T.

Proof Note that for any deterministic α H(K), Xt, α L2 is SG(ΣX[α, α]). Thus by Hanson Wright, it holds that for all γ < 1,

t=1 Xt, α 2 L2 ΣX[α, α]

exp( 2γ2T). (69)

Wang, Zhao, Yu and Willett

For D T to be chosen later, denote

F := span {LK1/2(Φk)}D k=1 and F := span {LK1/2(Φk)} k=D+1 .

Denote PF to be the projection operator from H(K) to F with respect to the H(K) topology. Them it holds that for all β H(K), 1 T

t=1 Xt, β 2 L2 ΣX[β, β]

t=1 Xt, PF(β) 2 L2 ΣX[PF(β), PF(β)]

t=1 Xt, PF(β) L2 Xt, PF (β) L2 ΣX[PF(β), PF (β)]

t=1 Xt, PF (β) 2 L2 ΣX[PF (β), PF (β)]

Step 1. For any J Z, J 1 and 2 J T 5, consider the set

k=1 fk LK1/2(Φk) : 2 J 1 C[α, α] 2 J,

k=1 f2 k = α 2 H(K) 1

So FJ can be viewed as a subset of unit ball in RD. This means that for every δ > 0, there exists a collection {αm}M m=1 such that for any α FJ,

αm α H(K) δ

δ D. Therefore for given m, αm FJ,

t=1 Xt, αm 2 L2 ΣX[αm, αm]

γΣX[αm, αm]

exp( 2γ2T). (70)

Denote bΣX(r, s) = 1

T PT t=1 Xt(r)Xt(s). We have that for any α FJ, 1 T

t=1 Xt, α 2 L2 ΣX[α, α]

t=1 Xt(r)Xt(s) ΣX(r, s)

Functional Linear Regression with Mixed Predictors

t=1 Xt(r)Xt(s) ΣX(r, s)

t=1 Xt(r)Xt(s) ΣX(r, s)

αm(s) dr ds

t=1 Xt(r)Xt(s) ΣX(r, s)

αm(s) dr ds

2 α αm L2 α L2

t=1 Xt(r)Xt(s) ΣX(r, s)

+ sup 1 m M

t=1 Xt(r)Xt(s) ΣX(r, s)

αm(s) dr ds

=2 α αm L2 α L2 bΣX ΣX F

+ sup 1 m M

t=1 Xt(r)Xt(s) ΣX(r, s)

αm(s) dr ds

T + sup 1 m M

t=1 Xt(r)Xt(s) ΣX(r, s)

αm(s) dr ds

T + sup 1 m M

t=1 Xt, αm 2 L2 ΣX[αm, αm]

t=1 Xt, α 2 L2 ΣX[α, α]

2γΣX[α, α] + 2CXδ

T for all α FJ

exp D log(2/δ) γ2T .

T and δ = T 9/2 gives

t=1 Xt, α 2 L2 ΣX[α, α]

T ΣX[α, α] +

for all α FJ

k=1 fk LK1/2(Φk) H(K) : ΣX[α, α] 1

k=1 f2 k = α 2 H(K) 1

The similar argument shows that

t=1 Xt, α 2 L2 ΣX[α, α]

T ΣX[α, α] +

for all α E

Wang, Zhao, Yu and Willett

Since ΣX[α, α] T 5 when α E, and D T,

t=1 Xt, α 2 L2 ΣX[α, α]

T 5 for all α E

Since {α F : α H(K) 1} = [

2 J T 5 FJ [ E,

by union bound,

t=1 Xt, α 2 L2 ΣX[α, α]

T ΣX[α, α] +

for all α F, α H(K) 1

Step 2. For any β such that β H(K) 1, it holds that PF(β) H(K) 1. Therefore 1 T

t=1 Xt, PF(β) 2 L2 ΣX[PF(β), PF(β)]

T ΣX[PF(β), PF(β)] +

T ΣX[β, β] +

where the ﬁrst inequality follows from (71), and the last inequality holds because

ΣX[β, β] =ΣX[PF(β), PF(β)] + 2ΣX[PF (β), PF(β)] + ΣX[PF (β), PF (β)]

=ΣX[PF(β), PF(β)] + ΣX[PF (β), PF (β)]

ΣX[PF(β), PF(β)]

where by Theorem 25, ΣX[PF (β), PF(β)] = 0.

Step 3. Denote the eigen-expansion of linear map LK1/2ΣXK1/2 as

LK1/2ΣXK1/2(Φk) =

wt,k = LK1/2(Xt), Φk L2 ξk .

For any β, let fk = LK 1/2(Φk), β L2. Note that if β 2 H(K) 1, then PF (β) H(K) 1.

Xt, PF (β) L2 =

k=D+1 LK1/2(Xt), Φk L2 LK 1/2(β), Φk L2 =

Functional Linear Regression with Mixed Predictors

Note that if β H(K) 1, then PF (β) H(K) 1. Therefore

sup β H(K) 1

t=1 Xt, PF (β) 2 L2 ΣX[PF (β), PF (β)]

sup P k=D+1 f2 k 1

t=1 wi,kwi,l 1

t=1 E(wi,kwi,l)

where the last inequality follows from Theorem 22.

4 log(k)+4 log(l)+12 log(T)

T . Note that since wt,kwi,l is SE(1),

i=1 wt,kwi,l 1

t=1 E(wi,kwi,l)

η + η2 for all k, l 1

i=1 wt,kwi,l 1

t=1 E(wi,kwi,l)

1 T 6k2l2 1

As a result

sup P k=D+1 f 2 k 1

t=1 wi,kwi,l E(wi,kwi,l)

sup P k=D+1 f 2 k 1

ξlfl η + η2

sup P k=D+1 f 2 k 1

4 log(k) + p

4 log(l) + p

12 log(T) + 4 log(k) + 4 log(l) + 12 log(T)

Observe that for D D,

sup P k=D+1 f2 k 1

sup P k=D+1 f2 k 1

sup P k=D+1 f2 k 1

k=D+1 k 2r log(k)

D2r 1 + 1 (D )r 1/2

Wang, Zhao, Yu and Willett

where the last inequality follows from Theorem 30. In addition,

sup P k=D+1 f2 k 1

log(T) sup P k=D+1 f2 k 1

log(T) sup P k=D+1 f2 k 1

log(T)D 2r+1.

The rest terms in Equation (72) can be handled in a similar way. So

sup β H(K) 1

t=1 Xt, PF (β) 2 L2 ΣX[PF (β), PF (β)]

D 2r+1 log(p)

D2r 1 + 1 (D )r 1/2

log(T)D 2r+1 !

D 4r+2 log(T)

T D 2r+1 C 2

where the second inequality follows by setting D = max{n

9 r 1/2 , D}.

Step 4. So with high probability, it holds that for all β such that β H(K) 1,

t=1 Xt, PF(β) 2 L2 ΣX[PF(β), PF(β)]

T ΣX[β, β] +

t=1 Xt, PF (β) 2 L2 ΣX[PF (β), PF (β)]

Taking D = c T log(T) for suﬃciently small constant c gives,

t=1 Xt, PF(β) 2 L2 ΣX[PF(β), PF(β)]

2ΣX[β, β] + C4T 4 (76)

t=1 Xt, PF (β) 2 L2 ΣX[PF (β), PF (β)]

2r 2r+1 + 1

Functional Linear Regression with Mixed Predictors

where r > 1/2 is used in the last inequality.

Step 5. Denote

δT = log(T)

Note that if k = l,

E(wi,kwi,l) =E LK1/2(Xt), Φk L2 ξk

LK1/2(Xt), Φl L2 ξl

= 1 ξkξl E ( Xt, LK1/2(Φk) L2 Xt, LK1/2(Φl) L2)

= 1 ξkξl ΣX[LK1/2(Φk), LK1/2(Φl)]

= 1 ξkξl LK1/2CK1/2(Φk), Φl L2 = 0.

sup β H(K) 1 ΣX[PF (β), PF (β)]

sup P k=D+1 f2 k 1

t=1 E(wi,kwi,l)

= sup P k=D+1 f2 k 1

t=1 E(wi,kwi,k)

2 sup P k=D+1 f2 k 1

k=D+1 f2 kk 2r

2 sup P k=D+1 f2 k 1

2D 2r+1/2 C5

2r 2r+1 C5 log(T)δT . (78)

Observe that ΣX[PF(β), PF (β)] = 0.

So for any β H(K) 1, 1 T

t=1 Xt, PF(β) L2 Xt, PF (β) L2 ΣX[PF(β), PF (β)]

t=1 Xt, PF(β) 2 L2

t=1 Xt, PF (β) 2 L2

Wang, Zhao, Yu and Willett

ΣX[PF(β), PF(β)] + τ

2ΣX[β, β] + C4T 4 1/2 ΣX[PF (β), PF (β)] + C4

δT + T 4 1/2

2ΣX[β, β] + T 4 1/2 C6δT + T 4 1/2

2ΣX[β, β] + C7δT , (79)

where the second inequality follows from (76) and (77), and the third inequality follows from (78), and the last inequality holds if C7 is a suﬃciently large constant. The desired results follows from Equation (76), Equation (77) and Equation (79).

C.2 Additional lemmas for kernel alignment

Remark 20 Note that following the same argument as that in the proof of Theorem 2 in Cai and Yuan (2012), the function LK1/2 : L2 H(K) admits a well-deﬁned inverse function LK 1/2 : H(K) L2 such that

LK 1/2(φk) = µ 1/2 k φk,

where the eigen-expansion of K(r, s) satisﬁes K(r, s) = P k=1 µkφk(r)φk(s).

Recall that the eigen-expansion of the linear map LK1/2ΣXK1/2 takes the form

K1/2ΣXK1/2 =

Lemma 21 For any f, g H(K),

f, g H(K) = LK 1/2(f), LK 1/2(g) L2.

In addition, for any L2 basis {Φk} k=1, we have {LK1/2(Φk)} k=1 is a collection of H(K) basis. (However, {LK1/2(Φk)} k=1 is not necessary a collection of L2 basis.)

Proof Let f = P k=1 akφk H(K) and g = P k=1 bkφk H(K). Note that LK 1/2(f) = P k=1 ak µk φk. As a result

LK 1/2(f) 2 L2 =

a2 k µk = f H(K).

So LK 1/2(f) L2. Then

f, g H(K) =

µk = LK 1/2(f), LK 1/2(g) L2.

Functional Linear Regression with Mixed Predictors

In addition, from the above equality,

LK1/2(Φi), LK1/2(Φj) H(K) = LK 1/2LK1/2(Φi), LK 1/2LK1/2(Φj) L2 =

( 1 if i = j; 0 if i = j.

Lemma 22 For any β H(K), it holds that

k=1 LK 1/2(β), Φk 2 L2 = β 2 H(K).

If in addition β F , where

F := span {LK1/2(Φk)} k=D+1 ,

and D is any positive integer, then

k=D+1 LK 1/2(β), Φk 2 L2 = β 2 H(K)

Proof Denote β = P k=1 LK1/2(Φk)bk. Then β 2 H(K) = P k=1 b2 k. So

k=1 LK 1/2(β), Φk 2 L2 =

k=1 b2 k = β 2 H(K).

For the second part, it suﬃces to observe that β = P k=D+1 LK1/2(Φk)bk.

Suppose W(r) is any Gaussian process with covariance operator

E(W(r)W(s)) = ΣW (r, s),

and the eigen-expansion of ΣW satisﬁes

ΣW (r, s) =

k=1 σ2 khk(r)hk(s).

Then it holds that

k=1 αkhk(r),

where {αk} k=1 are independent Gaussian random variables with mean 0 and variance σ2 k. Denote f(r) = P k=1 bkhk(r) L2. Then W, f L2 is Gaussian with mean 0 and variance

k=1 b2 kσ2 k = E( W, f 2 L2) = ZZ

[0,1]2 f(r)ΣW (r, s)f(s) dr ds = LΣW (f), f L2.

Wang, Zhao, Yu and Willett

Lemma 23 Suppose R is symmetric. Let W = LR(X). Then

E(W(s)W(r)) = (RΣXR)(s, r).

Thus if X is a Gaussian process, then LR(X) is also a Gaussian process with covariance function RΣXR.

Proof For the ﬁrst part, it suﬃces to observe that

E(W(s)W(r)) = ZZ

[0,1]2 R(s, u)E(X(u)X(v))R(r, v) du dv = ZZ

[0,1]2 R(s, u)ΣX(u, v)R(r, v) du dv.

For the second part, it suﬃces to observe that LR(X), f L2 = X, LR(f) L2 is Gaussian with standard deviation

E( X, LR(f) 2 L2) = q

E( LR(X), f 2 L2) =

[0,1]2 f(s)(RΣXR)(s, r)f(r) ds dr.

Lemma 24 Let X be a centered Gaussian process. Suppose eigenvalues of {ξk} k=1 of the linear operator LK1/2ΣXK1/2 satisfy

for some r > 1/2. Let wk = LK1/2(X),Φk L2 ξk . Then wk is Gaussian with variance 1.

Proof Since LK1/2(X), Φk L2 is Gaussian with standard deviation s ZZ

[0,1]2 Φk(s)LK1/2ΣXK1/2(s, r)Φk(r) ds dr = p

So wk is N(0, 1).

Lemma 25 Under the same conditions as in Theorem 24, it holds that

ΣX[LK1/2(Φk), LK1/2(Φl)] =

( ξk, if k = l; 0, if k = l; k, l = 1, 2, 3, . . .

Proof It suﬃces to observe that

ΣX[LK1/2(Φk), LK1/2(Φl)] = LΣX(LK1/2Φk), LK1/2Φl L2 = LK1/2ΣXK1/2Φk, Φl L2 = ξkΦk, Φl L2.

Functional Linear Regression with Mixed Predictors

Appendix D. Proof of the lower bound result

Proof [Proof of Theorem 7] Since the proof is about lower bounds, it suﬃces to assume that 1 s < p/3 and T s2 log(2p)/24. Step 1. Consider the following two special cases of model (1):

[0,1] A (r, s)X(s) ds + ϵ(r), r [0, 1] (80)

j=1 Zjβ j (r) + ϵ(r), r [0, 1]. (81)

E ( b A) = Z

[0,1] [0,1] A(r, s1)ΣX(s1, s2) A(r, s2) ds1 ds2

[0,1] β (r)ΣZ β(r) dr,

as the excess risks of model (80) and model (81) respectively. Since both model (80) and model (81) are special cases of model (1), standard minimax analysis shows that

inf b A,bβ sup A CA, β Cβ

E{E ( b A, bβ)} max inf b A,bβ sup A CA, β =0

E{E ( b A)}, inf b A,bβ sup A =0 β Cβ

= max inf b A sup A CA, β =0

E{E ( b A)}, inf bβ sup A =0, β Cβ

E{E (bβ)} .

By the above arguments, the task of ﬁnding a lower bound on the excess risk E ( b A, bβ) can be separated into two tasks of ﬁnding lower bounds on

inf b A sup A CA, β =0

E{E ( b A)} and inf bβ sup A =0, β Cβ

E{E (bβ)}. (82)

Theorem 2 in Sun et al. (2018) shows that under the same eigen-decay assumption in condition (9) of Theorem 4, it holds that

inf b A sup A CA, β =0

E{E ( b A)} c T 2r 2r+1 , (83)

where c > 0 is a suﬃciently small constant. Therefore, to provide a lower bound on the excess risk E ( b A, bβ), the only remaining task is to provide a lower bound of inf bβ sup A =0, β Cβ E{E (bβ)}

based on model (81).

Wang, Zhao, Yu and Willett

Step 2. For 0 < δ < p

1/(2s) to be speciﬁed later, let eΘ(δ, s) denote

{βj}p j=1 H1;

j=1 1{βj = 0} s, max j=1,...,p βj H1 δ p

, s {1, . . . , p}.

2s 1, B(s) Cβ with Cβ = 1. Let φ H(Kβ) be the leading eigenfunction of H(Kβ). Since H(Kβ) is generated by a bounded kernel, we have that φ 1 and φ L2 = 1. For any θ Rp such that θ 1, denote βθ = (βθ j , j = 1, . . . , p) H(Kβ) be such that βθ j ( ) = θjφ( ), j = 1, . . . , p.

Provided that θ 0 s, we have that βθ B(s) Cβ. For t {1, . . . , T}, let ϵt(s) = εtφ(s),

where {εt}T t=1| i.i.d. N(0, 1). We thus have εt is a Gaussian process with bounded second moment. Denote Pθ as the joint distribution of {Yt, Zt}T t=1 with β = βθ. Since Pθ is supported on the subspace spanned by φ, for any θ, θ , the Kullback Leibler divergence between Pθ and Pθ is

KL(Pθ|Pθ ) = EPθ

where Qθ is the joint distribution of {yt, Zt}T t=1 with

yt = Zt, θ p + εt, t {1, . . . , T}.

Since {Zt}T t=1 and {εt}T t=1 are independent, we have that

yt Zt, θ p 1

2 Zt, θ 2 p yt Zt, θ p + 1

2 Zt, θ 2 p

Conditioning on {Zt}T t=1, we have that yt N( Zt, θ p, 1). Therefore,

KL(Pθ|Pθ ) =EQθ

= E{Zt}T t=1

EQθ|{Zt}T t=1 log d Qθ

=E{Zt}T t=1

t=1 Zt, θ θ 2 p

= T(θ θ ) ΣZ(θ θ ) = T θ θ 2 2.

For 0 < δ < p

1/(2s) to be speciﬁed later, let eΘ(δ, s) be deﬁned as in Theorem 26. Then for any θ eΘ(δ, s), we have that

and thus β(θ) Cβ. For θ = θ in eΘ(δ, s), it holds that

j=1 βj(θ) βj(θ ) 2 L2 =

j=1 (θj θ j)2 φ 2 L2 = θ θ 2 2 δ2

Functional Linear Regression with Mixed Predictors

KL(Pθ|Pθ ) = T θ θ 2 2 8Tδ2.

Using Fano s lemma (e.g. Yu, 1997), we have that

inf bβ sup β B(s)

j=1 bβj β j 2 L2 δ2/4

θ,θ eΘ(δ,s) θ =θ KL(Pθ|Pθ ) + log(2)

M = |eΘ(δ, s)| exp s

θ,θ eΘ(δ,s) θ =θ KL(Pθ|Pθ ) + log(2)

log(M) 8Tδ2 + log(2)

s 2 log(p s

Provided that δ2 = s 48T log(p s

s/2 ), we have that

inf bβ sup β B(s)

j=1 bβj β j 2 L2 s 200T log p s

Since T s2 log(2p)

24 , it holds that

24T log p s

hence δ < p

1/(2s) as desired. Since ΣZ is positive deﬁnite , it holds that v ΣZv κ v 2

for any v Rp. Finally, we have that

inf bβ sup β B(s)

[0,1] β (r)ΣZ β(r) dr s 200T log p s

which directly implies the desired result.

Lemma 26 For every δ > 0 and s < p/3, denote

Θ(δ, s) = n θ { δ p

2/s, 0, δ p

2/s}p : θ 0 = s o .

Wang, Zhao, Yu and Willett

There exists a set eΘ(δ, s) Θ(δ, s), such that

|eΘ(δ, s)| exp s

This implies that for any θ, θ eΘ(δ, s), θ = θ , it holds that

θ θ 2 2 δ2 and θ θ 2 2 8δ2.

Proof Theorem 26 is Lemma 4 in Raskutti et al. (2011).

Appendix E. Additional technical results

Theorem 27 Suppose that Zi N(0, ΣZ). Then there exist universal constants c, C > 0 such that

t=1 (Z t v)2 2

3v ΣZv C log(p)

T v 2 1 for all v Rp !

Proof This is the well known Restricted Eigenvalue condition. The proof can be found in Theorem 1 of Raskutti et al. (2010). See Loh and Wainwright (2012) for better constants.

Lemma 28 Suppose {wi}n i=1 are i.i.d. centered sub-Gaussian random variables with variance 1 (denoted as SG(1)).Then

2 exp( 2γ2n).

Lemma 29 Suppose {zi}n i=1 are i.i.d. centered sub-Exponential random variables with parameter 1. Then

2 exp( 2n min{γ2, γ}).

See Vershynin (2018) for the proofs of these two lemmas.

Lemma 30 Suppose r > 1/2. For any p D, it holds that

D2r 1 + 1 pr 1/2

Proof Let ν R ν = r 1/2 > 0.

Functional Linear Regression with Mixed Predictors

D2r 1 + Cr 1 p2r 1 ν ,

where the second inequality holds because there exists a constant Cr such that log(k) Cνkν

for all k 1.

Theorem 31 (Hanson Wright) Let X = (X1, . . . , Xn) Rn be a random vector where the components Xi are independent SG(K). Let A be any n n matrix. Then for every t 0,

P X AX E(X AX) t 2 exp c min t2

K4 A 2 F , t K2 A op

Proof The proof of this theorem can be found in Vershynin (2018).

Lemma 32 Suppose X is a centered Gaussian process with covaraince function ΣX. Suppose that E( X 2 L2) = CX < .

Then there exist absolute constants c, C > 0 only depending on CX such that

P X 2 L2 Cη exp( c min{η, η2}).

k=1 σ2 kψx k(r)ψx k(s)

be the eigen-expansion of ΣX. Let zk = X, ψx k . Observe that {zk} k=1 is a collection of independent Gaussian random variables such that zk N(0, σ2 k). Since E( X 2 L2) = CX, we have that

E( X 2 L2) =

k=1 E(z2 k) =

k=1 σ2 k = CX.

Observe that X 2 L2 = P k=1 z2 k is a sub-Exponential random variable with parameter P k=1 σ4 k C2 X. So by sub-Exponential tail bound, it holds that for any η > 0,

P X 2 L2 C X

k=1 σ4 k η exp( c min{η, η2}),

where C and c are absolute constants.

Wang, Zhao, Yu and Willett

Lemma 33 Suppose X is a centered Gaussian process with

E( X 2 W α,2) = CX < .

Then there exist absolute constants c, C > 0 depending only on CX such that

P X 2 W α,2 Cη exp( c min{η, η2}).

Proof Let Kα be the generating kernel of W α,2 with eigen-expansion

k=1 ωkψα k (r)ψα k (s).

Note that we have ωk k 2α and ψk 2 W α,2 = ω 1 k k2α due to the property of Sobolev space. Let bk = X, ψα k L2. So {bk} k=1 is a collection of centered correlated Gaussian random variables and X = P k=1 bkψα k .

Step 1. Since X W α,2 almost surely, by Theorem 20, Y := LK 1/2 α (X) L2 is well deﬁned. So X = LK1/2 α (Y ) and that

E( LK1/2 α (Y ) 2 W α,2) = CX.

From Theorem 21, LK1/2 α (Y ) 2 W α,2 = Y 2 L2. So E( Y 2 L2) = CX. For any generic function f L2 with f L2 < , then

Y, f L2 = LK 1/2 α X

k=1 bkψα k , f

LK 1/2 α ψα k , f

where LK 1/2 α ψα k L2 because ψα k W α,2. Therefore Y, f L2 is a centered Gaussian random variable with

V ar( Y, f L2) = E( Y, f 2 L2) E{ Y 2 L2 f 2 L2} = CX f 2 L2 < .

Step 2. Let ΣY be the covariance function of Y . Since E( Y 2 L2) < , by the Hilbert Schmidt theorem, the eigen-expansion of ΣY can be written as

ΣY (r, s) =

k=1 δ2 kψy k(r)ψy k(s).

Then by Step 1, Y, ψy k L2 is Gaussian and that

E Y, ψy k L2 Y, ψy l L2 = ΣY [ψy k, ψy l ] = 0

Functional Linear Regression with Mixed Predictors

whenever l = k. So { Y, ψy k L2} k=1 is a collection of independent Gaussian random variables. By Theorem 32,

P Y 2 L2 Cη exp( c min{η, η2}).

The desired result follows by observing that

Y 2 L2 = LK 1/2 α (X) 2 L2 = X 2 W α,2.

Lemma 34 Suppose that {Xt}T t=1 is a collection of independent centered Gaussian process with covariance operator ΣX. Let

k=1 σ2 kψx k(r)ψx k(s)

be the eigen expansion of ΣX. Suppose E( Xt 2 L2) = P k=1 σ2 k < and that {σk} k=1 decay

to 0 at polynomial rate. Let {zt}T t=1 i.i.d. N(0, 1). If in addition, T log(p), then with probability at least 1 10p 4

t=1 Xtzt E(X1z1) L2 C

Proof Let wt,k = Xt, ψx k L2. Then {wt,k}1 t T,1 k< is a collection of centered independent random variables with V ar(wt,k) = σ2 k. Note that

t=1 Xtzt E(X1z1)

wt,kzt E(w1,kz1) ψx k

wt,kzt E(w1,kz1) 2 ,

where the last inequality follows from the fact that {ψx k} is a collection of basis functions in L2. Note that wt,kzt is SE with parameter σ2 k. So

t=1 wt,kzt E(w1,kz1) ησk

exp( n{η, η2}).

t=1 wt,kzt E(w1,kz1) Cσk

log(p) + log(k)

T + log(p) + log(k)

Since P k=1 k 3 < 3, with probability at least 3p 4,

t=1 wt,kzt E(w1,kz1) Cσk

log(p) + log(k)

T +log(p) + log(k)

for all 1 k < .

Wang, Zhao, Yu and Willett

Under this good event, Equation (84) implies that

t=1 Xtzt E(X1z1)

log(p) + log(k)

T + log(p) + log(k)

Since P k=1 σ2 k < , σk k 1 ν for some ν > 0. Therefore P k=1 σ2 k log2(k) < . Since in addition, log(p)

T < 1, Equation (85) implies that

t=1 Xtzt E(X1z1)

L2 C log(p)

Theorem 35 Suppose A : H(K) H(K) is a compact linear operator. There exist {ψk} k=1, {ωk} k=1 H(K), two orthogonal basis in L2, such that the associated bivariate function A can be written as

k=1 akψk(s)ωk(r), r, s [0, 1],

where {ak} k=1 R.

Proof This is the well known spectral theory for compact operators on Hilbert space. See Chapter 5 of Brezis (2010) for a detailed proof.

Lemma 36 Let A : H(K) H(K) be any Hilbert-Schmidt operator. We have that

sup r [0,1] A(r, ) H(K), sup s [0,1] A( , s) H(K), A( , )

where CK is some constant only depending on K.

Proof Since H(K) is generated by a bounded kernel K, we have that for any f H(K) and any r [0, 1],

f(r) = f, K(r, ) H(K) f H(K) K(r, ) H(K) = f H(K) q

K( , r), K( , r) H(K) f H(K) p

Since K is a bounded kernel, letting CK := supr K(r, r), we have that

f f H(K)CK.

Without loss of generality, it suﬃces to assume that CK = 1. By Theorem 35, there exist {ψk} k=1, {ωk} k=1 H(K), two orthogonal basis in L2, and {ak} k=1 R, such that

k=1 akψk(r)ωk(s).

Functional Linear Regression with Mixed Predictors

Since ψk H(K) = ωk H(K) = 1, k N , A 2 F(K) = P k=1 a2 k. Observe that for any f H(K) such that f H = 1, it holds that f = P k=1 bkωk, where P k=1 b2 k = 1. Therefore, for any r [0, 1], we have that

A(r, ) H(K) = sup f H(K)=1

k=1 akψk(r)ωk( ), f( ) E

H(K) = sup P k=1 b2 k=1

k=1 akψk(r)bk

sup P k=1 b2 k=1

k=1 |ak||bk| sup P k=1 b2 k=1

k=1 b2 k = A F(K),

where the ﬁrst inequality is due to that ψk ψk H(K) = 1, k N . Similar arguments show that sups [0,1] A( , s) H(K) A F(K). Moreover, observe that for any ﬁxed r, s [0, 1], it holds that

A(r, s) = A(r, ), K(s, ) H(K).

Therefore it holds that

sup s [0,1] |A(r, s)| sup s [0,1] | A(r, ), K(s, ) H(K)| A(r, ) H(K) sup s [0,1] K(s, ) H(K) A(r, ) H(K),

where K(s, ) 2 H(K) = K(s, s) 1, s [0, 1], is used in the last inequality. Finally, we have that A( , ) sup r [0,1] A(r, ) H(K) A F(K),

which concludes the proof.

Lemma 37 It holds that

E ( b A, bβ) = EX ,Z ,Y

[0,1] A(r, s)X (s) ds + Z , β(r) p

Proof We have that

E ( b A, bβ) = EX ,Z ,Y

[0,1] b A(r, s)X (s) ds Z , bβ(r) p

[0,1] A (r, s)X (s) ds Z , β (r) p

=EX ,Z ,Y Z

[0,1] A (r, s)X (s) ds + Z , β (r) p + ϵ t (r)

[0,1] b A(r, s)X (s) ds Z , bβ(r) p

2 dr EX ,Z ,Y

[0,1] {ϵ t (r)}2 dr

Wang, Zhao, Yu and Willett

[0,1] A(r, s)X (s) ds + Z , β(r) p

+ 2EX ,Z ,Y

[0,1] A(r, s)X (s) ds + Z , β(r) p

[0,1] A(r, s)X (s) ds + Z , β(r) p

[0,1] EX ,Z ,Y

[0,1] A(r, s)X (s) ds + Z , β(r) p

EX ,Z ,Y {ϵ t (r)} dr

[0,1] A(r, s)X (s) ds + Z , β(r) p

where the second identity is due to the deﬁnition of Y (r), the fourth identity is due to the independence between ϵt( ) and the rest, and the ﬁnal identity is due to the mean-zero assumption of ϵ t ( ).

E.1 Properties of Soboblev Spaces

Deﬁnition 38 Let f : [0, 1] R be any real function. Then f is said to be H older continuous of order κ (0, 1] if

sup x,x [0,1]

|f(x) f(x )|

|x x |κ < .

Theorem 39 Suppose that f : [0, 1] R is such that f W α,2 with α > 1/2. Then f is H older continuous with κ = α 1/2 and there exists an absolute constant C such that

sup x,x [0,1]

|f(x) f(x )|

|x x |α 1/2 < C f W α,2.

Proof This is the well-known Morrey s inequality. See Evans (2010) for a proof.

In the following lemma, we show that the integral of a H older smooth function f can be approximated by the discrete sum of f evaluated at the sample points {si}n i=1.

Lemma 40 Suppose that {si}n i=1 is a collection of sample points satisfying Assumption 3a. Suppose in addition that f : [0, 1] R is H older continuous with

sup x,x [0,1]

|f(x) f(x )|

|x x |α 1/2 < Cf.

Functional Linear Regression with Mixed Predictors

Let ws(i) = (si si 1)n. Then there exists an absolute constant C such that

[0,1] f(s)ds 1

i=1 ws(i)f(si) CCfn α+1/2.

Consequently if f W α,2, then

[0,1] f(s)ds 1

i=1 ws(i)f(si) C f W α,2n α+1/2.

Proof Let s0 = 0. Observe that

[0,1] f(s)ds 1

i=1 ws(i)f(si) =

si 1 f(s) ds 1

i=1 ws(i)f(si)

si 1 f(s) ds

si 1 f(si) ds

f(s) f(si) ds

i=1 (si si 1)Cf Cdα 1/2n α+1/2

=Cf Cα 1/2 d n α+1/2.

Lemma 41 Suppose that f, g : [0, 1] R are such that f, g W α,2 for some α > 1/2. Then there exist absolute constants C1 and C2 such that

fg W α,2 C1 f W α,2 g W α,2 and f C2 f W α,2. (86)

Proof These are well know Sobolev inequalities. See Evans (2010) for proofs.

Appendix F. Technical details of the optimization

F.1 Detailed formulation of the convex optimization

With the notation deﬁned as in Section 4.1, in the following, we provide the details for rewriting the penalized optimization problem in (13) into (14), which is a convex function of R and B = [b1, b2, , bp]. Speciﬁcally, we examine the three components of (13) one by one.

The squared loss can be rewritten as

i=1 ws(i)A(rj, si)Xt(si) β(rj), Zt p

Wang, Zhao, Yu and Willett

n1 K 1 RK2WSXt K 1 BZt

n1 K1RK2WSX K1BZ

where 2 and F are the ℓ2-norm of a vector and the Frobenius norm of a matrix.

As for the Frobenius norm penalty A 2 F(K), ﬁrst, it is easy to see A (r, s) = k1(s) Rk2(r), where A is the adjoint operator of A. Let u(s) = k2(s) c, for any c Rn1. We have that

A A[u](s) = A (s, r), A[u](r) H = k1(r) Rk2(s), A[u](r) H

= k2(s) R k1(r), A(r, s), u(s) H H = k2(s) R k1(r), k1(r) HR k2(s), u(s) H

=k2(s) R K1RK2c.

Thus, the eigenvalues of A A are the same as those of R K1RK2 and A 2 F(K) = tr(R K1RK2).

As for the H(K)-norm penalty βl H(K) and the group Lasso-type penalty βl n2, we

have βl H(K) = q

βl(r), βl(r) H(K) = q

b l K1bl and

j=1 wr(j)β2 l (rj) =

j=1 wr(j)b l k1(rj)k1(rj) bl

v u u tb l 1 n2

j=1 wr(j)k1(rj)k1(rj) bl = r

1 n2 b l K1W 2 RK1bl, for l = 1, . . . , p.

Combining all three components together, it is easy to see that the optimization problem in (13) can be written as (14).

F.2 Detailed optimization of the structured ridge regression (16)

By simple linear algebra, the ﬁrst order condition forms the linear system

(S 1 S1 + λ1In1n2)e = S 1 ey.

S(λ1) = S 1 S1 + λ1In1n2.

Note that S(λ1) is a positive deﬁnite matrix and the optimization has a unique solution. However, when n1n2 is large, the inverse of the matrix S(λ1) becomes computationally expensive. We thus further exploit the structure of S(λ1). Denote S4 = 1 n1 K1/2 2 WSX Rn1 T , we have S 1 S1 = (S4S 4 ) (K1/2 1 W 2 RK1/2 1 ), and thus

S(λ1) = (S4S 4 ) (K1/2 1 W 2 RK1/2 1 ) + λ1In1n2.

Functional Linear Regression with Mixed Predictors

Denote the singular value decomposition (SVD) of S4S 4 = 1 n2 1 K1/2 2 WSXX WSK1/2 2 as

S4S 4 = U1D1U 1 and the SVD of K1/2 1 W 2 RK1/2 1 as K1/2 1 W 2 RK1/2 1 = U2D2U 2 , we have

S(λ1) = (S4S 4 ) (K1/2 1 W 2 RK1/2 1 ) + λ1In1n2 = (U1 U2)(D1 D2 + λ1In1n2)(U 1 U 2 ).

Thus, we have

S 1(λ1) = (U1 U2)(D1 D2 + λ1In1n2) 1(U 1 U 2 )

1 λ1 + D1i D2j (U1i U2j)(U 1i U 2j)

1 λ1 + D1i D2j (U1i U 1i) (U2j U 2j).

Thus, we have

e = S 1(λ1)S 1 ey =

1 λ1 + D1i D2j (U1i U 1i S4) (U2j U 2j K1/2 1 WR)ey,

which implies that

1 λ1 + D1i D2j (U2j U 2j K1/2 1 WR)e Y (S 4 U1i U 1i)

1 λ1 + D1i D2j (U2j U 2j)(K1/2 1 WR e Y X WSK1/2 2 )(U1i U 1i),

and we update R = K 1/2 1 EK 1/2 2 .

Appendix G. Additional numerical analysis

Wang, Zhao, Yu and Willett

RKHS FDA PFFR

0.02 0.04 0.06 0.08 0.10

Scen. A RMISE n=5, T=50, q=5, kappa=0.5

RKHS FDA PFFR

0.02 0.04 0.06 0.08

Scen. A RMISE n=20, T=100, q=20, kappa=0.5

RKHS FDA PFFR

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=0.5

RKHS FDA PFFR

0.04 0.06 0.08 0.10

Scen. B RMISE n=5, T=50, q=5, kappa=0.5

RKHS FDA PFFR

0.06 0.08 0.10 0.12 0.14

Scen. B RMISE n=20, T=100, q=20, kappa=0.5

RKHS FDA PFFR

0.10 0.15 0.20

Scen. B RMISE n=40, T=200, q=50, kappa=0.5

Figure 7: Boxplot of RMISE of RKHS, FDA and PFFR across 500 experiments under function-on-function regression with κ = 0.5. Red points denote the average RMISE.

Functional Linear Regression with Mixed Predictors

RKHS FDA PFFR

0.02 0.04 0.06 0.08 0.10

Scen. A RMISE n=5, T=50, q=5, kappa=2

RKHS FDA PFFR

0.02 0.04 0.06 0.08

Scen. A RMISE n=20, T=100, q=20, kappa=2

RKHS FDA PFFR

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=2

RKHS FDA PFFR

0.04 0.06 0.08 0.10

Scen. B RMISE n=5, T=50, q=5, kappa=2

RKHS FDA PFFR

0.1 0.2 0.3 0.4 0.5

Scen. B RMISE n=20, T=100, q=20, kappa=2

RKHS FDA PFFR

0.2 0.4 0.6 0.8

Scen. B RMISE n=40, T=200, q=50, kappa=2

Figure 8: Boxplot of RMISE of RKHS, FDA and PFFR across 500 experiments under function-on-function regression with κ = 2. Red points denote the average RMISE.

Wang, Zhao, Yu and Willett

0.02 0.03 0.04 0.05 0.06 0.07 0.08

Scen. A RMISE n=5, T=50, q=5, kappa=0.5

0.01 0.02 0.03 0.04 0.05 0.06 0.07

Scen. A RMISE n=20, T=100, q=20, kappa=0.5

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=0.5

0.03 0.04 0.05 0.06 0.07 0.08

Scen. B RMISE n=5, T=50, q=5, kappa=0.5

0.04 0.06 0.08 0.10

Scen. B RMISE n=20, T=100, q=20, kappa=0.5

0.1 0.2 0.3 0.4 0.5

Scen. B RMISE n=40, T=200, q=50, kappa=0.5

Figure 9: Boxplot of RMISE of RKHS and PFFR across 500 experiments under mixed functional regression with κ = 0.5. Red points denote the average RMISE.

Functional Linear Regression with Mixed Predictors

0.02 0.04 0.06 0.08 0.10

Scen. A RMISE n=5, T=50, q=5, kappa=2

0.02 0.03 0.04 0.05 0.06 0.07 0.08

Scen. A RMISE n=20, T=100, q=20, kappa=2

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=2

0.03 0.04 0.05 0.06 0.07 0.08 0.09

Scen. B RMISE n=5, T=50, q=5, kappa=2

0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18

Scen. B RMISE n=20, T=100, q=20, kappa=2

0.0 0.5 1.0 1.5 2.0

Scen. B RMISE n=40, T=200, q=50, kappa=2

Figure 10: Boxplot of RMISE of RKHS and PFFR across 500 experiments under mixed functional regression with κ = 2. Red points denote the average RMISE.

Wang, Zhao, Yu and Willett

8 10 12 14 16 18 20

0.10 0.15 0.20 0.25

Average RMSE

Day s when prediction is made

RKHS RKHS_mixed PFFR PFFR_mixed FDA ARIMA

8 10 12 14 16 18 20

0.10 0.15 0.20

Average MAE

Day s when prediction is made

RKHS RKHS_mixed PFFR PFFR_mixed FDA ARIMA

Figure 11: Average RMSE and MAE achieved by diﬀerent functional regression methods. For campaign t at prediction time s, ARIMA is trained on {Nt(ri), ri [0, s]} for the prediction of {Nt(ri), ri (s, 30]}. Note that ARIMA can be used as {Nt(ri)}60 i=1 can be seen as an evenly spaced univariate time series.

Functional Linear Regression with Mixed Predictors

Appendix H. Robustness check for real data analysis

In this section, we further provide a robustness check for the real data analysis presented in Section 5.4. Speciﬁcally, we repeat the two-fold CV procedure in Section 5.4 independently for 100 times for RKHS, FDA and PFFR3. In addition, we also implement FDA with 50 basis functions (FDA50) and PFFR with 30 basis functions (PFFR30). We further compare with the popular functional PCA based regression method (FPCA) proposed in Yao et al. (2005).

Thus, for every method (RKHS, FDA, PFFR, FPCA), on each prediction day s {7, 8, 9, , 20}, we have 100 RMSEs and 100 MAEs obtained from the 100 times two-fold CV. We refer to Section 5.4 for more detailed deﬁnition of RMSE and MAE.

Figure 12 visualizes the average RMSEs and MAEs over the 100 times CV achieved by diﬀerent functional regression methods across s {7, 8, 9, , 20}. As can be seen clearly, the pattern exhibited in Figure 12 is the same as the one in Figure 5 in Section 5.4 and RKHS is the winner by a notable margin. Note that FDA20 and FDA50 give essentially the same performance, while PFFR30 indeed gives worse performance than PFFR20. This suggests that 20 basis dimension is suﬃcient for the current real data application.

Table 3 further gives the number of two-fold CV (out of 100 times) where RKHS achieves the smallest RMSEs and MAEs for each s {7, 8, 9, , 20}. As can be seen, RKHS is almost always the winner for every s {7, 8, 9, , 20} except for s = 19, where RKHS still provides the best performance more than two thirds of the time.

8 10 12 14 16 18 20

0.10 0.12 0.14 0.16

Average RMSE

Day s when prediction is made

RKHS FDA20 FDA50 PFFR20 PFFR30 FPCA

8 10 12 14 16 18 20

0.07 0.08 0.09 0.10 0.11 0.12 0.13

Average MAE

Day s when prediction is made

RKHS FDA20 FDA50 PFFR20 PFFR30 FPCA

Figure 12: RMSEs and MAEs achieved by diﬀerent functional regression methods averaged over 100 two-fold cross-validations for each s {7, 8, 9, , 20}.

3. For simplicity, we do not include RKHSmixed and PFFRmixed in the comparison, as the result in Section 5.4 (see Figure 5) clearly indicates that the three scalar predictors do not seem to improve the prediction performance.

Wang, Zhao, Yu and Willett

s 7 8 9 10 11 12 13 14 15 16 17 18 19 20 RMSEs 100 100 100 100 100 100 100 100 100 100 100 100 77 97 MAEs 100 100 100 100 100 100 100 100 100 100 100 100 66 96

Table 3: Number of two-fold cross-validations (out of 100 times) where RKHS achieves the smallest RMSEs and MAEs for each s {7, 8, 9, , 20}.

Appendix I. Numerical results for observations with measurement error

In this section, we conduct simulation experiments to examine the performance of diﬀerent methods, including FDA, PFFR and the proposed RKHS, under the setting where the observed functional responses are additionally corrupted with measurement error. The simulation analysis is done under the function-on-function regression setting as in Section 5.24. Speciﬁcally, for the ease of comparison, we use exactly the same simulation setting that generates Table 1 in Section 5.2. The only diﬀerence is that we assume we do not observe the functional response Yt(rj) directly but instead a noisy version of it, denoted by ytj, corrupted with measurement error such that

ytj = Yt(rj) + Etj,

where Etj are independent Gaussian random variable with mean 0 and variance σ2 E for t = 1, . . . , T and j = 1, . . . , n2. Note that for σ2 E = 0, we get back to the identical setting as in Section 5.2. It is easy to derive that the overall variability of the functional noise (i.e. E R

[0,1] ϵt(r)2 dr) under the simulation setting in Section 5.2 is around 0.42/12 0.1152, which can be calculated easily using the fact that ϵt(r) = Pq i=1 etiui(r). We refer to Sections 5.1 and 5.2 for more details. Thus, in the following, we set the variance of the measurement error as σ2 E = 0.122 such that the variability of the functional noise and the measurement error are comparable. For each setting, we conduct 500 experiments. We again use RMISE to evaluate the excess risk of each estimator. Due to limited space, we only report the result for the case where the spectral norm κ = 1. The results for κ = 0.5 and κ = 2 are similar and omitted. For each method, Table 4 reports its average n RMISE across 500 experiments under all simulation settings. Note that Table 4 is directly comparable with Table 1 in Section 5.2 (for κ = 1), which reports the performance of each method for σ2 E = 0. Thus, for ease of comparison, we copy the result reported in Table 1 (for κ = 1) to Table 4 under the name σ2 E = 0. Examining Table 4, it can be seen clearly that, compared to the case of no measurement error (i.e. σ2 E = 0), the performance of all three estimators (RKHS, FDA and PFFR) only worsen slightly when the variability of the measurement error and functional noise are comparable (i.e. σ2 E = 0.122), suggesting that functional noise is more diﬃcult to handle than the independent measurement error. This can also be viewed as numerical support for the result in Theorem 8, i.e. the convergence rate of RKHS is not aﬀected when the functional response is corrupted by independent mean zero measurement errors with a bounded variance.

4. Note that we use the function-on-function regression setting so that we can include FDA in the analysis as FDA cannot handle the general setting of functional regression with mixed covariates.

Functional Linear Regression with Mixed Predictors

Scenario A: n = 5, T = 50, q = 5 Scenario B: n = 5, T = 50, q = 5 σ2 E RKHS FDA PFFR Ravg(%) Rw (%) RKHS FDA PFFR Ravg(%) Rw (%) 0 9.14 10.31 10.98 12.81 68 10.42 11.50 11.12 6.63 71 0.122 10.14 11.40 12.20 12.45 64 12.61 13.87 13.23 4.96 64 Scenario A: n = 20, T = 100, q = 20 Scenario B: n = 20, T = 100, q = 20 σ2 E RKHS FDA PFFR Ravg(%) Rw (%) RKHS FDA PFFR Ravg(%) Rw (%) 0 5.89 6.75 12.84 14.66 81 18.39 32.41 22.32 21.37 92 0.122 5.87 6.86 12.44 16.93 84 22.09 34.67 25.22 14.17 83 Scenario A: n = 40, T = 200, q = 50 Scenario B: n = 40, T = 200, q = 50 σ2 E RKHS FDA PFFR Ravg(%) Rw (%) RKHS FDA PFFR Ravg(%) Rw (%) 0 3.98 4.41 9.68 10.96 75 20.75 76.74 77.40 269.73 100 0.122 4.02 4.47 9.75 11.32 77 23.71 76.95 77.52 224.49 100

Table 4: Numerical performance of RKHS, FDA and PFFR under function-on-function regression with no measurement error (σ2 E = 0) and with measurement error (σ2 E = 0.122). The reported n RMISEavg is multiplied by 100 in scale. Ravg reﬂects the percent improvement of RKHS over the best performing competitor, and Rw reﬂects the percentage of experiments in which RKHS achieves the lowest RMISE.

Appendix J. FDA and PFFR with larger number of basis functions

As discussed in Section 5.2, the performance of the penalized basis function approaches, such as FDA and PFFR, may be sensitive to the hyper-parameter Nb, which is the number of basis dimension used in the estimation. In this section, we further examine the performance of FDA and PFFR with a larger Nb. Speciﬁcally, we use the identical simulation setting as that in Section 5.2. We keep the implementation of RKHS the same and the only diﬀerence is that we implement FDA with Nb = 50 and PFFR with Nb = 30 instead of Nb = 20. (For PFFR we can only do Nb = 30 as the computational cost of PFFR with Nb > 30 is practically forbidden for large-scale comparison. See later for more details.) For each method, Table 5 reports the average n RMISE (n RMISEavg) across 500 experiments under all simulation settings. Table 5 is directly comparable with Table 1 in Section 5.2. Cross-examining Table 5 and Table 1, we have the following observations. For Scenario A, where the bivariate function A (r, s) is a simple exponential function, FDA50 and PFFR30 give essentially the same performance as Nb = 20. The same applies to Scenario B with q = 5 and with q = 20 (for low SNR parameter κ = 0.5). This indicates that Nb = 20 is suﬃcient for simulation settings with low model complexity. On the other hand, for Scenario B with q = 20 (for high SNR parameter κ = 1, 2) and q = 50, compared to Nb = 20, FDA50 and PFFR30 give much improved performance due to lower approximation bias, though still having a notable performance gap compared to RKHS. We further give the boxplot of RMISE for each method in Figure 13 under κ = 1. The general pattern is the same as that exhibited in Figure 2 for Nb = 20, where RKHS provides the best overall performance. Computational Time: Note that a larger Nb can signiﬁcantly increase the computational cost of the penalized basis function approaches. For example, for FDA with the same

Wang, Zhao, Yu and Willett

Scenario A: n = 5, T = 50, q = 5 Scenario B: n = 5, T = 50, q = 5 κ RKHS FDA50 PFFR30 Ravg(%) Rw (%) RKHS FDA50 PFFR30 Ravg(%) Rw (%) 0.5 15.98 17.98 19.77 12.41 64 20.86 22.97 22.30 6.88 64 1 9.14 10.32 10.97 12.74 68 10.42 11.51 11.12 6.66 71 2 4.99 6.17 5.69 14.22 72 5.21 5.75 5.55 6.69 74 Scenario A: n = 20, T = 100, q = 20 Scenario B: n = 20, T = 100, q = 20 κ RKHS FDA50 PFFR30 Ravg(%) Rw (%) RKHS FDA50 PFFR30 Ravg(%) Rw (%) 0.5 10.61 11.94 25.93 12.55 77 37.41 40.65 40.53 8.34 58 1 5.89 6.77 13.32 14.96 81 18.39 20.63 19.75 7.43 73 2 3.60 3.89 6.98 7.99 72 9.19 11.75 9.98 8.52 87 Scenario A: n = 40, T = 200, q = 50 Scenario B: n = 40, T = 200, q = 50 κ RKHS FDA50 PFFR30 Ravg(%) Rw (%) RKHS FDA50 PFFR30 Ravg(%) Rw (%) 0.5 7.37 8.54 21.00 15.86 82 39.22 48.27 72.42 23.40 94 1 3.98 4.40 10.70 10.75 75 20.75 35.82 70.61 76.05 100 2 2.34 2.34 5.57 0.11 47 12.59 29.77 70.20 151.92 100

Table 5: Numerical performance of RKHS, FDA and PFFR under function-on-function regression. The reported n RMISEavg is multiplied by 100 in scale. Ravg reﬂects the percent improvement of RKHS over the best performing competitor, and Rw reﬂects the percentage of experiments in which RKHS achieves the lowest RMISE.

number of basis functions Nb for the ﬁrst and second argument of the bivariate coeﬃcient function A(r, s), the computation of FDA involves inversion of an N2 b N2 b -dimension matrix (see equations (16.14) and (16.15) in Ramsay and Silverman (2005) for more details), which may not scale well with the number of basis Nb. PFFR uses restricted MLE (REML) for its model estimation by recasting the functional regression model as a penalized additive model with mixed eﬀects (see Ivanescu et al. (2015) for more details), which seems to be slow in terms of computation and also does not scale well with Nb.

For illustration, Figure 14 gives the boxplot of computational time (in log-scale) for RKHS, FDA20, FDA50 with a ﬁxed tuning parameter (λ = 10 15) and for PFFR20 and PFFR30 across 500 experiments. To conserve space, we present the case for Scenario B with κ = 1. Results under other settings are similar and thus omitted. Note that for FDA and RKHS, the computational time does not depend on λ as we have closed-form solutions. For PFFR, it uses REML to automatically select the roughness penalty and does not require cross-validation. Thus, for fair comparison, if RKHS and FDA requires a 5-fold cross-validation to search for the best tuning parameters among 50 candidates, we need to add log(250) 4.6 to the log time for FDA and RKHS on Figure 14.

As can be seen from Figure 14, FDA50 incurs much higher computational cost compared to FDA20. Similarly, PFFR30 has noticeably higher computational cost than PFFR20. For the current simulation settings, RKHS has the lowest computational cost.

Functional Linear Regression with Mixed Predictors

RKHS FDA PFFR

0.02 0.04 0.06 0.08 0.10

Scen. A RMISE n=5, T=50, q=5, kappa=1

RKHS FDA PFFR

0.02 0.04 0.06 0.08

Scen. A RMISE n=20, T=100, q=20, kappa=1

RKHS FDA PFFR

0.01 0.02 0.03 0.04 0.05 0.06 0.07

Scen. A RMISE n=40, T=200, q=50, kappa=1

RKHS FDA PFFR

0.04 0.06 0.08 0.10

Scen. B RMISE n=5, T=50, q=5, kappa=1

RKHS FDA PFFR

0.05 0.06 0.07 0.08 0.09 0.10

Scen. B RMISE n=20, T=100, q=20, kappa=1

RKHS FDA PFFR

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Scen. B RMISE n=40, T=200, q=50, kappa=1

Figure 13: Boxplots of RMISE of RKHS, FDA50 and PFFR30 across 500 experiments under function-on-function regression with κ = 1. Red points denote the average RMISE.

RKHS FDA20 FDA50 PFFR20 PFFR30

Scen. B log(Comp. Time) n=5, T=50, q=5, kappa=1

RKHS FDA20 FDA50 PFFR20 PFFR30

4 2 0 2 4 6 8

Scen. B log(Comp. Time) n=20, T=100, q=20, kappa=1

RKHS FDA20 FDA50 PFFR20 PFFR30

2 0 2 4 6 8

Scen. B log(Comp. Time) n=40, T=200, q=50, kappa=1

Figure 14: Boxplots of computational time (in log-scale) for RKHS, FDA20, FDA50 with a ﬁxed tuning parameter (λ = 10 15) and for PFFR20 and PFFR30 across 500 experiments.

Appendix K. Additional numerical comparison with FPCA

In this section, we further provide numerical comparison with the popular functional PCA based regression method (FPCA) proposed in Yao et al. (2005), which estimates the functional linear model based on Karhunen Lo eve expansion. FPCA is implemented via the

Wang, Zhao, Yu and Willett

Matlab package PACE (FPCreg function). We keep all tuning parameters as default values suggested in the PACE package5. Function-on-function regression: We ﬁrst consider the function-on-function regression setting as in Section 5.2 of the main text. Speciﬁcally, we use the identical simulation setting that generates Figure 2 in Section 5.2. The only diﬀerence is that besides RKHS, FDA and PFFR, we further implement FPCA in the simulation. For each setting, we conduct 500 experiments. We again use RMISE to evaluate the excess risk of each estimator. Due to limited space, we only report the result for the case where the spectral norm κ = 1. The results for κ = 0.5 and κ = 2 are similar and omitted. For each method (RKHS, FDA, PFFR and FPCA), Figure 15 visualizes the boxplot of its RMISE across 500 experiments under each simulation setting. Note that the only diﬀerent between Figure 2 and Figure 15 is that Figure 15 further includes the boxplot of RMISE for FPCA. As can be seen, for Scenario A, where the coeﬃcient function A(s, t) is the simple exponential function, FPCA provides reasonable performance. However, its performance is not satisfactory when the coeﬃcient function A(s, t) becomes more complex as in Scenario B. This phenomenon is well-known in the literature as the functional space spanned by the eigenfunctions associated with the leading principle components may not be able to well approximate the coeﬃcient function A(s, t), see Cai and Yuan (2012) and Sun et al. (2018) for similar observations. Functional regression with mixed predictors: We then consider the functional regression with mixed covariates setting as in Section 5.3 of the main text. Speciﬁcally, we use the identical simulation setting that generates Figure 3 in Section 5.3. The only diﬀerence is that besides RKHS and PFFR, we further implement FPCA in the simulation. Note that the original FPCreg function in the Matlab package PACE only supports function-on-function regression and does not support the general setting of functional regression with mixed covariates. Thus, we modify the original code in FPCreg to implement FPCA for functional regression with mixed covariates. The modiﬁcation is straightforward, where we further include the observed scalar predictors (Zt1, Zt2, Zt3) together with the estimated FPCs of the functional predictor Xt to predict the FPCs of the functional response Yt. For each setting, we conduct 500 experiments. We again use RMISE to evaluate the excess risk of each estimator. Due to limited space, we only report the result for the case where the spectral norm κ = 1. The results for κ = 0.5 and κ = 2 are similar and omitted. For each method (RKHS, PFFR and FPCA), Figure 16 visualizes the boxplot of its RMISE across 500 experiments under each simulation setting. Note that the only diﬀerent between Figure 3 and Figure 16 is that Figure 16 further includes the boxplot of RMISE for FPCA. As can be seen, the phenomenon is essentially the same as the one observed under the function-on-function regression setting. Speciﬁcally, for Scenario A, where the coeﬃcient function A(s, t) is the simple exponential function, FPCA provides reasonable performance. However, its performance is not satisfactory when the coeﬃcient function A(s, t) becomes more complex as in Scenario B. Again, this phenomenon is due to the fact that the functional space spanned by the eigenfunctions associated with the leading principle components may

5. See http://www.stat.ucdavis.edu/PACE for more details.

Functional Linear Regression with Mixed Predictors

RKHS FDA PFFR FPCA

0.02 0.04 0.06 0.08 0.10

Scen. A RMISE n=5, T=50, q=5, kappa=1

RKHS FDA PFFR FPCA

0.02 0.04 0.06 0.08

Scen. A RMISE n=20, T=100, q=20, kappa=1

RKHS FDA PFFR FPCA

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=1

RKHS FDA PFFR FPCA

0.1 0.2 0.3 0.4 0.5 0.6

Scen. B RMISE n=5, T=50, q=5, kappa=1

RKHS FDA PFFR FPCA

0.1 0.2 0.3 0.4 0.5

Scen. B RMISE n=20, T=100, q=20, kappa=1

RKHS FDA PFFR FPCA

0.1 0.2 0.3 0.4

Scen. B RMISE n=40, T=200, q=50, kappa=1

Figure 15: Boxplots of RMISE of RKHS, FDA, PFFR and FPCA across 500 experiments under function-on-function regression with κ = 1. Red points denote the average RMISE.

not be able to well approximate the coeﬃcient function A(s, t), see Cai and Yuan (2012) and Sun et al. (2018) for similar observations.

Appendix L. Illustration of phase transition

In this section, we provide an indirect numerical illustration for the phase transition phenomenon discovered in Theorem 4. Recall Theorem 4 suggests that putting aside the discretization error ζn, there is a phase transition between a nonparametric rate δT and a high-dimensional parametric rate s log(p)/T. Here, the nonparametric rate δT takes the form δT = T 2r/(2r+1) with some r > 1/2 and is associated with the estimation of the bivariate coeﬃcient function A . In other words, δT = T c for some c > 0 and c is strictly smaller than 1. The high-dimensional parametric rate s log(p)/T is associated with the estimation of the high-dimensional sparse univariate coeﬃcient functions β . In the following, we numerically verify these two rates by considering two special cases of the proposed functional linear model with mixed predictors. Speciﬁcally, we follow the same simulation setting as Scenario B in Section 5.1 of the main text and simulate data from the functional linear regression model

[0,1] A (r, s)Xt(s) ds +

j=1 β j (r)Ztj + ϵt(r), r [0, 1]. (87)

Wang, Zhao, Yu and Willett

RKHS PFFR FPCA

0.02 0.04 0.06 0.08 0.10 0.12 0.14

Scen. A RMISE n=5, T=50, q=5, kappa=1

RKHS PFFR FPCA

0.02 0.04 0.06 0.08

Scen. A RMISE n=20, T=100, q=20, kappa=1

RKHS PFFR FPCA

0.01 0.02 0.03 0.04 0.05 0.06

Scen. A RMISE n=40, T=200, q=50, kappa=1

RKHS PFFR FPCA

0.1 0.2 0.3 0.4 0.5 0.6

Scen. B RMISE n=5, T=50, q=5, kappa=1

RKHS PFFR FPCA

0.05 0.15 0.25 0.35

Scen. B RMISE n=20, T=100, q=20, kappa=1

RKHS PFFR FPCA

0.2 0.4 0.6 0.8 1.0

Scen. B RMISE n=40, T=200, q=50, kappa=1

Figure 16: Boxplots of RMISE of RKHS, PFFR and FPCA across 500 experiments under functional regression with mixed predictors with κ = 1. Red points denote the average RMISE.

The functional predictor Xt, scalar predictor Zt and the functional error ϵt are generated using the same setting in Section 5.1. Following Scenario B, the coeﬃcient functions A

and β 1, . . . , β p are generated from a q-dimensional subspace spanned by basis functions {ui(s)}q i=1. We refer to Section 5.1 for more detailed description of Scenario B.

In the following, we ﬁx the dimension q = 20 and the SNR parameter κ = 2. To minimize the impact of the discretization error ζn, we set the discrete sample points {rj}n2 j=1 for Yt and {si}n1 i=1 for Xt to be evenly spaced dense grids on [0, 1] with a large number of grids n = n1 = n2 = 50.

To evaluate the excess risk given by the RKHS estimator, we follow Section 5.1 and use the RMISE deﬁned in (21). To match the result in Theorem 4, we compute MISE, which is the squared RMISE with MISE = RMISE2.

Nonparametric rate δT : We ﬁrst focus on the nonparametric rate δT and consider the special case of (87) by setting p = 0, i.e. the function-on-function regression. We vary the sample size T = 100, 200, 300, . . . , 2000 and for each T we conduct 500 experiments. For each sample size T, we compute the average MISE across the 500 experiments and Figure 17(A) gives the plot between log(average MISE) and log(T), where the relationship is seen to be roughly linear. We further ﬁt an ordinary least squared estimator (OLS) on the points,

Functional Linear Regression with Mixed Predictors

which gives an estimated slope of 0.811, conﬁrming that δT takes the nonparametric rate O(T c) with c being a positive constant less than 1. High-dimensional parametric rate s log(p)/T: We now focus on the high-dimensional parametric rate and consider the special case of (87) by setting A = 0, i.e. we consider the functional linear regression with only scalar predictors Zt1, Zt2, , Ztp. We ﬁrst ﬁx s = 1 and T = 200, and vary the dimension p = 10, 50, 100, 150, . . . , 450, 500. In other words, the coeﬃcient functions β j 0 for j = 2, 3, . . . , p and β 1 is generated as in Scenario B of Section 5.1 (see more detailed description above). For each dimension p, we conduct 500 experiments and compute the average MISE. Figure 17(B) gives the plot between average MISE and log(p), where the relationship is seen to be roughly linear. This conﬁrms that ﬁxing the sparsity s and sample size T, the excess risk increases in the order of O log(p) . We next ﬁx s = 1, p = 10, and vary the sample size T = 100, 200, 300, . . . , 2000. For each sample size T, we conduct 500 experiments and compute the average MISE. Figure 17(C) gives the plot between log(average MISE) and log(T), where the relationship is seen to be roughly linear. We further ﬁt an OLS on the points, which gives an estimated slope of 1.012, conﬁrming that ﬁxing the sparsity s and dimension p, the excess risk decreases in the order of roughly O(1/T). Finally, we ﬁx T = 200, p = 100, and vary the sparsity s = 1, 2, 5, 10, 15, 20. In other words, the coeﬃcient functions β j 0 for j = s + 1, s + 2, . . . , p and β 1 = β 2 = = β s are generated as in Scenario B of Section 5.1. For each sparsity s, we conduct 500 experiments and compute the average MISE. Figure 17(D) gives the plot between log(average MISE) and log(s), where the relationship is seen to be roughly linear. We further ﬁt an OLS on the points, which gives an estimated slope of 1.087, conﬁrming that ﬁxing the sample size T and dimension p, the excess risk increases in the order of roughly O(s). Thus, we numerically illustrate that the prediction error associated with the estimation of the high-dimensional sparse univariate coeﬃcient functions β is of order O s log(p)/T . Putting these two cases together, we can conclude that there is a phase transition phenomenon between a nonparametric rate O(T c) and a high-dimensional parametric rate O s log(p)/T .

Wang, Zhao, Yu and Willett

4.5 5.0 5.5 6.0 6.5 7.0 7.5

7.0 6.5 6.0 5.5 5.0 4.5

(A) Nonparametric rate: Prediction error v.s. Sample size T

Est. slope = 0.811

4e 04 5e 04 6e 04 7e 04

(B) High dimensional rate: Prediction error v.s. Dimension p

Est. slope = 0.989*(10 4)

4.5 5.0 5.5 6.0 6.5 7.0 7.5

9.0 8.0 7.0

(C) High dimensional rate: Prediction error v.s. Sample size T

Est. slope = 1.012

0.0 0.5 1.0 1.5 2.0 2.5 3.0

7.0 6.0 5.0 4.0

(D) High dimensional rate: Prediction error v.s. Sparsity s

Est. slope = 1.087

Figure 17: Numerical illustration of the nonparametric rate (A) and the high-dimensional parametric rate (B-D).