# kernel_density_estimation_for_dynamical_systems__07e1fcb0.pdf Journal of Machine Learning Research 19 (2018) 1-49 Submitted 7/16; Revised 8/18; Published 9/18 Kernel Density Estimation for Dynamical Systems Hanyuan Hang hans2017@ruc.edu.cn Institute of Statistics and Big Data Renmin University of China 100872 Beijing, China Ingo Steinwart ingo.steinwart@mathematik.uni-stuttgart.de Institute for Stochastics and Applications University of Stuttgart 70569 Stuttgart, Germany Yunlong Feng ylfeng@albany.edu Department of Mathematics and Statistics State University of New York The University at Albany Albany, New York 12222, USA Johan A.K. Suykens johan.suykens@esat.kuleuven.be Department of Electrical Engineering, ESAT-STADIUS, KU Leuven Kasteelpark Arenberg 10, Leuven, B-3001, Belgium Editor: John Shawe-Taylor We study the density estimation problem with observations generated by certain dynamical systems that admit a unique underlying invariant Lebesgue density. Observations drawn from dynamical systems are not independent and moreover, usual mixing concepts may not be appropriate for measuring the dependence among these observations. By employing the C-mixing concept to measure the dependence, we conduct statistical analysis on the consistency and convergence of the kernel density estimator. Our main results are as follows: First, we show that with properly chosen bandwidth, the kernel density estimator is universally consistent under L1-norm; Second, we establish convergence rates for the estimator with respect to several classes of dynamical systems under L1-norm. In the analysis, the density function f is only assumed to be H older continuous or pointwise H older controllable which is a weak assumption in the literature of nonparametric density estimation and also more realistic in the dynamical system context. Last but not least, we prove that the same convergence rates of the estimator under L -norm and L1-norm can be achieved when the density function is H older continuous, compactly supported, and bounded. The bandwidth selection problem of the kernel density estimator for dynamical system is also discussed in our study via numerical simulations. Keywords: Kernel density estimation, dynamical system, dependent observations, Cmixing process, universal consistency, convergence rates, covering number, learning theory c 2018 Hanyuan Hang, Ingo Steinwart, Yunlong Feng and Johan A.K. Suykens. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/16-349.html. Hang, Steinwart, Feng, Suykens 1. Introduction Dynamical systems are now ubiquitous and are vital in modeling complex systems, especially when they admit recurrence relations. Statistical inference for dynamical systems has drawn continuous attention across various fields, the topics of which include parameter estimation, invariant measure estimation, forecasting, noise detection, among others. For instance, in the statistics and machine learning community, the statistical inference for certain dynamical systems have been recently studied in Suykens et al. (1995); Suykens and Vandewalle (2000); Suykens et al. (2002); Zoeter and Heskes (2005); Anghel and Steinwart (2007); Steinwart and Anghel (2009); Deisenroth and Mohamed (2012); Mc Goffet al. (2015a); Hang and Steinwart (2017), just to name a few. We refer the reader to a recent survey in Mc Goffet al. (2015b) for a general depiction of this topic. The purpose of this study is to investigate the density estimation problem for dynamical systems via a classical nonparametric approach, i.e., kernel density estimation. The commonly considered density estimation problem can be stated as follows. Let x1, x2, . . . , xn be observations drawn independently from an unknown distribution P on Rd with the density f. Density estimation is concerned with the estimation of the underlying density f. Accurate estimation of the density is important for many machine learning tasks such as regression, classification, and clustering problems and also plays an important role in many real-world applications. Nonparametric density estimators are popular since weaker assumptions are applied to the underlying probability distribution. Typical nonparametric density estimators include the histogram and kernel density estimator. In this study, we are interested in the latter one, namely, kernel density estimator, which is also termed as Parzen-Rosenblatt estimator (Parzen, 1962; Rosenblatt, 1956) and takes the following form fn(x) = 1 nhd Here, h := hn > 0 is a bandwidth parameter and K is a smoothing kernel. In the literature, point-wise and uniform consistency and convergence rates of the estimator fn to the unknown truth density f under various distance measurements, e.g., L1, L2, and L , have been established by resorting to the regularity assumptions on the smoothing kernel K, the density f, and the decay of the bandwidth sequence {hn}. Besides the theoretical concerns on the consistency and convergence rates, another practical issue one usually needs to address is the choice of the bandwidth parameter hn, which is also called the smoothing parameter. It plays a crucial role in the bias-variance trade-offin kernel density estimation. In the literature, approaches to choosing the smoothing parameter include least-squares cross-validation (Bowman, 1984; Rudemo, 1982), biased cross-validation (Scott and Terrell, 1987), plug-in method (Park and Marron, 1990; Sheather and Jones, 1991), the double kernel method (Devroye, 1989), and also the method based on a discrepancy principle (Eggermont and La Riccia, 2001). We refer the reader to Jones et al. (1996a) for a general overview and to Wand and Jones (1994); Cao et al. (1994); Jones et al. (1996b); Devroye (1997) for more detailed reviews. Note that studies on the kernel density estimator (1) mentioned above heavily rely on the assumption that the observations are drawn in an i.i.d fashion. In the literature of statistics and machine learning, it is commonly accepted that the i.i.d assumption on the Kernel Density Estimation for Dynamical Systems given data can be very much restrictive in real-world applications. Having realized this, researchers turn to weaken this i.i.d assumption by assuming that the observations are weakly dependent under various notions of weakly dependence which include α-mixing, β-mixing, and φ-mixing (Bradley, 2005). There has been a flurry of work to attack this problem with theoretical and practical concerns, see e.g., Gy orfi(1981); Masry (1983, 1986); Robinson (1983); Masry and Gy orfi(1987); Tran (1989b); Gy orfiand Masry (1990); Tran (1989a); Hart and Vieu (1990); Yu (1993) and Hall et al. (1995), under the above notions of dependence. These studies were conducted under various notions of sample dependence. In fact, as Gy orfiand Lugosi (1992) pointed out, for samples that are ergodic, kernel density estimation is not universally consistent under the usual conditions. A counter example was devised there showing the existence of an ergodic sequence of uniformly distributed random variables based on which the kernel density estimation almost surely does not tend to zero in the L1 sense. On the other hand, the assumed correlation among the observations complicates the kernel density estimation problem from a technical as well as practical view and also brings inherent barriers. This is because, more frequently, the analysis on the consistency and convergence rates of the kernel density estimator (1) is proceeded by decomposing the error term into bias and variance terms, which correspond to data-free and data-dependent error terms, respectively. The data-free error term can be tackled by using techniques from the approximation theory while the data-dependent error term is usually dealt with by exploiting arguments from the empirical process theory such as concentration inequalities. As a result, due to the existence of dependence among observations and various notions of the dependence measurement, the techniques, and results concerning the datadependent error term are in general not universally applicable. On the other hand, it has been also pointed out that the bandwidth selection in kernel density estimation under dependence also departures from the independent case, see e.g., Hart and Vieu (1990); Hall et al. (1995). In fact, when the observations x1, x2, . . . , xn Rd are generated by certain ergodic measure-preserving dynamical systems, the problem of kernel density estimation can be even more involved. To explain, let us consider a discrete-time ergodic measure-preserving dynamical system described by the sequence (T n)n 1 of iterates of an unknown map T : Ω Ωwith Ω Rd and a unique invariant measure P which possesses a density f with respect to the Lebesgue measure (rigorous definitions will be given in the sequel). That is, we have xi = T i(x0), i = 1, 2, . . . , n, (2) where x0 is the initial state. It is noticed that in this case the usual mixing concepts are not general enough to characterize the dependence among observations generated by (2) (Maume-Deschamps, 2006; Steinwart and Anghel, 2009; Hang and Steinwart, 2017). On the other hand, existing theoretical studies on the consistency or convergence rates of the kernel density estimator for i.i.d. observations frequently assume that the density function f is sufficiently smooth, e.g., first-order or even second-order smoothness. However, more often than not, this requirement can be stringent in the dynamical system context. It is well-known that (see e.g., Liverani (1995); Baladi (2000)) piecewise expanding maps (or Lasota-Yorke maps) admit a density f which only belongs to the space BV , i.e., functions Hang, Steinwart, Feng, Suykens of bounded variation. Typical examples are the Gauss map in Example 3 and the β-maps in Example 1 (see Subsection 2.2). In this study, the kernel density estimation problem with observations generated by dynamical systems (2) is approached by making use of a more general concept for measuring the dependence of observations, namely, the so-called C-mixing process (refer to Section 2 for the definition). Proposed in Maume-Deschamps (2006) and recently investigated in Hang and Steinwart (2017), the C-mixing concept is shown to be more general and powerful in measuring dependence among observations generated by dynamical systems and can accommodate a large class of dynamical systems. There, a Bernstein-type exponential inequality for C-mixing processes was established and its applications to some learning schemes were explored. Our main purpose in this paper is to conduct some theoretical analysis and practical implementations on the kernel density estimator for dynamical systems. The primary concern is the consistency and convergence rates of the kernel density estimator (1) with observations generated by dynamical systems (2). The consistency and convergence analysis is conducted under L1-norm, and L -norm, respectively. We show that under mild assumptions on the smoothing kernel, with properly chosen bandwidth, the estimator is universally consistent under L1-norm. When the probability distribution P possesses a polynomial or exponential decay outside of a radius-r ball in its support, under the H older continuity assumptions on the kernel function and the density, we obtain almost optimal convergence rates under L1-norm. Moreover, when the probability distribution P is compactly supported, which is a frequently encountered setting in the dynamical system context, we prove that stronger convergence results of the estimator can be developed, i.e., convergence results under L - norm which are shown to be of the same order with its L1-norm convergence rates. Finally, with regard to the practical implementation of the estimator, we also discuss the bandwidth selection problem by performing numerical comparisons among several typical existing selectors that include least squares cross-validation and its variants for dependent observations, and the double kernel method. We show that the double kernel bandwidth selector proposed in Devroye (1989) can in general work well. Moreover, according to our numerical experiments, we find that bandwidth selection for kernel density estimator of dynamical systems is usually ad-hoc in the sense that its performance may depend on the considered dynamical system. The rest of this paper is organized as follows. Section 2 is a warm-up section for the introduction of some notations, definitions and assumptions that are related to the kernel density estimation problem and dynamical systems. We provide our main results on the consistency and convergence rates of the kernel density estimator in Section 3. Some comments and discussions concerning the main results will be also provided in this section. The main analysis on bounding error terms is presented in Section 4. We discuss the bandwidth selection problem in Section 5 and provide some numerical simulations. All the proofs of Section 3 and Section 4 can be found in Section 6. We end this paper in Section 7. Kernel Density Estimation for Dynamical Systems 2. Preliminaries 2.1 Notations Throughout this paper, λd is denoted as the Lebesgue measure on Rd and is an arbitrary norm on Rd. We denote Br as the centered ball of Rd with radius r, that is, Br := {x = (x1, . . . , xd) Rd : x r}. Recall that for 1 p < , the ℓd p-norm is defined as x ℓdp := (xp 1 + + xp d)1/p, and the ℓd -norm is defined as x ℓd := maxi=1,...,d |xi|. Let (Ω, A, µ) be a probability space. We denote Lp(µ) as the space of (equivalence classes of) measurable functions g : Ω R with finite Lp-norm g p. Then Lp(µ) together with g p forms a Banach space. Moreover, if A A is a sub-σ-algebra, then Lp(A , µ) denotes the space of all A -measurable functions g Lp(µ). Finally, for a Banach space E, we write BE for its closed unit ball. In what follows, the notation an bn means that there exists a positive constant c such that an c bn, for all n N. With a slight abuse of notation, in this paper, c, c and C are used interchangeably for positive constants while their values may vary across different lemmas, propositions, theorems, and corollaries. 2.2 Dynamical Systems and C-mixing Processes In this subsection, we first introduce the dynamical systems of interest, namely, ergodic measure-preserving dynamical systems. Mathematically, an ergodic measure-preserving dynamical system is a system (Ω, A, µ, T) with a mapping T : Ω Ωthat is measurepreserving, i.e., µ(A) = µ(T 1A) for all A A, and ergodic, i.e., T 1A = A implies µ(A) = 0 or 1. In this study, we are confined to the dynamical systems in which Ωis a subset of Rd, µ is a probability measure that is absolutely continuous with respect to the Lebesgue measure λd and admits a unique invariant Lebesgue density f. In order to include a larger variety of density functions that commonly appear in dynamical systems, we introduce a new measurement of continuity that is a generalization of the α-H older continuity, which is defined as follows: Definition 1 (Pointwise α-H older Controllable) A density function f : Rd [0, ) is called pointwise α-H older controllable, if for λd-almost all x Rd there exists a constant c(x) 0 and a radius r(x) > 0 such that for all x Rd with x < r(x) we have |f(x + x ) f(x)| c(x) x α . Moreover, f is called uniformly pointwise α-H older controllable, if r0 := ess inf x Ωr(x) > 0 . Note that the an α-H older continuous function can be recognized as a special case of the α-H older controllable functions with c(x) and r(x) being some universal constant c > 0 and r > 0. In our study, it is assumed that the observations x1, x2, , xn are generated by the discrete-time dynamical system (2). Below we list several typical examples of discrete-time dynamical systems that satisfy the above assumptions (Lasota and Mackey, 1985): Hang, Steinwart, Feng, Suykens Example 1 (β-Map) For β > 1, the β-map is defined as T(x) = βx mod 1, x (0, 1), with a unique invariant Lebesgue density given by f(x) = cβ X i 0 β (i+1)1[0,T i(1)](x), where cβ is a constant chosen such that f has integral 1. Example 2 (Logistic Map) The Logistic map defined by T(x) = 4x(1 x), x (0, 1) admits the unique invariant Lebesgue density x(1 x) 1(0,1)(x), x R. Moreover, for all α (0, 1/2), the density f is α-H older controllable with 0 if x < 0 x 1/2 α if 0 < x < 1/4 4 if 1/4 x 3/4 (1 x) 1/2 α if 3/4 x < 1 0 if x > 1 . x if x < 0 x/2 if 0 < x < 1/4 1/4 if 1/4 x 3/4 1 x/2 if 3/4 x < 1 x 1 if x > 1 . Example 3 (Gauss Map) The Gauss map is defined by x mod 1, x (0, 1), with a unique invariant Lebesgue density f(x) = 1 log 2 1 1 + x 1[0,1](x), x R. Moreover, since the restriction f|[0,1] is Lipschitz continuous with Lipschitz constant 1 log 2, it is easy to check that f is pointwise 1-H older controllable with r(x) given by (3) and c(x) := 1 log 2 1(0,1)(x). Kernel Density Estimation for Dynamical Systems We now introduce the notion for measuring the dependence among observations from dynamical systems, namely, C-mixing process. To this end, let us assume that (X, B) is a measurable space with X Rd. Let X := (Xn)n 1 be an X-valued stochastic process on (Ω, A, µ), and for 1 i j , denote by Aj i the σ-algebra generated by (Xi, . . . , Xj). Let Γ : Ω X be a measurable map. µΓ is denoted as the Γ-image measure of µ, which is defined as µΓ(B) := µ(Γ 1(B)), B X measurable. The process X is called stationary if µ(Xi1+j,...,Xin+j) = µ(Xi1,...,Xin) for all n, j, i1, . . . , in 1. Denote P := µX1. Moreover, for ψ, ϕ L1(µ) satisfying ψϕ L1(µ), we denote the correlation of ψ and ϕ by cor(ψ, ϕ) := Z It is shown that several dependency coefficients for X can be expressed in terms of such correlations for restricted sets of functions ψ and ϕ. In order to introduce the notion, we also need to define a new norm which introduces restrictions on ψ and ϕ considered here. Throughout this paper, C(X) is denoted as a subspace of bounded measurable functions g : X R and that we have a semi-norm ||| ||| on C(X). For g C(X), we define the C-norm C by g C := g + |||g|||. (4) Additionally, we need to introduce the following restrictions on the semi-norm ||| |||. Assumption A We assume that the following two restrictions on the semi-norm ||| ||| hold: (i) |||g||| = 0 for all constant functions g C(X); (ii) |||eg||| eg |||g|||, g C(X). Note that the first constraint on the semi-norm in Assumption A implies its shift invariance on R while the inequality constraint can be viewed as an abstract chain rule if one views the semi-norm as a norm describing aspects of the smoothness of g. In fact, it is easy to show that the following function classes, which are probably also the most frequently considered in the dynamical system context, satisfy Condition (i) in Assumption A. Moreover, they also satisfy Condition (ii) in Assumption A, as shown in Hang and Steinwart (2017): L (X): The class of bounded functions on X; BV (X): The class of bounded variation functions on X; Cb,α(X): The class of bounded and α-H older continuous functions on X; Lip(X): The class of Lipschitz continuous functions on X; C1(X): The class of continuously differentiable functions on X. Hang, Steinwart, Feng, Suykens The corresponding semi-norms are |||g|||L (X) := 0, |||g|||BV (X) := g BV (X), |||g|||Cb,α(X) := |g|α := sup x =x |g(x) g(x )| |||g|||Lip(X) := |g|1, |||g|||C1(X) := Throughout this paper, we assume that for all r 1, there exists a function g C and a constant K such that 1Bc r/2 g 1 and |||g||| K. It is easily to see that this holds for function sets C = Lip, Cb,α, C1 etc. Definition 2 (C-mixing Process) Let (Ω, A, µ) be a probability space, (X, B) be a measurable space, X := (Xi)i 1 be an X-valued, stationary process on Ω, and C be defined by (4) for some semi-norm ||| |||. Then, for n 1, we define the C-mixing coefficients by φC(X, n) := sup cor(ψ, g(Xk+n)) : k 1, ψ BL1(Ak 1,µ), g BC(X) , and the time-reversed C-mixing coefficients by φC,rev(X, n) := sup cor(g(Xk), ϕ) : k 1, g BC(X), ϕ BL1(A k+n,µ) . Let (dn)n 1 be a strictly positive sequence converging to 0. We say that X is (timereversed) C-mixing with rate (dn)n 1, if we have φC,rev(X, n) dn for all n 1. Moreover, if (dn)n 1 is of the form dn := c0 exp bnγ , n 1, for some constants c0 > 0, b > 0, and γ > 0, then X is called geometrically (timereversed) C-mixing. Remark 3 In Definition 2, if ||| ||| = 0, we obtain the classical φ-mixing coefficients. If ||| ||| = 0, the resulting C-norm satisfies C and therefore, the mixing coefficients admit fewer functions. Thus, the considered functions must be smoother than the ones in the φ-mixing case and therefore statistical changes of small spatial nature in x do not have such a large impact on h, if h is smooth. In other words, even if the trajectory x1, . . . , xn stays in a certain region for a while, this does not impact the empirical average 1 n P i=1 h(xi) as much as it would for non-smooth h. As a result, the concentration properties in this case hold similarly as in the i.i.d. case. From the above definition, we see that a C-mixing process is defined in association with an underlying function space. For the above listed function spaces, i.e., L (X), BV (X), Cb,α(X), Lip(X), and C1(X), the increase of the smoothness enlarges the class of the Kernel Density Estimation for Dynamical Systems associated stochastic processes, as illustrated in Hang and Steinwart (2017). Note that the classical φ-mixing process is essentially a C-mixing process associated with the function space L (X). Note also that not all α-mixing processes are C-mixing, and vice versa. We refer the reader to Hang and Steinwart (2017) for the relations among α-, φand C-mixing processes. On the other hand, under the above notations and definitions, from Theorem 4.7 in Maume-Deschamps (2006), we know that Logistic map in Example 2 is geometrically timereversed C-mixing with C = Lip(0, 1) while Theorem 4.4 in Maume-Deschamps (2006) (see also Chapter 3 in Baladi (2000)) indicates that Gauss map in Example 3 is geometrically time-reversed C-mixing with C = BV (0, 1). Example 1 is also geometrically time-reversed C-mixing with C = BV (0, 1) according to Maume-Deschamps (2006). For more examples of geometrically time-reversed C-mixing dynamical systems, the reader is referred to Section 2 in Hang and Steinwart (2017). Besides, there also exist several high-dimensional examples. For instance, piecewise expanding maps (Baladi, 2000, Chapter 3), hyperbolic and Smale s Axiom A diffeomorphisms (Baladi, 2000, Chapter 4), and Anosov diffeomorphisms (Baladi, 2001; Lasota and Mackey, 1985). 2.3 Kernel Density Estimators for Dynamical Systems In this subsection, we formulate kernel density estimators for dynamical systems that admit a unique underlying invariant Lebesgue density. The existence and uniqueness of an invariant measure (and the corrsponding invariant density) and smooth invariant measure is a classical problem in the theory of dynamical systems (Katok and Hasselblatt, 1995; Baladi, 2000; Lasota and Mackey, 1985). Perhaps the first existence theorem for continuous maps goes back to Krylov and Bogolyubov, see e.g. (Katok and Hasselblatt, 1995, Theorem 4.1.1). Then Lasota and Yorke (1973) proved the existence theorem for piecewise expanding maps. Since then, results concerning the existence for many other dynamical systems such as uniformly hyperbolic attractors and nonuniformly hyperbolic uni-modal maps have been established, see e.g. Baladi (2000). From the discussion after Theorem 2.1 in Baladi (2000) we know that mixing (thus ergodic) implies the uniqueness among all absolutely continuous invariant measures and smoothness ensures the existence of the invariant measure. For the smoothing kernel K in the kernel density estimator, in this paper we consider its following general form, namely, d-dimensional smoothing kernel: Definition 4 A bounded, monotonically decreasing function K : [0, ) [0, ) is a ddimensional smoothing kernel if Z Rd K( x ) dx =: κ (0, ). (5) The choice of the norm in Definition 4 does not matter since all norms on Rd are equivalent. To see this, let be another norm on Rd satisfying κ (0, ). From the equivalence of the two norms on Rd, one can find a positive constant c such that x c x holds for all x R. Therefore, easily we have Z Rd K( x ) dx Z Rd K ( x /c) dx = cd Z Rd K( x ) dx < . Hang, Steinwart, Feng, Suykens In what follows, without loss of generality, we assume that the constant κ in Definition 4 equals to 1. Lemma 5 A bounded, monotonically decreasing function K : [0, ) [0, ) is a ddimensional smoothing kernel if and only if 0 K(r)rd 1 dr (0, ). Proof From the above discussions, it suffices to consider the integration constraint for the kernel function K with respect to the Euclidean norm ℓd 2. We thus have Rd K x ℓd 2 dx = dτd 0 K(r)rd 1 dr, where τd = πd/2 Γ d 2 +1 is the volume of the unit ball Bℓd 2 of the Euclidean space ℓd 2. This completes the proof of Lemma 5. Let r [0, + ) and denote 1A as the indicator function. Several common examples of d-dimensional smoothing kernels K(r) include the Naive kernel 1[0,1](r), the Triangle kernel (1 r)1[0,1](r), the Epanechnikov kernel (1 r2)1[0,1](r), and the Gaussian kernel e r2. In this paper, we are interested in the kernels that satisfy the following restrictions on their shape and regularity: Assumption B For a fixed function space C(X), we make the following assumptions on the d-dimensional smoothing kernel K: (i) K is H older continuous with exponent β with β (0, 1]; (ii) κυ := R 0 K(r)rυ+d 1 dr < for some υ (0, ); (iii) For some R > 0, K satisfies K(r) = 0 for all r > R; (iv) For all x Rd, we have K( x /h) C(X) and there exists a function ϕ : (0, ) (0, ) such that sup x Rd |||K( x /h)||| ϕ(h). It is easy to verify that for C = Lip, Assumption B is met for the Triangle kernel, the Epanechnikov kernel, and the Gaussian kernel. Particularly, Condition (iv) holds for all these kernels with ||| ||| being the Lipschitz norm and ϕ(h) O(h 1). Moreover, as we shall see below, not all the conditions in Assumption B are required for the analysis conducted in this study and conditions assumed on the kernel will be specified explicitly. We now show that given a d-dimensional smoothing kernel K as in Definition 4, one can easily construct a probability density on Rd. Kernel Density Estimation for Dynamical Systems Definition 6 (K-Smoothing of a Measure) Let K be a d-dimensional smoothing kernel and Q be a probability measure on Rd. Then, for h > 0, f Q,h(x) := f Q,K,h(x) := h d Z Rd K x x /h d Q(x ), x Rd, is called a K-smoothing of Q. It is not difficult to see that f Q,h defines a probability density on Rd since Fubini s theorem yields that Z Rd f Q,h(x) dx = Z Rd h d K x x /h d Q(x ) dx Rd K( x ) dx d Q(x ) = 1. Let us denote Kh : Rd [0, + ) as Kh(x) := h d K ( x /h) , x Rd. (6) Note that Kh also induces a density function on Rd since there holds Kh 1 = 1. For the sake of notational simplification, in what follows, we introduce the convolution operator . Under this notation, we then see that f Q,h is the density of the measure that is the convolution of the measure Q and νh = Kh dλd. Recalling that P is a probability measure on Rd with the corresponding density function f, by taking Q := P with d P = f dλd, we have f P,h = Kh f = f Kh = Kh d P. (7) Since Kh L (Rd) and f L1(Rd), from Proposition 8.8 in Folland (1999), we know that f P,h is uniformly continuous and bounded. Specifically, when Q is the empirical measure Dn = 1 n Pn i=1 δxi, the kernel density estimator for dynamical systems in this study can be expressed as f Dn,h(x) = Kh d Dn(x) = n 1h d n X i=1 K ( x xi /h) . (8) From now on, for notational simplicity, we will suppress the subscript n of Dn and denote D := Dn, e.g., f D,h := f Dn,h. 3. Main Results and Statements In this section, we present main results on the consistency and convergence rates of f D,h to the true density f under L1-norm and also L -norm for some special cases. We also present some comments and discussions on the obtained main results. Recall that f D,h is a nonparametric density estimator and so the criterion that measures its goodness-of-fit matters, which, for instance, includes L1-distance, L2-distance, and L -distance. In the literature of kernel density estimation, probably the most frequently Hang, Steinwart, Feng, Suykens employed criterion is the L2-distance of the difference between f D,h and f, since it entails an exact bias-variance decomposition and can be analyzed relatively easily by using Taylor expansion involved arguments. However, it is argued in Devroye and Gy orfi(1985) (see also Devroye and Lugosi (2001)) that L1-distance could be a more reasonable choice since: it is invariant under monotone transformations; it is always well-defined as a metric on the space of density functions; it is also proportional to the total variation metric and so leads to better visualization of the closeness to the true density function than L2-distance. As to the L -distance, it measures the worst-case goodness-of-fit of the estimator. 3.1 Results on Consistency We first present results on the consistency property of the kernel density estimator f D,h in the sense of L1-norm. A kernel density estimator f D,h is said to be consistent in the sense of L1-norm if f D,h converges to f almost surely under L1-norm. Theorem 7 Let K be a d-dimensional smoothing kernel that satisfies Conditions (i) and (iv) in Assumption B. Let X := (Xn)n 1 be an X-valued stationary geometrically (timereversed) C-mixing process on (Ω, A, µ) with C being defined for some semi-norm ||| ||| that satisfies Assumption A. If hn 0 and nhd n (log n)(2+γ)/γ , as n , then the kernel density estimator f D,hn is universally consistent in the sense of L1-norm. The consistency result in Theorem 7 is independent of the probability distribution P and is therefore of the universal type. In particular, these results apply to the examples provided in Section 2.2, i.e., β-map, Logistic map, and Gauss map. 3.2 Results on Convergence Rates under L1-Norm We now show that if certain tail assumptions on P are imposed, convergence rates can be obtained under L1-norm. Here, we consider three different situations, namely, the tail of the probability distribution P has a polynomial decay, exponential decay and disappears, respectively. Theorem 8 Let K be a d-dimensional smoothing kernel that satisfies Assumption B. Assume that the density f is α-H older continuous with α β. Let X := (Xn)n 1 be an X-valued stationary geometrically (time-reversed) C-mixing process on (Ω, A, µ) with C being defined for some semi-norm ||| ||| that satisfies Assumption A. We consider the following cases: (i) P Bc r r ηd for some η > 0 and for all r 1; (ii) P Bc r e arη for some a > 0, η > 0 and for all r 1; (iii) P Bc r0 = 0 for some r0 1. For the above cases, if n n0 with n0 given in the proof, and the sequences hn are of the following forms: Kernel Density Estimation for Dynamical Systems (i) hn = (log n)(2+γ)/γ n 1+η (1+η)(2α+d) α ; (ii) hn = (log n)(2+γ)/γ n 1 2α+d (log n) d (iii) hn = (log n)(2+γ)/γ/n 1 2α+d ; then with probability µ at least 1 1 n, there holds f D,hn f 1 εn, where the convergence rates (i) εn (log n)(2+γ)/γ n αη (1+η)(2α+d) α ; (ii) εn (log n)(2+γ)/γ n α 2α+d (log n) d γ α+d (iii) εn (log n)(2+γ)/γ/n α 2α+d . Notice that the above Theorem 8 holds only when the underlying density function f is α-H older continuous. However, as shown in Examples 2 and 3, instead of being α-H older continuous, density functions of commonly used dynamical systems often only satisfy a weaker continuity condition such as the pointwise α-H older controllable condition. Therefore, for this case, we are encouraged to establish convergence rates under certain tail condition of the probability distribution. Unfortunately, as will be shown in Proposition 11 later, in general, we are not able to give explicit expressions of the convergence rates as in Theorem 8. Nevertheless, for certain dynamical systems mentioned in this paper, we are still able to derive convergence rates explicitly as follows: Example 4 Consider the Gauss map from Example 3 with the resulting density f(x) = 1 log 2 1 1 + x, x (0, 1) and functions r( ) and c( ) as specified in Example 3. Then, for Ω:= (0, 1), if we pick a smoothing kernel with K(r) = 0 for all r > 1 and hn = (log n)(2+γ)/γ/n 1/(2+d), then we obtain the convergence rate (log n)(2+γ)/γ/n 1 2+d . Example 5 Consider the logistic map of Example 2 with resulting density x(1 x) , x (0, 1), an α (0, 1/2) and the corresponding functions r( ) and c( ) specified in Example 2. As in Example 4, we choose Ω:= (0, 1), and pick a smoothing kernel with K(r) = 0 for all r > 1, then we obtain approximately the convergence rate (log n)(2+γ)/γ/n 1 2+2d . Hang, Steinwart, Feng, Suykens Both of the above-mentioned two densities are not continuous at the end points 0 and 1, but f P,h turns out to be continuous everywhere. Therefore, uniform approximation is not be possible. However, as shown in the above examples, the uniform approximation can be achieved if we remove both of the neighbourhood around the critical points 0 and 1. This phenomenon can be apparently observed from Figures 1 and 2 in Section 5. 3.3 Results on Convergence Rates under L -Norm We now state our main results on the convergence of f D,h to f under L -norm. Theorem 9 Let X := (Xn)n 1 be an X-valued stationary geometrically (time-reversed) Cmixing process on (Ω, A, µ) with C being defined for some semi-norm ||| ||| that satisfies Assumption A. (i) Let K be a d-dimensional smoothing kernel that satisfies Conditions (i) and (iv) in Assumption B. Assume that there exists a constant r0 1 such that Ω Br0 Rd and the density function f is α-H older continuous with α β and f < . (ii) Let K be a d-dimensional smoothing kernel that satisfies Conditions (iii) and (iv) in Assumption B, and f is pointwise α-H older controllable. Fix an Ω Rd with {x Rd : f(x) > 0 and r(x) exists} Ωand define Ω+h R := {x Rd : infx Ω x x h R}. Moreover, we define X h := {x Rd : r(x) > h R} and assume that function x 7 c(x) is bounded on X h Ω+h R. Then, for both cases (i) and (ii), all n n 0 with n 0 that will be given in the proof, by choosing hn = (log n)(2+γ)/γ/n 1 2α+d , with probability µ at least 1 1 n, there holds f D,hn f (log n)(2+γ)/γ/n α 2α+d . (9) In Theorems 8 and 9, one needs to ensure that n n0 with n0 and n n 0 being specified later. One may also note that due to the involvement of the term ϕ(hn), the numbers n0 and n 0 depend on the hn. However, recalling that for the Triangle kernel, the Epanechnikov kernel, and the Gaussian kernel, we have ϕ(hn) O(h 1 n ), which, together with the choices of hn in Theorems 8 and 9, implies that n0 and n 0 are well-defined. It should be also remarked that in the scenario where the density function f is compactly supported and bounded, the convergence rate of f D,h to f is not only obtainable, but also the same with that derived under L1-norm. This is indeed an interesting observation since convergence under L -norm implies convergence under L1-norm. 3.4 Comments and Discussions This section presents some comments on the obtained theoretical results on the consistency and convergence rates of f D,h and compares them with related findings in the literature. Kernel Density Estimation for Dynamical Systems We highlight that in our analysis the density function f is only assumed to be H older continuous. As pointed out in the introduction, in the context of dynamical systems, this seems to be more than a reasonable assumption. On the other hand, the consistency and the convergence results obtained in our study, are of type with high probability due to the use of the Bernstein-type exponential inequality that takes into account the variance information of the random variables. From our analysis and the obtained theoretical results, one can also easily observe the influence of the dependence among observations. For instance, from Theorem 7 we see that with increasing dependence among observations (corresponding to smaller γ), in order to ensure the universal consistency of f D,hn, the decay of hn (with respect to n 1) is required to be faster. This is in fact also the case if we look at results on the convergence rates in Theorems 8 and 9. Moreover, the influence of the dependence among observations is also indicated there. That is, an increase of the dependence among observations may slow down the convergence of f D,h in the sense of both L1-norm and L -norm. It is also interesting to note that when γ tends to infinity, which corresponds to the case where observations can be roughly treated as independent ones, meaningful convergence rates can be also deduced. It turns out that, up to a logarithmic factor, the established convergence rates (9) under L -norm, namely, O(((log n)(2+γ)/γ/n)α/(2α+d)), match the optimal rates in the i.i.d. case, see, e.g., Khas minskii (1979) and Stone (1983). As mentioned in the introduction, there exist several studies in the literature that address the kernel density estimation problem for dynamical systems. For example, Bosq and Gu egan (1995) conducted some first studies and showed the point-wise consistency and the convergence (in expectation) of the kernel density estimator. The convergence rates obtained in their study are of the type O(n 4/(4+2d)), which are conducted in terms of the variance of f D,h. The notion they used for measuring the dependence among observations is α-mixing coefficient (see A3 in Bosq and Gu egan (1995)). Considering the density estimation problem for one-dimensional dynamical systems, Prieur (2001) presented some studies on the kernel density estimator f D,h by developing a central limit theorem and apply it to bound the variance of the estimator. Further some studies on the kernel density estimation of the invariant Lebesgue density for dynamical systems were conducted in Blanke et al. (2003). By considering both dynamical noise and observational noise, point-wise convergence of the estimator f D,h in expectation was established, i.e., the convergence of Ef D,h(x) f(x) for any x Rd. Note further that these results rely on the second-order smoothness and boundedness of f. Therefore, the second-order smoothness assumption on the density function together with the point-wise convergence in expectation makes it different from our work. In particular, under the additional assumption on the tail of the noise distribution, the convergence of E(f D,h(x) f(x))2 for any fixed x Rd is of the order O(n 2/(2+βd)) with β 1. Concerning the convergence of f D,h in a dynamical system setup, Maume-Deschamps (2006) also presented some interesting studies which in some sense also motivated our work here. By using also the C-mixing concept as adopted in our study to measure the dependence among observations from dynamical systems, she presented the point-wise convergence of f D,h with the help of Hoeffding-type exponential inequality (see Proposition 3.1 in Maume-Deschamps (2006)). The assumption applied on f is that it is bounded from below and also α-H older continuous (more precisely, f is assumed to be α-regular, see Assumption 2.3 in Maume-Deschamps (2006)). Hence, from the above dis- Hang, Steinwart, Feng, Suykens cussions, we suggest that the work we present in this study is essentially different from that in Maume-Deschamps (2006). 4. Error Analysis We conduct error analysis for the kernel density estimator f D,h in this section by establishing its consistency and convergence rates, which are stated in the above section in terms of the L1-distance and L -distance. The downside of using L1-distance is that it does not admit an exact bias-variance decomposition and the usual Taylor expansion involved techniques for error estimation may not apply directly. Nonetheless, if we introduce the intermediate estimator f P,h in (7), obviously the following inequality holds f D,h f 1 f D,h f P,h 1 + f P,h f 1. (10) The consistency and convergence analysis in our study will be mainly conducted in the L1 sense with the help of inequality (10). Besides, for some specific case, i.e., when the density f is compactly supported, we are also concerned with the consistency and convergence of f D,h to f under L -norm. In this case, there also holds the following inequality f D,h f f D,h f P,h + f P,h f . (11) It is easy to see that the first error term on the right-hand side of (10) or (11) is stochastic due to the empirical measure D while the second one is deterministic because of its sampling-free nature. Loosely speaking, the first error term corresponds to the variance of the estimator f D,h, while the second one can be treated as its bias although (10) or (11) is not an exact error decomposition. In our study, we proceed with the consistency and convergence analysis on f D,h by bounding the two error terms, respectively. 4.1 Bounding the Deterministic Error Term Our first theoretical result on bounding the deterministic error term shows that, given a d-dimensional kernel K, the L1-distance between its K-smooth of the measure P, i.e., f P,h, and f can be arbitrarily small by choosing the bandwidth appropriately. Moreover, under mild assumptions on the regularity of f and K, the L -distance between the two quantities possesses a polynomial decay with respect to the bandwidth h. Proposition 10 Let K be a d-dimensional smoothing kernel. (i) For any ε > 0, there exists 0 < hε 1 such that for any h (0, hε] we have f P,h f 1 ε. (ii) If K satisfies Condition (ii) in Assumption B and f is α-H older continuous with α υ, then there exists a constant c > 0 such that for all h > 0 we have f P,h f chα. Kernel Density Estimation for Dynamical Systems (iii) If K satisfies Condition (ii) in Assumption B and f is uniformly pointwise α-H older controllable with α υ, then there exists a constant c > 0 such that for all h > 0 and λd-almost all x Rd we have |f P,h(x) f(x)| c c(x)hα + cf(x)hυ + Z Bc r0/(2h) K( x ) f(x + hx ) dx . (iv) Assume that K satisfies Condition (iii) in Assumption B and f is pointwise α-H older controllable. We fix an Ω Rd such that {x Rd : f(x) > 0 and r(x) exists } Ω, and we define Ω+h R := {x Rd : infx Ω x x h R}. For λd-almost all x Ω+h R we then have |f P,h(x) f(x)| = 0. Moreover, there exists a constant c > 0 such that for all x Ω+h R for which r(x) exists we have |f P,h(x) f(x)| c c(x)hα + c K f(x) R r(x) r(x) x h R f(x + x ) dx . We now show that the L1-distance between f P,h and f can be upper bounded by their difference (in the sense of L -distance) on a compact domain of Rd together with their difference (in the sense of L1-distance) outside this domain. As we shall see later, this observation will entail us to consider different classes of the true density f. The following result is crucial in our subsequent analysis on the consistency and convergence rates of f D,h. Proposition 11 Assume that K is a d-dimensional smoothing kernel that satisfies Condition (ii) in Assumption B. (i) There exists a constant c1 > 0 such that for all h 1 and r 1, we have f P,h f 1 c1rd f P,h f + c1P(Bc r/2) + c1(h/r)υ. (ii) Assume that f is uniformly pointwise α-H older controllable with α υ. For r > 0 we define Br c(x) dx. Then there exists a constant c2 > 0 such that for all h > 0 and r 1 we have f P,h f 1 c2hαH(r) + c2P(Bc r/2) + c2hυ . (iii) Assume that K satisfies Condition (iii) in Assumption B and f is pointwise α-H older controllable. Furthermore, we pick an Ωas in Proposition 10. Then there exists a constant c3 > 0 such that for all h > 0 and r > 0 we have f P,h f 1 c3 H( ) hα + c3P {x : r(x) h R} r(x) x h R f(x + x ) dx dx . Hang, Steinwart, Feng, Suykens (iv) Let K, f, and Ωsatisfy the conditions in (iii). We define X h := {x Rd : r(x) > h R} and assume that function x 7 c(x) is bounded on X h Ωh R, then there is another constant c4 > 0 such that for all h > 0 we have sup x X h |f P,h(x) f(x)| c4 sup x X h Ωh R |c(x)| hα . 4.2 Bounding the Stochastic Error Term We now proceed with the estimation of the stochastic error term f D,h f P,h 1 by establishing probabilistic oracle inequalities. For the sake of readability, let us start with an overview of the analysis conducted in this subsection for bounding the stochastic error term. 4.2.1 An Overview of the Analysis In this study, the stochastic error term is tackled by using capacity-involved arguments and the Bernstein-type inequality established in Hang and Steinwart (2017). In the sequel, for any fixed x Ω Rd, we write kx,h := h d K( x /h), (12) and we further denote the centered random variable ekx,h on Ωas ekx,h := kx,h EP kx,h. (13) It thus follows that EDekx,h = EDkx,h EP kx,h = f D,h(x) f P,h(x), and consequently we have f D,h f P,h 1 = Z Rd |EDekx,h| dx, f D,h f P,h = sup x Ω |EDekx,h|. As a result, in order to bound f D,h f P,h 1, it suffices to bound the supremum of the empirical process EDekx,h indexed by x Rd. For any r > 0, there holds f D,h f P,h 1 = Z Br |EDekx,h| dx + Z Bcr |EDekx,h| dx. The second term of the right-hand side of the above equality can be similarly dealt with as in the proof of Proposition 11. In order to bound the first term, we define e Kh,r as the function set of ekx,h that corresponds to x which lies on a radius-r ball of Rd: e Kh,r := ekx,h : x Br L (Rd). The idea here is to apply capacity-involved arguments and the Bernstein-type exponential inequality in Hang and Steinwart (2017) to the function set e Kh,r and the associated empirical process EDekx,h. The difference between f D,h and f P,h under the L -norm can be bounded analogously. Therefore, to further our analysis, we first need to bound the capacity of e Kh,r in terms of covering numbers. Kernel Density Estimation for Dynamical Systems 4.2.2 Bounding the Capacity of the Function Set e Kh,r Definition 12 (Covering Number) Let (X, d) be a metric space and A X. For ε > 0, the ε-covering number of A is denoted as N(A, d, ε) := min n 1 : x1, , xn X such that A i=1 Bd(xi, ε) where Bd(x, ε) := {x X : d(x, x ) ε}. For any fixed r 1, we consider the function set Kh,r := {kx,h : x Br} L (Rd). The following proposition provides an estimate of the covering number of Kh,r. Proposition 13 Let K be a d-dimensional smoothing kernel that satisfies Condition (i) in Assumption B and h (0, 1]. Then there exists a positive constant c such that for all ε (0, 1], we have N(Kh,r, , ε) c rdh d d2 4.2.3 Oracle Inequalities under L1-Norm, and L -Norm We now establish oracle inequalities for the kernel density estimator (8) under L1-norm, and L -norm, respectively. These oracle inequalities will be crucial in establishing the consistency and convergence results of the estimator. Recall that the considered kernel density estimation problem is based on samples from an X-valued C-mixing process which is associated with an underlying function class C(X). As shown below, the established oracle inequality holds without further restrictions on the support of the density function. Proposition 14 Suppose that Assumption B holds. Let X := (Xn)n 1 be an X-valued stationary geometrically (time-reversed) C-mixing process on (Ω, A, µ) with C being defined for some semi-norm ||| ||| that satisfies Assumption A. Then for all 0 < h 1, r 1 and τ 1, there exists an n0 N such that for all n n0, with probability µ at least 1 3e τ, there holds f D,h f P,h 1 (log n)2/γrd(τ + log nr h ) hdn + (log n)2/γrd(τ + log nr + P(Bc r/4) + 32τ(log n)2/γ Here n0 will be given explicitly in the proof. Our next result shows that when the density function f is compactly supported and bounded, an oracle inequality under L -norm can be also derived. Hang, Steinwart, Feng, Suykens Proposition 15 Let K be a d-dimensional kernel function that satisfies Conditions (i) and (iii) in Assumption B. Let X := (Xn)n 1 be an X-valued stationary geometrically (time-reversed) C-mixing process on (Ω, A, µ) with C being defined for some semi-norm ||| ||| that satisfies Assumption A. Assume that there exists a constant r0 1 such that Ω Br0 Rd and the density function f satisfies f < . Then for all 0 < h 1 and τ > 0, there exists an n 0 N such that for all n n 0, with probability µ at least 1 e τ, there holds f D,h f P,h f (τ + log(nr0 h ))(log n)2/γ hdn + K(0)(τ + log(nr0 h ))(log n)2/γ Here n 0 will be given explicitly in the proof. In Proposition 15, the kernel K is only required to satisfy Conditions (i) and (iii) in Assumption B whereas the condition that R 0 K(r)rβ+d 1 dr < for some β > 0 is not needed. This is again due to the compact support assumption of the density function f as stated in Proposition 15. 5. Bandwidth Selection and Simulation Studies This section discusses the model selection problem of the kernel density estimator (8) by performing numerical simulation studies. In the context of kernel density estimation, model selection is mainly referred to the choice of the smoothing kernel K and the selection of the kernel bandwidth h, which are of crucial importance for the practical implementation of the data-driven density estimator. According to our experimental experience and also the empirical observations reported in Maume-Deschamps (2006), it seems that the choice of the kernel or the noise does not have a significant influence on the performance of the estimator. Therefore, our emphasis will be placed on the bandwidth selection problem in our simulation studies. 5.1 Several Bandwidth Selectors In the literature of kernel density estimation, various bandwidth selectors have been proposed, several typical examples of which have been alluded to in the introduction. When turning to the case with dependent observations, the bandwidth selection problem has been also drawing much attention, see e.g., Hart and Vieu (1990); Chu and Marron (1991); Hall et al. (1995); Yao and Tong (1998). Among existing bandwidth selectors, probably the most frequently employed ones are based on the cross-validation ideas. For cross-validation bandwidth selectors, one tries to minimize the integrated squared error (ISE) of the empirical estimator f D,h where ISE(h) := Z (f D,h f)2 = Z f2 D,h 2 Z f D,h Z f + Z f2. Note that on the right-hand side of the above equality, the last term R f2 is independent of h and so the minimization of ISE(h) is equivalent to minimize Z f2 D,h 2 Z f D,h Z f. Kernel Density Estimation for Dynamical Systems It is shown that with i.i.d observations, an unbiased estimator of the above quantity, which is termed as least squares cross-validation (LSCV), is given as follows: LSCV(h) := Z f2 D,h 2 i=1 ˆf i,h(xi), (14) where the leave-one-out density estimator ˆf i,h is defined as ˆf i,h(x) := 1 n 1 j =i Kh(x xj). When the observations are dependent, it is shown that cross-validation can produce much under-smoothed estimates, see e.g., Hart and Wehrly (1986); Hart and Vieu (1990). Observing this, Hart and Vieu (1990) proposed the modified least squares cross-validation (MLSCV), which is defined as follows MLSCV(h) := Z f2 D,h 2 i=1 ˆf i,h,ln(xi), (15) where ln is set to 1 or 2 as suggested in Hart and Vieu (1990) and ˆf i,h,ln(x) := 1 #{j : |j i| > ln} |j i|>ln Kh(x xj). The underlying intuition of proposing MLSCV is that when estimating the density of a fixed point, ignoring observations in the vicinity of this point may be help in reducing the influence of dependence among observations. However, when turning to the L1 point of view, the above bandwidth selectors may not work well due to the use of the least squares criterion. Alternatively, Devroye (1989) proposed the double kernel bandwidth selector that minimizes the following quantity DKM(h) := Z |f D,h,K f D,h,L|, (16) where f D,h,K and f D,h,L are kernel density estimators based on the kernels K and L, respectively. Some rigorous theoretical treatments on the effectiveness of the above bandwidth selector were made in Devroye (1989). Our purpose in simulation studies is to conduct empirical comparisons among the above bandwidth selectors in the dynamical system context instead of proposing new approaches. 5.2 Experimental Setup In our experiments, observations x1, , xn are generated from the following model1 ( xi = T i(x0), xi = xi + εi, i = 1, , n, (17) 1. Note that here the observational noise is assumed for the considered dynamical system (17), which differs from (2) and can be a more realistic setup from an empirical and experimental viewpoint. In fact, it is observed also in Maume-Deschamps (2006) that the influence of low SNR noise is not obvious in density estimation. We therefore adopt this setup in our experiments. All the observations reported in this experimental section apply to the noiseless case (2). Hang, Steinwart, Feng, Suykens where εi N(0, σ2), σ is set to 0.01 and the initial state x0 is randomly generated based on the density f. For the map T in (17), we choose Logistic map in Example 2 and Gauss map in Example 3. We vary the sample size among {5 102, 103, 5 103, 104}, implement bandwidth selection procedures over 20 replications and select the bandwidth from a grid of values in the interval [h L, h U] with 100 equispaced points. Here, h L is set as the minimum distance between consecutive points xi, i = 1, , n (Devroye and Lugosi, 1997), while h U is chosen according to the maximal smoothing principle proposed in Terrell (1990). Throughout our experiments, we use the Gaussian kernel for the kernel density estimators. In our experiments, we conduct comparisons among the above-mentioned bandwidth selectors which are, respectively, denoted as follows: LSCV: the least squares cross-validation given in (14); MLSCV-1: the modified least squares cross-validation in (15) with ln = 1; MLSCV-2: the modified least squares cross-validation in (15) with ln = 2; DKM: the double kernel method defined in (16) where the two kernels used here are the Epanechnikov kernel and the Triangle kernel, respectively. In the experiments, due to the known density functions for Logistic map and Gauss map, and in accordance with our previous analysis from the L1 point of view, the criterion of comparing different selected bandwidths is the following absolute mean error (AME): i=1 |f D,h(ui) f(ui)|, where u1, , um are m equispaced points in the interval [0, 1] and m is set to 10000. We also compare the selected bandwidth with the one that has the minimum absolute mean error which serves as a baseline method in our experiments. 5.3 Simulation Results and Observations The AMEs of the above bandwidth selectors for Logistic map in Example 2 and Gauss map in Example 3 over 20 replications are averaged and recorded in Tables 1 and 2 below. In Figs. 1 and 2, we also plot the kernel density estimators for Logistic map in Example 2 and Gauss map in Example 3 with different bandwidths and their true density functions with different sample sizes. The sample size of each panel, in Figs. 1 and 2, from up to bottom, is 103, 104, and 105, respectively. In each panel, the densely dashed black curve represents the true density, the dotted blue curve is the estimated density function with the bandwidth selected by the baseline method while the solid red curve stands for the estimated density with the bandwidth selected by the double kernel method. All density functions in Figs. 1 and 2 are plotted with 100 equispaced points in the interval (0, 1). From Tables 1 and 2, and Figs. 1 and 2, we see that the true density functions of Logistic map and Gauss map can be approximated well with enough observations and the double kernel method works slightly better than the other three methods for the two dynamical systems. In fact, according to our experimental experience, we find that the bandwidth selector of the kernel density estimator for a dynamical system is usually ad-hoc. That Kernel Density Estimation for Dynamical Systems 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Figure 1: Plots of the kernel density estimators f D,h for Logistic map in Example 2 with different bandwidths and its true density with different sample sizes. The sample size of each panel, from up to bottom, is 103, 104, and 105, respectively. In each panel, the dashed black curve represents the true density of Logistic map, the dotted blue curve is the estimated density of Logistic map with the bandwidth selected by the baseline method while the solid red curve stands for the estimated density of Logistic map with the bandwidth selected by the double kernel method. All curves are plotted with 100 equispaced points in the interval (0, 1). Hang, Steinwart, Feng, Suykens 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 2: Plots of the kernel density estimators f D,h for Gauss map in Example 3 with different bandwidths and its true density with different sample sizes. The sample size of each panel, from up to bottom, is 103, 104, and 105, respectively. In each panel, the dashed black curve represents the true density of Gauss map, the dotted blue curve is the estimated density of Gauss map with the bandwidth selected by the baseline method while the solid red curve stands for the estimated density of Gauss map with the bandwidth selected by the double kernel method. All curves are plotted with 100 equispaced points in the interval (0, 1). Kernel Density Estimation for Dynamical Systems Table 1: The AMEs of Different Bandwidth Selectors for Logistic Map in Example 2 sample size LSCV MLSCV-1 MLSCV-2 DKM Baseline 5 102 .3372 .3369 .3372 .3117 .3013 1 103 .2994 .2994 .2994 .2804 .2770 5 103 .2422 .2422 .2422 .2340 .2326 1 104 .2235 .2235 .2235 .2220 .2192 Table 2: The AMEs of Different Bandwidth Selectors for Gauss Map in Example 3 sample size LSCV MLSCV-1 MLSCV-2 DKM Baseline 5 102 .1027 .1026 .1059 .1181 .0941 1 103 .0925 .0933 .0926 .0925 .0878 5 103 .0626 .0626 .0626 .0586 .0585 1 104 .0454 .0454 .0454 .0440 .0439 is, for existing bandwidth selectors, there seems no a universal optimal one that can be applicable to all dynamical systems and outperforms the others. Therefore, further exploration and insights on the bandwidth selection problem in the dynamical system context certainly deserve future study. On the other hand, we also notice that due to the presence of dependence among observations generated by dynamical systems, the sample size usually needs to be large enough to approximate the density function well. This can be also seen from the plotted density functions in Figs. 1 and 2 with varying sample sizes. Aside from the above observations, not surprisingly, from Figs. 1 and 2, we also observe the boundary effect (Gasser et al., 1985) from the kernel density estimators for dynamical systems, which seems to be even more significant than the i.i.d case. From a practical implementation view, some special studies are arguably called for addressing this problem. The following lemma is needed for the proof of Example 2. Lemma 16 For x, h 0 with x 2h and p (0, 1) we have x h 2p 1x1 php . Proof [of Lemma 16] For c := 2p 1, x 0, and h 0 we define gh(x) := cx1 php x + h, so that our goal is to show gh(x) 0 for all x 2h. To this end, we first note that gh(2h) = c(2h)1 php h = c21 ph1 php h = 0 , that is, the assertion is true for x = 2h. To show the inequality for x > 2h, we note that g h(x) = (1 p)cx php 1. A simple calculation then shows that g h has its only zero at x := ((1 p)c)1/ph. Moreover, since g h(x) = (1 p)( p)cx 1 php < 0 we see that gh has a global maximum at x and no (local) minimum. Consequently, gh is decreasing on [x , ), Hang, Steinwart, Feng, Suykens so that it suffices to show that x 2h. The latter, however is equivalent to (1 p)c 2p, and by the definition of c, this inequality is satisfied if and only if (1 p)/2 1. Since the latter is obviously true, we obtain the assertion. Proof [of Example 2] Here we only need to show that the density f is α-controllable with the specified functions. Moreover, this is obvious for x < 0 and x > 1. Moreover, the first and second derivatives of f on (1/8, 7/8) are f (x) = 1 2x 2π(x(1 x))3/2 and f (x) = 8x2 8x + 3 4π(x(1 x))5/2 . Clearly, this gives f (x) 1 for all x (1/8, 7/8) and hence f is increasing on this interval. Using the symmetry of f around x = 1/2 we then obtain sup x (1/8,7/8 |f (x)| = |f (1/8)| = 3 Consequently, the restriction f|[1/8,7/8] is Lipschitz continuous with constant not exceeding 4. This shows the assertion for x [1/4, 3/4]. Therefore, it remains to consider the cases 0 < x < 1/4 and 3/4 < x < 1, and due to the symmetry of f we may actually restrict our considerations to 0 < x < 1/4. Let us therefore fix an x (0, 1/4). For h with |h| < r(x), that is h ( x/2, x/2), we then find π|f(x + h) f(x)| = (x + h)(1 x h) p x(1 x)(1 x h)(x + h) . (18) Moreover, we have 1 x > 3/4 and 1 x h > 5/8 and hence we find x(x + h)(1 x h)(1 x) x(x + h) . (19) Our next goal is to bound the numerator in (18). To this end, we define g(x, h) := p (x + h)(1 x h) , h, x [0, 1/4] . Note that the function t 7 p t(1 t) is increasing on [0, 1/2] and hence we have g(x, h) 0 if h [ 1/4, 0] and g(x, h) 0 if h [0, 1/4]. Now, our goal is to establish g(x, h) |h|1/2 , x [0, 1/4], h ( x/2, 1/4] (20) In the case h [0, 1/4] our preliminary considerations on g show that (20) is equivalent to (x + h)(1 x h) h + x(1 x) + 2 p Moreover, simple algebraic transformations show that the latter is equivalent to 2xh h2 2 p hx(1 x), which is obviously true. This shows (20) for h [0, 1/4]. Now, in the case h ( x/2, 0] we first observe that we have g(x, h) = g(x + h, h) . Kernel Density Estimation for Dynamical Systems Let us write x := x + h and h := h. Then we have h [0, x/2) [0, 1/4] and x (x/2, x] (0, 1/4]. Using (20) in the already proven case h [0, 1/4] we then find g(x, h) = g( x, h) | h|1/2 = |h|1/2 , which finishes the proof of (20). Now combining (19) with (20) we find π |f(x + h) f(x)| Let us first consider the case h ( x/2, 0]. Then Lemma 16 applied to p := 2ε and h := h gives 1 x + h = 1 p x h 2(1 p)/2x(p 1)/2 h p/2 = 21/2 εx 1/2+ε|h| ε Inserting this inequality in the previous inequality thus shows |f(x + h) f(x)| 1 64 15 x 1+ε |h|1/2 ε x 1+ε |h|1/2 ε . (21) Moreover, for h [0, x/2) we have x+h x h, and hence the previous considerations actually give (21) for all h ( x/2, x/2). Since for ε := 1/2 α we have both 1+ε = 1/2 α and 1/2 ε = α, we then obtain the assertion. The following lemma, which will be used several times in the sequel, supplies the key to the proof of Propositions 11 and 10. Lemma 17 Assume that K is a d-dimensional smoothing kernel that satisfies Condition (ii) of Assumption B. Then there exists a constant c1 > 0 such that for all r > 0 we have Z Bcr K( x ) dx c1r υ . (22) Moreover, for all probability measures Q on Rd and all h > 0 and r > 0 the functions kx,h, which are defined in (12) for all x Rd, satisfy Z Bcr EQkx,h dx κ Q(Bc r/2) + c12υ (h/r)υ . (23) Proof [of Lemma 17] For the proof of (22) we fix a constant c 1 such that we have c 1 ℓd 2 c ℓd 2. Since K is monotonically decreasing, we then find Z Bcr K( x ) dx Z Bcr K( c 1 x ℓd 2) dx Z c 1r K( c 1t)td 1 dt c 2 r K(t)td 1 dt c 2 r K(t)r υtd+υ 1 dt cd+2υ κυ r υ , Hang, Steinwart, Feng, Suykens where in the last step we used Condition (ii) of Assumption B. For the proof of (23) we first observe that for t0 > 0, we have Z Bcr EQkx,h dx = Z Rd h d K x x /h d Q(x ) dx Rd K( x )1Bcr(hx + x ) dx d Q(x ) Rd K( x ) Z Rd 1Bcr(hx + x ) d Q(x ) dx Bt0 K( x ) Z Rd 1Bcr(hx + x ) d Q(x )dx + Z Bc t0 K( x ) dx. Moreover, it is easy to see that 1Bcr(hx + x ) = 1 if and only if hx + x r. Now we set t0 := r 2h. In this case, if we additionally have x Bt0, then x r h x r ht0 = r/2. Using (22) we thus find Z Bcr EQkx,h dx Z Bt0 K( x )Q(Bc r/2) dx + Z Bc t0 K( x ) dx κ Q(Bc r/2) + c1 t υ 0 . (24) By t0 = r 2h we then find the assertion. The following technical lemma is needed in the proof of Proposition 10. Lemma 18 For all 0 a b and d 1 we have bd ad d bd 1 (b a) . (25) Proof [of Lemma 18] In the case b = 0 there is nothing to prove. Moreover, in the case b > 0, we first observe by deviding by bd that (25) is equivalent to Consequently, it suffices to show 1 td d (1 t) , t [0, 1] . (26) To this end, we define h(t) = d (1 t) 1 + td for t [0, 1]. This gives h (t) = d + dtd 1 and a simple check then shows h (t) 0 for all t [0, 1]. Moreover, we have h(1) = 0 and combining both we thus find h(t) 0 for all t [0, 1]. This shows (26). Proof [of Proposition 10] (i). Since the space of continuous and compactly supported functions Cc(Rd) is dense in L1(Rd), we can find f Cc(Rd) such that f f 1 ε/3, ε > 0. Kernel Density Estimation for Dynamical Systems Therefore, for any ε > 0, we have f P,h f 1 = Z Rd |f Kh f| dx Rd |f Kh f Kh| dx + Z Rd | f Kh f| dx + Z Rd |f f| dx Rd | f Kh f| dx, where Kh is defined in (6) and the last inequality follows from the fact that f Kh f Kh 1 f f 1 ε/3. The above inequality is due to Young s inequality (8.7) in Folland (1999). Moreover, there exist a constant M > 0 such that supp( f) BM and a constant r > 0 such that Z Bcr K( x ) dx ε 9 f 1 . Now we define L : Rd [0, ) by L(x) := 1[ r,r]( x )K( x ) and Lh : Rd [0, ) by Lh(x) := h d L(x/h). Then we have Z Rd | f Kh f| dx Z Rd | f Kh f Lh| dx + Z Rd | f Lh f| dx f 1 Kh Lh 1 + Z Rd Lh dx dx Rd(Lh Kh) dx dx 2 f 1 Kh Lh 1 + Z Rd Lh dx dx. Moreover, we have Kh Lh 1 = Z 1[ r,r]( x )K( x ) K( x ) dx Bcr K( x ) dx ε 9 f 1 . Hang, Steinwart, Feng, Suykens Finally, for h 1, we have Rd Lh dx dx = Z f(x x ) f(x) Lh(x ) dx dx Rd | f(x x ) f(x)|Lh(x ) dx dx. Since f is uniformly continuous, there exists a constant hε > 0 such that for all h hε and x rh, we have | f(x x ) f(x)| ε := ε 9(r + M)dλd(B1). Consequently we obtain Rd | f(x x ) f(x)|Lh(x ) dx ε Z Brh Lh(x ) dx ε Z Rd Kh dx = ε . Therefore, we obtain Rd Lh dx dx Z Br+M ε dx = ε and consequently the assertion can be proved by combining estimates in (27) and (28). (ii). The α-H older continuity of f tells us that for any x Rd, there holds |f P,h(x) f(x)| = 1 hd f(x ) dx f(x) Rd K( x )f(x + hx ) dx f(x) Rd K( x ) f(x + hx ) f(x) dx Rd K( x ) h x α dx Rd K x ℓd 2 hα x α ℓd 2 dx 0 K(r)rα+d 1 dr where c1, c2, c3 > 0 are suitable constants. (iii). For fixed h > 0 we define r := r0 2h. A quick calculation then yields hr = r0/2 < r0. Following the proof of (ii) and using that f is uniformly pointwise α-H older controllable, Kernel Density Estimation for Dynamical Systems we thus find for λd-almost all x Rd: |f P,h(x) f(x)| = Rd K( x ) f(x + hx ) f(x) dx Br K( x ) f(x + hx ) f(x) dx + Z Bcr K( x ) f(x + hx ) f(x) dx Br K( x ) h x α dx + f(x) Z Bcr K( x ) dx Bcr K( x ) f(x + hx ) dx c2 c(x)hα + c1f(x)r υ + Z Bcr K( x ) f(x + hx ) dx where the first term is estimated similarly to the first part of the proof of (22), and the second term is directly estimated by (22). The definition of r then yields the result. (iv). Let us fix an h > 0. Following the proof of (ii) and using that f is pointwise α-H older controllable, we then obtain for x Rd with r(x) > h R: |f P,h(x) f(x)| = Rd K( x ) f(x + hx ) f(x) dx BR K( x ) f(x + hx ) f(x) dx (29) BR K( x ) x α dx c c(x) hα , (30) where c is a suitable constant bounding the integral in the second to last line. In particular, this shows the assertion for all x Ω+h R for which r(x) exists and satisfies r(x) > h R. Let us now consider the case x Ω+h R with r(x) h R. Here, (29) and a repetition of the estimates around (30) give |f P,h(x) f(x)| Z BR K( x ) f(x + hx ) f(x) dx h K( x ) f(x + hx ) f(x) dx h x R K( x ) f(x + hx ) f(x) dx c c(x) hα + Z h x R K( x ) f(x + hx ) f(x) dx . Hang, Steinwart, Feng, Suykens To bound the second integral, we first observe that Z h x R K( x ) f(x + hx ) f(x) dx h x R K( x ) f(x + hx ) dx + Z h x R K( x ) f(x) dx r(x) x h R f(x + x ) dx + K f(x) Z h x R 1 dx . It thus remains to bound the second integral. To this end, observe that for C := vold(B ) we have Z h x R 1 dx = C Rd r(x) d d CRd 1 R r(x) where the last inequality is due to Lemma 18. Combining all estimates yields the assertion for all x Ω+h R for which r(x) exists and satisfies r(x) h R. Let us finally consider the case of an x Ω+h R for which r(x) exists. Then we obviously have x Ωand since r(x) exists we conclude that f(x) = 0. Moreover, the definition of Ω+h R gives x x > h R for all x Ω. Let us now fix an x with x R. Assume we had x + hx Ω. For x := x + hx we would then find h R < x x = hx h R, which is impossible. Consequently, we have x + hx Ω, and thus we obtain f(x + hx ) = 0 for λd-almost all x with x R. Combining our considerations with (29) leads to |f P,h(x) f(x)| Z x R K( x ) f(x + hx ) f(x) dx = 0 , and hence we have shown the assertion. Proof [of Proposition 11] (i). We decompose f P,h f 1 as follows f P,h f 1 = Z Br |f P,h(x) f(x)| dx + Z Bcr |f P,h(x) f(x)| dx (31) λd(Br) f P,h f + Z Bcr EP kx,h dx + Z λd(B1) rd f P,h f + Z Bcr EP kx,h dx + P (Bc r) . (32) Combining (32) and (24), we obtain the desired assertion. (ii). Since f is pointwise α-H older controllable, (iii) of Proposition 10 tells us that Z Br |f P,h(x) f(x)| dx c hα Z Br c(x) dx + chυ + Z Bc r0/(2h) K( x ) f(x + hx ) dx dx = c H(r) hα + chυ + Z Bc r0/(2h) K( x ) Z Br f(x + hx ) dx dx c H(r) hα + chυ + Z Bc r0/(2h) K( x ) dx c H(r) hα + c hυ , Kernel Density Estimation for Dynamical Systems where in the last step we used (22). Combining this estimate with the decomposition (31) and the inequality (24) then yields the assertion. (iii). In this case, (iv) of Proposition 10 gives Z Rd |f P,h(x) f(x)| dx = Z Ω+h R |f P,h(x) f(x)| dx Ω+h R c(x) dx + c K Ω+h R f(x) R r(x) r(x) x h R f(x + x ) dx dx . Moreover, we have Z Ω+h R f(x) R r(x) Rd f(x) R r(x) R P({x : r(x) h R}) , which, together with our previous estimate gives the assertion. (iv). Let us consider the partition X h = X h Ω+h R X h (Rd \ Ω+h R) . For x X h (Rd\Ω+h R), the first assertion of (iv) of Proposition 10 then shows |f P,h(x) f(x)| = 0, and hece the assertion is true for these x. Moreover, for X h Ω+h R, the second assertion of (iv) of Proposition 10 gives |f P,h(x) f(x)| c c(x)hα + c K f(x) R r(x) r(x) x h R f(x + x ) dx , and since the second and third term vanish on X h we obtain the assertion. To prove Proposition 13, we need the following lemmas. Lemma 19 Let (X, d) and (Y, e) be metric spaces and T : X Y be an α-H older continuous function with constant c. Then, for A X and all ε > 0 we have N(T(A), e, cεα) N(A, d, ε). Proof [of Lemma 19] Let x1, . . . , xn be an ε-net of A, that is, A Sn i=1 Bd(xi, ε). For i = 1, , n, we set yi := T(xi). Now, it only suffices to show that this gives a cεα-net of T(A). In fact, supposing that y T(Bd(xi, ε)), then there exists x Bd(xi, ε) such that T(x) = y. This implies e(T(x), T(xi)) cdα(x, xi) cεα. Hang, Steinwart, Feng, Suykens Therefore, we have T(Bd(xi, ε)) Be(yi, cεα). That is, y1, . . . , yn is a cεα-net of T(A). This completes the proof of Lemma 19. Remark 20 We remark that when X is a Banach space with the norm , then for any c > 0 there holds N(c A, , ε) = N (A, , ε/c) . Lemma 21 Let be another norm on Rd. Then for all ε (0, 1] we have N(B1, , ε) ε d. Proof [of Lemma 21] It is a straightforward conclusion of Proposition 1.3.1 in Carl and Stephani (1990) and Lemma 6.21 in Steinwart and Christmann (2008). Lemma 22 Let K be a d-dimensional smoothing kernel that satisfies Conditions (i) in Assumption B. Let h > 0 be the bandwidth parameter, and kx,h be defined in (12) for any x Rd. Then we have sup y Rd |kx,h(y) kx ,h(y)| c hβ+d x x β, x, x Rd, where c is a positive constant. Proof [of Lemma 22] From the definition of kx,h and the fact that K is a d-dimensional β-H older continuous kernel, we have |kx,h(y) kx ,h(y)| = 1 ch (β+d) x x β, where c is a positive constant. The desired conclusion is thus obtained. Proof [of Proposition 13] Lemma 22 reveals that Kh,r is the image of H older continuous map Br L (Rd) with the constant ch (β+d). By Lemmas 19 and 21 we obtain N(Kh,r, , ε) N Br, , εhβ+d = N B1, , εhβ+d Kernel Density Estimation for Dynamical Systems where c is a constant independent of ε. This completes the proof of Proposition 13. The following Bernstein-type exponential inequality, which was developed recently in Hang and Steinwart (2017), will serve as one of the main ingredients in the consistency and convergence analysis of the kernel density estimator (8). It can be stated in the following general form: Theorem 23 (Bernstein Inequality (Hang and Steinwart, 2017)) Assume that X := (Xn)n 1 is an X-valued stationary geometrically (time-reversed) C-mixing process on (Ω, A, µ) with C be defined by (4) for some semi-norm ||| ||| satisfying Condition (ii) in Assumption A, and P := µX1. Moreover, let g : X R be a function such that g C(X) with EP g = 0 and assume that there exist some A > 0, B > 0, and σ 0 such that |||g||| A, g B, and EP g2 σ2. Then, for all τ > 0, k N, and n n0 := max min m 3 : m 808c0(3A+B) (log m) 2 γ 4 , e k+1 with probability µ at least 1 4e τ, there holds 8(log n) 2 γ σ2τ n + 8(log n) 2 γ Bτ 3n . Proof [of Proposition 14] Let the notations kx,h and ekx,h be defined in (12) and (13), respectively, that is, kx,h := h d K( x /h), and ekx,h := kx,h EP kx,h. We first assume that x Rd is fixed and then estimate EDekx,h by using Bernstein s inequality in Theorem 23. For this purpose, we shall verify the following conditions: Obviously, we have EP ekx,h = 0. Moreover, simple estimates yield ekx,h 2 kx,h 2h d K 2h d K(0) EP ek2 x,h EP k2 x,h = Z Rd k2 x,h(x )d P(x ). Finally, the first condition in Assumption A and Condition (iv) in Assumption B imply |||ekx,h||| |||kx,h||| h d sup x Rd |||K ( x /h) ||| h dϕ(h). Now we can apply the Bernstein-type inequality in Theorem 23 and obtain that for n n1, for any fixed x Rd, with probability µ at most 4e τ, there holds 8τ(log n)2/γ R Rd k2 x,h(x )d P(x ) n + 16τ(log n)2/γK(0) 3hdn , (33) Hang, Steinwart, Feng, Suykens min m 3 : m 808c0(3h dϕ(h) + K(0)) 1 d+1 and m (log m) 2 γ 4 , e d+1 Consider the function set e Kh,r := {ekx,h : x Br}. We choose y1, . . . , ym Br such that {ky1,h, . . . , kym,h} is a minimal ε/2-net of Kh,r = {kx,h : x Br} with respect to . Noticing the following relation ekx,h ekyj,h 2 kx,h kyj,h ε, we know that {eky1,h, . . . , ekym,h} is an ε-net of e Kh,r with respect to . Note that here we have m = N(Kh,r, , ε 2), since the net is minimal. From Proposition 13, we know that there exists a positive constant c independent of ε such that log m c log r hε. From the estimate in (33) and a union bound argument, with probability µ at least 1 4me τ, the following estimate holds sup j=1,...,m |EDekyj,h| 8τ(log n)2/γ R Rd k2 yj,h(x )d P(x ) n + 16τ(log n)2/γK(0) By a simple variable transformation, we see that with probability µ at least 1 e τ, there holds sup j=1,...,m |EDekyj,h| 8(log n)2/γ R Rd k2 yj,h(x )d P(x )(τ + log(4m)) + 16(log n)2/γK(0)(τ + log(4m)) Recalling that {ky1,h, . . . , kym,h} is an ε/2-net of Kh,r, this implies that, for any x Br, there exists yj such that kx,h kyj,h ε/2. Then we have |EDekx,h| |EDekyj,h| EDekx,h EDekyj,h |EDkx,h EDkyj,h| + |EP kx,h EP kyj,h| kx,h kyj,h L1(D) + kx,h kyj,h L1(P) ε, and consequently |EDekx,h| |EDekyj,h| + ε. (35) By setting a := 8(log n)2/γ(τ + log(4m))/n, we have Rd k2 x,h(x ) d P(x ) Rd k2 yj,h(x ) d P(x ) = akx,h L2(P) akyj,h L2(P) a kx,h kyj,h L2(P) aε/2. Kernel Density Estimation for Dynamical Systems This together with inequality (35) implies that for any x Br, there holds |EDekx,h| |EDekyj,h| + 2ε Rd k2 yj,h(x ) d P(x ) + 2a K(0) Rd k2 x,h(x ) d P(x ) + aε 2 + 2a K(0) Consequently we have Br |EDekx,h| dx Z Rd k2 x,h(x ) d P(x ) dx + rdλd(B1) 2a K(0) hd + rdλd(B1)( a/2 + 1)ε. Now recall that for E Rd and g : E R, H older s inequality implies Rd |1E| 1 2 |g| 1 2 dx 2 Z Rd |1E| dx Z Rd |g| dx = µ(E) g 1. This tells us that Rd k2 x,h(x ) d P(x ) dx p Rd k2 x,h(x ) d P(x ) dx. Moreover, there holds Z Rd k2 x,h(x ) d P(x ) µ(dx) = Z Br h 2d K2 x x /h dx d P(x ) Rd h 2d K2 ( x /h) dx d P(x ) Rd K2( x ) dx We now set ε = 1 n and obtain log(4m) c log nr h . Thus we have Br |EDekx,h| dx (log n)2/γrd(τ + log(4m)) hdn + (log n)2/γrd(τ + log(4m)) (log n)2/γ(τ + log(4m)) (log n)2/γrd(τ + log nr h ) hdn + (log n)2/γrd(τ + log nr Hang, Steinwart, Feng, Suykens Now we need to estimate the corresponding integral over Bc r. By definition we have Z Bcr |EDekx,h| dx Z Bcr EDkx,h dx + Z Bcr EP kx,h dx. From Lemma 17 we obtain Bcr EDkx,h dx D(Bc r/2) + h i=1 1Bc r/2(xi) + h Bcr EP kx,h dx P(Bc r/2) + h Since r 1, we can construct a function g with 1Bc r/2 g 1Bc r/4 and there exists a function ψ(r) such that |||g||| ψ(r). Applying Bernstein inequality in Theorem 23 with respect to this function g, it is easy to see that when n n2, with probability µ at least 1 2e τ, there holds 8τ(log n)2/γ n + 8τ(log n)2/γ min m 3 : m2 808c0(3ψ(r) + 1) and m (log m) 2 γ 4 , e 3 b This implies that with probability µ at least 1 2e τ, there holds D(Bc r/2) = 1 i=1 1Bc r/2(xi) EDg 8τ(log n)2/γ n + 8τ(log n)2/γ EP 1Bc r/4(xi) + 8τ(log n)2/γ n + 8τ(log n)2/γ and consequently we obtain Bcr |EDekx,h| dx P(Bc r/4) + 32τ(log n)2/γ By combining estimates in (36) and (37), and taking n0 = max{n1, n2}, we have accomplished the proof of Proposition 14. Kernel Density Estimation for Dynamical Systems Remark 24 Let us briefly discuss the choice of the function ψ(r) in the proof of Proposition 14. For example, in the case C(X) = Lip(R), we can choose 1, for |x| > r, 0, for |x| < r/4, 4x 3, for r x r/4, 3, for r/4 x r. Then we have |||g||| 4 3r 4/3 and therefore, n2 is well-defined. Moreover, it is easily seen that even for smoother underlying functions classes like C1 we can construct a function g such that |||g||| < . Proof [of Proposition 15] Recalling the definitions of kx,h and ekx,h given in (12) and (13), we have f D,h f P,h = sup x Ω |EDekx,h|. To prove the assertion, we first estimate EDfx,h for fixed x Rd using the Bernstein inequality in Theorem 23. For this purpose, we first verify the following conditions: Obviously, we have EP ekx,h = 0. Then, simple estimates imply ekx,h 2 kx,h 2h d K 2h d K(0) EP ek2 x,h EP k2 x,h = h d Z Rd K2 x x /h f(x )h d dx f h d. Finally, the first condition in Assumption A and Condition (iv) in Assumption B yield |||ekx,h||| |||kx,h||| h d sup x Rd |||K ( x /h) ||| h dϕ(h). Therefore, we can apply the Bernstein inequality in Theorem 23 and obtain that for n n 0, for any fixed x Rd, with probability µ at least 1 4e τ, there holds τ f (log n)2/γ hdn + K(0)τ(log n)2/γ 3hdn , (38) min m 3 : m 808c0(3h dϕ(h) + K(0)) 1 d+1 and m (log m) 2 γ 4 , e d+1 Let us consider the following function set K h,r0 := ekx,h : x Br0 Hang, Steinwart, Feng, Suykens and choose y1, . . . , ym Br0 such that {ky1,h, . . . , kym,h} is a minimal ε/2-net of Kh,r0 with respect to and m = N(Kh,r0, , ε 2). As in the proof of Proposition 14, one can show that eky1,h, . . . , ekym,h is an ε-net of K h,r0. Again from Proposition 13 we know that there holds log(4m) log r0 hε. This in connection with (38) implies that the following union bound sup j=1,...,m |EDekyj,h| f (τ + log(4m))(log n)2/γ hdn + K(0)(τ + log(4m))(log n)2/γ holds with probability µ at least 1 e τ. For any x Br0, there exists a yj such that kx,h kyj,h ε. Then we have |EDekx,h| |EDekyj,h| |EDekx,h EDekyj,h| |EDkx,h EDkyj,h| + |EP kx,h EP kyj,h| kx,h kyj,h L1(D) + kx,h kyj,h L1(P) ε, and consequently with probability µ at least 1 e τ, there holds |EDekx,h| |EDekyj,h| + ε f (τ + log(4m))(log n)2/γ hdn + K(0)(τ + log(4m))(log n)2/γ for any x Br0. By setting ε = 1 n, we obtain log(4m) log nr0 h . Thus, with probability µ at least 1 e τ, we have f (τ + log(nr0 h ))(log n)2/γ hdn + K(0)(τ + log(nr0 h ))(log n)2/γ f (τ + log(nr0 h ))(log n)2/γ hdn + K(0)(τ + log(nr0 h ))(log n)2/γ By taking the supremum of the left hand side of the above inequality over x, we complete the proof of Proposition 15. Proof [of Theorem 7] Without loss of generality, we assume that hn 1. Since hn 0, Proposition 10 implies that f P,h f 1 ε. We set rn := nhd n (log n)(2+2γ)/γ and we can also assume w.l.o.g that rn 2. Moreover, there exists a constant n 1 such that P(Bc rn/2) ε, n n 1. Kernel Density Estimation for Dynamical Systems For any 0 < δ < 1, we select τ := log(1/δ). Then there exists a constant n 2 such that log nrn hn τ for all n n 2. On the other hand, with the above choice of rn, we have hn log n1/dhn (log n)(2+2γ)/γ n (1 + d 1) log n log n. Thus, for all n max{n 1, n 2}, we have (log n)2/γrd n log(nrn nhdn (log n)2/γrd n log n nhdn = 1 log n 0. Thus, following from Proposition 14, when n is sufficient large, for any ε > 0, with probability µ at least 1 3δ, there holds f D,hn f 1 ε. Therefore, with properly chosen δ, one can show that f D,hn converges to f under L1-norm almost surely. We have completed the proof of Theorem 7. Proof [of Theorem 8] (i) Combining the estimates in Proposition 14 and Proposition 11, we know that with probability µ at least 1 2e τ, there holds (log n)2/γrd(τ + log( nr hdnn + (log n)2/γrd(τ + log( nr + τ(log n)2/γ n + P Bc r + rdhα n + hn (log n)2/γrd(τ + log( nr hdnn + τ(log n)2/γ n + P Bc r + rdhα n + hn Let τ := log n and later we will see from the choices of hn and rn that there exists some constant c such that log(nr h ) can be bounded by c log n. Therefore, with probability µ at least 1 1 n there holds rd(log n)(2+γ)/γ hdnn + r ηd + rdhα n α 2α+d + r ηd (log n)(2+γ)/γ ! αη (1+η)(2α+d) α , by choosing hn = (log n)(2+γ)/γ 1+η (1+η)(2α+d) α and r := rn = n (log n)(2+γ)/γ α d(1+η)(2α+d) αd . Hang, Steinwart, Feng, Suykens (ii) Similar to case (i), one can show that with probability µ at least 1 1 n there holds rd(log n)(2+γ)/γ hdnn + e arη + rdhα n rd (log n)(2+γ)/γ α 2α+d + e arη (log n)(2+γ)/γ α 2α+d (log n) d η α+d by choosing hn = (log n)(2+γ)/γ 1 2α+d (log n) d η 1 2α+d and rn = (log n) 1 η . (iii) From Proposition 11 we see that with confidence 1 1 n, there holds f D,h f P,h 1 rd 0(log n)(2+γ)/γ hdnn + hα n (log n)(2+γ)/γ/n α 2α+d , where hn is chosen as hn = (log n)(2+γ)/γ/n 1 2α+d . The proof of Theorem 8 is completed. Proof [of Theorem 9] The desired estimate is an easy consequence if we combine the estimates in Proposition 15 and Proposition 10 (ii) for case (i), Proposition 15 and Proposition 11 (iv) for case (ii), respectively, and choose hn = (log n)(2+γ)/γ/n 1 2α+d . We omit the details of the proof here. To prove the initial Examples 4 and 5, we need the following lemma. Lemma 25 Let g : R [0, ) be an Lebesgue integrable function with f(x) = 0 for all x [0, 1] and f(x) cx γ for some constants c > 0 and γ [0, 1) and all x (0, 1/2]. We define r(x) := |x|/2 for x R. Then for all h (0, 1/8] we have the following two inequalities. Z r(x) h f(x) dx 2c 1 γ h1 γ r(x) |x | h f(x + x ) dx dx 10c 1 γ h1 γ . Kernel Density Estimation for Dynamical Systems Proof [of Lemma 25] The first inequality simply follows from r(x) h f(x) dx = Z 2h 0 f(x) dx c Z 2h 0 x γ dx 2c 1 γ h1 γ . For the proof of the second inequality, we first observe that, for x [0, 2h], we have r(x) |x | h f(x + x ) dx = Z f(x + x ) dx f(x + x ) dx + f(x + x ) dx f(x ) dx . (41) Let us consider the first term in the last equation. For the case where x [0, h), we have h+x f(x )dx = Z x/2 0 f(x )dx c Z x/2 0 (x ) γdx = c 1 γ 1 γ c 1 γ h1 γ, and for the case where x (h, 2h], we have h+x f(x )dx c Z x/2 h+x (x ) γdx = c 1 γ 1 γ (x h)1 γ c 1 γ h1 γ. Therefore, for all x [0, 2h], we obtain h+x f(x )dx c 1 γ h1 γ. (42) On the other hand, for the second term, there holds 3x/2 f(x )dx c Z h+x 3x/2 (x ) γdx = c 1 γ (h + x)1 γ 3x c 1 γ (3h)1 γ 3c 1 γ h1 γ. (43) Combining the estimates (41), (42), and (43), we obtain r(x) |x | h f(x + x ) dx 4c 1 γ h1 γ , Hang, Steinwart, Feng, Suykens while for x (2h, 1/2] the integral vanishes since {x : r(x) |x | h} = . Moreover, for x [ h, 0), we have r(x) |x | h f(x + x ) dx = Z f(x + x ) dx = Z f(x + x ) dx + x f(x + x ) dx x f(x + x ) dx 0 (x ) γ dx 2c 1 γ h1 γ . We these estimates we then conclude that Z 1/2 r(x) |x | h f(x + x ) dx dx = Z 2h r(x) |x | h f(x + x ) dx dx (44) 10c 1 γ h2 γ , (45) and hence we have shown the assertion. Proof [of Example 4] We can choose R = 1, Ω+h R = Ω+h = [ h, 1 + h], and X h = ( , h) (2h, 1 2h) (1 + h, ) . Moreover, we have already seen in Example 3 that we can pick α = 1 with bounded function x 7 c(x). Consequently, (iv) of Proposition 11 shows sup x X h |f P,h(x) f(x)| c h for some constant c > 0 and all h > 0. Moreover, Lemma 25 applied to both critical points x = 0 and x = 1 with γ = 0 in combination with (iii) of Proposition 11 shows f P,h f 1 c h for another constant c > 0 and all h (0, 1/8]. Proposition 14 then yields the assertion. Proof [of Example 5] We again choose R = 1, Ω+h R = Ω+h = [ h, 1 + h], and R c(x) dx < Consequently, Lemma 25 applied to both critical points x = 0 and x = 1 with γ = 1/2 in combination with (iii) of Proposition 11 shows f P,h f 1 c hα for some constant c > 0 and all h (0, 1/8]. Again, Proposition 14 then yields the desired result. Kernel Density Estimation for Dynamical Systems 7. Conclusion In the present paper, we studied the kernel density estimation problem for dynamical systems admitting a unique invariant Lebesgue density by using the C-mixing coefficient to measure the dependence among observations. The main results presented in this paper are the consistency and convergence rates of the kernel density estimator in the sense of L1norm and L -norm. With a properly chosen bandwidth, we showed that the kernel density estimator is universally consistent. Under mild assumptions on the kernel function and the density function, we established convergence rates for the estimator. For instance, when the density function is bounded and compactly supported, both L1-norm and L -norm convergence rates with the same order can be achieved for general geometrically time-reversed C-mixing dynamical systems. The convergence mentioned here is of type with high probability due to the use of a Bernstein-type exponential inequality and this makes the present study different from the existing related studies. We also discussed the model selection problem of the kernel density estimation in the dynamical system context by carrying out numerical experiments. Acknowledgments The authors are grateful to Professor L aszl o Gy orfi, the reviewers, and the action editor for helpful comments that helped improve the quality and the presentation of this paper. The research leading to these results has received funding from the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013) / ERC Ad G A-DATADRIVE-B (290923). Hanyuan Hang s research is supported by the National Natural Science Foundation of China (Project No.: 11731011). This paper reflects only the authors views, the Union is not liable for any use that may be made of the contained information. Research Council KUL: GOA/10/09 Ma Net, Co E PFV/10/002 (OPTEC), BIL12/11T; Ph D/Postdoc grants. Flemish Government: FWO: projects: G.0377.12 (Structured systems), G.088114N (Tensor based data similarity); Ph D/Postdoc grants. IWT: projects: SBO POM (100031); Ph D/Postdoc grants. i Minds Medical Information Technologies SBO 2014. Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, Dynamical systems, control and optimization, 2012-2017). The corresponding author is Yunlong Feng. Marian Anghel and Ingo Steinwart. Forecasting the evolution of dynamical systems from noisy observations. ar Xiv preprint ar Xiv:0707.4146, 2007. Viviane Baladi. Positive Transfer Operators and Decay of Correlations, volume 16 of Advanced Series in Nonlinear Dynamics. World Scientific Publishing Co., Inc., River Edge, NJ, 2000. Viviane Baladi. Decay of correlations. In Smooth Ergodic Theory and its Applications (Seattle, WA, 1999), volume 69 of Proc. Sympos. Pure Math., pages 297 325. Amer. Math. Soc., Providence, RI, 2001. Hang, Steinwart, Feng, Suykens Delphine Blanke, Denis Bosq, and Dominique Gu egan. Modelization and nonparametric estimation for dynamical systems with noise. Statistical Inference for Stochastic Processes, 6(3):267 290, 2003. Denis Bosq and Dominique Gu egan. Nonparametric estimation of the chaotic function and the invariant measure of a dynamical system. Statistics & Probability Letters, 25(3):201 212, 1995. Adrian W. Bowman. An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353 360, 1984. Richard C. Bradley. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2(2):107 144, 2005. Ricardo Cao, Antonio Cuevas, and Wensceslao Gonz alez Manteiga. A comparative study of several smoothing methods in density estimation. Computational Statistics & Data Analysis, 17(2):153 176, 1994. Bernd Carl and Irmtraud Stephani. Entropy, Compactness and the Approximation of Operators. Cambridge University Press, Cambridge, 1990. Chih-Kang Chu and James S. Marron. Comparison of two bandwidth selectors with dependent errors. The Annals of Statistics, 19(4):1906 1918, 1991. Marc Deisenroth and Shakir Mohamed. Expectation propagation in Gaussian process dynamical systems. In Advances in Neural Information Processing Systems, pages 2609 2617. NIPS Foundation, 2012. Luc Devroye. The double kernel method in density estimation. Annales de l Institut Henri Poincar e (B) Probabilit es et Statistiques, 25(4):533 580, 1989. Luc Devroye. Universal smoothing factor selection in density estimation: theory and practice. Test, 6(2):223 320, 1997. Luc Devroye and L aszl o Gy orfi. Nonparametric Density Estimation: The L1 View, volume 119. John Wiley & Sons Incorporated, 1985. Luc Devroye and G abor Lugosi. Nonasymptotic universal smoothing factors, kernel complexity and Yatracos classes. The Annals of Statistics, 25(6):2626 2637, 1997. Luc Devroye and G abor Lugosi. Combinatorial Methods in Density Estimation. Springer Science & Business Media, 2001. Paul Eggermont and Vince La Riccia. Maximum Penalized Likelihood Estimation: Volume I: Density Estimation. Springer, New York, 2001. Gerald B. Folland. Real Analysis. John Wiley & Sons, New York, 1999. Theo Gasser, Hans-Georg M uller, and Volker Mammitzsch. Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 47(2):238 252, 1985. L aszl o Gy orfi. Strong consistent density estimate from ergodic sample. Journal of Multivariate Analysis, 11(1):81 84, 1981. L aszl o Gy orfiand G abor Lugosi. Kernel density estimation from ergodic sample is not universally consistent. Computational statistics & data analysis, 14(4):437 442, 1992. Kernel Density Estimation for Dynamical Systems L aszl o Gy orfiand Elias Masry. The L1 and L2 strong consistency of recursive kernel density estimation from dependent samples. IEEE Transactions on Information Theory, 36(3):531 539, 1990. Peter Hall, Soumendra Nath Lahiri, and J org Polzehl. On bandwidth choice in nonparametric regression with both short-and long-range dependent errors. The Annals of Statistics, 23(6): 1921 1936, 1995. Hanyuan Hang and Ingo Steinwart. A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning. The Annals of Statistics, 45(2):708 743, 2017. Jeffrey D. Hart and Philippe Vieu. Data-driven bandwidth choice for density estimation based on dependent data. The Annals of Statistics, 18(2):873 890, 1990. Jeffrey D. Hart and Thomas E. Wehrly. Kernel regression estimation using repeated measurements data. Journal of the American Statistical Association, 81(396):1080 1088, 1986. Michael C. Jones, James S. Marron, and Simon J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433):401 407, 1996a. Michael C. Jones, James S. Marron, and Simon J. Sheather. Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics, 11(3):337 381, 1996b. Anatole Katok and Boris Hasselblatt. Introduction to the Modern Theory of Dynamical Systems, volume 54 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 1995. Rafail Z. Khas minskii. A lower bound on the risks of non-parametric estimates of densities in the uniform metric. Theory of Probability & Its Applications, 23(4):794 798, 1979. Andrzej Lasota and Michael C. Mackey. Probabilistic Properties of Deterministic Systems. Cambridge University Press, 1985. Andrzej Lasota and James A. Yorke. On the existence of invariant measures for piecewise monotonic transformations. Transactions of the American Mathematical Society, 186:481 488, 1973. Carlangelo Liverani. Decay of correlations for piecewise expanding maps. Journal of Statistical Physics, 78(3):1111 1129, 1995. Elias Masry. Probability density estimation from sampled data. Information Theory, IEEE Transactions on, 29(5):696 709, 1983. Elias Masry. Recursive probability density estimation for weakly dependent stationary processes. Information Theory, IEEE Transactions on, 32(2):254 267, 1986. Elias Masry and L aszl o Gy orfi. Strong consistency and rates for recursive probability density estimators of stationary processes. Journal of Multivariate Analysis, 22(1):79 93, 1987. V eronique Maume-Deschamps. Exponential inequalities and functional estimation for weak dependent data: applications to dynamical systems. Stochastics and Dynamics, 6(4):535 560, 2006. Kevin Mc Goff, Sayan Mukherjee, Andrew Nobel, and Natesh Pillai. Consistency of maximum likelihood estimation for some dynamical systems. The Annals of Statistics, 43(1):1 29, 2015a. Kevin Mc Goff, Sayan Mukherjee, and Natesh Pillai. Statistical inference for dynamical systems: a review. Statistics Surveys, 9:209 252, 2015b. Hang, Steinwart, Feng, Suykens Byeong U. Park and James S. Marron. Comparison of data-driven bandwidth selectors. Journal of the American Statistical Association, 85(409):66 72, 1990. Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065 1076, 1962. Cl ementine Prieur. Density estimation for one-dimensional dynamical systems. ESAIM: Probability and Statistics, 5:51 76, 2001. Peter M. Robinson. Nonparametric estimators for time series. Journal of Time Series Analysis, 4 (3):185 207, 1983. Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832 837, 1956. Mats Rudemo. Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics, 9:65 78, 1982. David W. Scott and George R. Terrell. Biased and unbiased cross-validation in density estimation. Journal of the American Statistical Association, 82(400):1131 1146, 1987. Simon J. Sheather and Michael C. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 53(3):683 690, 1991. Ingo Steinwart and Marian Anghel. Consistency of support vector machines for forecasting the evolution of an unknown ergodic dynamical system from observations with unknown noise. The Annals of Statistics, 37(2):841 875, 2009. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, New York, 2008. Charles J. Stone. Optimal uniform rate of convergence for nonparametric estimators of a density function or its derivatives. In Recent Advances in Statistics, pages 393 406. Academic Press, New York, 1983. Johan A.K. Suykens and Joos Vandewalle. Recurrent least squares support vector machines. Circuits and Systems I: Fundamental Theory and Applications, IEEE Transactions on, 47(7):1109 1114, 2000. Johan A.K. Suykens, Joos Vandewalle, and Bart De Moor. Artificial Neural Networks for Modelling and Control of Non-Linear Systems. Springer Science & Business Media, 1995. Johan A.K. Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Vandewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002. George R. Terrell. The maximal smoothing principle in density estimation. Journal of the American Statistical Association, 85(410):470 477, 1990. Lanh Tat Tran. The L1 convergence of kernel density estimates under dependence. The Canadian Journal of Statistics, 17(2):197 208, 1989a. Lanh Tat Tran. Recursive density estimation under dependence. Information Theory, IEEE Transactions on, 35(5):1103 1108, 1989b. Matt P. Wand and Chris M. Jones. Kernel Smoothing. Chapman & Hall, London, 1994. Kernel Density Estimation for Dynamical Systems Qiwei Yao and Howell Tong. Cross-validatory bandwidth selections for regression estimation based on dependent data. Journal of Statistical Planning and Inference, 68(2):387 415, 1998. Bin Yu. Density estimation in the L -norm for dependent data with applications to the Gibbs sampler. The Annals of Statistics, 21(2):711 735, 1993. Onno Zoeter and Tom Heskes. Change point problems in linear dynamical systems. The Journal of Machine Learning Research, 6:1999 2026, 2005.