# nonparametric_density_estimation_under_distribution_drift__bdbe3c0a.pdf Nonparametric Density Estimation under Distribution Drift Alessio Mazzetto 1 Eli Upfal 1 We study nonparametric density estimation in non-stationary drift settings. Given a sequence of independent samples taken from a distribution that gradually changes in time, the goal is to compute the best estimate for the current distribution. We prove tight minimax risk bounds for both discrete and continuous smooth densities, where the minimum is over all possible estimates and the maximum is over all possible distributions that satisfy the drift constraints. Our technique handles a broad class of drift models and generalizes previous results on agnostic learning under drift. 1. Introduction Density estimation is a fundamental concept in statistics with numerous applications in data analysis and machine learning. Given a set of samples, the goal is to best estimate the probability distribution that generated these samples, often subject to some parametric or nonparametric assumptions on the family of candidate distributions. This problem has been studied extensively, for both discrete and continuous distribution functions, assuming that the samples are independent and identically distributed according to the distribution that we aim to estimate (Devroye & Lugosi, 2001). In many data analysis applications, ranging from customers preferences to weather conditions, the assumption that the samples are identically distributed is unrealistic. The underlying distribution is gradually changing over time, and estimating the current distribution inevitably relies on past data from related but not identical distributions. This work presents the first tight bounds for density estimates of both discrete and continuous smooth distributions from samples that are independent but are generated by distributions that drift in time. Distribution drift has been studied in the context of agnostic 1Brown University. Correspondence to: Alessio Mazzetto . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). learning with the assumption of equal bounded drift at each step (Bartlett, 1992), i.e. there is a constant > 0, such that the drift between two distribution i steps apart is bounded by i . In this case, it has been shown that the minimax risk for learning a family of binary classifiers with VC dimension ν is Θ((ν )1/3) (Barve & Long, 1996; Long, 1999), where the minimax risk characterizes the maximal expected error of the best algorithm that solves the problem within the specified drift assumptions. The minimax risk has not been studied with other drift patterns in this context. In this work, we study the more general problem of density estimation, under a more detailed family of drift models. In particular, our results apply to any regular drift sequence, where the ith element of this sequence provides an upper bound to the distance of the current distribution and the i distribution in the sequence, and the regularity assumption prevents any abrupt change in the drift (Definition 3.1). The distance used depends on the specific estimation problem, in particular, we will use the total variation distance in the case of discrete densities, and the L2 distance in the case of smooth densities. The bounded drift per step is one possible model in our setting. However, when more information is available, we obtain more informative tight bounds. For the problem of estimating a discrete distribution with support size k using n samples from a drifting distribution, we show a minimax risk of Θ p k/r with respect to the total variation distance, where r n is an integer that is easily derived from the drift sequence. For the special case of no drift, we retrieve the known minimax risk Θ( p k/n) for the estimation of a discrete density with n independent and identically distributed samples. Since the problem of estimating a discrete density with support size k with respect to the total variation distance can be reduced to the problem of agnostic learning a family of binary classifiers with VC dimension k, we also show that our results imply a lower bound on the minimax risk for the latter problem. In particular, our results generalize the previously known lower bound that only holds for the case of bounded drift at each step (Barve & Long, 1996). The following simple example demonstrates the power of our approach. Assume a sample space of size k, and a distribution drift that follows this pattern: a probability mass of 1 is distributed between the k elements and does Nonparametric Density Estimation under Distribution Drift not change in time. The remaining probability mass is redistributed between the k elements at each step. The drift in each step is bounded by and based only on this bound the best estimate can only guarantee a Θ((k )1/3) error. However, the drift between the current distribution and any past distribution is also bounded by . Incorporating this extra information, our technique provides a tighter Θ( ) error estimate. In Section 4.1, we show a similar gap for the agnostic learning problem. For the smooth density estimation problem, we seek to estimate a probability density that is β-smooth (Definition 5.1). We focus on the nonparametric case, i.e. we do not assume any parametric assumption on the target density. We establish a minimax risk of Θ r 2β/(2β+1) with respect to the integrated squared loss, where again r n depends on the drift sequence. This results extends the known minimax risk Θ n 2β/(2β+1) for estimating a density from n independent and identically distributed samples (Van der Vaart, 2000). The results we have discussed so far establish the minimax risk for learning the current distribution at a given specific time based on past data. However, we are also interested in characterizing the minimax rate of the average risk for the online version of those problems, where we are required to provide an estimate of the current distribution at each step. This is a more challenging problem, as in the lower bound construction, we need to show that we frequently incur a high estimation error. Nonetheless, we show that in the case of a bounded drift at each step, the minimax rate for the online estimation of a discrete density with support size k is Θ((k )1/3). As previously discussed, this result also applies to the problem of agnostic learning a family of binary classifiers. This is the first work in the literature to provide a characterization of the minimax rate for the online version of these problems in a distribution drift setting. Following previous work on this setting, our upper bounds on the minimax risk are obtained by considering an estimator over a window of recent samples of a properly chosen size. The size of the window is chosen to minimize the trade-off between the variance of the estimator and the error introduced by considering samples from distributions that are further away in time and exhibit a large drift. In the literature, the only lower bound construction for drifting distribution is specific to the problem of agnostic learning of binary classifiers (Barve & Long, 1996), and it assumes a bounded drift at each step. In our paper, we develop a novel proof strategy that allows us to obtain tight lower bounds for both the problems of discrete density estimation, and smooth density estimation under any arbitrary regular drift sequence. We believe that our method is of independent interest, and can be possibly applied to other estimation problems with drifting distributions. 1.1. Related Work The distribution drift setting has been introduced in the context of the supervised learning of binary predictors (Helmbold & Long, 1991; Bartlett, 1992; Helmbold & Long, 1994). In this line of work, it has been shown that there exists an algorithm that finds a binary predictor whose expected prediction error with respect to the current distribution is at most O( 3 ν ) larger than the expected error of the best predictor in the family, where ν is the VC-dimension of the considered family of binary predictors and is an upper bound to the total variation distance of two consecutive distributions (Long, 1999). This upper bound is tight (Barve & Long, 1996). More recent work generalized this analysis to provide upper bounds to learning any family of predictors with bounded Rademacher complexity, and introduced a finer measure of distance between consecutive distributions (Mohri & Mu noz Medina, 2012): it uses tools from transfer learning theory (Mansour et al., 2009), as we can observe that learning with distribution drift is a special case of learning with domain shift (Ben-David et al., 2010). The problem of density estimation in the case of independent and identically distributed samples has been extensively studied in the literature, see (Silverman, 1986; Groeneboom & Jongbloed, 2014; Scott, 2015) for an overview of old and recent work in this topic. In the case of estimating a distribution with finite support size k, it is folklore that we can achieve an expected error O( p k/n) in total variation distance with n samples. This upper bound is tight (Anthony & Bartlett, 2002), and the minimax risk bound has been computed with exact constants (Kamath et al., 2015). Recent work provided an analysis for the estimation of a discrete distribution with infinite countable support (Berend & Kontorovich, 2013; Han et al., 2015), also using data-dependent bounds (Cohen et al., 2020). In the case of the estimation of a β-smooth density from n independent and identically distributed samples, it is possible to obtain an expected squared error of O n 2β 2β+1 , and we refer to (Tsybakov, 2009) for a recent book on the topic. This upper bound is achieved by using different methods as kernel density estimation (e.g., see Van der Vaart (2000)), and it can be proven to be tight by using information-theoretic methods from minimax theory (Devroye & Gy orfi, 1987; Yu, 1997). The problem of relaxing the independent and identically distributed assumptions on the samples for density estimation has been studied in the literature from a theoretical perspective. However, these work significantly diverge from our setting as they use different sets of assumptions. Multiple work addressed the problem of estimating the stationary distribution of a Markov process (Roussas, 1969; Wen et al., 2020), even for arbitrary initial distribution (Gillert & Wartenberg, 1984). In (Phillips & Park, 1998), the authors developed an asymptotic theory for the kernel density esti- Nonparametric Density Estimation under Distribution Drift mate of a random walk and the kernel regression estimator of a non-stationary first order autoregression. More similar to our setting, recent work (Gokcesu & Kozat, 2017) provides parametric density estimation results for a family of exponential distributions where the parameters of the distributions are allowed to slowly change at each step. Many other algorithms have also been proposed for online learning of densities (Kristan et al., 2011; Garc ıa-Trevi no & Barria, 2012), however, they do not come with any theoretical analysis. The problem of characterizing the minimax rate of the average risk for the online version of the density estimation problems can also be related to recent work on online forecasting (Baby & Wang, 2019; 2020). In the online forecasting problem, the goal is to predict the current position of a vector that moves over time, given a sequence of independent noisy observations of the past positions of this vector. For this problem, they provide an adaptive algorithm that achieves an optimal minimax regret (up-to-logarithmic factors) with respect to the total variation of the position of the vector over time. It is possible to use their algorithm to adaptively estimate a discrete density over [k] in a nonstationary setting. In fact, we can describe each discrete density as a random vector over Rk, where coordinate i represents the probability of sampling the i-th element, and we observe a noisy observation of this vector. However, their approach estimates each coordinate independently, thus a trivial application of their work does not achieve the optimal dependency on k for the problem of discrete density estimation with a bounded drift at each step. 1.2. Our Contributions 1. We introduce the concept of regular drift sequence - a general framework for characterizing distribution drift (Section 3). 2. We establish the minimax risk for discrete density estimation with respect to any regular drift sequence (Section 4). 3. We show a generalization of previous lower bound for the problem of agnostic learning a family of binary classifier to any regular drift sequence (Section 4.1). 4. We establish the minimax rate for the online problem of estimating a discrete density with a bounded drift at each step (Section 4.4). 5. We establish the minimax risk for estimating a smooth density with respect to any regular drift sequence (Section 5) 2. Preliminary Let [n] = {1, . . . , n} for n N. Consider a non-empty sample space X equipped with a σ-algebra. Let (Xi)i N be a an independent1 stochastic process defined over X, and let Pi be the probability distribution of the random variable Xi. Given n N, let Xn = (X1, . . . , Xn) be the random vector of the first n elements of the random process. Since the Xi s are independent, the distribution of Xn, can be written as a product distribution S = P1 . . . Pn over X n. Given i n, we denote with θi(S) = Pi the i-th component of S. Let Sn be a family of candidate probability distributions for the (unknown) distribution S of the random vector Xn. Given an observed Xn S, our goal is to estimate Pn = θn(S). Let ˆθn = ˆθn(Xn) be an estimator of this property. Given a suitable metric d( , ) that quantifies the error of the estimation, the minimax risk at time n is inf ˆθn sup S Sn E Xn S h d ˆθn(Xn), θn(S) i , (1) where we take the supremum (worst-case) over all the product distributions S in Sn, and the infimum is over all possible estimators ˆθn. The minimax risk quantifies the largest estimation error that the best estimator can possibly achieve with respect to Sn at a given time n. We omit the subscript n when it is clear from the context. For each estimation problem, we will adopt the most used distance in the literature for the specific estimation problem, that is the total variation distance for discrete densities (probability mass function), and the L2 distance for smooth densities. We are also interested in the minimax rate of the average risk, which quantifies the average of the estimation errors of a online algorithm that at each step observes a new random variable, and outputs an estimate of its distribution based on all the previous observations. Given i n, we let ˆθi(Xi) be an estimator of θi(S) = Pi. Given a metric d, the minimax rate of the average risk is defined as inf ˆθ1,...,ˆθn sup S Sn E Xn S i=1 d ˆθi(Xi), θi(S) # Due to space constraints, some of the proofs are deferred to the appendix. 3. Distribution Drift The family of candidate probability distributions Sn is defined by the assumptions on the distribution drift in the stochastic process. The most widely used assumption in the literature is a bound on the drift at each step (Bartlett, 1992). In our setting, this is formally expressed as follows: there exists > 0 such that d(Pi, Pi+1) for any i n 1. In the context of learning binary functions, minimax risk 1A stochastic process is independent iff every finite subset of its random variables is mutually independent. Nonparametric Density Estimation under Distribution Drift lower bound are known under this setting (Barve & Long, 1996; Long, 1999). However, this is a very pessimistic assumption, as in the worst case the distance from Pi to Pn is (n i) , since we can accumulate an additive error at each step. As an illustrative example that shows the drawback of this assumption, let (0, 1) and consider a sequence of distributions P1, . . . , Pn over N such that Pi takes the value 0 with probability 1 , and the value i with probability . This sequence has bounded drift at each step, i.e. the total variation distance between Pi and Pi+1 is equal to . However, the total variation distance between Pn and Pi is also equal to rather than (n i) . The minimax risk subject to only the weak condition d(Pi, Pi+1) can be arbitrarily far from the best estimation error. An alternative assumption is a polynomial drift (Hanneke et al., 2015). That is, we assume that there exists a value α [0, 1] such that d(Pi, Pn) (n i)α . While this assumption allows to obtain closed-formula upper bounds on the error for the problem of agnostic learning, no lower bounds are known in this setting. In this work, we introduce and present upper and lower bounds with a more detailed approach for defining drift. Definition 3.1. A vector n = ( 1, . . . , n) Rn 0 is a regular drift sequence for a product distribution S = P1 . . . Pn with respect to a metric d( , ) if: 1. i is an upper bound on the drift between Pi and Pn, i.e. d(Pi, Pn) i, 2. the sequence 1, . . . , n is non-increasing, 3. there is no abrupt change in the drift: there is a constant c such that i 1/ i c for any i = 2, . . . , n 1, 4. i = 0 i = n. Comment. Our results also hold if we substitute in the above definition the requirement that d(Pi, Pn) i with the requirement that d(Pi, Pi+1) i i+1 for any i n 1. The latter property is stronger as it implies the former, however our lower bound proof works with either definition, and we will us them interchangeably. In our work, we characterize the minimax risk of density estimation problems under an arbitrary regular drift sequence n. Noticeably, this is the first work to provide lower bound for estimation problems under such a general model of drift, as the previous lower bound construction assumed a bounded drift at each step (Barve & Long, 1996). 4. Discrete Density Estimation In this section, we show the minimax risk and the minimax rate of the average risk for the problem of estimating discrete distributions with finite support under distribution drift. Without loss of generality, we can let the sample space be X = [k] where k denotes the support size. Since the sample space is discrete, a distribution over X is a probability mass function P, such that P(j) = Pr X P (X = j) for j [k]. Following previous work on estimating discrete distributions, we evaluate the quality of the estimation by the total variation distance metric. Given two distribution P and Q over [k], their total variation distance is defined as TV(P, Q) .= 1 j [k] |P(j) Q(j)| . We consider the following family of probability distributions S( n, k) over X n with regular drift n. Definition 4.1. Let X = [k]. Let n Rn 0, and let k > 0. A product distribution S = P1 . . . Pn over X n belongs to Sn( n, k) if and only if n is a regular drift sequence for S with respect to the metric TV. We establish the following minimax risk in this setting: Theorem 4.2. Let Sn( n, k) be defined as in Definition 4.1, and let r [n] : n r+1 If r is well-defined and r > k, then: inf ˆθn sup S Sn( n,k) E Xn S TV ˆθn(Xn), θn(S) = Θ This is the first work to characterize the minimax risk for the discrete density estimation problem under distribution drift. In Section 4.2, we analyze a simple algorithm that achieves the upper bound of the theorem. Our results extend the known minimax risk of Θ( p k/n) for estimating a discrete distribution with n independent and identically distributed samples. In fact, for n 0, we have that r = n. Noticeably, Theorem 4.2 provides matching upper and lower bound for any regular drift sequence n, and this is the first theoretical work within the distribution drift literature that provides a lower bound in such a general drift setting. As a simple corollary of this theorem, we can obtain the minimax risk for more specific drift assumptions previously used in the literature. Bounded drift at each step. Assume that there exists a constant > 0 such that for any S = P1 . . . Pn, it holds that TV(Pi, Pi+1) . We can invoke Theorem 4.2 with the regular drift sequence n = ( (n 1), . . . , , 0). As r = max n r [n] : (r 1) p k/r o , we can observe that for n (k/ 2)1/3, we have that r = Θ (k/ 2)1/3 , and thus the minimax risk in this setting is Θ((k )1/3). Polynomial drift. Assume that there exists a α (0, 1] such that TV(Pi, Pn) (n i)α for all i [n]. We Nonparametric Density Estimation under Distribution Drift can invoke Theorem 4.2 with the regular drift sequence n = ( (n 1)α, (n 2)α . . . , , 0), and we obtain that for n (k/ 2)1/(2α+1), the minimax risk in this setting is Θ (k )1/(2α+1) . This is the first work to show a lower bound with this drift assumption. 4.1. Connection to Agnostic Learning We can easily show that the lower bound of Theorem 4.2 also applies to the problem of agnostic learning a family of binary functions with VC dimension k. In fact, consider the family of binary functions F = {f A : A [k]}, where f A(j) = 1{j A} for any j [k], and observe that the VCdimension of F is k. Let ˆP be any estimator of Pn, and let ˆP(A) = P j A ˆP(j) for any A [k]. By using the definition of total variation distance, we have that E X Pn f A(X) ˆP(A) = TV(Pn, ˆP) , which shows that the problem of estimating the distribution Pn under total variation distance can be reduced to the problem of agnostic learning the family F with respect to the distribution Pn. We can conclude that the lower bound of Theorem 4.2 applies to the problem of agnostic learning a family of binary functions with VC dimension k in a distribution drift setting. For the case of bounded drift at each step, as observed in the previous subsection, the lower bound is Ω((k )1/3) for sufficiently large n, and we retrieve the result of Barve & Long (1996). Theorem 4.2 generalizes this lower bound to a more general model of drift, giving tighter bounds when possible, as shown in the example in the Introduction. 4.2. Upper Bound To prove the upper bound of Theorem 4.2 fix a parameter r n and consider the empirical distribution ˆP r over the latest r n random variables: ˆP r(j) = 1 i=n r+1 1{Xi=j} j [k] . (3) Analogously, we define the average of the latest r distributions as P r = (1/r) Pn i=n r+1 Pi. In order to evaluate the expected error obtained by using ˆP r as an estimate, we use the triangle inequality to decompose the error into two terms E TV(Pn, ˆP r) E TV(P r, ˆP r) + TV(P r, Pn) . (4) The first error term of this upper bound is the statistical error of estimating the distribution P r by its empirical distribution ˆP r. This error is related to the variance of the estimator which depends on the support size of the estimated distributions. Proposition 4.3. E TV(P r, ˆP r) (1/2) p Proof. By definition E TV(P r, ˆP r) = (1/2) P j [k] E | ˆPr(j) Pr(j)|, and using Jensen s inequality we have that for any j N, E | ˆPr(j) Pr(j)| q V( ˆPr(j)). Since ˆPr(j) is the distribution of an average of 0-1 random variables, we have V( ˆPr(j)) = 1 i=n r+1 Pi(j)(1 Pi(j)) i=n r+1 Pi(j) = Thus, we have E TV(P r, ˆP r) 1 V( ˆPr(j)) 1 2 r We conclude the proof using Cauchy-Schwarz inequality: j [k] P r(j) s X The second error term of the upper bound (4) is the drift error, and describes how far the distribution P r is to Pn. Observe that if the samples were identically distributed, this error would be zero. The drift error can be upper bounded by using the information on the drift sequence n. Proposition 4.4. TV(P r, Pn) n r+1. Proof. We can rewrite the total variation distance as TV(P r, Pn) = 1 i=n r+1 |P r(j) Pn(j)| i=n r+1 |Pi(j) Pn(j)| where in the last inequality we used the triangle inequality and the definition of P r. Therefore, we obtain that TV(P r, Pn) (1/r) Pn i=n r+1 TV(Pi, Pn). We conclude the proof by observing that by the monotonicity of the drift sequence TV(Pi, Pn) i n r+1 for any i n r + 1. By using Proposition 4.3 and 4.4, we can upper bound the estimation error (4) as E TV(Pn, ˆP r) (1/2) p k/r + n r+1. There is a trade-off: by choosing a larger r, we obtain a smaller statistical error, but potentially a larger drifting error. The value r of Theorem 4.2 represents an Nonparametric Density Estimation under Distribution Drift optimal criterion (up to constants) to solve this trade-off for any regular drift sequence n. The upper bound of the theorem follows by setting the parameter r equal to r , for which n r +1 p 4.3. Lower Bound The lower bound of Theorem 4.2 is proven by using information-theoretical tools from minimax theory (Yu, 1997). In particular, we select a particular family of product probability distributions over X n from S( n, k), and obtain our lower bound by arguing that the observed values do not provide enough information to distinguish among those distributions (Lemma 4.5). This family of product probability distributions is constructed as follows. Let r be a parameter such that 1 r n. Each product probability distribution in our family has the same distribution for the first n r random variables. That is, the first n r random variables provide no information to decide among the family. For each product distribution in this family, the last r random variables steadily drift in distribution in a distinct direction subject to the constraint of the drift sequence n. We obtain a trade-off: if the value of r is large, it is easier to decide among the family as we have more time to drift apart, but we make a bigger error if we cannot decide correctly. Conversely, if the value of r is small, it is harder to decide among the family, but we make a smaller error as there is less time to drift apart. Similarly to the upper bound, we obtain a tight lower bound by setting the parameter r equal to r as defined in Theorem 4.2. This choice is adopted throughout this subsection. We distinguish two cases: (a) r = n and (b) r < n. In case (a), we argue that the minimax error is at least equal to the lower bound Ω( p k/r ) for discrete density estimation with n = r independent and identically distributed samples. In the remaining of this subsection, we focus on case (b). In order to establish the lower bound for the minimax risk and the minimax cumulative risk, we use Assouad s Lemma as the main technical tool. This is the first work to use this information-theoretic tool to provide lower bounds in a drift setting. Assouad s Lemma uses a family of probability distribution {Sw : w {0, 1}m} indexed over a hypercube {0, 1}m for some m 1. For two binary strings v, w {0, 1}m, their Hamming distance is defined as h(v, w) = Pm i=1 1{vi =wi}. Lemma 4.5 (Assouad s Lemma). Let θ( ) be a target property to estimate. Let {Sw : w {0, 1}m} S be a family of probability distributions indexed by w. Let p 1. Then: inf ˆθ sup S S E X S h 2pdp ˆθ(X), θ(S) i min v =w dp(θ(Sw), θ(Sv)) min v,w: w 1> v 1 h(v,w)=1 e KL(Sw Sv) where KL is the Kullback Leibler divergence and ˆθ is any estimator of θ(S). Our statement of Assouad s Lemma follows immediately by adapting to our notation its classic statement as in (Van der Vaart, 2000, Lemma 24.3). Differently from the latter statement, we state it with the KL-divergence by using the known inequality P Q = Q P (1/2) exp( KL(P||Q)) that holds for any distributions P and Q. Our formulation is more convenient for the computations of this paper. Without loss of generality, assume that k is even. Our goal is to properly construct a family of sequence of drifting distributions and apply Assouad s Lemma. We construct a family of product distributions {Sw : w {0, 1}k/2} as follows. For each w {0, 1}k/2, we have that Sw = Pw,1 . . . Pw,n is the product distributions of n discrete probability distributions over [k]. For any j [k] and w {0, 1}k/2, we define ( 1 k if i n r 1 k + ( 1)jw j/2 n r +1 i k if i > n r Intuitively, for any i n r + 1, if wj = 1, then the probabilities of the elements 2j 1 and 2j change as follows: Pw,i(2j 1) decreases, while the probability Pw,i(2j) increases of the same amount. The following proposition shows that our family of product distributions is well-defined. Proposition 4.6. {Sw : w {0, 1}k/2} Sn( n, k) Proof. First, we have that each Pw,i is a well defined probability distribution for any w and i, as P j [k] Pw,i(j) = 1 by construction, and 0 Pw,i(j) 1 for any j [k], since n r +1 p k/r < 1 by assumption of the theorem. Second, Sw satisfies the assumptions on the drift sequence n of Definition 4.1. In fact, for any i < n r + 1, we have that TV(Pw,i, Pw,i+1) = 0 i i+1, and for any i n r + 1, we have that TV(Pw,i, Pw,i+1) = 1 ℓ [k/2]:wℓ=1 2 k | i i+1| = ( i i+1) w 1 By using the triangle inequality, this also implies that TV(Pw,i, Pw,n) i for any i [n]. Let θn( ) be defined as in Section 3. The next two technical propositions show how to compute the quantities required to apply Assouad s Lemma in our setting. Proposition 4.7. Given w, w {0, 1}k/2, we have that TV(θn(Sw), θn(Sw )) = ( n r +1/k) h(w, w ). Nonparametric Density Estimation under Distribution Drift Proof. By definition of θn( ), we have that TV(θn(Sw), θn(Sw )) = TV(Pw,n, Pw ,n). Thus, TV(Pw,n, Pw ,n) = 1 ℓ=1 2|w ℓ wℓ|, and the statement follows by observing that n = 0 and that P ℓ|w ℓ wℓ| = h(w, w ). Proposition 4.8. Let w, w {0, 1}k such that h(w, w ) = 1, and let wq = w q be the bit in which they differ. Assume that wq = 1. Then KL(Sw Sw ) 2. Proof. By using the factorization property of the KLdivergence (see Proposition A.1 in the appendix), we have that KL(Sw Sw ) = Pn i=n r +1 KL(Pw,i Pw ,i), since Pw,i = Pw ,i for i < n r + 1. By using the definition of KL, and the definition of Sw, we obtain KL(Sw Sw ) = Pw,i(2q) log Pw,i(2q) + Pw,i(2q 1) log Pw,i(2q 1) Pw ,i(2q 1) as Pw,i and Pw ,i only differ on the elements 2q 1 and 2q for i n r + 1. If we expand the computation above, we have KL(Sw Sw ) =1 (1 + i) log (1 + i) + (1 i) log (1 i) For any i n r + 1, the following chain of inequality holds i n r +1 p k/r < 1. Thus, we can use the inequality (1 + x) log(1 + x) + (1 x) log(1 x) 2x2 that holds for any |x| < 1 (see Proposition A.2). We obtain: KL(Sw Sw ) 2 i=n r +1 2 i (2r /k) 2 n r +1 where the last inequality follows from the assumption that the sequence 1, . . . , n is non-increasing. We can conclude the proof by observing that due to the definition of r , it holds that 2 n r +1 k/r . We apply Assouad s Lemma with the family of product distributions {Sw : w {0, 1}k/2} Sn( n, k), and obtain the following lower bound to the minimax risk min v =w d(θ(Sw), θ(Sv)) min v,w: w 1> v 1 h(v,w)=1 e KL(Sw Sv) We use Propositions 4.7 and 4.8 to lower bound the above expression, and obtain inf ˆθn sup S Sn( n,k) E Xn S TV ˆθn, θn(S) n r +1 Note that since r < n, by using the definition of r , we have that n r > p k/(r + 1). Due to the regularity assumption of the drift sequence n, there exists a constant c such that n r / n r +1 c. Therefore, we have that n r +1 1 c n r = Ω( p k/r ), and this concludes the proof of the lower bound of Theorem 4.2. 4.4. Minimax Rate for the Average Risk Theorem 4.2 provides the minimax risk for estimating a distribution at a given time n. In this subsection, we want to characterize the minimax rate of the average risk defined as in (2) for the online version of this problem. In particular, we want to show that the lower bound proven in Theorem 4.2 for a specific time n is not a rare event but can hold on average for arbitrarily long sequence of estimates. We study this problem for the case of bounded drift at each steps, i.e. there exists a constant > 0 such that TV(Pi, Pi+1) for any i n 1. Let Sn( , k) denote the family of product distributions S = P1 . . . Pn over X n for which this property holds. The minimax rate of the average risk over n steps is Πn( , k) .= inf ˆθ1,...,ˆθn sup S Sn( ,k) E Xn S TV(ˆθi(Xi), θt(S)) and we let Π( , k) .= limn Πn( , k). The value Π( , k) represents the limit minimax rate over a arbitrarily large number of steps for the online estimation of a discrete density over [k] with the assumption that the distance of two consecutive distribution is upper bounded by . We can show the following result. Theorem 4.9. Let (0, 1/k). Then, we have that Π( , k) = Θ((k )1/3) As noted in Section 4.1, this result also applies for the online problem of agnostic learning a family of binary classifiers with VC dimension k in the setting of bounded drift at each step. The upper bound is an obvious corollary of Theorem 4.2. The difficulty is in obtaining the lower bound. We note that the lower bound construction of Section 4.3 cannot be used to prove a lower bound of the minimax rate for the average risk. In fact, that construction relies on the fact that at a given time n, a drift has occurred only for the latest r distribution, and it does not provide a tight lower bound for the estimation at all times i n. In order to prove the lower bound of Theorem 4.9, we adopt a different construction. We provide a sketch of the Nonparametric Density Estimation under Distribution Drift proof. Let n = mν. We consider product distributions S = P1 . . . Pn that can be partitioned into m blocks of length ν. Let Bi = P(ℓ 1)r . . . Pℓr be the product distribution of the block ℓ, i.e. the distribution of the random variables (X(ℓ 1)r, . . . , Xℓr). We let S exhibit a periodic structure. In particular, we guarantee that the first distribution and last distribution of each block Bℓis a uniform distribution over [k], i.e. P(ℓ 1)ν and Pℓν are both uniform distributions for every ℓ [m]. This property plays a double role: it allows us to construct S by considering a sequence of blocks; and the estimation of each block is independent, since samples outside of the block ℓdo not help to decide for the distribution Bℓdue to the periodic structure of S. The proof of the lower bound revolves around the fact that estimating each individual block is hard. In particular, we can show a lower bound of Ω (k )1/3 to the average error of estimating a block Bℓgiven Xν(ℓ+1). This result is obtained by using Assouad s Lemma on a properly defined family of blocks B. For any sequence of estimators ˆθ1, . . . , ˆθn, we use this previous result to show how to iteratively construct a sequence of blocks B1, . . . , Bm from B such that the the average risk of those estimators with respect to the distribution S = B1 . . . Bm is Ω (k )1/3 . By taking m , this is sufficient to prove the lower bound. The details of the full proof are deferred to the appendix. 5. Smooth Density Estimation In this section, we establish the minimax risk for the problem of estimating smooth densities under distribution drift. Let our sample space be any arbitrary interval I R, i.e. X = I. Given Xn, our goal is to estimate the density of the distribution Pn. In this setting, we also use P(x) to refer to the continuous density of a distribution P at x X. Following previous work on nonparametric density estimation, we characterize the smoothness of a density in a Sobolev sense (Tsybakov, 2009). Definition 5.1. Let β N+. A probability density P over X is β-smooth if P is differentiable β times, P (β 1) is absolutely continuous and R P (β)(x) 2 dx < . In order to evaluate the error of our estimate, we use the squared L2 distance between densities. Given two densities f and g over X, their L2 distance is defined as L2(f, g) .= f g = R (f(x) g(x))2 . In the density estimation literature, the quantity L2 2 is also referred to as mean integrated squared error, and it is the most commonly used measure of error. In our work, we consider the following family of smooth probability measures S( n, β) over Xn with regular drift n. Definition 5.2. Let n Rn 0 be a regular drift sequence, and let β > 0. A product distribution S = P1 . . . Pn over X n belongs to S( n, β) if and only if: (a) Pi is βsmooth for i [n]; n is a regular drift sequence for S with the metric L2. We can establish the following minimax risk in this setting. Theorem 5.3. Let n Rn + be a regular drift sequence and let β > 0. Let Sn( n, β) be defined as in Definition 5.2, and let r [n] : n r+1 1 Let r 1 be well-defined. We have: inf ˆθn sup S Sn( n,β) E Xn S θn(S) ˆθn(Xn) 2 = Θ (r ) If n 0, we have that r = n, and we retrieve the known minimax rate Θ n 2β 2β+1 for estimating a β-smooth density from n independent and identically distributed samples. We can achieve the upper bound of Theorem 5.3 with a properly constructed kernel density estimator. A kernel K is a function K : R 7 R such that R K(u)du = 1. Given a kernel K and a smoothing parameter h, the Parzen-Rosenblatt kernel density estimator (Rosenblatt, 1956; Parzen, 1962) over the previous r samples is defined as ˆP r K,h(x) = 1 i=n r+1 K Xi x The parameter h is also referred to as bandwidth. In order to obtain an accurate estimator for highly smooth function, we need to define a special class of kernel functions. Definition 5.4. Let β 1 be an integer. We say that K : R 7 R is a kernel of order β if the functions u 7 uj K(u), with j = 0, 1, . . . , β are integrable and satisfy Z K(u)du = 1, Z uj K(u)du = 0 for j = 1, . . . , β Z K2(u)du < , Z |u|β|K(u)|du < . We refer to the work of Tsybakov (2009) for a discussion of kernel of order β > 1. It can be proven that a kernel of order β 2 cannot be non-negative, and therefore we could obtain an estimate of the density that is negative. This problem can be addressed by taking only the positive part of the estimate, as described in the previous reference. If we let K be a kernel of order β, we can prove that for any P1 . . . Pn = S S( n, β), it holds that E ˆP r K,h Pn 2 = O 2 n r+1 + 1 r h + h2β . Nonparametric Density Estimation under Distribution Drift Observe that the optimal choice of the bandwidth h to minimize the above upper bound is independent of n. By choosing the value h = Θ r 1/(2β+1) , the previous upper bound becomes E ˆP r K,h Pn 2 = O 2 n r+1 + 1 This bound represents a trade-off between the drift error and the statistical error of the estimation. If we choose r as r , we obtain the upper bound of the theorem. The lower bound of the theorem is proven by using a similar strategy to the one used for discrete densities Section 4.3: we construct a family of product distributions that satisfy the assumption on the drift and use Assouad s Lemma. We refer the details of the proof to the appendix. We point out that it is also possible to prove an average minimax risk result similar to Theorem 4.9 for smooth densities by modifying the proof of the discrete case. 6. Conclusion and Open Questions We obtain tight minimax risk bounds for the discrete and smooth density estimation problems under a general model of distribution drift. We also present the first average minimax risk rate in the drift setting. Our results also apply to the important problem of agnostic learning of a family of binary classifiers, improving the known state-of-the-art bounds in the drift setting. In this work, we focus on the univariate case for smooth density estimation. Univariate kernel density estimation methods naturally extend to the multivariate setting with similar assumptions (Ibragimov & Khas minskii, 1983). In the i.i.d. case, the minimax rate becomes O(n 2β/(2β+d)) with n samples, where β is the smoothness of the density and d is the dimensionality of the space. We believe our framework for analysis with drift can provide a characterization of the minimax rate for the multivariate case and this is an interesting future direction. Another interesting open problem is to provide a competitive algorithm that is oblivious to the drift sequence (Hanneke & Yang, 2019). We refer to (Hanneke et al., 2015) for preliminary results in this direction for the problem of realizable supervised learning under distribution drift. Acknowledgements. This material is based on research sponsored by Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL) under agreement number FA8750-19-2-1006 and by the National Science Foundation (NSF) under award IIS-1813444. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL) or the U.S. Government. Anthony, M. and Bartlett, P. L. Neural Network Learning - Theoretical Foundations. Cambridge University Press, 2002. Baby, D. and Wang, Y.-X. Online forecasting of totalvariation-bounded sequences. Advances in Neural Information Processing Systems, 2019. Baby, D. and Wang, Y.-X. Adaptive online estimation of piecewise polynomial trends. Advances in Neural Information Processing Systems, 2020. Bartlett, P. L. Learning with a slowly changing distribution. In Proceedings of the fifth annual workshop on Computational learning theory, 1992. Barve, R. D. and Long, P. M. On the complexity of learning from drifting distributions. In Proceedings of the ninth annual conference on Computational learning theory, 1996. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine Learning, 79(1):151 175, 2010. Berend, D. and Kontorovich, A. A sharp estimate of the binomial mean absolute deviation with applications. Statistics & Probability Letters, 83(4):1254 1259, 2013. Cohen, D., Kontorovich, A., and Wolfer, G. Learning discrete distributions with infinite support. Advances in Neural Information Processing Systems, 2020. Devroye, L. and Gy orfi, L. Nonparametric density estimation : the l[1] view. Journal of the American Statistical Association, 1987. Devroye, L. and Lugosi, G. Combinatorial methods in density estimation. Springer Science & Business Media, 2001. Garc ıa-Trevi no, E. S. and Barria, J. A. Online waveletbased density estimation for non-stationary streaming data. Computational statistics & data analysis, 56(2): 327 344, 2012. Gillert, H. and Wartenberg, A. Density estimation for non stationary markov processes. Statistics: A Journal of Theoretical and Applied Statistics, 15(2):263 275, 1984. Nonparametric Density Estimation under Distribution Drift Gokcesu, K. and Kozat, S. S. Online density estimation of nonstationary sources using exponential family of distributions. IEEE transactions on neural networks and learning systems, 29(9):4473 4478, 2017. Groeneboom, P. and Jongbloed, G. Nonparametric estimation under shape constraints, volume 38. Cambridge University Press, 2014. Han, Y., Jiao, J., and Weissman, T. Minimax estimation of discrete distributions. In IEEE International Symposium on Information Theory, 2015. Hanneke, S. and Yang, L. Statistical learning under nonstationary mixing processes. In the 22nd International Conference on Artificial Intelligence and Statistics, 2019. Hanneke, S., Kanade, V., and Yang, L. Learning with a drifting target concept. In International Conference on Algorithmic Learning Theory, 2015. Helmbold, D. P. and Long, P. M. Tracking drifting concepts using random examples. In Proceedings of the fourth annual workshop on Computational learning theory, 1991. Helmbold, D. P. and Long, P. M. Tracking drifting concepts by minimizing disagreements. Machine learning, 14(1): 27 45, 1994. Ibragimov, I. and Khas minskii, R. Estimation of distribution density. Journal of Soviet Mathematics, 21:40 57, 1983. Kamath, S., Orlitsky, A., Pichapati, D., and Suresh, A. T. On learning distributions from their samples. In Conference on Learning Theory, 2015. Kristan, M., Leonardis, A., and Skoˇcaj, D. Multivariate online kernel density estimation with gaussian kernels. Pattern Recognition, 44(10-11):2630 2642, 2011. Long, P. M. The complexity of learning according to two models of a drifting environment. Machine Learning, 37 (3):337 354, 1999. Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory, 2009. Mohri, M. and Mu noz Medina, A. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, 2012. Parzen, E. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3): 1065 1076, 1962. Phillips, P. C. B. and Park, J. Y. Nonstationary density estimation and kernel autoregression. Research Papers in Economics, 1998. Rosenblatt, M. Remarks on some nonparametric estimates of a density function. The annals of mathematical statistics, pp. 832 837, 1956. Roussas, G. G. Nonparametric estimation in markov processes. Annals of the Institute of Statistical Mathematics, 21(1):73 87, 1969. Scott, D. W. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015. Silverman, B. W. Density Estimation for Statistics and Data Analysis. Springer, 1986. Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer series in statistics. Springer, 2009. Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge University Press, 2000. Wen, J., Dai, B., Li, L., and Schuurmans, D. Batch stationary distribution estimation. In International Conference on Machine Learning, 2020. Yu, B. Assouad, fano, and le cam. Festschrift for Lucien Le Cam: research papers in probability and statistics, pp. 423 435, 1997. Nonparametric Density Estimation under Distribution Drift A. Technical Propositions Definition of KL-divergence. Let P and Q be two distributions over X. In the continuous case (probability density function, X = R), their Kullback Leibler divergence is defined as KL(P Q) = Z R P(x) log P(x) In the discrete case (probability mass function, X is finite), their KL divergence is defined as KL(P Q) = X x X P(x) log P(x) The following proposition on the KL-divergence is folklore. Proposition A.1 (Factorization Property). Let P and Q two distributions over Rd n such that the distributions can be factorized, i.e. P = P1 . . . Pn and Q = Q1 . . . Qn. Then, we have that i=1 KL(Pi Qi) The following relation will prove useful. Proposition A.2. For any 1 < x < 1, we have that (1 + x) log(1 + x) + (1 x) log(1 x) 2x2 B. Average minimax risk of online estimation of discrete densities (Theorem 4.9) B.1. Proof of the upper bound Since the supremum is a convex function, we have that Πn( , k) inf ˆθ1,...,ˆθn t=1 sup S St( ,k) E Xt S TV(ˆθt, θt(S)) Let θt(Xt) be the estimator described in Section 4.2 for the estimation at time t that achieves the upper bound of Theorem 4.2. By using those estimators for t = 1, . . . , n, the above expression can be upper bounded as t=1 sup S St( ,k) E Xt S TV( θt(Xt), θt(S)) As previously discussed, for any t (k/ 2)1/3, we have that the expected error of the estimator θt(Xt) to estimate Pt is upper bounded by O((k )1/3) in the case of a bounded drift at each step. By taking n , we can conclude that Π( , k) = O((k )1/3). B.2. Proof of the lower bound Let n = νm, and let ν = 2(k/ 2)1/3. Let B = {Bw : w {0, 1}k/2} be a family of product distributions over X ν constructed as follows. The set B represents the family of possible candidate block distributions that we will use to build our lower bound. For any Bw = P w,1 . . . P w,ν and j [k], we let ( 1 k + ( 1)jw j/2 (i 1) k if i ν/2 P w,ν i+1(j) if i > ν/2 (5) We can observe that the first ν/2 components of each product distribution share similarities with the family of product distributions defined for the lower bound construction of Section 4.3 for the special case of bounded drift at each step. Nonparametric Density Estimation under Distribution Drift This is a properly defined family of product distributions. In fact, for any w and i, we have that P j Pw,i(j) = 1. Moreover, we have that |( 1)jw j/2 (i 1)/k| ν/(2k) = 1/3k1/3/k 1/k due to the assumption (0, 1/k). Hence, Pw,i(j) [0, 1]. We can also observe that B Sν( , k). In fact, for any two consecutive distributions Pw,i and Pw,i+1, we have that TV(Pw,i, Pw,i+1) 1 Observe that for any sequence of blocks B1, . . . , Bℓ B, the product distribution B1 . . . Bℓ Sνℓ( , k). This property will be exploited later in the proof. The next lemma shows that estimating the last block of a distribution S = B1 . . . Bm is hard for any m 1. It is proven by using Assouad s Lemma, and it is the core technical result that enables the analysis. Lemma B.1. Let ℓ [m]. Let B1, . . . , Bℓ 1 be any ℓ 1 blocks from B, and let S = B1 . . . Bℓ 1. We have: inf ˆθν(ℓ 1)+1,...,ˆθνℓ max S=S B: B B E Xℓν S TV(ˆθi(Xi), θi(S)) = Ω (k )1/3 Proof. It is convenient to rewrite the left-hand side of the equation of the lemma. Let V = V1 . . . Vν and W = W1 . . . Wν be two product distributions. We define the distance db(V, W) .= (1/ν) Pν i=1 TV(Wi, Vi). Given a product distribution S = P1 . . . Pℓν, we define θ#(S) = P(ℓ 1)ν+1 . . . Pℓν as the product distribution of the last ν components of S. Let θ(Xνℓ) be any estimator of the product distribution θ#(S) B. Observe that θ(Xℓν) = ˆθν(ℓ 1)+1(Xν(ℓ 1)+1) . . . ˆθνℓ(Xνℓ) is a possible estimator for θ#(S). Hence, we have that inf ˆθν(ℓ 1)+1,...,ˆθνℓ max S=S B: B B E Xℓν S TV(ˆθi(Xi), θi(S)) θ max S=S B: B B E Xℓν S db(θ(Xνℓ), θ#(S)) (6) Consider the family of product distributions {Sw = S Bw : w {0, 1}k/2} over the hypercube {0, 1}k/2, where S is defined in the statement of the Lemma. We want to invoke Assouad s Lemma on the estimation problem θ max Sw:w {0,1}k/2 E Xℓν S db(θ(Xνℓ), θ#(S)) that is equal to the right-hand side of (6). Let w, w {0, 1}k. We want to compute db(θ#(Sw), θ#(Sw )). By using the definition of θ#, we have that db(θ#(Sw), θ#(Sw )) = db(Bw, Bw ). Let Bw = P w,1 . . . P w,ν and Bw = P w ,1 . . . P w ,ν be defined as in (5). We have that db(θ#(Sw), θ#(Sw )) = 1 i=1 TV(P w,i, P w ,i) = 2 i=1 TV(P w,i, P w ,i) = 1 2k h(w, w ) Let w, w {0, 1}k such that h(w, w ) = 1. Let q denote the bit in which w and w differ, and assume that wq = 1 and w q = 0. We have that KL(Sw Sw ) = i=1 KL(P w,i P w ,i) Nonparametric Density Estimation under Distribution Drift due to the factorization property of the KL-divergence Proposition A.1, and the fact that the first ν(ℓ 1) components of the product distributions Sw and Sw coincide by construction. We have that i=1 KL(P w,i P w ,i) = i=1 KL(P w,i P w ,i) = 2 j=1 P w,i(j) log P w,i(j) P w ,i(j) Since w and w only differ on the bit q, we have P w,i(j) = P w ,i(j) only if j = 2q or j = 2q 1. By using the definition of P w,i and P w ,i, we obtain KL(Sw Sw ) = 2 log (1 + i ) + 1 i=0 {(1 + i) log (1 + i ) + (1 i) log (1 i )} i=0 i2 = O 2ν3 where in the first inequality we used Proposition A.2. By definition of ν, we have that ν3 = 8k/ 2, hence KL(Sw Sw ) = O(1). Therefore, we can apply Assouad s Lemma (Lemma 4.5), and obtain that θ max Sw:w {0,1}k/2 E Xℓν S db(θ(Xνℓ), θ#(S)) ν 16e O(1) = Ω (k )1/3 where the last equality follows by substituting the definition of ν. The lower bound of Theorem 4.9 is obtained by iteratively applying Lemma B.1 to construct a hard distribution given any sequence of estimators ˆθ1, . . . ˆθn. Proof of the lower bound of Theorem 4.9. Let n = νm, and let ν and B be defined as within this section. We want to show a lower bound to Πn(k, ) = inf ˆθ1,...,ˆθn sup S Sn( ,k) E Xn S TV(ˆθt(Xt), θt(S)) We prove a lower bound as follows. For any sequence of estimators ˆθ1, . . . , ˆθn, we show how to construct a sequence of blocks S = B 1 . . . B m from B such that TV(ˆθt(Xt), θt(S )) n = Ω (k )1/3 . Since S Sn( , k), this is sufficient to prove the lower bound of the theorem. Fix a sequence of estimators ˆθ1, . . . , ˆθn. We construct B 1, . . . , B m iteratively as follows. We let B 1 = argmax B B E Xν B TV(ˆθi(Xi), θi(B)) For any ℓ {2, . . . , m}, we let S (ℓ 1) = B 1 . . . B ℓ 1 and B ℓ= argmax B B E Xℓν S (ℓ 1) B TV(ˆθi(Xi), θi(S (ℓ 1) B)) Nonparametric Density Estimation under Distribution Drift Let S = S (m) = B 1 . . . B m. We have that due to the way that B 1 . . . B ℓ 1 are defined, Lemma B.1 applies, and for any ℓ [m], it holds E Xℓν S (ℓ) TV(ˆθi(Xi), θi(S (ℓ))) = Ω (k )1/3 (7) By using linearity of expectation, we have that TV(ˆθt(Xt), θt(S )) ℓ=1 E Xνℓ S (ℓ) i=(ℓ 1)ν+1 TV(ˆθi(Xi), θi(S (ℓ)) ℓ=1 Ω (k )1/3 = Ω (k )1/3 . C. Smooth Density Estimation under Distribution Drift C.1. Upper Bound of Theorem 5.3 The structure of the proof of the upper bound uses technical ideas from the analysis of kernel density estimator in a non drift setting (Tsybakov, 2009). Due to the length of the analysis, the proof is broken down into multiple lemmas. For the remaining of this subsection, let P1 . . . Pn = S Sn( n, β), where S is defined as in Definition 5.2. For any 1 r n, let P[r] be the average of the densities Pn r+1, . . . , Pn, i.e. P[r] = (1/r) Pn i=n r+1 Pi. For any x R, we define the following quantities b(x) = E Xn S h ˆP r K,h(x) i P[r](x) Bias of the estimator σ2(x) = E Xn S ˆP r K,h(x) E Xn S ˆP r K,h(x) 2 Variance of the estimator d2(x) = P[r](x) Pn(x) 2 Drift error We can obtain the following error decomposition based on the above quantities. Proposition C.1 (Error Decomposition). We have that E Pn ˆP r K,h 2 = E Z ˆP r K,h(x) Pn(x) 2 dx 2 Z d2(x)dx + 2 Z b2(x)dx + 2 Z σ2(x)dx Proof. By Tonelli-Fubini theorem, we can swap the integral and the expectation. We observe that E ˆP r K,h(x) Pn(x) 2 = E ˆP r K,h(x) P[r](x) + P[r](x) Pn(x) 2 2 E( ˆP r K,h(x) P[r](x))2 + 2(P[r](x) Pn(x))2 = 2 E( ˆP r K,h(x) P[r](x))2 + 2d2(x) The first inequality is due to the fact that (x + y)2 2x2 + 2y2 for any x, y R. We can use bias-variance decomposition and obtain that E( ˆP r K,h(x) P[r](x))2 = b2(x) + σ2(x) . In the next three propositions, we will now individually upper bound each source of error. Nonparametric Density Estimation under Distribution Drift Proposition C.2 (Upper bound on drift error). Z d2(x)dx = O( 2 n r+1) Proof. To prove the statement, we use Cauchy-Schwarz inequality: [P[r](x) Pn(x)]2 = 1 r (Pi(x) Pn(x)) 1 r (Pi(x) Pn(x)) i=n r+1 1/r i=n r+1 (1/r)(Pi(x) Pn(x))2 1 r |Pi(x) Pn(x)|2 Therefore, we have that Z d2(x)dx Z n X 1 r |Pi(x) Pn(x)|2dx Z |Pi(x) Pn(x)|2dx where the last inequality is due to the assumption on the regular drift n. Since n r+1 i for any i n r + 1, we can conclude that R d2(x)dx 2 n r+1 . The next two propositions follow by a slight modification of the the proofs from Tsybakov (2009, Proposition 1.4 and 1.5), as we adapt those results in the setting where each random variable is sampled from a different distribution. For completeness, we report the full proofs, and refer the reader to the previous reference for additional details. Proposition C.3 (Upper bound on variance error). Suppose that the kernel K satisfies m K = R K2(u)du < . Then: Z σ2(x)dx m K Proof. For i [n] and x X, consider the random variables ηi(x) = K((Xi x)/h) E K((Xi x)/h). The random variables η1(x), . . . , ηn(x) are independent and have zero mean, and their variance can be upper bounded as η2 i (x) E K2 Xi x Therefore, we have that i=n r+1 ηi(x) i=n r+1 E η2 i (x) 1 r2h2 i=n r+1 E K2 Xi x Nonparametric Density Estimation under Distribution Drift By integrating the above expression, we obtain that Z σ2(x)dx 1 r2h2 Z E K2 Xi x Z Z K2 zi x Z Z K2 zi x dx Pi(zi)dzi We remind that the Taylor expansion of a function f on R that is differentiable β times in each point of its domain can be written as follows. Let x R, u R, and h > 0, then f(x + uh) = f(x) + f (x)uh + . . . + (uh)β 0 (1 τ)β 1f (β)(x + τuh)dτ Proposition C.4 (Upper bound Bias Error). Let K be a kerner of order β. We have that Z b2(x)dx = O h2β Proof. Consider the bias error term b(x) = E h ˆP r K,h(x) i P[r](x) The function b(x) can be rewritten as Pi(z)dz h Pi(x) Z K(u) [Pi(x + uh) Pi(x)] du By using the Taylor expansion of Pi and the fact that K is a kernel of order β, we obtain that Z K(u) (uh)β 0 (1 τ)β 1P (β) i (x + τuh)dτ du Since the Kernel is of order β, we have that R K(u)uβP (β)(x)du = 0 for any density P, therefore we have that Z K(u) (uh)β 0 (1 τ)β 1P (β) i (x + τuh)dτ du Z K(u) (uh)β 0 (1 τ)β 1 P (β) i (x + τuh) P (β) i (x) dτ du Nonparametric Density Estimation under Distribution Drift We use Cauchy-Schwarz inequality to separate the contribution of each Pi, and obtain that Z K(u) (uh)β 0 (1 τ)β 1 P (β) i (x + τuh) P (β) i (x) dτ du Z K(u) (uh)β 0 (1 τ)β 1 P (β) i (x + τuh) P (β) i (x) dτ du 2 We use the above inequality to upper bound R b2(x)dx as follows Z Z K(u) (uh)β 0 (1 τ)β 1 P (β) i (x + τuh) P (β) i (x) dτ du 2 dx Now, we can proceed as in Tsybakov (2009, Proposition 1.5) to show an upper bound of O(h2β) for each integral within the sum over i. By doing so, we can conclude that Z b2(x)dx = O C.1.1. PROOF OF THE UPPER BOUND OF THEOREM 5.3 Let S Sn( n, β). Let K be a Kernel of order β as in Definition 5.4, and let h > 0. We consider the estimator ˆP r K,h defined as in Section 5. We use Proposition C.1 and have that E Pn ˆP r K,h 2 2 Z d2(x) + b2(x) + σ2(x) dx We upper bound each term of the right-hand side of the above inequality with Propositions C.2, C.3 and C.4: E Pn ˆP r K,h 2 = O h2β + 1 rh + 2 n r+1 We choose the bandwidth as h = r 1/(2β+1). In this case, the above upper bound becomes E Pn ˆP r K,h 2 = O r 2β/(β+1) + 2 n r+1 If we use as r the value r defined as in the theorem statement, it holds that 2 n r +1 (r ) 2β/(2β+1). We can conclude that E Pn ˆP r K,h 2 = O (1/r )2β/(2β+1) C.2. Lower Bound of Theorem 5.3 Let K : R 7 [0, ) be a function such that K(x) > 0 only if x ( 1/2, 1/2), R K(x)dx = 0, K is infinitely differentiable, K is β-smooth, K = supx |K(x)| 1, and K 2 = q R K2(x)dx 1. By following the example of Tsybakov (2009, p.93), we adopt the function K(x) = K0(4x + 1) K0(4x 1) (8) K0(x) = exp 1 1 u2 1{ 1 0. This property will be used throughout this section. In the next three propositions, we show that the family {Sw : w {0, 1}m} is well-defined, and we compute the quantities required to apply Assouad s Lemma. Proposition C.5. We have that {Sw : w {0, 1}m} Sn( n, β). Proof. For any w w {0, 1}m and i < n r +1, we have that Pw,i is trivially a well-defined density. If i n r +1, we have that Z Pw,i(x)dx = Z 1{0 x 1}dx + j=1 wj ( n r +1 i) Z K(m(x xj))dx = 1 where the last equality is due to the fact that R K(x)dx = 0. We also want to prove that Pw,i(x) 0 (the density is non-negative). We can show that for any x [0, 1], we have that j=1 wj ( n r +1 i) mβ K(m(x xj)) where in the first inequality, we used the fact that for any x, there exists at most one j such that K(m(x xj)) > 0. By definition of m and r , we have that n r +1/mβ (1/r )2β/(2β+1) 1, and this is sufficient to prove that the distributions Pw,i are non-negative. We also need to show that we satisfy the condition on the regular drift n. For any i < n r + 1, we have that Nonparametric Density Estimation under Distribution Drift Pw,i Pw,i+1 = 0 i i+1. If i n r + 1, we have that Pw,i Pw,i+1 = w 1 mβ | i i+1| s Z K2(mx)dx = w 1 K mβ+1/2 ( i i+1) i i+1 where the last inequality is due to the fact that p w 1 m1/2, K 1, and mβ 1 (as m 1). Finally, we show that the distributions Pw,i are β-smooth over [0, 1]. For i n r + 1, this is trivially true. In order to prove this for i > n r + 1, we consider P (β) w,i (x) = j=1 wj( n r +1 i)K(β)(m(x xj)) Therefore, by integrating the square of the above expression, we obtain that Z P (β) w,i (x) 2 dx = w 1( n r +1 i)2 Z K(β)(mx)2dx = w 1( n r +1 i)2 1 Z K(β)(x)2dx We have that w 1/m 1, and ( n r +1 i)2 2 n r +1 1 by assumption of the theorem (as r 1). Since R K(β)(x)2dx < as K is β-smooth, we can conclude that R P (β) w,i (x) 2 dx < , and Pw,i is β-smooth. Proposition C.6. Let w, w {0, 1}m. We have that θn(Sw) θn(Sw ) = p h(w, w ) K n r +1 Proof. We have that θn(Sw) θn(Sw ) = Pn,w Pn,w = n r +1 Z K2(m(x xj))dx h(w, w ) K n r +1 Proposition C.7. Let w, w {0, 1}m such that h(w, w ) = 1. Let q be the bit in which w and w differ, and assume that wq = 1. We have that KL(Sw Sw ) = O(1) Proof. By using the factorization property of the KL-divergence (Proposition A.1), we have that KL(Sw Sw ) = i=1 KL(Pw,i Pw ,i) Z 1 + ( n r +1 i)K(m(x xq)) log 1 + ( n r +1 i)K(m(x xq)) where in the last equality we used the definition of KL-divergence and the observation that Pw,i and Pw ,i can only differ on the interval [(q 1)/m, q/m]. By using the definition of K( ) of (8), for any i n r + 1 we can rewrite the above Nonparametric Density Estimation under Distribution Drift integral as Z 1 + ( n r +1 i)K(m(x xq)) log 1 + ( n r +1 i)K(m(x xq)) = Z h 1 + ( n r +1 i)K0(4mx) log 1 + ( n r +1 i)K0(4mx) + 1 ( n r +1 i)K0(4mx) log 1 ( n r +1 i)K0(4mx) We can observe that for any x R, we have that n r +1 i mβ K0(4mx) n r +1 where in the last inequality, we used n r +1/mβ (1/r )2β/(2β+1) 1, and the fact that K 1. Therefore, we can use Proposition A.2 and show that KL(Sw Sw) 2 m2β i=n r +1 ( n r +1 i)2 Z K0(4mx)2dx m2β+1 2 n r +1 K0 2 We have that K0 2 K 2 1. Also, m2β+1 = r by definition of m, and 2 n r +1 1. We can conclude that KL(Sw Sw) = O(1). C.2.1. PROOF OF THE LOWER BOUND OF THEOREM 5.3 We can assume that r < n. In fact, if r = n, the lower bound immediately follows from the lower bound Ω n 2β/(2β+1) for learning a β-smooth density with n independent and identically distributed samples. Let S = {Sw : w {0, 1}m} be defined as at the beginning of Appendix C.2. Due to Proposition C.5, we have that inf ˆθn sup S Sn( n,β) E Xn S θn(S) ˆθn(Xn) 2 inf ˆθn sup S S E Xn S θn(S) ˆθn(Xn) 2 We can lower bound the right-hand side of the above inequality by using Assouad s Lemma (Lemma 4.5) with metric d = L2 and p = 2. By using Proposition C.6 and C.7, we obtain that inf ˆθn sup S S E Xn S θn(S) ˆθn(Xn) 2 = Ω min w =w h(w, w ) n r +1 mβ 1/2 1 e O(1) We have that minw =w 1/ p h(w, w ) = 1/ m. By substituting the definition of m, we obtain that inf ˆθn sup S S E Xn S θn(S) ˆθn(Xn) 2 = Ω n r +1r β/(2β+1) By assumption, we have that r < n, hence n r = Ω (r ) β/(2β+1) . Using the definition of n, we have that n r +1 1 c n r = Ω (r ) β/(2β+1) . We can conclude that inf ˆθn sup S S E Xn S θn(S) ˆθn(Xn) 2 = Ω (r ) β/(2β+1)(r ) β/(2β+1) = Ω (r ) 2β/(2β+1)