# functional_linear_regression_of_cumulative_distribution_functions__2c66574c.pdf Published in Transactions on Machine Learning Research (02/2024) Functional Linear Regression of Cumulative Distribution Functions Qian Zhang zhan3761@purdue.edu Department of Statistics Purdue University Anuran Makur amakur@purdue.edu Department of CS and School of ECE Purdue University Kamyar Azizzadenesheli kamyara@nvidia.com Nvidia Corporation Reviewed on Open Review: https: // openreview. net/ forum? id= ZOq JCP4e Mk The estimation of cumulative distribution functions (CDF) is an important learning task with a great variety of downstream applications, such as risk assessments in predictions and decision making. In this paper, we study functional regression of contextual CDFs where each data point is sampled from a linear combination of context dependent CDF basis functions. We propose functional ridge-regression-based estimation methods that estimate CDFs accurately everywhere. In particular, given n samples with d basis functions, we show estimation error upper bounds of r Op a d{nq for fixed design, random design, and adversarial context cases. We also derive matching information theoretic lower bounds, establishing minimax optimality for CDF functional regression. Furthermore, we remove the burn-in time in the random design setting using an alternative penalized estimator. Then, we consider agnostic settings where there is a mismatch in the data generation process. We characterize the error of the proposed estimators in terms of the mismatched error, and show that the estimators are well-behaved under model mismatch. Moreover, to complete our study, we formalize infinite dimensional models where the parameter space is an infinite dimensional Hilbert space, and establish a self-normalized estimation error upper bound for this setting. Notably, the upper bound reduces to the r Op a d{nq bound when the parameter space is constrained to be d-dimensional. Our comprehensive numerical experiments validate the efficacy of our estimation methods in both synthetic and practical settings. 1 Introduction Estimating cumulative distribution functions (CDF) of random variables is a salient theoretical problem that underlies the study of many real-world phenomena. For example, Huang et al. (2021) and Liu et al. (2022) recently showed that estimating CDFs is sufficient for risk assessment, thereby making CDF estimation a key building block for such decision-making problems. In a similar vein, it is known that CDFs can also be used to directly compute distorted risk functions (Wirch & Hardy, 2001), coherent risks (Artzner et al., 1999), conditional value-at-risk and mean-variance (Cassel et al., 2023), and cumulative prospect theory risks (Prashanth et al., 2016). Furthermore, CDFs are also useful in calculating various risk functionals appearing in insurance premium design, portfolio design, behavioral economics, behavioral finance, and healthcare applications (Rockafellar et al., 2000; Shapiro et al., 2014; Prashanth et al., 2016; Wong et al., 2022). Given the broad utility of estimating CDFs, there is a vast (and fairly classical) literature that tries to understand this problem. Published in Transactions on Machine Learning Research (02/2024) In particular, the renowned Glivenko-Cantelli theorem (Cantelli, 1933; Glivenko, 1933) states that given independent samples of a random variable, one can construct a consistent estimator for its CDF. Tight non-asymptotic sample complexity rates for such estimation using the Kolmogorov-Smirnov (KS) distance as the loss have also been established in the literature (Cantelli, 1933; Glivenko, 1933; Dvoretzky et al., 1956; Massart, 1990). However, these results are all limited to the setting of a single random variable. In contrast, many modern learning problems, such as doubly-robust estimators in contextual bandits, treatment effects, and Markov decision processes (Huang et al., 2021; Kallus et al., 2019; Huang et al., 2022), require us to simultaneously learn the CDFs of potentially infinitely many random variables from limited data. Hence, the classical results on CDF estimation do not address the needs of such emerging learning applications. Contributions. In this work, as a first step towards developing general CDF estimation methods that fulfill the needs of the aforementioned learning problems, we study functional linear regression of CDFs, where samples are generated from CDFs that are convex combinations of context-dependent CDF bases. Our model resembles the well-studied linear regression and stochastic linear bandits problem. In linear regression, researchers analyzed finite-dimensional parametric models with pre-selected feature functions. These pre-designed features result from extensive feature engineering processes carried out for the underlying task. Similarly, within the domain of contextual bandits, researchers studied the stochastic linear bandit problem using a linear model (Lattimore & Szepesvári, 2020, Equation (19.1)) with finite dimension and known feature map. Thus, it is natural to commence the analysis assuming the access to known feature CDFs, which ultimately bestows the advantages intrinsic to linear regression. As our main contribution, we define both least-squares regression and ridge regression estimators for the unknown linear weight parameter, and establish corresponding estimation error bounds for the fixed design, random design, adversarial, and self-normalized settings. In particular, given n samples with d CDF bases, we prove estimation error upper bounds that scale like r Op a d{nq (neglecting sub-dominant factors). Our derivations are inspired by the classical finite dimensional fixed design, random design, and adversarial self-normalized theories (Peña et al., 2008; Abbasi-Yadkori et al., 2011b). Our results achieve the same problem-dependent scaling as in canonical finite dimensional linear regression (Abbasi-Yadkori et al., 2011b;a; Hsu et al., 2012b), and importantly, in contrast to the mentioned works, do not depend on the label/reward/response magnitude. Moreover, we derive Ωp a d{nq information theoretic lower bounds for functional linear regression of CDFs. This establishes minimax estimation rates of rΘp a d{nq for the CDF functional regression problem. We later show that this result directly implies the concentration of CDFs in KS distance. We also propose a new penalized estimator that theoretically eliminates the requirement on the burn-in time of sample size in the random design setting. Then, we consider agnostic settings where there is a mismatch between our linear model and the actual data generation process. We characterize the estimation error of the proposed estimator in terms of the mismatch error, and demonstrate that the estimator is well-behaved under model mismatch. To complete our study, we generalize the parameter space in the linear model from finite-dimensional Euclidean spaces to general infinite-dimensional Hilbert spaces, extend the ridge regression estimator to the infinite-dimensional model with proper regularization, and establish a corresponding self-normalized estimation error upper bound which immediately recovers our previous r Op a d{nq upper bound when the parameter space is restricted to be d-dimensional. Finally, we present numerical results for synthetic and real data experiments to illustrate the performance of our estimation methods. Related works. A complementary approach to the proposed CDF regression framework is quantile regression (Koenker & Bassett Jr, 1978). Although quantile regression may appear to be closely related to CDF regression at first glance, the two problems have very different flavors. Indeed, unlike CDFs, quantiles are not sufficient for law invariant risk assessment. Besides, due to their infinite range, quantile estimation is quite challenging, resulting in analyses that only consider pointwise estimation (Takeuchi et al., 2006). However, as it is necessary to estimate multiple quantiles for CDF estimation, a simultaneous analysis of multiple quantile estimates is needed theoretically, which typically requires a union bound on the failure probability that increases linearly with the number of estimates. Furthermore, the estimated multiple quantiles may not be monotonically increasing with respect to the probability values, requiring extra effort to construct a valid CDF from a finite series of quantile estimates. Moreover, any such construction will incur a non-convergent KS distance between the estimated CDF and the true CDF for some distribution, as a general CDF can exhibit jumps or flat regions at any position. Additionally, the quality of the estimated CDF from multiple estimated Published in Transactions on Machine Learning Research (02/2024) quantiles relies heavily on the selection of grid points of probability values, which is instance-dependent and may require knowledge of the distribution the learner seeks to estimate. Thus, establishing a universal rule for choosing grid points that yield reasonable CDF estimates via quantile regression across diverse distributions proves challenging. In practice, the introduction of grid points introduces numerous hyperparameters to tune, adding artificial complexity to the methodology. Perhaps more importantly, quantile regression can be ill-posed in many machine learning settings. For example, quantiles are not estimatable in decision-making problems and games with mixed random variables (which take both discrete and continuous values). For these reasons, our focus in this paper will be on CDF regression. Several works have delved into the realm of conditional CDF estimation. Hall et al. (1999) estimated conditional CDFs for fixed cutoff y and context x using local logistic methods and adjusted Nadaraya-Watson estimators. However, their analysis necessitates the assumption of strong regularity conditions on the conditional CDF (including at least continuous second-order derivatives), the marginal CDF of the context, and the data generating process. They established asymptotic convergence only for fixed cutoff and context. Ferraty et al. (2006) introduced a kernel-type nonparametric estimator for conditional CDFs at a fixed context x. Their analysis mandates that the samples are independent and identically distributed (iid), in addition to some regularity assumptions concerning the marginal distribution of x and the smoothness of the conditional CDF. Their theoretical findings, too, revolve around asymptotic scenarios and apply solely to fixed contexts. Chung & Dunson (2009) proposed a special class of conditional CDFs based on probit stick-breaking process mixture models. They developed an MCMC algorithm for posterior sampling of parameters but did not furnish theoretical assurances regarding consistency. Distinguishing itself from existing endeavors, this paper introduces a novel linear model (1) or (18) where we presume knowledge of an arbitrary family of contextual CDFs and aim to estimate the weight parameter θ . Consequently, our model possesses the capability to encompass any conditional CDF, enabling the estimation of the conditional CDF across all values of the context x and cutoff y by estimating one parameter. Furthermore, we embrace an adversarial data generation process (see Scheme I in Section 2), which surpasses the limitations of the iid setting in terms of generality. We provide tight non-asymptotic analysis of the estimation error by showing matching upper bounds and lower bounds of the error. Additionally, our model (1) or (18) readily accommodates the integration of estimated CDFs from previous works on conditional CDF estimation into the family of feature contextual CDFs, thereby enhancing the overall quality of the conditional CDF estimates. Furthermore, the probability approximately correct and Vapnik Chervonenkis theory (Devroye et al., 2013) has been extended to CDF with new measures of complexities (Liu et al., 2022). Chernozhukov et al. (2013) and Koenker et al. (2013) study distribution regression where for a fixed cutoff y, they estimate parameters in conditional CDF models by maximizing log likelihood of 1ty ě Yiu for outcome samples Y1, . . . , Yn. Thus, both works require specific models for conditional CDFs. Chernozhukov et al. (2013) introduced a distribution regression model where the conditional CDF takes the form of a link function evaluated at the inner product of vector transformations of the context X and outcome Y . However, due to the dependence of the log likelihood on the cutoff y within this model, their estimator is inherently pointwise. They established asymptotic convergence of the estimated conditional CDF. Nonetheless, this hinges on certain assumptions concerning the true parameter functions, which is challenging to validate. Koenker et al. (2013) considered the linear local-scale model where the outcome is the summation of a linear local function of the context and the product of a linear scale function of the context and an independent random error boasting a smooth density. Their convergence results are of an asymptotic nature, assuming iid samples, alongside other conditions on the expected log likelihood and the asymptotic covariance function which also pose substantial verification challenges. Furthermore, the maximum likelihood estimation (MLE) used in both papers only accesses the indicators denoting whether the samples Y1, . . . , Yn surpass a fixed cutoff y, which underutilizes the wealth of information inherent in the samples. In stark contrast, our estimator (2) or (21) uses the one-sample empirical CDFs (1t Yi ď u) which fully exploit the sample information. Moreover, as previously mentioned, the estimated CDFs derived in the above distribution regression problems can be seamlessly integrated into our proposed model. In some literature, distribution regression takes on a distinctive meaning, referring to the model where the context is a sequence of samples from some distribution which, together with the outcome, is sampled from some meta joint distribution (Póczos et al., 2013; Szabó et al., 2016). The task is to learn a mapping from Published in Transactions on Machine Learning Research (02/2024) the distribution of the context to the outcome. Contrastingly, our model (1) or (18) operates in a different realm: the outcome is a sample from a mixture of contextual CDFs and the task is to learn the weight parameter θ . Thus, our model diverges from the above notion of distribution regression. Our focus is not on estimating a mapping from distributions to outcomes but on estimating a parameter that governs the condition distribution. Moreover, there is no meta distribution that the samples follow in our adversarial data generating process. Our results on CDF regression have the potential for downstream applications in stochastic bandits (Thompson, 1933; Robbins, 1952; Lattimore & Szepesvári, 2020) where learning algorithms necessitate the estimation of reward distributions under selected actions and the adaptive exploration of the action space. Then, under the linear assumption of the reward distributions, our CDF regression method can serve to estimate the CDFs of the rewards in stochastic bandit algorithms, with readily available theoretical results for integration into the analysis. For instance, in the infinite-armed bandit problem (Berry et al., 1997; Wang et al., 2022), assuming that the underlying distribution of arms satisfies our linear model, our method, in conjunction with an exploration algorithm for arm selection, can be employed to estimate the CDF of the underlying distribution, which actually enables the estimation of any distribution functional, thereby broadening the class of indicator-based functionals considered in Wang et al. (2022). Furthermore, since estimating the CDF of the reward under a target policy in stochastic bandits is adequate for assessing various risk functionals associated with the target policy (Huang et al., 2021), with our linear assumption on the reward distribution, our method can be applied to the risk assessment of policies in stochastic bandits. Then, combined with an exploration algorithm to select policies, our method becomes a valuable tool for minimizing diverse risks in stochastic bandits, which also extends the conventional scope of minimizing expected regret in stochastic bandits. Outline. We briefly outline the rest of the paper. Notation and formal setup for our problem are given in Section 2. We propose our estimation paradigm and analyze its theoretical performance in Section 3. We derive corresponding lower bounds on the estimation error in Section 4. We establish upper bounds on the estimation error under the existence of a mismatch in our proposed model in Section 5. We generalize the problem from estimating finite dimensional parameters to estimating infinite dimensional parameters, extend our estimation paradigm to this infinite dimensional setting, and prove an upper bound on estimation error in Section 6. Numerical results are displayed in Section 7. Conclusions are drawn and future research directions are suggested in Section 8. All the proofs and additional results are presented in the appendices. 2 Preliminaries In this section, we introduce the notation used in the paper and set up the learning problem of contextual CDF regression. Notation. Let N denote the set of positive integers. For any n P N, let rns denote the set t1, . . . , nu. For any measure space pΩ, F, mq, define the Hilbert space L2pΩ, mq : tf : ΩÑ R ˇˇ ş Ω|f|2dm ă 8u with L2-norm }f}L2pΩ,mq : bş Ω|f|2dm for f P L2pΩ, mq. For any positive definite matrix A P Rdˆd, define } }A to be the weighted ℓ2-norm in Rd induced by A, i.e., }x}A ? x JAx for x P Rd. For the standard Euclidean (or ℓ2-) norm } }Id, where Id denotes the d ˆ d identity matrix, we omit the subscript Id and simply write } }. For any square matrix A, let µminp Aq denote the smallest eigenvalue of A, µmaxp Aq denote the largest eigenvalue of A and }A}2 denote the spectral norm of the matrix A, i.e., }A}2 : a µmaxp AJAq. Let KSp F1, F2q : supx PR |F1pxq F2pxq| denote the KS distance between two CDFs F1 and F2. Finally, let 1t u denote the indicator function. More technical notation dealing with measurability issues is provided at the beginning of Appendix B. Problem setup. In this paper, we consider the problem of functional linear regression of CDFs. To define this problem, let X denote the context space, and let Fpx, q : R Ñ r0, 1s be the CDF of some R-valued random variable for any x P X. We assume that X is a Polish space throughout the paper. For a context Published in Transactions on Machine Learning Research (02/2024) x P X, we observe a sample y from its corresponding CDF Fpx, q. We next summarize two schemes to generate px, yq samples: Scheme I (Adversarial). For each j P N, an adversary picks xpjq P X (either deterministically or randomly) in an adaptive way given knowledge of the previous ypiq s for i ă j, and then ypjq P R is sampled from Fpxpjq, q. This includes the canonical fixed design setting as a special case, where all xpjq s are fixed a priori without knowledge of ypjq s. Scheme II (Random). For each j P N, xpjq P X is sampled from some probability distribution P pjq X on X independently, and then ypjq P R is sampled from Fpxpjq, q independently. This is known as the random design setting in the regression context. Scheme I and Scheme II generalize the assumptions of the data generation process in canonical ridge regression in Abbasi-Yadkori et al. (2011a) and Hsu et al. (2012b) to the problem of CDF estimation, respectively. Note that although the random design setting in Scheme II is a special case of Scheme I, we emphasize it because it has specific properties that deserve a separate treatment. The adversarial setting in Scheme I is more general than what is typically considered for regression, and our corresponding self-normalized analysis has several potential future applications in risk assessment for reinforcement learning, e.g., in contextual bandits (Abbasi-Yadkori et al., 2011a). The task of contextual CDF regression is to recover F from a sample tpxpjq, ypjqquj Prns of size n. As an initial step towards this problem, inspired by the well-studied linear regression and linear contextual bandits problems (Lattimore & Szepesvári, 2020, Equation (19.1)), where finite-dimensional parametric models with pre-selected feature functions are assumed, we consider a linear model for F. Let d be a fixed positive integer. For each i P rds and x P X, let ϕipx, q : R Ñ r0, 1s be a feature function that is a CDF of a R-valued random variable with range contained in some Borel set S Ď R, and assume that ϕi is measurable. Then, we define the vector-valued function Φ : X ˆ R Ñ r0, 1sd, Φpx, tq rϕ1px, tq, . . . , ϕdpx, tqs J. We assume that there exists some unknown θ P d 1, where d 1 : tpθ1, . . . , θdq P Rd : řd i 1 θi 1, θi ě 0 for 1 ď i ď du denotes the probability simplex in Rd, such that, Fpx, tq θJ Φpx, tq, @ x P X, t P R. (1) Thus, we can view Φ as a basis for contextual CDF learning. We visualize the sample generation process in Figure 1 where the contextual CDFs are shown in the left column and the one-sample empirical CDFs (1ty ď u for sample y) are shown in the right column. It is worth mentioning the differences between our model and the mixture model with known basis distributions in the statistics literature. First, the basis distributions in our model depend on the context of the sample and are not fixed. Second, in mixture models, the samples are assumed to be independent while in our Scheme I, the samples can be dependent since xpjq is picked adversarially given knowledge of the previous ypiq s. Thus, the mixture model with known basis distributions only corresponds to the fixed design setting with the same context xpjq x for all samples. As explained in the sampling schemes above, given xpjq at the jth sample, the observation ypjq is generated according to the CDF Fpxpjq, q θJ Φpxpjq, q. For notational convenience, we will often refer to the vector-valued function Φpxpjq, q as Φjp q for all j P rns, so that Fpxpjq, q θJ Φjp q. Under the linear model in (1), our goal is to estimate the unknown parameter θ from the sample tpxpjq, ypjqquj Prns in a (regularized) least-squares error sense. This in turn recovers the contextual CDF function F. 3 Upper bounds on estimation error In this section, we propose an estimation paradigm for the unknown parameter θ in Section 3.1, derive the upper bounds on the associated estimation error in Section 3.2, and propose a new penalized estimator that theoretically eliminates the burn-in time of the sample size in the random setting in Section 3.3. Published in Transactions on Machine Learning Research (02/2024) j = 2 j = 3 j = 1 j = 4 j = 5 j = 6 Figure 1: A visualization of the data generating process. For each j P r6s with context xpjq P X, the upper row shows the d contextual CDFs (ϕipxpjq, q, i P rds) under the context xpjq. For ypjq drawn from the CDF Fpxpjq, q θJ Φpxpjq, q where Φpxpjq, q : rϕipxpjq, q, . . . , ϕdpxpjq, qs J, the bottom row shows the sample empirical CDF Iypjqp q : 1typjq ď u. 3.1 Ridge regression estimator We begin by formally stating our least-squares functional regression optimization problem to learn θ . Given a probability measure m on S, the sample tpxpjq, ypjqquj Prns, and the set of basis functions tΦjuj Prns, we propose to estimate θ by minimizing the (ridge or) ℓ2-regularized squared L2p S, mq-distance between the estimated and empirical CDFs: pθλ : arg min θPRd j 1 }Iypjq θJΦj}2 L2p S,mq λ}θ}2, (2) where λ ě 0 is the hyper-parameter that determines the level of regularization, and the function observation Iypjqptq : 1typjq ď tu is an empirical CDF of ypjq that forms an unbiased estimator for Fpxpjq, q conditioned on past contexts and observations. Hence, in Scheme I, we only require that Iypjq θJΦj is a zero-mean function given past contexts and observations, making our analysis suitable for online learning problems where the later contexts can depend on the past contexts and observations. We remark that the adoption of L2-distance in (2) is natural. Indeed, researchers have considered the L2-distance between a one-sample empirical CDF and a CDF estimate in the definition of Continuous Ranked Probability Score (CRPS) (Hersbach, 2000) to assess the performance of the CDF estimate in approximating data distributions. In fact, viewing the one-sample empirical CDF as the response and the basis contextual CDFs as the feature in linear regression, it is natural to consider the least squares method, precisely corresponding to minimizing the L2-distance in our functional setting. Notice further that pθλ in (2) is an improper estimator since it may not lie in d 1. However, since d 1 is compact in Rd, rθλ : arg minϑP d 1 }ϑ pθλ}A exists for any positive definite A P Rdˆd. Moreover, since d 1 is also convex, we have }rθλ θ}A ď }pθλ θ}A (Beck, 2014, Theorem 9.9) for any θ P d 1 including θ . This means that an upper bound on }pθλ θ }A is also an upper bound on }rθλ θ }A. Additionally, as we will see later, the pθλ has a closed-form analytic solution which benefits the analysis of the estimation error. Therefore, we focus our analysis on the improper estimator pθλ, noting that its projection onto d 1 yields an estimator rθλ for which the same upper bounds hold. Published in Transactions on Machine Learning Research (02/2024) When λ ą 0, the objective function in (2) is a p2λq-strongly convex function of θ P Rd (see, e.g., Bertsekas et al., 2003, for the definition), and is uniquely minimized at S ΦjΦJ j dm λId S IypjqΦjdm For the unregularized case where λ 0, we omit the subscript λ and write pθ to denote a corresponding estimator in (2). Note that when λ 0, if µminpřn j 1 ş S ΦjΦJ j dmq ą 0, the objective function in (2) is still strongly convex, and is uniquely minimized at pθ given in (3) with λ 0. In practice, one can deploy standard numerical methods to compute the integral in (3), and the computational complexity of the matrix inversion is cubic in the dimension d. However, iterative methods can be used to obtain better dimension dependence in the running time. As a remark, since the probability density functions (PDFs) of the basis distributions may not exist, the samples in Scheme I can be dependent, and the distributions of the contexts in Scheme II are unknown, the likelihood function of the samples generally does not exist in our problem setting, which rules out the usage of MLE. But our estimator (2) always exists. Moreover, we focus on non-asymptotic analysis of our estimator and prove self-normalized upper bounds for the estimation error, which is rarely analyzed for MLEs. Lastly, it is worth remarking upon the choice of measure m used above. In order for the estimator in (2) to be well-defined, since Iyptq, θJΦpx, tq P r0, 1s for any t, y P R and x P X, it suffices to ensure that mp Sq ă 8 (i.e., m is a finite measure). This is the reason why we restrict m to be a probability measure on S. Furthermore, the probability measure m can in general be chosen to adapt to specific problem settings. For example, the uniform measure m U on S is often easy to compute for some choices of S. Specifically, if 0 ă Lebp Sq ă 8, where Leb denotes the Lebesgue measure, m U is defined by dm U d Leb 1 Lebp Sq, where dm U d Leb is the Radon-Nikodym derivative. If S is a finite set with cardinality #S, m U 1 #S ř s PS δs, where δs denotes the Dirac measure at s. On the other hand, when S R, m can be set to the Gaussian measure γc,σ2 defined by γc,σ2pdxq 1 ? 2πσ2 e px cq2{p2σ2qdx with c P R and σ2 ą 0. 3.2 Self-normalized bounds in various settings For samples generated according to Scheme I, we prove self-normalized upper bounds on the error pθλ θ . For any probability measure m on S, define Un : řn j 1 ş S ΦjΦJ j dm and Unpλq Un λId for n P N and λ ě 0. For n, d P N, λ, τ P p0, 8q, and δ P p0, 1q, define ελpn, d, δq : a d log p1 n{λq 2 logp1{δq ? λ}θ } and (4) εpn, d, δ, τq : ˆ? 8d logp1{δq 4 d{n logp1{δq {?τ. (5) The next theorem states our self-normalized upper bound on the estimation error. Theorem 1 (Self-normalized bound in adversarial setting). Assume m is a probability measure on S and tpxpjq, ypjqquj PN is sampled according to Scheme I with F defined in (1). For any λ ą 0 and δ P p0, 1q, with probability at least 1 δ, for all n P N, the estimator defined in (2) satisfies }pθλ θ }Unpλq ď ελpn, d, δq. (6) Moreover, for the unregularized case, we have the following result. Proposition 2 (Self-normalized bound in adversarial setting for unregularized estimator). Under the same assumptions as Theorem 1, if UN is positive definite for a fixed N P N, then for any δ P p0, 1q and n ě N, with probability at least 1 δ, the estimator defined in (2) with λ 0 satisfies }pθ θ }Un ď ε pn, d, δ, µminp Unq{nq . (7) The proofs of Theorem 1 and Proposition 2 are provided in Appendix B.1. Informally, Theorem 1 and Proposition 2 convey that with high probability, the self-normalized errors }pθλ θ }Unpλq and }pθ θ }Un Published in Transactions on Machine Learning Research (02/2024) scale as r Op ? dq in the ℓ2-regularized and unregularized cases, where r Op q ignores logarithmic and other subdominant factors. We note that Theorem 1 and Proposition 2 also imply upper bounds on the (un-normalized) error }pθλ θ }. Indeed, for any positive definite matrix A P Rdˆd and vector a P Rd, we have }a} ď µminp Aq 1{2}a}A. Thus, for example, (6) in Theorem 1 implies that }pθλ θ } ď µminp Unpλqq 1{2ελpn, d, δq r O a d{p1 µminp Unqq with high probability. Then, for the projected estimator rθλ P d 1, we have }rθλ θ } ď r O mint1, a d{p1 µminp Unqqu by the property of d 1. When µminp Unq Θpnq, we have }pθλ θ } r O a The key idea in the proof of Theorem 1 is to first notice that pθλ θ Unpλq 1Wn Unpλq 1pλθ q, where Wn : řn j 1 ş Sp IypjqΦj θJ ΦjΦjqdm. We next show that tĎ Mnuně0 where Ď Mn : detp Unpλqq1{2 exp 1 2}Wn}2 Unpλq 1 is a super-martingale. Doob s maximal inequality for super-martingales is then used in conjunction with some careful algebra to establish (6). To prove Proposition 2, we use a vector Bernstein inequality for bounded martingale difference sequences (Hsu et al., 2012a, Proposition 1.2) to show a high probability upper bound for }Wn}. Note that UN being positive definite implies that Un is positive definite for n ě N. Since }pθ θ }Un }Wn}U 1 n ď }Wn}{ a µminp Unq, we establish (7). Since the fixed design is a special case of the adversarial setting, Theorem 1 and Proposition 2 imply the same r O ? d -style upper bounds as a corollary in the fixed design setting. Corollary 3 (Self-normalized bound in fixed design setting). For an arbitrary probability measure m on S and an arbitrary sequence txpjquj PN P X N, assume that ypjq is sampled from Fpxpjq, q independently for each j P N with F defined in (1). For any λ ą 0 and δ P p0, 1q, with probability at least 1 δ, the estimator defined in (2) satisfies (6) for all n P N. If UN is positive definite for some fixed N P N, then for any δ P p0, 1q and n ě N, with probability at least 1 δ, the estimator defined in (2) with λ 0 satisfies (7). The proof of Corollary 3 is inline with those of Theorem 1 and Proposition 2. Furthermore, based on Theorem 1 and Proposition 2, we prove self-normalized upper bounds on the estimation error under Scheme II, which corresponds to the random design setting in linear regression. For any probability measure m on S Ď R, define Σpjq : Expjq P pjq X ş S ΦjΦJ j dm and Σn : řn j 1 Σpjq for j, n P N. Theorem 4 (Self-normalized bound in random design setting). Assume m is a probability measure on S, tpxpjq, ypjqquj PN is sampled according to Scheme II with F defined in (1), and µmin Σpjq ě σmin for some constant σmin ą 0 and all j P N. For any δ P p0, 1{2q and n ě 32d2 σ2 min logp d δ q, with probability at least 1 2δ, the estimator in (2) with λ 0 satisfies }pθ θ }Σn ď 2ε pn, d, δ, σminq . (8) Moreover, for regularized estimators, we have the following result. Proposition 5 (Self-normalized bound in random design setting for regularized estimator). Under the same assumptions as Theorem 4, for any λ ą 0, δ P p0, 1{2q, and n ě 32d2 σ2 min log d δ , with probability at least 1 2δ, the estimator defined in (2) satisfies }pθλ θ }Σn ď ? 2ελpn, d, δq. (9) The proofs of Theorem 4 and Proposition 5 are given in Appendix B.2. As before, they convey that in the random design setting, the self-normalized errors }pθλ θ }Σn and }pθ θ }Σn scale as r O ? d with high probability in the ℓ2-regularized and unregularized cases. Moreover, we once again note that Theorem 4 and Proposition 5 imply upper bounds on the (un-normalized) error }pθλ θ }. For example, since σmin is a positive constant, (8) implies that }pθ θ } ď 2µminpΣnq 1{2ε pn, d, δ, σminq r O a d{n with high probability since µminpΣnq ě nσmin by Weyl s inequality (Weyl, 1912). Moreover, it is not hard to show that for general Σn and λ ą 0, (9) can be generalized to }pθλ θ }Σnpλq r Op ? dq which again implies that }pθλ θ } r Opmint1, a d{pµminpΣnq 1quq. Published in Transactions on Machine Learning Research (02/2024) The main idea in the proofs of Theorem 4 and Proposition 5 is to establish a high probability lower bound on µminp nq, where n : Σ 1 2 n p Un Σnq Σ 1 2 n . This can be achieved using the matrix Hoeffding s inequality (Tropp, 2012, Theorem 1.3). Then, we show that for any λ ě 0, }pθλ θ }Σn ď p1 µmin p nqq 1{2}pθλ θ }Unpλq. For Theorem 4, we prove that µminp Unq ě µminpΣnqpµminp nq 1q. Then, we can lower bound µminp Unq in (7) by a multiple of µminpΣnq with high probability. Thus, (8) follows from (7) and the high probability lower bound on µminp nq. For Proposition 5, (9) follows from (6) and the high probability lower bound on µminp nq. We briefly compare our results in this section with related results in the literature. In the (canonical, finite dimensional) adversarial linear regression setting, Abbasi-Yadkori et al. (2011a) and Zhou et al. (2021) show an r O ? d upper bound for the self-normalized error of the ridge least-squares estimator. Specifically, the upper bound in Hsu et al. (2012b) is r Op R ? dq for the case where the noise term is R-sub-Gaussian and the upper bound in Zhou et al. (2021) is r Opσ ? d Rq for the case where the noise term is bounded by R with variance bounded by σ2. The functional regression upper bound in (6) aligns precisely with this scaling (neglecting sub-dominant factors) with respect to d and n. Moreover, the upper bounds of Abbasi-Yadkori et al. (2011a) and Zhou et al. (2021) are susceptible to the magnitudes of the responses, as evidenced by their multiplicative constants of d. In contrast, the multiplicative constant in our upper bound is 1, ensuring that our upper bound remains independent of response scales. This independence constitutes a notable advantage, distinguishing our linear model from those explored in previous works. In the (canonical, finite dimensional) random design linear regression setting, Hsu et al. (2012b) show r O ? d upper bounds for the self-normalized error of the unregularized least-squares estimator under some conditions on the distribution of covariates. The upper bound in (8) for the unregularized case also matches this scaling (neglecting sub-dominant factors). Nevertheless, it s crucial to acknowledge that our linear model (1) is characterized by a unique complexity. Unlike the canonical linear regression framework, where the features are finite-dimensional vectors, and the response is a scalar, both features and response are functions in our model. Consequently, the theoretical results of Abbasi-Yadkori et al. (2011a), Zhou et al. (2021), and Hsu et al. (2012b) are not applicable to our estimators. This intricacy introduces numerous analytical challenges, setting it apart from the conventional linear regression paradigm. Furthermore, in our infinite dimensional model (18) studied in latter chapters, we elevate the parameter from a finite-dimensional vector to a function (infinite dimensional vector), ushering in even more formidable complexities and challenges during the analysis. Finally, we note that an upper bound on }pθλ θ } immediately implies an upper bound on the KS distance between our estimated CDF and the true one. Let p Fλpx, q : rθJ λ Φpx, q denote the estimated CDF for any x P X. Then, under the linear model (1), we have sup x PX KSp p Fλpx, q, Fpx, qq sup x PX,t PS |prθλ θ q JΦpx, tq| ď}rθλ θ } sup x PX,t PS }Φpx, tq} where we use the Cauchy-Schwarz inequality and the fact that supx PX,t PS }Φpx, tq} ď ? d. Since }pθλ θ } r O mint1, a d{p1 µminp Unqqu (see discussion below Proposition 2 and 5) and p Fλ, F P r0, 1s, we have supx PX KSp p Fλpx, q, Fpx, qq r O mint1, d{ a p1 µminp Unqqu . It is worth mentioning that the above upper bound on the estimation error in KS distance may not be sharp because we focus on a tight analysis of the estimation of θ instead of Fpx, q for some x P X. Nevertheless, in Appendix A, we show that when µminp Unq 0 (µminpΣnq 0), the minimax risk in terms of the uniform KS distance for the estimation of F is lower bounded by Ωp1q for the adversarial (random) setting. 3.3 Burn-in-time-free upper bound Note that the theoretical guarantees in Theorem 4 and Proposition 5 require a burn-in time of the sample size n: n ě 32d2 σ2 min logp d δ q. Motivated by Pires & Szepesvári (2012), we propose a new estimator qθλ in (10) to eliminate the burn-in time of n: qθλ P arg min θPRd }Unpλqθ un} U n pδq}θ} , (10) Published in Transactions on Machine Learning Research (02/2024) where λ ě 0, δ P p0, 1q, un : řn j 1 ş S IypjqΦjdm, and U n pδq is a positive number such that U n pδq ě }Un Σn} with probability at least 1 δ. For notatoinal convenience, we use qθ to denote qθ0. To calculate qθλ in (10), it is necessary to first choose U n pδq for which we prove a lower bound in the following lemma. Lemma 6. Assume m is a probability measure on S and tpxpjq, ypjqquj PN is sampled according to Scheme II with F defined in (1). For any δ P p0, 1q and n P N, any U n pδq ě d a 8n logpd{δq satisfies U n pδq ě U n with probability at least 1 δ. The proof of Lemma 6 follows from the matrix Hoeffding s inequality (Tropp, 2012, Theorem 1.3) and the boundedness of CDFs, and is provided in Appendix E. Then, we show the following upper bound on the estimation error of qθλ. Theorem 7 (Self-normalized bound in random setting without burn-in time). Under the same assumptions as Lemma 6, for any δ P p0, 1{2q and n P N, if µmin pΣnq ą 0, then, with probability at least 1 2δ, the estimator defined in (10) with λ 0 satisfies }qθ θ } ď 1 µminpΣnq 8n logpd{δq}θ } 2 ˆ? 8nd logp1{δq 4 d logp1{δq ȷ . (11) The proof of Theorem 7 is provided in Appendix B.3. It conveys that for any n P N, as long as µpΣnq ą 0, }qθ θ } ď r O d?n µminpΣnq holds with high probability. Under the assumption that µminpΣpjqq ě σmin for any j P N as in Theorem 4 and Proposition 5, we have that }qθ θ } ď r O d{?n with high probability for any n P N. Compared with the r O a d{n upper bound of the estimation error of pθ in Theorem 4, qθ suffers a larger error rate wrt the dimension d in order to eliminate the burn-in time of the sample size n. Thus, qθ is more applicable to the estimation of θ for small sample size and small dimension. However, it is worth mentioning that since the estimation errors of proper estimators which are contained in the probability simplex are always bounded by 2, our upper bound in (11) is only non-trivial for the projection of qθ to the probability simplex when n Ωpd2 logpd{δq{σ2 minq which aligns with the scale of the burn-in time of pθ. Thus, the estimator (10) only eliminates the burn-in time among improper estimators. The proof of Theorem 7 builds on the upper bound shown in Pires & Szepesvári (2012) for the estimator that minimizes the unsquared penalized loss as in (10). By Pires & Szepesvári (2012, Theorem 3.4), we have that with probability at least 1 δ, }Σnpλqqθλ Σnθ } ď pλ 2 U n pδqq}θ } 2}un Eruns}. Then, we can bound }un Eruns} with high probability by the vector Bernstein inequality (Hsu et al., 2012a, Proposition 1.2). By setting λ 0 and U n pδq d a 8n logpd{δq as is guaranteed by Lemma 6, we obtain (11) after some derivation. 4 Minimax lower bounds To show that our estimator (2) is minimax optimal, we prove information theoretic lower bounds on the ℓ2-norm of the estimation error for any estimator. Recall that for a distribution family Q and (parameter) function ξ : Q Ñ Rd, the minimax ℓ2-risk is defined as, Rpξp Qqq : inf ˆξ sup QPQ Ez Qr}ˆξpzq ξp Qq}s, (12) where the infimum is over all (possibly randomized) estimators ˆξ of ξ based on a sample z, and the supremum is over all distributions in the family Q. To specialize this definition for our problem, for any x P X and θ P Rd, let P Φ Y |x;θ denote the probability measure defined by the CDF θJΦpx, q. Moreover, for any sequence x1:n : pxp1q, . . . , xpnqq P X n, define the collection of product measures, Pd x1:n : ! bn j 1P Φ Y |xpjq;θ : θ P d 1, Φ P Bd ) , where Bd : trϕ1, . . . , ϕds J : ϕi : X ˆ R Ñ r0, 1s is measurable and ϕipx, q is a CDF on R, @i P rdsu. Published in Transactions on Machine Learning Research (02/2024) For any distribution P P Pd x1:n, let θp Pq be a parameter in d 1 such that P bn j 1P Φ Y |xpjq;θ. Then, we have the following theorem in the adversarial setting. Theorem 8 (Information theoretic lower bound in adversarial setting). For any d ě 2 and any sequence x1:n pxp1q, . . . , xpnqq P X n, we have Rpθp Pd x1:nqq Ω mint1, a d{p1 µminp Unqqu . (13) The proof uses Fano s method (Fano, 1961) and is given in Appendix C.1. Note that strictly speaking, the above theorem is written for the fixed design setting. However, a lower bound in the fixed design setting also implies the same lower bound in adversarial setting. Furthermore, by our discussion below Theorem 1, (6) implies that in the adversarial setting, P }pθλ θ }2 ě C1d logpnq C2 C3r 1 µminp Unq for r ą 0 and some constants C1, C2, and C3, which immediately implies that Er}pθλ θ }s r Op a d{p1 µminp Unqqq and Er}rθλ θ }s r Opmint1, a d{p1 µminp Unqquq. Thus, our estimator rθλ is minimax optimal. When µminp Unq Θpnq, the optimal rate is rΘp a d{nq in the adversarial setting. In the proof of Theorem 8, we construct a family of Ωpa{ ? dq-packing subsets of d 1 for a P p0, 1q under ℓ2-distance. We then show that when ϕ1, . . . , ϕd are the CDFs of d Bernoulli distributions, for any θp1q θp2q in such a packing subset, the Kullback-Leibler (KL) divergence (see definition in Appendix C.1) satisfies Dp PY |xpjq;θp1q}PY |xpjq;θp2qq Opa2p1 µminp Unqq{dq for any j P rns. Since the above family of Bernoulli distributions is a subset of Pd x1:n, we are able to show that Rpθp Px1:nqq Ω a d{p1 µminp Unqq using Fano s method and the aforementioned bound on KL divergence. Next, to analyze minimax ℓ2-risk under the random setting, let DX denote the set of all probability distributions on X. For any PX P DX , let PXP Φ Y |X;θ denote the joint distribution of p X, Y q such that the marginal distribution of X is PX and the conditional distribution of Y given X x is P Φ Y |x;θ. Define the distribution family Pd n : bn j 1 P pjq X P Φ Y |X;θ : θ P Rd, Φ P Bd, P pjq X P DX ( , and for any P P Pd n, let θp Pq denote the parameter in d 1 such that P bn j 1P pjq X P Φ Y |X;θ. Clearly, for any x1:n P X n, we have bn j 1δxpjq PY |X;θ : θ P d 1( Ď Pd n. Thus, each Pd x1:n is a collection of marginal distributions of elements belonging to such subsets of Pd n. Then, by the definition of minimax ℓ2-risk, Theorem 8 immediately implies the following corollary. Corollary 9 (Information theoretic lower bound in random setting). For any d ě 2, Rpθp Pd nqq Ω mint1, a d{p1 µminpΣnqqu (14) The proof is given in Appendix C.2. By the discussion below Proposition 5, our estimator rθλ (λ ą 0) is minimax optimal. When µminpΣnq Θpnq as in Theorem 4 and Corollary 5, the lower bound on the Euclidean norm of the estimation error is also Ω a d{n in random setting. Following the discussion below Theorem 4, (8) implies that in random setting, Pr}pθ θ } ě C1 a d{ns ď e r for r ą 0 and constants C1, C2, and C3, which immediately implies that Er}pθ θ }s r Op a d{nq. Thus, the estimator (2) is minimax optimal with rate rΘp a d{nq in random setting when µminpΣnq Θpnq. 5 Mismatched model In general, a mismatch may exist between the true target function and our linear model (1) with basis Φ. So, in analogy with canonical linear regression where additive Gaussian random variables are used to model the Published in Transactions on Machine Learning Research (02/2024) error term (Montgomery et al., 2021), we consider the following mismatched model: Fpx, tq θJ Φpx, tq epx, tq, @ x P X, t P R, (15) where an additive error function depending on the context is included to model the mismatch in (1). Note that in (15), each Fpx, q is a CDF and e : X ˆS Ñ r 1, 1s is a measurable function. One equivalent interpretation of (15) is as follows. Suppose that their exists another contextual CDF function ϕe such that Fpx, q is a mixture of the linear model θJ Φpx, q and the new feature function ϕepx, q, i.e., for some q P r0, 1s, Fpx, tq p1 qqθJ Φpx, tq qϕepx, tq θJ Φpx, tq q ϕepx, tq θJ Φpx, tq , @ x P X, t P R. Then, we naturally obtain an additive error function epx, tq q ϕepx, tq θJ Φpx, tq . Given a sample tpxpjq, ypjqquj Prns generated using the mismatched model (15), let ejptq denote epxpjq, tq for j P rns. Moreover, define En : řn j 1 ş S ejΦjdm and Bn : Er Ens řn j 1 E ş S ejΦjdm . Then, we have the following theoretical guarantees for the task of estimating θ using the estimator in (2) in the adversarial and random settings. Theorem 10 (Self-normalized bound in mismatched adversarial setting). Assume m is a probability measure on S and tpxpjq, ypjqquj PN is sampled according to Scheme I with F defined in (15). For any λ ą 0 and δ P p0, 1q, with probability at least 1 δ, the estimator defined in (2) satisfies that for all n P N, }pθλ θ }Unpλq ď ελpn, d, δq }En}{ ? The proof of Theorem 10 follows the same approach as the proof of Theorem 1, and it is provided in Appendix I.1. Furthermore, Theorem 10 implies Corollary 11 for the mismatched random setting. Corollary 11 (Self-normalized bound in mismatched random setting). Assume m is a probability measure on S, tpxpjq, ypjqquj PN is sampled according to Scheme II with F defined in (15), and µminpΣpjqq ě σmin for some σmin ą 0 and all j P N. For any λ ą 0, δ P p0, 1{2q, and n ě 32d2 σ2 min log d δ , with probability at least 1 2δ, the estimator defined in (2) satisfies }pθλ θ }Σn ď ? 2ελpn, d, δq a 2{λ}Bn}. (17) The proof of Corollary 11 is given in Appendix I.2. It follows from the proofs of Theorem 10 and Proposition 5. In the adversarial setting, comparing (16) in Theorem 10 with (6) in Theorem 1, we see that the effect of the additive error in the mismatched model is captured by the additional }En}{ ? λ term in our self-normalized error upper bound. Similarly, in the random setting, comparing (17) in Corollary 11 with (9), we again see that the effect of the additive error is captured by the additional a 2{λ}Bn} term in the self-normalized upper bound. 6 Infinite dimensional model So far, we have been assuming finite-dimensional models where the number of base contextual CDFs ϕi s per sample is finite. It is natural to consider generalizing the linear model to be infinite-dimensional and estimating an infinite dimensional parameter θ which shall be considered as a function on the index space of the base functions. In Section 6.1, we formally introduce the infinite-dimensional linear model. We present necessary definitions and technical facts for the statement of the estimator and theorem in Section 6.2. We extend the estimator pθλ in (2) with properly chosen regularization and provide a high probability upper bound on the estimation error of the generalized estimator in Section 6.3. 6.1 Formal model First, we introduce the infinite dimensional index space Ωand the generalized basis function Φ. Assume that pΩ, FΩ, nq is a measure space with npΩq ă 8 and Φ : X ˆ Ωˆ R Ñ r0, 1s, px, ω, tq ÞÑ Φpx, ω, tq is a Published in Transactions on Machine Learning Research (02/2024) p Bp Xq b FΩb Bp Rqq{Bpr0, 1sq-measurable function (see Appendix B for the explanations of notation) such that for any x P X and n-a.e. ω P Ω, Φpx, ω, q is the CDF of some R-valued random variable with its range contained in some Borel set S Ď R. Define the following mapping, x , y : L2pΩ, nq ˆ L2pΩ, nq Ñ R, pf, gq ÞÑ ż Then, x , y is an inner product on L2pΩ, nq and p L2pΩ, nq, x , yq is a Hilbert space. Let } } denote the norm by induced x , y on L2pΩ, nq. Assume that p L2pΩ, nq, x , yq is separable. Then, there exists a countable orthonormal basis on p L2pΩ, nq, x , yq. For notational convenience, we write L2pΩ, nq to represent the Hilbert space p L2pΩ, nq, x , yq. Let e teiu8 i 1 be an arbitrary countable orthonormal basis of L2pΩ, nq and σ tσiui PN be an arbitrary real sequence such that ř8 i 1 |σi| ă 8. Assume that there exists some unknown θ P Hσ,e such that θ ě 0 n-a.e., ş Ωθ n 1, and the target function F satisfies the following model Fpx, tq xθ p q, Φpx, , tqy, @ x P X, t P R. (18) 6.2 Technical preliminaries The proofs of the theoretical results in this section are provided in Appendix F. Given the sample tpxpjq, ypjqquj PN Ď X ˆ R, define the function Φj : Ωˆ R Ñ r0, 1s, pω, tq ÞÑ Φpxj, ω, tq for any j P N. Since npΩq ă 8 and |Φpx, ω, tq| ď 1 for any x P X, n-a.e. ω P Ω, and any t P R, we have Φj P L2pΩ, nq for any j P N. Then, for any j P N, we define, Ψj : L2pΩ, nq ˆ S Ñ R, pθ, tq ÞÑ xθp q, Φjp , tqy. Then, by Holder s inequality, for any j P N and θ P L2pΩ, nq, we have supt PR |Ψjpθ, tq| ď npΩq ş Ω|θ|2dn ă 8. It follows that Ψjpθ, q P L2p S, mq. Moreover, we have that for any n P N, any θ P L2pΩ, nq, and n-a.e. ω P Ω, ˇˇˇˇˇ S Ψjpθ, tqΦjpω, tqmpdtq S |Ψjpθ, tqΦjpω, tq| mpdtq ď nnpΩq ż Since npΩq ă 8, it follows that the function ω ÞÑ řn j 1 ş S Ψjpθ, tqΦjpω, tqmpdtq is in L2pΩ, nq. Thus, for any n P N, we can define an operator Un : L2pΩ, nq Ñ L2pΩ, nq by p Unθqpωq : S Ψjpθ, tqΦjpω, tqmpdtq S xθp q, Φjp , tqyΦjpω, tqmpdtq (19) for any θ P L2pΩ, nq. We show the following properties of Un. Lemma 12. For any n P N, Un is a self-adjoint positive Hilbert-Schmidt integral operator with }Un} ď nnpΩq. Thus, it is also a compact operator. Now, we assume that Un satisfies Assumption 13 for some n P N. Assumption 13. Assume that ei is an eigenfunction of Un with the corresponding eigenvalue denoted with λi for any i P N. Under the Assumption 13 on Un, we can conclude from Lemma 12 that: Corollary 14. Assume that Un satisfies Assumption 13 for some n P N. Then, we have 0 ď λi ď nnpΩq for any i P N and λi Ñ 0. Define the set L2 σpΩ, nq : ! θ P L2pΩ, nq : ř8 i 1 |xei,θy|2 σ4 i ă 8 ) . Then, we have that Lemma 15. For any σ tσiui PN Ď R satisfying ř8 i 1 |σi| ă 8, L2 σpΩ, nq is a linear subspace of L2pΩ, nq. For any θ P L2 σpΩ, nq, we have 2 |xei, θy|2 ď i 1 2λ2 i |xei, θy|2 2 σ4 i |xei, θy|2 ď 2}Unθ}2 2 Published in Transactions on Machine Learning Research (02/2024) which implies that třm i 1pλi 1 σ2 i qxei, θyeium PN is a Cauchy sequence and thus converges in L2pΩ, nq to ř8 i 1pλi 1 σ2 i qxei, θyei P L2pΩ, nq. Therefore, we can define the operator Un,σ : L2 σpΩ, nq Ñ L2pΩ, nq, θ ÞÑ ř8 i 1 λi 1 σ2 i xei, θyei for which we show the following lemma. Lemma 16. Un,σ is bijective linear operator from L2 σpΩ, nq onto L2pΩ, nq. U 1 n,σ is a bounded linear operator on L2pΩ, nq with }U 1 n,σ} ď supi PN σ2 i and U 1 n,σθ ř8 i 1 σ2 i xei,θy 1 λiσ2 i ei for any θ P L2pΩ, nq. Moreover, U 1 n,σ is positive and self-adjoint. Consequently, we can define the following mapping } }Un,σ : L2 σpΩ, nq Ñ r0, 8q, θ ÞÑ a g f f e}θ}2 Un σ2 i , (20) where }θ}Un : a xθ, Unθy for any θ P L2pΩ, nq. Define the set θ P L2pΩq : i 1 |xei, θy|2{σ2 i ă 8 and the mapping x , yσ,e : Hσ,e ˆ Hσ,e Ñ R, pf, gq ÞÑ ř8 i 1 xei,fyxei,gy σ2 i . Similar to the proofs of Lemma 15, we can show that Hσ,e is a linear subspace of L2pΩ, nq. Moreover, p Hσ,e, x , yσ,eq is also a separable Hilbert space with tσieiui PN being an orthonormal basis. For notational convenience, we write Hσ,e to represent the Hilbert space p Hσ,e, , x yσ,eq and use } }σ,e to denote the induced norm on Hσ,e. Moreover, we show the following lemma. Lemma 17. For any real sequence σ tσiui PN with limiÑ8 σi 0, Hσ,e Ď L2 σpΩ, nq. 6.3 Self-normalized upper bound Since we have proved that Ψjpθ, q P L2p S, mq for any j P N and θ P L2pΩ, nq, the following loss function is well-defined on Hσ,e, j 1 }Iypjqp q Ψjpθ, tq}2 L2p S,mq In fact, assuming the convention that 0{0 0 and 1{0 8, we can extend the domain of Lp ; σq to L2pΩ, nq by extending its codomain from r0, 8q to r0, 8s. We propose to estimate θ by minimizing the above loss function over Hσ,e: pθσ : arg min θPHσ,e Lpθ; σq. (21) Since ř8 i 1 |xei,θy|2 σ2 i 8 for any θ P L2pΩ, nq, we also have pθσ arg minθPL2pΩ,nq Lpθ; λq. We have the following formula for pθσ in (21). Proposition 18. The solution to the optimization problem (21) is given as the following, pθσ U 1 n,σ S IypjqptqΦjp , tqmpdtq The proof of Proposition 18 is provided in Appendix G. Under the adversarial setting, we show the following upper bound for the self-normalized estimation error of pθσ in (21). Published in Transactions on Machine Learning Research (02/2024) Theorem 19 (Self-normalized bound in adversarial setting for infinite dimensional model). Assume m is a probability measure on p S, Bp Sqq, n is a finite measure on pΩ, FΩq, e teiu8 i 1 is an orthonormal basis of L2pΩ, nq, σ tσiui PN is a real sequence satisfying ř8 i 1 |σi| ă 8, θ P Hσ,e satisfies θ ě 0 n-a.e. and ş Ωθ n 1, and tpxpjq, ypjqquj PN is sampled according to Scheme I with F defined in (18). For any given n P N and any δ P p0, 1q, if Un defined in (19) satisfies Assumption 13 and σ satisfies that |σi| ă 1 ?λi for any i P N, then, with probability at least 1 δ, the estimator pθσ defined in (21) satisfies }pθσ θ }Un,σ ď i 1 log p1 λiσ2 i q δ }θ }σ,e. (23) In particular, for any given n P N and any δ P p0, 1q, if Un defined in (19) satisfies Assumption 13 and σ satisfies that |σi| ă 1 ? nnpΩq for any i P N, then, with probability at least 1 δ, the estimator pθσ defined in (21) satisfies }pθσ θ }Un,σ ď i 1 log p1 nnpΩqσ2 i q δ }θ }σ,e. (24) The detailed proof of Theorem 19 is provided in Appendix D. Since ř8 i 1 |σi| ă 8 and θ P Hσ,e, we have that }θ }σ,e ă 8 and ř8 i 1 |σi|2 ă 8 which implies that i 1 log 1 λiσ2 i ď i 1 log 1 nnpΩqσ2 i ă 8. Thus, the RHS terms in (23) and (24) are finite and pθσ θ P L2 σpΩ, nq. (24) conveys that with high probability, }pθσ θ }Un,σ ď r O i 1 logp1 nnpΩqσ2 i q When Ω rds for some d P N and n is the counting measure on Ω, (21) reduces to (2) after setting ei 1tiu and σi 1 ? λ for any i P rds and some λ ą 0. Then, by (20) and (24), we have }pθσ θ }Un ď r O ? }pθσ θ }Un ď r O a d{p1 µminp Unqq , which also recovers the result in Theorem 1. Thus, Theorem 19 is a generalization of Theorem 1 for the possibly infinite dimensional model (18). The proof of Theorem 19 generalizes the approach used in the proof of Theorem 1 to the setting of the infinite dimensional model (18). However, there are plenty of technical challenges in dealing with the infinite dimensional L2 space. First of all, since the vectors in the proof of Theorem 1 are generalized to functions and the matrices are generalized to operators, we need to ensure that these functions are well-defined in some proper spaces and figure out the domain/codomain and properties (e.g., linearity, boundedness, self-adjointness, positivity, compactness, invertibility, etc) of those operators. As in the proof of Theorem 1, we would like to write pθσ θ U 1 n,σWn U 1 n,σpςθ q where, S IypjqptqΦjpω, tqmpdtq ż S Ψjpθ , tqΦjpω, tqmpdtq, and ςθ : ř8 i 1 xei,θ y σ2 i ei. However, this sequence třm i 1 xei,θ y σ2 i eium PN only converges for θ P L2 σpΩ, nq but not Hσ,e. Thus, for general θ P Hσ,e, ςθ does not exist and we instead consider the finite-rank operator ςm : θ ÞÑ řm i 1 xei,θy σ2 i ei on L2pΩ, nq and the sequence tθ ,m : U 1 n,σp Unθ ςmθ qum PN which we show satisfies }θ ,m θ }Un,σ Ñ 0 as m Ñ 8. Then, since it suffices to bound }pθσ θ ,m}Un,σ ď }U 1 n,σWn}Un,σ }U 1 n,σςmθ }Un,σ ď }Wn}U 1 n,σ }θ }σ,e. Published in Transactions on Machine Learning Research (02/2024) To bound }Wn}U 1 n,σ, we use the martingale approach as in the proof of Theorem 1. However, after proving that Mnpαq : exp xα, Wny 1 ně0 is a super-martingale for any α P L2pΩ, nq wrt the natural filtration t Fn : σpx1, y1, . . . , xn, yn, xn 1quně0, it is difficult to pick a properly defined Gaussian random variable in L2pΩ, nq. Inspired by Lifshits (2012, Example 2.2), we define β ř8 i 1 σiζiei with tζiui PN being a sequence of independent Np0, 1q-random variables. Note that β P L2pΩ, nq a.s. if ř8 i 1 σ2 i ă 8. Thus, we can define Ď Mn : Er Mnpβq|F8s with F8 : σp Y8 n 1Fnq. Then, we prove that t Mnuně0 is also a super-martingale wrt t Fnuně0 and the question remained is to calculate Mn. However, directly generalizing (41), we would get }Wn}2 U 1 n,σ }β U 1 n,σWn}2 Un,σ 2xβ, Wny }β}2 Un,σ which does not make sense because }β}Un,σ could be 8 with positive probability. Since it is hard to deal with this in the integration over the the law of β, we instead adopt the similar approach as we do for θ . Define βm : řm i 1 σiζiei and Wn,m : řm i 1xei, Wnyei. Then, after some calculation, we get }Wn,m}2 U 1 n,σ }βm U 1 n,σWn,m}2 Un,σ 2xβm, Wn,my }βm}2 Un,σ and Erexpp Hmq|F8s 1 aśm i 1p1 λiσ2 i q exp ˆ1 2}Wn,m}2 U 1 n,σ where expp Hmq : exp xβm, Wn,my 1 2}βm}2 Un ( . Afterwards, we use dominated convergence theorem to conclude that, lim mÑ8 Erexpp Hmq|F8s Er Mn|F8s Ď Mn, a.s.. The verification the integrability of the dominating function exp n ř8 i 1 |σiζi| 1 2 ř8 i 1 λiσ2 i ζ2 i is also quite technical, during which the condition that ř8 i 1 |σi| ă 8 is used. Finally, we obtain that Ď Mn 1 ?ś8 i 1p1 λiσ2 i q exp 1 2}Wn}2 U 1 n,σ . Then, by applying Doob s maximal inequality for super-martingales, we can bound }Wn}2 U 1 n,σ which yields the final bound on }pθσ θ }Un,σ in (23). (24) immediately follows from (23) and Corollary 14. 7 Numerical studies In this section, we demonstrate the scaling of estimation errors of the proposed estimator empirically in our synthetic data experiments in Section 7.1 and illustrate the practical utility of the proposed estimator in our real data experiments in Section 7.2. 7.1 Synthetic data experiments This section contains the experimental results on discrete and continuous synthetic data. Bernoulli data experiments. To illustrate that our estimator (2) achieves the ℓ2-error rate of rΘp a d{p1 µminp Unqqq in the estimation of θ under model (1), we consider the Bernoulli data generated according to the hard instance used to show the lower bound in the proof of Theorem 8 in Appendix C.1. Specifically, after choosing a true parameter θ P d 1 of dimension d P N, for any j P N, we set ϕipxj, q as the CDF of Bernoullippjiq for i P rds, where pj : rpj1, . . . , pjds J P r0, 1sd is defined as follows. When j P rds, we set pji 1 cj 2d3 cj1ti ju 2d3 ; when j ą d, we set pji 1 cjµminp Rj 1q 2d2 cjµminp Rj 1q1ti pj mod dqu 2d2 , where mod denotes the modulo operation, cj s are constants independent of d, and Rj : qjq J j 1 n řj 1 k 1 qkq J k for any j ě d with qj : r1 pj1, . . . , 1 pjds J. Then, we sample yj independently from BernoullipθJ pjq whose CDF is θJ Φj. Given n samples, we calculate pθλ using different values of λ according to (21) with S r0, 1s and m Lebpr0, 1sq. We evaluate the performance using the un-normalized ℓ2-error }pθλ θ }, the self-normalized error }pθλ θ }Unpλq, and the KS distance KSp p Fλpx, q, Fpx, qq (for KS distance, we consider the family of Bernoulli distributions with parameters in r1 1 d2 , 1 1 2d2 s to align with the setting of pj in Published in Transactions on Machine Learning Research (02/2024) 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of error ℓ2 error KS distance (a) λ 0.001 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of error ℓ2 error KS distance 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of error ℓ2 error KS distance 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of self-normalized error (d) λ 0.001 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of self-normalized error 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of self-normalized error Figure 2: Means and 90% confidence intervals of un-normalized ℓ2-errors }pθλ θ }, KS distances KSp p Fλpx, q, Fpx, qq, and self-normalized errors }pθλ θ }Unpλq against sample size n in logarithmic scale in Bernoulli synthetic data experiments. data generation). We repeat the experiments 100 times to calculate means and 90% confidence intervals of the errors. We first study the dependence of estimation errors of our estimator (2) on sample size n with the dimension d 5. Specifically, for λ 0.001, 0.1, and 10, we run the experiments with n ranging from 104 to 106 and plot the means and 90% confidence intervals of the errors against n (both in logarithmic scale) in Figure 2. According to Figure 2, for different values of λ, the slopes of the curves of log }pθλ θ }, log KSp p Fλpx, q, Fpx, qq, and log }pθλ θ }Unpλq against log n are around 0.5, 0.5, and 0.025, which obeys the rΘp a d{nq, r Opd{?nq (assuming µminp Unq grows linearly with n), and Op a d logp1 n{λqq upper bounds on the errors }pθλ θ }, KSp p Fλpx, q, Fpx, qq, and }pθλ θ }Unpλq respectively according to Theorem 1 and 8. Then, we study the dependence of estimation errors of estimator (2) on dimension d with the sample size n 106. For λ 0.001, 0.1, and 10, we run the experiments with d ranging from 10 to 100. Then, we plot the means and 90% confidence intervals of log }pθλ θ } and log KSp p Fλpx, q, Fpx, qq against log d log µminp Unpλqq as well as log }pθλ θ }Unpλq against log d in Figure 3. According to Figure 3, for different values of λ, the slopes of the curves of log }pθλ θ } and log KSp p Fλpx, q, Fpx, qq against log d log µminp Unpλqq are around 0.5 and 0.5 respectively, and the slopes of the curves of log }pθλ θ }Unpλq against log d are around 0.5. These results also obey the rΘp a d{p1 µminp Unqqq, r Opd{ a p1 µminp Unqquq , and Op a d logp1 n{λqq upper bounds on the errors }pθλ θ }, KSp p Fλpx, q, Fpx, qq, and }pθλ θ }Unpλq respectively according to Theorem 1 and 8. Polynomial CDF data experiments. For d P N, rpiq : i if 1 ď i ď d 1 2 , and rpiq : 2 2i d 1 if d 1 2 ă i ď d, we consider the following basis CDFs: ϕipx, tq 1tt P r0, 1{xsupxtqrpiq 1tt ą 1{xu, i P rds. (25) Published in Transactions on Machine Learning Research (02/2024) 7 6 5 4 3 log d log µmin(Un(λ)) log of error ℓ2 error KS distance (a) λ 0.001 7 6 5 4 3 log d log µmin(Un(λ)) log of error ℓ2 error KS distance 7 6 5 4 3 log d log µmin(Un(λ)) log of error ℓ2 error KS distance 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of self-normalized error (d) λ 0.001 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of self-normalized error 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of self-normalized error Figure 3: Means and 90% confidence intervals of un-normalized ℓ2-errors }pθλ θ }, KS distances KSp p Fλpx, q, Fpx, qq, and self-normalized errors }pθλ θ }Unpλq against d{µminp Unpλqq and dimension d in logarithmic scale in Bernoulli synthetic data experiments. To simulate n samples, we first choose a true parameter θ . For each j P rns, xj is sampled independently from the uniform distribution on r0.5, 2s. Then, we sample yj independently from the CDF θJ Φpxj, q using the inverse CDF method for j P rns. Given the simulated sample, we calculate pθλ using (3) with S r0, 2s, m chosen as the uniformly distribution m U on S, and different values of λ. We evaluate the performance by calculating ℓ2-error }pθλ θ }, the self-normalized errors }pθλ θ }Unpλq and }pθλ θ }Σn, and the KS distance KSp p Fλpx, q, Fpx, qq. To obtain stable results, we repeat the simulation independently 100 times in each setting to calculate 90% confidence intervals and means of the errors. Fixing d 5, we study the dependence of estimation errors of our estimator (2) on sample size n using λ 0.001, 0.1, and 10. We run the experiments with n ranging from 104 to 106 and plot the means and 90% confidence intervals of the errors against n (both in logarithmic scale) in Figure 4. According to Figure 4, for different values of λ, the slopes of the curves of log }pθλ θ }Unpλq and }pθλ θ }Σn against log n are around 0, which obeys the Op a d logp1 n{λqq upper bounds proved in Theorem 1 and Proposition 5. When λ is negligible compared to µminp Unqq, the r Op a d{pλ µminp Unqqq bound on ℓ2-error followed from Theorem 1 implies the r Op a d{nq ℓ2-error bound if µminp Unqq grows linearly with n. Indeed, for small λ 0.001, the slope of the curve of log }pθλ θ } against log n in Figure 4a is around 0.5. When λ is comparable with µminp Unqq, as is observed in Figure 4b and 4c, the slopes of the curves of ℓ2-errors are larger than 0.5, which is expected from the r Op a d{pλ µminp Unqqq bound. The slopes of the curves of the KS distances against log n are smaller than 0.5, also obeying the r Opd{ a pλ µminp Unqqq bound implied by Theorem 1. Next, fixing n 105, we run the experiments with d ranging from 10 to 100 using λ 0.001, 0.1, and 10. We plot the means and 90% confidence intervals of the errors against d (both in logarithmic scale) in Figure 5. According to Figure 5, for different values of λ, the slopes of the curves of log }pθλ θ }, log }pθλ θ }Unpλq, and log }pθλ θ }Σn against log d are around 0, obeying the respective r Op a d{pλ µminp Unqqq, Op a d logp1 n{λqq, and Op a d logp1 n{λqq bounds proved in Theorem 1 and Proposition 5. The slopes of the curves of the Published in Transactions on Machine Learning Research (02/2024) 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of error ℓ2 error KS distance (a) λ 0.001 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of error ℓ2 error KS distance 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of error ℓ2 error KS distance 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of self-normalized error (d) λ 0.001 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of self-normalized error 9.5 10.5 11.5 12.5 13.5 log of sample size (log n) log of self-normalized error Figure 4: Means and 90% confidence intervals of un-normalized ℓ2-errors }pθλ θ }, KS distances KSp p Fλpx, q, Fpx, qq, and self-normalized errors }pθλ θ }Unpλq and }pθλ θ }Σn against sample size n in logarithmic scale in polynomial CDF synthetic data experiments. KS distances are smaller than 1, which also obeys the r Opd{ a pλ µminp Unqqq bound implied by Theorem 1. Since the lower bounds are proved for the worst case of any estimator, the results above do not violate our theoretical results on lower bound. 7.2 Real data experiments We compare the empirical performance of our estimator (2) and other methods on two real-world datasets: the California house price dataset and the adult income dataset. California house price dataset. We evaluate the performance of estimator (2) on the California house price dataset (Mohapatra, 2022) of size n 20, 640 from Kaggle. There are 10 attributes among which we use median house value as the samples y from target CDFs and all other attributes as the contexts x (d 9). We standardize all the ordinal variables. We apply the proposed estimator (2) and three other methods, MLE, empirical CDF (ECDF), and kernel density estimation (KDE) to estimate contextual CDFs for this dataset. Specifically, for ECDF, given samples yp1q, . . . , ypnq, the empirical CDF is as follows, p FEptq : 1 j 1 Iypjqptq 1 j 1 1typjq ď tu. (26) For KDE, we apply the function density in the R package stats with Gaussian, rectangular, and triangular kernels. Note that only the samples y are used to estimate one CDF without considering the contexts x in ECDF and KDE. For the proposed estimator and MLE, we assume the linear model (1). We consider the following family of basis CDFs: ϕipx, tq p1 wq FNpt; βp1q N,ixi βp0q N,i, σ2 i q w FLpt; βp1q L,ixi βp0q L,i, biq, t P R, i P rds, (27) Published in Transactions on Machine Learning Research (02/2024) 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of error ℓ2 error KS distance (a) λ 0.001 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of error ℓ2 error KS distance 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of error ℓ2 error KS distance 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of self-normalized error (d) λ 0.001 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of self-normalized error 2.5 3.0 3.5 4.0 4.5 log of dimension (log d) log of self-normalized error Figure 5: Means and 90% confidence intervals of un-normalized ℓ2-errors }pθλ θ }, KS distances KSp p Fλpx, q, Fpx, qq, and self-normalized errors }pθλ θ }Unpλq and }pθλ θ }Σn against dimension d in logarithmic scale in polynomial CDF synthetic data experiments. where x px1, . . . , xdq is the context, FNp ; µ, σ2q is the CDF of the Gaussian distribution Npµ, σ2q, FLp ; µ, bq is the CDF of the Laplace distribution Laplacepµ, bq, w is the weight of Laplace distributions, βp1q N,i (βp0q N,i) is the coefficient (intercept) in the Gaussian linear model of xi, and βp1q L,i (βp0q L,i) is the coefficient (intercept) in the Laplace linear model of xi. We split the whole dataset into subsets of fractions 1{3, 1{2, and 1{6. 1{3 data points are used to estimate the coefficients and intercepts under Gaussian or Laplace linear models separately by maximizing log likelihood. For Laplace linear model, it corresponds to the least absolute residual regression which we solve using the function lad in the R package L1pack (Osorio & Wolodzko, 2023). Afterwards, we estimate σ2 i s and bi s using the sample variance and the mean absolute deviation of the their corresponding residuals respectively. Then, we apply different methods on the second subset (training dataset) of 1{2 data points. For the proposed estimator, we calculate pθλ using (3) with S R, m γ0,100, and λ 0.1, 1, 5. For MLE, we can formulate the likelihood function of the parameter θ in (1) with Φ specified in (27). Let pθMLE denote a maximizer of likelihood function. Since under (1), MLE corresponds to solving a convex minimization problem in a convex set (the probability simplex d 1), we use the solver SCS in the R package CVXR (Fu et al., 2020) to calculate pθMLE. Let p FE denote the ECDF calculated by (26) using the training dataset. Let p FKG, p FKR, and p FKT denote the CDF calculated by KDE with Gaussian, rectangular, and triangular kernels respectively using the training dataset. Given samples yp1q, . . . , ypnq, we define the L2-error of an estimated CDF p F as j 1 }Iypjq p F}2 L2p S,mq, (28) where we also set S R and m γ0,100. Note that when S R and m Lebp Rq, the L2-error in (28) corresponds to the renowned Continuous Ranked Probability Score (CRPS) (Hersbach, 2000) used to assess the performance of a CDF in approximating data distribution. We calculate L2-errors on the third subset Published in Transactions on Machine Learning Research (02/2024) ECDF KG KR KT MLE 0.1 1.0 5.0 estimators ECDF KG KR KT MLE 0.1 1.0 5.0 estimators ECDF KG KR KT MLE 0.1 1.0 5.0 estimators Figure 6: Box plots of L2-errors in the California house price data experiment. ECDF refers to the empirical CDF defined in (26). KG , KR , and KT refer to the kernel density estimation method using Gaussian, rectangular, and triangular kernels respectively. 0.1 , 1.0 , and 5.0 refer to our estimator pθλ in (2) with λ 0.1, 1.0, and 5.0 respectively. (test dataset) of 1{6 data points for the four methods described previously. For ECDF and KDE, we plug p FE, p FKG, p FKR, and p FKT in (28). For MLE and the proposed estimator, we calculate L2-errors using 1 n řn j 1 }Iypjq pθJ MLEΦj}2 L2p S,mq and 1 n řn j 1 }Iypjq pθJ λ Φj}2 L2p S,mq with different values of λ. We run the experiments with w 0, 0.5, and 1 in (27). To get stable results, we permute the dataset uniformly at random independently and repeat the experiments 100 times to calculate L2-errors. We draw the box plots of the L2-errors of different methods with different values of w in Figure 6. As is shown in the figure, ECDF and KDE have comparable L2-errors which are much larger than the other two methods. For all choices of w and λ, our estimator (2) achieves the smallest L2-error than any other method, indicating that its performance is very robust in the choices of basis CDFs and regularization level. Also, L2-error of our estimator decreases with the value of λ as expected. Thus, with different basis contextual CDFs, our estimator (2) has better performance in approximating target data distributions and the performance is stable wrt the value of λ in (2). Adult income dataset. The adult income dataset (Becker & Kohavi, 1996) was extracted from the 1994 census bureau database. The typical learning task is to predict whether income exceeds $50K/yr based on other attributes in the census data. Thus, we use the attributes age, workclass, education, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country as the contexts x (d 12), and use income (i.e., whether income exceeds $50K/yr) as the samples y from target CDFs. We standardize all of the ordinal attributes. The total number of samples is n 48, 842. Since the samples follow Bernoulli distributions, KDE is not considered. We apply our estimator (2), MLE, and ECDF on this dataset. For our estimator and MLE, we assume model (1) and use the following mixtures of logistic and probit models as basis CDFs: ϕipx, tq w FBpt; flogipβp1q L,ixi βp0q L,iqq p1 wq FBpt; FNpβp1q P,ixi βp0q P,i; 0, 1qq, i P rds, (29) where t P R, x px1, . . . , xdq denotes the context, FBp ; pq denotes the CDF of the Bernoulli distribution with parameter p, flogipaq : 1{p1 e aq for any a P R, w is the weight of the logistic model, βp1q L,i (βp0q L,i) denotes the coefficient (intercept) in the logistic model of xi, and βp1q P,i (βp0q P,i) denotes the coefficient (intercept) in the probit model of xi. We split the whole dataset into subsets of fractions 1{3, 1{2, and 1{6. 1{3 data points are used to estimate the coefficients and intercepts in (29) with the function glm in the R package stats . We apply all methods on the second subset (training dataset) of 1{2 data points. For our estimator, we calculate pθλ using (3) with S r0, 1s, m Lebpr0, 1sq, and λ 0.1, 1, 5. For MLE, we also use the solver Published in Transactions on Machine Learning Research (02/2024) ECDF MLE 0.1 1 5 estimators ECDF MLE 0.1 1 5 estimators ECDF MLE 0.1 1 5 estimators Figure 7: Box plots of L2-errors in adult income data experiments. ECDF refers to the empirical CDF defined in (26). 0.1 , 1.0 , and 5.0 refer to the estimator pθλ (2) with λ 0.1, 1.0, and 5.0 respectively. SCS in the R package CVXR (Fu et al., 2020) to calculate pθMLE as in the previous example. We use p FE to denote the ECDF calculated by (26) using the training dataset. Then, we calculate L2-errors (28) with S r0, 1s and m Lebpr0, 1sq for the three methods on the third subset (test dataset) of 1/6 data points. We run the experiments described above using w 0, 0.5, and 1 in (29) 100 times with the dataset permuted randomly in each run to get stable results. In Figure 7, we report the calculated L2-errors in box plots. According to the figure, our estimator (2) achieves the smallest L2-errors for all choices of λ and weight w and ECDF has the largest L2-error for all choices of w. Moreover, the performance of our estimator is very robust wrt λ and w. Thus, with a wide range of the basis contextual CDFs, our estimator (2) achieves good and robust performance in approximating target data distributions. 8 Conclusion In this paper, we propose a linear model for contextual CDFs and estimators for the coefficient parameter in this model. We prove r Op a d{nq upper bounds on the estimation error of our estimator under the adversarial and random settings, and show that the upper bounds are tight up to logarithmic factors by proving Ωp a d{nq information theoretic lower bounds. Additionally, when a mismatch exists in the linear model, we prove that the estimation error of our estimator only increases by an amount commensurate with the mismatch error. Furthermore, we increase the generality of our linear model by expanding the parameter space into an infinite dimensional Hilbert space. Within this framework, we generalize our estimator and subsequently establish self-normalized upper bounds for this general estimator. Moreover, we elucidate the scaling of the estimation error of our estimator empirically and showcase its practical utility on real-world datasets. Our current work assumes that the bases are known a priori. So, a fruitful future research direction would be to focus on the basis selection problem for CDF regression with possibly infinitely many base functions. More generally, it is a promising future direction to consider the adaptive setting where the learner seeks to learn both the basis Φ and the weight parameter θ in (1) by querying samples from ϕi s and F θJ Φ. A closely related challenge is presented in multi-distribution learning where the learner strategically queries samples from different distributions with the objective of minimizing the expected risk uniformly across all distributions (Haghtalab et al., 2022; Awasthi et al., 2023). Leveraging our existing results on estimating θ based on Φ, we can decompose this learning problem into two subproblems: (i) the estimation of the basis CDFs ϕi s from their respective samples, and (ii) the planning of the queries to samples from different distributions. As discussed in related works, numerous existing results address the estimation of a single contextual CDF based on certain distributional assumptions, placing the primary challenge in solving (ii). A natural idea is to initiate from the offline algorithm, akin to Wang et al. (2022): for each i P rds, the learner queries a pre-specified number of samples from each distribution to estimate ϕi, and then queries samples from F to estimate θ with the learned Φ, where the allocation of queries to each distribution is determined Published in Transactions on Machine Learning Research (02/2024) by minimizing the upper bound of the estimation error of θ under the constraint of a constant sum. Though this approach is straightforward, we can conjecture whether the offline algorithm is optimal, considering that the learning of θ may not contribute much to the learning of each individual basis CDF. Moving forward, we can explore online algorithms which adaptively determine the next oracle to query. The design of the online algorithms in Wang et al. (2022) and Awasthi et al. (2023) are anticipated to offer valuable guidance in this direction. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011a. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Online least squares estimation with self-normalized processes: An application to bandit problems. ar Xiv preprint ar Xiv:1102.2670, 2011b. Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance, 9(3):203 228, 1999. Pranjal Awasthi, Nika Haghtalab, and Eric Zhao. Open problem: The sample complexity of multi-distribution learning for vc classes. In The Thirty Sixth Annual Conference on Learning Theory, pp. 5943 5949. PMLR, 2023. Kamyar Azizzadenesheli. Importance weight estimation and generalization in domain adaptation under label shift, 2020. URL https://arxiv.org/abs/2011.14251. Necdet Batir. Inequalities for the gamma function. Archiv der Mathematik, 91(6):554 563, Dec 2008. ISSN 1420-8938. doi: 10.1007/s00013-008-2856-9. URL https://doi.org/10.1007/s00013-008-2856-9. Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with MATLAB. SIAM, 2014. Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20. Donald A Berry, Robert W Chen, Alan Zame, David C Heath, and Larry A Shepp. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997. Dimitri Bertsekas, Angelia Nedic, and Asuman Ozdaglar. Convex analysis and optimization, volume 1. Athena Scientific, 2003. Vladimir Bogachev. Measure Theory, volume 2. 01 2007. ISBN 978-3-540-34513-8. doi: 10.1007/ 978-3-540-34514-5. Francesco Paolo Cantelli. Sulla determinazione empirica delle leggi di probabilita. Giorn. Ist. Ital. Attuari, 4 (421-424), 1933. Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general framework for bandit problems beyond cumulative objectives. Mathematics of Operations Research, 2023. Victor Chernozhukov, Iván Fernández-Val, and Blaise Melly. Inference on counterfactual distributions. Econometrica, 81(6):2205 2268, 2013. Yeonseung Chung and David B Dunson. Nonparametric bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association, 104(488):1646 1660, 2009. Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.1.5 of 2022-03-15. URL http://dlmf.nist.gov/. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. Mc Clain, eds. Published in Transactions on Machine Learning Research (02/2024) Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pp. 642 669, 1956. R.M. Fano. Transmission of Information: A Statistical Theory of Communication. MIT Press Classics. MIT Press, 1961. ISBN 9780262561693. Frédéric Ferraty, Ali Laksaci, and Philippe Vieu. Estimating some characteristics of the conditional distribution in nonparametric functional models. Statistical Inference for Stochastic Processes, 9:47 76, 2006. Anqi Fu, Balasubramanian Narasimhan, and Stephen Boyd. CVXR: An R package for disciplined convex optimization. Journal of Statistical Software, 94(14):1 34, 2020. doi: 10.18637/jss.v094.i14. Valery Glivenko. Sulla determinazione empirica delle leggi di probabilita. Gion. Ist. Ital. Attauri., 4:92 99, 1933. Nika Haghtalab, Michael Jordan, and Eric Zhao. On-demand sampling: Learning optimally from multiple distributions. Advances in Neural Information Processing Systems, 35:406 419, 2022. Peter Hall, Rodney CL Wolff, and Qiwei Yao. Methods for estimating a conditional distribution function. Journal of the American Statistical association, 94:154 163, 1999. Hans Hersbach. Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15(5):559 570, 2000. doi: https://doi.org/10.1175/1520-0434(2000)015<0559: DOTCRP>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/wefo/15/5/1520-0434_ 2000_015_0559_dotcrp_2_0_co_2.xml. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17(none):1 6, 2012a. doi: 10.1214/ECP.v17-2079. URL https://doi.org/10.1214/ECP.v17-2079. Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on learning theory, pp. 9 1. JMLR Workshop and Conference Proceedings, 2012b. Audrey Huang, Leqi Liu, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment in contextual bandits. Advances in Neural Information Processing Systems, 34, 2021. Audrey Huang, Liu Leqi, Zachary C Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment for markov decision processes. In Artificial Intelligence and Statistics, 2022. Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Localized debiased machine learning: Efficient inference on quantile treatment effects and beyond. ar Xiv preprint ar Xiv:1912.12945, 2019. Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the Econometric Society, pp. 33 50, 1978. Roger Koenker, Samantha Leorato, and Franco Peracchi. Distributional vs. quantile regression. 2013. Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/ 9781108571401. Mikhail Lifshits. Lectures on gaussian processes. In Lectures on Gaussian Processes, pp. 1 117. Springer, 2012. Leqi Liu, Audrey Huang, Zachary Lipton, and Kamyar Azizzadenesheli. Supervised learning with general risk functionals. In International Conference on Machine Learning, pp. 12570 12592. PMLR, 2022. Published in Transactions on Machine Learning Research (02/2024) Anuran Makur. Information Contraction and Decomposition. Sc.D. thesis in Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA, May 2019. Anuran Makur and Lizhong Zheng. Comparison of contraction coefficients for f-divergences. Problems of Information Transmission, 56(2):103 156, April 2020. Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, pp. 1269 1283, 1990. Shibu Mohapatra. California House Price, 2022. Retrieved August 2023 from https://www.kaggle.com/ datasets/shibumohapatra/house-price. Douglas C Montgomery, Elizabeth A Peck, and G Geoffrey Vining. Introduction to linear regression analysis. John Wiley & Sons, 2021. F. Osorio and T. Wolodzko. Routines for L1 estimation, 2023. URL http://l1pack.mat.utfsm.cl. R package version 0.41-24. Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008. Bernardo Ávila Pires and Csaba Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012. Barnabás Póczos, Aarti Singh, Alessandro Rinaldo, and Larry Wasserman. Distribution-free distribution regression. In artificial intelligence and statistics. PMLR, 2013. LA Prashanth, Cheng Jie, Michael Fu, Steve Marcus, and Csaba Szepesvári. Cumulative prospect theory meets reinforcement learning: Prediction and control. In International Conference on Machine Learning, pp. 1406 1415. PMLR, 2016. Michael Reed and Barry Simon. Vi - bounded operators. In I: Functional Analysis, pp. 182 220. Elsevier Inc, 1972. ISBN 9780125850018. Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952. R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2: 21 42, 2000. Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2014. P. Stein. A note on the volume of a simplex. The American Mathematical Monthly, 73(3):299 301, 1966. ISSN 00029890, 19300972. URL http://www.jstor.org/stable/2315353. Francis Edward Su. Methods for quantifying rates of convergence for random walks on groups. Harvard University, 1995. Zoltán Szabó, Bharath K Sriperumbudur, Barnabás Póczos, and Arthur Gretton. Learning theory for distribution regression. The Journal of Machine Learning Research, 17(1):5272 5311, 2016. Ichiro Takeuchi, Quoc V Le, Timothy D Sears, and Alexander J Smola. Nonparametric quantile estimation. Journal of Machine Learning Research, 7(45):1231 1264, 2006. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. ISSN 00063444. URL http://www.jstor. org/stable/2332286. Published in Transactions on Machine Learning Research (02/2024) Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389 434, 2012. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi: 10.1017/9781108231596. Yifei Wang, Tavor Baharav, Yanjun Han, Jiantao Jiao, and David Tse. Beyond the best: Distribution functional estimation in infinite-armed bandits. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 9262 9273. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ file/3c2fd72a28fb98facf98546727320249-Paper-Conference.pdf. Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441 479, 1912. Julia L Wirch and Mary R Hardy. Distortion risk measures: Coherence and stochastic dominance. In International congress on insurance: Mathematics and economics, 2001. William Wong, Audrey Huang, Liu Leqi, Kamyar Azizzadenesheli, and Zachary C Lipton. Riskyzoo: A library for risk-sensitive supervised learning. 2022. Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532 4576. PMLR, 2021. Published in Transactions on Machine Learning Research (02/2024) A Discussion on the minimax lower bound for the estimation of CDFs First, for any contextual CDFs F1 and F2, define the uniform KS distance by KSp F1, F2q : sup x PX KSp F1px, q, F2px, qq. Similar to the minimax ℓ2-risk defined in (12), we can define the minimax risk in terms of the uniform KS distance for the estimation of the contextual CDF F. For any distribution family Q and the contextual CDF function Ξ : Q Ñ r0, 1s XˆR, the minimax risk in terms of the uniform KS distance is defined as RpΞp Qq; KSq : inf ˆΞ sup QPQ Ez Qr KSpˆΞpzq, Ξp Qqqs. We follow the notation in Section 4. With a slight abuse of notation, let Fp Pq θp Pq JΦ. For the random setting, define the distribution family P0 : bn j 1 P pjq X PY |X;θ : θ P Rd, P pjq X P DX such that µminpΣnq 0 ( . Then, we have the following results. Proposition 20. For any sequence x1:n pxp1q, . . . , xpnqq P X n such that µminp Unq 0, we have Rp Fp Px1:nq; KSq Ωp1q . (30) For the random setting (Scheme II), we have Rp Fp P0q; KSq Ωp1q . (31) Proof of Proposition 20. According to the discussion below Theorem 8, the discussion above Corollary 9, and Appendix C.2, it suffices to show (30) under the fixed design setting. Let us consider the fixed design setting where ϕipx, q are the CDFs of Bernoulli distributions for i P rds, d ě 2. Let qji denote the zero probability of the Bernoulli distribution with CDF ϕipxpjq, q Φjip q. We set S r0, 1s and m Leb. Then, we have Un řn j 1 qjq J j where qj rqj1, . . . , qjds J. For any θ P d 1, we have Fpxpjq, tq θJ qj under model (1) for any t P r0, 1q. Suppose that qji q11 P r0, 1s for any i P rds and j P rns. Then for any θ P r0, 1s, the samples ypjq s for j P rns are generated from the same distribution which is the Bernoulli distribution with success probability 1 q11. We have µminp Unq 0. Thus, the condition of the proposition is satisfied. For n 1, suppose that qn 1,1 1 and qn 1,2 0. Then, for any estimate ˇFn of F, we have ˇFnpxpn 1q, 1{2q P r0, 1s. If ˇFnpxpn 1q, 1{2q P r0, 1{2s, consider the case where θ θp1q r1, 0, . . . , 0s J. Then, we have | ˇFnpxpn 1q, 1{2q Fpxpn 1q, 1{2q| ě 1{2. If ˇFnpxpn 1q, 1{2q P p1{2, 1s, consider the case where θ θp2q r0, 1, . . . , 0s J. Then, we also have | ˇFnpxpn 1q, 1{2q Fpxpn 1q, 1{2q| ě 1{2. Thus, we have Rp Fp Px1:nq; KSq inf ˇ Fn sup P PPx1:pnq KSp ˇFn, Fq Ωp1q. Thus, the minimax risk in terms of the uniform KS distance of any estimate of F is Ωp1q. Recall that according to the discussion at the end of Section 3.2, for the plug-in estimate p Fλ of F using our projected estimator rθλ, we have the r Opmint1, d{ a 1 µminp Unquq upper bound in terms of the uniform KS distance. Proposition 20 implies that this plug-in estimate p Fλ is minimax optimal when µminp Unq 0. It is worth noting that with the assumption that µminp Unq Θpnq, the r Opmint1, d{ a 1 µminp Unquq r Opd{?nq upper bound of p Fλ implies that the minimax lower bound in estimating F is improved. Thus, we can see that µminp Unq or µminpΣnq plays an important role in the estimation of F. Published in Transactions on Machine Learning Research (02/2024) B Proofs of upper bounds for the finite dimensional model We first briefly expand on the notation for the proofs of our theoretical results. For any topological space A, let Bp Aq denote the Borel σ-algebra of A. For any two measurable spaces, p A1, A1q and p A2, A2q, a function f : A1 Ñ A2 is A1{A2-measurable if for any E P A2, we have f 1p Eq P A1. When A2 is the Borel σ-algebra on A2, we sometimes write f is A1-measurable to mean that f is A1{A2-measurable for brevity. When A1 is the Borel σ-algebra on A1 and A2 is the Borel σ-algebra on A2, we sometimes simply write f is measurable to mean that f is A1{A2-measurable for brevity. For any two σ-finite measure spaces p A1, A1, ν1q and p A2, A2, ν2q, let A1 ˆ A1 : tpa1, a2q : a1 P A1, a2 P A2u denote the product space, let A1 b A2 : σpt E1 ˆ E2 : E1 P A1, E2 P A2uq denote the product σ-algebra of A1 and A2 on A1 ˆ A2, and let ν1 bν2 denote the product measure of ν1 and ν2 on p A1 ˆA2, A1 b A2q (i.e., ν1 bν2p E1 ˆE2q ν1p E1qν2p E2q for any E1 P A1 and E2 P A2) whose existence is guaranteed by Carathéodory s extension theorem. Then, p A1 ˆ A2, A1 b A2, ν1 b ν2q is the product measure space of p A1, A1, ν1q and p A2, A2, ν2q. When A1 A2 and A1 A2, we will write A2 1 to represent A1 b A1. When A1 A2, A1 A2, and ν1 ν2, we will write ν2 1 to represent ν1 b ν1. Note that according to our assumptions, X is a Polish space equipped with the Borel σ-algebra Bp Xq, ϕi : X ˆ S Ñ r0, 1s is p Bp Xq b Bp Sqq{Bpr0, 1sq-measurable for each i P rds, and e : X ˆ S Ñ r 1, 1s is p Bp Xq b Bp Sqq{Bpr 1, 1sq-measurable. In the proofs of the main results, we consider an arbitrary probability measure m on p S, Bp Sqq. Since there is no ambiguity, for brevity, we omit dm in the notation for integrals. Note that some quantities defined below depend on the chosen probability measure m. B.1 Proofs of Theorem 1 and Proposition 2 In the proofs of Theorem 1 (Appendix B.1.1) and Proposition 2 (Appendix B.1.2), we use the following measure-theoretic treatment of probability spaces. (The notation we use can be found at the beginning of Appendix B.) The underlying probability space for the sample tpxpjq, ypjqquj PN is pr0, 1s N, Bpr0, 1sq N, Pq, where r0, 1s N tpξp1q, ξp2q, . . . q : ξpjq P r0, 1su, and, Bpr0, 1sq N : σpt B1 ˆ ˆ Bn : B1, . . . , Bn P Bpr0, 1sq, n P Nuq is the σ-algebra generated by all finite products of Borel sets on r0, 1s, and P|r0,1sn Lebn bn j 1Leb with Leb being the Lebesgue measure on pr0, 1s, Bpr0, 1sqq. The existence of the above probability space is guaranteed by Kolmogorov s extension theorem. Define the random vector Ξ pΞpjqqj PN on pr0, 1s N, Bpr0, 1sq Nq to be the identity mapping, i.e., Ξ : r0, 1s N Ñ r0, 1s N, pξpjqqj PN ÞÑ pξpjqqj PN. Then, P is also the probability measure on pr0, 1s N, Bpr0, 1s Nqq induced by Ξ, and Ξ follows the uniform distribution on r0, 1s N. Suppose tpxpjq, ypjqquj PN is sampled according to Scheme I with F defined in (1). Then, according to Bogachev (2007, Proposition 10.7.6), for each j P N, there exist some p Bp Xq b Bp Sqqj 1 b Bpr0, 1sq{Bp Xq-measurable function, hpjq X : p X ˆ Sqj 1 ˆ r0, 1s Ñ X, and p Bp Xq b Bp Sqqj 1 b Bp Xq b Bpr0, 1sq{Bp Sq-measurable function hpjq Y : p X ˆ Sqj 1 ˆ X ˆ r0, 1s Ñ S xpjq hpjq X pxp1q, yp1q, . . . , xpj 1q, ypj 1q, Ξp2j 1qq, ypjq hpjq Y pxp1q, yp1q, . . . , xpj 1q, ypj 1q, xpjq, Ξp2jqq, E 1 ! hpjq Y pxp1q, yp1q, . . . , xpj 1q, ypj 1q, xpjq, Ξp2jqq ď t ) ˇˇFj 1 ı θJ Φpxpjq, tq (32) Published in Transactions on Machine Learning Research (02/2024) for any t P S and j P N, where Fj : σ Ξpkq : k P r2j 1s ( is the sub σ-algebra of Bpr0, 1sq N generated by the random variables Ξp1q, . . . , Ξp2j 1q. By definition, we have that ypjq is Fj{Bp Sq-measurable for each j P N and t Fju8 j 0 forms a filtration of pr0, 1s N, Bpr0, 1sq N, Pq. Therefore, ypjq( j PN is t Fjuj PN-adapted. By the above construction, for each j P N, xpjq : r0, 1s N Ñ X, ξ ÞÑ xpjqpξq is a Fj 1{Bp Xq-measurable function. Thus, for each j P N, the function rh X : r0, 1s NˆS Ñ X ˆS, pξ, tq ÞÑ pxpjqpξq, tq is p Fj 1b Bp Sqq{p Bp Xqb Bp Sqqmeasurable. Since ϕi : X ˆ S Ñ r0, 1s, px, tq ÞÑ ϕipx, tq is Bp Xq b Bp Sq{Bpr0, 1sq-measurable, we know that rϕpjq i : r0, 1s N ˆ S Ñ r0, 1s, pξ, tq ÞÑ ϕipxpjqpξq, tq is p Fj 1 b Bp Sqq{Bpr0, 1sq-measurable. Therefore, the vectorvalued function Φj : r0, 1s N ˆS Ñ r0, 1sd, pξ, tq ÞÑ rϕ1pxpjqpξq, tq, . . . , ϕdpxpjqpξq, tqs rrϕpjq 1 pξ, tq, . . . , rϕpjq d pξ, tqs is p Fj 1 b Bp Sqq{Bpr0, 1sdq-measurable for each j P N. B.1.1 Proof of Theorem 1 Proof. Define Vj : ş S IypjqΦj ş S θJ ΦjΦj. Since we have proved above that for each j P N, ypjq is Fjmeasurable and the function S ˆ S Q py, tq ÞÑ 1ty ď tu P r0, 1s is Bp Sq2-measurable, we have that Iypjq : r0, 1s N ˆ S Ñ r0, 1s, Iypjqpξ, tq 1typjqpξq ď tu is Fj b Bp Sq-measurable. Since we have also proved above that for each j P N, Φj is Fj 1 b Bp Sq-measurable, by Fubini s theorem and (32), we have that ş S θJ ΦjΦj is Fj 1-measurable, Vj is Fj-measurable, and Er Vj|Fj 1s E ż S IypjqΦj ˇˇFj 1 S E Iypjq ˇˇFj 1 Φj ż S θJ ΦjΦj ż For any α P Rd, define M0pαq 1. Then, M0pαq is F0-measurable for any α P Rd. For n P N, define Mnpαq : exp αJWn 1 2}α}2 Un ( with Wn : řn j 1 Vj and Un řn j 1 ş S ΦjΦJ j . Since Φj is Fj 1 b Bp Sqmeasurable and Vj is Fj-measurable, by Fubini s theorem, Un is Fn 1-measurable and Wn is Fn-measurable for each n P N. Thus, Mnpαq is also Fn-measurable for any α P Rd and n P N. Moreover, note that the function Rd ˆ Rd ˆ Rdˆd Ñ p0, 8q, pα, W, Uq ÞÑ exp αJW 1 2}α}2 U ( is measurable. Hence, Mn : r0, 1s N ˆ Rd Ñ p0, 8q, pξ, αq ÞÑ exp αJWnpξq 1 2}α}2 Unpξq ( is Fn b Bp Rdq-measurable. Thus, for any α P Rd, t Mnpαquně0 is t Fnuně0-adapted. Besides, for any α P Rd and n P N, we have Er Mnpαq|Fn 1s Mn 1pαq E exp " αJVn 1 α *ˇˇˇˇFn 1 Mn 1pαq E exp αJVn ( |Fn 1 exp ! 1 2 ş S pαJΦnq2) . (34) S |αJΦn| ď αJVn ď ş S |αJΦn| almost surely (a.s.), we have E exp αJVn ( |Fn 1 ď exp S |αJΦn| 2+ αJΦn 2* , (36) where (35) follows from Hoeffding s lemma (Hoeffding, 1963), and (36) follows from the Cauchy-Schwarz inequality and the fact that ş S 1 mp Sq 1. Then, by (34) and (36), we have Er Mnpαq|Fn 1s ďMn 1pαq exp ! 1 2 ş exp ! 1 2 ş S pαJΦnq2) Mn 1pαq. (37) Published in Transactions on Machine Learning Research (02/2024) Since M0pαq 1 and Mnpαq ě 0, for any α P Rd, t Mnpαquně0 is a super-martingale. Now for any n ě 0, define Ď Mn : ş Rd Mnpαqhpαqdα, with dα denoting Lebpdαq where the Lebesgue measure is on p Rd, Bp Rdqq and 2 αJα * ˆ λ Recall that Unpλq Un λId. Then, for n ě 1, we have Rd exp " αJWn 1 2}Wn}2 Unpλq 1 1 2}α Unpλq 1Wn}2 Unpλq detp Unpλqq 1 2 exp ˆ1 2}Wn}2 Unpλq 1 p2πq d 2 detp Unpλqq 1 2}α Unpλq 1Wn}2 Unpλq detp Unpλqq 1 2 exp ˆ1 2}Wn}2 Unpλq 1 where (39) follows from the calculation below: }Wn}2 Unpλq 1 }α Unpλq 1Wn}2 Unpλq }Wn}2 Unpλq 1 αJ W J n Unpλq 1 Unpλq α Unpλq 1Wn }Wn}2 Unpλq 1 }α}Unpλq }Wn}2 Unpλq 1 2αJWn 2αJWn }α}λId }α}Un. (41) For n 0, Ď M0 ş Rd M0pαqhpαqdα ş Rd hpαqdα 1. Moreover, since we have shown that Mn is Fn b Bp Rdq-measurable, by Fubini s theorem and (37), Ď Mn is Fn-measurable for any n ě 0 and for any n P N, E Ď Mn|Fn 1 E ż Rd Mnpαqhpαqdα ˇˇˇˇFn 1 Rd E r Mnpαq|Fn 1s hpαqdα Rd Mn 1pαqhpαqdα Ď Mn 1. (42) Thus, tĎ Mnuně0 is also a super-martingale. By Doob s maximal inequality for super-martingales, P sup n PN Ď Mn ě δ ȷ ď ErĎ M0s which, together with (40), implies that Dn P N s.t. }Wn}Unpλq 1 ě log detp Unpλqq Published in Transactions on Machine Learning Research (02/2024) S ΦjΦJ j λId S ΦjΦJ j θ λθ by (3), we have pθλ θ Unpλq 1 n ÿ Unpλq 1Wn Unpλq 1pλθ q. Thus, by the triangle inequality, }pθλ θ }Unpλq ď }Unpλq 1Wn}Unpλq λ}Unpλq 1θ }Unpλq }Wn}Unpλq 1 }λθ }Unpλq 1 ď }Wn}Unpλq 1 ? λ}θ }, (45) where the last inequality follows from the facts that Unpλq 1 1 λ I Unpλq 1Un and }I Unpλq 1Un}2 ď 1. By (43) and (45), with probability at least 1 δ, for all n P N, we have }pθλ θ }Unpλq ď log detp Unpλqq λ}θ }. (46) By the arithmetic mean-geometric mean (AM GM) inequality, we have log detp Unpλqq ď d log ˆtrace p Unpλqq S ΦjΦJ j λId S ΦjΦJ j λId S trace ΦjΦJ j log detp Unpλqq ď d log ˆ1 d pdλ ndq d log pλ nq . (47) By (46) and (47), for any λ ą 0, δ P p0, 1q, with probability at least 1 δ, for all n P N, we have }pθλ θ }Unpλq ď λ}θ }. (48) Thus, Theorem 1 is proved for any probability measure m on p S, Bp Sqq. B.1.2 Proof of Proposition 2 Proof. When UN is non-singular for some fixed N P N, since ş S ΦjΦJ j is positive semi-definite for any j P N, it immediately follows that Un are non-singular for any n ě N. Then, pθ is unique and is given by (3) with λ 0 for any n ě N, i.e., Published in Transactions on Machine Learning Research (02/2024) for any n ě N. Since we have pθ θ U 1 n Wn. (50) By definition and the triangle inequality for integrals, we have S |Iypjq θJ Φj|}Φj} ď ż which also implies that j 1 Er}Vj}2|Fj 1s ď j 1 d nd. (52) Since Wn řn j 1 Vj, by (33), (51), (52), and Hsu et al. (2012a, Proposition 1.2), we have 8nda p4{3q ? for any a ą 0. Thus, for any δ P p0, 1q and n P N, with probability at least 1 δ, we have Since Un is positive definite, by (53), we have }Wn}U 1 n b W J n U 1 n Wn ď }Wn} a µminp Unq ď µminp Unq (54) with probability at least 1 δ. Hence, by (50), and (54), we have that for any n ě N, }pθ θ }Un }U 1 n Wn}Un }Wn}U 1 n ď with probability at least 1 δ. In conclusion, Proposition 2 is proved for any probability measure m on p S, Bp Sqq. B.2 Proofs of Theorem 4 and Proposition 5 In this section, we follow the same construction of the probability space as in Appendix B.1. In particular, noting that Scheme II is a special case of Scheme I, we consider the underlying probability space for the sample tpxpjq, ypjqquj PN to be pr0, 1s N, Bpr0, 1sq N, Pq. Define the random vector Ξ to be the identity mapping from r0, 1s N onto itself as in Appendix B.1. Then, Ξ follows the uniform distribution on r0, 1s N. Suppose tpxpjq, ypjqquj PN is sampled according to Scheme II with F defined in (1). Then, according to Bogachev (2007, Proposition 10.7.6), for each j P N, there exist some Bpr0, 1sq{Bp Xq-measurable function hpjq X : r0, 1s Ñ X and Bp Xq b Bpr0, 1sq{Bp Sq-measurable function hpjq Y : X ˆ r0, 1s Ñ S such that xpjq hpjq X pΞp2j 1qq, ypjq hpjq Y pxpjq, Ξp2jqq, and E 1 ! hpjq Y pxpjq, Ξp2jqq ď t )ˇˇˇFj 1 ı θJ Φpxpjq, tq for any t P S and j P N, where Fj : σ Ξpkq : k P r2j 1s ( is the sub σ-algebra of Bpr0, 1sq N generated by the random variables Ξp1q, . . . , Ξp2j 1q. With the same proof provided at the beginning of Appendix B.1, ypjq( j PN is t Fjuj PN-adapted and Φj is p Fj 1 b Bp Sqq{Bpr0, 1sdq-measurable for each j P N. Moreover, txpjquj PN is independent, which implies that tΦjptquj PN is independent for any t P S and typjquj PN is independent. Published in Transactions on Machine Learning Research (02/2024) B.2.1 Proof of Theorem 4 Proof. By definition and Fubini s theorem, we have Σpjq E ş S E ΦjΦJ j for each j P rns, Σn řn j 1 Σpjq E r Uns. For the proof, we need to define n : Σ 1 2 n p Un Σnq Σ 1 2 n , rΣpjq n : Σ 1 2 n ΣpjqΣ 1 2 n , and rΦjptq : Σ 1 2 n Φjptq for any t P R and j P rns. For any j P N, we have }Σpjq}2 µmax Σpjq µmax ȷ ď d. (55) By the assumption that µminpΣpjqq ě σmin for all j P N and Weyl s inequality (Weyl, 1912), we have µmin pΣnq ě nσmin. (56) By (55) and (56), for each j P rns, we have µmax rΣpjq n ď µmax Σpjq µmin pΣnq ď d nσmin . (57) Consider the following random matrix for j P rns: S rΦj rΦJ j rΣpjq n Σ 1 S ΦjΦJ j Σpjq Σ 1 We have that j 1 Zj, (58) and for any j P rns, we have, Er Zjs 0, (59) and, furthermore, we have, }Zj}2 maxtµmaxp Zjq, µminp Zjqu ď max " µmax S rΦj rΦJ j , µmax rΣpjq n * ď d nσmin (60) where (60) follows from (57) and S rΦj rΦJ j S }rΦj}2 ď 1 µminpΣnq S }Φj}2 ď d nσmin . By (58), (59), (60), and Tropp (2012, Theorem 1.3), we have P rµmin p nq ď as ď d exp ˆ nσ2 mina2 for any a ě 0. Thus, with probability at least 1 δ, µmin p nq ě d σmin Published in Transactions on Machine Learning Research (02/2024) Since n Σ 1 2 n Id, we have µminpΣ 1 2 n q µminp nq 1 which together with the fact that 1 2n implies that µminp Unq ě µminpΣnqµminpΣ 1 2 n q µminpΣnqpµminp nq 1q. (63) By (63), we have µminp nq ě 1 2 ùñ µminp Unq ě 1 2µminpΣnq ě n 2 σmin ą 0. (64) Note that when Un is positive definite, we have 1 2n U 1 n Σ 1 2n U 1 n Σ 2 n 1 2 }p Id nq 1}2 1 µminp Id nq 1 1 µmin p nq. (65) By (64) and (65), we have µminp nq ě 1 2 ùñ µminp Unq ě n 2 σmin and }U 1 2 n }2 ď 2. (66) By (62), for any δ P p0, 1q, if n ě 32d2 σ2 min logpd{δq, we have µminp nq ě 1 2 with probability at least 1 δ. Then, by (66), we have 2 n }2 ď 2 and µminp Unq ě n 2 σmin. (67) with probability at least 1 δ. Still define Wn : řn i 1 ş S IypjqΦj ş S θJ ΦjΦj . By (50), we have pθ θ U 1 n Wn and }pθ θ }Un }Wn}U 1 n . Published in Transactions on Machine Learning Research (02/2024) By (7), (67), and the union bound, for any δ1 P p0, 1q and δ2 P p0, 1 δ1q, if n ě 32d2 σ2 min log d δ1 , we have }pθ θ }Σn b 2 n }2}Wn}2 U 1 n 2 n }2}pθ θ }2 Un ď ? with probability at least 1 δ1 δ2. By letting δ1 δ2 δ, (8) is proved. In conclusion, Theorem 4 is proved for any probability measure m on p S, Bp Sqq. B.2.2 Proof of Proposition 5 Proof. Since 2 n UnpλqΣ 1 2 n pΣn λId Un Σnq Σ 1 2 n Id λΣ 1 n n and λΣ 1 n is positive semi-definite for any λ ě 0, we have 1 2n Unpλq 1Σ 2 n UnpλqΣ 1 2 n 1 2 }p Id λΣ 1 n nq 1}2 1 µminp Id λΣ 1 n nq ď 1 1 µmin p nq. (68) 1 2n Unpλq 1Σ 1 2n Unpλq 1 1 2n Unpλq 1 2 Σn Unpλq 1 1 2n Unpλq 1 1 2n Unpλq 1 by (68), we have 2 Σn Unpλq 1 1 2n Unpλq 1Σ 1 2n}2 ď 1 1 µmin p nq. Define Rn řn i 1 Vj λθ where Vj : ş S IypjqΦj ş S θJ ΦjΦj. Then, by (3) and (44), we have pθλ θ Unpλq 1Rn. Published in Transactions on Machine Learning Research (02/2024) }pθλ θ }2 Σn RJ n Unpλq 1 2 Σn Unpλq 1 2 Σn Unpλq 1 2 }2}Rn}2 Unpλq 1 2 Σn Unpλq 1 2 }2}pθλ θ }2 Unpλq. By the above inequality, (62), and (6) in Theorem 1, for any n P N, δ1 P p0, 1q, δ2 P p0, 1 δ1q, we have }pθλ θ }Σn ď }pθλ θ }Unpλq a with probability at least 1 δ1 δ2. Then, when n ě 32d2 σ2 min logpd{δ1q, by the above inequality, we have }pθλ θ}Σn ď 2 ˆ d log 1 n 2λ}θ } (69) with probability at least 1 δ1 δ2. Thus, (9) is obtained from (69) by setting δ1 δ2 δ P p0, 1{2q. Proposition 5 is proved for any probability measure m on p S, Bp Sqq. B.3 Proof of Theorem 7 Proof. Notice that by Fubini s theorem, we have Er Unpλqs Σnpλq and S ΦjΦJ j θ dm S ΦjΦJ j dm Similar to the proof of Azizzadenesheli (2020, Lemma 4.3), by Pires & Szepesvári (2012, Theorem 3.4), we have that with probability at least 1 δ, }Σnpλqqθλ Σnθ } ď}Σnpλqθ Σnθ } 2 U n pδq}θ } 2}un Eruns} pλ 2 U n pδqq}θ } 2}un Eruns}. }Σnpλqqθλ Σnθ } }Σnpqθλ θ q λqθλ}, }Σnpqθλ θ q} ď λ}qθλ} pλ 2 U n pδqq}θ } 2}un Eruns} which also implies that }qθλ θ } ď 1 µminpΣnq λ}qθλ} pλ 2 U n pδqq}θ } 2}un Eruns} ı (70) Note that for any j P rns, S IypjqΦj ż S E θJ ΦjΦj ď ? Published in Transactions on Machine Learning Research (02/2024) According to Hsu et al. (2012a, Proposition 1.2), for any δ P p0, 1q and n P N, with probability at least 1 δ, we have }un Eruns} ď ? 8nd logp1{δq 4 d logp1{δq. Then, by (70), for any δ P p0, 1q, δ1 P p0, 1 δq, and any λ ě 0, with probability at least 1 δ δ1, we have }qθλ θ } ď 1 µminpΣnq λ}qθλ} pλ 2d a 8n logpd{δ1qq}θ } 8nd logp1{δ2q 4 d logp1{δ2q ı . For λ 0, we have }qθ θ } ď 1 µminpΣnq 8n logpd{δ1q}θ } 2 ˆ? 8nd logp1{δ2q 4 d logp1{δ2q ȷ . Setting δ δ1, we obtain (11). C Proofs of minimax lower bounds In this section, we prove Theorem 8 and Proposition 9. C.1 Proof of Theorem 8 Proof. First, we show that Rpθp Pd x1:nqq Ωp1q under the regime that µminp Unq 0. Suppose that ϕip , q ϕ1p , q for any 1 ď i ď d. In this case, we have µminp Unq 0 and θp Pq can be arbitrary θ P d 1 for any P P Pd x1:n. For any estimator ˇθ P d 1, there exists θ1 P d 1 such that }ˇθ θ1} Ωp1q by the property of d 1. Then, there always exists P P Pd x1:n such that θp Pq θ1 and hence, EP }ˇθpyp1q, . . . , ypnqq θp Pq} ı ď supθp1q,θp2q P d 1 }θp1q θp2q} Ωp1q. Thus, we have, Rpθp Pd x1:nqq Ωp1q under the regime that µminp Unq 0. Next, we show that Rpθp Pd x1:nqq Ωp b d 1 µminp Unqq under the regime that µminp Unq ą 0 using Fano s method (Fano, 1961). In order to apply Fano s method, we first construct separated subset for d 1. Let dℓ2 denote the ℓ2 distance. For δ P p0, 1q, let Pp d 1, dℓ2, δq denote the δ-packing number of the set d 1. Then, we have the following lower bound on Pp d 1, dℓ2, δq. Lemma 21. For any d ě 2, we have Pp d 1, dℓ2, δ0q ą 2d. (71) δ0 : ?e 2 ? 1 d 1 ˆ 1 ? The proof of Lemma 21 uses the volume method and is provided in Appendix H. Lemma 21 implies that there exits a δ0-separated subset V1 of d 1 of size |V1| ě 2d. Define Va : tlapθq : θ P V1u where lapθq aθ1, . . . , aθd 1, 1 a řd 1 i 1 θi ıJ for 0 ď a ă 1 supθPV1 řd 1 i 1 θi . Then, for any θp1q, θp2q P V1 and any j P rns, we have }lapθp1qq lapθp2qq} g f f ea2 dÿ θp1q i θp2q i 2 a}θp1q θp2q} Published in Transactions on Machine Learning Research (02/2024) Thus, we have }lapθp1qq lapθp2qq} ď a sup x,y P d 1 }x y} ? and }lapθp1qq lapθp2qq} ě aδ0 which implies that Va is a paδ0q-separated subset of d 1 of size |Va| ě 2d. Let Dp Q1}Q2q and χ2p Q1}Q2q denote the Kullback-Leibler (KL) divergence and χ2-divergence between two probability measures Q1 and Q2 on R, respectively, where Q1 is absolutely continuous w.r.t. Q2. Their definitions are given below: Dp Q1}Q2q : ż R log ˆd Q1 d Q1 and χ2p Q1}Q2q : ż d Q2 1 2 d Q2, d Q2 denotes the Radon-Nikodym derivative of Q1 w.r.t. Q2. Lemma 22. For any d ě 2, there exists some nonempty subset BB d Ď Bd such that for any n ě d, we have D bn j 1P Φ Y |xpjq,θp1q bn j 1 P Φ Y |xpjq,θp2q ď 8a2 d p1 2µmin p Unqq (74) and µminp Unq ą 0 for any θp1q, θp2q P Va and any Φ P BB d . Proof of Lemma 22. For any θp1q, θp2q P Va, and Φ P Bd, we have D P Φ Y |xpjq,θp1q}P Φ Y |xpjq,θp2q ď χ2 P Φ Y |xpjq,θp1q}P Φ Y |xpjq,θp2q (75) where (75) follows from the bound on KL divergence w.r.t. χ2-divergence (Su, 1995) (also see Makur (2019, Lemma 2.3) or Makur & Zheng (2020, Lemma 3) and the references therein). By the tensorization of KL divergence, we have D bn j 1P Φ Y |xpjq,θp1q bn j 1 P Φ Y |xpjq,θp2q j 1 D P Φ Y |xpjq,θp1q}P Φ Y |xpjq,θp2q . (76) Now, we consider a special case where Φ consists of CDFs of Bernoulli distributions. Under this Bernoulli setting, we set S r0, 1s and m Leb. Specifically, for any p ppjiqj Prns,i Prds P r0, 1snˆd, define Φp jiptq : Pr Zi ď ts with Zj Bernoullippjiq for any i P rds and j P rns. Then, for any θ P d 1 and j P rns, we have that řd i 1 θiΦp jiptq Pr Zpjq θ ď ts with Zpjq θ Bernoullipp J j θq where pj rpj1, . . . , pjds J. Let P B ρ be the probability measure induced by the Bernoulli distribution with parameter ρ P r0, 1s. Define qji : 1 pji and qj : rqj1, . . . , qjds J for any j P rns and i P rds. By definition, the χ2-divergence between two different Bernoulli distributions with parameters p J j θp1q and p J j θp2q is χ2 P Φp Y |xpjq;θp1q P Φp Y |xpjq;θp2q χ2 P B p J j θp1q P B p J j θp2q q J j θp1q θp2q 2 q J j θp1q θp2q 2 q J j θp1q θp2q 2 q J j θp2q p J j θp2q ď 2a2 řd i 1 q2 ji q J j θp2q p J j θp2q (77) Published in Transactions on Machine Learning Research (02/2024) where (77) is by Cauchy-Schwarz inequality and (73). Since S r0, 1s, m Leb, and Φjptq qj for any t P r0, 1q and j P rns, we have S ΦjΦJ j dm j 1 qjq J j . Assume d ě 2. Suppose that for any j P rds and i P rds, pj satisfies that d3 ď pji ď 1 1 2d3 and µmin j 1 qjq J j Since µmin řd j 1 qjq J j ą 0 if tqjuj Prds is linearly independent, such vectors pj s exist. For example, we can set pjj 1 1 d3 for any j P rds and pji 1 1 2d3 for any i, j P rds with i j. Then, it is clear that qj s are linearly and thus, µmin řd j 1 qjq J j ą 0. Therefore, µminp Unq ą 0 for any d ě n. Now, for any j ě d 1 and i P rds, suppose that pj satisfies 1 µminp Rj 1q d2 ď pji ď 1 µminp Rj 1q where Rj : qjq J j 1 n řj 1 k 1 qkq J k for any j ě d. Then, according to the condition that µminp Udq ą 0, we have ď µminp Rjq ď 1 dtracep Rjq 1 k 1 q J k qk for any j ě d, which implies that 0 ă µminp Rj 1q d2 ď 2 d2 ď 1 d. Thus, for any j ě d and i P rds, the above pji s are indeed defined in r0, 1s and qji satisfies µminp Rj 1q 2d2 ď qji ď µminp Rj 1q For notational convenience, define Rj : 1 d Id for any 0 ď j ď n 1. Then, we have řd i 1 q2 ji q J j θp2q ď 2 dµminp Rj 1q p J j θp2q ě 1 µminp Rj 1q It follows that 2a2 řd i 1 q2 ji q J j θp2q p J j θp2q ď 8a2µminp Rj 1q which, together with (75) and (77), implies that D P Φp Y |xpjq,θp1q}P Φp Y |xpjq,θp2q ď 8a2µminp Rj 1q Then, by (76), we have D bn j 1P Φp Y |xpjq,θp1q bn j 1 P Φp Y |xpjq,θp2q ď8a2 j 1 µminp Rj 1qq j d µminp Rjq Published in Transactions on Machine Learning Research (02/2024) where the second inequality follows from the fact that µminp Rjq 1 d for any 0 ď j ď d 1 and µminp Rnq ě 0. The last inequality follows from Weyl s inequality (Weyl, 1912). Note that j d qjq J j n qjq J j ĺ 2 j 1 qjq J j 2Un where we say A ĺ B for two square matrices A and B of the same size if µminp B Aq ě 0. Therefore, by Weyl s inequality (Weyl, 1912) again, we have µmin řn j d Rj ď 2µminp Unq and D bn j 1P Φp Y |xpjq,θp1q bn j 1 P Φp Y |xpjq,θp2q ď 8a2 d p1 2µmin p Unqq . for any θp1q, θp2q P Va. In conclusion, we have proved that µminp Unq ą 0 and (74) holds for any θp1q, θp2q P Va and any Φ P BB d with BB d : tΦp : pji s satisfy (78) for any i, j P rds and (79) for any j ě d 1 and i P rdsu . As is shown in the discussions below (78) and (79), BB d H. Now, define PB,d x1:n : ! bn j 1P Φ Y |xpjq;θ : θ P d 1, Φ P BB d ) Ď PB,d x1:n with BB d specified in Lemma 22. Then, by Lemma 22, (72), (81), and Fano s method (Fano, 1961), we have Rpθp Pd x1:nqq ěRpθp PB,d x1:nqq (82) inf ˆθ sup P PPB,d x1:n EP r}ˆθpyp1q, . . . , ypnqq θp Pq}s 1 supθp1q,θp2q PVa D bn j 1Fθp1qpxpjq, q bn j 1 Fθp2qpxpjq, q log 2 d p1 2µminp Unqq log 2 ˆ 1 8a2p1 2µminp Unqq d log 2 where (82) follows from the fact that PB,d x1:n Ď Px1:n. Choosing a Θp d ? 1 p1 µminp Unqqq, by (83), we have Rpθp Pd x1:nqq Ωp b d 1 µminp Unqq under the regime that µminp Unq ą 0. Given the above results, we can conclude that Rpθp Px1:nqq Ω d 1 µminp Unq C.2 Proof of Corollary 9 Proof. Assume that Xp1q, . . . , Xpnq are independent random variables in X. For any fixed sequence x1:n pxp1q, . . . , xpnqq P X n, denote by Pd X,Y ;x1:n Ď Pd n the family of the joint distributions of Published in Transactions on Machine Learning Research (02/2024) p Y p1q, Xp2q, . . . , Y pnq, Xpnqq whose marginal distribution on p Xp1q, . . . , Xpnqq is 1x1:n, i.e., the delta mass on x1:n. Then, we have Σn Un almost surely (a.s.) and Rpθp Pqq inf ˆθ sup P PPd n EP r}ˆθpyp1q, . . . , ypnqq θp Pq}s inf ˆθ sup P PPd n EP EP }ˆθpyp1q, . . . , ypnqq θp Pq}|Xp1q, . . . , Xpnqıı ě inf ˆθ sup P PPd X,Y ;x1:n EP }ˆθpyp1q, . . . , ypnqq θp Pq}|xp1q, . . . , xpnqı d 1 µminpΣnq Thus, Rpθp Pd nqq Ω min ! 1, b d 1 µminpΣnq ) . D Proofs of upper bounds for the infinite dimensional model In this section, we prove Theorem 19. Proof of Theorem 19. For θ P Hσ,e, define the function for any m P N rθ ,m : Unθ Then, we have rθ ,m P L2pΩ, nq. For any i P rms, we have xei, rθ ,my xei, Unθ y xei, θ y x Unei, θ y xei, θ y For any i ě m 1, we have xei, rθ ,my xei, Unθ y λixei, θ y. Thus, we have U 1 n,σrθ ,m σ2 i xei, rθ ,my 1 λiσ2 i ei i 1 xei, θ yei λiσ2 i xei, θ y 1 λiσ2 i ei }θ U 1 n,σrθ ,m}2 |xei, θ y|2 p1 λiσ2 i q2 Since limiÑ8 λi 0 limiÑ σi and ř8 i 1 |xei, θ y|2 ă 8, we have |xei, θ y|2 p1 λiσ2 i q2 0. Published in Transactions on Machine Learning Research (02/2024) Thus, defining θ ,m : U 1 n,σrθ ,m, we have θ ,m Ñ θ as m Ñ 8 in L2pΩ, nq. Moreover, by the definition of Un, we have θ ,mpωq U 1 n,σrθ ,m U 1 n,σ S Ψjpθ , tqΦjpω, tqmpdtq for n-a.e. ω P Ω. We follow the same probability space constructed in Appendix B.1. Define S IypjqptqΦjpω, tqmpdtq ż S Ψjpθ , tqΦjpω, tqmpdtq for any j P rns and n-a.e. ω P Ω. Since Φ is p Bp Xqb FΩb Bp Rqq{Bpr0, 1sq-measurable, according to the similar arguments as in Appendix B.1.1, we have that Vj is FΩb Fj-measurable for any j P rns. For n-a.e. ω P Ω, we have | ş S IypjqptqΦjpω, tqmpdtq| ď ş S mpdtq 1 and | ş S Ψjpθ , tqΦjpω, tqmpdtq| ď ş S mpdtq 1. Thus, We have 1 ď Vjpωq ď 1 for n-a.e. ω P Ωand Vj P L2pΩ, nq because npΩq ă 8. By Fubini s theorem, for any j P rns and n-a.e. ω P Ω, we have Er Vjpωq|Fj 1s ż S Er Iypjqptq|Fj 1sΦjpω, tqmpdtq ż S Ψjpθ , tqΦjpω, tqmpdtq S Ψjpθ , tqΦjpω, tqmpdtq ż S Ψjpθ , tqΦjpω, tqmpdtq For any α P L2pΩ, nq, define M0pαq : 1. For any n P N and α P L2pΩ, nq, define Wn : řn j 1 Vj and Mnpαq : exp xα, Wny 1 2}α}2 Un ( with }α}2 Un xα, Unαy. We have that Lemma 23. For any α P L2pΩ, nq, Mnpαqně0 is a non-negative super-martingale. The proof of Lemma 23 is similar to that in Appendix B.1.1 and is provided in Appendix J. By Lemma 23, we have Er Mnpαqs ď Er M0pαqs 1. Since U 1 n,σ is a bounded linear operator on L2pΩ, nq, we have U 1 n,σWn P L2pΩ, nq. There exists pwiqi PN P RN such that U 1 n,σWn ř8 i 1 wiei. Let tζiuiě1 be a sequence of independent normal distributed random variables such that ζi Np0, 1q for any i P N and Fζ : σ pζ1, ζ2, . . . q is independent of F8 : σ p Yj 1Fjq. Define βi : σiζi for any i P N. By the monotone convergence theorem, we have i 1 σ2 i ζ2 i i 1 σ2 i E ζ2 i i 1 σ2 i ă 8. Thus, we have ř8 i 1 β2 i ă 8 a.s., which implies that třm i 1 βieiumě1 and třm i 1 σiζieiumě1 converges in L2pΩ, nq a.s.. In particular, we have β : ř8 i 1 σiβiei P L2pΩ, nq with }β}2 ă 8 a.s.. Define Ď Mn : Er Mnpβq|F8s. Then, Ď Mn ě 0. Since Mnpαq is Fn-measurable for any fixed α P L2pΩ, nq and Fβ and F8 are independent, we have that Ď Mn is Fn-measurable. Then, we have ErĎ Mn|Fn 1s Er Mnpβq|Fn 1s Er Er Mnpβq|Fζ, Fn 1s|Fn 1s ďEr Er Mn 1pβq|Fζ, Fn 1s|Fn 1s Er Er Mn 1pβq|Fζ, F8s|Fn 1s Er Er Mn 1pβq|F8s|Fn 1s Published in Transactions on Machine Learning Research (02/2024) Er|Ď Mn|s ErĎ Mns Er Mnpβqs Er Er Mnpβq|Fβss ď 1. Thus, tĎ Mnuně0 is a non-negative super-martingale. Since }Un} ď nnpΩq and Un is positive, we have 0 ď λi xei, Uneiy ď nnpΩq for any i P N. Define w1 i : xei, Wny for any i P N. Define Hm : řm i 1 βiw1 i 1 2 řm i 1 λiβ2 i for any m P N and H8 : xβ, Wny 1 2}β}2 Un ř8 i 1 βiw1 i 1 2 ř8 i 1 λiβ2 i . Then, we have Mnpβq expp H8q. Moreover, we prove the following convergence result. Erexpp Hmq|F8s Ñ Er Mnpβq|F8s Ď Mn as m Ñ 8 a.s.. The proof of Lemma 24 uses the conditional dominated convergence theorem and is provided in Appendix J. Specifically, we first show that limmÑ8 |Hm H8| 0 a.s.. Then, we verify that the dominating function of expp Hmq, i 1 |σiζi| 1 i 1 λiσ2 i ζ2 i , is integrable. Since ζ1, ζ2, . . . are independent and Np0, 1q-distributed random variables, we can use the monotone convergence theorem to calculate Erexp n ř8 i 1 |σiζi| 1 2 ř8 i 1 λiσ2 i ζ2 i s. Then, it suffices to verify the convergence of the resulting series; e.g., the conditions that |σi| ă 1 ?λi , @i P N and ř8 i 1 |σi| ă 8 are needed to show that 8 ź ˆ 2ΦNp0,1q n|σi|{ b exists as a non-negative real number, where ΦNp0,1q denotes the CDF of the Np0, 1q distribution. For any m P N, define βm řm i 1 βiei and Wn,m řm i 1 w1 iei. Then, we have βm, Wn,m P L2 σpΩ, nq and }Wn,m}2 U 1 n,σ }βm U 1 n,σWn,m}2 Un,σ 2xβm, Wn,my }βm}2 Un,σ For any m P N, we have Erexpp Hmq|F8s E exp " xβm, Wn,my 1 xβm, Wn,my 1 2}βm}2 Un,σ 1 2}Wn,m}2 U 1 n,σ 2}βm U 1 n,σWn,m}2 Un,σ 2}Wn,m}2 U 1 n,σ 2}βm U 1 n,σWn,m}2 Un,σ 2}Wn,m}2 U 1 n,σ 2}Wn,m}2 U 1 n,σ 1 λiσ2 i ζ2 i 1 aśm i 1p1 λiσ2 i q exp ˆ1 2}Wn,m}2 U 1 n,σ Published in Transactions on Machine Learning Research (02/2024) Since λiσ2 i ě 0 for any i P N and ř8 i 1 λiσ2 i ă 8, we have i 1 p1 λiσ2 i q ă 8. }Wn,m}2 U 1 n,σ σ2 i pw1 iq2 σ2 i pw1 iq2 1 λiσ2 i ď sup k PN σ2 k i 1 pw1 iq2 sup k PN σ2 k}Wn}2 ă 8, lim mÑ8 }Wn,m}2 U 1 n,σ σ2 i pw1 iq2 1 λiσ2 i }Wn}2 U 1 n,σ ă 8. In conclusion, we have Ď Mn lim mÑ8 Erexpp Hmq|F8s 1 bś8 i 1p1 λiσ2 i q exp ˆ1 2}Wn}2 U 1 n,σ Since tĎ Mnuně0 is a super-martingale. By Doob s maximal inequality for super-martingales, P sup n PN Ď Mn ě δ ȷ ď ErĎ M0s which, implies that, Dn P N s.t. }Wn}U 1 n,σ ě i 1 p1 λiσ2 i q fl ď δ. (85) Define the finite rank operator ςm : Hσ,e Ñ L2pΩ, nq, θ ÞÑ řm i 1 xei,θy σ2 i ei. Since pθσ θ ,m U 1 n,σ p Wn ςmθ q U 1 n,σWn U 1 n,σςmθ , by the triangle inequality, we have }pθσ θ ,m}Un,σ ď }U 1 n,σWn}Un,σ }U 1 n,σςmθ }Un,σ }Wn}U 1 n,σ }ςmθ }U 1 n,σ }Wn}U 1 n,σ p1 λiσ2 i qσ2 i ď }Wn}U 1 n,σ Besides, we have }θ ,m θ }2 Un,σ |xei, θ y|2 p1 λiσ2 i q2 |xei, θ y|2 σ2 i p1 λiσ2 i q Published in Transactions on Machine Learning Research (02/2024) Since limiÑ8 λi 0 limiÑ σi and ř8 i 1 |xei,θ y|2 σ2 i ă 8, we have lim mÑ8 }θ ,m θ }2 Un,σ lim mÑ8 |xei, θ y|2 σ2 i p1 λiσ2 i q2 0. }pθσ θ }Un,σ ď lim sup mÑ8 }pθσ θ ,m}Un,σ ď }Wn}U 1 n,σ σ2 i }Wn}U 1 n,σ }θ }σ,e. With probability at least 1 δ, for all n P N, we have }pθσ θ }Un,σ ď i 1 log p1 λiσ2 i q Since 0 ď λi xei, Uneiy ď nnpΩq for any i P N, the above inequality implies that }pθσ θ }Un,σ ď i 1 log p1 nnpΩqσ2 i q E Proof of Lemma 6 Proof. For any n P N, define U n : }Un Σn} and Zj : ş S ΦjΦJ j Σn for j P rns. We have Er Zjs 0 and S }Φj}2 2 ď d, }Σpjq}2 µmax Σpjq µmax Thus, for each j P rns, we have }Zj} ď max " 2 , }Σpjq}2 By (Tropp, 2012, Theorem 1.3), for any a ě 0, we have Pr U n ě as ď d exp ˆ a2 In other words, for any δ P p0, 1q, with probability at least 1 δ, we have 8n logpd{δq. Thus, we can set U n pδq ě d a 8n logpd{δq. F Proofs of theoretical results in Section 6.2 In this section, we provide the proofs of the stated theoretical results in Section 6.2. Published in Transactions on Machine Learning Research (02/2024) Proof of Lemma 12. By Fubini s theorem, for any θ P L2pΩ, nq, we have p Unθqpωq ż S Φjpω, tqΦjpω1, tqmpdtqnpdω1q. Define the function un : Ω2 Ñ R, pω, ω1q ÞÑ S Φjpω, tqΦjpω1, tqmpdtq. Then, we have p Unθqpωq ş Ωunpω, ω1qθpω1qnpdω1q. Since Φjpω, tq P r0, 1s for any ω P Ω, t P S and m is a probability measure on S, we have unpω, ω1q P r0, ns for any ω, ω1 P Ωand ż Ω |unpω, ω1q|2npdωqnpdω1q ď n2npΩq2. Thus, un P L2pΩ2, n2q and Un is Hilbert-Schmidt integral operator for any n P N. Thus, it is also a compact operator. Because unpω, ω1q řn j 1 ş S Φjpω, tqΦjpω1, tqmpdtq unpω1, ωq P R, Un is self-adjoint. For any θ P L2pΩ, nq, we have Ω θpω1qunpω, ω1qnpdω1q ˇˇˇˇ Ω |θpω1q|2npdω1q ż Ω |unpω, ω1q|2npdω1qnpdωq ďn2npΩq2}θ}2 x Unθ, θy ż Ω unpω, ω1qθpω1qθpωqnpω1qnpωq S xθp q, Φjp , tqy2mpdtq Thus, Un is a positive operator with }Un} ď nnpΩq. Note that x Unθ, θy 0 iff xθp q, Φjp , tqy 0 for m-a.e. t P S for all j P rns. Since Un is compact, if dimp L2pΩ, nqq 8, Un is not invertible. Proof of Corollary 14. By Lemma 12, since }Un} ď nnpΩq and Un is positive, we have 0 ď λi xei, Uneiy ď nnpΩq for any i P N. Since Un is a compact operator and teiui PN is an orthonormal basis consisting of eigenfunctions of Un, by the Riesz-Schauder theorem (see e.g., Reed & Simon, 1972), we have that λi Ñ 0. Proof of Lemma 15. For any f, g P L2 σpΩ, nq and α P R, we have |αxei, fy xei, gy|2 |α|2|xei, fy|2 |xei, gy|2 2|α||xei, fy||xei, gy| Then, we have αf g P L2 σpΩ, nq and L2 σpΩ, nq is a linear subspace of L2pΩ, nq. Published in Transactions on Machine Learning Research (02/2024) Proof of Lemma 16. For any α P R and f, g P L2 σpΩ, nq, we have αf g P L2 σpΩ, nq, Un,σpαf gq lim mÑ8 xei, αf gyei αUn,σf Un,σg, 2 |xei, fy|2 i 1 |xei, fy|2 ě 1 supk PN σ4 k }f}2 Thus, Un,σ is a linear operator. Since 0 ď supi PN σ2 i ă 8, we can conclude that }Un,σf} 0 iff f 0. Therefore, Un,σ is injective. For any θ ř8 i 1xei, θyei P L2pΩ, nq, we have σ4 i |xei, θy|2 p1 λiσ2 i q2 ď sup k PN σ4 k i 1 |xei, θy|2 ă 8. p1 λiσ2 i q2 ď i 1 |xei, θy|2 ă 8. Thus, we know that ˇθ : ř8 i 1 σ2 i xei,θy 1 λiσ2 i ei P L2 σpΩ, nq. Since σ2 i xei, θy 1 λiσ2 i ei i 1 xei, θyei θ, we can conclude that U 1 n,σθ ˇθ σ2 i xei, θy 1 λiσ2 i ei and Un,σ is a bijective linear operator from L2 σpΩ, nq onto L2pΩ, nq. Then, U 1 n,σ exists as a bijective linear operator from L2pΩ, nq onto L2 σpΩ, nq. Since for any f P L2 σpΩ, nq, we have proved that }f} ď supk PN σ2 k}Un,σf}, we have }U 1 n,σ} ď supi PN σ2 i and U 1 n,σ is a bounded linear operator on L2pΩ, nq. Proof of Lemma 17. Since limiÑ8 σi 0, there exists some N P N such that σi ď 1 for any i ě N. Then, for any θ P Hσ,e, we have Thus, we have θ P L2 σpΩ, nq. Published in Transactions on Machine Learning Research (02/2024) G Proof of Proposition 18 Proof. First, we show that řn j 1 ş S IypjqptqΦjp , tqmpdtq P L2pΩ, nq. Indeed, we have S IypjqptqΦjp , tqmpdtq}2 ď S IypjqptqΦjp , tqmpdtq}2 S |IypjqptqΦjp , tq|mpdtq}2 Define θ0 : U 1 n,σ řn j 1 ş S IypjqptqΦjp , tqmpdtq . We show that θ0 P L2 σpΩ, nq. Then, by Lemma 17, we have θ0 P Hσ,e. Since λi ě 0 for any i P N, we have 1 σ4 i |xei, θy|2 ď 2 |xei, θ0y|2 ď }Un,σθ0}2 S IypjqptqΦjp , tqmpdtq}2 ă 8. Thus, θ0 P L2 σpΩ, nq. Then, for any j P rns, we have |Ψjpθ0, tq| |xθ0p q, Φjp , tqy| Ω |θ0pωqΦjpω, tq|npdωq ď}θ0}}Φjp , tq} which implies that j 1 }Iypjqp q Ψjpθ0, tq}L2p S,mq }θ0}σ,e ă 8 since mp Sq 1 and |Iypjqptq| ď 1 for any j P rns and t P S. For any θ P Hσ,e, we have Lpθ0 θ; σq Lpθ0q S |Ψjpθ, tq|2mpdtq S Ψjpθ, tqpΨjpθ0, tq Iypjqptqqmpdtq xei, θyxei, θ0y Published in Transactions on Machine Learning Research (02/2024) Notice that by Fubini s theorem, we have S Ψjpθ, tqΨjpθ0, tqmpdtq Ω θpωqΦjpω, tqθ0pω1qΦjpω1, tqnpdω1qnpdωqmpdtq S Φjpω, tqΦjpω1, tqmpdtq npdω1qnpdωq Ω θ0pω1qunpω, ω1qnpdω1qnpdωq S Ψjpθ, tq Iypjqptqmpdtq Ω θpωqΦjpω, tq Iypjqptqnpdωqmpdtq S IypjqptqΦjpω, tqmpdtq S IypjqptqΦjp , tqmpdtqy, xei, θyxei, θ0y i 1 xθ, xei, θ0y Since we have proved that |xei, θ0y|2 we can conclude that ř8 i 1xθ, xei,θ0y σ2 i eiy xθ, ř8 i 1 xei,θ0y σ2 i eiy. Thus, S Ψjpθ, tqpΨjpθ0, tq Iypjqptqqmpdtq xei, θyxei, θ0y xθp q, p Unθ0qp q S IypjqptqΦjp , tqmpdtqy xθp q, p Un,σθ0qp q S IypjqptqΦjp , tqmpdtqy S IypjqptqΦjp , tqmpdtq S IypjqptqΦjp , tqmpdtqy Then, we have Lpθ0 θ; σq Lpθ0q S |Ψjpθ, tq|2mpdtq σ2 i ě Lpθ0; σq. Since 1 σ2 i ą 0 for all i P N, we have that ř8 i 1 |xei,θy|2 σ2 i ą 0 for any θ P Hσ,e with θ 0. Since Lpθ0; σq ă 8, we can conclude that Lpθ; σq ą Lpθ0; σq for any θ P Hσ,eztθ0u and pθσ θ0. Published in Transactions on Machine Learning Research (02/2024) H Proof of Lemma 21 Proof. By Vershynin (2018, Proposition 4.2.12), we have Pp d 1, dℓ2, δq ě Volp d 1q Volp Bd 1 δ p0qq (86) where Bd 1 δ p0q : tx P Rd 1 : }x}2 ď δu and for any E Ď Rd 1, Volp Eq is the volume of E under the Lebesgue measure in Rd 1. According to Stein (1966); DLMF, We have d pd 1q!, (87) Volp Bd 1 δ p0qq p?πδqd 1 Pp d 1, dℓ2, δq ě Γp d 1 d pd 1q!p?πδqd 1 d Γpdqp?πδqd 1 . (89) When d ě 3, we have d 1 2 ě 2 and d ě 2. According to Batir (2008, Theorem 1.5), we have 2ppx 1{2q{eqx 1{2 ă Γpxq ă 3ppx 1{2q{eqx 1{2 for any x ě 2. Thus, for d ě 3, we have 2 q Γpdq ą 2 We verify that the above inequality also holds when d 2. Therefore, for any d ě 2, we have 2 q Γpdq ą 2 which implies that Pp d 1, dℓ2, δq ą 2 ? d 3p?πδqd 1 d 3p?πδqd 1 d 3p?πδqd 1 1 2d{2 e d 1 Let δ δ0 ?e 2 ? d 3 1 d 1 1 ? d d 1 . Then, by (90), we have Pp d 1, dℓ2, δ0q ą 2d which is exactly (71). For d ě 2, we have δ0 ě ?e 4 ? d 3 1 d 1 . Consider the function fpxq 1 x 1 log ?x with x ě 2. We have that x log x 2 log 3 Published in Transactions on Machine Learning Research (02/2024) Since the function g : x ÞÑ 1 x log x is a decreasing function when x ě 2 and f 1p2q ą 0, f 1pe5q ă 0, we have that f first increases and then decreases when x increases from 2 to infinity. Since limxÑ8 fpxq 0, fp2q logp ? 2{3q, we have that fpxq ě fp2q logp ? 2{3q. Therefore, for any d ě 2, we have ? d 3 1 d 1 ě ? πd which gives (72). I Proofs of upper bounds for the mismatched model In this Section, we prove Theorem 10 in Appendix I.1 and Corollary 11 in Appendix I.2. I.1 Proof of Theorem 10 Proof. In the setting of Theorem 10, the sample tpxpjq, ypjqquj PN is generated according to Scheme I, and similar to setting of Appendix B.1, we consider the underlying probability space for the sample to be pr0, 1s N, Bpr0, 1sq N, Pq which is already defined at the beginning of Appendix B.1. Define the random vector Ξ to be the identity mapping from r0, 1s N onto itself as in Appendix B.1. Then, Ξ follows the uniform distribution on r0, 1s N. Suppose tpxpjq, ypjqquj PN is sampled according to Scheme I with F defined in (15). Then, according to Bogachev (2007, Proposition 10.7.6), for each j P N, there exist some p Bp Xq b Bp Sqqj 1 b Bpr0, 1sq{Bp Xqmeasurable function hpjq X : p X ˆSqj 1ˆr0, 1s Ñ X and p Bp Xqb Bp Sqqj 1b Bp Xqb Bpr0, 1sq{Bp Sq-measurable function hpjq Y : p X ˆ Sqj 1 ˆ X ˆ r0, 1s Ñ S such that xpjq hpjq X pxp1q, yp1q, . . . , xpj 1q, ypj 1q, Ξp2j 1qq, ypjq hpjq Y pxp1q, yp1q, . . . , xpj 1q, ypj 1q, xpjq, Ξp2jqq, and E 1 ! hpjq Y pxp1q, yp1q, . . . , xpj 1q, ypj 1q, xpjq, Ξp2jqq ď t ) ˇˇFj 1 ı θJ Φpxpjq, tq epxpjq, tq (91) for any t P S and j P N, where Fj : σ Ξpkq : k P r2j 1s ( . With the same proof provided at the beginning of Appendix B.1, ypjq( j PN is t Fjuj PN-adapted, xpjq is Fj 1{Bp Xq-measurable, and Φj is p Fj 1 b Bp Sqq{Bpr0, 1sdq-measurable for each j P N. Since e : X ˆ S Ñ r 1, 1s, px, tq ÞÑ epx, tq is Bp Xq b Bp Sqmeasurable and xpjq is Fj 1{Bp Xq-measurable, we have that ej : r0, 1s N ˆ S Ñ r 1, 1s, pξ, tq ÞÑ epxpjqpξq, tq is Fj 1 b Bp Sq-measurable. Define Vj : ş S IypjqΦj ş SpθJ Φj ejqΦj. Since Iypjq is Fj b Bp Sq-measurable and ej and Φj are Fj 1 b Bp Sqmeasurable, by Fubini s theorem and (91), We have Er Vj|Fj 1s E ż S IypjqΦj ˇˇˇFj 1 S pθJ Φj ejqΦj S E Iypjq|Fj 1 Φj ż S pθJ Φj ejqΦj S pθJ Φj ejqΦj ż S pθJ Φj ejqΦj For any α P Rd, if n 0, define Mnpαq 1. If n ě 1, define Mnpαq : exp αJWn 1 2}α}2 Un ( for Wn : řn j 1 Vj and Un řn j 1 ş S ΦjΦJ j . Then, with the similar proof as in Appendix B.1.1, we can show that Wn is Fn-measurable, Un is Fn 1-measurable, and Mn is Fn b Bp Rdq-measurable for any n P N. Thus, for any α P Rd, t Mnpαquně0 is t Fnuně0-adapted. Moreover, for any α P Rd and n P N, we have Er Mnpαq|Fn 1s Mn 1pαq E exp " αJVn 1 α * ˇˇˇFn 1 Mn 1pαq E exp αJVn ( |Fn 1 exp ! 1 2 ş S pαJΦnq2) . (92) Published in Transactions on Machine Learning Research (02/2024) S |αJΦn| ď αJVn ď ş S |αJΦn| a.s., we have E exp αJVn ( |Fn 1 ď exp S |αJΦn| 2+ αJΦn 2* (93) αJΦn 2* (94) where (93) is by Cauchy-Schwarz inequality and ş S 1 mp Sq 1. Then, by (92) and (94), we have Er Mnpαq|Fn 1s ďMn 1pαq exp ! 1 2 ş exp ! 1 2 ş S pαJΦnq2) Mn 1pαq. Thus, for any α P Rd, t Mnpαquně0 is a super-martingale. Now define Ď Mn : ş Rd Mnpαqhpαqdα for 2 αJα * ˆ λ Then, with the same calculation as (40) in Appendix B.1.1, we have Ď Mn λd{2 detp Unpλqq1{2 exp 1 2}Wn}2 Unpλq 1 . By Fubini s theorem, Mn is Fnb Bp Rdq-measurable implies that Ď Mn is Fn-measurable for any n ě 0. With the same analysis as (42), tĎ Mnuně0 is a super-martingale. By Doob s maximal inequality for super-martingales, we have that P sup n PN Ď Mn ě δ ȷ ď ErĎ M0s which implies that for any N P N, Dn P r Ns s.t. }Wn}Unpλq 1 ě log detp Unpλqq According to (47), we have }Wn}Unpλq 1 ď for all n P N with probability at least 1 δ. By (3), (44), and the definition of Vj, we have pθλ θ Unpλq 1 n ÿ j 1 Vj En λθ Unpλq 1Wn Unpλq 1p En λθ q. (96) where En řn j 1 ş S ejΦj by definition. Thus, }pθλ θ }Unpλq ď }Unpλq 1Wn}Unpλq }Unpλq 1p En λθ q}Unpλq ď }Wn}Unpλq 1 }λθ }Unpλq 1 }En}Unpλq 1 ď }Wn}Unpλq 1 ? λ}θ } }En}Unpλq 1 (97) where (97) is because of Unpλq 1 1 λ I Unpλq 1Un and }I Unpλq 1Un}2 ď 1. By (95) and (97), for any δ P p0, 1q, with probability at least 1 δ, we have }pθλ θ }Unpλq ď λ}θ } }En}Unpλq 1. (98) Published in Transactions on Machine Learning Research (02/2024) for all n P N. Since Unpλq λId is positive semi-definite, (98) immediately implies that }pθλ θ}Unpλq ď λ }En} (99) which is exactly (16). I.2 Proof of Corollary 11 Proof. In the setting of Corollary 11, the sample tpxpjq, ypjqquj PN is generated according to Scheme II. In the following proof, we consider the underlying probability space for the sample tpxpjq, ypjqquj PN to be pr0, 1s N, Bpr0, 1sq N, Pq which has already been defined at the beginning of Appendix B.1. Define the random vector Ξ to be the identity mapping from r0, 1s N onto itself as in Appendix B.1. Then, Ξ follows the uniform distribution on r0, 1s N. Suppose tpxpjq, ypjqquj PN is sampled according to Scheme II with F defined in (15). Then, according to Bogachev (2007, Proposition 10.7.6), for each j P N, there exist some Bpr0, 1sq{Bp Xqmeasurable function hpjq X : r0, 1s Ñ X and Bp Xq b Bpr0, 1sq{Bp Sq-measurable function hpjq Y : X ˆ r0, 1s Ñ S such that xpjq hpjq X pΞp2j 1qq, ypjq hpjq Y pxpjq, Ξp2jqq, and E 1 ! hpjq Y pxpjq, Ξp2jqq ď t ) ˇˇFj 1 ı θJ Φpxpjq, tq epxpjq, tq (100) for any t P S and j P N, where Fj : σ Ξpkq : k P r2j 1s ( . With the same proof provided at the beginning of Appendix B.1, ypjq( j PN is t Fjuj PN-adapted and Φj is p Fj 1 b Bp Sqq{Bpr0, 1sdq-measurable for each j P N. Moreover, txpjquj PN is independent, which implies that tΦjptquj PN is independent for any t P S, tejptquj PN is independent for any t P S, and typjquj PN is independent. Let bjptq : ErejptqΦjptqs for t P S and j P rns. Then, by Fubini s theorem, bj is measurable with bjiptq P r 1, 1s for t P S, j P N and i P rds. By definition and Fubini s theorem, we have Bn řn j 1 ş Define Vj : ş S IypjqΦj ş S θJ ΦjΦj ş S bj. By Fubini s theorem and (100), we have S p Iypjq θJ ΦjqΦj S E p Iypjq θJ ΦjqΦj ż S E Er Iypjq θJ Φj|Fj 1sΦj ż S E rejΦjs ż For any α P Rd, if n 0, define Mnpαq 1. If n ě 1, define Mnpαq : exp αJWn 1 2}α}2 Un ( for Wn : řn j 1 Vj and Un řn j 1 ş S ΦjΦJ j . Similar to Appendix I.1, we can show that Mn is Fn b Bp Rdqmeasurable for any n ě 0. Moreover, for any n P N, Er Mnpαq|Fn 1s Mn 1pαq E exp " αJVn 1 α * ˇˇˇFn 1 Mn 1pαq E exp αJVn ( |Fn 1 exp ! 1 2 ş S pαJΦnq2) (101) Published in Transactions on Machine Learning Research (02/2024) S αJbn ď αJVn ď ş S αJbn a.s.. Thus, E exp αJVn ( |Fn 1 ď exp S |αJΦn| 2+ αJΦn 2* (102) Then, by (101) and (102), we have Er Mnpαq|Fn 1s ďMn 1pαq exp ! 1 2 ş exp ! 1 2 ş S pαJΦnq2) Mn 1pαq. Thus, for any α P Rd, t Mnpαquně0 is a super-martingale. With the same approach as in Appendix I.1, for any λ P p0, 8q, we can show that }pθλ θ }Unpλq ď λ }Bn} (103) for all n P N with probability at least 1 δ. Then, using the same analysis as in Appendix B.2.2, we can show that for any δ1 P p0, 1q, δ2 P p0, 1 δ1q, and n ě 32d2 σ2 min logpd{δ1q, we have }pθλ θ }Σn ď 2 ˆ d log ˆ 1 1 with probability at least 1 δ1 δ2. Then, (17) is obtained by setting δ1 δ2 δ P p0, 1{2q. J Proofs of the lemmas in Appendix D In this section, we provide the proofs of the technical lemmas in Appendix D. Proof of Lemma 23. For any n P N, since Vj is FΩb Fj-measurable for any j P rns and Wn řn j 1 Vj, we have that Wn is also FΩb Fj-measurable. According to the similar arguments as in Appendix B.1.1, we know that Un is FΩb Fn 1-measurable. Thus, by Fubini s theorem, for any α P L2pΩ, nq, Mnpαq exp xα, Wny 1 is Fn-measurable, which implies that t Mnpαquně0 is t Fnuně0-adapted. For any n P N, we have that Er Mnpαq|Fn 1s Mn 1pαq E exp " xα, Vny 1 S Ψnpα, tq2mpdtq * ˇˇˇFn 1 Mn 1pαq E rexp txα, Vnyu |Fn 1s S Ψnpα, tq2mpdtq (. S |Ψnpα, tq|mpdtq ď xα, Vny ď ş S |Ψnpα, tq|mpdtq, according to Hoeffding s lemma (Hoeffding, 1963) and Cauchy-Schwarz inequality, we have E rexp txα, Vnyu |Fn 1s ď exp S |Ψnpα, tq|mpdtq 2+ S Ψnpα, tq2mpdtq * S Ψnpα, tq2mpdtq * . Published in Transactions on Machine Learning Research (02/2024) Then, we have Er Mnpαq|Fn 1s ďMn 1pαqexp 1 S Ψnpα, tq2mpdtq ( S Ψnpα, tq2mpdtq ( Mn 1pαq. Since M0pαq 1 and Mnpαq ě 0, for any α P L2pΩ, nq, t Mnpαquně0 is a non-negative super-martingale. Proof of Lemma 24. For any m P N, we have i m 1 pβiw1 i 1 i m 1 pw1 iq2 nnpΩq i m 1 β2 i . Since ř8 i 1pw1 iq2 }Wn}2 ă 8 and ř8 i 1 β2 i ă 8 a.s., we have that limmÑ8 |Hm H8| 0 a.s.. Thus, lim mÑ8 |expp Hmq Mnpβq| lim mÑ8 |expp Hmq expp H8q| 0. Since |Wn| ď n a.s., we have | expp Hmq| ď exp ˆ |xβ, Wny| 1 i 1 |σiζi| 1 i 1 λiσ2 i ζ2 i for all m P N a.s.. Moreover, for any m P N, by the independence of tζiui PN, if |σi| ă 1 ?λi for all i P N, then we have i 1 |σiζi| 1 i 1 λiσ2 i ζ2 i i 1 E exp ˆ |σiζi| 1 2λiσ2 i ζ2 i i 1 exp ˆ n|σiζi| 1 2pλiσ2 i 1qζ2 i 1 λiσ2 i exp ˆ n2σ2 i 2p1 λiσ2 i q where ΦNp0,1q denotes the CDF of the Np0, 1q distribution. By the monotone convergence theorem, we have i 1 |σiζi| 1 i 1 λiσ2 i ζ2 i 1 λiσ2 i exp ˆ n2σ2 i 2p1 λiσ2 i q ď 1 bś8 i 1p1 λiσ2 i q exp σ2 i 1 λiσ2 i Published in Transactions on Machine Learning Research (02/2024) Since limiÑ8 λiσ2 i 0 and ř8 i 1 σ2 i ă 8, we have σ2 i 1 λiσ2 i ă 8 and i 1 λiσ2 i ă 8, which also implies that i 1 p1 λiσ2 i q ă 8. For any sequence taiui PN such that ai ą 0 and ř8 i 1 ai ă 8, we have limiÑ8 ai 0 and lim iÑ8 2ΦNp0,1qpaiq 1 ai lim iÑ8 1 ?π exp a2 i 2 pai opaiqq Since ř8 i 1 |ai| ă 8, we can conclude that i 1 p2ΦNp0,1qpaiq 1q ă 8. Therefore, if we assume that ř8 i 1 |σi| ă 8, we have ř8 i 1 n|σi| ? 1 λiσ2 i ă 8 and In conclusion, we have i 1 |σiζi| 1 i 1 λiσ2 i ζ2 i Then, by the conditional dominated convergence theorem, we have Erexpp Hmq|F8s Ñ Er Mnpβq|F8s Ď Mn as m Ñ 8 a.s..