# learning_functional_distributions_with_private_labels__13461886.pdf Learning Functional Distributions with Private Labels Changlong Wu 1 Yifan Wang 1 Ananth Grama 1 Wojciech Szpankowski 1 Abstract We study the problem of learning functional distributions in the presence of noise. A functional is a map from the space of features to distributions over a set of labels, and is often assumed to belong to a known class of hypotheses F. Features are generated by a general random process and labels are sampled independently from featuredependent distributions. In privacy sensitive applications, labels are passed through a noisy kernel. We consider online learning, where at each time step, a predictor attempts to predict the actual (label) distribution given only the features and noisy labels in prior steps. The performance of the predictor is measured by the expected KLrisk that compares the predicted distributions to the underlying truth. We show that the minimax expected KL-risk is of order Θ( p T log |F|) for finite hypothesis class F and any non-trivial noise level. We then extend this result to general infinite classes via the concept of stochastic sequential covering and provide matching lower and upper bounds for a wide range of natural classes. 1. Introduction Consider the problem of finding how clinical factors (such as age, gender, smoking history etc.) impact the probability of manifesting various sequelaes after catching a disease. We represent clinical factors as a set of features X, the outcomes (i.e., sequelaes) as a set of labels Y, and (Y) as the set of all probability distributions over Y. Our goal is to find a mapping (a functional distribution) from X to (Y) by observing a set of sample-label pairs (x1, y1), , (x T , y T ) sampled from a group of real patients. We aim to recover the true relationship (i.e., X (Y)) with minimal loss. Since clinical data is sensitive, one typically does not reveal them publicly, choosing instead to reveal perturbed labels 1Center for Science of Information, Department of Computer Science, Purdue University. Correspondence to: Changlong Wu . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). y T = { y1, , y T } that are generated by adding noise to y T . Note that in this case, clinical factors are not considered sensitive, and therefore features x T = {x1, , x T } are revealed exactly. Our goal is to design a noisy process that prevents inferring y T from (x T , y T ) (i.e., one that prevents inference of true labels) while allowing us to learn the underlying relationship even with the noisy labels. Our setup can be understood as an extension of the classical randomized response of surveying proposed by (Warner, 1965) to our functional scenario with additional features. We model the functional dependency as a map p : X (Y). We assume that the underlying true relationship comes from a class of hypothesis F (Y)X , i.e., we consider the well specified case. While our above clinical example represents a batch (supervised) learning scenario, in this paper we consider a more general online learning case. This will not only provide us with more general application scenarios (e.g., online advertisement) but also resolves the batch learning as a special case. Let p F be the underlying true mapping. We consider the following online learning scenario that occurs over time horizon of T steps. At each time t, Nature generates feature xt and reveals it to a predictor. The predictor then makes a prediction ˆpt (Y) based on the history observed thus far, i.e., xt, yt 1. Nature then generates yt p(xt) independently of all previous data and reveals a noisy label yt = Kη(yt) where Kη represents a noisy kernel (channel). The goal of the predictor is to minimize the following expected KL-risk: t=1 KL(p(xt), ˆpt(xt, yt 1)) where KL is the KL-divergence and the expectation is over all randomness involved in this process. By standard onlineto-batch conversion, any online predictor ˆp implies a batch learner that achieves the same performance bound. See Section 2 for more formal assertions. The goal of this paper is to understand how the class F and the process generating x T affect the expected KL-risk in the presence of the noisy labels y T . Our contributions. We formulate the problem of learning functional distributions from noisy labels by allowing Learning Functional Distributions with Private Labels features to influence outcome distributions. This offers a natural extension of the classical randomized response scenario (Warner, 1965) with a learning-theoretic context. Our formulation differs substantially from classical label differential privacy (Chaudhuri & Hsu, 2011), where the goal is to recover a classification function that best fits actual data. Our goal is to recover the underlying distribution itself, and the learning quality is measured by the expected KL-risk. This formulation also provides a resolution to label inference attacks (Wu et al., 2023) in the label differential privacy scenario. Specifically, we introduce a noisy kernel (channel) Kη parameterized by the noise level η > 0. We show that by tuning noise level η, one can make probability of recovering the actual label y T , given the noisy label y T , arbitrarily close to the optimal recovering probability by knowing only the underlying distribution. We design a general algorithm (Algorithm 1) for learning distributions in the presence of such noisy labels, which is based on the Bayesian averaging (i.e., exponential weighted average). We show that this algorithm provides a tight expected KL-risk for a wide range of hypothesis classes. Specifically, we show that: (i) for any finite class F and adversarially presented features, the expected KL-risk grows as Θ( p T log |F|) upto a log T factor for any nontrivial noise levels η. Furthermore, the expected KL-risk becomes O(log |F|) if the probabilities in F are bounded away from 0; (ii) we provide a general approach for reducing the expected KL-risk of general infinite classes to finite classes using the concept of stochastic sequential covering. This yields tight bounds for a wide range of natural classes, including logistic regression and classes that can be kernelized by a function class of finite pseudo-dimension. Our main technique for establishing upper bounds is through a reduction to sequential probability assignment under logarithmic loss via a novel denoising approach that relates KL-divergence and total variation distance. For lower bounds, we construct hard classes with probabilities close to 0 via the classical lower bounding techniques such as Le Cam and Fano s methods. In summary, our main contributions are: (i) a fundamentally new formulation that extends the classical randomized response scenario to a learning context; (ii) tight lower and upper bounds for the expected KL-risk for a wide range of hypothesis classes; and (iii) novel algorithmic and analysis techniques for establishing upper and lower bounds, which may be of independent interest beyond our target problem. Related work. Our setup is related to label differential privacy as studied in (Chaudhuri & Hsu, 2011; Esfandiari et al., 2022; Ghazi et al., 2021). However, our framework is different in the sense that our learning goal is to recover the actual underlying distribution that generates data, and not classification functions that fit the data. For instance, sup- pose the underlying distribution is uniform over {0, 1} for all x X, then any classification function cannot achieve cumulative error better than T/2. However, one can still learn the distribution with sublinear KL-risk. The learned distribution will then be used in applications beyond classification. Learning functional distributions has also been extensively studied in the context of sequential probability assignment under logarithmic loss (Yang & Barron, 1998; Cesa-Bianchi & Lugosi, 2006; Rakhlin & Sridharan, 2015; Bilodeau et al., 2020; Wu et al., 2022b; Bhatt & Kim, 2021; Bilodeau et al., 2021). However, these efforts assume that the labels are noiseless. In some of these results the labels are non-realizable but the regret is still evaluated on the observed labels. In our work, we evaluate the quality of learned models by comparing it to the actual underlying truth, even when observing only noisy labels. 2. Problem Formulation Let X be the feature space and Y be the label space. We will assume throughout the paper that Y is finite and |Y| = M for some integer M 2. We denote by (u1, , u M) [0, 1]M : the set of all probability distributions over Y. A function p : X (Y) is said to be a hypothesis, and maps each x X to a distribution p(x) (Y) over Y. We write p(x)[m] to be the probability mass of p(x) on the mth element of Y. A set F (Y)X is called the hypothesis class. We provide several natural hypothesis classes F below, with analysis in Section 3.2. Example 2.1 (Constant functions). Let X be any feature space. We define the following class F = {pq( ) : x X pq(x) = q and q (Y)}, i.e., x X is mapped to the same distribution q under pq(x). When |Y| = 2, our setup recovers the classical randomized response scenario as in (Warner, 1965). Example 2.2 (Logistic Regression). Let X = Rd and W = (w1, , w M) Rd M. The function p(W, x)[m] = e wm,x PM i=1 e wi,x with m [M] def = {1, , M} defines a hypothesis map X (Y). The set F of all such hypothesis p(W, ) parameterized by W Rd M is known as the Logistic hypothesis class. Example 2.3 (Hidden Classification Model). Let H [N]X be a class of functions X [N] with N N. For any tuple Learning Functional Distributions with Private Labels q = (q1, , q N) (Y)N and h H, we define a hypothesis p(h, q, x) = qh(x), i.e., for any x X if h(x) = n [N], we have the value p(h, q, x) = qn. The class F = p(h, q, ) : h H, q = (q1, , q N) (Y)N defines a hypothesis class. We call such a class the hidden classification model w.r.t. H. Intuitively, we can understand the functions in F as classifying the features X into N classes such that each class corresponds to the same (but not fixed) distribution over Y. We will often assume the class H to have bounded complexity, e.g., finite pseudo-dimension. This can be viewed as a generalization of the setup in (Bhatt & Kim, 2021) to the multi-class case. For any two probability distributions p, q (Y), we have 1. The KL-divergence is defined as m=1 p[m] log p[m] 2. The Total Variation is defined as TV(p, q) = 1 m=1 |p[m] q[m]|. 3. The χ2-divergence is defined as (p[m] q[m])2 We assume throughout this paper that log(x) is in base e. Let η [0, 1] be fixed and known. We define a noisy kernel Kη to be a random map Y Y such that for all y = y Y we have Pr[Kη(y) = y] = 1 η and Pr[Kη(y) = y ] = η M 1. With a slight abuse of notation, we also interpret Kη as a map from (Y) (Y) such that for any p (Y), we find Kη(p)[m] = (1 η)p[m] + η(1 p[m]) M 1 . We assume here that η [0, (M 1)/M). The inverse kernel K 1 η is a map (Y) (Y) such that for any p (Y) in the image of Kη, we have K 1 η (p)[m] = It is easy to verify that for any p (Y), we have K 1 η (Kη(p)) = p. We note that both Kη and K 1 η are linear maps from RM RM. Let T be a time horizon. We denote by P a set of distributions over X T , for instance P could be the class of all i.i.d. distributions, which models the statistical generation mechanism of the feature vectors x T . Online learning with private labels. For any hypothesis class F and distribution class P, we consider the following online learning game between Nature and predictor. At the beginning of the game Nature selects some p F (i.e., the well specified case) and ν T P. Nature then samples x T ν T . At each time step t T, Nature reveals the tth sample xt of x T . The predictor then makes a prediction ˆpt (Y) using a strategy Φ, which is based on the history xt = {x1, , xt} and yt 1 = { y1, , yt 1}, i.e., ˆpt = Φ(xt, yt 1). Nature then samples yt p(xt) (independent of all previous data) and reveals the noisy label yt = Kη(yt) to the predictor. We emphasize that the parameter η is fixed and known to the predictor. We are interested in the following (minimax) expected KL-risk: r KL T (F, P) = inf Φ sup p F, ν T P E t=1 KL(p(xt), ˆpt) where the expectation is over the joint distribution of x T and y T that are generated in the above process, and Φ runs over all possible (randomized) prediction rules. Privacy of noisy label y T . We note that if we take η = eϵ M 1 + 1 1 for some ϵ > 0, i.e., ϵ = log 1 η η/(M 1) , then our kernel Kη automatically provides (ϵ, 0)-differential privacy on the true label y T . However, as noticed in some recent papers (Wu et al., 2023), label differential privacy cannot prevent label inference attack for any non-trivial learning accuracy. We note that this argument holds only for the classification problem, where the learning goal is to recover the labeling function. This does not apply to our distribution recovering problem, since our learning goal is to estimate the distribution of the true label y T , not the labeling. Indeed, let p (Y) and y p. We show that knowing the additional information y = Kη(y) does not provide much advantage for recovering y when compared to only knowing p. To see this, we observe that the best strategy to recover y by knowing only p is to predict the label y for which p[y ] is maximum (i.e., the Bayesian optimal predictor), which has prediction accuracy of pmax = max{p[y] : y Y}. Suppose now we have the additional knowledge y, then the best strategy is the maximum a-posterior prediction (i.e., we predict y for y such that p[y | y] is maximum), which has prediction accuracy upper bounded by pmax(1 η) η/(M 1) . See Appendix A for a complete derivation. This implies that knowing noisy label y only provides a 1 η η/(M 1) factor in the accuracy of recovering of y. By taking η M 1 M , we have 1 η η/(M 1) close to 1. Online to batch conversion. Let µ be an arbitrary distribution over X and F be a hypothesis class. Suppose we observe x T i.i.d. µ and y T be generated for some p F as Learning Functional Distributions with Private Labels in the online case. The batch learning problem is to find a function Ψ : X T YT (Y)X that minimizes RKL T (Ψ, F) = sup µ,p Ex T , y T ,x µ KL(p(x), Ψ(x T , y T )[x]) . We show in Appendix B that any online strategy Φ that minimizes r KL T (F, IID) automatically implies a strategy Φ such that RKL T ( Φ, F) r KL T (F, IID) where IID is the class of all i.i.d. distributions over X T and Φ(x T , y T )[x] = 1 t=1 Φ(xt 1x, yt 1). Here, the summation of probability distributions is understood as the summation of the corresponding vectors in (Y), and xt 1x means concatenation of xt 1 and x. We note also that the expected guarantee of RKL T (Ψ, F) can be boosted to a high probability guarantee by splitting the sample x T into blocks and performing cross-validation to select the best predictor on one of the blocks. See Appendix B for detailed discussion. Therefore, our main focus of this paper is to bound r KL T (F, P) for general classes F and P, and understand how the complexity of F and P affect the precise KL-risk. 3. Main Results This is the main section of our paper, where we provide general lower and upper bounds on r KL T (F, P) for various classes F and P under noise kernel Kη. 3.1. Finite hypothesis class F We assume that F is finite and S = {δx T : x T X T } is the class of singleton distributions over X T , where δx T is the distribution over X T that assigns probability 1 to x T . Note that S is equivalent to features x T being presented adversarially and can actually provide a general upper bound for any class of distributions over X T including, for example, the class IID of i.i.d. processes. Our first main result is the following theorem: Theorem 3.1. Let F (Y)X be a finite class and S be the class of all singleton distributions over X T . Then: r KL T (F, S) O Furthermore, for Y = {0, 1} and X = {1, , k} with 1 k T, there exists a function class F (Y)X and η 1 4 such that |F| = 2k and r KL T (F, S) Ω p T log |F| . (4) Note that for η < M 1 M the denominator of (3) is strictly positive and therefore for any such η the bound of Theorem 3.1 is of the form O( p T log |F|), where O hides a poly log T factor. Compared to the privacy loss 1 η η/(M 1) (i.e., the advantage of recovering y T when observing y T ) established in Section 2, we known that by tuning the parameter η close to M 1 M , one can make the privacy loss arbitrarily close to 1 while achieving sublinear expected KL-risk of form O( p T log |F|). The privacy-accuracy trade offs are related by the constants 1 η η/(M 1) and (1 Mη M 1) 1. Remark 3.2. Theorem 3.1 establishes a fundamental distinction between the noisy and noiseless cases, i.e., O( p T log |F|) v.s. O(log |F|) (see (Wu et al., 2022b)). Perhaps surprisingly, this distinction is actually real, i.e., the lower bound Ω( p T log |F|) is attainable for certain classes F. This differs from the classification problem, where benign noise (such as our Kη) does not effect the performance substantially, see e.g., (Ben-David et al., 2009, Thm 15). Before we present our proof of Theorem 3.1, we emphasize that the knowledge of η is key to achieving sub-linear risk (this is true even in the constant function cases such as (Warner, 1965)). Indeed, we may consider two constant functions p1 = (0, 1), p2 = ( 1 4) (with Y = {0, 1}). By selecting η1 = 1 4, η2 = 0, we have Kη1(p1) = Kη2(p2). Therefore, no predictor can distinguish between the noisy sample from p1, p2 with parameters η1, η2, respectively. Hence, the expected KL-risk must be Ω(T) for either p1 or p2 since KL(p1, p2) = Ω(1). Our proof of Theorem 3.1 is based on the Noisy Smooth Bayesian Predictor described in Algorithm 1, which is the main algorithmic contribution of this paper. Recall the following definition of logarithmic loss: For any y Y and p (Y), logarithmic loss (log-loss) is defined as ℓlog(p, y) = log(p[y]), i.e., negative logarithm of the probability p on label y. For any online prediction rule Φ and x T , y T , the point-wise regret (Wu et al., 2022b) is R(Φ, y T , F | x T ) = t=1 ℓlog(ˆpt, yt) inf p F t=1 ℓlog(p(xt), yt) where ˆpt = Φ(xt, yt 1). We start with the following technical lemmas with proofs presented in Appendix E. Lemma 3.3. Let F (Y)X and x T X T . For any p F and online prediction rule Φ Ey T [R(Φ, y T , H | x T )] Ey T t=1 KL(p(xt), ˆpt) where y T is independently generated so that yt p(xt) for all t [T], and ˆpt = Φ(xt, yt 1). Learning Functional Distributions with Private Labels Algorithm 1 Noisy Smooth Bayesian Predictor Input: Finite class F (Y)X and η < (M 1)/M 1: Let F = {p1, , p K} and w1 = (1, , 1) RK 2: for t = 1, , T do 3: Receive feature xt. 4: For all k [K], set pk(xt) = Kη(pk(xt)). 5: For all y Y, compute pt[y] = PK k=1 pk(xt)[y] wt k PK k=1 wt k , where pt defines a distribution over Y. 6: Make prediction ˆpt = K 1 η ( pt) + 1/(TM 2) 1 + 1/(MT) . 7: Receive noisy label yt. 8: For all k [K], update: wt+1 k = wt k pk(xt)[ yt]. Lemma 3.4. For any distributions p, q (Y), we have 1. TV(p, q) q 1 2KL(p, q); 2. KL(p, q) (2 log(qmin))TV(p, q), where qmin = min{q[y] : y Y}. 3. KL(p, q) χ2(p, q). Proof of Theorem 3.1(Upper Bound). We first observe that for any p (Y), if y p, then y = Kη(y) is distributed according to Kη(p). For any class F (Y)X , we denote F = {Kη(p) : p F}. By the construction of Algorithm 1, we know that the predictor pt (step 5) is simply the Bayesian predictor (a.k.a., aggregating algorithm) over F. By (Wu et al., 2022b, Lemma 3), for all x T , y T we know that R( p T , y T , F | x T ) log | F|. Invoking Lemma 3.3, we conclude: sup p F E y T t=1 KL( p(xt), pt) log | F|, (5) where yt p(xt). We now derive the expected KL-risk of predictor ˆpt (step 6) using (5) by leveraging the relations between information divergences in Lemma 3.4. Our first observation is that pt (in step 5 of Algorithm 1) is a convex combination of the functions pk F. Therefore, pts are in the image of Kη, since Kη is a linear map. For any distributions p, q (Y) such that q is in the image of Kη, we claim that: TV(Kη(p), q) = 1 Mη M 1 TV(p, K 1 η (q)). (6) To see this, we analyze the total variation entry-wise. For any a, b [0, 1], we have |a + bp[m] q[m]| = b|p[m] (q[m] a)/b|. By letting a = η M 1, b = 1 Mη M 1 and noting that (x a)/b is the inverse of a + xb, clearly (6) follows. We now abbreviate cη = 1 Mη M 1 . Therefore, for all t [T] and k [K] TV( pk(xt), pt) = cηTV(pk(xt), K 1 η ( pt)). By elementary inequality |a (b + 1/(M 2T))/(1 + 1/(MT))| |a b| + 2/(MT) for a, b [0, 1], we have TV(pk(xt), ˆpt) TV(pk(xt), K 1 η ( pt)) + 2 where ˆpt is defined in step 6 of Algorithm 1. Combining the inequalities, for all t [T] and k [K] we find: TV(pk(xt), ˆpt) c 1 η TV( pk(xt), pt) + 2 We now upper and lower bound the total variations of (7) with KL-divergence using Lemma 3.4. By inequality 1 of Lemma 3.4, TV( pk(xt), pt) 1 2KL( pk(xt), pt). By inequality 2, we obtain: TV(pk(xt), ˆpt) (2 + log(2TM 2)) 1KL(pk(xt), ˆpt), where we use the fact that ˆpmin t 1 2T M 2 due to the smoothing at step 6 of Algorithm 1. Therefore, for all k [K] t=1 KL(pk(xt),ˆpt) O(c 1 η log(TM)) KL( pk(xt), pt) (a) O(c 1 η log(TM)) t=1 KL( pk(xt), pt) where (a) follows by Cauchy-Schwarz inequality PT t=1 at q T PT t=1 at for all at 0. Taking expectation over y T and noting that E[ E[X], the expected KL-risk r KL T (F, P) is upper bounded by: O(c 1 η log(TM)) v u u t T sup pk F E y T t=1 KL( pk(xt), pt) O(c 1 η log(TM)) p where the inequality follows by (5) and therefore (3) follows. Learning Functional Distributions with Private Labels Remark 3.5. Our proof presented above actually shows the upper bound for the total variation risk as well, i.e., t=1 TV(p(xt), K 1 η ( pt)) T log |F| , where pt is in step 5 of Algorithm 1 (note that here we used O not O). This will sometimes be more useful than the KL-risk for certain application scenarios. Proof of Theorem 3.1(Lower Bound). We now assume the label space Y = {0, 1} and feature space X = [k] with k T. For any binary sequence b {0, 1}k, we define a function such that for all i [k]: ( 0, if bi = 0, 0.1 T/k, otherwise . and pb(i)[0] = 1 pb(i)[1]. Let F = {pb( ) : b {0, 1}k}. We now partition the sequence x T into k blocks, each of size T/k such that the ith block x(i) takes value i [k]. We shall show below that the expected KL-risk is lower bounded by Ω( p T/k) for each block. Suppose this holds, and let gi(Φ, p(i)) be the expected KL-risk at block i for strategy Φ when the labels are generated by p(i). Then inf Φ sup p F E t=1 KL(p(xt), ˆpt) (a) = inf Φ sup p F i [k] gi(Φ, p(i)) (b) = inf Φ p(i)[1] 0, 0.1 i [k] inf Φi sup p(i)[1] 0, 0.1 g(Φi, p(i)) where (a) follows by linearity of expectation, (b) follows by the fact that the p(i)s that maximize gi(Φ, p(i)) locally can be glued into a function pb F for some b, and (c) follows by inf P P inf. Hence (4) is proved. We now focus on lower bounding the expected KL-risk gi(Φi, p(i)) for each block i. Our proof is based on the Le Cam s two point method. Let T = T/k and η 1 4 in the sequel. Recall the features in each block i equals i, therefore, we fix x T = {i}T . We select two sources p1, p2 ({0, 1}) to be p1[1] = 0, p2[1] = 0.1 By Lemma C.2 (in Appendix C), for any prediction ˆpt: max{KL(p1, ˆpt), KL(p2, ˆpt)} 0.02 and KL(Kη(p1), Kη(p2)) 0.08 T , where Kη is the noise kernel. Denote by y T 1 , y T 2 the noisy samples from p1, p2 under kernel Kη, respectively, i.e., y T i i.i.d. Kη(pi). By productive property of KL-divergence we obtain KL( y T 1 , y T 2 ) T 0.08 T = 0.08. By Lemma 3.4 (1), this implies that: TV( y T 1 , y T 2 ) 0.2. (9) We now assume there exists a predictor Φ that achieves KLrisk < 0.01 T w.p. 0.7 for any underlying source in {p1, p2}, i.e., for all p {p1, p2} with y T i.i.d. p t=1 KL(p, Φ(xt, yt 1)) < 0.01 For any pi, Φ and y T , we define the empirical KL-risk as: t=1 KL(pi, Φ(xt, yt 1)). Let ϕ( y T ) = arg minp {p1,p2}{r(p, Φ)} be a source identifier. We claim that: sup i {1,2} Pr y T i i.i.d. pi[ϕ( y T i ) = pi] < 0.3. (10) To see this, we have by (8) that for any predictor Φ sup p {p1,p2} r(p, Φ) T where we used the fact that there must be one of p1, p2 that achieves the maximum of (8) for at least T /2 steps. Therefore, if the KL-risk of Φ against the true source is < 0.01 T , the rule ϕ must identify the true source, which happens w.p. 0.7 by our assumption. Therefore, the claim (10) follows. However, this contradicts Le Cam s two point lemma (Yu, 1997, Lemma 1), which asserts that for any identifier ϕ we have: sup i {1,2} Pr[ϕ( y T i ) = pi] 1 TV( y T 1 , y T 2 ) 2 0.4. This contradiction implies for any Φ, there must be p {p1, p2} such the w.p. 0.3, the KL-risk is lower bounded by 0.01 T ; i.e., expected KL-risk must be lower bounded by 0.3 0.01 T ). This completes the proof. One may observe that the main ingredient in the lower bound proof of Theorem 3.1 is to take the probability mass near 0. A natural question is: if the probability mass is bounded away from 0 can we obtain better bounds? Perhaps surprisingly, this turns out to be true. We have the following complementary theorem to Theorem 3.1. Learning Functional Distributions with Private Labels Theorem 3.6. Let F (Y)X be a finite class and S be the class of all singleton distributions over X T . If there exists a number δmin > 0, such that for all x X, p F and y Y we have p(x)[y] δmin, Then r KL T (F, S) O δmin 1 Mη M 1 2 Proof. We specify predictor ˆp t = K 1 η ( pt) with pt defined in step 5 of Algorithm 1, i.e., we do not do smoothing as ˆpt. From the proof of Theorem 3.1, we conclude it is sufficient to bound KL(pk(xt), ˆp t) with KL( pk(xt), pt). Let p = pk(xt) and q = ˆp t. Our goal is to find a relation between KL(p, q) and KL( p, q). To do so, we exploit properties of χ2-divergence. We have, by inequality 3 of Lemma 3.4, that KL(p, q) χ2(p, q). Therefore, KL(p, q) 1 δmin t=1 (p[m] q[m])2 since q[m] δmin by assumption. Invoking inequality 1 of Lemma 3.4, we conclude KL( p, q) 2TV( p, q)2 1 2 PM m=1( p[m] q[m])2, where the last inequality follows from (P a)2 P a2. By linearity of Kη, we have (p[m] q[m])2 = c 2 η ( p[m] q[m])2, where cη = 1 Mη M 1. This implies that KL(pk(xt), ˆp t) KL( pk(xt), pt) The theorem now follows by the reduction from KL-risk to log-loss as in Equation (5). Remark 3.7. Theorem 3.6 shows that if the probability lower bound δmin is a constant, then we essentially achieve the same bound as in the noiseless case! If δmin we obtain better bounds than Theorem 3.1. However, the upper bound of Theorem 3.1 holds even with δmin = 0, in which case Theorem 3.6 only provides vacuous bounds. Example 3.8. Let F = {pw(x)[1] = (1+e w,x ) 1 : w, x Rd and ||w||2 R, ||x||2 1} be binary Logistic functions. We have δ 1 min = 1+e R for F. Using a covering argument as Theorem 3.13 (in Section 3.2) and Theorem 3.6, we have r KL T (F, S) O((e R + 1)d log(RT)). When R = 1, i.e., w is in a unit ball, this matches the noiseless O(d log T) bounds as in (Foster et al., 2018; Shamir, 2020). 3.2. General class F via covering We established in Theorem 3.1 the (near) optimal dependency of the expected KL-risk for finite classes. We shall show in this section that such techniques can be generalized to broad classes F via the powerful technique of covering. We need the following extended notion of the stochastic sequential covering introduced recently in (Wu et al., 2022a). Definition 3.9. For any class F (Y)X and a class of distributions P over X T , we say a class G (Y)X (where X denotes the set of all finite sequences of X) is a stochastic sequential cover F w.r.t. P at scale α 0 and confidence δ > 0 if for all ν T P Prx T ν T p F q G t [T], TV(p(xt), q(xt)) > α The following covering bound generalizes Theorem 3.1 to general infinite classes with detailed proof in Appendix F. Theorem 3.10. Let F (Y)X be any hypothesis class and P be any distribution class. If for all α 0, there exists a stochastic sequential covering set Gα of F w.r.t. P at scale α and confidence δ = 1 T M , then r KL T (F, P) O T inf α 0{Mα2T/η + log |Gα|} where O hides the term O(log(MT)(1 Mη/(M 1)) 1). Remark 3.11. At a high level, the proof of Theorem 3.10 exploits the following general reduction from KL-risk to Log-loss. For any predictor Φ that achieves regret r T (F, P) under log-loss for class F and P, we can apply Φ with reference class F = {Kη(p) : p F} to obtain a predictor Φ. If Φ makes prediction within the image of Kη, then the prediction (K 1 η ( Φ)+1/TM 2))/(1+1/TM) as in step 6 of Algorithm 1 achieves r KL T (H, P) O( q T r T ( F, P)) by the same argument as in Theorem 3.1. Note that this requires Φ to make predictions in the image of Kη, which is satisfied for our Bayesian averaging based approach. Moreover, Theorem 3.10 has dependency Mα2T + log |Gα| on regret, which is tighter than the worse case regret in (Wu et al., 2022b) and matches the average case regret in (Bilodeau et al., 2021). We note that η is a constant close to M 1 We now prove tight expected KL-risk bounds for several natural infinite classes using Theorem 3.10. We abbreviate cη = 1 Mη/(M 1) in the sequel. Theorem 3.12. Let F be the class of all constant functions as in Example 2.1 and S be the class of all singleton distributions over X T . Then r KL T (F, S) O(c 1 η log3/2(TM) Moreover, for η 1 4, we have r KL T (F, S) Ω( Sketch of Proof. Note that the set (Y) can be uniformly covered by a set G with α = 1 T M such that |G| Learning Functional Distributions with Private Labels TM 2 M 1 under L1 distance. To see this, we construct an α-grid of (Y), with step size 1 T M 2 . Since (Y) is determined by only M 1 free parameters, we have that the upper bound on |G| holds. The upper bound on the expected KL-risk follows directly by Theorem 3.10 by noticing that uniform cover implies stochastic sequential cover under S. To prove the lower bound, we apply the Fano s method by constructing a hard subclass of F that achieves the tight lower bound. See Appendix G for detailed proof. Theorem 3.13. Let F be the Logistic hypothesis class as in Example 2.2 with ||W||2 B and ||x||2 1 and S be the class of all singleton distributions over X T . Then r KL T (F, S) O(c 1 η log(TM) p d MT log(TMB)). Sketch of Proof. This follows from the fact that the set W = {W Rd M : ||W||2 B} can be 1 MT -covered with size (TMB)d M under L2 norm. Since logistic function is Lipschitz on W under L2 norm, the L2 cover of W implies an (uniform) L1 cover in the sense of Definition 3.9. The theorem follows from Theorem 3.10. The following result provides tight bounds for the Hidden Classification Model under i.i.d. processes. Theorem 3.14. Let F be the Hidden Classification Model as in Example 2.3 with reference class H [N]X of pseudodimension d and IID be the class of all i.i.d. distributions over X T . Then r KL T (F, IID) O c 1 η log2(TMN) p T(d + NM) . Moreover, there exists a class H of pseudo-dimension d such that r KL T (F, IID) Ω p T max{d, NM} , provided d, N = O(T/ log T) and η 1 Our proof is based on the following key lemma that bounds the stochastic sequential covering of F w.r.t. i.i.d. processes. See Appendix H for detailed proof. Lemma 3.15. Let H [N]X be a class with pseudodimension d and F be the Hidden Classification Model w.r.t. H. Then, there exists a stochastic sequential cover G of F w.r.t. IID at scales 1 T M and confidence δ > 0 such that log |G| O d log2(TN)+ log(TN) log(1/δ) + N(M 1) log(TM) . Proof of Theorem 3.14. The upper bound follows directly from Theorem 3.10 and Lemma 3.15 with α = δ = 1 T M . We now prove the lower bound by a novel combination of the lower bounds in Theorem 3.1 and 3.12. We first prove the Ω( Td) bound. To see this, we denote by x1, , xd the samples that are pseudo-shattered by H witnessed by r1, , rd (Mohri et al., 2018, Def. 11.4). We construct the hard class as in Theorem 3.1 (with Y = {0, 1}) by defining for each h a function ph(xi)[1] = 0.1/ p T/d if h(xi) ri and ph(xi)[1] = 0 otherwise. Let µ be the uniform distribution over xd, and the features be generated i.i.d. from µ. By the multiplicative Chernoff bound (Mitzenmacher & Upfal, 2017, Thm 4.5(2)) we have w.p. 1/2 any xi with i [d] appears Θ(T/d) times provided d T/ log T. Using the same argument as in the lower bound proof of Theorem 3.1, we have the expected KL-risk is lower bounded by d Ω( p To establish the lower bound Ω( TNM), we assume that there exists a function h H taken values in the full range of [N]. Denote by x1, , x N the points such that h(xi) = i. We again choose µ to be uniform over x N and the features are generated i.i.d. from µ. By our argument above, each xi appears Θ(T/N) times in the sample w.p. 1/2 provided N T/ log T. For each xi with i [N], we construct the hard constant function class Fi as in Theorem 3.12. Now, for any tuple q1, , q N with qi Fi, we define a function pq(xi) = qi. Let F F be the class consisting of all such pqs. By the lower bound proof of Theorem 3.12 (in Appendix G), the expected KL-risk of F is lower bounded by NΩ( p Example 3.16. Let Y = {0, 1}, N = 2, and VC-dimension of H be d. Then, we recover the setup of (Bhatt & Kim, 2021). By Theorem 3.14, we have r KL T (F, IID) = Θ( Td) for the (worst case) hidden classification mode F w.r.t. H. Here, we use the fact that the pseudo-dimension degenerates to VC-dimension for a binary valued class. This differs substantially from the O(d log2 T) noiseless bound established in (Bilodeau et al., 2021; Wu et al., 2022a). Moreover, if F ({0, 1})X is a class with α-fat-shattering number O(α s) (view the functions in F as [0, 1]-valued, interpreted as the probability on label 1), then r KL T (F, IID) O T (s+1)/(s+2) by Theorem 3.10 and the sequential covering estimates as (Wu et al., 2022a, Thm 17). 4. Discussion and Extension In this paper, we established tight expected KL-risk bounds for learning functional distributions in the presence of noisy labels. Our main technique for establishing the upper bounds is through a reduction to sequential probability assignments, which is achieved by relating KL(p, q) to KL( p, q) (where p = Kη(p)). For instance, our Theorem 3.1 Learning Functional Distributions with Private Labels relies on the relation KL(p, q) O( p KL( p, q)), while Theorem 3.6 relies on KL(p, q) O(δ 1 min KL( p, q)). We believe investigating other relations will be an interesting future direction and may result in better bounds. Another extension is the generalization to miss-specified cases, i.e., the underlying function is not covered by the hypothesis class. However, we stress that such a generalization (with sublinear regret) is not an easy task, since for miss-specified cases, the relations of information divergences as Lemma 3.4 may not hold, and would require substantially new techniques. Furthermore, we can also examine general noisy kernels and risk-loss functions, in addition to the kernel Kη and KL-risk studied in this paper. Acknowledgements The authors would like to thank Chih-Hao Fang for helpful discussions in the earlier stage of formulating this problem. This work was partially supported by the NSF Center for Science of Information (CSo I) Grant CCF-0939370, and also by NSF Grants CCF-2006440, CCF-2007238, CCF2211423, and Google Research Award. Ben-David, S., P al, D., and Shalev-Shwartz, S. Agnostic online learning. In Conference on Learning Theory, volume 3, 2009. Bhatt, A. and Kim, Y.-H. Sequential prediction under logloss with side information. In Algorithmic Learning Theory, pp. 340 344. PMLR, 2021. Bilodeau, B., Foster, D., and Roy, D. Tight bounds on minimax regret under logarithmic loss via self-concordance. In International Conference on Machine Learning, pp. 919 929. PMLR, 2020. Bilodeau, B., Foster, D. J., and Roy, D. M. Minimax rates for conditional density estimation via empirical entropy. ar Xiv preprint ar Xiv:2109.10461, 2021. Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning and Games. Cambridge University Press, 2006. Chaudhuri, K. and Hsu, D. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 155 186. JMLR Workshop and Conference Proceedings, 2011. Cover, T. M. and Thomas, J. A. Elements of information theory. Wiley-Interscience, 2006. Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz, S. Multiclass learnability and the erm principle. J. Mach. Learn. Res., 16(1):2377 2404, 2015. Esfandiari, H., Mirrokni, V., Syed, U., and Vassilvitskii, S. Label differential privacy via clustering. In International Conference on Artificial Intelligence and Statistics, pp. 7055 7075. PMLR, 2022. Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. Logistic regression: The importance of being improper. In Conference On Learning Theory, pp. 167 208. PMLR, 2018. Ghazi, B., Golowich, N., Kumar, R., Manurangsi, P., and Zhang, C. Deep learning with label differential privacy. Advances in Neural Information Processing Systems, 34: 27131 27145, 2021. Haussler, D. and Long, P. M. A generalization of sauer s lemma. Journal of Combinatorial Theory, Series A, 71 (2):219 240, 1995. Mitzenmacher, M. and Upfal, E. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018. Nielsen, F. On a variational definition for the jensenshannon symmetrization of distances based on the information radius. Entropy, 23(4):464, 2021. Polyanskiy, Y. and Wu, Y. Information Theory: From Coding to Learning. Cambridge University Press, 2022. Rakhlin, A. and Sridharan, K. Sequential probability assignment with binary alphabets and large classes of experts. ar Xiv preprint ar Xiv:1501.07340, 2015. Rubinstein, B., Bartlett, P., and Rubinstein, J. Shifting, oneinclusion mistake bounds and tight multiclass expected risk bounds. Advances in Neural Information Processing Systems, 19, 2006. Shamir, G. I. Logistic regression regret: What s the catch? In Conference on Learning Theory, pp. 3296 3319. PMLR, 2020. Warner, S. L. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63 69, 1965. Wu, C., Heidari, M., Grama, A., and Szpankowski, W. Expected worst case regret via stochastic sequential covering. ar Xiv preprint ar Xiv:2209.04417, 2022a. Wu, C., Heidari, M., Grama, A., and Szpankowski, W. Precise regret bounds for log-loss via a truncated bayesian algorithm. In Advances in Neural Information Processing Systems, volume 35, pp. 26903 26914, 2022b. Learning Functional Distributions with Private Labels Wu, R., Zhou, J. P., Weinberger, K. Q., and Guo, C. Does label differential privacy prevent label inference attacks? In International Conference on Artificial Intelligence and Statistics, pp. 4336 4347. PMLR, 2023. Yang, Y. and Barron, A. R. An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 44(1):95 116, 1998. Yu, B. Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pp. 423 435. Springer, 1997. Learning Functional Distributions with Private Labels A. Accuracy of Recovering True Labels from Noisy Labels We provide a derivation of the prediction accuracy for recovering y p when observing y = Kη(y). Let ϕ be any recovering rule, we have: Pr[ϕ( y) = y] = E y [Pr[ϕ( y) = y | y]] E y [max{p[y | y] : y Y}] . We now analyze the maximum a-posterior probability max{p[y | y] : y Y}. We assume Y = [M] for simplicity. Let y = m, we have if n = m then: Pr[y = n | y = m] = p[n]η/(M 1) p[m](1 η) + (1 p[m])η/(M 1), Pr[y = m | y = m] = p[m](1 η) p[m](1 η) + (1 p[m])η/(M 1). Now, the key observation is that if η (M 1)/M, we have η/(M 1) 1 η. Meaning that p[m](1 η) + (1 p[m])η/(M 1) η/(M 1), i.e., we have Pr[y | y] max m,n p[n], p[m](1 η) B. Online to Batch Conversion We establish the online to batch conversion results under KL-risk. Let Φ be the online predictor that achieves r KL T (F, IID), we define the batch learner to be Φ(x T , y T )[x] = 1 T PT t=1 Φ(xt 1x, yt 1). For any t [T], let et = KL(h(xt), Φ(xt, yt 1)). RKL T ( Φ, F) = sup µ,p Ex T , y T Ex µ KL(p(x), Φ(x T , y T ))[x] (a) sup µ,p Ex T , y T t=1 KL(p(x), Φ(xt 1x, yt 1)) = sup µ,p Ex T , y T t=1 Ex µ KL(p(x), Φ(xt 1x, yt 1)) # (b) = sup µ,p Ex T , y T t=1 Ext KL(p(xt), Φ(xt, yt 1)) # (c) = sup µ,p Ex T , y T = r KL T (F, IID) where (a) follows by convexity of KL-divergence (Polyanskiy & Wu, 2022, Theorem 7.5(b)); (b) follows by the fact that xt is independent of xt 1 and distributed as x, and x T i.i.d. µ with y T being noisy sample of p on x T ; (c) follows by the law of total probability. Note that the only property we used in the above derivation is the convexity of KL-divergence, which holds for any f-divergence (Polyanskiy & Wu, 2022, Theorem 7.5(b)). B.1. Boosting expected guarantee to high probability guarantee We now describe a general approach for boosting the expected guarantee of a batch learner to high probability guarantee. We will establish the result only for total variation since the result for KL-divergence follows by Remark 3.5 and Lemma 3.4. Let F (Y)X be an arbitrary class, µ be any distribution over X. Suppose there exists a learning rule Ψ such that sup p F Ex T , y T Ex µ TV(p(x), Ψ(x T , y T )[x]) R(T), Learning Functional Distributions with Private Labels for some function R(T). We partition the sample x T into k blocks each of size T/k, where k is to be determined later. We now fix some p F and denote Ψi to be the function generated by Ψ on the ith block, and Ei = Ex µ [TV(p(x), Ψi(x))] . By definition, we have the Eis are independent among different i [k] and i [k], Ex T , y T [Ei] R(T/k). Let Ii be the indicator of event {Ei 3R(T/k)}, we have E[Ii] 2 3. By Hoeffding bound (Cesa-Bianchi & Lugosi, 2006, Corollary A.1), w.p. 1 e k/3, there exist at least half of the indicators Ii that equal 1. Now, for any pair i, j [k], we compute the distance d(Ψi, Ψj) = 1 |Ji,j| xt Ji,j TV(Ψi(xt), Ψj(xt)) where Ji,j is the subset of x T that corresponds to the blocks other than i, j and therefore is independent of Ψi, Ψj. Note that for any i, j, if Ii = Ij = 1, then w.p. 1 2e 2r we have d(Ψi, Ψj) 6R(T/k) + r r T(k 2)/k by Hoeffding bound and triangle inequality of total variation. Similarly, for any Ei > 9R(T/k) + 2 q r T (k 2)/k and Ij = 1, we have w.p. 1 2e 2r that d(Ψi, Ψj) > 6R(T/k) + r r T(k 2)/k . We now define the learner Ψ to be any Ψi so that d(Ψi, Ψj) 6R(T/k) + r r T(k 2)/k for at least half of the Ψj. Taking k = 3 log(2/δ) and r = log(4k2/δ)/2, we have by union bound (over all the events above), w.p. 1 δ over x T , y T , that Ex µ [TV(p(x), Ψ (x))] 9R(T/k) + 2 r r T(k 2)/k 9R T 3 log(2/δ) Proposition B.1. Let H [N]X be a class of pseudo-dimension d, µ be an arbitrary distribution over X and F be the hidden classification model w.r.t. H. Then, there exists a learning rule Ψ such that for any p F w.p. 1 δ over (x T , y T ) where x T i.i.d. µ and yt Kη(p(xt)) be the noisy labels, we have Ex µ KL(p(x), Ψ(x T , y T )[x]) O c 1 η log2(TMN) (d + NM) log(1/δ) where cη = (1 (Mη)/(M 1)) and η [0, (M 1)/M). Proof. Let G be the stochastic sequential cover of F w.r.t. IID at α = δ = 1 T M as in Lemma 3.15. Let ˆp t = K 1 η ( pt), where pt is the predictor at step 5 of Algorithm 1 with input G. By (14) and Lemma 3.4 (1), we have the expected TV-risk (see Remark 3.5) of ˆp t is upper bounded by O(c 1 η p T log |G|). Since total variation is convex, this implies the batch learner Ψ (x T , y T )[x] = 1 T PT t=1 ˆp t(xt 1x, yt 1) achieves expected (batch) TV-risk sup p F Ex T , y T Ex µ TV(p(x), Ψ (x T , y T )[x]) O By (12) and |G| O((d + NM) log2(TNM)) (Lemma 3.15), we have for all p F, w.p. 1 δ over x T , y T that Ex µ TV(p(x), Ψ (x T , y T )[x]) O c 1 η log(TMN) (d + NM) log(1/δ) We now define the smoothed learner Ψ = (Ψ + 1/(TM 2))/(1 + 1/TM). Invoking Lemma 3.4 (2) and noting that Ψmin 1 T M 2 , we have KL(p(x), Ψ(x T , y T )[x]) O(log(TM) TV(p(x), Ψ (x T , y T )[x])) and therefore the result follows. Learning Functional Distributions with Private Labels C. Supporting Lemmas of Theorem 3.1(Lower Bound) We now prove the following technical lemmas which are crucial in the proof of Theorem 3.1 (Lower Bound). Lemma C.1. For any distributions p, q (Y), we have arg min r (Y){KL(p, r) + KL(q, r)} = p + q This implies that infr (Y) max{KL(p, r), KL(q, r)} 1 2 KL(p, p+q 2 ) + KL(q, p+q Proof. This result already appeared in (Nielsen, 2021, Equation (32)). However, we provide an alternative simpler proof here. We observe that KL(p, r) + KL(q, r) = 2 p[m] + q[m] 2 log 1 r[m] H(p) H(q) where H( ) denotes for Shannon entropy. It is sufficient to minimize PT m=1 p[m]+q[m] 2 log 1 r[m] = KL( p+q 2 , r) + H( p+q 2 ), which attains minima when r = p+q 2 . The last part of the lemma follows from that max{a, b} a+b 2 for all a, b 0. Lemma C.2. Let p1, p2 ({0, 1}) be two distributions such that p1[1] = 0 and p2[1] = 0.1 T , we have 1. For any r ({0, 1}), max{KL(p1, r), KL(p2, r)} 1 2KL p1, p1 + p2 2. Let p1 = Kη(p1) and p2 = Kη(p2), then KL( p1, p2) 0.08 Proof. The first inequality of statement (1) follows by Lemma C.1. To prove the second inequality, we have by direct computation: 1 2KL p1, p1 + p2 2 log 1 1 0.05/ where the first inequality follows by log(x) 1 1 x for all x > 0. To prove statement (2), we observe that for any a, b [0, 1] b + (1 a) log 1 a b 1 + (1 a) 1 a 1 b 1 = (a b)2 where the inequality follows by log(x) x 1 for all x 0. This implies KL( p1, p2) ( p1[1] p2[1])2 p2[1](1 p2[1]) 16 1 2 0.1 where the first inequality follows by p2[1](1 p2[1]) 3 16 and p1[1] p2[1] 1 T by definition of Kη and η [ 1 2), the constant 0.08 is selected for ease of computation. Remark C.3. Note that if we take p1[1] = 0 and p2[1] = 0.1 cη/ T with cη = (1 2η) 1, then the statement 2 of Lemma C.2 still holds while the statement 1 will be lower bounded by 0.02 cη/ T. This can be exploited to establish the tight dependency on η of the lower bound in Theorem 3.1. Moreover, the upper bound established for our Kη in statement 2 can be extended to any noisy kernel K provided χ2(K(0), K(1)) 16, by the locally χ2-like property of KL-divergence (Polyanskiy & Wu, 2022, Prop. 2.19), i.e., KL(λp + (1 λ)q, q) λ2 2 χ2(p, q) + o(λ2), and taking λ = 0.1/ T, p = K(1) and q = K(0) for sufficient large T. Learning Functional Distributions with Private Labels D. Packing Number of Boolean Cube with Given Hamming Weight We now establish a lower bound for the packing number of Boolean cube with given Hamming weight, which is crucial for our lower bound proof in Theorem 3.12. Theorem D.1. There exists a set V {0, 1}2n such that for all v1 = v2 V , we have Ham(v1, 0) = Ham(v2, 0) = n, Ham(v1, v2) n where Ham is the Hamming distance and 0 is the all zero vector. Proof. We use the probabilistic method as in the usual packing number estimates with no Hamming weight restriction. We select vectors in V uniformly at random from all the vectors of Hamming weight n. Now, for any vector v1 and v2, the following holds: Pr h Ham(v1, v2) n 22n = 2ne n/4. The first inequality follows by conditioning on v1 and computing the probability on the randomness of v2, the third inequality follows by Hoeffding bound (since the inner sum equals 2n Pr [X1 + + Xn n/4] where Xn i.i.d. Bernoulli( 1 2)) and the fact that 2n n 22n 2n . By union bound Pr h v1, v2 V, Ham(v1, v2) n i |V |22ne n/4. Therefore, if we take |V | = 2 , the probability will be upper bounded by 1 2. Meaning that there must exist class V with the required property. E. Proofs of Lemma 3.3 and 3.4 Proof of Lemma 3.3. By definition of log-loss, we have for any p F: R(Φ, y T , F | x T ) = sup p F t=1 log p (xt)[yt] t=1 log p(xt)[yt] For any random variable Z, we denote Et[Z] = E[Z | yt 1] to be the conditional expectation conditioning on yt 1. By definition of KL-divergence, we have log p(xt)[yt] = KL(p(xt), ˆpt), since yt p(xt). By the law of total probability: E[Z1 + + ZT ] = E[E1[Z1] + + ET [ZT ]], the lemma follows if we take Zt = log p(xt)[yt] Proof of Lemma 3.4. The first inequality is known as Pinsker s inequality (Cover & Thomas, 2006, Lemma 11.6.1). The second inequality follows by (Yang & Barron, 1998, Lemma 4) that KL(p, q) (2 log(qmin))H2(p, q) and H2(p, q) TV(p, q), where H2(p, q) = 1 2 PM m=1( p q[m])2 and the second inequality follows by |a b| = | a + b|2, for all a, b [0, 1]. The third inequality is standard, see (Polyanskiy & Wu, 2022, Equation 7.31). Learning Functional Distributions with Private Labels F. Proof of Theorem 3.10 The proof will follow a similar path as the upper bound proof of Theorem 3.1. For any stochastic sequential covering set Gα of F w.r.t. P at scale α and confidence δ = 1 T M , we denote by ˆpt the predictor that runs Algorithm 1 with input Gα (the adaption to sequential functions g Gα is straightforward by replacing every occurrence of p(xt) with g(xt)). We show that for such predictor ˆpt, we have: sup p F, ν T P E t=1 KL(p(xt), ˆpt) T(Mα2T/η + log |Gα|) The theorem will then follow by optimizing on α 0. By the argument as in the upper bound proof of Theorem 3.1, it is sufficient to prove that: sup p F,ν T P E t=1 KL( p(xt), pt) O(Mα2T/η + log |Gα|), (14) where p = Kη(p) and pt is the Bayesian predictor at step 5 of Algorithm 1 with input Gα. To achieve this, by standard regret bound of Bayesian algorithm (Wu et al., 2022b, Lemma 3), we have for any x T and y T : t=1 ℓlog( pt, yt) ℓlog( g(xt), yt) log |Gα|, (15) where g = Kη(g). We now fix any ν T P. By definition of stochastic sequential covering, we have w.p. 1 1 T M over x T ν T , for any p F, there exists g Gα such that t [T], TV(p(xt), g(xt)) α. Denote A to be such an event. We now fix x T A to be any realization. We have for any p F and g Gα with yt p(xt), the following holds: t=1 KL( p(xt), pt) KL( p(xt), g(xt)) t=1 ℓlog( pt, yt) ℓlog( p(xt), yt) + ℓlog( p(xt), yt) ℓlog( g(xt), yt) t=1 ℓlog( pt, yt) ℓlog( g(xt), yt) log |Gα|, (17) where the first equality follows by definition of KL-divergence and the law of total probability as in Lemma 3.3, the last inequality follows by (15). For any p F, we take g Gα to be the function such that for all t [T], TV(p(xt), g(xt)) α. By the data processing inequality for f-divergence (Polyanskiy & Wu, 2022, Theorem 7.4), we have for all t [T], TV( p(xt), g(xt)) α. Now, the key observation is that g(xt)[y] η M 1 for all y Y by definition of Kη. By the relationship between KL-divergence and χ2-divergence (see Lemma 3.4 (3)), we have: KL( p(xt), g(xt)) χ2( p(xt), g(xt)) (18) ( p(xt)[m] g(xt)[m])2 g(xt)[m] (19) η TV( p(xt), g(xt))2 4(M 1)α2 where the last two inequalities follow by the fact that P a2 (P a)2 for a 0, g(xt)[m] η M 1 and TV( p(xt), g(xt)) Learning Functional Distributions with Private Labels α. This implies sup p F E y T t=1 KL( p(xt), pt) = sup p F E y T t=1 KL( p(xt), pt) KL( p(xt), g(xt)) + KL( p(xt), g(xt)) = sup p F E y T t=1 KL( p(xt), pt) KL( p(xt), g(xt)) t=1 KL( p(xt), g(xt)) log |Gα| + 4(M 1)α2T where the inequality follows by (17) and (20). We now remove the conditioning on event A. Note that pmin t η M 1 since pt is a convex combination of g. We have KL( p(xt), pt) log((M 1)/η) by inequality 2 of Lemma 3.4. Therefore, expected KL-risk contributed by event A not happening, is upper bounded by log((M 1)/η)/M, which is a constant independent of T. Therefore, we have sup p F,ν T P Ex T ν T , y T t=1 KL( p(xt), pt) O log |Gα| + 4(M 1)α2T This completes the proof of (14) and therefore the theorem. G. Proof of Theorem 3.12 (Lower Bound) We prove the lower bound of Theorem 3.12. The proof follows the so-called Fano s method. By Theorem D.1, there exists a set V {0, 1}M 1 such that any vector in V has exactly M 1 2 ones and any v1 = v2 V differ on at least M 1 4 positions 1 4(M 1)e(M 1)/16. We assume here w.l.o.g. that M 1 is even (otherwise we can leave one of the coordinates zero). For any v V , we construct a distribution qv (Y) such that qv[M] = 1 c1 T and for all m < M ( 0, if v[m] = 0 T , otherwise where c1 > 0 is a constant to be determined later. By Lemma C.1, for any prediction ˆpt and v1 = v2 V , we have by direct computation that max{KL(qv1, ˆpt), KL(qv2, ˆpt)} KL(qv1, qv1 + qv2 where c2 = c1 log 2 2 and we have used the fact that v1, v2 differs on at least (M 1) 4 positions. By Lemma 3.4 (3), we have KL( y T 1 , y T 2 ) = T KL(Kη(qv1), Kη(qv2)) T χ2(Kη(qv1), Kη(qv2)) 16c2 1M where y T 1 , y T 2 are the noisy samples from qv1 and qv2 under kernel Kη, respectively, and the last inequality follows by direct computation and noting that η 1 4 and Kη(q)min η M 1 for all q (Y). We now assume there exists a prediction rule Φ that achieves < c2 T M 2 KL-risk w.p. c3 for some constant 0 < c3 < 1 independent of c1. Conditioning on such an event, we will be able to identify the true sources using the noisy label y T by selecting some v V such that qv has minimal empirical KL-risk incurred by Φ, since the true source has empirical KL-risk < c2 T M 2 by assumption, but any other source must have empirical KL-risk lower bounded by c2 T M 2 due to (21). Here, for any x T , y T and predictor Φ, the empirical KL-risk against a source q (Y) is defined as t=1 KL(q, Φ(xt, yt 1)). However, this contradicts to the Fano s inequality (Polyanskiy & Wu, 2022, Theorem 31.3), which asserts that any identifier (that identifies the true source) must have error probability lower bounded by c4 = 1 16c2 1M+log 2 log |V | . By selecting c1, c3 to be small enough one can make 1 c3 < c4 < 1 for sufficiently large M, since log |V | = M 1 16 O(log M). This implies that any predictor Φ must incur the expected KL-risk lower bounded by (1 c3) c2 Learning Functional Distributions with Private Labels H. Proof of Lemma 3.15 Our proof follows (Wu et al., 2022a, Theorem 6) with an extension to the multi-label case. To do so, we consider the multiclass one-inclusion graph predictor introduced in (Rubinstein et al., 2006), which maps (X [N]) X [N]. Let Φ be the multiclass one-inclusion graph predictor and H [N]X be any class. We define the following quantity ˆ MΦ,H(t) = sup xt X t sup h H Eσ h 1{Φ(xσ(t 1), h(xσ(t 1)), xσ(t)) = h(xσ(t))} i , where h(xσ(t 1)) = {h(xσ(1)), , h(xσ(t 1))} and σ is the uniform random permutation over [t]. By (Rubinstein et al., 2006, Thm 5.2) for any class H with pseudo-dimension d and the multiclass one-inclusion predictor Φ, we have ˆ MΦ,H(t) d Since the multiclass one-inclusion predictor is permutation invariant, by (Wu et al., 2022a, Lemma 7) for any x T X T and δ > 0 the following holds: t=1 1{Φ(xσ(t), h(xσ(t 1))) = h(xσ(t))} c(d log T + log(1/δ)) for some constant c, where σ is uniform random permutation over [T]. For any x T X T , by the generalized Sauer s lemma (Haussler & Long, 1995, Corollary 3) the number of functions of H restricted on x T is upper bounded by (TN)d. Taking δ := δ (T N)d in (22), we arrive at: t=1 1{Φ(xσ(t), h(xσ(t 1))) = h(xσ(t))} c(d log(T 2N) + log(1/δ)) By symmetries of i.i.d. distributions, this implies that for any distribution µ over X Prx T i.i.d. µ t=1 1{Φ(xt, h(xt 1)) = h(xt)} c(d log(T 2N) + log(1/δ)) We now exploit (23) to construct a stochastic sequential cover G [N]X of H w.r.t. IID at scale 0 and confidence δ. The construction goes along a similar path as (Daniely et al., 2015, Theorem 25). Let Φ be the multiclass one-inclusion predictor. For any I [T] with |I| c(d log(T 2N) + log(1/δ)) and K = {ki}i I [N]|I|, we define a sequential function g I,K recursively in the following way. For any t I, we set g I,K(xt) = Φ(xt, {g I,K(x1), g I,K(x2), , g I,K(xt 1)}), else we set g I,K(xt) = kt. It is easy to verify that the class G consisting of all such g I,K is the desired covering set, and log |G| log c(d log(T N)+log(1/δ)) X N b O(d log2(TN) + log(TN) log(1/δ)). We now use the covering set G of H to construct a stochastic sequential covering set G of F w.r.t. IID at scales 1 T M and confidence δ. To do so, we let Q be the a 1 M 2T -cover of (Y) under L norm, where |Q| (TM 2)M 1 (this implies a 1 T M -cover under total variation). Now, for any N-tuple q1, , q N Q and g G, we construct a function g q N,g(xt) = qg(xt) for all xt X . Let G be the class of all of the functions g q N,g( ). Then G forms a stochastic sequential 1 T M -cover of F at confidence δ w.r.t. IID per Definition 3.9 and log |G | O(d(log2(TN) + log(TN) log(1/δ)) + N(M 1) log(TM)),