# learning_distributions_over_quantum_measurement_outcomes__9ef82463.pdf Learning Distributions over Quantum Measurement Outcomes Weiyuan Gong 1 Scott Aaronson 2 Abstract Shadow tomography for quantum states provides a sample efficient approach for predicting the measurement outcomes of quantum systems. However, these shadow tomography procedures yield poor bounds if there are more than two outcomes per measurement. In this paper, we consider a general problem of learning properties from quantum states: given an unknown d-dimensional quantum state ρ and M unknown quantum measurements M1, ..., MM with K 2 outcomes, estimating the probability distribution for applying Mi on ρ to within total variation distance ϵ. Compared to the special case when K = 2, we have to learn unknown distributions instead of values. Here, we propose an online shadow tomography procedure that solves this problem with high success probability requiring O(K log2 M log d/ϵ4) copies of ρ. We further prove an information-theoretic lower bound showing that at least Ω(min{d2, K +log M}/ϵ2) copies of ρ are required to solve this problem with high success probability. Our shadow tomography procedure requires sample complexity with only logarithmic dependence on M and d and is sample-optimal concerning the dependence on K. 1. Introduction The statistical learning theory problem of extracting information based on empirical observations is of fundamental importance in a number of fields. In quantum physics, a fundamental problem is to obtain the properties of a quantum system based on statistical results from quantum measurements. A general method to obtain the full information of an unknown d-dimensional quantum state ρ, called quantum state tomography, completely recovers the density matrix to within a small error. This task is proved to require 1IIIS, Tsinghua University 2Department of Computer Science, University of Texas at Austin. Correspondence to: Weiyuan Gong . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Ω(d2) copies of ρ from the information-theoretical perspective (Haah et al., 2017; O Donnell & Wright, 2016; Chen et al., 2022b). These state tomography procedures have been pushed to the limit of their capabilities after the recent advances in experimental quantum platforms (Preskill, 2018), recalling that the dimension of the quantum state d = 2n increases exponentially in the number of qubits. However, demanding full descriptions of quantum states may be excessive for concrete quantum problems. Following this conceptual different line of research, the quantum shadow tomography problem developed by (Aaronson, 2019; 2007) considers the case when we are given an unknown d-dimensional quantum state and M known quantum events regarded as a two-outcome quantum measurement that outputs 1 (or accept ) with probability Tr(Eiρ) and outputs 0 (or reject ) otherwise. The goal is to estimate each expectation Eρ[Ei] = Tr(Eiρ) to within additive error ϵ. This shadow tomography problem for two-outcome POVMs is a quantum analog of the classical adaptive data analysis (Dwork et al., 2015; 2010), which can be solved with poly(log M, log d, 1/ϵ) samples (Bassily et al., 2021). Recently, (B adescu & O Donnell, 2021) proved the best known upper bound on sample complexity for this problem as N = O(log2 M log d/ϵ4). In quantum mechanics, the prediction of some intriguing properties requires quantum measurements with K > 2 measurement outcomes. In addition, in practical tasks, many quantum properties are represented by the expectation values of quantum observables, which are obtained by computing the probability distribution of each measurement outcome and averaging the eigenvalue corresponding to the outcome. In this case, the measurement M outputs results j = 1, ..., K with probability Eρ[Ej] = Tr(Ejρ), which are expectations of quantum events E1, ..., EK that satisfy PK j=1 Ej = I. Our goal is to approximate the probability distribution over the outcomes of M within total variation distance ϵ. Recalling that K can be as large as Θ(d) in the extreme case, it is an important factor to concern in practical shadow tomography. In this paper, we study the shadow tomography problem of K-outcome quantum measurements, which can be formulated as follows Problem 1.1 (Shadow Tomography of K-outcome Measurements). We consider an unknown d-dimensional quantum state, as well as M quantum measurements M1, ..., MM, Learning Distributions over Quantum Measurement Outcomes each of which has K results and outputs the j-th result with probability Tr(Ei,jρ) for i [M] and j [K]. We denote pi the probability distribution (Tr(Ei,1ρ), ..., Tr(Ei,Kρ)) after measurement Mi. Our goal is to output M probability distributions b1, ..., b M defined on the K-outcomes such that the total variation distance d T V (pi, bi) ϵ with success probability at least 1 δ. The quantum events Ei,j for j [K] corresponding to the quantum measurement Mi is defined to satisfy the constraint 0 Ei,j I and PK j=1 Ei,j = I. We remark that Problem 1.1 can also be extended to other metrics such as Euclidean norm or infinity norm and obtain a sample complexity without K dependence. We choose the total variation distance because it has a direct connection with predicting expectation values of observables: by bounding the total variation distance, the error for the expectation value of any observable O can be bounded with poly( O ) overhead. 1.1. Main Results The first main result of this paper is to propose an algorithm to solve this shadow tomography problem of K-outcome measurements. We prove the following sample-complexity upper bound for our algorithm. Theorem 1.2 (Shadow Tomography of K-outcome Measurements). Problem 1.1 is solvable using N = O log(1/δ) ϵ4 K log2 M log d copies of ρ. Here, the O hides a poly(log log M, log log D, log(1/ϵ), log K) factor. The procedure is fully explicit and online. We provide an overview of proof for this theorem in Section 1.2. The detailed proof is technically involved and provided in Section 3 and Section 4. Theorem 1.2 indicates that we can learn the probability distribution of M quantum measurements of K outcomes using sample complexity that depends logarithmically on M and d but linearly on K. Considering the parameters M, d, and ϵ, our algorithm has the same dependence as the best known upper bound for 2-outcome case (B adescu & O Donnell, 2021). The dependence on K is the most important result in this work. Compared to directly regarding each quantum event Ei,j as a two-outcome quantum measurement and approximating the expectation Tr(Ei,jρ) to within additive error 2ϵ/K, our algorithm reduces the dependence on K from O(K4) to O(K). Notice that in some extreme cases, K can be as large as Θ(d), which is exponential in system size n, our algorithm reduces the number of copies required to perform the shadow tomography task significantly. Although the complexity of our algorithm still has an O(K) dependence on K, we emphasize that this dependence is necessary by the following information-theoretic lower bound on Problem 1.1: Theorem 1.3. Any strategy for Problem 1.1 i.e., for estimating all pi = (Tr(Ei,1ρ), ..., Tr(Ei,Kρ)) of Mi to within total variation distance ϵ for all i [M], with success probability at least (say) 2/3 requires at least N Ω min{d2, K + log M} copies of unknown d-dimensional quantum state ρ. We provide the sketch of proof for this lower bound in Section 1.2 and leave the detailed proof in Section 5. The lower bound is obtained by an information-theoretic argument developed by (Flammia et al., 2012) and refined by further works (Aaronson, 2019; Huang et al., 2020; Haah et al., 2017). The proof exploits Holevo s theorem (Holevo, 1973) and Fano s inequality (Fano, 1949). Even in the special case where there is only M = 1 entirely classical measurement and the unknown quantum state is also classical, learning a distribution on [K] to within total variation distance ϵ still requires O(K/ϵ2) samples (Canonne, 2020). This result can be understood as the information required to approximate the probability distribution scales linearly with the dimension of the distribution. By comparing the lower bound in Theorem 1.3 and the upper bound in Theorem 1.2, we can conclude that our algorithm for shadow tomography of K-outcome quantum measurement is optimal concerning the dependence on K. 1.2. Technical Overview Our shadow tomography procedure involves the combination of two ideas: solving a quantum distribution threshold search problem using O(K log2 M/ϵ2) samples in each iteration and performing an online learning procedure that has at most O(log d/ϵ2) iterations. The first step in this work concerns a problem which we call the quantum distribution threshold search problem. We formulate this problem as below: Problem 1.4 (Quantum Distribution Threshold Search). Suppose we are given Parameters 0 < ϵ, δ < 1 Unentangled copies of an unknown d-dimensional quantum state ρ. A list of M d-dimensional POVMs M1, ..., MM each of K outcomes corresponding to quantum events Ei,j, where i [M], j [K], and PK j=1 Ei,j = I. We denote pi = (Tr(Ei,1ρ), ..., Tr(Ei,Kρ)) to be the actual distribution over the measurement outcomes of Mi. Learning Distributions over Quantum Measurement Outcomes A list of M threshold vectors θi = (θi,1, ..., θi,K), where θi,j [0, 1] and PK j=1 θi,j = 1. the algorithm outputs either: d T V (pi , θi ) > 3ϵ/4 for some particular i ; or d T V (pi, θi) ϵ for any i. Our goal is to minimize the number of copies required to ensure we output correctly with success probability at least 1 δ. A similar problem for the case of K = 2 was originally called a gentle search procedure in (Aaronson, 2019). Later, it was renamed as the quantum threshold search problem in (B adescu & O Donnell, 2021) since the gentle measurement assumption (Aaronson & Rothblum, 2019) is not necessary. It is proven that the quantum threshold problem can be solved using O(log2 M/ϵ2) copies of ρ with probability at least (say) 3/4 (B adescu & O Donnell, 2021). Yet, it is worthwhile to mention that Problem 1.4 is not a direct extension of the quantum threshold search problem. Even at K = 2, the requirement of Problem 1.4 is a two-side bound instead of a one-side bound in the quantum threshold search problem. In this paper, we provide an algorithm that can solve Problem 1.4 for any K 2: Theorem 1.5. Problem 1.4 (Quantum Distribution Threshold Search) is solvable using N = O log(1/δ) ϵ2 K log2 M copies of ρ. We provide proof of this theorem in Section 3. When K = 2, our upper bound for the quantum distribution threshold search problem reduces to the same bound for the quantum threshold search problem. We remark that the K dependence in the sample complexity bound we provide in Theorem 1.2 directly comes from the K dependence in solving the quantum distribution threshold search problem. Given our quantum distribution threshold search algorithm, the second step is to employ a black-box reduction to an online quantum state learning algorithm. In the special case when K = 2, the bound is obtained by (Aaronson et al., 2018). The formal version of our result in online learning distributions is provided as follows: Theorem 1.6. Let ρ be an unknown d-dimensional quantum state, as well as M1, M2, ..., Mt, ... be a sequence of Koutcome POVMs each consisting quantum events Et,j for j [K]. We denote pt = (Tr(Et,1ρ), ..., Tr(Et,Kρ)) to be the actual probability distribution when we apply Mt on ρ. We are provided with a probability distribution bt after each measurement Mt with d T V (pt, bt) ϵ/4. There exists a strategy for outputting hypothesis states ω1, ω2, ... such that the probability distribution µt, which is obtained by applying Mt on ωt, deviates more than 3ϵ/4 from pt for at most T = O(log d/ϵ2) iterations t (also called bad iterations ). We provide the proof for Theorem 1.6 in Section 4.1 adapting the template of the Regularized Follow-the-Leader algorithm (RFTL; see, for example, in (Hazan et al., 2016)). However, our online learning procedure is not a direct extension of the standard template as we have to modify the loss function to measure the total variation distance between two distributions instead of the ℓ1 loss between two values in (Aaronson et al., 2018; Chen et al., 2022c). We exploit the property of the POVM to provide an upper bound for the total regret of the learning procedure. We can then combine our quantum distribution threshold search algorithm with this online setting to prove the sample complexity for our shadow tomography procedure of K-outcome quantum measurements. We start with the maximally mixed state I/d. In each iteration, we first perform the quantum distribution threshold search algorithm to find an i such that the total variation distance between µt and pt is larger than 3ϵ/4. We can then use O(K/ϵ2) samples to estimate bt with high success probability and update the hypothesis. As there are at most O(log d/ϵ2) bad iterations , we finally reach the complexity bound in Theorem 1.2. The overall computational complexity of our shadow tomography protocol is estimated to be O(KM poly(K, 1/ϵ, log d)) + d O(1). We further show the extension of Theorem 1.2 to the case of predicting quantum observables in Appendix D. To prove the lower bound, we first fix M different quantum measurements. We then find a set (known as packing net (Haah et al., 2017)) of size 2K/2M consisting of mixed states {ρ1, ..., ρN} such that we can use our shadow tomography procedure to distinguish between any pair of states chosen from this packing net, which requires log 2K/2M = Θ(K + log M) bits of information. We further show using Holevo s theorem (Nielsen & Chuang, 2002) that we can at most obtain O(ϵ2) bits of information from any quantum states chosen from this set. Therefore, the sample complexity is bounded below by Ω((K+log M)/ϵ2) to make it possible to obtain the information. We emphasize that, different from the previous lower bounds (Aaronson, 2019; Aaronson & Rothblum, 2019; Huang et al., 2020) which either adapt a classical bound (Ullman et al., 2018; Bun et al., 2018; Vadhan, 2017), or just consider a set containing M quantum states corresponding to the M measurements (observables), our construction of the set exploit a coding parameter z for the states to introduce a K dependence. Learning Distributions over Quantum Measurement Outcomes 1.3. Related Works Shadow tomography of two-outcome quantum measurements. A first related topic is the shadow tomography of two-outcome quantum measurements (Aaronson, 2019; B adescu & O Donnell, 2021). It is proven that only O(log2 M log d/ϵ4) copies of ρ can estimate the expectation value of M two-outcome quantum measurement. When K = 2, the sample complexity in Theorem 1.2 reduces to this bound. To extend this result to the case of K > 2, a straightforward approach is to regard each quantum event as a two-outcome quantum measurement and estimate each expectation value Tr(Ei,jρ) to within additive error 2ϵ/K. However, this approach requires an additional cost of O(K4) using the state-of-art shadow tomography algorithms. Compared to this direct extension, our approach only requires sample complexity that increases linear with K and is proven to be optimal concerning the dependence on K. Quantum observable estimation using classical shadow. (Huang et al., 2020) and (Chen et al., 2022a) considered the task of estimating quantum functions (or the expectations of quantum operators). Given ρ an unknown d-dimensional state, as well as M quantum operators O1, ..., OM, they provided a strategy that can approximate the expectation value for each operator Tr(Oiρ) to within additive error ϵ with high success probability O(log M2k/ϵ2) copies of ρ, where k is the locality of the observable. Their protocol requires neither quantum memory nor joint measurements that simultaneously measure states of the form ρ k. However, the sample complexity may increase exponentially with the system size n = log d. Our algorithm, however, can provide sample-efficient shadow tomography when the number of measurement outcome scales polynomially with system size n, regardless of whether the measurement is global. Quantum state tomography. It is proven that there exists a sample-optimal algorithm that can perform state tomography for an unknown quantum state ρ of rank r d using O(dr/ϵ2) copies of ρ (Haah et al., 2017). Although the shadow tomography procedure in this paper does not require full information of ρ, the information obtained in this procedure increases linearly with K. In the extreme case, when we perform a quantum measurement on the computational basis i.e., there are d possible outcomes corresponding to all possible n-bit classical strings x chosen from {0, 1}n. To perform shadow tomography on this measurement, we require Ω(d/ϵ2) copies of ρ. By performing this measurement, we can obtain a full description of any pure states. This bound is the same as the sample complexity required for state tomography for pure states. 1.4. Broader impact This work focuses on the theory of quantum shadow tomography based on online learning, and as far as we see, we do not anticipate its potential negative societal impact. Nevertheless, it might have a positive impact on researchers who are interested in understanding the theoretical underpinnings of online learning applications and statistical learning. 2. Preliminaries 2.1. Classical Probability Theory We consider two probability distributions D = (px)x and D = (qx)x on K-dimensional space, we will use the following two distance measures between them. The total variation distance between D and D is defined by d T V (D, D ) = 1 We also consider another distance measure that is commonly used for vectors. The Euclidean norm of the distance between the two distributions is defined by x (px qx)2 !1/2 The Euclidean norm is not commonly used in probability theory. We employ it as the intermediate tool when using the concentration inequalities on random vectors. To connect among these norms, we notice that for any probability distribution D and D , the following inequality holds d T V (D, D ) K 2 D D 2. (1) 2.2. Quantum Preliminaries Here, we briefly review some basic notations and concepts in quantum information. More details can be found, for example, in (Nielsen & Chuang, 2002). A matrix A Cd d is said to be a Hermitian matrix if A = A, where A denotes the conjugate transpose of A. We write A 0 when the Hermitian operator A is positive semidefinite. We write A B when A B 0. We use I for the identity matrix and the dimension can be understood from the context. In quantum mechanics, a d-dimensional quantum state vector is described by a unit vector ψ = (ψ1, . . . , ψd) denoted by |ψ with the Dirac symbol | , in a complex Hilbert space Cd. The computational basis of Cd are defined as {|i }d i=1, where |i = (0, . . . , 0, 1, 0, . . . , 0) is the vector with the ith entry being 1 and other entries being 0. A d-dimensional Learning Distributions over Quantum Measurement Outcomes general quantum state can be written as a matrix ρ Cd d with ρ 0 and Tr(ρ) = 1. If ρ has rank 1, it is a pure state and can be written as an outer product |ψ ψ| of a complex vector |ψ . Equivalently, we can write ρ as a convex combination for outer products of different pure states (without loss of generality, there can be at most d orthogonal pure states): i=1 pi |ψi ψi| , where Pd i=1 pi = 1 and pi 0 for arbitrary i [d]. This representation can be interpreted as a probability distribution over each pure state |ψi . In the special case when ρ is diagonal, it represents a classical probability distribution over orthogonal computational basis |1 , ..., |d . The maximally mixed state I/d corresponds to the uniform distribution over |1 1| , ..., |d d|. A quantum observable, or a quantum operator, is a ddimensional Hermitian matrix O Cd d. A quantum observable is a real-valued property of the physical systems. Given a quantum state ρ, the expectation of O with respect to ρ is defined by Eρ[O] = Tr(Oρ). A quantum event is a quantum operator that satisfies 0 E I, i.e., a Hermitian operator with eigenvalues chosen from [0, 1]. The expectation value of a quantum event Eρ[E] can be interpreted as a probability assigned by quantum state ρ to E. We further call E a projector for the special case when E2 = E and all eigenvalues for E are Boolean values 0 and 1. A quantum measurement M, or a positive operator-valued measure (POVM), is a sequence M = (E1, ..., EK) of quantum events with PK j=1 Ej = I. According to the linearity of trace, we can obtain that j=1 Eρ[Ej] = Eρ = Eρ[I] = 1. Given a quantum state ρ, a POVM determines a probability distribution D = {pj}j on [K] defined by pj = Eρ[Ej] = Tr(Ejρ). 2.3. Online Learning Settings and Regrets In the context of online learning quantum states considered in Theorem 1.6, we are given a sequence of quantum measurements M1, M2, ... in each iteration t. The learner constructs a hypothesis state ωt Cd d in each iteration. Given the quantum measurement Mt, the learner calculates the distribution after applying Mt on the hypothesis state ωt as µt = (Tr(Et,1ρ), ..., Tr(Et,Kρ)), which is known as a prediction . The learner then obtains feedback from the measurement M. The simplest feedback can be a random variable Yt chosen from value [K] = {1, ..., K} for different outcomes. In this paper, the learner obtains feedback by performing a quantum distribution threshold search to find whether d T V (µt, pt) is larger than some tolerance threshold, where pt = (Tr(Et,1ρ), ..., Tr(Et,Kρ)) is the actual probability distribution for the unknown state ρ. If the quantum distribution threshold search procedure does not output a t such that d T V (µt, pt) > 3ϵ/4, the learner accepts the prediction and set it as the final result. If the quantum distribution threshold search procedure outputs such a t, the learner starts an update procedure. The learner first estimates a probability distribution bt. According to Eq. (7), the learner can guarantee that d T V (bt, pt) ϵ/4 with high probability by using O(K/ϵ2) copies of ρ. Then, the learner defines a loss function that measures the total variation distance between the bad prediction µt and bt as: ℓt(µt) := 1 j=1 |Tr(Et,jωt) bt,j|, (2) where bt,j denotes the j-th entry of bt. The learner updates the hypothesis ωt ωt+1 based on the loss functions, measurements, and feedback before the current iteration. Our goal is to design a strategy such that the learner s total loss is minimized. Suppose there are in total T iterations, we want to find a strategy such that the learner s total loss is not much more than that of the strategy which outputs the same quantum hypothesis φ in each iteration, where φ is chosen as the minimization of the total loss with perfect hindsight. Formally, we define the regret RT to be the difference between values of total loss for these two strategies as t=1 ℓt(µt) min φ H t=1 ℓt(µφ), (3) where µφ = (Tr(Et,1φ), ..., Tr(Et,Kφ)) is the probability distribution after applying Mt on φ and H denotes the set of all d-dimensional quantum states. We remark that the sequence of measurements Mt can be arbitrary, even adversarial, based on the learner s prior actions. 3. Quantum Distribution Threshold Search In this section, we prove Theorem 1.5. In Section 4, we will use this procedure as feedback in the online learning procedure of our shadow tomography algorithm for K-outcome POVMs. Our starting point is the following expectation estimation lemma. Learning Distributions over Quantum Measurement Outcomes Lemma 3.1. Let ρ be an unknown d-dimensional state and M be a K-outcome POVM. The probability distribution over the outcomes for applying M to ρ is p = (Tr(E1ρ), ..., Tr(EKρ)). We choose parameters 0 < ϵ, δ < 1 2. Then there exists N = K log(1/δ)/ϵ2 such that, for any d-dimensional quantum states ρ, Pr d T V (p, p ) ϵ where p = (p 1, ..., p K) is the empirical distribution by applying M to the joint state ρ N Moreover, there exists a quantum event B such that for any K-dimensional distribution τ d T V (p, τ) > ϵ Eρ N [B] > 1 δ, d T V (p, τ) 3ϵ 4 Eρ N [B] δ. We provide the proof for Lemma 3.1 in Appendix B. Moreover, we can observe that if Ei s are projectors, then Ak s are also projectors. We can prove that B is a summation of Ak s. thus is also a projector. By using this lemma, we reduce the shadow tomography procedure of a K-outcome POVM to two-outcome ones. Now, we begin to prove Theorem 1.5. Notice that the assumptions on Ei,j is a quantum event in Problem 1.4 while the assumptions for Ei,j is a projector in Lemma A.1, we have to first reduce the theorem to the case of projectors. Let ρ Cd d be the unknown quantum state, and Ei,j be the quantum events for i [M] and j [K]. We can achieve this through Naimark s theorem(see, for example, (Riesz & Nagy, 2012; Akhiezer & Glazman, 2013)). This theorem demonstrates that a quantum event E Cd d can be reduced to a projector Π on the space C2d 2d, such that for arbitrary ρ, Eρ |0 0|[Π] = Eρ[E]. Therefore, we assume that Ei,j are projectors in the following proofs. Suppose we are given M K-outcome POVMs M1, ..., MM and M threshold vectors θ1, ..., θM. We first apply Lemma 3.1 with parameters δ = 1/4 and τ = θi for each measurement Mi. Therefore, we can find some N0 = O(K/ϵ2) such that each measurement Mi can be replaced by a quantum event Bi (Cd d) N0 satisfying if d T V (pi, θi) > ϵ, Eρ N0[Bi] > 3/4; if d T V (pi, θi) 3ϵ/4, Eρ N0[Bi] 1/4; Here pi is the actual distribution after applying Mi on ρ. Since Ei,j s are projectors, quantum events Bi are also projectors. Algorithm 1 RFTL for Quantum Tomography of Koutcome POVMs 1: Input: T, η < 1 2. 2: Set ω1 := I/d. 3: for t = 1, ..., T do 4: Predict ωt. Consider the loss function ℓt : RK 1 R given by measurement Mt : ℓt(Tr(Et,1φ), ..., Tr(Et,K 1φ)). It has the same value with the loss function defined in Eq. (2). Let ℓt/ xj be a sub-derivative of ℓt with respect to xj for j [K 1]. Define ℓt (Tr(Et,j)ωt)Et,j. (4) 5: Update decision according to the RFTL rule with von Neumann entropy by ωt+1 :=: arg min φ H s=1 Tr( sφ) + i=1 λi(φ) log λi(φ) (5) where λi(A) denotes the i-th eigenvalue of Hermitian matrix A Cd d, and H Cd d denotes all ddimensional quantum states. 6: end for We then apply Lemma A.1 by setting each Bi to be the projectors we have just constructed and unknown state to be ρ = ρ N0. If the algorithm outputs i such that Eρ [Bi ] > 1/4, we have d T V (pi , θi ) > 3ϵ/4. Otherwise, we can guarantee that d T V (pi, θi) ϵ for all i [M] with high probability. 4. Shadow Tomography of K-outcome POVMs In this section, we first prove Theorem 1.6 in Section 4.1. We then prove the upper bound in Theorem 1.2. 4.1. Online Learning of Quantum States We suppose there are in total T iterations where the learner performs an update procedure. In the update procedure, the learner follows the template of the Regularized Follow-the Leader algorithm (RFTL) as the following Algorithm 1. Algorithm 1 employs von Neumann entropy, which relates to the matrix exponentiated gradient algorithm (Tsuda et al., 2005). We remark that the loss function defined in Eq. (4) of the RFTL algorithm is slightly different from the definition in Eq. (2) in that it takes a vector of (K 1) entries instead of K entries. This is because the input vectors in Eq. (2) are supposed to be probability distributions such that the summation of all entries is 1. Therefore, there are only Learning Distributions over Quantum Measurement Outcomes (K 1) free parameters. We rewrite the loss function with an input vector containing only free entries as Eq. (4). According to the definition of regret in Eq. (3), we now provide the following regret bound on this RFTL algorithm. Lemma 4.1. Setting η = p log d/8T, the regret RT of Algorithm 1 is bounded by 4 p (2 log 2)T log d. The proof for Lemma 4.1 is provided in Appendix C. We then prove Theorem 1.6. We consider the case that the RFTL is triggered when the prediction µt = (Tr(Et,1ωt), ..., Tr(Et,Kωt)) deviates from the actual probability distribution pt = (Tr(Et,1ρ), ..., Tr(Et,Kρ)) for more than 3ϵ/4 i.e.,d T V (µt, pt) > 3ϵ/4. As the provided distribution bt satisfies d T V (bt, pt) ϵ/4, the loss function ℓt is at least ϵ/2 by triangle inequality. We then consider using the real distribution in each iteration, the loss function is at most ϵ/4 in each iteration. By the regret bound, we have Therefore, we can obtain the upper bound on T as T O(log d/ϵ2). 4.2. Online Shadow Tomography of K-outcome POVMs We now prove Theorem 1.2 using Theorem 1.5 and Theorem 1.6. We describe our online shadow tomography procedure for K-outcome POVMs below. Given the requirement parameters ϵ, δ and the number of measurements M, we first define the following ancillary parameters T0 = C0 log d + 1, δ0 = δ 2T0 , N0 = C1K log(1/δ0) Nb = C2K log(1/δ0) where C0, C1, and C2 are three parameters that scale at most poly(log log M, log log D, log(1/ϵ), log K). The number of copies of ρ will be N = T0(N0 + Nb), which is indeed N = O log(1/δ) ϵ4 K log2 M log d , where O hides a poly(log log M, log log D, log(1/ϵ), log K) factor. After receiving N copies of ρ, our algorithm first divides these states equally into T0 batches, each consisting N0 states. We prepare two joint states ρ N0 and ρ Nb using each batch. Each batch is used for the update procedure in a bad iteration in our online learning procedure. To begin with, the learner initializes the hypothesis state ω0 = I/d. In each iteration t, it chooses a fresh batch of states and runs the quantum distribution threshold search algorithm using joint state ρ N0. The threshold is chosen to be the probability distribution µi after applying Mi for i [M] on the hypothesis ωt. According to Theorem 1.5, we can always find such C1 to solve this quantum distribution search problem with success probability at least 1 δ0. If the quantum distribution threshold search declares that for all i [M], d T V (µi, pi) ϵ. Then we have successfully found a hypothesis such that the probability distributions after applying all K-outcome POVMs on this hypothesis are at most ϵ from that of the unknown state ρ. If the quantum distribution threshold search outputs i where d T V (pi , µi ) > 3ϵ/4. We use ρ Nb for an estimation bi of the probability distribution after applying Mi on ρ. According to Eq. (7), we can always find C2 such that with probability at least 1 δ0, one can bound the total variation distance d T V (pi , bi ) ϵ/4. We supply this bi to the learner and the learner employs the Algorithm 1 to update the hypothesis state into ωt+1. Furthermore, the remaining copies in the current batch will be abandoned. The learner will move into the next iteration with a new batch. According to Theorem 1.6, the number of bad iterations is bounded by O(log d/ϵ2). If there is no failure in any of the rounds, we can always find C0 such that we can guarantee that for all i [M], d T V (µi, pi) ϵ after the online procedure, where µi is obtained by applying Mi for i [M] on the hypothesis ωT0. Now we calculate the failure probability in this procedure. In each iteration, the success probability for the quantum distribution threshold search and the calculation of bi are both at least 1 δ0. By the union bound, the probability for failure after T0 iterations is bounded by 2T0δ0 = δ. Finally, we consider the computational complexity of our shadow tomography procedure. In each iteration, we have to implement a series of O(M) measurements on a batch of O(K/ϵ2) joint samples to perform a quantum distribution threshold search and compute K terms for the gradient t to update the hypothesis in the RFTL protocol. As the iteration number is bounded by O(log d/ϵ2), the overall computational complexity is bounded by O(KM poly(K, 1/ϵ, log d)) + d O(1). 5. The Lower Bound We now show that any shadow tomography procedure for K-outcome POVMs requires at least Ω(min{D2, K + log(M)}/ϵ2) copies of ρ. We set D := min{d, p log2 M + K} and suppose the unknown state is D-dimensional mixed state. We choose some constant c (0, 1) and set L = c D2 K . We will have Learning Distributions over Quantum Measurement Outcomes L quantum measurements of K outcomes for L M. Notice that the probability for each outcome of the quantum measurement can be regarded as the expectation for a quantum event, there are in total L K quantum events. We choose L K subspaces {S1,1, ..., S1,K},..., {SL,1, ..., SL,K} from CD D for L POVMs independently and Haar-randomly such that dim(Si,j) = D/K for any i [L] and j [K]. The K subspaces in each set {Si,1, ..., Si,K} are orthogonal. We denote Pi,j to be the projection to Si,j and ρi,j = KPi,j/D to be the maximally mixed state projected onto Si,j. As long as we choose a c that is close enough to 1, we can always find a choice over Si,j s with success probability 1 o(1) such that Tr(Pi,jρi ,j ) 1 for i = i according to Lemma A.2. We fix such a choice over Si,j. Without loss of generality, we assume that K is even. Now, we consider constructing the following states using a classical bit string z = (z1, ..., z K/2) of K/2 bits as K ρi,2j 1 + 1 + 50ϵzj We consider applying measurement Mi on state ρi(z). The (2j 1)-th and the 2j-th entry for the probability distribution are Tr(Ei,2j 1ρi(z)) = 1 50ϵzi Tr(Ei,2jρi(z)) = 1 + 50ϵzi We consider applying Mi on state ρi(z) for i = i . According to Eq. (6), the j-th entry of the probability distribution is Tr(Ei ,jρi(z)) 3 K = 1 + 25ϵ Tr(Ei ,jρi(z)) 3 Now, we fix a measurement Mi. If we apply this measurement to two quantum states ρi(z1) and ρi (z2) for i = i . The total variation distance for the two probability distributions is at least 25ϵ/2. It follows that, if we can estimate the probability distribution after a Mi to within total variation distance ϵ, we can immediately estimate i [L] for the unknown state ρi(z). We then consider applying this measurement to two quantum states ρi(z1) and ρi(z2) for z1 = z2. Since each single difference on one entry in z1 and z2 will contribute 100ϵ n to the total variation distance between the two probability distributions, the distance is at least ϵ if more than 1% of the entries are different. Therefore, we can distinguish between such z1 and z2 with more than 1% different entries if we can estimate the probability distribution after a Mi to within total variation distance ϵ. Suppose we choose i and z uniformly and randomly, then such choice contains log2(2K/2M) = Ω(K + log2(M)) bits of classical information. Suppose we require N copies of ρ to perform a shadow tomography procedure of Koutcome POVMs. Let ζ := Ei [L],z {0,1}K/2 ρi(z) N . In order to make learning i and 99% of the entries for z from ζ information-theoretically possible, the mutual information I(ζ : i, z) must be at least Ω(K + log2(M)). As both i and z are classical, we have I(ζ : i, z) = S(ζ) S(ζ|i) = S(ζ) S(ρi(z) N) N(log2 D S(ρi(z)), where S( ) is the von Neumann entropy. Now, we calculate the term S(ρi(z)). Let λi,z,1, ..., λi,z,D be the eigenvalues for ρi(z). By applying a unitary transformation that diagonalizes ρi(z) rotating to a basis that contains half of the projectors, we can observe that half the λi,z,j s are (1+50ϵ)/D and the other half of the λi,z,j s are (1 50ϵ)/D. Hence, j=1 λi,z,j log2( 1 λi,z,j ) log2 D O(ϵ2). Therefore, the mutual information can be bounded by I(ζ : i, z) = O(Nϵ2). As learning i and 99% of the entries for z requires Ω(K + log2(M)) bits of classical information, we conclude that = Ω min{d2, K + log(M)} 6. Conclusion This work theoretically established the exact dependence on K for shadow tomography of K-outcome quantum measurements and proposed the explicit algorithm that learns these distributions with sample complexity optimal in K. To the best of our knowledge, K is the only parameter we can obtain exact dependence for sample complexity in the context of shadow tomography. We conclude by discussing a few possible future directions. Learning Distributions over Quantum Measurement Outcomes Can we develop an algorithm or provide a tight sample complexity that has smaller dependence on log M, log d, and 1/ϵ. In addition, it is interesting to explore whether the lower bound in our setting is multiplicative (Ω(K log M)) or additive (Ω(K +log M)). Also, can we find a trade-off relation between the sample and the time complexity? Can we achieve a better query complexity for shadow tomography with access to a unitary oracle that prepares the state? It has been shown that we can achieve quantum speedups in similar tasks (Huggins et al., 2021; van Apeldoorn et al., 2022) using quantum mean estimation (Hamoudi, 2021; Cornelissen et al., 2022). Can we find a shadow tomography procedure that has polynomial dependence on log d, log M, and log K for some family of Mi s and ρ s that are commonly considered in practical experiments? Acknowledgements S.A. was supported by a Vannevar Bush Fellowship from the US Department of Defense, the Berkeley NSF-QLCI CIQC Center, a Simons Investigator Award, and the Simons It from Qubit collaboration. The authors thank William Kretschmer, Sitan Chen, Tongyang Li, Dong-Ling Deng, and Weikang Li for helpful discussion, and anonymous reviewers for constructive comments. Aaronson, S. The learnability of quantum states. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 463(2088):3089 3114, 2007. Aaronson, S. Shadow tomography of quantum states. SIAM Journal on Computing, 49(5):STOC18 368, 2019. Aaronson, S. and Rothblum, G. N. Gentle measurement of quantum states and differential privacy. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 322 333, 2019. Aaronson, S., Chen, X., Hazan, E., Kale, S., and Nayak, A. Online learning of quantum states. Advances in neural information processing systems, 31, 2018. Akhiezer, N. I. and Glazman, I. M. Theory of linear operators in Hilbert space. Courier Corporation, 2013. Audenaert, K. M. and Eisert, J. Continuity bounds on the quantum relative entropy. Journal of mathematical physics, 46(10):102104, 2005. B adescu, C. and O Donnell, R. Improved quantum data analysis. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1398 1411, 2021. Bassily, R., Nissim, K., Smith, A., Steinke, T., Stemmer, U., and Ullman, J. Algorithmic stability for adaptive data analysis. SIAM Journal on Computing, 50(3):STOC16 377, 2021. Bhatia, R. Matrix analysis, volume 169. Springer Science & Business Media, 2013. Bun, M., Ullman, J., and Vadhan, S. Fingerprinting codes and the price of approximate differential privacy. SIAM Journal on Computing, 47(5):1888 1938, 2018. Canonne, C. L. A short note on learning discrete distributions. ar Xiv preprint ar Xiv:2002.11457, 2020. Carlen, E. A. and Lieb, E. H. Remainder terms for some quantum entropy inequalities. Journal of Mathematical Physics, 55(4):042201, 2014. Chen, S., Cotler, J., Huang, H.-Y., and Li, J. Exponential separations between learning with and without quantum memory. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 574 585. IEEE, 2022a. Chen, S., Huang, B., Li, J., Liu, A., and Sellke, M. Tight bounds for state tomography with incoherent measurements. ar Xiv:2206.05265, 2022b. Chen, X., Hazan, E., Li, T., Lu, Z., Wang, X., and Yang, R. Adaptive online learning of quantum states. ar Xiv:2206.00220, 2022c. Cornelissen, A., Hamoudi, Y., and Jerbi, S. Near-optimal quantum algorithms for multivariate mean estimation. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pp. 33 43, 2022. Dwork, C., Rothblum, G. N., and Vadhan, S. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51 60. IEEE, 2010. Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. L. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 117 126, 2015. Fano, R. M. The transmission of information. Massachusetts Institute of Technology, Research Laboratory of Electronics ..., 1949. Flammia, S. T., Gross, D., Liu, Y.-K., and Eisert, J. Quantum tomography via compressed sensing: error bounds, sample complexity and efficient estimators. New Journal of Physics, 14(9):095022, 2012. Learning Distributions over Quantum Measurement Outcomes Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011. Haah, J., Harrow, A. W., Ji, Z., Wu, X., and Yu, N. Sampleoptimal tomography of quantum states. IEEE Transactions on Information Theory, 63(9):5628 5641, 2017. Hamoudi, Y. Quantum sub-gaussian mean estimator. ar Xiv preprint ar Xiv:2108.12172, 2021. Hayden, P., Leung, D. W., and Winter, A. Aspects of generic entanglement. Communications in mathematical physics, 265(1):95 117, 2006. Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. Holevo, A. S. Bounds for the quantity of information transmitted by a quantum communication channel. Problemy Peredachi Informatsii, 9(3):3 11, 1973. Huang, H.-Y., Kueng, R., and Preskill, J. Predicting many properties of a quantum system from very few measurements. Nature Physics, 16(10):1050 1057, 2020. Huggins, W. J., Wan, K., Mc Clean, J., O Brien, T. E., Wiebe, N., and Babbush, R. Nearly optimal quantum algorithm for estimating multiple expectation values. ar Xiv preprint ar Xiv:2111.09283, 2021. Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pp. 1895 1904. PMLR, 2017. Nielsen, M. A. and Chuang, I. Quantum computation and quantum information, 2002. O Donnell, R. and Wright, J. Efficient quantum tomography. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 899 912, 2016. Preskill, J. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018. Riesz, F. and Nagy, B. S. Functional analysis. Courier Corporation, 2012. Smith, A. Lecture notes for the algorithmic foundations of adaptive data analysis. Stability and adaptive analysis I, Lecture 7-10, 2017. Tsuda, K., R atsch, G., and Warmuth, M. K. Matrix exponentiated gradient updates for on-line learning and bregman projection. Journal of Machine Learning Research, 6 (Jun):995 1018, 2005. Ullman, J., Smith, A., Nissim, K., Stemmer, U., and Steinke, T. The limits of post-selection generalization. Advances in Neural Information Processing Systems, 31, 2018. Vadhan, S. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pp. 347 450. Springer, 2017. van Apeldoorn, J., Cornelissen, A., Gily en, A., and Nannicini, G. Quantum tomography using state-preparation unitaries. ar Xiv preprint ar Xiv:2207.08800, 2022. Learning Distributions over Quantum Measurement Outcomes A. Auxiliary lemmas Lemma A.1 (Lemma 4.2, (B adescu & O Donnell, 2021)). Suppose we are given an unknown d-dimensional quantum state ρ, and M quantum projectors B1, ..., BM Cd d.There exists an algorithm using O(log2 M log(1/δ)) copies of ρ outputting either Eρ[Bi ] = Tr(Bi ρ) > 1/4 for some particular i ; or Eρ[Bi] 3/4 for any i, with success probability at least 1 δ. The proof of this theorem employs the χ2-stable threshold reporting technique, which is a quantum version of classical statistical results fitting into the adaptive data analysis framework. We omit the details here and refer to (Smith, 2017), for example, for the related background. Lemma A.2 (Hayden, Leung, and Winter (Hayden et al., 2006)). Let S and T be two subspaces of Cd d with dimension d1 and d2. We denote PS and PT to be projectors on subspaces S and T. Consider ρS = 1 d1 PS to be the maximally mixed state projected onto S. If we fix T and randomly choose S, then Pr Tr(PT ρS) d2 exp c2 0d1d2 6 ln 2 B. Expectation Estimation We will need a concentration lemma for random vectors, which is an extension of the vector Bernstein inequality (Theorem 6) in (Gross, 2011; Kohler & Lucchi, 2017). Lemma B.1. Let x1, ..., xm be independent K-dimensional vector-valued random variables. We assume that each random vector is zero-mean, uniformly bounded, and has bounded variance, i.e., E[xi] = 0 and xi µ as well as E h xi 2 2 i σ2 for some constants µ, σ > 0. Suppose that parameters satisfies 0 < ϵ < σ2/µ, then we have for some positive constant C. Proof. Theorem 6 in (Gross, 2011) indicates that for independent, zero-mean random vectors where V = Pm i=1 E h xi 2 2 i is the sum of variances for random vectors. We define ϵ = t + V and rewrite the above inequality as Since the sum of variance V can be bounded by mσ2 according to our assumption, we can finally obtain the following inequality By choosing the constant C = 1 4, we finish the proof for this lemma. Learning Distributions over Quantum Measurement Outcomes Now we consider sampling from a probability distribution p = (p1, ..., p K) for m times. For the i-th sample where i [m], we obtain one sample ˆpi = (ˆp1 i , ..., ˆp K i ) with only one entry 1 and the other entries 0. We set xi = p ˆpi. Then xi is centered because E(xi) = E[p ˆpi] = 0. Each entry of xi is bounded below by 1 and E[ xi 2] = σ2 = 1 j=1 p2 j < 1. Therefore, by Lemma B.1, we can guarantee that as long as we choose m O(log(1/δ)/ϵ2). To bound the total variation distance d T V ( 1 m Pm i=1 ˆpi, p) between the empirical distribution and the actual distribution, we combine the bound in Eq. (1) with Lemma B.1 and obtain: Hence, we can bound d T V ( 1 m Pm i=1 ˆpi, p) below ϵ with probability at least 1 δ if m O log(1/δ) We assign an index for every single copy ρ in the joint state ρ N and assume each single ρ occupies a register . For all N-bit classical strings x = (x1, ..., x N) [K]N, we define quantum events Ex = Ex1 .. Ex N to be the tensor product of quantum event Exi in the i-th register. It is easy to verify that P x {0,1}N Ex = I. For all K-dimensional positive integer arrays k = (k1, ..., k K) with PK j=1 kj = N, we define quantum event Ak to be x [K] N [num of xi=j]=kj Then the empirical approximation p is chosen as p = k/N. Since each entry ki of k is distributed as Binomial(N, Tr(Eiρ)), we can bound the following probability using Eq. (7): Pr d T V (p, p ) ϵ as N = O(K log(1/δ)/ϵ2). We define a function f : [0, 1] K {0, 1} by ( 1, d T V (t, τ) 7ϵ 8 , 0, otherwise. Based on this function, we define quantum event B by k k1+...+k K =N As each entry ki of k is distributed as Binomial(N, Tr(Eiρ)), we can observe that Eρ N [B] = Pr d T V (p , τ) 7ϵ Learning Distributions over Quantum Measurement Outcomes Recall the guarantee in Eq. (8). The condition d T V (p, τ) > ϵ implies that d T V (p , τ) 7ϵ/8 by triangle inequality. Hence, Eρ N [B] = Pr d T V (p , τ) 7ϵ Pr d T V (p, p ) ϵ Similarly, the condition d T V (p, τ) 3ϵ/4 implies that d T V (p , τ) 7ϵ/8. Hence, Eρ N [B] = Pr d T V (p , τ) > 7ϵ > Pr d T V (p, p ) ϵ C. Proof of Lemma 4.1 We mainly follow the template of the proof for Theorem 3 in (Aaronson et al., 2018), but there are some differences since the loss function is different. We first observe that the loss function ℓt(Tr(Et,1φ), ..., Tr(Et,K 1φ)) is convex. There are at most two terms that contain each Tr(Et,jφ) in the loss function when calculating the sub-derivative over each value Tr(Et,jφ): The variance in the j-th entry: 1/2|Tr(Et,jφ) bt,j|; The variance in the last entry: 1/2|Tr(Et,Kφ) bt,K| as Tr(Et,Kφ) = 1 PK 1 j=1 Tr(Et,jφ). Therefore, the value of sub-derivative ℓt/ (Tr(Et,j)) is either 1 or 0. We can divide all indexes j of Et,j into three subsets St,1, St, 1, and St,0 such that the value of ℓt/ (Tr(Et,j)) is 1, 1, and 0 for j chosen from St,1, St, 1, and St,0. We thus rewrite t as: j St,1 Et,j X j St, 1 Et,j. Notice that Et,j are projectors corresponding to different measurement outcomes and PK j=1 Et,j = I, each Et,j are orthogonal and the spectral norm of any summation P j [K] Et,j 1. We can thus bound the spectral norm of t below by j St,1 Et,j j St, 1 Et,j In the following, we denote µt = Tr(Et,1ωt), ..., Tr(Et,K 1ωt) and τt = Tr(Et,1φ), ..., Tr(Et,K 1φ) for simplicity. Since ℓt is convex, ℓt(µt) ℓt(τt) t (ωt φ) holds for all φ H, where denotes the trace inner-product between complex matrices. Summing over t, we obtain t=1 [ℓt(µt) ℓt(τt)] t=1 [Tr( tωt) Tr( tφ)]. We define gt(X) = t X for X H and H(X) to be the negative von Neumann Entropy of X. By Lemma 5.2 in (Hazan et al., 2016), we have t=1 [gt(ωt) gt(φ)] t=1 t (ωt ωt+1) + 1 for any φ H, where D2 R := maxφ,φ H{R(φ) R(φ )}. We define Φt(X) = η Pt s=1 s X + R(X), then line 5 of Algorithm 1 finds the minimal value of Φt(X) in H. To prove the theorem, we need the following two claims. Learning Distributions over Quantum Measurement Outcomes Claim C.1. For all t {1., , , .T}, we have ωt 0. Proof. Consider a Hermitian matrix P Cd d with zero minimal eigenvalue i.e.,λmin = 0. Suppose P = V QV , where Q is a diagonal matrix with real entries as the eigenvalues of P. Assume Q1,1 = λmax(P) and Qd,d = λmin(P) = 0. We consider a different matrix P = V Q V such that Q 1,1 = Q1,1 ϵ, Q i,i = Qi,i for i {2, ..., d 1}, and Q d,d = ϵ for ϵ < λmax(P). We then prove that there exists ϵ > 0 that satisfies Φt(P ) Φt(P). By expanding both sides of the inequality, we need to prove an equivalent inequality A (P P) α log α (α ϵ) log(α ϵ) ϵ log ϵ, where A = η Pt s=1 s and α = λmax(P) = Q1,1. Notice that A η Pt s=1 s 2ηt. The left side of the inequality can be bounded using Generalized Cauchy-Schwartz inequality (Bhatia, 2013) as A (P P ) 2ηt P P Tr 4ϵηt. where A Tr is the trace norm for matrix A. As log ϵ when ϵ 0, there exists a small enough ϵ such that 4ηt log α log ϵ. Therefore, we have 4ηtϵ ϵ log α ϵ log ϵ α log α (α ϵ) log(α ϵ) ϵ log ϵ. This indicates that there exists ϵ that is small enough such that Φt(P ) Φt(P). If P has more than one zero eigenvalues, we can repeat the proof and construct the matrix P . As ωt is a minimal point of Φt 1 and ω1 0, we have ωt 0 for all t. Now, we can focus on X 0 and write R(X) = Tr(X log X). We can further calculate the gradient of Φt(X) as s=1 s + I + log X. Here, we assume that the function Φt(X) is defined over real symmetric matrices. We can further prove the following claim. Claim C.2. For all t {1, ..., T 1}, Φt(ωt+1) (ωt ωt+1) 0. Proof. We inversely assume that Φt(ωt+1) (ωt ωt+1) < 0. We choose a parameter a (0, 1) and construct X = (1 a)ωt+1 + aωt. Then X 0 is also a density matrix. We denote = X ωt+1 = a(ωt ωt+1). According to Theorem 2 in (Audenaert & Eisert, 2005), we have Φt(X) Φt(ωt+1) a Φt(ωt+1) (ωt ωt+1) + Tr 2 = a Φt(ωt+1) (ωt ωt+1) + a2 Tr (ωt ωt+1)2 λmin(ωt+1) . Then we divide the above inequality by a on both sides and get Φt(X) Φt(ωt+1) a Φt(ωt+1) (ωt ωt+1) + a Tr (ωt ωt+1)2 λmin(ωt+1) . Since we assume that Φt(ωt+1) (ωt ωt+1) < 0, we can always choose some small enough a such that the right-hand side is negative while the left-hand side is always positive since Φt(X) > Φt(ωt+1). This leads to a contradiction. Therefore, we have proved that Φt(ωt+1) (ωt ωt+1) 0. BΦt(ωt||ωt+1) := Φt(ωt) Φt(ωt+1) Φt(ωt+1) (ωt ωt+1). Learning Distributions over Quantum Measurement Outcomes By Pinsker inequality (Carlen & Lieb, 2014), we have 1 2 ωt ωt+1 2 Tr Tr(ωt log ωt) Tr(ωt log ωt+1) = BΦt(ωt||ωt+1). Using Claim C.2 and Φt 1(ωt) Φt 1(ωt+1), we have BΦt(ωt||ωt+1) = Φt(ωt) Φt(ωt+1) Φt(ωt+1) (ωt ω[t + 1]) Φt(ωt) Φt(ωt+1) = Φt 1(ωt) Φt 1(ωt+1) + η t (ωt ωt+1) η t (ωt ωt+1). 1 2 ωt ωt+1 2 Tr η t(ωt ωt+1). By Generalized Cauchy-Schwartz inequality, we have t (ωt ωt+1) t ωt ωt+1 Tr t p 2η (ωt ωt+1) We combine this inequality with Eq. (9) and reach the following bound t=1 t (ωt φ) 8ηT + 1 We take η = DR 2 2T . Observe that D2 R log d according to the definition of von Neumann entropy, the value for η is The corresponding regret bound is t=1 [ℓt(µt) ℓt(τt)] t=1 t (ωt φ) 4 p D. An Exemplary Application Here, we provide some applications of our shadow tomography procedure of K-outcome POVMs. In quantum mechanics, we are sometimes interested in the expectation value of quantum operators {Oi}M i=1: oi = Oi = Tr(Oiρ), given an unknown quantum state ρ. Suppose we perform a quantum measurement Mi that has K outcomes to estimate the expectation value oi. Then the following corollary holds by using our shadow tomography procedure Corollary D.1. We consider an unknown d-dimensional quantum state, as well as M quantum operators O1, ..., OM. Assume we can measure each operator Oi using a quantum measurement M of K results. Then there exists a strategy that can approximate the expectation of each operator Tr(Oiρ) within additive error ϵ using ϵ4 K log2 M log d copies of ρ. Here, is the spectral norm. The success probability is at least 1 δ. Learning Distributions over Quantum Measurement Outcomes To prove this corollary, we can divide the procedure into two steps. In the first step, we approximate the distribution after we apply each measurement Mi within total variation distance ϵ/ maxi Oi , which requires N copies of ρ according to Theorem 1.2. Next, we calculate the expectation value using the distribution we obtained. The additive error for the expectation of Oi is bounded above by Oi ϵ maxi Oi ϵ. As an example, we consider a n-qubit quantum states that is d = 2n-dimensional. We want to measure the expectation value for the operators {Sˆni}M i=1 which measures the spin along ˆni directions as k=1 σk ˆni O where σk ˆni denotes the spin operator along ˆni on the k-th operator and Ik denotes the identity operator on the k-th qubit. Each measurement Mi has K = n + 1 outcomes. The quantum event corresponding to each outcome n 2k for k = 0, 1, ..., n can be written as a projector x {0,1}n |x|=x 2k where |x| represents the Hamming weight for string x. We can calculate the spectral norm and the Hilbert-Schmidt norm HS of Sˆni by Sˆni HS = n2n, Therefore, we can approximate the expectation value for {Sˆni}M i=1 using N = O log7 d copies of ρ according to Corollary D.1, which scales only poly-logarithmic on d. However, directly using classical shadow exponential number of samples.