# on_adaptivity_in_quantum_testing__af712f3e.pdf Published in Transactions on Machine Learning Research (September/2023) On Adaptivity in Quantum Testing Omar Fawzi omar.fawzi@ens-lyon.fr Univ Lyon, Inria, ENS Lyon, UCBL, LIP, F-69342, Lyon Cedex 07, France Nicolas Flammarion nicolas.flammarion@epfl.ch EPFL, Lausanne, Switzerland Aurélien Garivier aurelien.garivier@ens-lyon.fr Univ Lyon, ENS de Lyon, UMPA UMR 5669, F-69364 Lyon Cedex 07, France Aadil Oufkir aadil.oufkir@ens-lyon.fr Univ Lyon, Inria, ENS Lyon, UCBL, LIP, F-69342, Lyon Cedex 07, France Reviewed on Open Review: https: // openreview. net/ forum? id= Hf95z Fn Q7H Can adaptive strategies outperform non-adaptive ones for quantum hypothesis selection? We exhibit problems where adaptive strategies provably reduce the number of required samples by a factor four in the worst case, and possibly more when the actual difficulty of the problem makes it possible. In addition, we exhibit specific hypotheses classes for which there is a provable polynomial separation between adaptive and non-adaptive strategies a specificity of the quantum framework that does not appear in classical testing. 1 Introduction Testing properties of quantum states is an important question in quantum learning theory which generalizes testing properties of discrete probability distributions (Nielsen & Chuang, 2002; Montanaro & de Wolf, 2013; Arunachalam & de Wolf, 2017; Anshu & Arunachalam, 2023). Indeed, a quantum state admits an intrinsic probability description given by the list of its eigenvalues. Learning completely a quantum state is expensive and requires a significant number of copies even at the presence of quantum memory (Haah et al., 2016). Since quantum resources are costly, it is crucial to design procedures to efficiently test the important properties of quantum systems. A simple way of reducing the sample complexity is to allow the testing algorithm to update its stopping rule (sequential) and/or its way of acquiring new data (adaptive) depending on the previously observed data. Here we focus on the question of the effect of sequential and adaptive strategies on the sample efficiency. This is a very active area in statistics and machine learning starting from the works of Wald (Wald, 1945) and in the learning literature under the name of bandit problems (Lattimore & Szepesvári, 2020). Here, we consider this question for quantum testing problems. Specifically, we focus on the hypothesis selection problem using incoherent measurements, where the tester is asked to determine the hypothesis set containing the unknown quantum state ρ with high probability. This problem is ubiquitous in the quantum learning theory literature, and several variants are considered: testing identity (O Donnell & Wright, 2015; Bubeck et al., 2020; Chen et al., 2022c), testing closeness (Yu, 2020), binary hypothesis testing (Hiai & Petz, 1991), (Audenaert et al., 2007; Nussbaum & Szkoła, 2009), composite quantum hypothesis testing (Bjelaković et al., 2005). If the tester is limited to incoherent measurements, the problem is very related to classical testing problems. Indeed, on the one hand, every classical testing problem on discrete distributions can be cast into a quantum testing problem by taking diagonal quantum states corresponding to the discrete distributions. Measuring these quantum states is equivalent to sampling from the classical distributions. On the other hand, the quantum hypothesis selection problem can be seen as a bandit problem (see e.g. (Garivier Published in Transactions on Machine Learning Research (September/2023) Non Sequential Sequential Non adaptive 2 log(1/δ) ε2 (Prop. 3.1) log(1/δ) 2ε2 (Prop. 3.3) Adaptive 2 log(1/δ) ε2 (Prop. 3.1) log(1/δ) 2ε2 (Prop. 3.3) Table 1: Copy complexity for testing whether ρ = σ1 or ρ = σ2 where ε = σ1 σ2 tr and δ is the error probability. Non Sequential Sequential Non adaptive Θ d3/2 log(1/δ) ε2 (Bubeck et al., 2020) O min n d3/2 log(1/δ) ε2 , d1/2 log(1/δ) o (Prop. 3.4) Adaptive Θ d3/2 log(1/δ) ε2 (Chen et al., 2022a) O min n d3/2 log(1/δ) ε2 , d1/2 log(1/δ) o (Prop. 3.4) Table 2: Copy complexity for testing whether ρ = I d tr ε where d is the dimension of quantum states and δ is the error probability. & Kaufmann, 2019; Lumbreras et al., 2022; Brahmachari et al., 2023)). Born s rule defines exactly the classical distribution of the reward when pulling a particular arm (performing a measurement). Note that these probability distributions are not arbitrary: they are governed by the unknown quantum state. This connection leads to an important question: Can sequential strategies outperform non-sequential ones for some hypothesis selection problem with incoherent measurements? In other words, if the tester is allowed to choose the measurement device at a given step depending on the previous observations, would it require fewer copies of the unknown quantum state? Note that from a practical point of view in quantum experiments, adaptive and sequential strategies can be implemented without much overhead. Thus, finding efficient adaptive/sequential strategies can lead to practical savings for quantum experiments as shown by Granade et al. (2017). Classically, sequential strategies prove to have an advantage over non-sequential ones for instance for binary hypothesis testing problems (see (Wald, 1945)), testing continuous distributions (see (Zhao et al., 2016; Balsubramani & Ramdas, 2015)), testing identity and closeness problems with small alphabet size (see (Fawzi et al., 2021; 2022)). This speedup comes, mainly, from the fact that a sequential algorithm can make comparisons at each step and can respond earlier once it has the enough confidence. However, sequential strategies in the quantum setting have not only the capacity to choose the stopping time, but also to change the measurement devices adaptively. We expect then a larger gap between sequential and non-sequential strategies. To avoid confusion, sequential strategies can choose the stopping time according to the previous observations and thus they have random stopping times, while adaptive strategies are allowed to adapt their measurement devices at each step according to past observations. With these definitions, a strategy can be sequential and adaptive, sequential and non-adaptive, non-sequential and adaptive, or non-sequential and non-adaptive. When we don t specify whether the strategy is non-sequential or sequential (resp. non-adaptive or adaptive), it can be both and the statement remains true. On the other hand, non-adaptive strategies have been shown to be optimal for many interesting quantum testing problems, including testing identity by (Chen et al., 2022a), purity testing and shadow tomography by (Chen et al., 2021), tomography by (Chen et al., 2022b). These works suggest that adaptive/sequential strategies cannot outperform non-adaptive non-sequential ones. The goal of the article is to show the contrary: there are some situations where sequential or adaptive strategies require fewer measurements than non-adaptive non-sequential ones. Contributions When the number of hypotheses m is equal to 2 and the hypotheses are simple (i.e., only one possible state), we can precisely characterize the optimal worst case complexity for non-sequential and sequential strategies. We show that sequential strategies outperform non-sequential ones by a factor 4. For the lower bounds, we show how to reduce this problem to the classical testing identity problem, then apply the lower bounds of Fawzi et al. (2022). For the sequential upper bound, we design stopping rules inspired by time uniform concentration inequalities. We refer to Table. 1 for a summary of these bounds. Published in Transactions on Machine Learning Research (September/2023) Non Sequential Sequential Non adaptive Θ min n md ε2 o (Prop. 4.4, 4.5) Θ min n md ε2 o (Prop. 4.4, 4.5) Adaptive Θ d ε2 (Prop. 4.3, 4.9) Θ d ε2 (Prop. 4.3, 4.9) Table 3: Copy complexity for the hypothesis selection problem (P) (4.1) where d is the dimension of quantum states, m is the number of hypotheses and ε is the precision parameter. There is a polynomial separation between non adaptive and adaptive strategies when exp(O(d)) m Ω(d). Moreover, we show that sequential algorithms can adapt to the actual difficulty for the testing mixedness and testing closeness problems. For this, we show a lower bound on the TV-distance between the probability distributions after measurement depending on the actual 1-norm between the quantum states (see Lemma 3.5). This inequality helps to reduce quantum testing to classical testing at the cost of a factor 1/ d (d is the dimension of the quantum states) and can be useful for other applications. We refer to Table. 2 for a summary of these bounds. For a number of hypotheses m 2, we prove a separation between adaptive and non-adaptive strategies for a specific problem. The learner has the information that the unknown quantum state can be diagonalised in a basis amongst m known orthonormal bases and would like to approximate it. See Def. 4.1 for a formal definition of this problem. We show that this problem can be solved by adaptive algorithms using O(d log(m)/ε2) copies of ρ. On the other hand, every non-adaptive algorithm solving this problem will require Ω(min{md/ log(m)ε2, d2/ε2}) copies of ρ. The upper bounds follows from the shadow tomography algorithm of Huang et al. (2020). For the lower bounds, we construct an ε-separated family of quantum states close to the maximally mixed state (I/d) and use it to encode a message from [meΩ(d)]. A learning algorithm can be used to decode this message with the same success probability. Hence, the encoder and decoder should share at least Ω(log(m)+d) bits of information (Fano s inequality (Fano, 1961)). On the other hand, after each step, we show that the correlation between the encoder and decoder can only increase by at most O(ε2 log(m)/m + ε2/d) bits for non-adaptive strategies and it can only increase by at most O(ε2) bits for adaptive strategies. We obtain an improvement by a factor d or m/ log(m) for non-adaptive strategies by exploiting the randomness in the construction and the independence of the observations at different steps. We refer to Table. 3 for a summary of these bounds. Related work Quantum testing identity using entangled measurements is well understood (O Donnell & Wright, 2015; Bădescu et al., 2019): it is known that Θ(d/ε2) copies are necessary and sufficient. For incoherent measurements, it starts with the work of Bubeck et al. (2020) where we have two different lower bounds for testing mixedness problem using independent adaptive and non-adaptive measurements. This result is generalized for general testing identity to some quantum state σ by Chen et al. (2022c). Recently Chen et al. (2022a) show that adaptive algorithms cannot significantly outperform non-adaptive ones neither for testing mixedness nor testing identity. If entangled measurements are allowed, the quantum hypothesis selection problem can be solved using poly(log(m)) copies of ρ (see (Bădescu & O Donnell, 2021)). This poly-logarithmic complexity in m can be explained by the fact that ρ N can be reused after measurement. In contrast, this is not possible using incoherent measurements for which the state collapses after performing the measurement. In general, the quantum hypothesis selection problem, where each hypothesis contains only one quantum state, is highly related to the shadow tomography problem where the learner is asked to uniformly approximate the expected values {tr(ρOi)}i [m] of m known observables {Oi}i [m] by measuring the unknown quantum state ρ. A popular algorithm for the shadow tomography problem is given by Huang et al. (2020) and uses at most O(log(m)d/ε2) non-sequential non-adaptive incoherent measurements. On the other hand, independent adaptive strategies are shown to be useless for shadow tomography (and purity testing) by Chen et al. (2021). Moreover, sequential adaptive strategies have been used by Li et al. (2022b) (see (Li et al., 2022a) for quantum channel discrimination) to achieve the optimal rates given by the quantum relative entropy for both type I and type II errors at the same time for binary hypothesis testing problem using entangled measurements. Published in Transactions on Machine Learning Research (September/2023) Adaptive strategies have been considered for testing quantum channels in (Harrow et al., 2010; Pirandola et al., 2019; Salek et al., 2022). In particular, Harrow et al. (2010) and Salek et al. (2022) provide examples for which adaptive strategies outperform non-adaptive ones for testing quantum channels. We note that for channels, one has the possibility to adapt the input of the channel to the previous observations, but this is not the case for testing states. As such, it is more challenging to find a separation between adaptive and non-adaptive strategies for testing quantum states than it is for channels. For the tomography problem, Chen et al. (2022b) shows that adaptive independent strategies cannot beat non-sequential non-adaptive ones and thus need at least Ω(d3/ε2) copies to learn the quantum state ρ. However, it is unclear whether adaptivity can help for learning restricted families of states such as graph states (Ouyang & Tomamichel, 2022). On the other hand, sequential strategies were used for online state tomography by Kueng & Ferrie (2015); Youssry et al. (2019); Stricker et al. (2022); Rambach et al. (2022). Note that the word online learning is usually used to refer to a learning task where the properties or the state we want to estimate change on the fly. The problems we consider here are not of this type: the task is fixed in advance. For this reason we use the words adaptive and sequential instead of online. Finally, other works consider testing properties of quantum states/distributions with different access models. For instance, (Acharya et al., 2020) studies the copy complexity of estimating entropies of a quantum state, (Gilyén & Li, 2019) studies testing closeness between unknown distributions with coherent access and (Belovs, 2019) studies the quantum query complexity of discriminating two probability distributions encoded by quantum oracles. 2 Preliminaries Throughout the paper, d is the dimension of the quantum states. A quantum state is a positive semi-definite Hermitian matrix of trace 1. We use the bra-ket notation: a column vector is denoted |ϕ and its adjoint is denoted ϕ| = |ϕ . With this notation, ϕ|ψ is the dot product of the vectors ϕ and ψ and, for a unit vector |ϕ , |ϕ ϕ| is the rank-1 projector on the space spanned by the vector ϕ. The canonical basis {ei}i [d] is denoted {|i }i [d] := {|ei }i [d]. We define the trace norm or the 1-norm of a matrix M as M tr = 1 2tr M M and the 2-norm as tr (M M). An observable is a Hermitian matrix O satisfying O 0 and I O 0 where I is the identity matrix. Given two quantum states ρ and σ, we can compare them using the quantum relative entropy defined as: D(ρ σ) = tr(ρ(log(ρ) log(σ))) or the quantum Chernoff divergence defined as: C(ρ, σ) = log inf 0 s 1 tr(ρ1 sσs) . The total variation (TV) distance between two probability distributions P and Q on [d] is defined as: TV(P, Q) = 1 i=1 |Pi Qi| and the Kullback-Leibler (KL) divergence is defined as: i=1 Pi log Pi Finally, for two numbers p, q [0, 1], we denote KL(p q) = KL(Ber(p) Ber(q)). All the problems discussed in this article are special cases of the general hypothesis selection problem. Given an unknown quantum state ρ Cd d and m hypothesis classes {Hi}i [m], the learner is asked to find one of the hypothesis classes containing ρ with high probability. Formally: Published in Transactions on Machine Learning Research (September/2023) Definition 2.1 (Quantum hypothesis selection). Let ρ Cd d be an unknown quantum state. Let {Hi}i [m] be m hypothesis classes. We have the promise that at least one of the following assertions is satisfied: ρ H1, ρ H2, . . . , ρ Hm . An algorithm A is δ-correct for this problem if it verifies the following property: i [m] : ρ / Hi = P (A = i) δ . The difference between quantum and classical testing is that in the quantum case we have the possibility to choose a measurement (given by positive operators summing to the identity). If the quantum states are restricted to be diagonal, we may assume the measurement is always the same and so the problem becomes a classical testing problem (see Lemma 3.2). The quantum state ρ is unknown, but the learner can extract classical information from it by performing a measurement. The way the unknown quantum state ρ is measured is important and can lead to different results about the number of copies needed for this task. Definition 2.2. A measurement is defined by a POVM (positive operator-valued measure) with a finite number of elements: this is a set of positive semi-definite matrices M = {Mi}i X acting on a Hilbert space H and satisfying P i X Mi = IH. Each element Mi in the POVM M is associated with the outcome i X. The tuple {tr(ρMi)}i X is non-negative and sums to 1: it thus defines a probability denoted by ρ(M). Born s rule (Born, 1926) says that the probability that the measurement on a quantum state ρ using the POVM M will output i is exactly tr(ρMi). We distinguish between two types of measurements depending on the considered Hilbert space: Definition 2.3 (Entangled measurement). An entangled measurement is given by a POVM on the Hilbert space H = (Cd) N, where N is the number of copies available of the quantum state ρ. We can measure the whole state ρ N at once. An interesting POVM related to the observable O on Cd is given by M(O) = {Mk}0 k N where Mk = P x {0,1}N,|x|=k Ox1 Ox N . Measuring ρ N with the POVM M(O) outputs a sample from the binomial distribution Bin(n, tr(ρO)). Definition 2.4 (Incoherent measurement). An incoherent measurement is given by a sequence of POVMs {Mt}t [N], each of them acts on the Hilbert space H = Cd. In this case, we measure at step t the quantum state ρ using the POVM Mt. For instance, for an observable O, measuring ρ with the POVM M(O) = (I O, O) outputs a sample from the Bernoulli distribution Ber(tr(ρO)). In this article, we focus on algorithms using incoherent measurements. In this case, we can distinguish between two four types of strategies depending whether the number of measurements and the POVMs {Mt}t are fixed in advance or not. Definition 2.5 (Adaptive strategies). A strategy is called non-adaptive when the POVMs {Mt}t are fixed in advance (i.e., do not depend on the outcomes of the previous measurements). When Mt can be chosen depending on the results of the previous measurements with the (Ms)s ϕ(δ, t)) δ where ϕ(δ, t) = r 1 2t log 2t(t+1) The lower bound follows from the previous proof s reduction and the lower bound on the expected number of samples for testing uniform using sequential algorithms: Ber(1/2) vs Ber(1/2 ε) (see (Fawzi et al., 2022)). The detailed proof can be found in App. A.1. Note that Li et al. (2022b) have also established an advantage of sequential adaptive strategies over nonadaptive non-sequential ones in terms of the error exponents. The type I error is the probability that the testing algorithm answers the hypothesis H1 while the hypothesis H0 is the correct one while the type II error is the probability that the testing algorithm answers the hypothesis H0 while the hypothesis H1 is the correct one: αN = PH0(AN = 1) and βN = PH1(AN = 0) Published in Transactions on Machine Learning Research (September/2023) where N is the number of copies used. The error exponents (rates) are then given by R0 = lim N log(αN) N and R1 = lim N log(βN) Concretely, Li et al. (2022b) show that adaptive sequential strategies can achieve the best rates (at the same time) given by the quantum relative entropy between two states for both type I and II errors. On the other hand, it is known that non sequential non adaptive strategies can only achieve the quantum Chernoff rate exponent when the error probabilities are equal (Nussbaum & Szkoła, 2009; Audenaert et al., 2007). For the particular states σ1 = I2 2 and σ2 = diag( 1 2 ε), we can show that the quantum relative entropy and the quantum Chernoff divergence between σ1 and σ2 are asymptotically equivalent to: D(σ1 σ2), D(σ2 σ1) ε 0 2ε2 and C(σ1, σ2) ε 0 ε2 Therefore, we can recover the factor 4 improvement by comparing the quantum relative entropy and the quantum Chernoff divergence: C(σ1, σ2) , D(σ2 σ1) C(σ1, σ2) ε 0 2ε2 We refer to App. A.1.3 for the detailed computations. 3.2 Sequential strategies adapt on the actual difficulty of the problem without prior knowledge In this section, we change the previous setting by letting the second hypothesis be multiple. Precisely, we consider the problem of testing identity with H1 = {I/d} and H2 = {ρ : ρ I/d tr ε} where ε is a positive parameter. (Chen et al., 2022a) has proved that the optimal non-sequential adaptive copy complexity is Θ(d3/2/ε2). We show that while non-sequential adaptive algorithms cannot improve the copy complexity, sequential non-adaptive algorithms can be used to adapt to the actual difficulty of the problem. Mainly we show the following result: Proposition 3.4. There is a sequential non adaptive algorithm for testing identity problem using a number of measurements satisfying: E (N) = O min d3/2 log(1/δ) ε2 , d1/2 log(1/δ) In particular, the expected copy complexity can be reduced to O(rd1/2 log(1/δ)) if the quantum state ρ has low rank r d/2 or O rd1/2 log(1/δ) if the trace-less matrix ρ I/d has low rank r even if the algorithm does not have any information about these ranks (see App. A.2). The algorithm uses random measurements and a time-dependent stopping rule. Since we have already sequential algorithms for the classical testing identity problem, it is sufficient to show how to reduce the quantum problem to the classical one. For a POVM M and a quantum state ρ, let ρ(M) denotes the classical probability distribution {tr(ρMi)}i. The following lemma captures the main ingredient of the reduction: Lemma 3.5. For all δ > 0, let l = 1 4 log(2/δ) and U 1, U 2, . . . , U l Cd d be Haar-random unitary matrices of columns {|U j i }1 i d,1 j l, M = { 1 l |U j i U j i |}i,j is a POVM and for all quantum states ρ and σ we have with a probability at least 1 δ: TV(ρ(M), σ(M)) ρ σ 2 where r is the rank of (ρ σ). It is, in general, difficult to compute the expected value of the 1-norm under Haar measure, but the 2 and 4-norms can be computed exactly with the Weingarten calculus (see Lemma D.1 and Lemma D.2), so we Published in Transactions on Machine Learning Research (September/2023) use Hölder s inequality to lower bound the 1-norm by an expression involving the 2 and 4-norms. Moreover, this will only give a lower bound in expectation, so we sample more Haar-distributed unitaries and construct a new measurement device by concatenating a fraction of the columns of each unitary. Then, we need to show that the TV-distance is Lipschitz to be able to apply a concentration inequality for functions of Haardistributed unitaries (Theorem D.3). This is done by carefully applying the Cauchy Schwarz inequality. The complete proof can be found at App. A.2. Sen (2006) proved a slightly weaker (by a logarithmic factor) lower bound on the TV distance using a POVM constructed with Gaussian random variables. Also, a similar lower bound (in expectation) can be found in (Matthews et al., 2009) where the authors analyze the uniform POVM and a POVM defined by a spherical 4-designs. However, for our reduction, it is important to minimize the number of outcomes of the POVM. Lemma 3.5 gives a POVM for which our problem reduces to testing identity: P = Un vs TV(P, Un) ε 20 d with high probability, where n = 1 4d log(2/δ) and P = M(ρ). Under the alternative hypothesis H2, the TV distance between P and Un can be lower bounded by TV(P, Un) 1 20 ρ I/d 2. Therefore we can apply the sequential classical testing uniform result of (Fawzi et al., 2022) to obtain a copy complexity O d3/2 log(1/δ) max{ε2, d ρ I/d 2 2} A matching lower bound can be obtained in the worst case setting where we are interested only in the parameters d, ε and ρ I/d tr. This can be done using Markov s inequality to transform the algorithm to a deterministic-time one then invoking the lower bound of (Chen et al., 2022a): Any adaptive algorithm for testing identity would require Ω(d3/2/ε2) copies of ρ. Note that, using Lemma 3.5 and the sequential tester of (Fawzi et al., 2022), we obtain the same copy complexity for testing closeness (i.e., testing ρ = σ vs ρ σ tr ε where we can measure the unknown quantum states ρ and σ) as for testing identity. This is different from the classical case where testing identity (Diakonikolas et al., 2017) can be done with much less copies than testing closeness (Diakonikolas et al., 2020). 4 Provable separation between adaptive and non-adaptive strategies In this section, we fix the error probability to δ = 1/3. We construct a problem for which we have a separation between adaptive and non-adaptive algorithms. Definition 4.1. [Hypothesis selection problem (P)] Let {σ1, . . . , σm} be a set of ε-separated known quantum states. The unknown quantum state ρ is ε/3-close to (at most) one of the quantum states σi {σ1, . . . , σm} and has the same diagonalisation basis than σi . We aim to learn the quantum state ρ to within ε/10 with high probability. Formally, the goal is to design an algorithm that measures a number of copies of ρ and returns a quantum state ρ (an ε/10-approximation of ρ) such that with probability (the randomness comes from the measurements and possibly the algorithm) at least 1 δ: The problem described above is not a hypothesis selection problem in the strict sense of the term. However it is equivalent to the following hypothesis selection problem which has the same order of copy complexity. For i [m], let σi = P λk|ϕi k ϕi k| and {σi,j}j [M] an ε/10-covering of the set {ρ = P k µk|ϕi k ϕi k| : TV(λ, µ) ε/3}. Our problem is equivalent to the hypothesis selection problem for {Hi,j = {B(σi,j, ε/10)} {ρ : ρσi,j = σi,jρ}}i [m],j [M]. For simplicity, we use the first formulation of the problem and refer to it as (P). 4.1 Upper bound In this section, we present an adaptive algorithm for the problem (P) achieving a copy complexity strictly less than the lower bound which holds for all non-adaptive algorithms. The first step is to determine with high probability the closest quantum state σi to ρ, then it remains to approximate ρ by measuring it in its basis of diagonalization. Published in Transactions on Machine Learning Research (September/2023) Algorithm 1 Hypothesis selection problem (P). Input: N = O(d log(m/δ)/ε2) incoherent measurements on ρ and m quantum states σ1, . . . , σm. Output: Two quantum states σi and ρ satisfying with a probability at least 1 δ: σi ρ tr ε/3 and ρ ρ tr ε/10 . For all i = j [m], let Oi,j an observable satisfying σi σj tr = tr Oi,j(σi σj). For all i = j [m], let µi,j an ε/10 approximation of tr(ρOi,j) given by classical shadow tomography of (Huang et al., 2020). Let k = argminl maxi,j µi,j tr(σl Oi,j). Let M = {|ϕi ϕi|}i [d] the POVM corresponding to the basis of diagonalisation of σk . Measure ρ independently M = 200 log(2d+2/δ)/ε2 times using the POVM M and denote the outcomes {Ei}1 i M. return ρ = P i [d] j [M] 1Ej =i 4.1.1 Adaptive strategies. For all i = j [m], let Oi,j an observable satisfying σi σj tr = tr Oi,j(σi σj). In Sec. 3.1, we have seen that such observable Oi,j can be used to distinguish between ρ = σi and ρ = σj if one of the two hypotheses is satisfied. The quantum state σi has the property to minimize the 1-norm between ρ and {σi}i, so it is natural to take the state minimizing the statistics of expected value roughly maxi,j tr Oi,j(ρ σl) for l [m]. To do this, we need to approximate tr ρOi,j for all i = j. We can use the classical shadow tomography algorithm of (Huang et al., 2020) to predict all these events using a few number of copies of ρ: Theorem 4.2. (Huang et al., 2020) Let (O1, . . . , Om) be a tuple of observables. There is an algorithm using non-adaptive incoherent measurements requiring: N = O d log(m/δ) copies of ρ to predict tr(ρOi) to within ε-error for all i = 1, . . . , m with at most an error probability of δ. Once we find the quantum state σi , we know the basis of diagonalization of ρ. Hence we can learn the eigenvalues of the unknown quantum state ρ by measuring it using the measurement device corresponding to its basis of diagonalization. This requires O(d/ε2) incoherent copies. The algorithm is summarized in Alg. 1. This algorithm can be split in two phases. The first phase can be seen as an exploration phase, where the algorithm looks for the optimal eigen-basis. It collects (non-adaptively) the information given by the approximations µi,j of tr(ρOi,j). Then it uses this information to choose k = argminl maxi,j µi,j tr(σl Oi,j). After this step, in the second exploitation phase, the algorithm adapts its measurement device M according to the previous information k and measures only with the POVM corresponding the the eigen-basis of σk . Alg. 1 is δ-correct (detailed proof deferred to App. B). It can be split in two parts for which we independently upper bound the copy complexity. The first part relies on the shadow tomography algorithm of (Huang et al., 2020) and needs a number N1 = O d log(m(m 1)/δ) = O d log(m/δ) of copies of ρ. The second part requires a number N2 = 200 log(2d+1/δ) ε2 of copies of ρ. Finally, since N = N1 + N2 and δ = 1/3, we have proven the following proposition: Proposition 4.3. Alg. 1 has a total copy complexity satisfying: N = O d log(m) Published in Transactions on Machine Learning Research (September/2023) 4.1.2 Non-adaptive strategies. We can slightly modify Alg. 1 to have a non-adaptive algorithm for the problem (P) with incoherent measurements. It amounts to first measuring ρ in all the basis corresponding to the known quantum states (σi)i and preparing m approximated quantum states ( ρi)i. Then the tester can look for the closest quantum state σi and finally returns the approximated quantum state ρi . This non-adaptive algorithm has a copy complexity m N2 + N1 = O md + m log(1/δ) + d log(m/δ) This complexity is almost optimal for m d (see Prop. 4.5). However, it is no longer optimal for m d since md/ε2 d2/ε2. In that case, we can still design an almost optimal non-adaptive algorithm as follows: for each k [m], let {|ϕk i }i an orthonormal basis of diagonalization for σk. For each k [m] and B [m], let Ok B = P i B |ϕk i ϕk i |. We use the classical shadow tomography of (Huang et al., 2020) to predict (tr(ρOi,j))i,j [m] (tr(ρOk B)k [m],B [m] to within ε/40 simultaneously using O(d log(m2 + m2d)/ε2) = O((d2 + log(m))/ε2) copies of ρ. We find the closest quantum state σi to ρ the same way as the Alg. 1 does. Next, we look for a probability distribution λ satisfying for all B [m] : λ(B) µi B ε/40, where µi B is the prediction of shadow tomography algorithm for tr(ρOi B ). Such λ exists since the vector λ of eigenvalues of ρ satisfies the following property: tr(ρOi B ) = tr i [d] λi|ϕi i ϕi i | X i B |ϕi i ϕi i | i [d],j B λi| ϕi i |ϕi j |2 = X i B λi = λ(B), and λ(B) µi B = tr(ρOi B ) µi B ε/40. We can thus return the quantum state ρ = P i [d] λi|ϕi i ϕi i | as an approximation of ρ. We can verify that it is indeed an ε/10 approximation of ρ: i=1 |λi λi| = 2 max B [d] λ(B) λ(B) 2 max B [d] λ(B) µi B + 2 max B [d] µi B λ(B) 2ε/40 + 2ε/40 ε/10. The copy complexity of this algorithm is O((d2 + log(m))/ε2 which matches (up to logarithmic factors) the lower bound for m d. Proposition 4.4. There is a non-adaptive algorithm for the the hypothesis selection problem (P) with a total copy complexity satisfying: N = O min d2 + log(m) 4.2 Lower bound In this section, we derive lower bounds for the problem (P) both with adaptive and non-adaptive incoherent measurements. Note that the same lower bounds (up to constants) could be proven for sequential strategies as well. This is because if a sequential algorithm uses N copies in expectation, then by Markov s inequality, it uses at most 10 E(N) copies with probability at least 9/10. We start with a lower bound for non-adaptive algorithms that matches the copy complexity of the algorithm presented in Sec. 4.1.2. Proposition 4.5. There is a tuple of quantum states (σ1, . . . , σm) such that any learning algorithm with non-adaptive incoherent measurements requires N = Ω min md log(m)ε2 , d2 Published in Transactions on Machine Learning Research (September/2023) copies of ρ to approximate ρ to at most ε/10 with at least a probability 2/3. This result with m = d, together with the analysis of the adaptive Alg. 1 gives a nearly quadratic advantage for adaptive algorithms over non-adaptive ones. Sketch of the proof. We construct a large set of quantum states randomly as follows: for y {1, . . . , m}, σy = UyΛU y = 2 I d σm+1 y, where Uy is a d d unitary matrix Haar(d)-distributed and Λ is a diagonal matrix with entries (1 10ε)/d. Using the concentration inequality for Lipschitz functions of unitaries chosen according to the Haar measure, we can prove that this family is ε-separated with high probability: Lemma 4.6. Suppose that m exp(d2/3000). For y [m/2], let Uy Haar(d) and σy = UyΛU y. We have with at least a probability 9/10, for all y = z: σy σz tr > ε. Then, for each y, we construct an ε/10-separated family of quantum states on the sphere of center σy and radius ε/3 which have the same eigen-basis as σy. This can be done by taking random eigenvalues and using Hoeffding s inequality. This leads to a family of ecd states (for some constant c) that we denote by {ρx,y}x [ecd]. By definition of the problem (P), any δ-correct non-adaptive algorithm for the problem (P) can be used to distinguish between the states {ρx,y}x,y with probability at least 1 δ. Thus, we can use these quantum states to encode a message in [ecd] [m] to a quantum state ρ = ρx,y in the family constructed above. The decoder receives this unknown quantum state, performs non-adaptive incoherent measurements, and learns it. Therefore a δ-correct algorithm can decode with a probability of failure at most δ. By Fano s inequality, the encoder and decoder should share at least Ω(log(m) + d) bits of information. Lemma 4.7 ((Fano, 1961)). The mutual information between the encoder and the decoder is at least I 2/3 log(mecd) log(2) Ω(log(m) + d). The remaining and crucial part of the proof is to upper bound the mutual information for a non-adaptive algorithm. After some manipulations, the use of Jensen s inequality, and some elementary inequalities of the logarithm function, we obtain an upper bound on the mutual information I of the form: y [m/2],x [ecd] ϕ|(dρx,y I)|ϕ 2ε2 The next step is to show that the right hand side of the previous inequality cannot be bigger than O (Nε2 log(m) d with a probability at least 9/10. This is the object of the following lemma. Lemma 4.8. By writing ρx,y = I d Uy Ox,y U y and M = 2 m ecd , we have with at least a probability 9/10, for all unit vector |ϕ : x [ecd],y [m/2] ϕ|(dρx,y I)|ϕ 2 C log(m) d + 2 Ox,y 2 This lemma can be proven by considering a concentration inequality for the function (Uy)y 7 2 m ecd X x [ecd],y [m/2] ϕ|(dρx,y I)|ϕ 2 , then considering a 1/m-net on the unit sphere to deduce the required inequality for the previous function uniformly on the sphere. The proof uses techniques similar to the ones of (Haah et al., 2016) and (Chen et al., 2021). The detailed proof can be found in App. C.1. A similar proof strategy allows to derive a lower bound on the copy complexity of adaptive strategies. The result is stated in the next proposition. Published in Transactions on Machine Learning Research (September/2023) Proposition 4.9. There is a tuple of quantum states (σ1, . . . , σm) such that any learning algorithm with possibly adaptive incoherent measurements requires N = Ω d + log(m) copies of ρ to approximate ρ to at most ε/10 with at least a probability 2/3. The proof is similar to the one for non-adaptive strategies, with the minor difference that the adaptivity makes difficult to simplify some products, thus we cannot upper bound the mutual information as in Ineq. 2. We use instead a Cauchy Schwarz inequality to break the dependencies created by the adaptiveness of the algorithm. We obtain then an upper bound on the mutual information I O(Nε2). The detailed proof can be found in App. C.2. This proposition along with the analysis of Alg. 1 show that the near optimal copy complexity of the problem (P) using adaptive incoherent measurements is Θ d ε2 . This latter along with Prop. 4.5 imply the separation between adaptive and non-adaptive strategies for the problem (P) for m 1. In other words, knowing that the eigen-basis of the quantum state belongs to some family of bases gives an advantage to adaptive strategies since they can find the eigen-basis, and then focus on measuring the quantum state with the corresponding POVM. Up to our knowledge, this is the first example for which adaptive independent strategies outperform non-adaptive ones for testing quantum states. 5 Conclusion We have constructed hypothesis selection problems for which sequential adaptive strategies are more efficient than non-sequential non-adaptive ones. The problem for which the advantage is the most significant is the one presented in Sec. 4. It would be interesting to see if there are other natural problems for which such a separation exists. We conjecture the separation would be polynomial in m for the composite hypothesis selection problem: distinguishing between ρ {σ1, . . . , σm} and ρ {σm+1, . . . , σ2m} with high probability. Jayadev Acharya, Ibrahim Issa, Nirmal V Shende, and Aaron B Wagner. Estimating quantum entropy. IEEE Journal on Selected Areas in Information Theory, 1(2):454 468, 2020. Anurag Anshu and Srinivasan Arunachalam. A survey on the complexity of learning quantum states. ar Xiv preprint ar Xiv:2305.20069, 2023. Srinivasan Arunachalam and Ronald de Wolf. Guest column: A survey of quantum learning theory. ACM Sigact News, 48(2):41 67, 2017. Koenraad MR Audenaert, John Calsamiglia, Ramón Munoz-Tapia, Emilio Bagan, Ll Masanes, Antonio Acin, and Frank Verstraete. Discriminating states: The quantum chernoff bound. Physical review letters, 98 (16):160501, 2007. Costin Bădescu, Ryan O Donnell, and John Wright. Quantum state certification. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 503 514, 2019. Akshay Balsubramani and Aaditya Ramdas. Sequential nonparametric testing with the law of the iterated logarithm. ar Xiv preprint ar Xiv:1506.03486, 2015. Aleksandrs Belovs. Quantum algorithms for classical probability distributions. ar Xiv preprint ar Xiv:1904.02192, 2019. Igor Bjelaković, Jean-Dominique Deuschel, Tyll Krüger, Ruedi Seiler, Rainer Siegmund-Schultze, and Arleta Szkoła. A quantum version of sanov s theorem. Communications in mathematical physics, 260(3):659 671, 2005. Published in Transactions on Machine Learning Research (September/2023) Max Born. Zur Quantenmechanik der Stoßvorgänge. Zeitschrift fur Physik, 37(12):863 867, December 1926. doi: 10.1007/BF01397477. Shrigyan Brahmachari, Josep Lumbreras, and Marco Tomamichel. Quantum contextual bandits and recommender systems for quantum data. ar Xiv preprint ar Xiv:2301.13524, 2023. Costin Bădescu and Ryan O Donnell. Improved Quantum Data Analysis, pp. 1398 1411. Association for Computing Machinery, New York, NY, USA, 2021. ISBN 9781450380539. URL https://doi.org/10. 1145/3406325.3451109. Sebastien Bubeck, Sitan Chen, and Jerry Li. Entanglement is necessary for optimal quantum property testing. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp. 692 703. IEEE, 2020. Sitan Chen, Jordan Cotler, Hsin-Yuan Huang, and Jerry Li. Exponential separations between learning with and without quantum memory. ar Xiv preprint ar Xiv:2111.05881, 2021. Sitan Chen, Brice Huang, Jerry Li, and Allen Liu. Tight bounds for quantum state certification with incoherent measurements. ar Xiv preprint ar Xiv:2204.07155, 2022a. Sitan Chen, Brice Huang, Jerry Li, Allen Liu, and Mark Sellke. Tight bounds for state tomography with incoherent measurements. ar Xiv preprint ar Xiv:2206.05265, 2022b. Sitan Chen, Jerry Li, and Ryan O Donnell. Toward instance-optimal state certification with incoherent measurements. In Conference on Learning Theory, pp. 2541 2596. PMLR, 2022c. Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Optimal identity testing with high probability. ar Xiv preprint ar Xiv:1708.02728, 2017. Ilias Diakonikolas, Themis Gouleakis, Daniel M Kane, John Peebles, and Eric Price. Optimal testing of discrete distributions with high probability. ar Xiv preprint ar Xiv:2009.06540, 2020. Robert M Fano. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793 794, 1961. Omar Fawzi, Nicolas Flammarion, Aurélien Garivier, and Aadil Oufkir. Sequential algorithms for testing closeness of distributions. Advances in Neural Information Processing Systems, 34, 2021. Omar Fawzi, Nicolas Flammarion, Aurélien Garivier, and Aadil Oufkir. Sequential algorithms for testing identity and closeness of distributions. ar Xiv preprint ar Xiv:2205.06069, 2022. Alexei A Fedotov, Peter Harremoës, and Flemming Topsoe. Refinements of pinsker s inequality. IEEE Transactions on Information Theory, 49(6):1491 1498, 2003. Aurélien Garivier and Emilie Kaufmann. Non-asymptotic sequential tests for overlapping hypotheses and application to near optimal arm identification in bandit models. ar Xiv preprint ar Xiv:1905.03495, 2019. András Gilyén and Tongyang Li. Distributional property testing in a quantum world. ar Xiv preprint ar Xiv:1902.00814, 2019. Christopher Granade, Christopher Ferrie, and Steven T Flammia. Practical adaptive quantum tomography. New Journal of Physics, 19(11):113017, 2017. Yinzheng Gu. Moments of random matrices and weingarten functions. Ph D thesis, 2013. Jeongwan Haah, Aram W. Harrow, Zhengfeng Ji, Xiaodi Wu, and Nengkun Yu. Sample-optimal tomography of quantum states. In STOC 16 Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pp. 913 925. ACM, New York, 2016. Aram W Harrow, Avinatan Hassidim, Debbie W Leung, and John Watrous. Adaptive versus nonadaptive strategies for quantum channel discrimination. Physical Review A, 81(3):032339, 2010. Published in Transactions on Machine Learning Research (September/2023) Fumio Hiai and Dénes Petz. The proper formula for relative entropy and its asymptotics in quantum probability. Communications in mathematical physics, 143(1):99 114, 1991. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13 30, 1963. ISSN 0162-1459. URL http://links.jstor.org/sici?sici=0162-1459(196303)58: 301<13:PIFSOB>2.0.CO;2-D&origin=MSN. Hsin-Yuan Huang, Richard Kueng, and John Preskill. Predicting many properties of a quantum system from very few measurements. Nature Physics, 16(10):1050 1057, 2020. Richard Kueng and Christopher Ferrie. Near-optimal quantum tomography: estimators and bounds. New Journal of Physics, 17(12):123013, 2015. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Yonglong Li, Christoph Hirche, and Marco Tomamichel. Sequential quantum channel discrimination. In 2022 IEEE International Symposium on Information Theory (ISIT), pp. 270 275. IEEE, 2022a. Yonglong Li, Vincent YF Tan, and Marco Tomamichel. Optimal adaptive strategies for sequential quantum hypothesis testing. Communications in Mathematical Physics, pp. 1 35, 2022b. Josep Lumbreras, Erkka Haapasalo, and Marco Tomamichel. Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states. Quantum, 6:749, 2022. William Matthews, Stephanie Wehner, and Andreas Winter. Distinguishability of quantum states under restricted families of measurements with an application to quantum data hiding. Communications in Mathematical Physics, 291(3):813 843, 2009. Elizabeth Meckes, Mark Meckes, et al. Spectral measures of powers of random matrices. Electronic communications in probability, 18, 2013. Ashley Montanaro and Ronald de Wolf. A survey of quantum property testing. ar Xiv preprint ar Xiv:1310.2035, 2013. Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002. Michael Nussbaum and Arleta Szkoła. The chernoff lower bound for symmetric quantum hypothesis testing. 2009. Ryan O Donnell and John Wright. Quantum spectrum testing. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 529 538, 2015. Yingkai Ouyang and Marco Tomamichel. Learning quantum graph states with product measurements. In 2022 IEEE International Symposium on Information Theory (ISIT), pp. 2963 2968. IEEE, 2022. Stefano Pirandola, Riccardo Laurenza, Cosmo Lupo, and Jason L Pereira. Fundamental limits to quantum channel discrimination. npj Quantum Information, 5(1):50, 2019. Markus Rambach, Akram Youssry, Marco Tomamichel, and Jacquiline Romero. Efficient quantum state tracking in noisy environments. Quantum Science and Technology, 8(1):015010, 2022. Farzin Salek, Masahito Hayashi, and Andreas Winter. Usefulness of adaptive strategies in asymptotic quantum channel discrimination. Physical Review A, 105(2):022419, 2022. Pranab Sen. Random measurement bases, quantum state distinction and applications to the hidden subgroup problem. In 21st Annual IEEE Conference on Computational Complexity (CCC 06), pp. 14 pp. IEEE, 2006. Roman Stricker, Michael Meth, Lukas Postler, Claire Edmunds, Chris Ferrie, Rainer Blatt, Philipp Schindler, Thomas Monz, Richard Kueng, and Martin Ringbauer. Experimental single-setting quantum state tomography. PRX Quantum, 3(4):040310, 2022. Published in Transactions on Machine Learning Research (September/2023) Abraham Wald. Sequential tests of statistical hypotheses. The annals of mathematical statistics, 16(2): 117 186, 1945. John Watrous. The theory of quantum information. Cambridge university press, 2018. Akram Youssry, Christopher Ferrie, and Marco Tomamichel. Efficient online quantum state estimation using a matrix-exponentiated gradient method. New Journal of Physics, 21(3):033006, 2019. Nengkun Yu. Sample optimal quantum identity testing via pauli measurements. ar Xiv preprint ar Xiv:2009.11518, 2020. Shengjia Zhao, Enze Zhou, Ashish Sabharwal, and Stefano Ermon. Adaptive concentration inequalities for sequential decision problems. Advances in Neural Information Processing Systems, 29:1343 1351, 2016. Published in Transactions on Machine Learning Research (September/2023) In this appendix, we give more details and the technical lemmas needed for the proofs of the article s main results. We start by stating general tools to reduce quantum testing to classical testing in App. A along with the proofs for upper/lower bounds on the problem of discriminating two quantum states using sequential/non adaptive strategies. Then we move to prove the correctness of Alg. 1 in App. B. Next, we prove the lower bounds for the hypothesis selection problem (P) in App. C. Finally, we group the technical lemmas we often need in App. D. A Reduction to classical testing problems One of the main difficulties in quantum testing is the freedom in the choice of measurement at each step. So, to simplify the analysis of quantum testing problems, we provide some techniques permitting the reduction to classical testing problems which are well understood. A.1 Testing a single hypothesis vs a single hypothesis A.1.1 Non-sequential strategies We start by the simple remark that for diagonal quantum states, incoherent measurements can be seen as a post-processing of samples from the distribution given by the diagonal elements of the quantum state. Moreover, the stochastic map for this post-processing does not depend on the quantum state. Lemma A.1. Let D be a discrete distribution and ρ its corresponding diagonal quantum state. Let M be a POVM. Measuring the quantum state ρ with the POVM M can be seen as post-processing of samples from the distribution D. Proof. Let M = {M i}i [k]. For each i [k], we can write x,y M i x,y|x y|. By Born s rule, the probability distribution of the outcomes of the measurement of ρ by the POVM M is: M(ρ) = {tr(ρM i)}i d = x Dx|x x| X x,y M i x,y|x y| x M i x,x Dx where P = (M i x,x)i,x is a stochastic matrix. Indeed, M i 0 implies M i x,x = x|M i|x 0 and P i M i = I implies X i M i x,x = X i x|M i|x = x|x = 1. Note that the post-processing map is independent of the quantum state, hence we can generalize the statement to any number of discrete distributions. Corollary A.2. Let D1 and D2 be two discrete distributions and ρ1 and ρ2 their corresponding diagonal quantum states. Let M be a POVM. Measuring the quantum state ρ1 (resp. ρ2) with the POVM M can be seen as post-processing (independent of the quantum states) of samples from the distribution D1 (resp. D2). Published in Transactions on Machine Learning Research (September/2023) We move now to the proof of the following upper and lower bound on discriminating two quantum states using non sequential incoherent measurements: Proposition A.3. There is a non-sequential algorithm for testing H1 : ρ = σ1 vs H2 : ρ = σ2 using a number of measurements N 2 log(1/δ) Moreover, there exists two quantum states σ1 and σ2 satisfying σ1 σ2 tr = ε so that every non-sequential algorithm distinguishing between H1 : ρ = σ1 and H2 : ρ = σ2 needs a number of measurements satisfying lim inf δ 0 N log(1/δ) max 1 KL(1/2 + αε 1/2), 1 KL(1/2 βε 1/2) where α (0, 1) and β (0, 1) are defined such that KL(1/2 + αε 1/2) = KL(1/2 + αε 1/2 + ε) and KL(1/2 βε 1/2) = KL(1/2 βε 1/2 ε). Proof. The correctness of the batch algorithm presented in Sec. 3.1.1 can be done using Chernoff-Hoeffding inequality, if ρ = σ1 the error probability can be upper bounded as follows: P (S tr(σ2O) ε/2) = P (S tr(σ1O) ε/2 ε) P (S tr(σ1O) ε/2) exp( N KL(tr(σ1O) ε/2 tr(σ1O))). On the other hand, if ρ = σ2: P (S tr(σ2O) ε/2) exp( N KL(tr(σ2O) + ε/2 tr(σ2O))). Therefore to ensure that the batch algorithm is δ-correct we need N to satisfy ( log(1/δ) KL(tr(σ1O) ε/2 tr(σ1O)), log(1/δ) KL(tr(σ2O) + ε/2 tr(σ2O)) Moreover by Pinsker s inequality (Fedotov et al., 2003), the right hand side is upper bounded by: ( log(1/δ) KL(tr(σ1O) ε/2 tr(σ1O)), log(1/δ) KL(tr(σ2O) + ε/2 tr(σ2O)) For the lower bound, let d = 2, σ1 = I/2 and σ2 = diag((1 + 2ε)/2, (1 2ε)/2) = I/2 + εO where O = diag(1, 1). Let A be a non sequential algorithm that distinguishes between H1 and H2 using N measurements. Let the ith measurement be Mi = (I Oi, Oi). Measuring ρ = σ1 (resp. σ2) with the POVM Mi = (I Oi, Oi) is equivalent to sampling from Ber(tr(Oi)/2) (resp. Ber(tr(Oi)/2 + εtr(Oi O)). The optimal sample complexity of testing identity: H0 : p = tr(Oi)/2 vs H1 : p = tr(Oi)/2 + εtr(Oi O) is asymptotically equivalent to (when ε 0) (Fawzi et al., 2022): 8(tr(Oi)/2)(1 tr(Oi)/2) log(1/δ) ε2tr(Oi O)2 = 4tr(Oi)(1 tr(Oi)/2) log(1/δ) ε2tr(Oi O)2 . Let s write Oi = λ1 β β λ2 , we have tr(Oi O) = λ1 λ2. Since 0 Oi I, we have 0 λi 1 for i = 1, 2. 4tr(Oi)(1 tr(Oi)/2) log(1/δ) ε2tr(Oi O)2 = 2(λ1 + λ2)(2 λ1 λ2) log(1/δ) ε2(λ1 λ2)2 2 log(1/δ) This latter inequality is true since (λ1 + λ2)(2 λ1 λ2) (λ1 λ2)2 λ1(1 λ1) + λ2(1 λ2) 0, Published in Transactions on Machine Learning Research (September/2023) with equality iff λi = 0, 1 for i = 1, 2. The cases λ1 = λ2 are eliminated because the sample complexity has a denominator (λ1 λ2). It remains the cases λ1 = 1 λ2 {0, 1} for which Oi is a rank 1 projector. Therefore, the optimal measurement reduces to testing uniform: Ber(1/2) vs Ber(1/2 ε). This problem requires a sample complexity asymptotically equivalent to 2 log(1/δ) ε2 . Note that we can also use Lemma A.1 to make the desired reduction. We show how this reduction works for entangled strategies. We have σ N 1 = I 2N and σ N 2 = 1 2N diag (1 + 2ε)|i|(1 2ε)N |i| i {0,1}N where |i| = i1 + + i N. By Lemma A.1, measuring the quantum states σ N 1 (resp. σ N 2 ) can be seen as post-processing of samples from the distribution D1 = {1/2N}i {0,1}N (resp. D2 = (1/2 ε)|i|(1/2 + ε)N |i| i {0,1}N ). Observe that a sample i = (i1, . . . , i N) D1 is given by N i.i.d. random variables {ik Ber(1/2)}k [N]. Similarly, a sample i = (i1, . . . , i N) D2 is given by N i.i.d. random variables {ik Ber(1/2 ε)}k [N]. Therefore, distinguishing σ1 from σ2 using N entangled copies can be reduced to testing Ber(1/2) vs Ber(1/2 ε) using N samples. This latter requires a number of samples (Fawzi et al., 2022): lim inf δ 0 N log(1/δ) max 1 KL(1/2 + αε 1/2), 1 KL(1/2 βε 1/2) where α (0, 1) and β (0, 1) are defined such that KL(1/2 + αε 1/2) = KL(1/2 + αε 1/2 + ε) and KL(1/2 βε 1/2) = KL(1/2 βε 1/2 ε). A.1.2 Sequential strategies Discriminating two quantum states using sequential strategies can be done with fewer measurements than non-sequential strategies. Since the reduction to lower bound is similar, we give only the proof for the upper bound. Proposition A.4. There is a sequential algorithm for testing H1 : ρ = σ1 vs H2 : ρ = σ2 using an expected number of measurements: E(N) log(1/δ) 2ε2 + log(1/δ)2/3 + 2 log(1/δ)1/3 + log(log(1/δ)/2ε2) + 1 Moreover, there are two quantum states σ1 and σ2 satisfying σ1 σ2 = ε so that every sequential algorithm distinguishing between H1 : ρ = σ1 and H2 : ρ = σ2 with high probability needs in expectation a number E(N) log(1/δ) min {KL (1/2 ε 1/2)} of measurements. Proof. The algorithm is presented in Sec. 3.1.2. Correctness. Let s start by showing that this algorithm is δ-correct. To this end, we need a time uniform concentration inequality which can be obtained by Hoeffding inequality along with the union bound, recall that St = (Pt i=1 Xi)/t and Xi Ber(tr(ρO)): P ( t 1 : |St E (St) | > ϕ(δ, t)) X t 1 P (|St E (St) | > ϕ(δ, t)) t 1 exp( 2tϕ(δ, t)2) Published in Transactions on Machine Learning Research (September/2023) Complexity. To obtain an upper bound on the complexity, we use the following lemma: Lemma A.5. N a random variable taking values in N, we have for all k N t k P(N t) . This inequality can be proved by writing E(N) = P t 0 P(N t) then upper bounding the first k terms by 1. Let α (0, 1) and k the smallest integer so that for all t k : ϕ(δ, t) αε. We focus only on the case ρ = σ1 (the other being similar), the expected stopping time of the algorithm can be controlled as follows: E (N) k + X t k P(St 1 < tr(σ2O) + ϕ(δ, t 1)) t k 1 P(St tr(σ1O) < ε + αε) t k 1 P(St tr(σ1O) < (1 α)ε) t k 1 2 exp( 2t(1 α)2ε2) k + 2 exp( 2(k 1)(1 α)2ε2) 1 exp( 2(1 α)2ε2) k + 2 exp( 2(k 1)(1 α)2ε2) On the other hand we have ϕ(δ, k) αε and ϕ(δ, k 1) αε so 2(k 1)α2ε2. k 1 log(1/δ) 2α2ε2 + 2log(log(1/δ)/(αε)2) E (N) log(1/δ) 1 2α2ε2 + 2log(log(1/δ)/(αε)2) log(1/δ)α2ε2 + 1 log(1/δ) + 2 exp( 2(k 1)(1 α)2ε2) log(1/δ)(1 α)2ε2 , and by taking δ 0, then α 1 we obtain: lim sup δ 0 E (N) log(1/δ) 1 2ε2 . A non asymptotic upper bound can be obtained by choosing α = (1 + log(1/δ) 1/3) 1: E(N) log(1/δ) 2ε2 + log(1/δ)2/3 + 2 log(1/δ)1/3 + log(log(1/δ)/2ε2) + 1 The lower bound follows from the previous reduction to testing Ber(1/2) vs Ber(1/2 ε) and (Fawzi et al., 2022). Published in Transactions on Machine Learning Research (September/2023) A.1.3 Asymptotics of the quantum relative entropy and Chernoff divergence. Recall that we consider the particular states σ1 = I2 2 and σ2 = diag( 1 2 ε). An asymptotic (when ε 0) of the quantum relative entropy between σ1 and σ2 is given by: D(σ1 σ2) = 1 2 log 1 1 + 2ε 2 log 1 1 2ε D(σ2 σ1) = 1 2 + ε log(1 + 2ε) + 1 2 ε log(1 2ε) ε 0 2ε2. On the other hand, an asymptotic (when ε 0) of the quantum Chernoff divergence between the states σ1 and σ2 can be upper bounded using the inequality log(x) (x 1) (x 1)2 valid for x ( 1 C(σ1, σ2) = sup 0 s 1 log(tr(σs 1σ1 s 2 )) = sup 0 s 1 log 1 2(1 + 2ε)s + 1 2(1 + 2ε)s 1 2(1 2ε)si + h1 2(1 + 2ε)s + 1 2(1 2ε)s 1 i2 sup 0 s 1 [2s(1 s)ε2 + o(ε2)] + o(ε4) ε 0 ε2 Moreover, it can be lower bounded using the inequality log(x) x 1: C(σ1, σ2) = sup 0 s 1 log 1 2(1 + 2ε)s + 1 2(1 + 2ε)s 1 sup 0 s 1 [2s(1 s)ε2 + o(ε2)] ε 0 ε2 Finally C(σ1, σ2) ε 0 ε2 A.2 Testing a single hypothesis vs a multiple hypothesis In this section, we relate the TV-distance between the distributions obtained after the measurements and the 1-norm between the quantum states. Lemma A.6. Let U Cd d be a Haar-random unitary matrix of columns {|Ui }1 i d, M(U) = {|Ui Ui|}i is a POVM and there exists a universal constant c such that for all quantum states ρ and σ we have: E [TV(ρ(M), σ(M))] c ρ σ tr Proof. Let ξ = ρ σ, we have U|ei = |Ui and we use Weingarten Calculus D.1 and D.2 to calculate E Ui|ξ|Ui 2 = E [ Ui|ξ|Ui Ui|ξ|Ui ] = E [tr(ξ|Ui Ui|ξ|Ui Ui|)] = E [tr(ξU|ei ei|U ξU|ei ei|U )] = E [tr(U ξU|ei ei|U ξU|ei ei|)] α,β S2 Wg(βα 1, d)trβ 1(ξ, ξ)trα(|ei ei|, |ei ei|) = 1 d(d + 1)tr(ξ2). Published in Transactions on Machine Learning Research (September/2023) E Ui|ξ|Ui 4 = E [ Ui|ξ|Ui Ui|ξ|Ui Ui|ξ|Ui Ui|ξ|Ui ] = E [tr(ξ|Ui Ui|ξ|Ui Ui|ξ|Ui Ui|ξ|Ui Ui|)] = E [tr(ξU|ei ei|U ξU|ei ei|U ξU|ei ei|U ξU|ei ei|U )] = E [tr(U ξU|ei ei|U ξU|ei ei|U ξU|ei ei|U ξU|ei ei|)] α,β S4 Wg(βα 1, d)trβ 1(ξ, ξ, ξ, ξ)trα(|ei ei|, |ei ei|, |ei ei|, |ei ei|) = 1 d(d + 1)(d + 2)(d + 3)(6tr(ξ2)2 + 6tr(ξ4)). d(d + 1)(d + 2)(d + 3)tr(ξ2)2. We can now conclude by Hölder s inequality: 2E [TV(ρ(M), σ(M))] = i=1 E [| Ui|ξ|Ui |] (E [ Ui|ξ|Ui 2])3 E [ Ui|ξ|Ui 4] (d 1(d + 1) 1tr(ξ2))3 c d 1(d + 1) 1(d + 2) 1(d + 3) 1tr(ξ2)2 This Lemma is about the expected TV distance. Actually, we can prove that we have the same inequality with high probability. Lemma A.7. Let l = Θ (log(1/δ)) and U 1, U 2, . . . , U l Cd d be Haar-random unitary matrices of columns {|U j i }1 i d,1 j l, M = { 1 l |U j i U j i |}i,j is a POVM and there exists a universal constant c such that for all quantum states ρ and σ we have with a probability at least 1 δ: TV(ρ(M), σ(M)) c ρ σ tr Published in Transactions on Machine Learning Research (September/2023) Proof. Let f(U) = TV(ρ(M), σ(M)), we first show that f is Lipschitz by using the triangle and Cauchy Schwarz inequalities: 2|f(U) f(V )| = 1 i d,1 j l l |tr(|U j i U j i |ξ)| |tr(|V j i V j i |ξ)| 1 i d,1 j l tr((|U j i U j i | |V j i V j i |)ξ) 1 i d,1 j l tr((|U j i U j i | |V j i V j i |)2) 1 i d,1 j l tr((|U j i U j i | |V j i V j i |)2) 1 j l tr((U j V j)2) tr(ξ2) U V 2,HS, hence f is L = q tr(ξ2)-Lipschitz, therefore by Theorem D.3: P |f(U) E (f(U))| > c tr(ξ2) e dc2tr(ξ2) 48L2 = e lc2/24 = δ/2, for l = 24 log(2/δ)/c2. Finally with high probability (at least 1 δ/2) we have TV(ρ(M), σ(M)) E (TV(ρ(M), σ(M))) | TV(ρ(M), σ(M)) E (TV(ρ(M), σ(M))) | 2 ρ σ tr r , where r is the rank of (ρ I/d). Once we have the lower bound on the TV distance between the distributions obtained after performing the measurements, we can deduce upper bounds on sequential algorithms for testing identity depending on the rank of ρ or ρ I/d. Dependence in the rank of ρ I/d From the previous lower bound on the TV-distance, we can achieve an upper bound using the sequential tester of (Fawzi et al., 2022): O min n1/2 log(1/δ)1/2 d, ρ I/d 2})2 , log(1/δ) (max{ε/ d, ρ I/d 2})2 = O d3/2 log(1/δ) max{ε2, d ρ I/d 2 2} = O min d3/2 log(1/δ) ε2 , rd1/2 log(1/δ) where r is the rank of (ρ I/d) and we use Cauchy Schwarz to obtain the latter inequality. Published in Transactions on Machine Learning Research (September/2023) Dependence in the rank of ρ The proof of Lemma 3.5 permits to deduce that with high probability: TV(P, Un) c ρ I/d 2 c where r is the rank of ρ supposed to be less than d/2 and we use Cauchy Schwarz inequality. Therefore we can test whether ρ = σ or ρ σ tr > ε with probability at least 1 δ using d1/2 log(1/δ) max n ε/ = O min d3/2 log(1/δ) ε2 , rd1/2 log(1/δ) copies of ρ. B Analysis of Alg. 1 In this section we prove the correctness of the Alg. 1. We need to show that with probability at least 1 δ/2, Alg. 1 finds the closest quantum state σi to ρ. Lemma B.1. For all i = j [m], let µi,j an ε/10 approximation of tr(ρOi,j) given by classical shadow tomography of (Huang et al., 2020). Let k = argminl maxi,j µi,j tr(σl Oi,j). We have with at least a probability 1 δ/2: ρ σk tr ε/3. Proof. Classical shadow tomography of (Huang et al., 2020) permits to have the following approximations i = j [m] : |µi,j tr(ρOi,j)| ε/10, with a probability at least 1 δ/2 using only N = O(d log(m)/ε2) copies of ρ. Let σi the closest quantum state to ρ. We want to prove that with high probability k = i . We have for all l = i : σi σl tr > ε hence: max i,j µi,j tr(σl Oi,j) µi ,l tr(σl Oi ,l) tr(ρOi ,l) tr(σl Oi ,l) ε/10 tr(σi Oi ,l) tr(σl Oi ,l) + tr(ρOi ,l) tr(σi Oi ,l) ε/10 σi σl tr ρ σi tr ε/10 On the other hand max i,j µi,j tr(σk Oi,j) max i,j µi,j tr(σi Oi,j) max i,j tr(ρOi,j) tr(σi Oi,j) + ε/10 ρ σi tr + ε/10 Therefore, with high probability, k cannot be different from i . Published in Transactions on Machine Learning Research (September/2023) Once we know, with high probability, the closest quantum state to ρ we can read its basis and use it to to learn ρ. The following lemma indicates how to construct this approximation along with the number of copies/measurements needed for this learning task. Lemma B.2. Let ρ = Pd i=1 λi|ϕi ϕi|. Let A1, . . . , AN the outcomes of the measurement of ρ independently by the POVM M = {|ϕi ϕi|}i. The quantum state PN j=1 1Aj=i is ε/10-close in 1-norm to ρ with a probability at least 1 δ/2 if N = 200 log(2d+2/δ)/ε2. Proof. ρ is a quantum state so it is a Hermitian matrix positive semi definite of trace 1. Hence, we can write ρ = Pd i=1 λi|ϕi ϕi| where{λi}i is a probability distribution and {ϕi}i is an orthonormal basis. Therefore Pd i=1 |ϕi ϕi| = I and M is a valid POVM. Measuring ρ via the POVM M is equivalent to sampling from the distribution {tr(|ϕi ϕi|ρ}i = {P j λjtr(|ϕi ϕi||ϕj ϕj|}i = {λi}i hence A1, . . . , AN i.i.d. {λi}i. On the other hand ρ and ρ have the same basis of diagonalization so the 1 norm between them is simply i=1 λi|ϕi ϕi| i=1 λi|ϕi ϕi| i=1 (λi λi)|ϕi ϕi| i=1 |λi λi| = 2 TV(λ, λ), where { λi}i = { PN N }i. It is well known that the TV distance can be written as: TV(λ, λ) = max B [d]( λ(B) λ(B)). Chernoff-Hoeffding((Hoeffding, 1963)) inequality implies for all B [d] : PN j=1 1Aj B Therefore by union bound we obtain P ( ρ ρ tr > ε/10) = P 2 TV(λ, λ) > ε/10 PN j=1 1Aj B 2d+1 exp 2N ε Finally for N = 200 log(2d+1/δ)/ε2, we have with at least a probability 1 δ : ρ ρ tr ε/10. Grouping the two previous Lemmas, Alg. 1 finds the closest quantum state σi and returns an ε/10approximation of ρ with a probability at least 1 (δ/2 + δ/2) = 1 δ. Finally, Alg. 1 is δ-correct. Published in Transactions on Machine Learning Research (September/2023) C Lower bound for the problem (P) In this section, we focus on proving lower bounds for hypothesis selection problem (P) for non-adaptive and adaptive strategies. C.1 Non-adaptive strategies We recall the theorem we want to prove: Theorem C.1. There is a tuple of quantum states (σ1, . . . , σm) such that any learning algorithm for problem (P) with non-adaptive incoherent measurements requires N = Ω min md log(m)ε2 , d2 copies of ρ. Construction For the construction, we choose m unitary matrices {Uy}y chosen randomly from the Haar(d) distribution, then we choose for each unitary (orthonormal basis) random eigenvalues: Lemma C.2. Let m exp(d2/3000). Let {Uy}y [m/2] m/2 unitaries Haar(d) distributed. For y [m/2], let σy = 2I/d σm+1 y = UyΛU y where Λ = I d + diag {λi}i [d] = diag n 1+( 1)d10ε with a probability at least 9/10, for all y = z [m]: σy σz tr ε. Proof. Let y = z [m/2] and 0 O I satisfying tr diag {λi}i [d] O = 5ε. Let f(U) = tr U diag {λi}i [d] U diag {λi}i [d] O where U Haar(d), we have E (f(U)) = tr diag {λi}i [d] O = 5ε (see Weingarten Calculus D.1). The function f is 20ε d -Lipschitz: |f(U) f(V )| = |tr(U diag {λi}i [d] U diag {λi}i [d] )O tr(V diag {λi}i [d] V diag {λi}i [d] )O| |tr(U diag {λi}i [d] U V diag {λi}i [d] V )O| (U V ) diag {λi}i [d] U tr + V diag {λi}i [d] (U V ) tr U V 2 diag {λi}i [d] U 2 + V diag {λi}i [d] 2 (U V ) 2 d ( U V 2 diag ( 1)i U 2 + V diag ( 1)d 2 (U V ) 2) where we have used Cauchy Schwarz inequality. Using the fact that the Haar distribution is invariant under the multiplication by a unitary and the concentration inequality for Lipschitz functions D.3, the probability that the states {σy}y [m/2] are not ε-separated Published in Transactions on Machine Learning Research (September/2023) is upper bounded by P ( y, z [m/2] : σy σz tr ε) m2 4 P ( σy σz tr ε) 4 P Uy diag {λi}i [d] U y Uz diag {λi}i [d] U z tr ε 4 P U z Uy diag {λi}i [d] U y Uz diag {λi}i [d] tr ε 4 P U diag {λi}i [d] U diag {λi}i [d] tr ε 4 P tr U diag {λi}i [d] U diag {λi}i [d] O ε 4 P (f(U) E (f(U)) ε 5ε) 4 P (E (f(U)) f(U) 4ε) 4 exp 16d2ε2 which is smaller than 1/10 if m2 2 exp(d2/1000)/5. For the case when y [m/2] and z [m] \ [m/2], let x = m + 1 z [m/2] we have σy σz tr = σy 2I/d + σx tr σx 2I/d + σx tr σy σx tr 2 σx I/d tr + tr σy σx tr ε. Finally, for the case when y [m]\[m/2] and z [m]\[m/2], let y = m+1 y [m/2] and z = m+1 z [m/2] we have σy σz tr = 2I/d σy 2I/d + σz tr σy σz tr ε. We have shown how to construct the unitaries, we move to prove the existence of the eigenvalues: Lemma C.3. There exists family of quantum states {ρx,y}|x| [ecd],y [m] (where c is a universal constant) such that for each y [m], {ρx,y}|x| [ecd] is ε/5-separated and commute. Proof. We start by writing the eigen-decomposition of the known quantum states σy as i=1 λy i |i i| We claim that we can choose αx i to construct an ε/5-separated family of mecd quantum states (c is a constant to be chosen later) of the form λy i + αx i (2ε/3) Published in Transactions on Machine Learning Research (September/2023) for |x| [ecd] and y [m]. Note that for convenience of notation, the labels x can be positive and negative. Moreover the distance between ρx,y and σy is exactly: ρx,y σy tr = ε Concretely, we look for {αx i }1 i d,1 |x| ecd/2 such that 1. αx i = 1, 2. α x i = αx i , 3. αx i + αx i+d/2 = 0 (we suppose d is even) and 4. x = x : Pd/2 i=1 |αx i αx i | > d(1/2 1/200). The third point ensures that ρ has trace 1 while the fourth one implies ρx,y ρx ,y tr > ε/3 ε/100 > ε/5. Starting by the simple quantum states ρ1,y = σy + Pd/2 i=1 (2ε/3) d Uy|i i|U y Pd i=d/2+1 (2ε/3) d Uy|i i|U y and ρ 1,y = 2I/d ρ1,y = σm+1 y Pd/2 i=1 (2ε/3) d Uy|i i|U y + Pd i=d/2+1 (2ε/3) d Uy|i i|U y and we suppose that we have constructed Q an ε-separated family of the form described above of cardinality M < ecd. Let α1, . . . , αd/2 i.i.d. random variables taken values in { 1} with probability 1/2 each. We have by Hoeffding s inequality i=1 |αx i αi| d(1/2 1/200) OR i=1 |α x i αi| d(1/2 1/200) i=1 |αx i αi| d(1/2 1/200) OR i=1 |αx i + αi| d(1/2 1/200) i=1 1αi=αx i > d/4 + d/400 i=1 1αi=αx i d/4 d/400 which is strictly less than 1 if M < ed/2000. So let s take c = 1/2000, we deduce that i=1 |αx i αi| > d(1/2 1/200) therefore there exists some α { 1}d verifying the desired conditions. We can repeat this construction until Card(Q) ecd. We have constructed the ε-separated family of quantum states {σy}y an the corresponding ε/5-separated {ρx,y}x for all y, we can use tools from communication theory to deduce the lower bound (see (Haah et al., 2016)). Alice encodes a message (x, y) {1, . . . , ecd} [m] in ρx,y and sends it to Bob. To read the message, Bob tries to approximate the quantum state that he received from Alice. We suppose that Bob can approximate (up to ε/10 in trace norm) a state ε/3 close to one of {σy} and diagonalized in the same basis of this quantum state with a probability at least 2/3. Bob uses N copies to decode Alice s message and returns (x , y ) {1, . . . , ecd} [m] , therefore by Fano s inequality ((Fano, 1961)) we have the following lower bound on the mutual information: Lemma C.4 (Fano). The mutual information can be lower bounded: I(X, Y : X , Y ) 2/3 log(mecd) log(2) Ω(log(m) + d). Published in Transactions on Machine Learning Research (September/2023) On the other hand we can upper bound the mutual information between (X, Y ) and (X , Y ). Let I1, . . . , IN be the outcomes of a non adaptive algorithm solving the problem (P). By using the data-processing inequality for the Kullback-Leibler divergence and the fact that every non adaptive algorithm for the problem (P) can be used as a 2/3-correct decoder we can upper bound the mutual information as follows: Lemma C.5 (Data-processing). . The mutual information between (X, Y ) and (X , Y ) is smaller than the mutual information between (X, Y ) and (I1, . . . , IN): I(X, Y : X , Y ) I(X, Y : I1, . . . , IN). The next step is to upper bound the mutual information between (X, Y ) and (I1, . . . , IN). This latter depends on the quantum states {σy}y, therefore it is a random variable. We will show that with at least a probability 9/10, it is upper bounded by an expression involving the parameters of the problem. First we start by proving the following upper bound relating the mutual information with the unitaries {Uy}y defining the quantum states {σy}y. Lemma C.6. For all unitaries {Uy}y, we have: I(X, Y : I1, . . . , IN) 4N sup ϕ, ϕ 2 1 |x|,y m/2 ϕ|Uy Ox,y U y|ϕ 2ε2, where for (x, y), Ox,y = U y(dρx,y I)Uy. Proof. We suppose that the eigenvalues of σy have the form λy i = 1 + 10βy i ε d , where βy i = 1 satisfying P i βy i = 0 (exactly half are equal to +1) and βy = βm+1 y (we suppose m even). The diagonalizing matrices {Uy}y are chosen randomly so as they satisfy Um+1 y = Uy for all y m/2 and other conditions to be specified later. Let us denote by Mt the POVM used at step t. Without loss of generality, we can suppose that the non-adaptive algorithm performs only measurements of the following form: Mt = {|ϕt i ϕt i|}i. where we have the condition P i |ϕt i ϕt i| = I implying for all i and t: ϕt i 2 1. Let M = 2mecd, we can write the mutual information as follows: I(X, Y : I1, . . . , IN) = H x,y tr((ρx,y) N N t=1 Mt) x,y H tr((ρx,y) N N t=1 Mt) t=1 ϕt it|ρx,y|ϕt it log QN t=1 ϕt it|ρx,y|ϕt it x,y QN t=1 ϕt it|ρx,y|ϕt it where Σ1 and Σ2 are defined as follows: t=1 ϕt it|ρx,y|ϕt it log t=1 ϕt it|dρx,y|ϕt it t=1 ϕt it|ρx,y|ϕt it log t=1 ϕt it|dρx,y|ϕt it ρx,y = Uy diag ( 1 + (10βy i + 2αx i /3)ε d Published in Transactions on Machine Learning Research (September/2023) we can write ϕt it|ρx,y|ϕt it = 1 + ut,x,y it ε d , where ut,x,y it = ϕt it|Uy diag {10βy i + 2αx i /3}i [d] Uy|ϕt it ( 11, 11). Denote by Ox,y = diag {10βy i + 2αx i /3}i [d] , we remark that it=1 ut,x,y it = it=1 ϕt it|Uy diag {10βy i + 2αx i /3}i [d] Uy|ϕt it = tr Uy diag {10βy i + 2αx i /3}i [d] Uy = tr diag {10βy i + 2αx i /3}i [d] = i=1 10βy i + 2αx i /3 = 0. Moreover, the couples of quantum states (ρx,y, ρ x,y) and (ρx,y, ρx,m+1 y) are symmetric with respect to I/d by the construction of (αx i )i,x and (βx i )i,x hence ut, x,m+1 y it = ϕt it|Um+1 y diag n 10βm+1 y i + 2α x i /3 o Um+1 y|ϕt it = ϕt it|Uy diag { 10βy i 2αx i /3}i [d] Uy|ϕt it = ϕt it|Uy diag {10βy i + 2αx i /3}i [d] Uy|ϕt it = ut,x,y it . Suppose that ε 0.05. We can start by controlling Σ2 using Jensen s inequality: 1 + ut,x,y it ε d t=1 (1 + ut,x,y it ε) 1 + ut,x,y it ε d t=1 (1 + ut,x,y it ε) 1 + ut,x,y it ε d x,y,t log 1 + ut,x,y it ε ! 1 + ut,x,y it ε d |x|,y m/2,t log 1 + ut,x,y it ε + log 1 ut,x,y it ε 1 + ut,x,y it ε d |x|,y m/2,t log 1 (ut,x,y it )2ε2 Published in Transactions on Machine Learning Research (September/2023) Now, we can use the inequality log(1 x2) 2x2 for |x| 1/ 1 + ut,x,y it ε d |x|,y m/2,t log 1 (ut,x,y it )2ε2 1 + ut,x,y it ε d |x|,y m/2,t 2(ut,x,y it )2ε2 1 + ut,x,y it ε d t sup ϕ, ϕ 2 1 |x|,y m/2 2 ϕ|Uy Ox,y U y|ϕ 2ε2 N sup ϕ, ϕ 2 1 |x|,y m/2 2 ϕ|Uy Ox,y U y|ϕ 2ε2. Using the fact that P it ut,x,y it = 0 for all t, x, y along with the inequality (1+x) log(1+x)+(1 x) log(1 x) 2x2 for |x| 1/ 2 we can upper bound the first sum Σ1: 1 + ut,x,y it ε d 1 + ut,x,y it ε ! 1 + ut,x,y it ε d k log 1 + uk,x,y ik ε i1,...,ik 1,ik+1,...,i N 1 + ut,x,y it ε d log 1 + uk,x,y ik ε 1 + uk,x,y ik ε d log 1 + uk,x,y ik ε |x|,y m/2,k 1 + uk,x,y ik ε log 1 + uk,x,y ik ε + 1 uk,x,y ik ε log 1 uk,x,y ik ε |x|,y m/2,k,ik 2(uk,x,y ik ε)2 k,ik sup ϕ, ϕ 2 1 |x|,y m/2 2 ϕ|Uy Ox,y U y|ϕ 2ε2 2N sup ϕ, ϕ 2 1 |x|,y m/2 ϕ|Uy Ox,y U y|ϕ 2ε2. Finally the upper bounds on Σ1 and Σ2 imply the required upper bound on their sum Σ1 + Σ2 = I(X, Y : I1, . . . , IN). Note that we need to take a supremum over all possible vectors ϕ because the learner knows the quantum states {σy}y and so it can choose measurements dependent on the unitaries {Uy}y. We can now show that with high probability on the choice of the unitaries {Uy}y, the latter supremum can bounded and so the mutual information too. Lemma C.7. Let {Uy}y m unitary matrices Haar(d) distributed. We have with a probability at least 9/10: 4N sup ϕ, ϕ 2 1 |x|,y m/2 ϕ|Uy Ox,y U y|ϕ 2ε2 = O Nε2 log(m) Published in Transactions on Machine Learning Research (September/2023) Proof. To upper bound the previous supremum, we use a similar approach to (Chen et al., 2021): For U Haar(d), ϕ B(0, 1) and a trace-less Hermitian matrix O, let f(ϕ, U) = ϕ|UOU |ϕ , we have E (f(ϕ, U)) = 1 dtr(O)tr(|ϕ ϕ|) = 0 (see Weingarten Calculus D.1) and f is 2 O -Lipschitz: |f(U) f(V )| 2| ϕ|(U V )OU |ϕ | 2 O U V 2. Therefore by the concentration inequality D.3: P (|f(U)| > t) exp( dt2/48). P |f(U)|2 > t exp( dt/48). For m/2 unitaries U1, . . . , Um/2 and λ = 2d/C for sufficiently large C. Denote by X = |f(U)|2, by Markov s inequality: 1 y m/2 |f(Uy)|2 > t exp( λmt/2)E eλX m/2 exp( λmt/2) 1 + Z 0 dxλeλxe dx/48 m/2 exp( dmt/2C) (C )m/2 exp( dmt/C + m log(C )), with C another constant. In order to prove an inequality valid for all ϕ B(0, 1), let s take an η-net {ϕi}i of size at most (1 + 2/η)2d. For ϕ B(0, 1), there is ϕi such that ϕ ϕi 2 η. Moreover |f(ϕ, U)| O so 1 y m/2 f(ϕ, Uy)2 f(ϕi, Uy)2 1 y m/2 |f(ϕ, Uy)2 f(ϕi, Uy)2| 1 y m/2 2 O |( ϕ| ϕi|)Uy OU y|ϕ | 2η O 2. 1 y m/2 |f(ϕ, Uy)|2 > t + 2η O 2 k=1 |f(ϕi, Uk)|2 > t (1 + 2/η)2d exp( dmt/C + m log(C )). Taking η = 1/m yields: 1 y m/2 |f(ϕ, Uy)|2 > t + 2 O 2/m (1 + 2m)2d exp( dmt/C + m log(C )). Applying the union bound, we can obtain: y m/2 ϕ|Uy Ox,y U y|ϕ 2 t + 2 Ox,y 2 4ecd(1 + 2m)2d exp( dmt/C + m log(C )). Let s take t = C log(40)+cd+2d log(1+2m)+m log(C ) dm in order to have |x|,y m/2 ϕ|Uy Ox,y U y|ϕ 2 t + 2 Ox,y 2 y m/2 ϕ|Uy Ox,y U y|ϕ 2 t + 2 Ox,y 2 Published in Transactions on Machine Learning Research (September/2023) Therefore we have the existence of {Uy}y such that for all y = z σy σz tr > ε, sup ϕ, ϕ 2 1 |x|,y m/2 ϕ|Uy Ox,y U y|ϕ 2 tr(O2 x,y) d(d + 1) + t + 2 Ox,y 2 d + 1 + C log(40) + cd + 2d log(1 + 2m) + m log(C ) Finally, we showed the existence of quantum states {σx,y}x,y such that: I(X, Y : I1, . . . , IN) = O 1 To sum up, we have shown the existence of quantum states {σx,y}x,y such that: Ω(log(m) + d) I(X, Y : X , Y ) I(X, Y : I1, . . . , IN) O 1 We conclude that N = Ω min n md log(m)ε2 , d2 C.2 Adaptive strategies It is important to see why this proof doesn t work for adaptive strategies. The lower bound on the mutual information has nothing to do with the non-adaptive/adaptive option of the algorithm so it remains true. However, upper bounding the mutual information cannot be done the same way since now the POVM used at time t depends on the previous outcomes. Let {ut}N t=1 be a sequence constituted by the outcomes of a correct adaptive algorithm. Let Mt u k and log ( ϕu 1 we obtain: t=1 ϕu 0 P (|F(U1, . . . , Uk) E (F(U1, . . . , Uk)) | t) e dt2/12L2, where U1, . . . , Uk are independent Haar-distributed unitary matrices.