# optimal_estimation_of_low_rank_density_matrices__47c4366b.pdf Journal of Machine Learning Research 16 (2015) 1757-1792 Submitted 7/15; Published 9/15 Optimal Estimation of Low Rank Density Matrices Vladimir Koltchinskii vlad@math.gatech.edu Dong Xia dxia7@math.gatech.edu School of Mathematics Georgia Institute of Technology Atlanta, GA 30332, USA. Editor: Alex Gammerman and Vladimir Vovk The density matrices are positively semi-definite Hermitian matrices of unit trace that describe the state of a quantum system. The goal of the paper is to develop minimax lower bounds on error rates of estimation of low rank density matrices in trace regression models used in quantum state tomography (in particular, in the case of Pauli measurements) with explicit dependence of the bounds on the rank and other complexity parameters. Such bounds are established for several statistically relevant distances, including quantum versions of Kullback-Leibler divergence (relative entropy distance) and of Hellinger distance (so called Bures distance), and Schatten p-norm distances. Sharp upper bounds and oracle inequalities for least squares estimator with von Neumann entropy penalization are obtained showing that minimax lower bounds are attained (up to logarithmic factors) for these distances. Keywords: quantum state tomography, low rank density matrix, minimax lower bounds 1. Introduction This paper deals with optimality properties of estimators of density matrices, describing states of quantum systems, that are based on penalized empirical risk minimization with specially designed complexity penalties such as von Neumann entropy of the state. Alexey Chervonenkis was a co-founder of the theory of empirical risk minimization that is of crucial importance in machine learning, but he also had very broad interests that included, in particular, quantum mechanics. By the choice of the topic, we would like to honor the memory of this great man and great scientist. Let Mm(C) be the set of all m m matrices with complex entries and let Hm = Hm(C) Mm(C) be the set of all Hermitian matrices: Hm = {A Mm(C) : A = A }, A denoting the adjoint matrix of A. For A Hm, tr(A) denotes the trace of A and A 0 means that A is positively semi-definite. Let Sm := {S Hm : S 0, tr(S) = 1} be the set of all positively semi-definite Hermitian matrices of unit trace called density matrices. In quantum mechanics, the state of a quantum system is usually characterized by a density matrix ρ Sm (or, more generally, by a self-adjoint positively semi-definite operator of unit trace acting in an infinite-dimensional Hilbert space, called a density operator). Often, very . Supported in part by NSF Grants DMS-1509739, DMS-1207808, CCF-1523768 and CCF-1415498 . Supported in part by NSF Grant DMS-1207808 c 2015 Vladimir Koltchinskii and Dong Xia. Koltchinskii and Xia large density matrices are needed to represent or to approximate the density operator of the state. For instance, for a quantum system consisting of b qubits, the density matrices are of the size m m with m = 2b, so the dimension of the density matrix grows exponentially with b. For instance, for a 10 qubit system, one has to deal with matrices that have 220 entries. Thus, it becomes natural in the problems of statistical estimation of density matrix ρ to take an advantage of the fact that it might be low rank, or nearly low rank (that is, it could be well approximated by low rank matrices) which reduces the complexity of the estimation problem. In quantum state tomography (QST), the goal is to estimate an unknown state ρ Sm based on a number of specially designed measurements for the system prepared in state ρ (see Gross et al. 2010, Gross 2011, Koltchinskii 2011a, Cai et al. 2015 and references therein). Given an observable A Hm with spectral representation A = Pm j=1 λj Pj, where m m, λj being the eigenvalues of A and Pj being the corresponding mutually orthogonal eigenprojectors, the outcome of a measurement of A for the system prepared in state ρ is a random variable Y taking values λj with probabilities tr(ρPj). The expectation of Y is then EρY = tr(ρA), so, Y could be viewed as a noisy observation of the value of linear functional tr(ρA) of the unknown density matrix ρ. A common approach is to choose an observable A at random, assuming that it is the value of a random variable X with some design distribution Π in the space Hm. More precisely, given a sample of n i.i.d. copies X1, . . . , Xn of X, n measurements are being performed for the system identically prepared n times in state ρ resulting in outcomes Y1, . . . , Yn. Based on the data (X1, Y1), . . . , (Xn, Yn), the goal is to estimate the target density matrix ρ. Clearly, the observations satisfy the following model Yj = tr(ρXj) + ξj, j = 1, . . . , n, (1) where {ξj} is a random noise consisting of n i.i.d. random variables satisfying the condition Eρ(ξj|Xj) = 0, j = 1, . . . , n. This is a special case of so called trace regression model intensively studied in the recent literature (see, e.g., Koltchinskii et al. 2011, Koltchinskii 2011b and references therein). 1.1 Assumptions A common choice of design distribution in this type of problems is so called uniform sampling from an orthonormal basis described in the following assumptions. Assumption 1 Let E = {E1, . . . , Em2} Hm be an orthonormal basis of Hm with respect to the Hilbert Schmidt inner product: A, B = tr(AB). Moreover, suppose that, for some U > 0, Ej U, j = 1, . . . , n, where denotes the operator norm (the spectral norm). Since Ej 2 = 1, where 2 denotes the Hilbert Schmidt (or Frobenius) norm, we can assume that U 1. Moreover, U m 1/2 since 1 = Ej 2 m1/2 Ej m1/2U. Assumption 2 Let Π be the uniform distribution in the finite set E (see Assumption 1), let X be a random variable sampled from Π and let X1, . . . , Xn be i.i.d. copies of X. Low Rank Density Matrices Estimation It will be assumed in what follows that assumptions 1 and 2 hold (unless it is stated otherwise). Under these assumptions, Y1, . . . , Yn could be viewed as noisy observations of a random sample of Fourier coefficients ρ, X1 , . . . , ρ, Xn of the target density matrix ρ in the basis E. The above model (in which X1, . . . , Xn are uniformly sampled from an orthonormal basis and Y1, . . . , Yn are the outcomes of measurements of the observables X1, . . . , Xn for the system being identically prepared n times in the same state ρ) will be called in what follows the standard QST model. It is a special case of trace regression model with bounded response: Assumption 3 (Trace regression with bounded responce) Suppose that Assumption 1 holds and let (X, Y ) be a random couple such that X is sampled from the uniform distribution Π in an orthonormal basis E Hm. Suppose also that, for some ρ Sm, E(Y |X) = ρ, X a.s. and, for some U > 0, |Y | U a.s.. The data (X1, Y1), . . . (Xn, Yn) consists of n i.i.d. copies of (X, Y ). We are also interested in the trace regression model with Gaussian noise: Assumption 4 (Trace regression with Gaussian noise) Suppose Assumption 1 holds and let (X, Y ) be a random couple such that X is sampled from the uniform distribution Π in an orthonormal basis E Hm and, for some ρ Sm, Y = ρ, X + ξ, where ξ is a normal random variable with mean 0 and variance σ2 ξ, ξ and X being independent. The data (X1, Y1), . . . (Xn, Yn) consists of n i.i.d. copies of (X, Y ). Note that this model is not directly applicable to the standard QST problem described above, where the response variable Y is discrete. However, if the measurements are repeated multiple times for each observable Xj and the resulting outcomes are averaged to reduce the variance, the noise of such averaged measurements becomes approximately Gaussian and it is of interest to characterize the estimation error in terms of the variance of the noise. An important example of an orthonormal basis used in quantum state tomography is so called Pauli basis, see, e.g., Gross et al. (2010), Gross (2011). The Pauli basis in the space H2 of 2 2 Hermitian matrices (observables in a single qubit system) consists of four matrices W1, W2, W3, W4 defined as Wi = 1 2σi, i = 1, . . . , 4, where It is easy to check that {W0, W1, W2, W3} indeed forms an orthonormal basis in H2. The Pauli basis in the space Hm for m = 2b (the space of observables for a b qubits system) is defined by tensorisation, namely, it consists of 4b tensor products Wi1 . . . Wib, (i1, . . . , ib) {1, 2, 3, 4}b. Let us write these matrices as E1, . . . , Em2 with E1 = W1 . . . W1. It is easy to see that each of them has eigenvalues 1 m and Ej = m 1/2, so, for this basis, U = m 1/2. The fact that, for the Pauli basis, the operator norms of basis matrices are as small as possible plays an important role in quantum state tomography (Gross et al., 2010; Gross, 2011; Liu, 2011). Let Ej = 1 m Q+ j 1 m Q j be the spectral representation of Ej. Then, an outcome of a measurement of Ej in state ρ is a random variable τj taking values Koltchinskii and Xia 1 m with probabilities ρ, Q t . Its expectation is Eρτj = ρ, Ej . Of course, there exists a unique representation of density matrix ρ in the Pauli basis that can be written as follows: ρ = Pm2 j=1 αj m Ej with α1 = 1. Then, we clearly have Eρτj = αj m and Pρ n τj = 1 m (for j = 1, this gives Pρ n τ1 = 1 m o = 1). As a consequence, Varρ(τj) = 1 α2 j m . Note that Pm2 j=1 α2 j m = ρ 2 2 tr2(ρ) = 1. This implies that there exists j such that α2 j 1 2 and Varρ(τj) 1 2m. In fact, the number of such j must be large, say, at least m2 2 (provided that m > 4). Thus, for most of the values of j, Varρ(τj) 1 m. A way to reduce the variance is to repeat the measurement of each observable Xj K times (for a system identically prepared in state ρ) and to average the outcomes of such K measurements. The resulting response variable is Yj = ρ, Xj + ξj, where Eρ(ξj|Xj) = 0 and Eρ(ξ2 j |Xj) = Varρ(Yj|Xj) = 1 α2 νj Km , νj being defined by the relationship Xj = Eνj. 1.2 Preliminaries and Notations Some notations will be used throughout the paper. The Euclidean norm in Cm will be denoted by and the notation , will be used for both the Euclidean inner product in Cm and for the Hilbert Schmidt inner product in Hm. p, p 1 will be used to denote the Schatten p-norm in Hm, namely A p p = m P j |λj(A)|p, A Hm, λ1(A) . . . λm(A) being the eigenvalues of A. In particular, 2 denotes the Hilbert Schmidt (or Frobenius) norm, 1 denotes the nuclear (or trace) norm and denotes the operator (or spectral) norm: A = max1 j m |λj(A)| = |λ1(A)|. The following well known interpolation inequality for Schatten p-norms will be used to extend the bounds proved for some values of p to the whole range of its values. It easily follows from similar bounds for ℓp-spaces. Lemma 1 (Interpolation inequality) For 1 p < q < r , and let µ [0, 1] be such that µ Then, for all A Hm, A q A µ p A 1 µ r . Given A Hm, define a function f A : Hm 7 R : f A(x) := A, x , x Hm. For a given random variable X in Hm with a distribution Π, we have f A 2 L2(Π) = Ef2 A(X) = E A, X 2. Sometimes, with a minor abuse of notation, we might write A 2 L2(Π) = R Hm A, x 2Π(dx) = f A 2 L2(Π). In what follows, Π will be typically the uniform distribution in an orthonormal basis E = {E1, . . . , Em2} Hm, implying that f A 2 L2(Π) = A 2 L2(Π) = m 2 A 2 2, so, the L2(Π)-norm is just a rescaled Hilbert Schmidt norm. Consider A Hm with spectral representation A = Pm j=1 λj Pj, m m with distinct non-zero eigenvalues λj. Denote by sign(A) := Pm j=1 sign(λj)Pj and by supp(A) the linear Low Rank Density Matrices Estimation span of the images of projectors Pj, j = 1, . . . , m (the subspace supp(A) Cm will be called the support of A). Given a subspace L Cm, L denotes the orthogonal complement of L and PL denotes the orthogonal projection onto L. Let PL, P L be orthogonal projection operators in the space Hm (equipped with the Hilbert Schmidt inner product), defined as follows: P L (A) = PL APL , PL(A) = A PL APL . These two operators split any Hermitian matrix A into two orthogonal parts, PL(A) and P L (A), the first one being of rank at most 2dim(L). For a convex function f : Hm 7 R, f(A) denotes the subdifferential of f at the point A Hm. It is well known that A 1 = n sign(A) + P L (M) : M Hm, M 1 o , (2) where L = supp(A) (see Koltchinskii 2011b, p. 240 and references therein). C, C1, C , c, c , etc will denote constants (that do not depend on parameters of interest such as m, n, etc) whose values could change from line to line (or, even, within the same line) without further notice. For nonnegative A and B, A B (equivalently, B A) means that A CB for some absolute constant C > 0, and A B means that A B and B A. Sometimes, symbols , and could be provided with subscripts (say, A γ B) to indicate that constant C may depend on a parameter (say, γ). In what follows, P denotes the distribution of (X, Y ) and Pn denotes the corresponding empirical distribution based on the sample (X1, Y1), . . . , (Xn, Yn) of n i.i.d. observations. Similarly, Π is the distribution of X (typically, uniform in an orthonormal basis) and Πn is the corresponding empirical distribution based on the sample (X1, . . . , Xn). We will use standard notations Pf = Ef(X, Y ), Pnf = n 1 Pn j=1 f(Xj, Yj) and Πg = Eg(X), Png = n 1 Pn j=1 g(Xj). 1.3 Estimation Methods Recall that the central problem in quantum state tomography is to estimate a large density matrix ρ based on the data (X1, Y1), . . . , (Xn, Yn) satisfying the trace regression model. Often, the goal is to develop adaptive estimators with optimal dependence of the estimation error (measured by various statistically relevant distances) on the unknown rank of the target matrix ρ under the assumption that ρ is low rank, or on other complexity parameters in the case when the target matrix ρ can be well approximated by low rank matrices. The simplest estimation procedure for density matrix ρ is the least squares estimator defined by the following convex optimization problem: ˆρ := arg min S Sm j=1 (Yj S, Xj )2 . (3) Since, for all S Sm, S 1 = tr(S) = 1, we have that ˆρ = ˆρε := arg min S Sm j=1 (Yj S, Xj )2 + ε S 1 Koltchinskii and Xia Thus, in the case of density matrices, the least squares estimator ˆρ coincides with the matrix LASSO estimator ˆρε with nuclear norm penalty and arbitrary value of regularization parameter ε. The nuclear norm penalty is used as a proxy of the rank that provides a convex relaxation for rank penalized least squares method. Matrix LASSO is a standard method of low rank estimation in trace regression models that has been intensively studied in the recent years, see, for instance, Cand es and Plan (2011), Rohde and Tsybakov (2011), Koltchinskii (2011b), Koltchinskii et al. (2011), Negahban and Wainwright (2010) and references therein. In the case of estimation of density matrices, due to their positive semidefiniteness and trace constraint, the nuclear norm penalization is present implicitly even in the case of a nonpenalized least squares estimator ˆρ (see also Koltchinskii 2013a, Kalev et al. 2015 where similar ideas were used). Note that the estimator ˆρ can be also rewritten as ˆρ := arg min S Sm S 2 L2(Πn) 2 j=1 Yj S, Xj . (5) Replacing the empirical L2(Πn)-norm with the true L2(Π)-norm (which could make sense in the case when the design distribution Π is known) yields the following modified least squares estimator studied in Koltchinskii et al. (2011), Koltchinskii (2013a): ˇρ := arg min S Sm S 2 L2(Π) 2 j=1 Yj S, Xj . (6) Another estimator was proposed in Koltchinskii (2011a) and it is based on an idea of using so called von Neumann entropy as a penalizer in least squares method. Von Neumann entropy is a canonical extension of Shannon s entropy to the quantum setting. For a density matrix S Sm, it is defined as E(S) := tr(S log S). The estimator proposed in Koltchinskii (2011a) is defined as follows ρε := arg min S Sm j=1 (Yj S, Xj )2 + εtr(S log S) . (7) Essentially, it is based on a trade-offbetween fitting the model via the least squares method in the class of all density matrices and maximizing the entropy of the quantum state. Note that (7) is also a convex optimization problem (due to concavity of von Neumann entropy, see Nielsen and Chuang 2000) and its solution ρε is a full rank matrix (see Koltchinskii 2011a, the proof of Proposition 3). It should be also mentioned that the idea of estimation of a density matrix of a quantum state by maximizing the von Neumann entropy subject to constraints based on the data has been used in quantum state tomography earlier (see Buˇzek 2004 and references therein). 1.4 Distances between Density Matrices The main purpose of this paper is to study the optimality properties of estimator ρϵ with respect to a variety of statistically meaningful distances, in the case when the underlying density matrix ρ is low rank. These distances include Schatten p-norm distances for p Low Rank Density Matrices Estimation [1, 2],1 but also quantum versions of Hellinger distance and Kullback-Leibler divergence that are of importance in quantum statistics and quantum information. A version of the (squared) Hellinger distance that will be studied is defined as H2(S1, S2) := 2 2tr for S1, S2 Sm (see also Nielsen and Chuang 2000). Clearly, 0 H2(S1, S2) 2. In quantum information literature, it is usually called Bures distance and it does not coincide with tr( S1 S2)2 (which is another possible non-commutative extension of the classical Hellinger distance). In fact, H2(S1, S2) tr( S1 S2)2, S1, S2 Sm, but the opposite inequality does not necessarily hold. The quantity tr q 1 2 1 in the right hand side of the definition of H2 is a quantum version of Hellinger affinity. The noncommutative Kullback-Leibler divergence (or relative entropy distance) K( ) is defined as (see also Nielsen and Chuang 2000): K(S1 S2) := S1, log S1 log S2 . If log S2 is not well-defined (for instance, some of the eigenvalues of S2 are equal to 0) we set K(S1 S2) = + . The symmetrized version of Kullback-Leibler divergence is defined as K(S1; S2) := K(S1 S2) + K(S2 S1) = S1 S2, log S1 log S2 . The following very useful inequality is a noncommutative extension of similar classical inequalities for total variation, Hellinger and Kullback-Leibler distances. It follows from representing the noncommutative distances involved in the inequality as suprema of the corresponding classical distances between the distributions of outcomes of measurements for two states S1, S2 over all possible measurements represented by positive operator valued measures (see, Nielsen and Chuang 2000, Klauck et al. 2007, Koltchinskii 2011a, Section 3 and references therein). Lemma 2 For all S1, S2 Sm, the following inequalities hold: 1 4 S1 S2 2 1 H2(S1, S2) K(S1 S2) S1 S2 1 . (8) 1.5 Matrix Bernstein Inequalities Non-commutative (matrix) versions of Bernstein inequality will be used in what follows. The most common version is stated (in a convenient form for our applications) in the following lemma. Lemma 3 Let X, X1, . . . , Xn Hm be i.i.d. random matrices with EX = 0, σ2 X := EX2 and X U a.s. for some U > 0. Then, for all t 0 with probability at least 1 e t, 1 n t + log(2m) _ U t + log(2m) 1. Similar problems for estimators ˆρ, ˇρ and for Schatten p-norm distances with p (2, + ] are studied in a related paper by Koltchinskii and Xia (2015+) Koltchinskii and Xia The proof of such bounds could be found, e.g., in Tropp (2012). Other versions on matrix Bernstein type inequalities for not necessarily bounded random matrices will be also used in what follows and they could be found in Koltchinskii (2011b), Koltchinskii (2013a). A simple consequence of the inequality of Lemma 3 is the following expectation bound: _ U log(2m) It follows from the exponential bound by integrating the tail probabilities. The paper is organized as follows. In Section 2, minimax lower bounds on estimation error of low rank density matrices are provided in Schatten p-norm, Hellinger (Bures) and Kullback-Leibler distances. In Section 3.1, sharp low rank oracle inequalities for von Neumann entropy penalized least squares estimator are derived in the case of trace regression model with bounded response. In Section 3.2, low rank oracle inequalities are established in the case of trace regression with Gaussian noise. In addition to this, in these two sections, upper bounds on estimation error with respect to Kullback-Leibler distance are obtained. In Section 3.3, they are further developed and extended to other distances (Hellinger distance, Schatten p-norm distances for p [1, 2]) showing the minimax optimality (up to logarithmic factors) of the error rates of the least squares estimator with von Neumann entropy penalization. 2. Minimax Lower Bounds In this section, we provide main results on the minimax lower bounds on the risk of estimation of density matrices with respect to Schatten p-norm (or, rather q-norm in the notations used below) distances as well as Hellinger-Bures distance and Kullback-Leibler divergence. Minimax lower bounds will be derived for the class Sr,m := {S Sm : rank(S) r} consisting of all density matrices of rank at most r (the low rank case). We will start with the case of trace regression with Gaussian noise. Given that the sample (X1, Y1), . . . , (Xn, Yn) satisfies Assumption 4 with the target density matrix ρ Sm and noise variance σ2 ξ, let Pρ denote the corresponding probability distribution. Note that Ma and Wu (2013) developed a method of deriving minimax lower bounds for distances based on unitary invariant norms, including Schatten p-norms in matrix problems, and obtained such lower bounds, in particular, in matrix completion problem. The approach used in our paper is somewhat different and the aim is to develop such bounds under an additional constraint that the target matrix is a density matrix. The resulting bounds are also somewhat different, they involve an additional term that does not depend on the rank, but does depend on q. Essentially, it means that the complexity of the problem is controlled by a truncated rank r 1 τ , where τ = σξm3/2 n rather than by the actual rank r. The upper bounds of Section 3.3 show that such a structure of the bound is, indeed, necessary. It should be also mentioned that minimax lower bounds on the nuclear norm error of estimation of density matrices have been obtained earlier in Flammia et al. (2012) (see Remark 11 below). Low Rank Density Matrices Estimation Theorem 4 For all q [1, + ], there exist constants c, c > 0 such that, the following bounds hold: inf ˆρ sup ρ Sr,m Pρ ˆρ ρ q c σξm 3 2 r1/q n q 1 c , (9) inf ˆρ sup ρ Sr,m Pρ H2(ˆρ, ρ) c σξm 3 2 r n inf ˆρ sup ρ Sr,m Pρ K(ρ ˆρ) c σξm 3 2 r n where inf ˆρ denotes the infimum over all estimators ˆρ in Sm based on the data (X1, Y1), . . . , (Xn, Yn) satisfying the Gaussian trace regression model with noise variance σ2 ξ. Proof A couple of preliminary facts will be needed in the proof. We start with bounds on the packing numbers of Grassmann manifold Gk,l, which is the set of all k-dimensional subspaces L of the l-dimensional space Rl. Given such a subspace L Rl with dim(L) = k, let PL be the orthogonal projection onto L and let Pk,l := {PL : L Gk,l}. The set of all k-dimensional projectors Pk,l will be equipped with Schatten q-norm distances for all q [1, + ] (which also could be viewed as distances on the Grassmannian itself): dq(Q1, Q2) := Q1 Q2 q, Q1, Q2 Pk,l. Recall that the ε-packing number of a metric space (T, d) is defined as D(T, d, ε) = max n n : there are t1, . . . , tn T, such that min i =j d(ti, tj) > ε o . The following lemma (see Pajor 1998, Proposition 8) will be used to control the packing numbers of Pk,l with respect to Schatten distances dq. Lemma 5 For all integer 1 k l such that k l k, and all 1 q , the following bounds hold c d D(Pk,l, dq, εk1/q) C d , ε > 0 (12) with d = k(l k) and universal positive constants c, C. In addition to this, we need the following well known information-theoretic bound frequently used in derivation of minimax lower bounds (see Tsybakov 2008, Theorem 2.5). Let Θ = {θ0, θ1, . . . , θM} be a finite parameter space equipped with a metric d and let P := {Pθ : θ Θ} be a family of probability distributions in some sample space. Given P, Q P, let K(P Q) := EP log d P d Q be the Kullback-Leibler divergence between P and Q. Proposition 6 Suppose that the following conditions hold: (i) for some s > 0, d(θj, θk) 2s > 0, 0 j < k M; (ii) for some 0 < α < 1/8, 1 M j=1 K(Pθj Pθ0) α log M Koltchinskii and Xia Then, for a positive constant cα, inf ˆθ sup θ Θ Pθ{d(ˆθ, θ) s} cα, where the infimum is taken over all estimators ˆθ Θ based on an observation sampled from Pθ. We now turn to the actual proof of Theorem 4. Under Assumption 4, the following computation is well known: for ρ1, ρ2 Sr,m, K(Pρ1 Pρ2) = EPρ1 log Pρ1 X1, Y1, . . . , Xn, Yn h (Yj ρ1, Xj )2 2σ2 ξ + (Yj ρ2, Xj )2 ρ1 ρ2, Xj 2 2σ2 ξ = n 2σ2 ξ ρ1 ρ2 2 L2(Π). It is enough to prove the bounds for 2 r m/2. The proof in the case r = 1 is simpler and the case r > m/2 easily reduces to the case r m/2. We will use Lemma 5 to construct a well separated (with respect to dq) subset of density matrices in Sr,m. To this end, first choose a subset Dq Pr 1,m 1 such that card(Dq) 2(r 1)(m r) and, for some constant c , Q1 Q2 q c (r 1)1/q, Q1, Q2 Pr 1,m 1, Q1 = Q2. Such a choice is possible due to the lower bound on the packing numbers of Lemma 5. For Q Dq (note that Q can be viewed as an (m 1) (m 1) matrix with real entries) and κ (0, 1), consider the following m m matrix Note that S is symmetric positively-semidefinite real matrix of unit trace. It is straightforward to check that it defines a Hermitian positively-semidefinite operator in Cm of unit trace, and it can be identified with a density matrix S Sm. Clearly, S is of rank r, so, S Sr,m. We will take κ := c1 σξm3/2(r 1) n with a small enough absolute constant c1 > 0 and first assume that κ < 1 (as it is needed in definition Equation 14). Let S q := {SQ : Q Dq} and consider a family of M + 1 = card(Dq) 2(r 1)(m r) distributions {PS : S S q}. It is immediate that for S1 = SQ1, S2 = SQ2, Q1, Q2 Dq, Q1 = Q2, we have S1 S2 q = κ r 1 Q1 Q2 q c κ(r 1)1/q 1 c c1 σξm3/2(r 1)1/q n cσξm3/2r1/q with some constant c > 0, implying condition (i) of Proposition 6 with s = c 2 σξm3/2r1/q Low Rank Density Matrices Estimation We will now check its condition (ii) . In view of (13), we have, for all S1 = SQ1, S2 = SQ2 S q, K(PS1 PS2) = n 2σ2 ξ S1 S2 2 L2(Π) = n 2σ2 ξm2 S1 S2 2 2 2σ2 ξm2(r 1)2 Q1 Q2 2 2 4n(r 1)κ2 2σ2 ξm2(r 1)2 = 2c2 1m(r 1) αm(r 1)/ log(2)/4 α 2 (r 1)(m r) log(2) α log M, provided that constant c1 is small enough, so, condition (ii) of Proposition 6 is also satisfied. Proposition 6 implies that, under the assumption κ = c1 σξm3/2(r 1) n < 1, the following minimax lower bound holds for some c, c > 0 : inf ˆρ sup ρ Sr,m Pρ ˆρ ρ q cσξm 3 2 r1/q n In the case when n < 1 c1 σξm3/2(r 1) n , one can choose 2 r < r 1 such that, for some constant c2 > 0, c2 < c1 σξm3/2(r 1) n < 1. For such a choice of r , it follows from (17) that inf ˆρ sup ρ Sr ,m Pρ ˆρ ρ q cσξm 3 2 (r )1/q n The definition of r implies that r r 1 σξm3/2 Therefore, σξm 3 2 (r )1/q n σξm3/2 and, since Sr ,m Sr,m, bound (18) yields inf ˆρ sup ρ Sr,m Pρ ˆρ ρ q c σξm3/2 1 1/q inf ˆρ sup ρ Sr ,m Pρ ˆρ ρ q c σξm3/2 (19) for some constants c, c > 0. This allows us to recover the second term in the minimum in bound (9). Finally, in the case when c1 σξm3/2 n > 1, the minimax lower bound becomes a Koltchinskii and Xia constant (and the proof is based on a simplified version of the above argument that could be done for r = 1). This completes the proof of bound (9) for Schatten q-norms. The proof of bound (10) for the Hellinger distance is similar. In the case r 2, we will use a well separated set of density matrices S q Sr,m for q = 1 constructed above. We still use κ := c1 σξm3/2(r 1) n assuming first that κ (0, 1). For SQ1, SQ2 S q with Q1 = Q2, it follows by a simple computation and using bound (8) that, for some c > 0, H2(SQ1, SQ2) = κH2 Q1 4 κ (r 1)2 Q1 Q2 2 1 (c )2 4 κ c σξm3/2(r 1) n . Repeating the argument based on Proposition 6 yields bound (10) in the case when κ = c1 σξm3/2(r 1) n < 1, and in the opposite case it is easy to see that the lower bound is a constant. Finally, bound (11) for the Kullback Leibler divergence follows from (10) and the inequality K(ρ ˆρ) H2(ˆρ, ρ) (see inequality 8). Next we state similar results in the case of trace regression model with bounded response (see Assumption 3). Denote by Pr,m( U) the class of all distributions P of (X, Y ) such that Assumption 3 holds for some U and E(Y |X) = ρP , X for some ρP Sr,m. Given P, PP denotes the corresponding probability measure (such that (X1, Y1), . . . , (Xn, Yn) are i.i.d. copies of (X, Y ) sampled from P). Theorem 7 Suppose U 2U. For all q [1, + ], there exist absolute constants c, c > 0 such that the following bounds hold: inf ˆρ sup P Pr,m( U) PP ˆρ ρP q c Um 3 2 r1/q n q 1 c , (20) inf ˆρ sup P Pr,m( U) PP H2(ˆρ, ρP ) c Um 3 2 r n inf ˆρ sup P Pr,m( U) PP K(ρP ˆρ) c Um 3 2 r n where inf ˆρ denotes the infimum over all estimators ˆρ in Sm based on the data (X1, Y1), . . . , (Xn, Yn). Proof The proof relies on an idea already used in a context of matrix completion by Koltchinskii et al. (2011) (see their Theorem 7). We need the same family S q Sr,m of well separated density matrices of rank r as in the proof of Theorem 4. For a density matrix ρ, let (X, Y ) be a random couple such that X is sampled from the uniform distribution Π in E and, conditionally on X, Y takes value + U with probability pρ(X) := 1 2 U and value Low Rank Density Matrices Estimation U with probability qρ(X) := 1 2 U . Since U 2U and | ρ, X | ρ 1 X U, we have pρ(X), qρ(X) [1/4, 3/4] (so, they are bounded away from 0 and from 1). Clearly, Eρ(Y |X) = ρ, X . Let Pρ denote the distribution of such a couple and Pρ denote the corresponding distribution of the data (X1, Y1), . . . , (Xn, Yn). Then, for all ρ Sr,m, Pρ Pr,m( U). The only difference with the proof of Theorem 4 is in the bound on Kullback Leibler divergence K(Pρ1 Pρ2) (see Equation 13). It is easy to see that K(Pρ1 Pρ2) = n E pρ1(X) log pρ1(X) pρ2(X) + qρ1(X) log qρ1(X) The following simple inequality will be used: for all a, b [1/4, 3/4], b + (1 a) log 1 a 1 b 12(a b)2. It implies that K(Pρ1 Pρ2) 3n E ρ1 ρ2, X 2 U2 ρ1 ρ2 2 L2(Π). This bound is used instead of identity (13) from the proof of Theorem 4. The rest of the proof is the same. Note that the proof requires the possible range [ U, U] of response variable Y to be larger than the possible range [ U, U] of Fourier coefficients ρ, Ej , j = 1, . . . , m2. This is not the case for standard QST model described in the introduction (see also the example of Pauli measurements) and it is of interest to prove a version of minimax lower bounds without this constraint, including the case when U = U. The following theorem is a result in this direction. Theorem 8 Suppose Assumption 1 is satisfied and, moreover, for some constant γ (0, 1), tr(Ek) (1 γ)Um, k = 1, . . . , m2. (24) Then, for all q [1, + ], there exist constants cγ, c γ > 0 such that the following bounds hold: inf ˆρ sup P Pr,m(U) PP Um 3 2 r1/q n q 1 c γ, (25) inf ˆρ sup P Pr,m(U) PP H2(ˆρ, ρP ) cγ 1 c γ, (26) inf ˆρ sup P Pr,m(U) PP K(ρP ˆρ) cγ 1 c γ, (27) where inf ˆρ denotes the infimum over all estimators ˆρ in Sm based on the data (X1, Y1), . . . , (Xn, Yn). Koltchinskii and Xia Proof The proof is based on the following lemma: Lemma 9 Suppose assumption (24) holds. Let K be a sufficiently large absolute constant (to be chosen later) and let m satisfy the condition K log m m γ 2 (which means that m Aγ for some constant Aγ). Then there exists v Cm with v = 1 such that Ekv, v (1 γ/2)U, k = 1, . . . , m2. (28) Proof We will prove this fact by a probabilistic argument. Namely, set v := m 1/2(ε1, . . . , εm), where εj = 1. We will show that there is a random choice of signs εj such that (28) holds. Assume that εj, j = 1, . . . , m are i.i.d. and take values 1 with probability 1/2 each. Let Ek := (a(k) ij )i,j=1,...,m. For simplicity, assume that (a(k) ij )i,j=1,...,m is a symmetric real matrix (in the complex case, the proof can be easily modified). We have i=1 a(k) ii ε2 i + 1 i =j a(k) ij εiεj = tr(Ek) i =j a(k) ij εiεj. It is well known that i =j a(k) ij εiεj i =j a(k) ij εiεj a(k) ij 2 2 X a(k) ij 2 = 2 Ek 2 2 = 2. Moreover, it follows from exponential inequalities for Rademacher chaos (see, e.g., Corollary 3.2.6 in de la Pe na and Gin e 1999) that for some absolute constant K > 0 and for all t > 0, with probability at least 1 e t Ekv, v tr(Ek) i =j a(k) ij εiεj Taking t = 2 log m and using the union bound, we conclude that with probability at least 1 me 2 log m = 1 1 Ekv, v tr(Ek) m K log m m U γ where we also used the fact that U m 1/2. Thus, there exists a choice of signs εj such that Ekv, v max 1 k m which, under condition (24), implies (28). We set e1 := v (where v is the unit vector introduced in Lemma 9) and construct an orthonormal basis e1, . . . , em. Assume that matrices SQ defined by (14) represent linear transformations in basis e1, . . . , em. Then we have SQ, Ek = (1 κ) Ekv, v + κ r 1 Q, Ek . Low Rank Density Matrices Estimation Therefore, SQ, Ek (1 κ) Ekv, v + κ r 1 Ek Q 1 (1 κ)(1 γ/2)U+κU = (1 (1 κ)(γ/2))U. Assuming that κ 1/2, we get SQ, Ek (1 γ/4)U, k = 1, . . . , m2. (29) The rest of the proof becomes similar to the proof of Theorem 7 (with U = U). Namely, bound (29) implies that, for ρ = SQ and X being sampled from the orthonormal basis {E1, . . . , Em2}, probabilities pρ(X) and qρ(X) are bounded away from 0 and from 1 : pρ(X), qρ(X) [γ/8, 1 γ/8]. This allows us to complete the argument of the proof of Theorem 7. Theorem 8 does not apply directly to the Pauli basis since condition (24) fails in this case. Indeed, by the definition of Pauli basis, U = m 1/2 and tr(E1) = m = Um > (1 γ)Um. Note also that tr(Ej) = 0, j = 2, . . . , m2. Thus, for Pauli basis, E1 is the only matrix for which condition (24) fails. However, for this matrix ρ, E1 = m 1/2tr(ρ) = m 1/2 = U for all density matrices ρ Sm. This immediately implies that pρ(E1) = 1 and qρ(E1) = 0 for all ρ Sm and, as a result, the value X = E1 does not have an impact on the computation of Kullback-Leibler divergence in (23). For the rest of the matrices in the Pauli basis, condition (24) holds implying also bound (28). Therefore, if X = E1, we still have that, for ρ = SQ, pρ(X), qρ(X) [γ/8, 1 γ/8], and the proof of Theorem 7 can be completed in this case, too. Note also that, given X sampled from the Pauli basis, the binary random variable Y taking values U = 1 m with probabilities pρ(X) and qρ(X), respectively (this is exactly the random variable used in the construction of the proof of Theorem 7) coincides with an outcome of a Pauli measurement for the system prepared in state ρ. These considerations yield the following minimax lower bounds for Pauli measurements. Theorem 10 Let {E1, . . . , Em2} be the Pauli basis in the space Hm of m m Hermitian matrices and let X1, . . . , Xn be i.i.d. random variables sampled from the uniform distribution in {E1, . . . , Em2}. Let Y1, . . . , Yn be outcomes of measurements of observables X1, . . . , Xn for the system being identically prepared n times in state ρ. The corresponding distribution of the data (X1, Y1), . . . , (Xn, Yn) will be denoted by Pρ. Then, for all q [1, + ], there exist constants c, c > 0 such that the following bounds hold: inf ˆρ sup ρ Sr,m Pρ ˆρ ρ q c mr1/q q 1 c , (30) inf ˆρ sup ρ Sr,m Pρ H2(ˆρ, ρ) c mr n inf ˆρ sup ρ Sr,m Pρ K(ρ ˆρ) c mr n where inf ˆρ denotes the infimum over all estimators ˆρ in Sm based on the data (X1, Y1), . . . , (Xn, Yn). Koltchinskii and Xia Remark 11 Minimax lower bounds on nuclear norm error of density matrix estimation close to bound (30) for q = 1 (but for a somewhat different estimation protocol and stated in a different form) were obtained earlier in Flammia et al. (2012). This paper also contains upper bounds on the errors of matrix LASSO and Dantzig selector estimators in the nuclear norm matching the lower bounds up to log-factors. Remark 12 It is easy to see that, if constant γ (0, 1) is small enough (namely, γ < 1 1 2), then, in an arbitrary orthonormal basis {E1, . . . , Em2}, there is at most one matrix Ej such that |tr(Ej)| > (1 γ)Um. Indeed, note that tr(Ej) = Ej, Im . Since j=1 Ej, Im 2 = Im 2 2 = m and U2m 1, we have card n j : | Ej, Im | > (1 γ)Um o 1 (1 γ)2U2m2 j=1 Ej, Im 2 m (1 γ)2U2m2 = 1 (1 γ)2U2m 1 (1 γ)2 < 2, provided that γ < 1 1 Remark 13 It will be shown in Section 3.3 that the minimax rates of theorems 4, 7, 8 and 10 are attained up to logarithmic factors for the von Neumann entropy penalized least squares estimator. Remark 14 Similar minimax lower bounds could be proved in certain classes of nearly low rank density matrices. Consider, for instance, the following class Bp(d; m) := S Sm : j=1 |λj(S)|p d (33) for some d > 0 and p [0, 1], where λ1(S) λm(S) denote the eigenvalues of S. This set consists of density matrices with the eigenvalues decaying at a certain rate (nearly low rank case) and, for p = 0, d = r it coincides with Sr,m. It turns out that minimax lower bounds of theorems 4 and 7 hold for the class Bp(d; m) (instead of Sr,m) with r replaced by r := r(τ, d, m, p) = dτ p m, where τ := σξm3/2 n in the case of trace regression with Gaussian noise and τ := Um3/2 n in the case of trace regression with bounded response. These minimax bounds are attained up to logarithmic factors for a slightly modified von Neumann entropy penalized least squares estimator. Note that, for ρ Bp(d, m) with eigenvalues λ1(ρ) λm(ρ), we have λj(ρ) d1/p j1/p , j = 1, . . . , m. Therefore, for j r, λj(ρ) τ. Note also that τ characterizes the Low Rank Density Matrices Estimation minimax rate of estimation of ρ Sr,m in the operator norm for any value of the rank r (see bound (9) for q = + ; the corresponding upper bound also holds for the least squares estimator up to a logarithmic factor, see Koltchinskii and Xia 2015+). Roughly speaking, τ is a threshold below which the estimation of eigenvalues λj(ρ) becomes impossible and r can be viewed as an effective rank of nearly low rank density matrices in the class Bp(d, m). 3. Von Neumann Entropy Penalization: Optimality and Oracle Inequalities The goal of this section is to study optimality properties of von Neumann entropy penalized least squares estimator ρε defined by (7). In particular, we establish oracle inequalities for such estimators in the cases of trace regression with bounded response (Subsection 3.1) and trace regression with Gaussian noise (Subsection 3.2), and prove upper bounds on their estimation errors measured by Schatten q-norm distances for q [1, 2] and also by Hellinger and Kullback-Leibler distances (Subsection 3.3). 3.1 Oracle Inequalities for Trace Regression with Bounded Response In this subsection, we prove a sharp low rank oracle inequality for estimator ρε defined by (7). It is done in the case of trace regression model with bounded response (that is, under Assumption 3). The results of this type show some form of optimality of the estimation method, namely, that the estimator provides an optimal trade-offbetween the approximation error of the target density matrix by a low rank oracle and the estimation error of the oracle that is proportional to its rank. Sharp oracle inequalities (in which the leading constant in front of the approximation error is equal to 1, so that the bound mimics precisely the approximation by the oracle) are usually harder to prove. In the case of low rank matrix completion, the first result of this type was proved by Koltchinskii et al. (2011) for a modified least squares estimator with nuclear norm penalty. A version of such inequality for empirical risk minimization with nuclear norm penalty (that includes matrix LASSO) was first proved by Koltchinskii (2013b). Low rank oracle inequalities for von Neumann entropy penalized least squares method with the leading constant larger than 1 were proved by Koltchinskii (2011a). The main result of this section refines these previous bounds by proving a sharp oracle inequality, improving the logarithmic factors and removing superfluous assumptions, but also by establishing the inequality in the whole range of values of regularization parameter ε 0 (including the value ε = 0, for which ρε coincides with the least squares estimator ˆρ). In addition to this, for a special choice of regularization parameter ε, the theorem below also provides an upper bound on the Kullback-Leibler error K(ρ ρε) of ρε that matches the minimax lower bound (22) up to log-factors (and second order terms ). It turns out that, for this choice of ε, the estimator satisfies exactly the same low rank oracle inequality as the best inequalities known for LASSO estimator and minimax optimal error rates are attained for ρε also with respect to Hellinger distance and Schatten q-norm distances for all q [1, 2] (see Section 3.3). For simplicity, it will be assumed that constants U in Assumption 1 and U in Assumption 3 coincide (in the upper bounds, one can always replace U and U by U U). Koltchinskii and Xia Theorem 15 Suppose Assumption 3 holds with constant U = U and let ε [0, 1]. Then, there exists a constant C > 0 such that for all t 1 with probability at least 1 e t f ρε fρ 2 L2(Π) inf S Sm f S fρ 2 L2(Π) + C rank(S)m2ε2 log2(mn) +U2 rank(S)m log(2m) n + U2 t+log log2(2n) In particular, this implies that f ρε fρ 2 L2(Π) C rank(ρ)m2ε2 log2(mn) +U2 rank(ρ)m log(2m) n + U2 t+log log2(2n) Moreover, if ε := 1 log(mn) _ U2 log(2m) then, with some constant C and with probability at least 1 e t f ρε fρ 2 L2(Π) C U2 rank(ρ)m log(2m) 1 W U2 m log(2m) +U2 t+log log2(2n) K(ρ ρε) CU rank(ρ)m3/2 log(2m) log(mn) n n (t+log log2(2n)) log(mn) Proof The following notations will be used in the proof. Let ℓ(y, u) := (u y)2, y, u R be the quadratic loss function. For f : Hm 7 R, denote (ℓ f)(x, y) = (f(x) y)2, (ℓ f)(x, y) = 2(f(x) y) P(ℓ f) = E(Y f(X))2, Pn(ℓ f) = n 1 n X j=1 (Yj f(Xj))2. For A Hm, let f A(x) = A, x , x Hm. Since for density matrices S Sm, S 1 = tr(S) = 1, the estimator ρ = ρε can be equivalently defined by the following convex optimization problem: ρ = argmin S Sm Ln(S), Ln(S) := h Pn(ℓ f S) + εtr(S log S) + ε S 1 i for an arbitrary ε > 0. The following lemma will be crucial in the proofs of Theorem 15 as well Theorem 19 in the following subsection. Note that it does not rely on Assumption 3, only Assumptions 1 and 2 are needed. Low Rank Density Matrices Estimation Lemma 16 Suppose Assumptions 1 and 2 hold. Let δ (0, 1) and S := (1 δ)S + δ Im m , where S Sm, rank(S ) = r and Im is the m m identity matrix. Then the following bound holds: f ρ fρ 2 L2(Π) + 1 2 f ρ f S 2 L2(Π) + εK( ρ; S) + ε P L ( ρ) 1 f S fρ 2 L2(Π) + rm2ε2 log2(m/δ) + rm2 ε2 (38) +4 εδ + (P Pn)(ℓ f ρ)(f ρ f S). Lemma 16 will be often used together with the following simple bound: f S fρ 2 L2(Π) = 1 m2 S ρ 2 2 1 m2 S ρ 2 2 + 2 m2 S ρ 2 S S 2 + 1 m2 S S 2 2 (39) f S fρ 2 L2(Π) + 8δ m2 f S fρ 2 L2(Π) + 12δ Together, they imply that f ρ fρ 2 L2(Π) + 1 2 f ρ f S 2 L2(Π) + εK( ρ; S) + ε P L ( ρ) 1 f S fρ 2 L2(Π) + rm2ε2 log2(m/δ) + rm2 ε2 (40) +4 εδ + 12δ m2 + (P Pn)(ℓ f ρ)(f ρ f S). We will now give the proof of Lemma 16. Proof By standard necessary conditions of extremum in convex problems, we get that, for all S Sm and for some V ρ 1, Pn(ℓ f ρ)(f ρ f S) + ε log ρ, ρ S + ε V , ρ S 0 (see, e.g., Aubin and Ekeland 2006, Chapter 2, Corollary 6; see also Koltchinskii 2011b, pp. 198 199; for the computation of derivative of the function tr(S log S), see Lemma 1 in Koltchinskii 2011a). Replacing in the left hand side P by Pn, we get P(ℓ f ρ)(f ρ f S) + ε log ρ, ρ S + ε V , ρ S (P Pn)(ℓ f ρ)(f ρ f S). It is easy to check that for the quadratic loss P(ℓ f ρ)(f ρ f S) = P(ℓ f ρ) P(ℓ f S) + f ρ f S 2 L2(Π), implying that P(ℓ f ρ) P(ℓ f S) + f ρ f S 2 L2(Π) + ε log ρ, ρ S + ε V , ρ S (P Pn)(ℓ f ρ)(f ρ f S). Also, for the quadratic loss, P(ℓ f) P(ℓ fρ) = f fρ 2 L2(Π). Koltchinskii and Xia f ρ fρ 2 L2(Π) + f ρ f S 2 L2(Π) + ε log ρ, ρ S + ε V , ρ S f S fρ 2 L2(Π) + (P Pn)(ℓ f ρ)(f ρ f S). Recall that we have set S = (1 δ)S + δ Im m , where S Sm, rank(S ) = r, δ (0, 1). Clearly, V , S S V S S 1 S S 1 = δ S Im where we used the fact that V 1 for V ρ 1. This implies f ρ fρ 2 L2(Π) + f ρ f S 2 L2(Π) + ε log ρ, ρ S + ε V , ρ S (41) f S fρ 2 L2(Π) + 2 εδ + (P Pn)(ℓ f ρ)(f ρ f S). Recall formula (2) for the subdifferential of nuclear norm. Let L = supp(S ). By the duality between the operator and nuclear norms, there exists M Hm with M 1 such that P L (M), ρ S = M, P L ( ρ S ) = P L ( ρ S ) 1 = P L ( ρ) 1. With V = sign(S ) + P L (M) S 1, by monotonicity of subdifferential, we get that sign(S ), ρ S + P L ( ρ) 1 = V, ρ S V , ρ S . (42) In addition to this, we have log ρ, ρ S = log ρ log S, ρ S + log S, ρ S = K( ρ; S) + log S, ρ S . (43) Substituting (42) and (43) into (41), we get f ρ fρ 2 L2(Π) + f ρ f S 2 L2(Π) + εK( ρ; S) + ε P L ( ρ) 1 f S fρ 2 L2(Π) + ε log S, S ρ + ε sign(S ), S ρ (44) +2 εδ + (P Pn)(ℓ f ρ)(f ρ f S). The following bound on ε sign(S ), S ρ is straightforward: ε sign(S ), S ρ ε sign(S ), S ρ + ε sign(S ) S S 1 ε sign(S ) 2 S ρ 2 + 2 εδ ε rm f S f ρ L2(Π) + 2 εδ (45) 4 f S f ρ 2 L2(Π) + 2 εδ. A similar bound on ε log S, S ρ is only slightly more complicated. Suppose S has the following spectral representation: S = Pr k=1 λk Pk with eigenvalues λk (0, 1] (repeated with their multiplicities) and one-dimensional orthogonal eigenprojectors Pk. We will extend Pj, j = 1, . . . , r to the complete orthogonal resolution of the identity Pj, j = 1, . . . , m. Then log S = log (1 δ)S + δIm j=1 log (1 δ)λj + δ/m Pj + j=r+1 log(δ/m)Pj Low Rank Density Matrices Estimation j=1 log 1 + (1 δ)mλj/δ Pj + log(δ/m)Im log S, S ρ = r X j=1 log 1 + (1 δ)mλj/δ Pj, S ρ + log(δ/m) Im, S ρ j=1 log 1 + (1 δ)mλj/δ Pj, S ρ where we used the fact that Im, S ρ = tr(S) tr( ρ) = 0. Therefore, ε log S, S ρ ε Pr j=1 log 1 + (1 δ)mλj/δ Pj 2 S ρ 2 (46) = εm Pr j=1 log2 1 + (1 δ)mλj/δ 1/2 f S f ρ L2(Π) ε rm log(m/δ) f S f ρ L2(Π) rm2ε2 log2(m/δ) + 1 4 f S f ρ 2 L2(Π), where it was used that for λj [0, 1] log 1 + (1 δ)mλj/δ log δ + (1 δ)m Substituting bounds (45) and (46) in (44) we easily get bound (38), as claimed in the lemma. We will also need the following simple lemma that provides a bound on K(S ρ) in terms of K(S ρ). Let h(δ) := δ log 1 δ + (1 δ) log 1 1 δ. Observe that h(δ) = δ log 1 δ + (1 δ) log 1 + δ 1 δ δ + (1 δ) δ 1 δ δ log e (this bound will be used in what follows). Lemma 17 Let δ (0, 1), S Sm with rank(S ) = r and S = (1 δ)S + δ Im m . Then, for any U Sm, K(S U) K(S U) + h(δ) Proof The following identities are straightforward: K(S U) = tr(S(log S log U)) = (1 δ)tr(S (log S log U)) + δtr((Im/m)(log S log U)) = (1 δ)tr(S (log S log U)) + (1 δ)tr(S (log S log S )) +δtr((Im/m)(log S log(Im/m))) + δtr((Im/m)(log(Im/m) log U)) = (1 δ)K(S U) (1 δ)K(S S) + δK(Im/m U) δK(Im/m S). Koltchinskii and Xia Since K(Im/m U) 0, it follows that K(S U) K(S U) 1 δ + K(S S) + δ 1 δK(Im/m S). (47) Assuming that S has spectral representation S = Pr j=1 λj Pj with eigenvalues λj > 0 and one-dimensional projectors Pj, we get j=1 λj log (1 δ)λj + δ/m j=1 λj log 1 δ + δ mλj j=1 λj = log(1 δ), implying that K(S S) log 1 1 δ. On the other hand, K(Im/m S) = 1 j=1 log 1/m (1 δ)λj + δ/m 1 Substituting these bounds in (47) yields the result. To complete the proof of Theorem 15, we need to control the empirical process (P Pn)(ℓ f ρ)(f ρ f S) in the right hand side of bound (38). Our approach is based on the following empirical processes bound that is a slight modification of Lemma 1 in Koltchinskii (2013b). As before, we assume that S = (1 δ)S + δ Im m with S Sm, rank(S ) = r. We will set δ := 1 m2n2 . Let Ξε := n 1 Pn j=1 εj Xj, where εj are i.i.d. Rademacher random variables (that is, εj takes values +1 and 1 with probability 1/2 each) and {εj}, {Xj} are independent. Lemma 18 Given δ1, δ2 > 0, denote αn(δ1, δ2) := sup (Pn P)(ℓ f A)(f A f S) : A Sm, f A f S L2(Π) δ1, P L A 1 δ2 Let 0 < δ 1 < δ+ 1 , 0 < δ 2 < δ+ 2 . For t 1, denote t := t + log [log2(δ+ 1 /δ 1 )] + 2 + log [log2(δ+ 2 /δ 2 )] + 2 + log 3. Then, with probability at least 1 e t, for all δ1 [δ 1 , δ+ 1 ], δ2 [δ 2 , δ+ 2 ], αn(δ1, δ2) C1UE Ξε rmδ1 + δ2 + δ + C2Uδ1 t n + C3U2 t where C1, C2, C3 > 0 are constants. Low Rank Density Matrices Estimation We will use this lemma to control the term (P Pn)(ℓ f ρ)(f ρ f S) in bound (38). Let δ1 := f ρ f S L2(Π) and δ2 := P L ρ 1. Define also m, δ+ 2 := 1, δ 1 = δ 2 := 1 mn, so that t t + 2 log(log2(mn) + 3) + log 3. It is easy to see that δ1 δ+ 1 and δ2 δ+ 2 . If, in addition, δ1 δ 1 , δ2 δ 2 , the bound of Lemma 18 implies that with probability at least 1 e t (P Pn)(ℓ f ρ)(f ρ f S) αn(δ1, δ2) C1UE Ξε rmδ1 + δ2 + δ + C2Uδ1 t n + C3U2 t n If ε C1UE Ξε , the last bound implies that (P Pn)(ℓ f ρ)(f ρ f S) 4 f ρ f S 2 L2(Π) + rm2 ε2 + ε P L ρ 1 + εδ (48) 4 f ρ f S 2 L2(Π) + (C2 2 + C3)U2 t Substituting this bound in the right hand side of (40), we get f ρ fρ 2 L2(Π) + εK( ρ; S) f S fρ 2 L2(Π) + rm2ε2 log2(m/δ) + 2rm2 ε2 (49) +5 εδ + CU2 t where C := C2 2 + C3. In the case when δ1 = f ρ f S L2(Π) δ 1 = 1 mn or δ2 = P L ρ 1 δ 2 = 1 mn, we can replace the terms 1 4 f ρ f S 2 L2(Π) or P L ρ 1 in bound (48) by their respective upper bounds 4(δ 1 )2 = 1 4m2n2 , or δ 2 = 1 mn), which would be smaller than CU2 t n for large enough C > 0, so bound (49) still holds (recall that U m 1/2). Note also that 12δ m2 = 12 1 m4n2 12U2 t n. Thus, increasing the value of constant C, one can rewrite (49) in a simpler form as f ρ fρ 2 L2(Π) + εK( ρ; S) f S fρ 2 L2(Π) + rm2ε2 log2(m/δ) + 2rm2 ε2 (50) +5 εδ + CU2 t The following expectation bound is a consequence of a matrix version of Bernstein inequality for Ξε (it follows by integrating out its exponential tails): _ U log(2m) (it is also used in this computation that, in the case of uniform sampling from an orthonormal basis, σ2 εX = EX2 = 1 m, a simple fact often used in the literature; see, e.g., Koltchinskii 2011a, Section 5). Let Koltchinskii and Xia for some constant D . If D is sufficiently large and then the condition ε C1UE Ξε is satisfied and bound (50) holds with probability at least 1 e t. Moreover, εδ D δ D U2 t n, implying that the term 5 εδ in (50) can be dropped at a price of further increasing the value of constant C. If (51) does not hold, we still have that f ρ fρ 2 L2(Π) = ρ ρ 2 2 m2 2 Recalling that t t + 2 log(log2(mn) + 3) and log(m/δ) log(mn), we deduce from (50) that with some constant C and with probability at least 1 e t f ρ fρ 2 L2(Π) f S fρ 2 L2(Π) + C rm2ε2 log2(mn) +U2 rm log(2m) n + U2 t+log(log2(mn)+3) Note that, for n 2, log(log2(mn) + 3) = log log2(4m) + log2(2n) log log2(4m) + log log2(2n), (53) since log2(4m) + log2(2n) log2(4m) log2(2n). Since also, for r 1, U2 t + log log2(4m) n U2 rm log(2m) we can replace in bound (52) the term U2 t+log(log2(mn)+3) n with the term U2 t+log log2(2n) n (increasing the value of the constant C accordingly). This yields bound (34) of the theorem. For S = ρ, it yields bound (35), and, moreover, for S = ρ and S = (1 δ)ρ + δ Im m with δ = 1 m2n2 , bound (50) also implies that εK( ρ; S) rank(ρ)m2ε2 log2(m/δ) + 2rank(ρ)m2 ε2 (55) +5 εδ + CU2 t We will now take _ U2 log(2m) for a large enough constant D so that ε C1UE Ξε . Assume that ε := 1 log(mn) _ U2 log(2m) Low Rank Density Matrices Estimation As before, the term εδ in bound (55) will be absorbed by the term CU2 t n with a larger value of C and also rank(ρ)m2ε2 log2(m/δ) D rank(ρ)m2 ε2 D U2 rank(ρ)m log(2m) 1 _ U2 m log(2m) As a result, taking into account (53), (54), bound (55) can be rewritten as follows: εK( ρ; S) CU2 rank(ρ)m log(2m) 1 W U2 m log(2m) +t+log log2(2n) Using the bound of Lemma 17 along with the bound h(δ) δ log(e/δ) = 1 m2n2 log(em2n2) U rm n (t + log log2(2n)) log(mn) p we easily get that (37) holds. 3.2 Oracle Inequalities for Trace Regression with Gaussian Noise In this subsection, we establish oracle inequalities for the von Neumann entropy penalized least squares estimator ρε in the case of trace regression model with Gaussian noise (Assumption 4). Unlike in the case of Theorem 15 of the previous section, our aim is not to obtain sharp oracle inequality, but rather to get a clean main term of the random error bound part of the inequality, namely, the term σ2 ξ rank(S)m(t+log(2m)) n in inequality (58) below. Note that this term depends only on the variance of the noise σ2 ξ, but not on the constant U from Assumption 1 (the constant U is involved only in the higher order O(n 2) terms of the bound). Note also that there are no constraints on the variance σ2 ξ that could be arbitrarily small, or even equal to 0 (in which case only higher order terms are present in the bound). This improvement comes at a price of having the leading constant 2 in the oracle inequality and also of imposing assumption (57) that requires the regularization parameter ε to be bounded away from 0 (again, unlike Theorem 15, where it could be arbitrarily small). As in the previous section, we also obtain a bound on Kullback Leibler divergence K(ρ ρε). Theorem 19 Let t 1. Suppose ε DU2 t + log3 m log2 n n , D1σξ log(mn) t + log(2m) _ DU2 t + log3 m log2 n with large enough constants D, D1 > 0. There exists a constant C > 0 such that with probability at least 1 e t f ρε fρ 2 L2(Π) inf S Sm 2 f S fρ 2 L2(Π) + C σ2 ξ rank(S)m(t + log(2m)) + σ2 ξU2 rank(S)m2(t + log(2m))2 log(2m) n2 + U4 rank(S)m2(t + log3 m log2 n)2 log2(mn) Koltchinskii and Xia In particular, f ρε fρ 2 L2(Π) C σ2 ξ rank(ρ)m(t+log(2m)) +σ2 ξU2 rank(ρ)m2(t+log(2m))2 log(2m) n2 + U4 rank(ρ)m2(t+log3 m log2 n)2 log2(mn) Moreover, if ε := D1σξ log(mn) t + log(2m) _ DU2 t + log3 m log2 n for large enough constants D, D1, then with some constant C and with the same probability both (59) and the following bound hold: K(ρ ρε) C σξ rank(ρ)m3/2(t+log(2m))1/2 log(mn) n (60) +σ2 ξ rank(ρ)m2(t+log(2m)) log(2m) n + U2 rank(ρ)m2(t+log3 m log2 n) log2(mn) Proof As in in the proof of Theorem 15, we rely on Lemma 16, but we use a different approach to bounding the empirical process (P Pn)(ℓ f ρ)(f ρ f S). The following identity follows from the definition of quadratic loss ℓ (ℓ f)(x, y)(f(x) f S(x)) = 2(f(x) f S(x))2 + 2(f S(x) y)(f(x) f S(x)) and it implies that (P Pn)(ℓ f ρ)(f ρ f S) = 2(Pn P)(f ρ f S)2 2 Ξ, ρ S (61) Ξ := n 1 n X j=1 (f S(Xj) Yj)Xj E(f S(X) Y )X. We will bound (Pn P)(f ρ f S)2 in representation (61) as follows: (Pn P)(f ρ f S)2 ρ S 2 1βn f ρ f S L2(Π) βn( ) := sup (Pn P)(f2 A) : A Hm, A 1 1, f A L2(Π) . The next lemma provides a bound on βn( ). Its proof is somewhat involved and it will not be given here. It is based on Rudelson s L (Pn) generic chaining bound for empirical processes indexed by squares of functions and on the ideas of the paper by Gu edon et al. (2008) combined with Talagrand s concentration inequality (see also Aubrun 2009, Liu 2011 and Theorem 3.16, Lemma 9.8 and Proposition 9.2 in Koltchinskii 2011b for similar arguments). Low Rank Density Matrices Estimation Lemma 20 Given 0 < δ < δ+ and t 1, let t := t + log log2(δ+/δ ) + 3 . Then, with some constant C and with probability at least 1 e t, the following bound holds for all [δ , δ+] : βn( ) C U log3/2 m log n n + U2 log3 m log2 n We will use Lemma 20 to control βn( ) for := f ρ f S L2(Π) ρ S 1 . Let δ+ := 1 m and δ := 1 mn. With this choice, t t + log(log2 n + 3). Note that for A = ρ S ρ S 1 , f A L2(Π) = m = m 1 = δ+. If also f A L2(Π) δ , then we can substitute bound (63) on βn( ) into (62) that yields: (Pn P)(f ρ f S)2 C f ρ f S L2(Π) ρ S 1U log3/2 m log n n + ρ S 2 1U2 log3 m log2 n n + f ρ f S L2(Π) ρ S 1U q + ρ S 2 1U2 t 32 f ρ f S 2 L2(Π) + 8(C2 + C/8)U2 log3 m log2 n n ρ S 2 1 (64) 32 f ρ f S 2 L2(Π) + 8(C2 + C/8)U2 t 16 f ρ f S 2 L2(Π) + C U2 log3 m log2 n+ t where C := 8(C2 + C/8). If, on the other hand, f A L2(Π) δ = 1 mn, then f ρ f S L2(Π) in the above bound can be replaced by 1 mn ρ S 1 and the proof that follows only simplifies since 1 16 f ρ f S 2 L2(Π) 1 16 1 m2n2 ρ S 2 1 1 16U2 log3 m log2 n + t Another term in the right hand side of representation (61) to be controlled is Ξ, ρ S . Note that Ξ = Ξ1 + Ξ2, where Ξ1 := n 1 n X Ξ2 := n 1 n X j=1 (f S(Xj) fρ(Xj))Xj E(f S(X) fρ(X))X. Recall that S = (1 δ)S +δ Im m with S Sm, rank(S ) = r, supp(S ) = L and δ = 1 m2n2 . Koltchinskii and Xia The term with Ξ1 is controlled as follows: Ξ1, ρ S PL(Ξ1), ρ S + Ξ1, P L ( ρ S ) + P L (Ξ1), S S PL(Ξ1) 2 ρ S 2 + Ξ1 P L ( ρ) 1 + P L (Ξ1) S S 1 2rm Ξ1 f ρ f S L2(Π) + Ξ1 P L ( ρ) 1 + 4δ Ξ1 (65) 32rm2 Ξ1 2 + 1 16 f ρ f S 2 L2(Π) + Ξ1 P L ( ρ) 1 + 4δ Ξ1 . We also have Ξ2, ρ S Ξ2 ρ S 1 Ξ2 ρ S 1 + Ξ2 S S 1 Ξ2 ρ S 1 + 2δ Ξ2 . (66) Thus, Ξ, ρ S 32rm2 Ξ1 2 + 1 16 f ρ f S 2 L2(Π) + Ξ1 P L ( ρ) 1 + 4δ Ξ1 + Ξ2 ρ S 1 + 2δ Ξ2 . (67) It follows from (61), (64) and (67) that with some constant C (P Pn)(ℓ f ρ)(f ρ f S) 1 4 f ρ f S 2 L2(Π) + C U2 log3 m log2 n+ t n ρ S 2 1 (68) +64rm2 Ξ1 2 + 2 Ξ1 P L ( ρ) 1 + 8δ Ξ1 +2 Ξ2 ρ S 1 + 4δ Ξ2 . This bound will be substituted in (38). Note that, if assumption (57) on ε holds with a sufficiently large constant D, then we have ε 8C U2 log3 m log2 n + t (this follows from the fact that t t+log(log2 n+3) t+c log3 m log2 n for some constant c > 0). Assume also that ε 4 Ξ1 and recall that K( ρ; S) 1 4 ρ S 2 1 (see inequality 8). Taking all this into account, (38) implies that f ρ fρ 2 L2(Π) + 1 4 f ρ f S 2 L2(Π) + ε 2K( ρ; S) + ε f S fρ 2 L2(Π) + rm2ε2 log2(m/δ) + 5rm2 ε2 + 6 εδ (69) +2 Ξ2 ρ S 1 + 4 Ξ2 δ. It remains to control Ξ1 and Ξ2 . To this end, we use matrix versions of Bernstein inequality. To bound Ξ2 , we use its standard version which yields that with probability Low Rank Density Matrices Estimation at least 1 e t Ξ2 2 E(f S(X) fρ(X))2X2 1/2 W (f S(X) fρ(X)) X L where L denotes the essential supremum norm in the space of random variables. Since E(f S(X) fρ(X))2X2 U2 f S fρ 2 L2(Π) and (f S(X) fρ(X)) X L 2U2, Ξ2 4 f S fρ L2(Π)U q n + U2 t+log(2m) This implies that 2 Ξ2 ρ S 1 f S fρ 2 L2(Π) + 16U2 t+log(2m) n ρ S 2 1 (71) +8U2 t+log(2m) 16U2 t+log(2m) n ρ S 2 1 16U2 t+log(2m) n ρ S 2 1 + 16U2 t+log(2m) n (4δ + δ2) (72) 8U2 t+log(2m) 8U2 t+log(2m) n P L ρ 1 + 8U2 t+log(2m) n PL( ρ S ) 1 (73) 8U2 t+log(2m) n P L ρ 1 + 8U2 t+log(2m) n PL( ρ S) 1 + 16U2 t+log(2m) Since, for some constant C > 0, 8U2 t+log(2m) n PL( ρ S) 1 8 2U2 t+log(2m) n r PL( ρ S) 2 2U2 t+log(2m) n rm f ρ f S L2(Π) 1 4 f ρ f S 2 L2(Π) + C U4 rm2(t+log(2m))2 it follows from (71), (72) and (73) that 2 Ξ2 ρ S 1 f S fρ 2 L2(Π) + +16U2 t+log(2m) n ρ S 2 1 + 16U2 t+log(2m) n (4δ + δ2) (74) +8U2 t+log(2m) n P L ρ 1 + 16U2 t+log(2m) 4 f ρ f S 2 L2(Π) + C U4 rm2(t+log(2m))2 Koltchinskii and Xia Note that (70) also implies that n + U2 t+log(2m) (since f S fρ L2(Π) m 1 S ρ 2 2m 1). Let us substitute (74) and (75) in the last line of (69). Assume that ε 16U2 t + log(2m) n and that constant D in assumption (57) is large enough so that 16U2 t + log(2m) n ρ S 2 1 ε (recall inequality 8). It easily follows that with some constants C1, C2, f ρ fρ 2 L2(Π) + ε 2 f S fρ 2 L2(Π) + C1rm2ε2 log2(m/δ) + 5rm2 ε2 (76) +C2 εδ + 32 U (note that the term C U4 rm2(t+log(2m))2 n2 of bound (74) is absorbed by the term C1rm2ε2 log2(m/δ) of bound (76) provided that constant C1 is large enough). Since δ = 1 m2n2 U2 t + log(2m) (recall that U2 m 1), we have εδ ε2. Also, since U m 1/2, t + log(2m) t + log(2m) n 1 m3n2 U4 t + log(2m) Therefore, (76) implies that with some constant C f ρ fρ 2 L2(Π) + ε 2 f S fρ 2 L2(Π) + C rm2ε2 log2(m/δ) + rm2 ε2 . (77) To bound Ξ1 , we use a version of matrix Bernstein type inequality due to Koltchinskii (2011b) (see bound (2.7) of Theorem 2.7). Its version for α = 2 (with U(α) Uσξ) implies that for some constant K > 0 with probability at least 1 e t t + log(2m) _ σξU (t + log(2m)) log1/2(2Um1/2) t + log(2m) _ (σξ U)U (t + log(2m)) log1/2(2m) Low Rank Density Matrices Estimation with a sufficiently large constant D2 to satisfy the condition Ξ1 4 ε with probability at least 1 e t (the rest of the assumptions we made on ε are also satisfied with this choice). Bound (77) then implies that with some constant C and with probability at least 1 3e t the following inequality holds: f ρε fρ 2 L2(Π) 2 f S fρ 2 L2(Π) + C σ2 ξ rm(t + log(2m)) n + σ2 ξU2 rm2(t + log(2m))2 log(2m) + U4 rm2(t + log3 m log2 n)2 log2(mn) Using bound (39) to replace S in f S fρ 2 L2(Π) with S and adjusting the value of constant C to rewrite the probability bound as 1 e t, it is easy to complete the proof of (58). If S = ρ, this also yields bound (59). Moreover, with a larger value of regularization parameter ε := D1σξ log(mn) t + log(2m) _ DU2 t + log3 m log2 n bound (77) and Lemma 17 easily imply bound (60). 3.3 Optimality Properties of von Neumann Entropy Penalized Estimator ρϵ We start with upper bounds on the error of estimator ρϵ (von Neumann entropy penalized least squares estimator defined by (7)) in Hellinger, Kullback-Leibler and Schatten q-norm distances for q [1, 2] for the trace regression model with Gaussian noise (Assumption 4). To avoid the impact of second order terms on the upper bounds, we will make the following simplifying assumptions: n log m 1 and U2 rm n log5/2 m log2 n log(mn) σξ. (80) Recall that, for the Pauli basis, U = m 1/2, so, the above assumptions hold if n log2 m and σξ is larger than 1 mn (times a logarithmic factor). We will choose regularization parameter ε as follows: ε := D1σξ log(mn) with a sufficiently large constant D1 > 0. The next result shows that minimax rates of Theorem 4 are attained up to logarithmic factors for the estimator ρε. Theorem 21 There exists a constant C > 0 such that the following bounds hold for all r = 1, . . . , m, for all ρ Sr,m and for all q [1, 2] with probability at least 1 m 2 : ρε ρ q C σξm 3 2 r1/q n log m log(2 q)/q(mn) σξm3/2 q (log m) 1 2 1 Koltchinskii and Xia H2( ρε, ρ) C σξm 3 2 r n log m log(mn) 2 (83) K(ρ ρε) C σξm 3 2 r n log m log(mn). (84) Proof We will need the following simple lemma. Lemma 22 For all ρ Sm and all l = 1, . . . , m, there exists ρ Sl,m such that Proof Suppose that ρ = Pm j=1 λj Pj, where λj are the eigenvalues of ρ repeated with their multiplicities and Pj are orthogonal one-dimensional projectors. Note that {λj : j = 1, . . . , m} is a probability distribution on the set {1, . . . , m}. Let ν be a random variable sampled from this distribution and ν1, . . . , νl be its i.i.d. copies. Then EPν = ρ and 2 = E Pν ρ 2 2 l = E Pν 2 2 ρ 2 2 l = 1 ρ 2 2 l 1 Therefore, there exists a realization ν1 = k1, . . . , νl = kl of r.v. ν1, . . . , νl such that Denote ρ := l 1 Pl j=1 Pkj. Then, ρ Sl,m and ρ ρ 2 2 1 First, we will prove bound (82) for q = 2. To this end, we use oracle inequality (58) with t = 2 log m + log 2 and with oracle S = ρ Sl,m such that ρ ρ 2 2 1 l . Under simplifying assumptions (80) it yields that with probability at least 1 1 ρε ρ 2 2 = m2 f ρε fρ 2 L2(Π) 1 l + τ 2l log m , where τ := σξm3/2 n . On the other hand, using the same inequality with S = ρ Sr,m yields the bound ρε ρ 2 2 τ 2r log m that also holds with probability at least 1 1 2m 2. Therefore, with probability at least 1 m 2 l + τ 2l log m τ 2r log m. (85) Let l = 1 τ log m. If l [1, m], set l := [ l]. Otherwise, if l > m, set l := m and, if l < 1, set l := 1. An easy computation shows that with such a choice of l bound (85) implies (82) for q = 2. Low Rank Density Matrices Estimation Next we use bound (60) that, for t = 2 log m, implies under assumptions (80) that with some constant C and with probability at least 1 m 2 K(ρ ρε) Cσξ rm3/2 log m log(mn) n , (86) which is bound (84). Bound (83) also holds in view of inequality (8). Now, we prove bound (82) for q = 1 (the bound for q [1, 2] will then follow by interpolation). To this end, we will use the following lemma (see Proposition 1 in Koltchinskii 2011a) that shows that if two density matrices are close in Hellinger distance and one of them is concentrated around a subspace L, then another one is also concentrated around L. Lemma 23 For any L Cm and all S1, S2 Sm, P L S1 1 2 P L S2 1 + 2H2(S1, S2). We apply this lemma to S1 = ρε, S2 = ρ and L = supp(ρ) so that P L ρ = 0. It yields that P L ρε 1 2H2( ρε, ρ). ρε ρ 1 PL( ρε ρ) 1+ P L ( ρε ρ) 1 2r ρε ρ 2+ P L ρε 1 2r ρε ρ 2+2H2( ρε, ρ). (87) Using bounds (82) for q = 2 and (83), we get from (87) that ρε ρ 1 C σξm 3 2 r n log m log(mn) 2, (88) which is equivalent to (82) for q = 1. Note that by choosing t = 2 log m + log 2 + 2 (which might have an impact only on the constant), we could make probability bounds in (82) for q = 2 and (83) to be at least 1 1 2m 2 implying that (88) holds with probability at least 1 m 2, as it is claimed in the theorem. To complete the proof, it is enough to use the interpolation inequality of Lemma 1. It follows that, for q (1, 2), ρε ρ q ρε ρ 2 q 1 1 ρε ρ 2 2 Substituting bound (82) for q = 1 and q = 2 into the last inequality yields the result for an arbitrary q (1, 2). Similarly, in the case of trace regression with bounded response (see Assumption 3), minimax rates of Theorem 7 are also attained for the estimator ρε (up to log factors). In this case, assume that Assumption 3 holds with U = U and, in addition, let us make the following simplifying assumptions: n 1 and log log2 n m log m. (89) Koltchinskii and Xia For the Pauli basis (U = m 1/2), the first assumption holds if n log m. The second assumption does hold unless n is extremely large (n 2exp{m log m}). Under these assumptions, we will use the following value of regularization parameter ε : ε := U log(mn) The following version of Theorem 21 holds in the bounded regression case (with a similar proof). Theorem 24 There exists a constant C > 0 such that the following bounds hold for all r = 1, . . . , m, for all ρ Sr,m and for all q [1, 2] with probability at least 1 m 2 : ρε ρ q C Um 3 2 r1/q n log m log(2 q)/q(mn) Um3/2 q (log m) 1 2 1 H2( ρε, ρ) C Um 3 2 r n log m log(mn) 2 (91) K(ρ ρε) C Um 3 2 r n log m log(mn). (92) Remark 25 In the case of Pauli basis, the minimax optimal rates (up to constants and logarithmic factors) are: mr1/q n ( m n)1 1 q 2 for Schatten q-norm distances for q [1, 2]; mr n for nuclear norm, squared Hellinger and Kullback-Leibler distances (provided the mr n). Jean-Pierre Aubin and Ivar Ekeland. Applied Nonlinear Analysis. Courier Corporation, 2006. Guillaume Aubrun. On almost randomizing channels with a short Kraus decomposition. Communications in Mathematical Physics, 288(3):1103 1116, 2009. Vladim ır Buˇzek. Quantum tomography from incomplete data via maxent principle. In Quantum State Estimation, pages 189 234. Springer, 2004. Tony Cai, Donggyu Kim, Yazhen Wang, Ming Yuan, and Harrison H Zhou. Optimal largescale quantum state tomography with Pauli measurements. http://www-stat.wharton. upenn.edu/ tcai/paper/Estimating-Density-Matrix-Pauli.pdf, 2015. Emmanuel J Cand es and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342 2359, 2011. Victor H. de la Pe na and Evarist Gin e. Decoupling. From Dependence to Independence. Springer, 1999. Low Rank Density Matrices Estimation Steven T Flammia, David Gross, Yi-Kai Liu, and Jens Eisert. Quantum tomography via compressed sensing: error bounds, sample complexity and efficient estimators. New Journal of Physics, 14(9):095022, 2012. David Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011. David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum state tomography via compressed sensing. Physical Review Letters, 105(15):150401, 2010. Olivier Gu edon, Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Majorizing measures and proportional subsets of bounded orthonormal systems. Revista Matem atica Iberoamericana, 24(3):1075 1095, 2008. Amir Kalev, Robert L Kosut, and Ivan H Deutsch. Informationally complete measurements from compressed sensing methodology. ar Xiv preprint ar Xiv:1502.00536, 2015. Hartmut Klauck, Ashwin Nayak, Amnon Ta-Shma, and David Zuckerman. Interaction in quantum communication. IEEE Transactions on Information Theory, 53(6):1970 1982, 2007. Vladimir Koltchinskii. von Neumann entropy penalization and low-rank matrix estimation. The Annals of Statistics, 39(6):2936 2973, 2011a. Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d Et e de Probabilit es de Saint-Flour XXXVIII-2008. Springer, 2011b. Vladimir Koltchinskii. A remark on low rank matrix recovery and noncommutative Bernstein type inequalities. In From Probability to Statistics and Back: High-Dimensional Models and Processes A Festschrift in Honor of Jon A. Wellner, pages 213 226. Institute of Mathematical Statistics, 2013a. Vladimir Koltchinskii. Sharp oracle inequalities in low rank estimation. In Empirical Inference, pages 217 230. Springer, 2013b. Vladimir Koltchinskii and Dong Xia. Schatten p-norm distances in low rank density matrix estimation. 2015+. Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011. Yi-Kai Liu. Universal low-rank matrix recovery from Pauli measurements. In Advances in Neural Information Processing Systems, pages 1638 1646, 2011. Zongming Ma and Yihong Wu. Volume ratio, sparsity, and minimaxity under unitarily invariant norms. In Information Theory Proceedings (ISIT), 2013 IEEE International Symposium, pages 1027 1031. IEEE, 2013. Koltchinskii and Xia Sahand Negahban and Martin J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 2010. M.A. Nielsen and I.L. Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2000. Alain Pajor. Metric entropy of the Grassmann manifold. Convex Geometric Analysis, 34: 181 188, 1998. Angelika Rohde and Alexandre B Tsybakov. Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2):887 930, 2011. Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.