# rechtre_noncommutative_arithmeticgeometric_mean_conjecture_is_false__4fd865ef.pdf Recht R e Noncommutative Arithmetic-Geometric Mean Conjecture is False Zehua Lai 1 Lek-Heng Lim 1 Stochastic optimization algorithms have become indispensable in modern machine learning. An un resolved foundational question in this area is the difference between with-replacement sampling and without-replacement sampling does the latter have superior convergence rate compared to the former? A groundbreaking result of Recht and Re reduces the problem to a noncommuta tive analogue of the arithmetic-geometric mean inequality where n positive numbers are replaced by n positive defnite matrices. If this inequality holds for all n, then without-replacement sam pling (also known as random reshuffing) indeed outperforms with-replacement sampling in some important optimization problems. The conjec tured Recht Re inequality has so far only been established for n = 2 and a special case of n = 3. We will show that the Recht R e conjecture is false for general n. Our approach relies on the noncommutative Positivstellensatz, which allows us to reduce the conjectured inequality to a semidef nite program and the validity of the conjecture to certain bounds for the optimum values, which we show are false as soon as n = 5. 1. Introduction The breathtaking reach of deep learning, permeating every area of science and technology, has led to an outsize role for randomized optimization algorithms. It is probably fair to say that in the absence of randomized algorithms, deep learning would not have achieved its spectacular level of success. Fitting an exceedingly high-dimensional model with an exceedingly large training set would have been prohibitively expensive without some form of random sampling, which in addition provides other crucial benefts such as saddle-point *Equal contribution 1Computational and Applied Mathematics Initiative, University of Chicago, Chicago, IL 60637, USA. Correspondence to: Zehua Lai . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). avoidance (Fang et al., 2019; Jin et al., 2017). As such, in machine learning computations, stochastic variants of gradient descent (Bottou, 2010; Johnson & Zhang, 2013; Nemirovski et al., 2008), alternating projections (Strohmer & Vershynin, 2009), coordinate descent (Nesterov, 2012), and other algorithms have largely overtaken their classical deterministic counterparts in relevance and utility. There are numerous random sampling strategies but the most fundamental question, before all other considerations, is deciding between sampling with replacement or sampling without replacement. In the vast majority of randomized algorithms, a random sample is selected or a random action is performed with replacement from a pool, making the randomness in each iteration independent and thus easier (often much easier) to analyze. However, when it comes to practical realizations of these algorithms, one invariably samples without replacement, since they are easier (often much easier) to implement. Take the ubiquitous stochastic gradient descent for example, many if not most implementations would pass through each item exactly once in a random order this is sampling without replacement, also called random reshuffing. Likewise, in implementations of randomized coordinate descent, coordinates are usually just chosen in a random order again sampling without replacement. Apart from its ease of implementation, there are other reasons for favoring without-replacement sampling. Empirical evidence (Bottou, 2009; Beck & Tetruashvili, 2013; Wright, 2015) suggests that in stochastic gradient descent, without-replacement sampling regularly outperforms withreplacement sampling. Theoretical results also point towards without-replacement sampling: Under standard convexity assumptions, the convergence rate of a withoutreplacement sampling algorithm typically beats a withreplacement sampling one by a factor of O(n 1) (Shamir, 2016; Nagaraj et al., 2019; G urb uzbalaban et al., 2019). Recht & Re (2012) proposed a matrix theoretic approach to compare the effcacy of withand without-replacement sampling methods. Since nearly every common optimization algorithm, deterministic or randomized, works with a linear or quadratic approximation of the objective function locally, it suffces to examine the two sampling strategies on linear or quadratic functions to understand their local convergence Noncommutative Arithmetic-Geometric Mean Conjecture is False behaviors. In this case, the iteration reduces to matrix multiplication and both sampling procedures are linearly convergent (often called exponentially convergent in machine learning). The question of which is better then reduces to comparing their linear convergence rates. In this context, Recht & Re (2012) showed that without-replacement sampling outperforms with-replacement sampling provided the following noncommutative version of the arithmeticgeometric mean inequality holds. Conjecture 1 (Recht & R Let n be a positive intee 2012). ger, A1, . . . , An be symmetric positive semidefnite matrices, and k k be the spectral norm. Then for any m n, 1 X Aj1 Ajm nm 1 j1,...,jm n (n m)! X Aj1 A n! jm . (1) 1 j1,...,jm n, j1,...,jm distinct While one may also ask if (1) holds for other norms, the most natural and basic choice is the spectral norm, i.e., the operator 2-norm. Unless specifed otherwise, k k will always denote the spectral norm in this article. To give an inkling of how (1) arises, consider the Kaczmarz algorithm (Strohmer & Vershynin, 2009) where we attempt to solve an overdetermined linear system Cx = b, C Rp d , 1 p > d, with ith row vector c T i where ci Rd . For k = 1, 2, . . . , the (k+1)th iterate is formed with a randomly chosen i and ( b , (k) i hci x i x k+1) = x( k) + ci. kcik2 The kth error e( k) = x( k) x is then c T (k+1) ic e = I i e (k) =: P e( k) ci , kc 2 ik where Pc Rd d, the orthogonal projector onto span {c} , is clearly symmetric positive semidefnite. A careful analysis would show that the relative effcacy of withand without-replacement sampling depends on a multitude of inequalities like k A4 + B4 + AB2A + BA2Bk 2k AB2A + BA2Bk, which are diffcult to analyze on a case-by-case basis. Nevertheless, more general heuristics (Recht & Re , 2012) would lead to (1) if it holds, then without-replacement sampling is expected to outperform with-replacement sampling. In fact, the gap can be significant for random Wishart matrices, Recht & Re (2012) showed that the ratio between the two sides of (1) increases exponentially with m. 1We adopt standard convention that any vector x Rd is a column vector; a row vector will always be denoted x T . Current status: Recht & Re (2012) applied results from random matrix theory to show that Conjecture 1 holds with high probability for (i) independent Wishart matrices, and (ii) the incremental gradient method. To date, extensive numerical simulations have produced no counterexample. Conjecture 1 has been rigorously established only in very special cases, notably for (m, n) = (2, 2) (Recht & Re , 2012) and (m, n) = (3, 3k) (Zhang, 2018). Our contributions: We show how to transform Conjecture 1 into a form where the noncommutative Positivstellensatz applies, which implies in particular that for any specifc values of m and n, the conjecture can be checked via two semidefnite programs. This allows us to show in Section 3 that the conjecture is false as soon as m = n = 5. We also establish in Section 2 that the conjecture holds for m = 2 and 3 with arbitrary n by extending the approach in Zhang (2018). While the conjectured inequality (1) is clearly sharp (as we may choose all Ai s to be equal) whenever it is true, we show in Section 5 that the m = 2 case may nonetheless be improved in a different sense, and we do likewise for m = 3 in Section 2. The m = 4 case remains open but our noncommutative Positivstellensatz approach permits us to at least check that it holds for n = 4 and 5 in Section 3. Over the next two sections, we will transform Recht and R e s Conjecture 1 into a Loewner form (Conjecture 1A), a sum-of-squares form (Conjecture 1B), and fnally a semidefnite program form (Conjecture 1C). All four con jectures are equivalent but the correctness of the last one for any m, n can be readily checked as a semidefnite program. 2. Recht Re Inequality f or m = 2 and 3 Our goal here is to establish (1) for a pair and a triple of matrices. In so doing, we take Conjecture 1 a step closer to a form where noncommutative Positivstellensatz applies. There is independent value in establishing these two special cases given that the classical noncommutative arithmetic geometric-harmonic mean inequality (Bhatia & Holbrook, 2006) is only known for a pair of matrices but nonetheless attracted a lot of interests from linear algebraists. These special cases also have implications on randomized algo rithms take the Kaczmarz algorithm for example, the fact that Conjecture 1 holds for m = 2 and 3 implies that if we randomly choose two or three distinct samples, perform the iterations, and sample again, then this replacing after every two or three samples strategy will converge faster than a replacing after every sample strategy. We begin by providing some context for the inequality (1). The usual arithmetic-geometric mean inequality for n non negative real numbers a1, . . . , an, i.e., (a1 + + an)/n (a1 an)1/n, Noncommutative Arithmetic-Geometric Mean Conjecture is False is a special case of Maclaurin s inequality (Hardy et al., 1988): If we defne 1 X sm := n aj1 ajm , m 1 j1< 0 such that all (A1, . . . , An) BL sat isfy k A1k r, . . . , k Ank r. Let d N. We write k pi X X fij Rh X1, . . . , Xnid, Σd(L) := fij T ifij k, p1, . . . , pk N i=1 j=1 for the set of noncommutative sum-of-squares generated by L. The following theorem is a simplifed version of the noncommutative Positivstellensatz, i.e., Theorem 1.1 in Helton et al. (2012), that will be enough for our purpose. Theorem 3 (Noncommutative Positivstellensatz). Let f be a symmetric polynomial with deg(f) 2d + 1 and the feasible set BL be bounded with nonempty interior. Then f(A1, . . . , An) 0 for all (A1, . . . , An) BL if and only if f Σd(L). Readers familiar with the commutative Positivstellsatz (Lasserre, 2015) would see that the noncommutative version is, surprisingly, much simpler and neater. To avoid notational clutter, we introduce the shorthand ji = 6 jk 1 j1,...,jm n, j1,...,jm distinct for sum over distinct indices. Applying Theorem 3 with linear constraints X1 0, . . . , Xn 0, X1 + + Xn n I, Conjecture 1A becomes the following. Conjecture 1B (Sum-of-squares form). Let m n N and d = bm/2c. For the linear constraints 1 = X1, . . . , n = Xn, n+1 = n X1 Xn, let X λ 1 = argmin λ R λ Xj1 Xjm Σd(L) , j j i= k X λ λ R 2 = argmin λ + Xj1 Xjm Σd(L) . Then both λ1 2 In polynomial optimization (Lasserre, 2015), the commutative Positivstellsatz is used to transform a constrained optimization problem into a sum-of-squares problem that can in turn be transformed into a semidefnite programming (SDP) problem. Helton (2002) has observed that this also applies to noncommutative polynomial optimization problems, i.e., we may further transform Conjecture 1B into an SDP form. The vector space Rh X1, . . . , Xnid has dimension q := 1 + n + n2 + + nd and a basis comprising all q monomials of degree d. We will assemble all basis elements into a q-tuple of monomials that we denote by β. With respect to this basis, any f Rh X1, . . . , Xnid may be represented uniquely as f = βT u for some u Rq. Therefore a noncommutative square may be expressed as X p p X X p f T T T j fj = u T T j β β uj = tr β β uj uj j=1 j=1 j=1 by simply writing fj = βTuj , uj Rq, j = 1, . . . , p. Since a symmetric matrix Y is positive semidefnite iff it can be P written as p Y = u T j=1 j uj , we obtain the following one-toone correspondence between noncommutative squares and and λ n!/(n m)!. Noncommutative Arithmetic-Geometric Mean Conjecture is False positive semidefnite matrices: p fj T fj Rh X1, . . . , Xni2d+1, fj Rh X1, . . . , Xnid ~ j=1 w p X T uj uj Rq q, uj Rq. With this correspondence, the two minimization problems in Conjecture 1B become two SDPs. Conjecture 1C (Semidefnite program form). Let m n N and d = bm/2c. Let β be a monomial basis of Rh X1, . . . , Xnid and let Xn+1 = n X1 Xn. Let λ1 be the minimum value of the SDP: minimize λ n+1 X X subject to λ Xj1 Xjm = tr(βXiβ TYi), ji6=jk i=1 Y1 0, . . . , Yn+1 0; (5) and λ2 be that of the SDP: minimize λ n+1 X X subject to λ + Xj1 Xjm = tr(βXiβ TYi), ji6=jk i=1 Y1 n+1 (6) Then both λ1 and λ2 n!/(n m)!. Note that the minimization is over the scalar variable λ and the matrix variables Y1, . . . , Yn+1; the equality constraint equating two noncommutative polynomials is simply saying that the coeffcients on both sides are equal, i.e., for each monomial, we get a linear constraint involving λ, Y1, . . . , Yn+1 the Xi s play no role other than to serve as placeholders for these linear constraints. We may express (5) and (6) as SDPs in standard form with a single matrix variable Y := diag(λ, Y1, . . . , Yn+1), see (7) for example. Readers acquainted with (commutative) polynomial optimization (Lasserre, 2015) would be familiar with the above discussions. In fact, the only difference between the com P mutative and noncommutative cases is that d i=0 n i, the size of a noncommutative monomial basis, is much larger d+n than d , the size of a commutative monomial basis. For any fxed values of m and n, Conjecture 1C is in a form that can be checked by standard SDP solvers. The dimension of the SDP grows exponentially with m, and without access to signifcant computing resources, only small values of m, n are within reach. Fortuitously, m = n = 5 already yields the required violation 144.6488 120, showing that Conjecture 1C and thus Conjecture 1 is false in general. We tabulate our results for m n 5 in Table 1. 0, . . . , Y 0. The fact that the SDP in (6) for m = n = 5 has a minimum λ2 > 144 > 120 = 5! shows that there are uncountably λ1 λ2 n!/(n m)! m = 2, n = 2 2.0000 0.5000 2 m = 2, n = 3 6.0000 1.5000 6 m = 2, n = 4 12.0000 3.0000 12 m = 2, n = 5 20.0000 5.0000 20 m = 3, n = 3 6.0000 3.4113 6 m = 3, n = 4 24.0000 8.5367 24 m = 3, n = 5 60.0000 17.3611 60 m = 4, n = 4 24.0000 22.4746 24 m = 4, n = 5 120.0000 80.2349 120 m = 5, n = 5 120.0000 144.6488 120 Table 1. Results from the SDPs in Conjecture 1C for m n 5. The bold entry for λ2 shows that the Recht R e conjecture is false for m = n = 5 since 144.6488 > 120. many instances with A1 0, A2 0, A3 0, A4 0, A5 0, and A1 + A2 + A3 + A4 + A5 5I such that the matrix X Aσ(1)Aσ(2)Aσ(3)Aσ(4)Aσ(5) σ S5 has an eigenvalue that is less than 144 < 120 = 5!. Here Sn is the symmetric group on n elements. We em phasize that neither (6) nor its dual would give us fve such matrices explicitly, although the dual does provide another way to verify our result, as we will see in Section 4. Indeed, the beauty of the noncommutative Positivstellensatz approach is that it allows us to show that Conjecture 1 is false for m = n = 5 without actually having to produce fve positive semidefnite matrices A1, . . . , A5 that violates the inequality (1). It would be diffcult to fnd A1, . . . , A5 explicitly as one does not even know the smallest dimen sions required for these matrices to give a counterexample to (1). Our approach essentially circumvents the issue by replacing them with noncommutative variables X1, . . . , X5 the reader may have observed that the dimensions of the matrices A1, . . . , A5 did not make an appearance anywhere in this article. 4. Verifcation via Farkas We take a closer look at the m = n = 5 case that provided a refutation to the Recht Re conjecture. In this case, the basis β has 1 + 5 + 52 = 31 monomials; the SDP in (6) has 1 + 5 + 52 + 53 + 54 + 55 = 3906 linear constraints, 312 6 + 1 = 5767 variables, and takes the form: minimize tr(C0Y ) subject to tr(Ci Y ) = bi, i = 1, . . . , 3906, (7) Y = diag(λ, Y1, . . . , Y6) 0. S187 , b R3096 , λ is a scalar Here C0, C1, . . . , C3906 ++ variable, and Y1, . . . , Y6 are 31-by-31 symmetric matrix Noncommutative Arithmetic-Geometric Mean Conjecture is False variables. To put (7) into standard form, the block diago nal structure of Y may be further encoded as linear con straints requiring that off-diagonal blocks be zero. The output of our program gives a minimizer of the form Y 6 ) S187 = diag(λ , Y1 , . . . , Y λ Y = 144.6488, 1 , . . . , Y 6 S31 ++ . (8) The actual numerical entries of the matrices appearing in (7) and (8) are omitted due to space constraints; but they can be found in the output of our program (code in supplement). The values in (8) are of course approximate because of the inherent errors in numerical computations. In our opinion, the gap between the computed 144.6488 and the conjectured 120 is large enough to override any concerns of a mistaken conclusion resulting from numerical errors. Nevertheless, to put to rest any lingering doubts, we will directly show that the conjectured value λ = 120 is infeasible by producing a Farkas certifcate. Consider the feasibility problem: minimize 0 subject to tr(Ci Y ) = bi, i = 1, . . . , 3906, (9) tr(C0Y ) = 120, Y 0, S187 and b R3096 with C0, C1, . . . , C3906 ++ as in (7). Note that C0 = e1e T is the matrix with one in the (1, 1)th 1 entry and zero everywhere else. So (9) is the feasibility problem of the optimization problem (7) with the additional linear constraint y11 = 120 and where we have disregarded the block diagonal constraints2 on Y . The dual of (9) is maximize 120y0 + b Ty subject to y0C0 + y1C1 + + y3906C3906 0. Our program produces a Farkas certifcate y R3096 with 120y0 + b Ty 47.3 > 0, implying that (9) is infeasible. While this is a consequence of Farkas Lemma for SDP (Lasserre, 1995), all we need is the following trivial version. Lemma 1. Let m, n N. Let C0, C1, . . . , Cm Sn and b Rm+1 . If there exists a y Rm+1 with y0C0 + + ym Cm 0, b then there does not exist a Y Sn with tr(C0Y ) = b0, . . . , tr(Cm Y ) = bm, Y 0. Proof. If such a Y exists, then 0 tr (y0C0 + +ym Cm)Y = y0b0 + +ymbm > 0, a contradiction. 2If (9) is already infeasible, then adding these block diagonal constraints just makes it even more infeasible. Hence a matrix of the form Y = diag(120, Y1, . . . , Y6) S187 is infeasible for (7), providing another refutation of Conjecture 1C and thus Conjecture 1. In particular, showing that λ = 120 is infeasible for (7) does not require any of the values computed in (8). Of course, aside from being the conjectured value of λ2, there is nothing special about λ = 120 for any λ < 144.6488, we may similarly compute a Farkas certifcate y to show that such a value of λ is infeasible for (7). We conclude with a few words on the computational costs of the SDPs in this and the last section. Our resulting dense linear system for m = n = 5 requires 3906 5767 22 million foating point storage. Using a personal computer with an Intel Core i7-9700k processor and 16GB of RAM, our Se Du Mi (Sturm, 1999) program in Matlab takes 150 seconds. For m = n = 6, storage alone would have taken 26 billion foating numbers, beyond our modest computing resources. 5. Improving the Recht R e Inequality An unexpected beneft of the noncommutative Positivstel lensatz approach is that it leads to better bounds for the m = 2 and 3 cases that we know are true. Observe that the values for λ2 in Table 1 for m = 2 are exactly smaller than the values for n!/(n m)! by a factor of 1/4. This suggests that the Recht Re inequality (3) for m = 2 in Theorem 1 may be improved to X 1 n(n 1)I Ai Aj n(n 1)I. 4 i6=j Table 1 only shows this for n = 2, 3, 4, 5 but in this section, we will give a proof for arbitrary n 2. Although our proof below does not depend on the SDP formulation in (6), the correct coeffcients in (11) for arbitrary n would have been impossible to guess without solving (6) for m = 2 and some small values of n. So far we have not explored the symmetry evident in our formulations of the Recht Re inequality: In Conjecture 1A, the matrix expression X λI Aj1 Ajm ji6=jk and the constraints A1 0, . . . , An 0, A1 + + An n I are clearly invariant under any permutation σ Sn. In Conjecture 1C, the noncommutative sum-of-squares n+1 X X λ Xj1 Xjm = tr(βXiβ Noncommutative Arithmetic-Geometric Mean Conjecture is False where Xn+1 = n X1 Xn, is also invariant under Sn and so we may average over all permutations to get a symmetrized sum-of-squares. For commutative polynomials, results from classical invariant theory are often used to take advantage of symmetry (Gatermann & Parrilo, 2004). We will see next that such symmetry may also be exploited for noncommutative polynomials. Consider the case m = 2, n = 3. The monomial basis of Rh X1, X2, X3i1 is β = (1, X1, X2, X3). The symmetry imposes linear constraints on the matrix variables Y1, Y2, Y3, Y4 in (6), requiring them to take the following forms: a b c c a c b c b d e e Y = , Y = c f e g 1 2 b e d e , c e f g c e g f c g e f a c c b x y y y c f g e y z w w Y3 = , Y4 = c g f e y w z w . b e e d y w w z These symmetries allow us to drastically reduce the degree of freedom in our SDP: For any m = 2, n 2, the matrices Y1, . . . , Yn are always determined by precisely 11 variables that we label a, b, c, d, e, f, g, x, y, z, w. We computed their values explicitly for n = 2, 3, 4. For n = 2, 5 3 1 1 1 1 4 4 4 4 4 4 = 3 Y1 1 0 , Y3 = 1 1 0 , 4 2 4 2 1 0 1 1 0 1 4 2 4 2 and Y2 can be determined from Y1. For n = 3, 5 1 0 0 1 1 1 1 2 2 3 3 3 1 4 1 1 1 4 1 1 Y = 9 9 9 3 9 9 9 1 , Y = , 4 0 1 4 1 1 1 4 1 9 9 9 3 9 9 9 0 1 1 4 1 1 1 4 9 2 9 3 9 2 9 and Y2, Y3 can be determined from Y1. For n = 4, 15 9 1 1 1 4 8 8 8 8 9 3 1 1 1 8 8 8 8 8 Y = 1 1 1 3 1 1 , 8 8 8 8 8 1 1 1 3 1 8 8 8 8 8 1 1 1 1 3 8 8 8 8 8 3 3 3 3 3 4 8 8 8 8 3 3 1 1 1 8 8 8 8 8 Y5 = 3 1 3 1 1 8 8 8 8 , 8 3 1 1 3 1 8 8 8 8 8 3 1 1 1 3 8 8 8 8 8 and Y2, Y3, Y4 can be determined from Y1. The rational numbers above are all chosen by observing the foating numbers output of the SDP (6). The values of the matrices Yi s for n = 2, 3, 4 allow us to guess that the variables a, b, c, d, e, f, g, x, y, z, w are: 5(n 1) 3(n 1) 3 n a = , b = , c = , 4 2n 2n 2(n 1) n 2 d = f = z = , e = g = w = , (11) n2 n2 n 1 n 1 x = , y = . 4 2n The proof of our next theorem will ascertain that these choices are indeed correct they yield the sum-of-squares decomposition in (10) for m = 2. Theorem 4 (Better Recht Re for m = 2). Let A1, . . . , An be positive semidefnite matrices. If A1 n 1 X n(n 1)I Ai A j n(n 1)I. 4 i=j Proof. The upper bound has already been established in Theorem 1. It remains to establish the lower bound. We start from the following readily verifable inequalities 1 (Aj Ak)Ai(Aj Ak) 0, (12) 2n(n 1) 5(n 1) 6 2(3 n) X I Ai + Aj (13) 4 5n 5n(n 1) j=i 6 2(3 n) X Ai I Ai + Aj 0, 5n 5n(n 1) j=i n 1 2n 1 X Ai + A (14) 5n2 n 1 j j=i 1 X 2n Ai Ai + A , n 1 j 0 j=i X 1 (Aj Ak) n Ai (Aj Ak) 0, (15) 2n2 i X n 1 2 X I Ai n Ai (16) 4 n i i 2 X I Ai 0. n i Sum (12) over all distinct i, j, k; sum (13) over all i; sum (14) over all i; sum (15) over all distinct j, k; add all results to (16). The fnal inequality is our required lower bound. + + A n I, then Noncommutative Arithmetic-Geometric Mean Conjecture is False For n = m = 2, the new lower bound is sharp. Take " # " # 3 1 2 0 2 6 3 A1 = , A2 = , 2 4 0 0 3 3 then k A1 + A2k = 2 and the smallest eigenvalue of A1A2 + A2A1 is 1/2. We conjecture that this bound is sharp for all m = 2, n 2. The method in this section also extends to higher m. For example, we may impose symmetry constraints for m = n = 3 and see if the Y1, Y2, Y3, Y4 obtained have rational values, and if so write down a sums-of-squares proof by factoring the Yi s. 6. Conclusion and Open Problems We conclude our article with a discussion of some open problems and why we think the Recht Re conjecture, while false as it is currently stated, only needs to be refned. An immediate open question is whether the conjecture is true for m = 4: Table 1 shows that it holds for (m, n) = (4, 4) and (4, 5); we suspect that it is true for all n 4. As we pointed out after Conjecture 1A, the Recht Re in equality as stated in (1) conceals an asymmetry it actually contains two inequalities, as shown in (2). What we have seen is that the lower bound is never attained in any of the cases we have examined. For m = 2 and 3, the lower bound is too large, and we improved it in Theorem 4 and the proof of Theorem 2 respectively. For m = 5, the lower bound is too small, which is why the Recht Re inequality is false. A natural follow-up question is then: What is the correct lower bound? On the other hand, we conjecture that the remaining half of the Recht Re inequality, i.e., the upper bound in (2), holds true for all m n N. Duchi (2012) has another conjecture similar to Conjecture 1 but where the norms appear after the summation. Conjecture 2 (Duchi 2012). Let A1, . . . , An be positive semidefnite matrices. Then k Aj1 Ajm k nm 1 j1 ,...,jm n X (n m)! k Aj1 Ajm k. n! 1 j1,...,jm n, j1,...,jm distinct Israel et al. (2016) has established this for m = 2 and 3 for any unitarily invariant norm. It is not clear to us if the noncommutative Positivstellensatz might perhaps also shed light on Conjecture 2. Lastly, if our intention is to analyze the relative effcacies of withand without-replacement sampling strategies in ran domized algorithms, then it is more pertinent to study these inequalities for random matrices, i.e., we do not just assume that the indices are random variables but also the entries of the matrices. For example, if we want to analyze the Kaczmarz algorithm, then we ought to take expectation not only with respect to all permutations but also with respect to how we generate the entries of the matrices. This would provide a more realistic platform for comparing different sampling strategies. Beck, A. and Tetruashvili, L. On the convergence of block coordinate descent type methods. SIAM J. Optim., 23(4): 2037 2060, 2013. Bhatia, R. and Holbrook, J. Noncommutative geometric means. Math. Intelligencer, 28(1):32 39, 2006. Bhatia, R. and Kittaneh, F. On the singular values of a product of operators. SIAM J. Matrix Anal. Appl., 11(2): 272 277, 1990. Bhatia, R. and Kittaneh, F. Notes on matrix arithmeticgeometric mean inequalities. Linear Algebra Appl., 308 (1-3):203 211, 2000. Bhatia, R. and Kittaneh, F. The matrix arithmetic-geometric mean inequality revisited. Linear Algebra Appl., 428 (8-9):2177 2191, 2008. Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, 2009. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pp. 177 186. Physica-Verlag/Springer, Heidelberg, 2010. Duchi, J. C. Commentary on Towards a noncommutative arithmetic-geometric mean inequality by B. Recht and C. Re. JMLR: Workshop and Conference Proceedings, 23:11.25 11.27, 2012. Fang, C., Lin, Z., and Zhang, T. Sharp analysis for nonconvex sgd escaping from saddle points. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 1192 1234, Phoenix, USA, 2019. Gatermann, K. and Parrilo, P. A. Symmetry groups, semidefinite programs, and sums of squares. J. Pure Appl. Algebra, 192(1-3):95 128, 2004. G uzbalaban, M., Ozdaglar, A., and Parrilo, P. A. Why urb random reshuffing beats stochastic gradient descent. Math. Program., 2019. Noncommutative Arithmetic-Geometric Mean Conjecture is False Hardy, G. H., Littlewood, J. E., and P olya, G. Inequalities. Cambridge Mathematical Library. Cambridge University Press, Cambridge, 1988. Reprint of the 1952 edition. Helton, J. W. Positive noncommutative polynomials are sums of squares. Ann. of Math. (2), 156(2):675 694, 2002. Helton, J. W., Klep, I., and Mc Cullough, S. The convex Positivstellensatz in a free algebra. Adv. Math., 231(1): 516 534, 2012. Israel, A., Krahmer, F., and Ward, R. An arithmeticgeometric mean inequality for products of three matrices. Linear Algebra Appl., 488:1 12, 2016. Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points effciently. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1724 1732, International Convention Centre, Sydney, Australia, 2017. Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013. Lasserre, J. B. A new Farkas lemma for positive semidefnite matrices. IEEE Trans. Automat. Control, 40(6):1131 1133, 1995. Lasserre, J. B. An introduction to polynomial and semialgebraic optimization, volume 52. Cambridge University Press, 2015. Nagaraj, D., Jain, P., and Netrapalli, P. SGD without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703 4711, 2019. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574 1609, 2008. Nesterov, Y. Effciency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim., 22 (2):341 362, 2012. Pascoe, J. E. Positivstellens atze for noncommutative rational expressions. Proc. Amer. Math. Soc., 146(3):933 937, 2018. Recht, B. and R e, C. Toward a noncommutative arithmeticgeometric mean inequality: Conjectures, case-studies, and consequences. In Mannor, S., Srebro, N., and Williamson, R. C. (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pp. 11.1 11.24, Edinburgh, Scotland, 2012. Shamir, O. Without-replacement sampling for stochastic gradient methods. In Advances in neural information processing systems, pp. 46 54, 2016. Strohmer, T. and Vershynin, R. A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl., 15(2):262 278, 2009. Sturm, J. F. Using Se Du Mi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization methods and software, 11(1-4):625 653, 1999. Wright, S. J. Coordinate descent algorithms. Math. Program., 151(1, Ser. B):3 34, 2015. Zhang, T. A note on the matrix arithmetic-geometric mean inequality. Electron. J. Linear Algebra, 34:283 287, 2018.