# decentralized_sketching_of_low_rank_matrices__4570a943.pdf

Decentralized sketching of low-rank matrices

Rakshith S Srinivasa Dept. of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30318 rsrinivasa6@gatech.edu

Kiryung Lee Dept. of Electrical and Computer Engineering Ohio State University Columbus, OH 43210 lee.8763@osu.edu

Marius Junge Dept. of Mathematics University of Illinois-Urbana Champagne Urbana, IL, 61801 mjunge@illinois.edu

Justin Romberg Dept. of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30318 jrom@ece.gatech.edu

We address a low-rank matrix recovery problem where each column of a rank-r matrix X Rd1 d2 is compressed beyond the point of individual recovery to RL with L d1. Leveraging the joint structure among the columns, we propose a method to recover the matrix to within an ϵ relative error in the Frobenius norm from a total of O(r(d1 +d2) log6(d1 +d2)/ϵ2) observations. This guarantee holds uniformly for all incoherent matrices of rank r. In our method, we propose to use a novel matrix norm called the mixed-norm along with the maximum ℓ2-norm of the columns to design a new convex relaxation for low-rank recovery that is tailored to our observation model. We also show that the proposed mixed-norm, the standard nuclear norm, and the max-norm are particular instances of convex regularization of low-rankness via tensor norms. Finally, we provide a scalable ADMM algorithm for the mixed-norm-based method and demonstrate its empirical performance via large-scale simulations.

1 Introduction

A fundamental structural model for data is that the data points lie close to an unknown subspace, meaning that the matrix created by concatenating the data vectors has low rank. We address a particular low-rank matrix recovery problem where we wish to recover a set of vectors from a low-dimensional subspace after they have been individually compressed (or sketched ). More concretely, let x1, , xd2 be vectors from an unknown r-dimensional subspace in Rd1. We observe the vectors indirectly via linear sketches by corresponding sensing matrices B1, . . . , Bd2 Rd1 L, where L < d1, i.e., the observed measurement vectors are written as

yi = B i xi + zi, i = 1, . . . , d2. (1)

Although individual recovery of each vector is ill-posed, it is still possible to recover x1, . . . , xd2 jointly by leveraging their mutual structure without knowing the underlying subspace a priori. This indeed results in a low-rank matrix recovery problem with a column-wise observation model.

We are motivated mainly by large-scale inference problems where data is collected in a distributed network or in a streaming setting. In both cases, it is desired to compress the data to lower the

This work was supported in part NSF CCF-1718771, NSF DMS 18-00872 and in part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

communication overhead. In the ﬁrst scenario, the data is partitioned according to the network structure and each data point must be compressed without accessing the remainders. In the second scenario, memory or computational constraints may limit access to relatively small number of recent data points.

Such compressive and distributive acquisition schemes arise frequently in numerous real-world applications. In next generation high-resolution astronomical imaging systems, an antenna array may be distributed across a wide geographical area to collect data points that have a high dimension but are also heavily correlated (and so belong to a low-dimensional subspace). Compression at the node level relieves the overhead to transmit data to a central processing unit [1]. In scientiﬁc computing, it is common to generate large scale simulation data that has redundancies that manifest as low-rank structures. For example, simulations in a ﬂuid dynamic system generate large state vectors that have low-rank dynamics [2]. Our observational model describes a kind of on-the-ﬂy compression, where the states are compressed as the system evolves, resulting in efﬁcient communication and storage.

In each of these applications, if the underlying low-dimensional subspace were known a priori, then the projection onto that subspace could have implemented an optimal distortion-free linear compression. Alternatively if the uncompressed data were available, the standard Principal Component Analysis (PCA) might have been used to discover the subspace. Unfortunately, neither is the case. Therefore we approach the recovery as sketching without knowing the latent subspace a priori. It is also interpreted as a blind compressed sensing problem that recovers the data points and underlying subspace simultaneously from compressed measurements.

The measurement model in (1) is equivalently rewritten as follows: Let X0 Rd1 d2 be a matrix obtained by concatenating x1, . . . , xd1. It follows that the rank of X0 is at most r. The entries of y1, . . . , yd2 then correspond to noisy linear measurements of X0, i.e., for l = 1, . . . , L and i = 1, . . . , d2, the lth entry of yi denoted by yl,i is written as

yl,i = Al,i, X0 + zl,i with Al,i = 1

L bl,ie i , (2)

where zl,i, bl,i, and ei respectively denote the lth entry of zi, the lth column of Bi, and the ith column of the identity matrix of size d2. We propose a convex optimization method to recover X0 from {yl,i} and provide theoretical analysis when bl,i and zl,i are independent copies of random vectors drawn according to N(0, Id1) and N(0, σ2) respectively.

1.1 Mixed-norm-based low-rank recovery

Low-rank matrix recovery has been extensively studied (e.g., see [3]). One popular approach is to formulate the recovery as a convex program with various matrix norms such as the nuclear norm [4, 5, 6] and the max norm [7]. As we show in Section 2, these two norms together with the new norm we propose below are all particular instances of a uniﬁed perspective of low-rank regularization. We propose a convex relaxation of low-rankness by a matrix norm for the recovery from measurements given in (2).

For a matrix X, the maximum ℓ2 column norm is deﬁned as

X 1 2 = max j=1 d2 Xej 2 , (3)

where ej is the jth standard basis vector. This can be interpreted as the operator norm from the vector space ℓd2 1 to that of ℓd1 2 . We deﬁne the mixed-norm" of a matrix X as

X mixed = inf U,V:UV =X U F V 1 2. (4)

Indeed the above two norms provide a convex relaxation suitable to the observation model in (2) through the interlacing property given in the following lemma, the proof of which is given in the supplementary material.

Lemma 1 Let X Rd1 d2 satisfy rank(X) r. Then

X 1 2 X mixed r X 1 2 . (5)

By Lemma 1, the set κ(α, R) deﬁned by

κ(α, R) = {X: X 1 2 α, X mixed R} (6)

contains the set of rank-r matrices with column norms bounded by α. We show that the observation model in (2) results in an ϵ-embedding of the set κ(α, R) for a total number of measurements Ld2 r(d1 + d2) log6(d1 + d2)/ϵ2. We consider an estimate b X of X0 given by

ˆX argmin X κ(α,R)

l,i |yl,i Al,i, X |2. (7)

We have attempted to use the nuclear norm instead of the mixed norm but this approach was not successful with providing a guarantee at a near optimal sample complexity. Furthermore it also demonstrates worse empirical performance compared to our approach, as show in Section 3.

Another appealing property of the mixed-norm is that it can be computed in polynomial time using a semideﬁnite formulation. This renders our proposed estimator readily implementable using general purpose convex solvers. However, to address scalability, we propose an ADMM based framework. We defer further details on efﬁcient computation to Section 3.

1.2 Main result

Our main result, stated in Theorem 1, provides an upper bound on the Frobenius norm of the error between the estimate ˆX obtained from solving (7) and the ground truth matrix X0 that holds simultaneously for all matrices X κ(α, R) rather than for a ﬁxed arbitrary matrix X0. En route to proving our guarantee, we indeed show that P

l,i Al,i, X 2 is well concentrated around its expectation X 2 F for all X κ(α, R) and hence, the measurements results in an embedding of the set κ(α, R) into a low dimension.

Theorem 1 Let κ(α, R) be deﬁned as in (6). Suppose that the bl,i are drawn independently from N(0, Id1), (zi,l) are i.i.d. following N(0, σ2), d = d1 + d2 and d2 Ld2 d1d2. Then, for R α r, there exist numerical constants c1, c2 such that the estimate b X satisﬁes

b X X0 2 F X0 2 F c1 α2

X0 2 F /d2 max

r(d1 + d2) log6 d

with probability at least 1 2 exp( c2R2d/α2) for all X0 κ(α, R).

There are a few remarks in order:

The factor α2d2/ X0 2 F is the ratio between the maximum and the average of the squared column ℓ2 norm of the ground truth matrix X0 and represents its degree of incoherence. A ratio close to 1 indicates that the columns have similar ℓ2-norms and results in a lower sample complexity than when the ratio is much larger than 1. This is similar to the dependence on the relative magnitude of each entry in the max-norm-based estimator [7] and the dependence on incoherence in matrix completion problems.

The second factor is written as max(1, η) where η = σ

L α accounts for the noise level in the measurements. Since we take L measurements per column and the measurement operator is isotropic, α2, is compared against the corresponding noise-variance σ2L.

If the incoherence term is upper-bounded by a constant and the normalized noise level η satisﬁes η = Ω(1), then b X obtained from O(η2rd log6(d)ϵ 2) measurements satisﬁes b X X0 2 F ϵ X 2 F with high probability.

We conjecture that the corresponding minimax lower bound coincide with (8) except the maximum of η with 1 and the logarithmic term. Particularly if η = Ω(1), then the sample complexity in (8) will be near optimal.

1.3 Related work

The model in (2) has been studied in the context of compressed principal components estimation [8, 9, 10]. These works studied a speciﬁc method that computes the underlying subspace though an empirical covariance estimation. While being guaranteed at a near optimal sample complexity, this approach is inherently limited to the linear observation model. On the other hand, our method is more ﬂexible in terms of its potential extension to nonlinear observation models.

Negahban and Wainwright [11] considered the multivariate linear regression problem where a similar model to (2) arises but with a ﬁxed sensing matrix A, i.e., Ai = A for all i = 1, . . . , d2. They showed that a nuclear-norm penalized least squares provides robust recovery at a near optimal sample complexity within a logarithmic factor of the degrees of freedom of rank-r matrices. However, their guarantees applies to an arbitrary ﬁxed ground truth matrix and not to all matrices within the model simultaneously. Our aim is to work with an embedding of the model set κ(α, R) and we obtain a uniform theoretical guarantee over the entire model set at the cost of using different sensing matrices Ai s and incoherence of the matrices.

Our solution approach is partly inspired by earlier works on low-rank matrix completion using the max-norm [12, 13, 7]. The pair of max-norm and ℓ norms is used to relax the set of low-rank matrices to a convex model. We generalize this approach to that of using tensor norms (see Section 2) as a proxy for low rank regularization and show that the max-norm and the mixed-norm are particular instances of this general framework. In particular we choose a speciﬁc pair of tensor norms in accordance with the structure in the observation model. This leads to a new convex relaxation model of low-rankness, a corresponding optimization formulation, algorithm, and its performance guarantee. Finally, we point out that our method of proofs and the technical tools we use to establish our results are signiﬁcantly different from that of [7].

2 Properties of tensor norms on low-rank matrices

We interpret a matrix X Rd1 d2 as a linear operator from a vector space Rd2 to another vector space Rd1. Then let the domain and range spaces be respectively endowed with the ℓp norm and the ℓq norm. The vector space of all d1 d2 matrices is then identiﬁed as the tensor product of the two Banach spaces, denoted as ℓp ℓq (e.g., [14]), where 1/p + 1/p = 1.

A tensor norm is a norm on the algebraic tensor product of two Banach spaces that satisﬁes the operator ideal property (see e.g., [14, 15]). The main insight driving the uniﬁed perspective is that, when we restrict linear operators to those of rank at most r, certain tensor norms become equivalent up to a function of r. In particularly, we consider the injective and projective tensor norms, deﬁned respectively as X = sup u Rd1, u p=1 Xu q (9)

k uk p vk q

The pair of the injective and projective norms characterizes the set of low-rank matrices through an interlacing property between them. For example, when p = q = 2, it can be easily veriﬁed that X = X 2 and X = X . It follows from the singular value decomposition that X 2 X r X 2 . In yet another example where p = 1, q = , we have X = X . Linial et al. [13] showed that Grothendieck s inequality implies X X max r X , (11) where X max = inf U,V:UV =X

U 1 2 V 1 2 .

In this case, it has been shown that the max norm is equivalent up to a constant to the projective norm.

Finally, by letting p = 1, q = 2, we obtain X = X 1 2 and that the projective norm is equivalent (up to a constant factor) to the mixed norm and the relationship in Lemma 1 holds. Further, it is interesting that unlike many tensor norms, the mixed norm and max-norms can be computed efﬁciently in a polynomial time, similar to the nuclear norm. As we note in the next section, this enables efﬁcient implementation of mixed-norm-based low-rank recovery programs.

3 Fast algorithm for mixed-norm-based optimization

The mixed-norm of any matrix X can be computed in polynomial time as

X mixed = min W11,W22 max(trace(W11), diag(W22) )

where diag(W22) denotes the vector of the diagonal entries of W22. Then the optimization routine in (7) can be written as

minimize W11,W22,X P

l,i |yl,i Al,i, X |2

subject to trace(W11) R, diag(W22) R,

X 1 2 α, W =

The program in (13) is now a constrained convex optimization problem over the cone of positive semideﬁnite (PSD) matrices.

3.1 ADMM based fast algorithm

The program in (13) can be implemented using standard convex optimization solvers like Se Du Mi. [16]. However, this could result in scaling issues, as run times could be prohibitive in higher dimensions. To address this, we propose to use the ADMM based algorithm [17] which breaks down the optimization problem into smaller problems that can be solved efﬁciently. Our approach is similar to [18], where the positive semideﬁnite constraint on W in (13) is treated separately from the other constraints. We provide an algorithm for the norm-penalized version of (13). By Lagrangian duality, the penalized version and the constrained version are equivalent when the Lagrangian multipliers λ1 and λ2 are chosen appropriately.

By introducing an auxiliary variable T, it is straightforward to show that the optimization problem (13) is equivalent to

minimize W,T P

l,i |yl,i Al,i, W |2 + λ1 trace(T11) + λ2 diag(W22) subject to W12 1 2 α, T = W, T 0. (14)

In (14), we carry the constraints on trace(T11) and diag(W22) to the objective function by using the Lagrangian formulation. Note that there are other variations possible, with more or fewer constraints carried over to the objective function. The formulation in (14) is amenable to the ADMM algorithm. The augmented Lagrangian of (14) is given by

L(T, W, Z) = f(W) + λ1 trace(T11) + λ2 diag(W22)

+ Z, T W + ρ

2 T W 2 F + χ{T 0} + χ{ W12 1 2 α},

where Z is the dual variable and χS is the indicator function of the set S given as χS(t) = 0 if t S and χS(t) = otherwise. The ADMM algorithm then iterates by alternating among T, W and Z, as shown in Algorithm 1. While we leave the ﬁner details of the algorithm to the supplementary material, it is worthwhile to note that each step in Algorithm 1 has a unique closed-form solution that allows for scalability to high dimensions.

3.2 Experiments

To complement our theoretical results, we observe the empirical performance of the mixed-normbased method in a set of Monte Carlo simulations. Matrices are set to be of size 1, 000 1, 000 and of rank 5. In our experiments we normalize the columns to have the same energy. We observe the estimation error by varying the degree of compression and the signal-to-noise (SNR) ratio. We compare the proposed method to the popular matrix LASSO, which minimizes the least squares loss

Algorithm 1 ADMM algorithm

Initialize: T0, W0, Z0 while not converged do

Tk+1 = argmin T 0 L(T, Wk, Zk)

Wk+1 = argmin W12 1 2 α L(Tk+1, W, Zk)

Zk+1 = Zk + ρ(Tk+1 Wk+1) end while

Figure 1: Simulation results comparing the proposed mixed-norm based estimator and the nuclear norm based estimator. The test matrices were of size 1, 000 1, 000 with rank 5. Each data point is computed as an average of 5 trials. Mixed norm estimator is able to achieve much lower errors with fewer measurements compared to the nuclear norm estimator.

with a nuclear norm regularizer. We used Algorithm 1 to implement the mixed-norm based method. The nuclear norm minimization approach was implemented using the algorithm provided in [19].

Figure 1 shows the obtained simulation results. The estimation error is averaged over 5 trials. The result indicates that the mixed-norm-based estimator outperforms the nuclear-norm-based estimator at both the SNR levels considered.

4 Proof sketch

We state the key lemmas involved in proving our result and point to the tools we use and defer ﬁner details to the supplementary material. We begin with the basic optimality condition that relates the estimate ˆX to the ground truth X0. Let M = ˆX X0. By the triangular inequality, we have M κ(2α, 2R). For notational brevity, we assume from now on that M κ(α, R). (Neither the main result nor the proofs are affected by this since they involve multiplication with some numerical constants.)

We adapt the ﬁrst step in the analysis framework of the analogous matrix completion problem [7]. By optimality of the solution and (2), we have

yl,i Al,i, b X 2 X

l,i (yl,i Al,i, X0 )2 . (15)

After substituting ˆX X0 by M and rearranging the terms, we obtain X

l,i Al,i, M 2 2 X

l,i Al,i, M zl,i. (16)

As in [7], we rely on the stochastic nature of the noise. The proof also relies on the norm-constrained optimization rather than norm-penalized optimization. Our strategy is to obtain a lower bound on the quadratic form P l,i Al,i, M 2 in terms of M 2 F and a uniform upper bound on the linear form P

l,i Al,i, M zl,i over the set κ(α, R). We can then bound M 2 F uniformly over the set.

4.1 Lower bound on the quadratic form

We observe that P

l,i Al,i, M 2 can be reformulated as a quadratic form in standard Gaussian random variables. Let us deﬁne

b1,1 ... b L,d2

RLd1d2. (17)

Then it follows that ξ N(0, ILd1d2). Therefore, the left-hand side of (16) is rewritten as X

l,i Al,i, M 2 = QMξ 2 , where

f M 1 0 0 0 f M 2 0 ... ... ... 0 0 0 f M d2

, f M j = IL (Mej) RL Ld1. (18)

We also have E QMξ 2 = M 2 F We compute a tail estimate on sup M κ(α,R) QMξ 2 2 by using the results on suprema of chaos processes [20]. They derived a sharp tail estimate on the supremum of a Gaussian quadratic form maximized over a given set A, which is written as

Aξ 2 E Aξ 2 ,

by using a chaining argument. By adapting their framework, we obtain the following Lemma:

Lemma 2 Under the assumptions of Theorem 1, if QM and ξ are as deﬁned in (18) and (17), then

sup M κ(α,R)

d2 M 2 F d2

with probability at least 1 2 exp( c R2dα2).

From Lemma 2, in the regime where Ld2 > R2d/α2, we can obtain P

l,i Al,i, M 2

d2 M 2 F d2 c Rα

d Ld2 log3 d. (19)

4.2 Upper bound on the right-hand side of (16)

We obtain the following uniform upper bound P

l,i Al,i, M zl,i:

Lemma 3 Under the assumptions of Theorem 1, with probability at least 1 2 exp( c R2d/α2),

sup M κ(α,R)

l,i Al,i, M zl,i

d Ld2 log3 d. (20)

To derive Lemma 3, we ﬁrst express the left-hand side of (20) using a matrix norm.

|||M||| := M 1 2

Then by the deﬁnition of κ(α, R) in (6) it follows that the unit ball B := {M : |||M|||| 1} with respect to ||| ||| coincides with κ(α, R). Therefore via the Banach space duality, we obtain

sup M κ(α,R)

l,i Al,i, M zl,i = sup M κ(α,R) X

l,i zl,i Al,i, M = ||| X

l,i zl,i Al,i|||

where ||| ||| denotes the dual norm. Then, conditioned on Al,i s, it follows from Theorem 4.7 in [21] that with probability 1 δ

l,i zl,i Al,i||| Ez ||| X

l,i zl,i Al,i|||

v u u tlog(2/δ)

2 sup M κ(α,R)

l,i Al,i, M 2

The ﬁrst term T1 is the Gaussian complexity of the sample set {Al,i} over the function class { M, : M κ(α, R)}. This can be (up to a logarithmic factor of the size of the summation) upper-bounded by the corresponding Rademacher complexity ([22], Equation (4.9)) as

log(Ld2 + 1) E(rl,i) ||| X

l,i rl,i Al,i||| , (22)

where (rl,i) is a Rademacher sequence and the expectation is conditioned on (Ai,l). Then by the symmetry of the standard Gaussian distribution, we obtain

E(rl,i) ||| X

l,i rl,i Al,i||| = 1

L sup M κ(α,R)

l,i rl,ibl,i, Mel

L sup M κ(α,R)

l,i bl,i, Mel

(23) where the second equation holds in the sense of distribution.

Note that ( ) is the maximum of linear combinations of Gaussian variables and an upper bound can be obtained using Dudley s inequality [22]. Once we obtain a tail estimate of ( ), since ( ) no longer depends on the Rademacher sequence (rl,i), it can be used to upper-bound T1 through (22) and (23). An upper bound on T2 has been already derived in Lemma 2. Combining these upper estimates on T1 and T2 results in Lemma 3. From the lower bound on P

l,i Al,i, M 2, we have

M 2 F d2 cαR

d Ld2 log3 d 1

l,i Al,i, M 2 sup M κ(α,R)

l,i Al,i, M zl,i

From Lemma 3, we get the following inequality, which then leads to the ﬁnal result.

M 2 F d2 c log3 d R

4.3 Entropy estimate

Part of proofs of lemmas 2 and 3 has been deferred to the supplementary material. Both proofs rely on a key quantity that captures the complexity of the set κ(α, R). In particular, using Dudley s inequality requires an estimate of the entropy number of the set κ(α, R), which is given by the following Lemma.

Lemma 4 Let κ(α, R) be as in (6) and let B1 2 be the unit ball with respect to 1 2. Then there exists a numerical constant c such that Z

log N(κ(α, R), ηB1 2)dη c R

d log3/2(d1 + d2). (24)

Here N(κ(α, R), ηB1 2) denotes the covering number of κ(α, R) with respect to the scaled unit ball ηB1 2.

In Section 2 we introduced the projective tensor norm . Let B denote the unit ball with respect to the projective tensor norm in ℓd2 ℓd1 2 . The injective tensor norm in ℓd2 ℓd1 2 reduces to 1 2. By its construction, κ(α, R) is given as the intersection of two norm balls αB1 2 and RB . The proof of Lemma 4 reduces to the computation of the entropy number of the identity map on ℓd2 ℓd1 2 from the Banach space with the projective tensor norm to that with the injective tensor norm. This proof along with a study of the machinery of computing such entropy numbers can be found in a complementary paper [23].

5 Discussion

Low rank modeling is a widely used approach in many machine learning and signal processing tasks. By interpreting low-rankness as a property expressed by tensor norms, we are able to design a practical and sample efﬁcient regularization method that is tailored to the observation model. The proposed method comes with theoretical guarantees and also performs well empirically. Our proposed method can also be implemented efﬁciently in high dimensions, making it a viable option for performing PCA or low rank recovery in big data scenarios.

[1] R. Spencer. The square kilometre array: The ultimate challenge for processing big data. In IET Seminar on Data Analytics: Deriving Intelligence and Value from Big Data.

[2] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher. Streaming low-rank matrix approximation with an application to scientiﬁc simulation. ar Xiv:1902.08651, 2019.

[3] M. A. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE J. Sel. Topics Signal Process, 10(4):608 622, June 2016.

[4] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010.

[5] E.J. Candès and B. Recht. Exact matrix completion via convex optimization. Found, of Comp. Math., 9(6):717, 2009.

[6] D. Gross. Recovering low-rank matrices from few coefﬁcients in any basis. IEEE Trans. Inf. Theory, 57(3):1548 1566, 2011.

[7] T.T. Cai and W. Zhou. Matrix completion via max-norm constrained optimization. Electronic J. Stat., 10(1):1493 1525, 2016.

[8] Farhad P. A. and Shannon H. Memory and computation efﬁcient PCA via very sparse random projections. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 1341 1349, Bejing, China, Jun. 2014.

[9] M. Azizyan, A. Krishnamurthy, and A. Singh. Extreme compressive sampling for covariance estimation. ar Xiv preprint ar Xiv:1506.00898, 2015.

[10] H. Qi and S. M. Hughes. Invariance of principal components under low-dimensional random projection of the data. In 19th IEEE Int. Conf. Image Process., pages 937 940, Sep. 2012.

[11] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist., 39(2):1069 1097, 04 2011.

[12] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1329 1336. MIT Press, 2005.

[13] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of sign matrices. Combinatorica, 27(4):439 463, 2007.

[14] G. Jameson. Summing and nuclear norms in Banach space theory, volume 8. Cambridge University Press, 1987.

[15] R. A. Ryan. Introduction to tensor products of Banach spaces. Springer Science & Business Media, 2013.

[16] J. F. Sturm. Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones. Optimization methods and software, 11(1-4):625 653, 1999.

[17] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3(1):1 122, 2011.

[18] E. X. Fang, H. Liu, K. Toh, and W. Zhou. Max-norm optimization for robust matrix recovery. Mathematical Programming, 167(1):5 35, 2018.

[19] K. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Paciﬁc Journal of optimization, 2010.

[20] F. Krahmer, S. Mendelson, and H. Rauhut. Suprema of chaos processes and the restricted isometry property. Communications on Pure and Applied Mathematics, 67(11):1877 1904, 2014.

[21] G. Pisier. The volume of convex bodies and Banach space geometry, volume 94. Cambridge University Press, 1999.

[22] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.

[23] K. Lee, R. S. Srinivasa, M. Junge, and J. Romberg. Entropy estimates on tensor products of banach spaces and applications to low-rank recovery. In Proceesings of 13th International Conference on Sampling Theory and Applications (Samp TA), Bordeaux, France, July 2019.