# learning_lowdimensional_metrics__8cb55cc3.pdf Learning Low-Dimensional Metrics Lalit Jain University of Michigan Ann Arbor, MI 48109 lalitj@umich.edu Blake Mason University of Wisconsin Madison, WI 53706 bmason3@wisc.edu Robert Nowak University of Wisconsin Madison, WI 53706 rdnowak@wisc.edu This paper investigates the theoretical foundations of metric learning, focused on three key questions that are not fully addressed in prior work: 1) we consider learning general low-dimensional (low-rank) metrics as well as sparse metrics; 2) we develop upper and lower (minimax) bounds on the generalization error; 3) we quantify the sample complexity of metric learning in terms of the dimension of the feature space and the dimension/rank of the underlying metric; 4) we also bound the accuracy of the learned metric relative to the underlying true generative metric. All the results involve novel mathematical approaches to the metric learning problem, and also shed new light on the special case of ordinal embedding (aka non-metric multidimensional scaling). 1 Low-Dimensional Metric Learning This paper studies the problem of learning a low-dimensional Euclidean metric from comparative judgments. Specifically, consider a set of n items with high-dimensional features xi 2 Rp and suppose we are given a set of (possibly noisy) distance comparisons of the form sign(dist(xi, xj) dist(xi, xk)), for a subset of all possible triplets of the items. Here we have in mind comparative judgments made by humans and the distance function implicitly defined according to human perceptions of similarities and differences. For example, the items could be images and the xi could be visual features automatically extracted by a machine. Accordingly, our goal is to learn a p p symmetric positive semi-definite (psd) matrix K such that the metric d K(xi, xj) := (xi xj)T K(xi xj), where d K(xi, xj) denotes the squared distance between items i and j with respect to a matrix K, predicts the given distance comparisons as well as possible. Furthermore, it is often desired that the metric is low-dimensional relative to the original high-dimensional feature representation (i.e., rank(K) d < p). There are several motivations for this: Learning a high-dimensional metric may be infeasible from a limited number of comparative judgments, and encouraging a low-dimensional solution is a natural regularization. Cognitive scientists are often interested in visualizing human perceptual judgments (e.g., in a two-dimensional representation) and determining which features most strongly influence human perceptions. For example, educational psychologists in [1] collected comparisons between visual representations of chemical molecules in order to identify a small set of visual features that most significantly influence the judgments of beginning chemistry students. It is sometimes reasonable to hypothesize that a small subset of the high-dimensional features dominate the underlying metric (i.e., many irrelevant features). Downstream applications of the learned metric (e.g., for classification purposes) may benefit from robust, low-dimensional metrics. Authors contributed equally to this paper and are listed alphabetically. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. (a) A general low rank psd matrix (b) A sparse and low rank psd matrix Figure 1: Examples of K for p = 20 and d = 7. The sparse case depicts a situation in which only some of the features are relevant to the metric. With this in mind, several authors have proposed nuclear norm and 1,2 group lasso norm regularization to encourage low-dimensional and sparse metrics as in Fig. 1b (see [2] for a review). Relative to such prior work, the contributions of this paper are three-fold: 1. We develop novel upper bounds on the generalization error and sample complexity of learning low- dimensional metrics from triplet distance comparisons. Notably, unlike previous generalization bounds, our bounds allow one to easily quantify how the feature space dimension p and rank or sparsity d < p of the underlying metric impacts the sample complexity. 2. We establish minimax lower bounds for learning low-rank and sparse metrics that match the upper bounds up to polylogarithmic factors, demonstrating the optimality of learning algorithms for the first time. Moreover, the upper and lower bounds demonstrate that learning sparse (and low-rank) metrics is essentially as difficult as learning a general low-rank metric. This suggests that nuclear norm regularization may be preferable in practice, since it places less restrictive assumptions on the problem. 3. We use the generalization error bounds to obtain model identification error bounds that quantify the accuracy of the learned K matrix. This problem has received very little, if any, attention in the past and is crucial for interpreting the learned metrics (e.g., in cognitive science applications). This is a bit surprising, since the term metric learning strongly suggests accurately determining a metric, not simply learning a predictor that is parameterized by a metric. 1.1 Comparison with Previous Work There is a fairly large body of work on metric learning which is nicely reviewed and summarized in the monograph [2], and we refer the reader to it for a comprehensive summary of the field. Here we discuss a few recent works most closely connected to this paper. Several authors have developed generalization error bounds for metric learning, as well as bounds for downstream applications, such as classification, based on learned metrics. To use the terminology of [2], most of the focus has been on must-link/cannot-link constraints and less on relative constraints (i.e., triplet constraints as considered in this paper). Generalization bounds based on algorithmic robustness are studied in [3], but the generality of this framework makes it difficult to quantify the sample complexity of specific cases, such as low-rank or sparse metric learning. Rademacher complexities are used to establish generalization error bounds in the must-link/cannot-link situation in [4, 5, 6], but do not consider the case of relative/triplet constraints. The sparse compositional metric learning framework of [7] does focus on relative/triplet constraints and provides generalization error bounds in terms of covering numbers. However, this work does not provide bounds on the covering numbers, making it difficult to quantify the sample complexity. To sum up, prior work does not quantify the sample complexity of metric learning based on relative/triplet constraints in terms of the intrinsic problem dimensions (i.e., dimension p of the high-dimensional feature space and the dimension d of the underlying metric), there is no prior work on lower bounds, and no prior work quantifying the accuracy of learned metrics themselves (i.e., only bounds on prediction errors, not model identification errors). Finally we mention that Fazel et a.l [8] also consider the recovery of sparse and low rank matrices from linear observations. Our situation is very different, our matrices are low rank because they are sparse - not sparse and simultaneously low rank as in their case. 2 The Metric Learning Problem Consider n known points X := [x1, x2, . . . , xn] 2 Rp n. We are interested in learning a symmetric positive semidefinite matrix K that specifies a metric on Rp given ordinal constraints on distances between the known points. Let S denote a set of triplets, where each t = (i, j, k) 2 S is drawn uniformly at random from the full set of n triplets T := {(i, j, k) : 1 i 6= j 6= k n, j < k}. For each triplet, we observe a yt 2 { 1} which is a noisy indication of the triplet constraint d K(xi, xj) < d K(xi, xk). Specifically we assume that each t has an associated probability qt of yt = 1, and all yt are statistically independent. Objective 1: Compute an estimate c K from S that predicts triplets as well as possible. In many instances, our triplet measurements are noisy observations of triplets from a true positive semi-definite matrix K . In particular we assume qt > 1/2 () d K (xi, xj) < d K (xi, xk) . We can also assume an explicit known link function, f : R ! [0, 1], so that qt = f(d K (xi, xj) d K (xi, xk)). Objective 2: Assuming an explicit known link function f estimate K from S. 2.1 Definitions and Notation Our triplet observations are nonlinear transformations of a linear function of the Gram matrix G := XT KX. Indeed for any triple t = (i, j, k), define M t(K) := d K(xi, xj) d K(xi, xk) i Kxk + x T j Kxi + x T So for every t 2 S, yt is a noisy measurement of sign(M t(K)). This linear operator may also be expressed as a matrix M t := xix T so that M t(K) = h M t, Ki = Trace(M T t K). We will use M t to denote the operator and associated matrix interchangeably. Ordering the elements of T lexicographically, we let M denote the linear map, M(K) = (M t(K)| for t 2 T ) 2 Rn( Given a PSD matrix K and a sample, t 2 S, we let (yth M t, Ki) denote the loss of K with respect to t; e.g., the 0-1 loss {sign(yth M t,Ki)6=1}, the hinge-loss max{0, 1 yth M t, Ki}, or the logistic loss log(1 + exp( yth M t, Ki)). Note that we insist that our losses be functions of our triplet differences h M t, Ki. Further, note that this makes our losses invariant to rigid motions of the points xi. Other models proposed for metric learning use scale-invariant loss functions [9]. For a given loss , we then define the empirical risk with respect to our set of observations S to be b RS(K) := 1 |S| (yth M t, Ki). This is an unbiased estimator of the true risk R(K) := E[ (yth M t, Ki)] where the expectation is taken with respect to a triplet t selected uniformly at random and the random value of yt. Finally, we let In denote the identity matrix in Rn n, 1n the n-dimensional vector of all ones and V := In 1 n the centering matrix. In particular if X 2 Rp n is a set of points, XV subtracts the mean of the columns of X from each column. We say that X is centered if XV = 0, or equivalently X1n = 0. If G is the Gram matrix of the set of points X, i.e. G = XT X, then we say that G is centered if X is centered or if equivalently, G1n = 0. Furthermore we use k k to denote the nuclear norm, and k k1,2 to denote the mixed 1,2 norm of a matrix, the sum of the 2 norms of its rows. Unless otherwise specified, we take k k to be the standard operator norm when applied to matrices and the standard Euclidean norm when applied to vectors. Finally we define the K-norm of a vector as kxk2 K := x T Kx. 2.2 Sample Complexity of Learning Metrics. In most applications, we are interested in learning a matrix K that is low-rank and positivesemidefinite. Furthermore as we will show in Theorem 2.1, such matrices can be learned using fewer samples than general psd matrices. As is common in machine learning applications, we relax the rank constraint to a nuclear norm constraint. In particular, let our constraint set be Kλ,γ = {K 2 Rp p|K positive-semidefinite, k Kk λ, max t2T h M t, Ki γ}. Up to constants, a bound on h M t, Ki is a bound on x T i Kxi. This bound along with assuming our loss function is Lipschitz, will lead to a tighter bound on the deviation of b RS(K) from R(K) crucial in our upper bound theorem. Let K := min K2Kλ,γ R(K) be the true risk minimizer in this class, and let c K := min K2Kλ,γ b RS(K) be the empirical risk minimizer. We achieve the following prediction error bounds for the empirical risk minimzer. Theorem 2.1. Fix λ, γ, δ > 0. In addition assume that max1 i n kxik2 = 1. If the loss function is L-Lipschitz, then with probability at least 1 δ R(c K) R(K ) 4L 140λ2 k XXT k n log p |S| + 2 log p 2L2γ2 log 2/δ Note that past generalization error bounds in the metric learning literature have failed to quantify the precise dependence on observation noise, dimension, rank, and our features X. Consider the fact that a p p matrix with rank d has O(dp) degrees of freedom. With that in mind, one expects the sample complexity to be also roughly O(dp). We next show that this intuition is correct if the original representation X is isotropic (i.e., has no preferred direction). The Isotropic Case. Suppose that x1, , xn, n > p, are drawn independently from the isotropic Gaussian N(0, 1 p I). Furthermore, suppose that K = p p d UU T with U 2 Rp d is a generic (dense) orthogonal matrix with unit norm columns. The factor p p d is simply the scaling needed so that the average magnitude of the entries in K is a constant, independent of the dimensions p and d. In this case, rank(K ) = d and k K k F = trace(U T U) = p. These two facts imply that the tightest bound on the nuclear norm of K is k K k p d. Thus, we take λ = p d for the nuclear norm constraint. Now let zi = d U T xi N(0, Id) and note that kxik2 K = kzik2 χ2 Therefore, Ekxik2 K = d and it follows from standard concentration bounds that with large probability maxi kxik2 K 5d log n =: γ see [10]. Also, because the xi N(0, 1 p I) it follows that if n > p log p, say, then with large probability k XXT k 5n/p. We now plug these calculations into Theorem 2.1 to obtain the following corollary. Corollary 2.1.1 (Sample complexity for isotropic points). Fix δ > 0, set λ = p d, and assume that k XXT k = O(n/p) and γ := maxi kxik2 K = O(d log n). Then for a generic K 2 Kλ,γ, as constructed above, with probability at least 1 δ, R(c K) R(K ) = O dp(log p + log2 n) This bound agrees with the intuition that the sample complexity should grow roughly like dp, the degrees of freedom on K . Moreover, our minimax lower bound in Theorem 2.3 below shows that, ignoring logarithmic factors, the general upper bound in Theorem 2.1 is unimprovable in general. Beyond low rank metrics, in many applications it is reasonable to assume that only a few of the features are salient and should be given nonzero weight. Such a metric may be learned by insisting K to be row sparse in addition to being low rank. Whereas learning a low rank K assumes that distance is well represented in a low dimensional subspace, a row sparse (and hence low rank) K defines a metric using only a subset of the features. Figure 1 gives a comparison of a low rank versus a low rank and sparse matrix K. Analogous to the convex relaxation of rank by the nuclear norm, it is common to relax row sparsity by using the mixed 1,2 norm. In fact, the geometry of the 1,2 and nuclear norm balls are tightly related as the following lemma shows. Lemma 2.2. For a symmetric positive semi-definite matrix K 2 Rp p, k Kk k Kk1,2. Proof. k Kk1,2 = Ki,i = Trace(K) = λi(K) = k Kk This implies that the 1,2 ball of a given radius is contained inside the nuclear norm ball of the same radius. In particular, it is reasonable to assume that it is easier to learn a K that is sparse in addition to being low rank. Surprisingly, however, the following minimax bound shows that this is not necessarily the case. To make this more precise, we will consider optimization over the set λ,γ = {K 2 Rp p|K positive-semidefinite, k Kk1,2 λ, max t2T h M t, Ki γ}. Furthermore, we must specify the way in which our data could be generated from noisy triplet observations of a fixed K . To this end, assume the existence of a link function f : R ! [0, 1] so that qt = P(yt = 1) = f(M t(K )) governs the observations. There is a natural associated logarithmic loss function f corresponding to the log-likelihood, where the loss of an arbitrary K is f(yth M t, Ki) = {yt= 1} log 1 f(h M t, Ki) + {yt=1} log 1 1 f(h M t, Ki) Theorem 2.3. Choose a link function f and let f be the associated logarithmic loss. For p sufficiently large, then there exists a choice of γ, λ, X, and |S| such that E[R(c K)] R(K) C inf|x| γ f(x)(1 f(x)) sup| | γ f 0( )2 with Cf = inf|x| γ f 0(x), C1 is an absolute constant, and the infimum is taken over all estimators c K of K from |S| samples. Importantly, up to polylogarithmic factors and constants, our minimax lower bound over the 1,2 ball matches the upper bound over the nuclear norm ball given in Theorem 2.1. In particular, in the worst case, learning a sparse and low rank matrix K is no easier than learning a K that is simply low rank. However in many realistic cases, a slight performance gain is seen from optimizing over the 1,2 ball when K is row sparse, while optimizing over the nuclear norm ball does better when K is dense. We show examples of this in the Section 3. The proof is given in the supplementary materials. Note that if γ is in a bounded range, then the constant C has little effect. For the case that f is the logistic function, Cf 1 4e yth M t,Ki 1 4e γ. Likewise, the term under the root will be also be bounded for γ in a constant range. The terms in the constant C arise when translating from risk and a KL-divergence to squared distance and reflects the noise in the problem. 2.3 Sample Complexity Bounds for Identification Under a general loss function and arbitrary K , we can not hope to convert our prediction error bounds into a recovery statement. However in this section we will show that as long as K is low rank, and if we choose the loss function to be the log loss f of a given link function f as defined prior to the statement of Theorem 2.3, recovery is possible. Firstly, note that under these assumptions we have an explicit formula for the risk, R(K) = 1 |T | f(h M t, K i) log 1 f(h M t, Ki) + (1 f(h M t, K i)) log 1 1 f(h M t, Ki) R(K) R(K ) = 1 |T | KL(f(h M t, K i)||f(h M t, Ki)). The following theorem shows that if the excess risk is small, i.e. R(c K) approximates R(K ) well, then M(c K) approximates M(K ) well. The proof, given in the supplementary materials, uses standard Taylor series arguments to show the KL-divergence is bounded below by squared-distance. Lemma 2.4. Let Cf = inf|x| γ f 0(x). Then for any K 2 Kλ,γ, f |T | k M(K) M(K )k2 R(K) R(K ). The following may give us hope that recovering K from M(K ) is trivial, but the linear operator M is non-invertible in general, as we discuss next. To see why, we must consider a more general class of operators defined on Gram matrices. Given a symmetric matrix G, define the operator Lt by Lt(G) = 2Gik 2Gij + Gjj Gkk If G = XT KX then Lt(G) = M t(K), and more so M t = XLt XT . Analogous to M, we will combine the Lt operators into a single operator L, L(G) = (Lt(G)| for t 2 T ) 2 Rn( Lemma 2.5. The null space of L is one dimensional, spanned by V = In 1 The proof is contained in the supplementary materials. In particular we see that M is not invertible in general, adding a serious complication to our argument. However L is still invertible on the subset of centered symmetric matrices orthogonal to V , a fact that we will now exploit. We can decompose G into V and a component orthogonal to V denoted H, G = H + σGV where σG := h G,V i F , and under the assumption that G is centered, σG = k Gk n 1 . Remarkably, the following lemma tells us that a non-linear function of H uniquely determines G. Lemma 2.6. If n > d + 1, and G is rank d and centered, then σG is an eigenvalue of H with multiplicity n d 1. In addition, given another Gram matrix G0 of rank d0, σG0 σG is an eigenvalue of H H0 with multiplicity at least n d d0 1. Proof. Since G is centered, 1n 2 ker G, and in particular dim(1? n \ ker G) = n d 1. If x 2 1? n \ ker G, then Gx = Hx + σGV x ) Hx = σGx. For the second statement, notice that dim(1? n \ ker G G0) n d d0 1. A similar argument then applies. If n > 2d, then the multiplicity of the eigenvalue σG is at least n/2. So we can trivially identify it from the spectrum of H. This gives us a non-linear way to recover G from H. Now we can return to the task of recovering K from M(c K). Indeed the above lemma implies that G (and hence K if X is full rank) can be recovered from H by computing an eigenvalue of H . However H is recoverable from L(H ), which is itself well approximated by L(c H) = M(c K). The proof of the following theorem makes this argument precise. Theorem 2.7. Assume that K is rank d, c K is rank d0, n > d + d0 + 1, X is rank p and XT K X and XT c KX are all centered. Let Cd,d0 = 1 + n 1 (n d d0 1) . Then with probability at least 1 δ, nσmin(XXT )2 |T | kc K K k2 140λ2 k XXT k n log p |S| + 2 log p 2L2γ2 log 2 where σmin(XXT ) is the smallest eigenvalue of XXT . The proof, given in the supplementary materials, relies on two key components, Lemma 2.6 and a type of restricted isometry property for M on V ?. Our proof technique is a streamlined and more general approach similar to that used in the special case of ordinal embedding. In fact, our new bound improves on the recovery bound given in [11] for ordinal embedding. We have several remarks about the bound in the theorem. If X is well conditioned, e.g. isotropic, then σmin(XXT ) n p . In that case nσmin(XXT )2 |T | 1 p2 , so the left hand side is the average squared error of the recovery. In most applications the rank of the empirical risk minimizer c K is approximately equal to the rank of K , i.e. d d0. Note that If d + d0 1 2(n 1) then Cd,d0 3. Finally, the assumption that XT K X are centered can be guaranteed by centering X, which has no impact on the triplet differences h M t, K i, or insisting that K is centered. As mentioned above Cf will be have little effect assuming that our measurements h M t, Ki are bounded. 2.4 Applications to Ordinal Embedding In the ordinal embedding setting, there are a set of items with unknown locations, z1, , zn 2 Rd and a set of triplet observations S where as in the metric learning case observing yt = 1, for a triplet t = (i, j, k) is indicative of the kzi zjk2 kzi zkk2, i.e. item i is closer to j than k. The goal is to recover the zi s, up to rigid motions, by recovering their Gram matrix G from these comparisons. Ordinal embedding case reduces to metric learning through the following observation. Consider the case when n = p and X = Ip, i.e. the xi are standard basis vectors. Letting K = G , we see that kxi xjk2 K = kzi zjk2. So in particular, Lt = M t for each triple t, and observations are exactly comparative distance judgements. Our results then apply, and extend previous work on sample complexity in the ordinal embedding setting given in [11]. In particular, though Theorem 5 in [11] provides a consistency guarantee that the empirical risk minimizer b G will converge to G , they do not provide a convergence rate. We resolve this issue now. In their work, it is assumed that kzik2 γ and k Gk dnγ. In particular, sample complexity results of the form O(dnγ log n) are obtained. However, these results are trivial in the following sense, if kzik2 γ then k Gk γn, and their results (as well as our upper bound) implies that true sample complexity is significantly smaller, namely O(γn log n) which is independent of the ambient dimension d. As before, assume an explicit link function f with Lipschitz constant L, so the samples are noisy observations governed by G , and take the loss to be the logarithmic loss associated to f. We obtain the following improved recovery bound in this case. The proof is immediate from Theorem 2.7. Corollary 2.7.1. Let G be the Gram matrix of n centered points in d dimensions with k G k2 d . Let b G = mink Gk γn,k Gk1 γ RS(G) and assume that b G is rank d, with n > 2d + 1. Then, 3 Experiments To validate our complexity and recovery guarantees, we ran the following simulations. We generate x1, , xn iid N(0, 1 p I), with n = 200, and K = p p d UU T for a random orthogonal matrix U 2 Rp d with unit norm columns. In Figure 2a, K has d nonzero rows/columns. In Figure 2b, K is a dense rank-d matrix. We compare the performance of nuclear norm and 1,2 regularization in each setting against an unconstrained baseline where we only enforce that K be psd. Given a fixed number of samples, each method is compared in terms of the relative excess risk, R(c K) R(K ) R(K ) , and the relative squared recovery error, kc K K k2 F , averaged over 20 trials. The y-axes of both plots have been trimmed for readability. In the case that K is sparse, 1,2 regularization outperforms nuclear norm regularization. However, in the case of dense low rank matrices, nuclear norm reularization is superior. Notably, as expected from our upper and lower bounds, the performances of the two approaches seem to be within constant factors of each other. Therefore, unless there is strong reason to believe that the underlying K is sparse, nuclear norm regularization achieves comparable performance with a less restrictive modeling assumption. Furthermore, in the two settings, both the nuclear norm and 1,2 constrained methods outperform the unconstrained baseline, especially in the case where K is low rank and sparse. To empirically validate our sample complexity results, we compute the number of samples averaged over 20 runs to achieve a relative excess risk of less than 0.1 in Figure 3. First, we fix p = 100 and increment d from 1 to 10. Then we fix d = 10 and increment p from 10 to 100 to clearly show the linear dependence of the sample complexity on d and p as demonstrated in Corollary 2.1.1. To our knowledge, these are the first results quantifying the sample complexity in terms of the number of features, p, and the embedding dimension, d. (a) Sparse low rank metric (b) Dense low rank metric Figure 2: 1,2 and nuclear norm regularization performance (a) d varying (b) p varying Figure 3: Number of samples to achieve relative excess risk < 0.1 Acknowledgments This work was partially supported by the NSF grants CCF-1218189 and IIS1623605 [1] Martina A Rau, Blake Mason, and Robert D Nowak. How to model implicit knowledge? similarity learning methods to assess perceptions of visual representations. In Proceedings of the 9th International Conference on Educational Data Mining, pages 199 206, 2016. [2] Aurélien Bellet, Amaury Habrard, and Marc Sebban. Metric learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 9(1):1 151, 2015. [3] Aurélien Bellet and Amaury Habrard. Robustness and generalization for metric learning. Neurocomputing, 151:259 267, 2015. [4] Zheng-Chu Guo and Yiming Ying. Guaranteed classification via regularized similarity learning. Neural Computation, 26(3):497 522, 2014. [5] Yiming Ying, Kaizhu Huang, and Colin Campbell. Sparse metric learning via smooth optimiza- tion. In Advances in neural information processing systems, pages 2214 2222, 2009. [6] Wei Bian and Dacheng Tao. Constrained empirical risk minimization framework for distance metric learning. IEEE transactions on neural networks and learning systems, 23(8):1194 1205, 2012. [7] Yuan Shi, Aurélien Bellet, and Fei Sha. Sparse compositional metric learning. ar Xiv preprint ar Xiv:1404.4105, 2014. [8] Samet Oymak, Amin Jalali, Maryam Fazel, Yonina C Eldar, and Babak Hassibi. Simultaneously structured models with application to sparse and low-rank matrices. IEEE Transactions on Information Theory, 61(5):2886 2908, 2015. [9] Eric Heim, Matthew Berger, Lee Seversky, and Milos Hauskrecht. Active perceptual similarity modeling with auxiliary information. ar Xiv preprint ar Xiv:1511.02254, 2015. [10] Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001. [11] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Advances In Neural Information Processing Systems, pages 2703 2711, 2016. [12] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189 223, 2014. [13] Joel A. Tropp. An introduction to matrix concentration inequalities, 2015. [14] Felix Abramovich and Vadim Grinshtein. Model selection and minimax estimation in general- ized linear models. IEEE Transactions on Information Theory, 62(6):3721 3730, 2016. [15] Florentina Bunea, Alexandre B Tsybakov, Marten H Wegkamp, et al. Aggregation for gaussian regression. The Annals of Statistics, 35(4):1674 1697, 2007. [16] Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, pages 731 771, 2011. [17] Jon Dattorro. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing USA,