# exponential_family_matrix_completion_under_structural_constraints__40e055ea.pdf Exponential Family Matrix Completion under Structural Constraints Suriya Gunasekar SURIYA@UTEXAS.EDU Pradeep Ravikumar PRADEEPR@CS.UTEXAS.EDU Joydeep Ghosh GHOSH@ECE.UTEXAS.EDU The University of Texas at Austin, Texas, USA We consider the matrix completion problem of recovering a structured matrix from noisy and partial measurements. Recent works have proposed tractable estimators with strong statistical guarantees for the case where the underlying matrix is low rank, and the measurements consist of a subset, either of the exact individual entries, or of the entries perturbed by additive Gaussian noise, which is thus implicitly suited for thin tailed continuous data. Arguably, common applications of matrix completion require estimators for (a) heterogeneous data types, such as skewed continuous, count, binary, etc., (b) for heterogeneous noise models (beyond Gaussian), which capture varied uncertainty in the measurements, and (c) heterogeneous structural constraints beyond low rank, such as block sparsity, or a superposition structure of low rank plus elementwise sparseness, among others. In this paper, we provide a vastly unified framework for generalized matrix completion by considering a matrix completion setting wherein the matrix entries are sampled from any member of the rich family of exponential family distributions; and impose general structural constraints on the underlying matrix, as captured by a general regularizer R(.). We propose a simple convex regularized M estimator for the generalized framework, and provide a unified and novel statistical analysis for this general class of estimators. We finally corroborate our theoretical results on simulated datasets. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 1. Introduction In the general problem of matrix completion, we seek to recover a structured matrix from noisy and partial measurements. This problem class encompasses a wide range of practically important applications such as recommendation systems, recovering gene protein interactions, and analyzing document collections in language processing, among others. In recent years, leveraging developments in sparse estimation and compressed sensing, there has been a surge of work on computationally tractable estimators with strong statistical guarantees, specifically for the setting where a subset of entries of a low rank matrix are observed either deterministically, or perturbed by additive noise that is Gaussian (Candes & Plan, 2010), or more generally sub Gaussian (Keshavan et al., 2010b; Negahban & Wainwright, 2012). While such a Gaussian noise model is amenable to the subtle statistical analyses required for the ill posed problem of matrix completion, it is not always practically suitable for all data settings encountered in matrix completion problems. For instance, such a Gaussian error model might not be appropriate in recommender systems based on movie ratings that are either binary (likes or dislikes), or range over the integers one through five. The first question we ask in this paper is whether we can generalize the statistical estimators for matrix completion as well as their analyses to general noise models? Note that a noise model captures the uncertainty underlying the matrix measurements, and is thus an important component of the problem specification given any application; and it is thus vital for broad applicability of the class of matrix completion estimators to extend to general noise models. Though this might seem like a narrow technical, although important question, it is related to a broader issue. A Gaussian observation model implicitly assumes the matrix values are continuous valued (and that they are thin tail distributed). But in modern applications, matrix data span the gamut of heterogeneous data types, for instance, skewed continuous, categorical discrete including binary, count valued, among others. This, thus gives rise to the second question of whether we can generalize the stan- Generalized Matrix Completion dard matrix completion estimators and statistical analyses, suited for thin tailed continuous data, to more heterogeneous data types? Note that there has been some recent work for the specific case of binary data by Davenport et al. (2012), but generalizations to other data types and distributions is largely unexplored. The recent line of work on matrix completion, moreover, enforces the constraint that the underlying matrix be either exactly or approximately low rank. Aside from the low rank constraints, further assumptions to eliminate overly spiky matrices are required for well posed recovery under partial measurements (Candes & Recht, 2009). Early work provided generalization error bounds for various low rank matrix completion algorithms, including algorithms based on nuclear norm minimization (Candes & Recht, 2009; Candes & Tao, 2010; Candes & Plan, 2010; Recht, 2011), max margin matrix factorization (Srebro et al., 2004), spectral algorithms (Keshavan et al., 2010a;b), and alternating minimization (Jain et al., 2013). These work made stringent matrix incoherence assumptions to avoid spiky matrices. These assumptions have been made less stringent in more recent results (Negahban & Wainwright, 2012), which moreover extend the guarantees to approximately low rank matrices. Such (approximate) low rank structure is one instance of general structural constraints which are now understood to be necessary for consistent statistical estimation under high dimensional settings (with very large number of parameters and very few observations). Note that the high dimensional matrix completion problem is particularly ill posed, since the measurements are typically both very local (e.g. individual matrix entries), and partial (e.g. covering a decaying fraction of entries of the entire matrix). However, the specific (approximately) low rank structural constraint imposed in the past work on matrix completion does not capture the rich variety of other qualitatively different structural constraints such as row sparseness, column sparseness, or a superposition structure of low rank plus elementwise sparseness, among others. For instance, in the classical introductory survey on matrix completion (Laurent, 2009), the authors discuss structural constraints of a contraction matrix, and a Euclidean distance matrix. Thus, the third question we ask in this paper is whether we can generalize the recent line of work on low rank matrix completion to the more general structurally constrained case. In this paper, we answer all of the three questions above in the affirmative, and provide a vastly unified framework for generalized matrix completion. We address the first two questions by considering a general matrix completion setting wherein observed matrix entries are sampled from any member of a rich family of natural exponential family distributions. Note that this family of distributions encompass a wide variety of popular distributions including Gaus- sian, Poisson, binomial, negative binomial, Bernoulli, etc. Moreover, the choice of the exponential family distribution can be made depending on the form of the data. For instance, thin tailed continuous data is typically modeled using the Gaussian distribution; count data is modeled through an appropriate distribution over integers (Poisson, binomial, etc.), binary data through Bernoulli, categorical discrete through multinomial, etc. We address the last question by considering general structural constraints upon the underlying matrix, as captured by a general regularization function R(.). Our general matrix completion setting thus captures heterogeneous noise channels, for heterogeneous data types, and heterogeneous structural constraints. In a key contribution, we propose a simple regularized convex M estimator for recovering the structurally constrained underlying matrix in this general setting; and moreover provide a unified and novel statistical analysis for our general matrix completion problem. Following a standard approach (Negahban, 2012), we (a) first showed that the negative log likelihood of the subset of observed entries satisfies a form of Restricted Strong Convexity (RSC) (Definition 4); and (b) under this RSC condition, our proposed M estimator satisfies strong statistical guarantees. We note that proving these individual components for our general matrix completion problem under general structural constraints required a fairly delicate and novel analysis, particularly the first component of showing the RSC condition, which we believe would be of independent interest. A key corollary of our general framework is matrix completion under sub Gaussian samples and low rank constraints, where we show that our theorem recovers results comparable to the existing literature (Candes & Plan, 2010; Keshavan et al., 2010b; Negahban & Wainwright, 2012). Finally, we corroborate our theoretical findings via simulated experiments. 1.1. Notations and Preliminaries In this subsection we describe the notations and definitions frequently used throughout the paper. Matrices are denoted by capital letters, X, Θ, M, etc. For a matrix M, Mj and M (i) are the jth column and ith row of M respectively, and Mij denotes the (i, j)th entry of M. The transpose, trace, and rank of a matrix M are denoted by M , tr(M), and rk(M), respectively. The inner product between two matrices is given by X, Y = tr(X Y ) = P (i,j) Xij Yij. For a matrix M Rm n of rank r, with singular values σ1 σ2 . . . σr, commonly used matrix norms include the nuclear norm M = P i σi, the spectral norm M 2 = σ1, the Frobenius norm M F = P i σ2 i , and the maximum norm M max = max(i,j) Mij. Given any matrix norm , the dual norm, is given by X = sup Y 1 X, Y . Generalized Matrix Completion Definition 1 (Natural Exponential Family). A distribution of a random variable X, in a normed vector space is said to belong to the natural exponential family, if its probability density function, characterized by the parameter Θ in the dual vector space, is given by: P(X|Θ) = h(X) exp X, Θ G(Θ) , where G(Θ) = log R X h(X)e X,Θ d X, called the log partition function, is strictly convex, and analytic. Definition 2 (Bregman Divergence). Let φ : dom(φ) R be a strictly convex function differentiable in the relative interior of dom(φ). The Bregman divergence (associated with φ) between x dom(φ) and y ri(dom(φ)) is defined as: Bφ(x, y) = φ(x) φ(y) φ(y), x y . Definition 3 (Subspace compatibility constants). Given a matrix norm R(.), we define the following maximum and minimum subspace compatibility constants of R(.) w.r.t the subspace M: Ψ(M; R) = sup Θ M\{0} R(Θ) Θ F , Ψmin(R) = inf Θ ={0} R(Θ) Θ F . Thus, Θ M Ψmin(R) Θ F R(Θ) Ψ(M, R) Θ F . Definition 4 (Restricted Strong Convexity). A loss function L is said to satisfy Restricted Strong Convexity with respected to a subspace S, if for some κL > 0, L(Θ + ) L(Θ) L(Θ), κL 2 F , S. Definition 5 (Sub Gaussian Distributions). A random variable, X, is said to have a sub Gaussian distribution with parameter b, if s > 0, the distribution satisfies E[es X] es2b2/2. Further, if X is sub Gaussian with parameter b, then E[X] = 0 and V ar(X) b2 (refer Vershynin (2010)). 2. Exponential Family Matrix Completion Denote the underlying target matrix by Θ Rm n. We then assume that individual entries Θ ij are observed indirectly via a noisy channel: specifically, via a sample drawn from the corresponding member of natural exponential family (see Definition 1): P(Xij|Θ ij) = h(Xij) exp XijΘ ij G(Θ ij) , (1) where G : R R, is a strictly convex, and analytic function called the log partition function. Consider the random matrix X Rm n, where each entry Xij is drawn independently from the corresponding distri- bution in (1); it can be seen that: P(X|Θ ) = Y h(Xij) exp XijΘ ij G(Θ ij) = h(X) exp { X, Θ G(Θ )} , (2) where we overload the notation to denote G : Rm n R as G(Θ) = P ij G(Θij), and the base measure h(X) as h(X) = Q Uniformly Sampled Observations: In a fully observed setting, we would observe all the entries of the observation matrix X Rm n. However, we consider a partially observed setting, where we observe entries over a subset of indices Ω [m] [n]. We assume a uniform sampling model, so that (i, j) Ω, i uniform([m]), j uniform([n]). (3) Given, Ω, we define the following matrix PΩ(X): PΩ(X)ij = Xij if (i, j) Ω 0 otherwise. The matrix completion task can then be stated as the estimation of Θ from the partially observed matrix PΩ(X), where X P(X|Θ ). As noted earlier, this problem is ill posed in general. However, as we will show, under structural constraints imposed on the parameter matrix Θ , we are able to design an M estimator with a near optimal deviation from Θ . 2.1. Applications Gaussian (fixed σ2) is typically used to model continuous data, x R, such as measurements with additive errors, affinity datasets. Here, G(θ) = 1 2σ2θ2. Bernoulli is a popular distribution of choice to model binary data, x {0, 1}, with G(θ) = log (1 + eθ). Some examples of data suitable for Bernoulli model include social networks, gene protein interactions, etc. Binomial (fixed N) is used to model number of successes in N trials. Here, x {0, 1, 2, . . . , N}, and G(θ) = N log (1 + eθ). Some applications include predicting success/failure rate, survey outcomes, etc. Poisson is used to model count data x {0, 1, 2, . . .}, such as arrival times, events per unit time, click throughs among others. Here, G(θ) = eθ. Exponential is often used to model positive valued continuous data x R+, specially inter arrival times between events. Here, G(θ) = log ( θ). 2.2. Log likelihood Denote the gradient map: g(Θ) G(Θ) Rm n, where g(Θ)ij = G(Θ) It can then be verified that the mean and variance of the distribution P(X|Θ ) are given by E[X] = g(Θ ), and Generalized Matrix Completion Var(X) = 2G(Θ ), respectively. The Fenchel conjugate of the log partition function G, is denoted by: F(X) supΘ X, Θ G(Θ). A useful consequence of the exponential family is that the negative log likelihood is convex in the natural parameters Θ , and moreover has a bijection with a large class of Bregman divergences (Definition 2). The following relationship was first noted by (Forster & Warmuth, 2002), and later established by (Banerjee et al., 2005) [Theorem 4]: log P(X|Θ) BF (X, g(Θ)), X dom(F). (4) 2.3. Discussion and directions for future work We consider the standard matrix completion setting where the distribution of the observation matrix X in (2) corresponds to its entries Xij being drawn independently from the other entries. Further, the probability of observing a specific entry Xij, under uniform sampling is independent of the noise channel or the distribution P(Xij|Θ ij). However, in some applications, it might be beneficial to have a sampling scheme involving dependencies; for instance, when a user gives a movie a bad rating, we might want to vary the sampling scheme to sample an entirely different region of the matrix. It would be interesting to extend the analysis of this paper to such a dependent sampling setting. The form of the observation random matrix distribution in (2), given the individual observations in (1), can be seen to have connotations of multi task learning: here recovering each individual matrix entry Θ ij together with its corresponding noise model forms a single task, and all these tasks can be performed jointly given the shared structural constraint on Θ . It would be interesting to generalize this to form a more general statistical framework for partial multi-task learning. We use the general class of exponential family distributions as the underlying probabilistic model capturing the measurement uncertainties. The richness of the class of exponential family distributions has been used in other settings to provide general statistical frameworks. Kakade et al. (2010) provide a generalization of compressed sensing problem to general exponential family distributions. Note however that results from compressed sensing cannot be immediately extended to matrix completion case, since the sampling operator PΩdoes not satisfy the typical assumptions (restricted isometry or restricted eigenvalues) made in such settings; see (Candes & Recht, 2009) for additional discussion. There have been extensions of classical probabilistic PCA (Tipping & Bishop, 1999) from Gaussian noise models to exponential family distributions Collins et al. (2001); Mohamed et al. (2008); Gordon (2002). There have also been recent extensions of probabilistic graphical model classes, beyond Gaussian and Ising models, to multivariate extensions of exponential family distributions (Yang et al., 2012; 2013). More complicated probabilistic models have also been proposed in the context of collaborative filtering (Mnih & Salakhutdinov, 2007; Salakhutdinov & Mnih, 2008), but these typically involve non convex optimization, and it is difficult to extend the rigorous statistical analyses of the form in this paper (and in the matrix completion literature) to these models. 3. Main Result and Consequences As noted in the introduction, we consider the matrix completion setting with general structural constraints on the underlying target matrix Θ . To formalize the notion of such structural constraints, we follow (Negahban, 2012), and assume that Θ satisfies Θ M M Rm n, for some subspace M M, which contains parameter matrices that are structured similar to the target (the corresponding structural constraints such as low rankness, low rankness+sparsity etc); we also allow the flexibility of working with a superset M of the model subspace that is potentially easier to analyze. Moreover, we use their definition of a decomposable norm regularization function, R(.) : Rm n R+, which suitably captures these structural constraints: A 1. (Decomposable Norm Regularizer) We assume that R(.) is a matrix norm, and is decomposable over (M, M ), i.e. if X M, Y M , then, R(X + Y ) = R(X) + R(Y ). We provide some examples of such decomposable regularizers and structural constraint subspaces, and refer to (Negahban, 2012) for more examples and discussion. Example 1. Low rank: This is a common structure assumed in numerous matrix estimation problems, particularly those in collaborative filtering, PCA, spectral clustering, etc. The corresponding structural constraint subspaces (M, M ) in this case correspond to a linear span of specific one rank matrices; we discuss these in further detail in Section 3.2, where we derive a corollary of our general theorem to the specific case of recovery guarantees for low rank constrained matrix completion. The nuclear norm R(Θ) = Θ = P k σk, has been shown to be decomposable with respect to these constraint subspaces (Fazel et al., 2001). Example 2. Block sparsity: Another important structural constraint for a matrix is block sparsity, where each row is either all zeros or mostly non zero, and the number of non zero rows is small. The structural constraint subspaces in this case correspond to a linear span of specific Frobenius norm one matrices that are non zero in a small subset of the rows (dependent on Θ ); it has been shown that ℓ1/ℓq Generalized Matrix Completion (q > 1) norms (Negahban & Wainwright, 2008; Obozinski et al., 2011) are decomposable with respect to such structural constraint subspaces. Recalling that Θ(i) is the ith row of Θ, the ℓ1/ℓq norm is defined as: i=1 Θ(i) q = j=1 |Θij|q 1/qi . Example 3. Low rank plus sparse: This structure is often used to model low rank matrices which are corrupted by a sparse outlier noise matrix. The structural constraint subspaces corresponding to these consist of the linear span of weighted sum of specific rank one matrices and sparse matrices with non zero components on specified positions; and appropriate regularization function decomposable with respect to such structural constraints is the infimum convolution of the weighted nuclear norm with weighted elementwise ℓ1 norm, M 1,1 = P ij |Mij| (Candes et al., 2011; Yang & Ravikumar, 2013): R(Θ) = inf{λ1 S 1,1 + λ2 L : Θ = S + L}. The second assumption we make is on the curvature of the log partition function. This is required to establish a form of RSC (Definition 4) for the loss function. A 2. The second derivative of the log partition function G : R R has atmost an exponential decay, i.e, 2G(u) e η|u|, u R, for some η > 0 It can be verified that commonly used members of natural exponential family obey this assumption. Finally, we make an assumption to avoid spiky target matrices. As Candes & Recht (2009) show with numerous examples, low rank and presumably other such structural constraints as above, by themselves are not sufficient for accurate recovery, in part due to the infeasibility of recovering overly spiky matrices with very few large entries. Early work (Candes & Plan, 2010; Keshavan et al., 2010a;b), assumed stringent matrix incoherence conditions to preclude such matrices, while more recent work (Davenport et al., 2012; Negahban & Wainwright, 2012), relax these assumptions to restricting the spikiness ratio, defined as follows: αsp(Θ) = mn Θ max A 3. There exists a known α > 0, such that Θ max = αsp(Θ ) mn Θ F α mn. 3.1. M estimator for Generalized Matrix Completion We propose a regularized M estimate as our candidate parameter matrix bΘ. The norm regularizer R(.) used is a convex surrogate for the structural constraints, and is assumed to satisfy A 1. For a suitable λ > 0, bΘ = argmin Θ max α mn ij Ω log P(Xij|Θij) i + λR(Θ) = argmin Θ max α mn ij Ω G(Θij) XijΘij i + λR(Θ). (6) The above optimization problem is a convex program, and can be solved by any off the shelf convex solvers. 3.2. Main Results Without loss of generality, suppose that m n. Let R (.) = sup R(X) 1 X, . be the dual norm of the regularizer R(.). Further, let Ψ(M) and Ψmin be the maximum and minimum subspace compatibility constants of R w.r.t the model subspace M (Definition 3) . We next define the following quantity: κR(n, |Ω|) := E h mn ij Ω ϵijeie j i , where the expectation is over the random sampling index set Ω, and over the Rademacher sequence {ϵij : (i, j) Ω}; here {ei Rm}, {ej Rn} are the standard basis. This quantity κR(n, |Ω|) captures the interaction between the sampling scheme and the structural constraint as captured by the regularizer (specifically its dual R ). Note that it depends only on n (n m), and on the size |Ω| of Ω. Theorem 1. Let bΘ be the estimate from (6) with λ |Ω| R (PΩ(X g(Θ )). Under the assumptions A1-3, if |Ω| = Ω(Ψ2(M)n log n) , then given a constant c0, constants C, C1, C2, and K1, such that, with probability > 1 C1e C2Ψminn log n: bΘ Θ 2 F C max{α 2, 1}Ψ2(M) max λ2 κ2 L , c2 0n log n provided κL := e 2ηα |Ω|κ2 R(n,|Ω|) n log n > 0. In the above theorem, η and α αsp(Θ ) Θ F are constants from Assumptions 2 and 3, respectively. Our proof uses elements from Negahban (2012), as well as Negahban & Wainwright (2012) where they analyze the case of low rank structure and additive noise, and establish a form of restricted strong convexity (RSC) for squared loss over subset of matrix entries (closely relates to the special case, when the exponential family distribution assumed in (2) is Gaussian). However, showing such an RSC condition for our general loss function over a subset of structured matrix entries involved some delicate arguments. Further, we provide a much simpler proof that moreover only required a low spikiness condition rather than a multiplicative spik- We suppress the dependence of Ψ and Ψmin on R in our notation to avoid notational clutter f(n) = Ω(g(n)) if f(n) > kg(n) n > ˆn. Generalized Matrix Completion iness and structural constraint. Moreover, we are able to provide an RSC condition broadly for general structures, and the negative log likelihood loss associated with general exponential family distribution. 3.3. Corollary An important special case of the problem is when the parameter matrix Θ , is assumed to be of a low rank r m. The most commonly used convex regularizer to enforce low rank is the nuclear norm. The intuition behind the low rank assumption on the parameter matrix is as follows: the parameter Θ ij, can be thought of as the true measure of affinity between the entities corresponding to row i and column j, respectively; and the data Xij, is a sample from a noisy channel parametrized by Θij. It is hypothesized that {Θ ij}, are obtained from a weighted interaction of a small number of latent variables as, Θ ij = Pr k=1 σkuikvjk. Let {uk Rm} and {vk Rn}, k [r] be the left and right singular vectors, respectively of Θ . Let the column and row span of Θ be U col(Θ ) = span{ui} and V row(Θ ) = span{vj}, respectively. Define: M := {Θ : row(Θ) V , col(Θ) U }, and M := {Θ : row(Θ) V , col(Θ) U }. (7) It can be verified that, M = M, however, M M. Corollary 1. Let Θ be a low rank matrix of rank atmost r m. If further, (i, j), (Xij g(Θ ij)) is sub Gaussian (Definition 5) with parameter b, and |Ω| > Ω(rn log n), then using R(.) = . and λ 2 := C mnb q |Ω| in (6), w.p. > 1 C 1e C 2 log n, 1 mn bΘ Θ 2 F C max{α 2, 1}b2 where κ L = K 1e 2ηα Remark 1: Note that the above results hold for the minimizer bΘ of the convex program in (6), optimized for any α αsp(Θ ) Θ F ; in particular it holds with α = αsp(Θ ) Θ F , where 1 αsp(Θ ) mn. While in practice α is chosen through cross validation, the theoretical bound in Corollary 1 can be tightened to the following (if Θ F 1): 1 mn bΘ Θ 2 F Θ 2 F C α2 sp(Θ )b2 Similar bound can be obtained for Theorem 1. Remark 2: The parameter b2 is a measure of noise per entry in the system; ij, Var(Xij g(Θ ij)) b2. In this section, we provide key steps in the proofs of the main results (Sec. 3.2-3.3). 4.1. Proof of Theorem 1 The proof of our main theorem involves two key steps: We first show that, under assumptions A1-3, RSC of the form in Definition 4 holds for the loss function in (6) over a large subset of the solution space. When the RSC condition holds, the result follows from a few simple calculations; we handle the case where RSC does not hold separately. Let b = bΘ Θ . We state two results of interest. Lemma 1. We define the following subset: V = {Θ Rm n : R(ΘM ) 3R(ΘM)}, where recall that ΘM is the projection of Θ onto the subspace M. If bΘ is the minimizer of (6), and λ 2 mn |Ω| R (PΩ(X g(Θ )), then b = bΘ Θ V. The proof follows from Lemma 1 of Negahban (2012). Lemma 2. Let bΘ be the minimizer of (6). If λ 2 mn |Ω| R (PΩ(X g(Θ )), then: (i,j) Ω BG(bΘij, Θ ij) 3λΨ(M) The proof is provided in Appendix A.2. Recall the notation αsp( ) = mn max F . We now consider two cases, depending on whether the following condition holds for some constant c0 > 0: αsp(b ) 1 c0Ψ(M) |Ω| n log n. (9) Case 1: Suppose condition in (9) does not hold; so that αsp(b ) > 1 c0Ψ(M) |Ω| n log n. From the constraints of the optimization problem (6), we have that b max bΘ max + Θ max (2α / mn). Thus, b F = mn b max αsp(b ) 2c0α s Ψ2(M)n log n Case 2: Suppose condition in (9) does hold. Then, the following theorem shows that an RSC condition of the form in Definition 4 holds. Theorem 2 (Restricted Strong Convexity). If for a given c0, αsp(b ) 1 c0Ψ(M) |Ω| n log n. then, under the assumptions and notations in Theorem 1, w.p. > 1 C1e C2Ψminn log n: mn ij Ω BG(bΘij, Θ ij) κL b 2 F , where κL = e 2ηα |Ω|κ2 R(n,|Ω|) n log n As noted earlier, such an RSC result for the special case of squared loss under low rank constraints was shown in Negahban & Wainwright (2012). However, proving the RSC Generalized Matrix Completion condition for our general setting required a different and more involved proof technique. We prove this theorem in Section 4.3. Remaining steps of the proof of Theorem 1: Thus, if αsp(b ) 1 c0Ψ(M) |Ω| n log n, and κL > 0, from Theorem 2 and Lemma 2, w.h.p.: κL b 2 F mn ij Ω BG(bΘij, Θ ij) 3λΨ(M) From (10) and (11), under assumptions of Theorem 1, we have that for an appropriate constant C, with probability higher than 1 C1e C2Ψminn log n, b 2 F C max{α 2, 1}Ψ2(M) max λ2 κ2 L , c2 0n log n 4.2. Proof of Corollary 1 From the definition of M in (7), we have M = span{uix , yvj : x Rn, y Rm}. Let PU Rm m and PV Rn n, be the projection matrices onto the column and row spaces (U , V ) of Θ , respectively. We have, X Rm n, XM = PU X + XPV PU XPV . Also, rk(PU ) = rk(PV ) = rk(Θ ) = r. Thus, Φ M, rk(Φ) 2r; and hence, Ψ(M) = sup Φ M\{0} 2r. Further, Ψmin 1. Next, we use the following proposition by Negahban & Wainwright (2012), to bound κR(n, |Ω|) in Theorem 1. Lemma 3 (Lemma 6 of Negahban & Wainwright (2012)). If Ω [m] [n] is sampled using uniform sampling and |Ω| > n log n, then for a Rademacher sequence {ϵij, (i, j) Ω}, mnϵijeie j 2 i 10 Thus, for large enough c0 > 640, using κR(n, |Ω|) = |Ω| in Theorem 2, for some K 1 > 0 we have: κ L = e 2ηα Finally, to prove the corollary, we derive a bound on PΩ(X g(Θ )) 2 using the Ahlswede Winter Matrix bound (Appendix A.3). Let φ(x) = ex2 1; and let Y (ij) mn(Xij g(Θ ij))eie j, such that, mn |Ω| PΩ(X g(Θ )) 2 = 1 ij Ω Y (ij) 2. From the equivalence of sub-Gaussian definitions in Lemma 5.5 of Vershynin (2010), Xij g(Θ ij) φ c0b, ij. Since, Y (ij) has a single element, mn(Xij g(Θ ij)), we have, Y (ij) φ c0 mnb. Further, E[Y (ij)T Y (ij)] = E[mn(Xij g(Θ ij))2eje j] (a) = mn E(ij Ω)[EX[(Xij g(Θ ij))2]eje j] (b) mnb2E(ij Ω)[eje j] (c) = mnb2 1 n In n, (13) where (a) follows from Fubini s Theorem, (b) follows as (Xij g(Θ ij)) is sub Gaussian, and (c) follows from the uniform sampling model. Similarly, E[Y (ij)Y (ij)T ] = mnb2Im m. Define σ2 ij := max{E[Y (ij)T Y (ij)], E[Y (ij)Y (ij)T ]} In Lemma A.3, using σ2 ij nb2, σ2 := P ij Ωσ2 ij = n|Ω|b2, M = c0 mnb c0nb, and t = |Ω|δ, we have: ij Ω Y (ij) 2 δ n2 max n e δ2|Ω| 4nb2 , e δ|Ω| Further, for all C, using δ = Cb q |Ω| , there exists a large enough C1, s.t. if |Ω| > C1n log n, then, |Ω| PΩ(X g(Θ )) 2 Cb e C 1 log n. Using Ψmin 1, κL = K 1e 2ηα mn (from (12)), and λ |Ω| in Theorem 1 leads to the corollary as 2 = C mnb q |Ω| PΩ(X g(Θ )) 2. 4.3. Proof of Theorem 2 This proof uses symmetrization arguments and contractions (Ledoux & Talagrand (1991) Ch.4&6). We observe that, (ij) Ω, vij [0, 1], s.t. BG(bΘij, Θ ij) = G(bΘij) G(Θ ij) g(Θ ij)(bΘij Θ ij) = 2G((1 vij)Θ ij + vij bΘij)b 2 ij mn b 2 ij. (15) where (a) holds as |(1 vij)Θ ij + vij bΘij| Θ max + bΘ max 2α mn, and 2G(u) e η|u| (A2). Lemma 4. Under Theorem 2, consider the subset, E = n V : αsp( ) 1 c0Ψ(M) |Ω| n log n, F = 1 o . Generalized Matrix Completion 0 0.5 1 1.5 2 2.5 3 3.5 4 0 |Ω|/rn log n n=50 n=100 n=150 n=200 0 0.5 1 1.5 2 2.5 3 3.5 4 |Ω|/rn log n n=50 n=100 n=150 n=200 0 0.5 1 1.5 2 2.5 3 3.5 4 |Ω|/rn log n n=50 n=100 n=150 n=200 Figure 1. Appropriate error metric between observation matrix X, and the MLE estimate from (6) b X, plotted against normalized sample size, when X is generated from (a) Gaussian, (b) Bernoulli, and (c) binomial distributions w.p. > 1 C1e C2Ψminn log n, E: ij Ω 2 ij 1 16R( ) |Ω|κ2 R(n, |Ω|) n log n The proof is provided in Appendix A.1. From the assumptions in Thm. 2 and Prop. 1, b b F E. Further, as b V, R(b ) = R(b M) + R(b M ) 4R(b M) 4Ψ(M) b F . Using Lemma 4, and (15), and choosing |Ω| = cΨ2(M)n log n, for large enough c, we have K1 := 1 4k 1 q Ψ2(M)n log n |Ω| > 0; thus using κL := e 2ηα |Ω|κ2 R(n,|Ω|) n log n , w.h.p., ij Ω BG(bΘij, Θ ij) κL b 2 F . (16) 5. Experiments We provide simulated experiments to corroborate our theoretical guarantees, focusing specifically on Corollary 1, where we consider the special case where the underlying parameter matrix is low rank, but the underlying noise model for the matrix elements could be any of the general class of exponential family distributions. We look at three well known members of exponential family suitable for different data types, namely Gaussian, Bernoulli, and binomial, which are popular choices for modeling continuous valued, binary, and count valued data, respectively. 5.1. Experimental Setup We create low rank ground truth parameter matrices, Θ Rm n of sizes n {50, 100, 150, 200} (for simplicity we considered square matrices, m = n); we set the rank to r = 2 log n. The observation matrices, X, are then sampled from the different members of exponential family distributions parameterized by Θ . For each n, we uniformly sample a subset Ωentries of the observation matrix X, and estimate bΘ from (6). Evaluation: For each member of the exponential family of distributions considered, we can measure the performance of our M estimator in parameter space as bΘ Θ 2 F Θ 2 F , or in obser- vation space using an appropriate error metric err( b X, X), where b X is the maximum likelihood estimate of the recovered distribution, b X = argmax XP(X|bΘ) (we use RMSE, MAE in our plots). From our corollary, we require |Ω| = O(rn log n) samples for consistent recovery, so we plot the error metric against the the normalized sample size, |Ω| rn log n. For reasons of space, we only provide results for the error metric in observations space plotted against the the normalized sample size. The remainder of the results are provided in Appendix B. It can be seen from the plots that the error decays with increasing sample size, corroborating our consistency results; indeed |Ω| > 1.5rn log n samples suffice for the errors to decay to a very small value. Further, the aligning of the curves (for different n) given the normalized sample size corroborates the convergence rates as well. Acknowledgement The research was funded by NSF Grants IIS-0713142 and IIS-1017614. Pradeep Ravikumar acknowledges the support of ARO via W911NF-12-1-0390 and NSF via IIS1149803, IIS-1320894, DMS-1264033. Generalized Matrix Completion Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with bregman divergences. JMLR, 6:1705 1749, 2005. Candes, E. J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 2010. Candes, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 2009. Candes, E. J. and Tao, T. The power of convex relaxation: near-optimal matrix completion. IEEE Transactions on Information Theory, 2010. Candes, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011. Collins, M., Dasgupta, S., and Schapire, R. E. A generalization of principal components analysis to the exponential family. In NIPS, pp. 617 624, 2001. Davenport, M. A., Plan, Y., Berg, E., and Wootters, M. 1bit matrix completion. ar Xiv preprint ar Xiv:1209.3672, 2012. Fazel, M., Hindi, H, and Boyd, S. P. A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, pp. 4734 4739, 2001. Forster, J. and Warmuth, M. Relative expected instantaneous loss bounds. Journal of Computer and System Sciences, 2002. Gordon, G. J. Generalizedˆ 2 linearˆ 2 models. In NIPS, pp. 577 584, 2002. Jain, P., Netrapalli, P., and Sanghavi, S. Low-rank matrix completion using alternating minimization. In STOC, 2013. Kakade, S. M., Shamir, O., Sridharan, K., and Tewari, A. Learning exponential families in high-dimensions: Strong convexity and sparsity. In AISTATS, JMLR Workshop and Conference Proceedings, pp. 381 388, 2010. Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from a few entries. IEEE Transactions on Information Theory, 2010a. Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from noisy entries. JMLR, 2010b. Laurent, M. Matrix completion problems. Encyclopedia of Optimization, 3:221 229, 2009. Ledoux, M. and Talagrand, M. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer, 1991. Mnih, A. and Salakhutdinov, R. Probabilistic matrix factorization. In NIPS, pp. 1257 1264, 2007. Mohamed, S., Ghahramani, Z., and Heller, K. A. Bayesian exponential family pca. In NIPS, pp. 1089 1096, 2008. Negahban, S. Structured Estimation in High-Dimensions. Ph D thesis, EECS Department, University of California, Berkeley, May 2012. Negahban, S. and Wainwright, M. J. Joint support recovery under high-dimensional scaling: Benefits and perils of l1,-regularization. NIPS, 21:1161 1168, 2008. Negahban, S. and Wainwright, M. J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 98888:1665 1697, 2012. Obozinski, G., Wainwright, M. J., and Jordan, M. I. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1):1 47, 2011. Recht, B. A simpler approach to matrix completion. JMLR, 12:3413 3430, 2011. Salakhutdinov, R. and Mnih, A. Bayesian probabilistic matrix factorization using markov chain monte carlo. In ICML, pp. 880 887. ACM, 2008. Srebro, N., Rennie, J., and Jaakkola, T. S. Maximummargin matrix factorization. In NIPS, pp. 1329 1336, 2004. Tipping, M. E. and Bishop, C. M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pp. 611 622, 1999. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Yang, E. and Ravikumar, P. Dirty statistical models. In NIPS, 2013. Yang, E., Allen, G., Liu, Z., and Ravikumar, P. Graphical models via generalized linear models. In NIPS, 2012. Yang, E., Ravikumar, P., Allen, G. I., and Liu, Z. Conditional random fields via univariate exponential families. In Advances in Neural Information Processing Systems, pp. 683 691, 2013.