# provable_gaussian_embedding_with_one_observation__8227ae7c.pdf

Provable Gaussian Embedding with One Observation

Ming Yu Zhuoran Yang Tuo Zhao Mladen Kolar Zhaoran Wang

The success of machine learning methods heavily relies on having an appropriate representation for data at hand. Traditionally, machine learning approaches relied on user-deﬁned heuristics to extract features encoding structural information about data. However, recently there has been a surge in approaches that learn how to encode the data automatically in a low dimensional space. Exponential family embedding provides a probabilistic framework for learning low-dimensional representation for various types of high-dimensional data [20]. Though successful in practice, theoretical underpinnings for exponential family embeddings have not been established. In this paper, we study the Gaussian embedding model and develop the ﬁrst theoretical results for exponential family embedding models. First, we show that, under mild condition, the embedding structure can be learned from one observation by leveraging the parameter sharing between different contexts even though the data are dependent with each other. Second, we study properties of two algorithms used for learning the embedding structure and establish convergence results for each of them. The ﬁrst algorithm is based on a convex relaxation, while the other solved the non-convex formulation of the problem directly. Experiments demonstrate the effectiveness of our approach.

1 Introduction

Exponential family embedding is a powerful technique for learning a low dimensional representation of high-dimensional data [20]. Exponential family embedding framework comprises of a known graph G = (V, E) and the conditional exponential family. The graph G has m vertices and with each vertex we observe a p-dimensional vector xj, j = 1, . . . , m, representing an observation for which we would like to learn a low-dimensional embedding. The exponential family distribution is used to model the conditional distribution of xj given the context {xk, (k, j) 2 E} speciﬁed by the neighborhood of the node j in the graph G. In order for the learning of the embedding to be possible, one furthermore assumes how the parameters of the conditional distributions are shared across different nodes in the graph. The graph structure, conditional exponential family, and the way parameters are shared across the nodes are modeling choices and are application speciﬁc.

For example, in the context of word embeddings [1, 11], a word in a document corresponds to a node in a graph with the corresponding vector xj being a one-hot vector (the indicator of this word); the context of the word j is given by the surrounding words and hence the neighbors of the node j in the graph are the nodes corresponding to those words; and the conditional distribution of xj is

Booth School of Business, University of Chicago, Chicago, IL. Email: ming93@uchicago.edu Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ. School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA. Booth School of Business, University of Chicago, Chicago, IL. Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

a multivariate categorical distribution. As another example arising in computational neuroscience consider embedding activities of neurons. Here the graph representing the context encodes spatial proximity of neurons and the Gaussian distribution is used to model the distributions of a neuron s activations given the activations of nearby neurons.

While exponential family embeddings have been successful in practice, theoretical underpinnings have been lacking. This paper is a step towards providing a rigorous understanding of exponential family embedding in the case of Gaussian embedding. We view the framework of exponential family embeddings through the lens of probabilistic graphical models [6], with the context graph specifying the conditional independencies between nodes and the conditional exponential family specifying the distribution locally. We make several contributions:

1) First, since the exponential family embedding speciﬁes the distribution for each object conditionally

on its context, there is no guarantee that there is a joint distribution that is consistent with all the conditional models. The probabilistic graphical models view allows us to provide conditions under which the conditional distributions deﬁned a valid joint distribution over all the nodes.

2) Second, the probabilistic graphical model view allows us to learn the embedding vector from one observation we get to see only one vector xj for each node j 2 V by exploiting the shared parameter representation between different nodes of the graph. One might mistakenly then think that we in fact have m observations to learn the embedding. However, the difﬁculty lies in the fact that these observations are not independent and the dependence intricately depends on the graph structure. Apparently not every graph structure can be learned from one observation, however, here we provide sufﬁcient conditions on the graph that allow us to learn Gaussian embedding from one observation.

3) Finally, we develop two methods for learning the embedding. Our ﬁrst algorithm is based on a convex optimization algorithm, while the second algorithm directly solves a non-convex optimization problem. They both provably recover the underlying embedding, but in practice, non-convex approach might lead to a faster algorithm.

1.1 Related Work

Exponential family embedding Exponential family embedding originates from word embedding, where words or phrases from the vocabulary are mapped to embedding vectors [1]. Many variants and extensions of word embedding have been developed since [12, 9, 31, 10]. [20] develop a probabilistic framework based on general exponential families that is suitable for a variety of high-dimensional distributions, including Gaussian, Poisson, and Bernoulli embedding. This generalizes the embedding idea to a wider range of applications and types of data, such as real-valued data, count data, and binary data [13, 18, 19]. In this paper, we contribute to the literature by developing theoretical results on Gaussian embedding, which complements existing empirical results in the literature.

Graphical model. The exponential family embedding is naturally related to the literature on probabilistic graphical models as the context structure forms a conditional dependence graph among the nodes. These two models are naturally related, but the goals and estimation procedures are very different. Much of the research effort on graphical model focus on learning the graph structure and hence the conditional dependency among nodes [8, 25, 29, 22]. As a contrast, in this paper, we instead focus on the problem where the graph structure is known and learn the embedding.

Low rank matrix estimation. As will see in Section 2, the conditional distribution in exponential family embedding takes the form f(V V >) for the embedding parameter V 2 Rp r which embeds the p dimensional vector xj to r dimensional space. Hence this is a low rank matrix estimation problem. Traditional methods focused on convex relaxation with nuclear norm regularization [14, 3, 17]. However, when the dimensionality is large, solving convex relaxation problem is usually time consuming. Recently there has been a lot of research on non-convex optimization formulations, from both theoretical and empirical perspectives [24, 26, 21, 27, 30]. People found that non-convex optimization is computationally more tractable, while giving comparable or better result. In our paper we consider both convex relaxation and non-convex optimization approaches.

(a) Chain structure

(b) !-nearest neighbor structure

(c) Lattice structure

Figure 1: Some commonly used context structures

2 Background

In this section, we brieﬂy review the exponential family embedding framework. Let X = (x1, . . . , xm) 2 Rp m be the data matrix where a column xj 2 Rp corresponds to a vector ob-

served at node j. For example, in word embedding, x represents a document consisting of m words, xj is a one-hot vector representation of the j-th word, and p is the size of the dictionary. For each j, let cj {1, ..., m} be the context of j, which is assumed to be known and is given by the graph G in particular, cj = {k 2 V : (j, k) 2 E}. Some commonly used context structures are shown in Figure 1. Figure 1(a) is for chain structure. Note that this is different from vector autoregressive model where the chain structure is directed. Figure 1(b) is for !-nearest neighbor structure, where each node is connected with its preceding and subsequent ! nodes. This structure is common in word embedding where the preceding and subsequent ! words are the contexts. When ! = 1 it boils down to the chain structure. Finally Figure 1(c) is for lattice structure that is widely used in the Ising model.

The exponential family embedding model assumes that xj conditioning on xcj follows an exponential family distribution

xj|xcj Exponential Family

j(xcj), t(xj)

where t(xj) is the sufﬁcient statistics and j(xcj) 2 Rp is the natural parameter. For the linear embedding, we assume that in (2.1) takes the form

j(xcj) = fj

where the link function fj is applied elementwise and Vj 2 Rp r. The low dimensional matrix Vk embeds the vector xk 2 Rp to a lower r-dimensional space with V >

k xk 2 Rr being the embedding of xk. For example, in word embedding each row of Vk is the embedding rule for a word. Since xk is a one-hot vector, we see that V >

k xk is selecting a row of Vk that corresponds to the word on the node k. A common simplifying assumption is that the embedding structure is shared across the nodes by assuming that Vj = V for all j 2 V . In word-embedding, this makes the embedding rule not depend on the position of the word in the document. We summarize some commonly seen exponential family distributions and show how they deﬁne an exponential family embedding model.

Gaussian embedding. In Gaussian embedding it is assumed that the conditional distribution is

where M = V V > and j is the conditional covariance matrix for each node j. We will prove in Section 3 that under mild conditions, these conditional distributions deﬁne a valid joint Gaussian distribution. The link function for Gaussian embedding is the identity function, but one may choose the link function to be f( ) = log( ) in order to constrain the parameters to be non-negative. Gaussian embedding is commonly applied to real valued observations.

Word embedding (cbow [11]). In the word embedding setting, xj is an indicator of the j-th word in a document and the dimension of xj is equal to the size of the vocabulary. The context of the j-th word, cj, is the window of size ! around xj, that is, cj = {k 2 {1, ..., m}: k 6= j, |k j| !}. Cbow is a special case of exponential family embedding with

p(xj|xcj) =

Poisson embedding. In Poisson embedding, the sufﬁcient statistic is the identity and the natural parameter is the logarithm of the rate. The conditional distribution is given as

xj|xcj Poisson

Poisson embedding can be applied to count data.

3 Gaussian Embedding Model

In this paper, we consider the case of Gaussian embedding, where the conditional distribution of xj given its context xcj is given in (2.3) with the conditional covariance matrix j unknown. The parameter matrix M = V V > with V 2 Rp r will be learned from the data matrix X 2 Rp m and V >xk is the embedding of xk.

Let Xcol = [x>

2 , ..., x>

m]> 2 Rpm 1 be the column vector obtained by stacking columns of X and let [xj] denote the -th coordinate of xj. We ﬁrst restate a deﬁnition on compatibility from [23].

Deﬁnition 3.1. A non-negative function g is capable of generating a conditional density function p(y|x) if

p(y|x) = g(y, x) R

g(y, x)dy . (3.1)

Two conditional densities are said to be compatible if there exists a function g that can generate both conditional densities. When g is a density, the conditional densities are called strongly compatible.

Since M is symmetric, according to Proposition 1 in [4], we have the following proposition.

Proposition 3.2. The conditional distributions (2.3) is compatible and the joint distribution of Xcol is of the form p(xcol) / exp

for some col 2 Rpm pm. When the choice of M and j is such that col 0, the conditional distributions are strongly compatible and we have Xcol N(0, col).

The explicit expression of col can be derived from (2.3), however, in general is quite complicated. The following example provides an explicit formula in the case where j = I.

Example 3.3. Suppose that j = I for all j = 1, . . . , m. Let A 2 Rm m denote the adjacency matrix of the graph G, with aj,k = 1 when there is an edge between nodes j and k and 0 otherwise. Denote c = {1, . . . , 1, + 1, . . . , p}, the conditional distribution of [xj] is given by

*** xcj, [xj] c N

Moreover, there exists a joint distribution Xcol N(0, col) where col 2 Rpm pm satisﬁes

col = I A M, (3.2)

and A M denotes the Kronecker product between A and M. Clearly, we need col 0, which imposes implicit restrictions on A and M. To ensure that col is positive deﬁnite, we need to

ensure that all the eigenvalues of A M are smaller than 1. One sufﬁcient condition for this is k Ak2 k Mk2 < 1. For example, consider a chain graph with

1 0 ... ... ... 1 1 0

2 Rp p and 1

M I ... ... ... M M I

2 Rpm pm. (3.3)

Then it sufﬁces to have k Mk2 < 1/2. Similarly for !-nearest neighbor structure, it sufﬁces to have k Mk2 < 1/2! and for the lattice structure to have k Mk2 < 1/4.

3.1 Estimation Procedures

Since j is unknown, we propose to minimize the following loss function based on the conditional log-likelihood

Lj(M), (3.4)

where Lj(M) := 1 2

332. Let M = V V > denote the true rank r matrix with V 2 Rp r. Note that V is not unique, but M is. Observe that minimizing (3.4) leads to a consistent estimator, since

In order to ﬁnd a low rank solution c M that approximates M , we consider the following two procedures.

Convex Relaxation We solve the following problem

min M2Rp p,M >=M,M 0 L(M) + λk Mk , (3.5)

where k k is the nuclear norm of a matrix and λ is the regularization parameter to be speciﬁed in the next section. The problem (3.5) is convex and hence can be solved by proximal gradient descent method [15] with any initialization point.

Non-convex Optimization Although it is guaranteed to ﬁnd global minimum by solving the convex relaxation problem (3.5), in practice it may be slow. In our problem, since M is low rank and positive semideﬁnite, we can always write M = V V > for some V 2 Rp r and solve the following nonconvex problem

min V 2Rp r L(V V >). (3.6)

With an appropriate initialization V (0), in each iteration we update V by gradient descent

V (t+1) = V (t) r V L(V V >)

where is the step size. The choice of initialization V (0) and step size will be speciﬁed in details in the next section. The unknown rank r can be estimated as in [2].

4 Theoretical Result

We establish convergence rates for the two estimation procedures.

4.1 Convex Relaxation

In order to show that a minimizer of (3.5) gives a good estimator for M, we ﬁrst show that the objective function L( ) is strongly convex under the assumption that the data are distributed according to (2.3) with the true parameter M = V V > with V 2 Rp r. Let

δL( ) = L(M + ) L(M ) hr L(M ), i,

where h A, Bi = tr(A>B) and is a symmetric matrix. Let i denote the i-th column of and let col = [ >

1 , . . . , >

p ]> 2 Rp2 1 be the vector obtained by stacking columns of . Then a simple calculation shows that

has a quadratic form in each i with the same Hessian matrix H. Let

= X A 2 Rp m,

where A is the adjacency matrix of the graph G. Then the Hessian matrix is given by

e X e X> = 1

m XAA>X> 2 Rp p (4.1)

and therefore we can succinctly write δL( ) = >

col Hcol col, where the total Hessian matrix Hcol = diag(H, H, . . . , H) 2 Rp2 p2 is a block diagonal matrix.

To show that L( ) is strongly convex, it sufﬁces to lower bound the minimum eigenvalue of H, deﬁned in (4.1). If the columns of e X were independent, the minimum eigenvalue of H would be bounded away from zero with overwhelming probability for a large enough m [16]. However, in our setting the columns of e X are dependent and we need to prove this lower bound using different tools. As the distribution of X depends on the unknown conditional covariance matrices j, j = 1, . . . , m in a complicated way, we impose the following assumption on the expected version of H.

Assumption EC. The minimum and maximum eigenvalues of EH are bounded from below and from above: 0 < cmin σmin(EH) σmax(EH) cmax < 1.

Assumption (EC) puts restrictions on conditional covariance matrices j and can be veriﬁed in speciﬁc instances of the problem. In the context of Example 3.3, where j = I, j = 1, . . . , m, and the graph is a chain, we have the adjacency matrix A and the covariance matrix col given in (3.3). Then simple linear algebra [5] gives us that

EH = m 1EXAA>X> = 2I + c M 2 + o(M 2),

which guarantees that σmin(EH) 1 and σmax(EH) c + 3 for large enough m.

The following assumption requires that the spectral norm of A and col do not scale with m.

Assumption SC. There exists a constant 0 such that max

k Ak2, k 1/2

Assumption (SC) gives sufﬁcient condition on the graph structure, and it requires that the dependency among nodes is weak. In fact, it can be relaxed to 0 = o(m1/4) which allows the spectral norm to scale with m slowly. In this way, the minimum and maximum eigenvalues in assumption (EC) also scale with m and it results in a much larger sample complexity on m. However, if 0 grows even faster, then there is no way to guarantee a reasonable estimation. We see that 0 m1/4 is the critical condition, and we have the phase transition on this boundary.

It is useful to point out that these assumptions are not restrictive. For example, under the simpliﬁcation that j = I, we have k colk2 = 1/(1 k Ak2 k Mk2). The condition k Ak2 k Mk2 < 1 is satisﬁed

naturally for a valid Gaussian embedding model. Therefore in order to have k 1/2

col k2 0, we only need that k Ak2 k Mk2 1 1/ 2

0, i.e., it is bounded away from 1 by a constant distance.

It is straightforward to verify that assumption (SC) holds for the chain structure in Example 3.3. If the graph is fully connected, we have k Ak2 = m 1, which violates the assumption. In general, assumption (SC) gives a sufﬁcient condition on the graph structure so that the embedding is learnable.

With these assumptions, the following lemma proves that the minimum and maximum eigenvalues of the sample Hessian matrix H are also bounded from below and above with high probability.

Lemma 4.1. Suppose the assumption (EC) and (SC) hold. Then for m c0p we have σmin(H)

1 2cmin and σmax(H) 2cmax with probability at least 1 c1 exp( c2m), where c0, c1, c2 are absolute constants. Therefore

F δL( ) L k k2

2cmin and L = 2cmax for any 2 Rp p.

Lemma 4.1 is the key technical result, which shows that although all the xj are dependent, the objective function L( ) is still strongly convex and smooth in . Since the loss function L( ) is strongly convex, an application of Theorem 1 in [14] gives the following result on the performance of the convex relaxation approach proposed in the previous section.

Theorem 4.2. Suppose the assumptions (SC) and (EC) are satisﬁed. The minimizer c M of (3.5) with λ

kc M M k F 32prλ

The following lemma gives us a way to set the regularization parameter λ.

Lemma 4.3. Let G = 1 m

j=1 j. Assume that the maximum eigenvalue of G is bounded from above as σmax(G) max for some constant max. Then there exist constants c0, c1, c2, c3 > 0 such that for m c0p, we have

c2 exp( c3m).

Combining the result of Lemma 4.3 with Theorem 4.2, we see that λ should be chosen as λ = O

, which leads to the error rate

kc M M k F = OP

4.2 Non-convex Optimization

Next, we consider the convergence rate for the non-convex method resulting in minimizing (3.6) in V . Since the factorization of M is not unique, we measure the subspace distance between V and V .

Subspace distance. Let V be such that V V > = . Deﬁne the subspace distance between V and V as

d2(V, V ) = min O2O(r) k V V Ok2

where O(r) = {O : O 2 Rr r, OO> = O>O = I}.

Next, we introduce the notion of the statistical error. Denote

: 2 Rp p, = >, rank( ) = 2r, k k F = 1

The statistical error is deﬁned as

estat = sup

0 5000 10000 15000 #nodes

Estimation error

Convex relaxation Non-convex method

(a) Chain structure

0 5000 10000 15000 #nodes

Estimation error

Convex relaxation Non-convex method

(b) !-nearest neighbor structure

0 5000 10000 15000 #nodes

Estimation error

Convex relaxation Non-convex method

(c) Lattice structure

Figure 2: Estimation accuracy for three context structures

Intuitively, the statistical error quantiﬁes how close the estimator can be to the true value. Speciﬁcally, if V is within c estat distance from V , then it is already optimal. For any 2 , we have the factorization = U V >

where U , V 2 Rp 2r and k U k2 = k V k F = 1. We then have

r L(M )V , U

kr L(M )V k F k U k F

kr L(M )k2k V k F k U

where the last inequality follows from Lemma 4.3. In particular, we see that both convex relaxation and non-convex optimization give the same rate.

Initialization. In order to prove a linear rate of convergence for the procedure, we need to initialize it properly. Since the loss function L(M) is quadratic in M, we can ignore all the constraints on M and get a closed form solution as

We then apply rank-r eigenvalue decomposition on f M (0) = 1 2

M (0) + M (0)>(

and obtain [e V , e S, e V ] = rank-r svd of f M (0). Then V (0) = e V e S1/2 is the initial point for the gradient descent. The following lemma quantiﬁes the accuracy of this initialization.

Lemma 4.4. The initialization M (0) and V (0) satisfy

k M (0) M k F 2ppλ

where σr(M ) is the minimum non-zero singular value of M .

With this initialization, we obtain the following main result for the non-convex optimization approach, which establishes a linear rate of convergence to a point that has the same statistical error rate as the convex relaxation approach studied in Theorem 4.2.

Theorem 4.5. Suppose the assumption (EC) and (SC) are satisﬁed, and suppose the step size satisﬁes

32k M (0)k2

1. For large enough m, after T iterations we have

stat, (4.8)

for some constant β < 1 and a constant C.

5 Experiment

In this section, we evaluate our methods through experiments. We ﬁrst justify that although j is unknown, minimizing (3.4) still leads to a consistent estimator. We compare the estimation accuracy

0 5000 10000 15000 #nodes

Convex relaxation Non-convex method

(a) Chain structure

0 5000 10000 15000 #nodes

Convex relaxation Non-convex method

(b) !-nearest neighbor structure

0 5000 10000 15000 #nodes

Convex relaxation Non-convex method

(c) Lattice structure

Figure 3: Testing loss for three context structures

with known and unknown covariance matrix j. We set j = σj Toeplitz( j) where Toeplitz( j) denotes Toeplitz matrix with parameter j. We set j U[0, 0.3] and σj U[0.4, 1.6] to make them non-isotropic. The estimation accuracy with known and unknown j are given in Table 1. We can see that although knowing j could give slightly better accuracy, the difference is tiny. Therefore, even if the covariance matrices are not isotropic, ignoring them still gives a consistent estimator.

Table 1: Comparison of estimation accuracy with known and unknown covariance matrix

m = 1000 m = 2500 m = 5000 m = 8000 m = 15000 unknown 0.8184 0.4432 0.3210 0.2472 0.1723 known 0.7142 0.3990 0.2908 0.2288 0.1649

We then consider three kinds of graph structures given in Figure 1: chain structure, !-nearest neighbor structure, and lattice structure. We generate the data according to the conditional distribution (2.3) using Gibbs Sampling. We set p = 100, r = 5 and vary the number of nodes m. For each j, we set j = to be a Toeplitz matrix with i = |i | with = 0.3. We generate independent train, validation, and test sets. For convex relaxation, the regularization parameter is selected using the validation set. We consider two metrics, one is the estimation accuracy kc M M k F /k M k F , and the other is the loss L(c M) on the test set.

The simulation results for estimation accuracy for the three graph structures are shown in Figure 2, and the results for loss on test sets are shown in Figure 3. Each result is based on 20 replicates. For the estimation accuracy, we see that when the number of nodes is small, neither method gives accurate estimation; for reasonably large m, non-convex method gives better estimation accuracy since it does not introduce bias; for large enough m, both methods give accurate and similar estimation. For the loss on test sets, we see that in general, both methods give smaller loss as m increases. The non-convex method gives marginally better loss. This demonstrates the effectiveness of our methods.

6 Conclusion

In this paper, we focus on Gaussian embedding and develop the ﬁrst theoretical result for exponential family embedding model. We show that for various kinds of context structures, we are able to learn the embedding structure with only one observation. Although all the data we observe are dependent, we show that the objective function is still well-behaved and therefore we can learn the embedding structure reasonably well.

It is useful to point out that, the theoretical framework we proposed is for general exponential family embedding models. As long as the similar conditions are satisﬁed, the framework and theoretical results hold for any general exponential family embedding model as well. However, proving these conditions is quite challenging from the probability perspective. Nevertheless, our framework still holds and all we need are more complicated probability tools. Extending the result to other embedding models, for example the Ising model, is work in progress.

[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic

language model. Journal of machine learning research, 3(Feb):1137 1155, 2003.

[2] Florentina Bunea, Yiyuan She, and Marten H Wegkamp. Optimal selection of reduced rank

estimators of high-dimensional matrices. The Annals of Statistics, pages 1282 1309, 2011.

[3] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization.

Foundations of Computational mathematics, 9(6):717, 2009.

[4] Shizhe Chen, Daniela M Witten, and Ali Shojaie. Selection and estimation for mixed graphical

models. Biometrika, 102(1):47 64, 2014.

[5] GY Hu and Robert F O Connell. Analytical inversion of symmetric tridiagonal matrices.

Journal of Physics A: Mathematical and General, 29(7):1511, 1996.

[6] Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.

[7] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and pro-

cesses. Springer Science & Business Media, 2013.

[8] Jason D Lee and Trevor J Hastie. Learning the structure of mixed graphical models. Journal of

Computational and Graphical Statistics, 24(1):230 253, 2015.

[9] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In

Advances in neural information processing systems, pages 2177 2185, 2014.

[10] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons

learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211 225, 2015.

[11] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word

representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013.

[12] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space

word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746 751, 2013.

[13] Tanmoy Mukherjee and Timothy Hospedales. Gaussian visual-linguistic embedding for zero-

shot recognition. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 912 918, 2016.

[14] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise

and high-dimensional scaling. The Annals of Statistics, pages 1069 1097, 2011.

[15] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R in Optimiza-

tion, 1(3):127 239, 2014.

[16] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for

correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241 2259, 2010.

[17] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of

linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010.

[18] Maja Rudolph and David Blei. Dynamic bernoulli embeddings for language evolution. ar Xiv

preprint ar Xiv:1703.08052, 2017.

[19] Maja Rudolph, Francisco Ruiz, Susan Athey, and David Blei. Structured embedding models for

grouped data. In Advances in Neural Information Processing Systems, pages 250 260, 2017.

[20] Maja Rudolph, Francisco Ruiz, Stephan Mandt, and David Blei. Exponential family embeddings.

In Advances in Neural Information Processing Systems, pages 478 486, 2016.

[21] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization.

IEEE Transactions on Information Theory, 62(11):6535 6579, 2016.

[22] Jialei Wang and Mladen Kolar. Inference for high-dimensional exponential family graphical

models. In Artiﬁcial Intelligence and Statistics, pages 1042 1050, 2016.

[23] Yuchung J Wang and Edward H Ip. Conditionally speciﬁed continuous distributions. Biometrika,

95(3):735 746, 2008.

[24] Zhaoran Wang, Huanran Lu, and Han Liu. Nonconvex statistical optimization: Minimax-optimal

sparse pca in polynomial time. ar Xiv preprint ar Xiv:1408.5352, 2014.

[25] Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models via

univariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813 3847, 2015.

[26] Ming Yu, Varun Gupta, and Mladen Kolar. An inﬂuence-receptivity model for topic based

information cascades. 2017 IEEE International Conference on Data Mining (ICDM), pages 1141 1146, 2017.

[27] Ming Yu, Varun Gupta, and Mladen Kolar. Learning inﬂuence-receptivity network structure

with guarantee. ar Xiv preprint ar Xiv:1806.05730, 2018.

[28] Ming Yu, Varun Gupta, and Mladen Kolar. Recovery of simultaneous low rank and two-way

sparse coefﬁcient matrices, a nonconvex approach. ar Xiv preprint ar Xiv:1802.06967, 2018.

[29] Ming Yu, Mladen Kolar, and Varun Gupta. Statistical inference for pairwise graphical models

using score matching. In Advances in Neural Information Processing Systems, pages 2829 2837, 2016.

[30] Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact

ﬁrst order oracle. Advances in Neural Information Processing Systems, 2015.

[31] Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. Bilingual word em-

beddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1393 1398, 2013.