# on_estimation_in_latent_variable_models__e7e081b2.pdf

On Estimation in Latent Variable Models

Guanhua Fang, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St Bellevue WA 98004 USA {guanhuafang, liping11}@baidu.com

Latent variable models have been playing a central role in statistics, econometrics, machine learning with applications to repeated observation study, panel data inference, user behavior analysis, etc. In many modern applications, the inference based on latent variable models involves one or several of the following features: the presence of complex latent structure, the observed and latent variables being continuous or discrete, constraints on parameters, and data size being large. Therefore, solving an estimation problem for general latent variable models is highly non-trivial. In this paper, we consider a gradient based method via using variance reduction technique to accelerate estimation procedure. Theoretically, we show the convergence results for the proposed method under general and mild model assumptions. The algorithm has better computational complexity compared with the classical gradient methods and maintains nice statistical properties. Various numerical results corroborate our theory.

1. Introduction

A latent variable model, as the name suggests, is a statistical model that contains latent/unobserved variables. Their roots trace back to Spearman s 1904 seminal work (Spearman, 1904) on factor analysis, which is arguably the ﬁrst wellarticulated latent variable model. In past years, latent variable models have been playing an important role in machine learning, statistics, econometrics, psychometrics, social sciences with applications to repeated observation study, panal data inference, user behavior analysis, etc (Aigner et al., 1984; Bishop, 1998; Bartholomew et al., 2011; Ahmed et al.,

2012; Loehlin and Beaujean, 2016). Latent variables serve to reduce the dimensionality of data. Many observable

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

variables can be aggregated in a model to represent an underlying concept, making it easier to do data analysis.

In many applications, the inference based on latent variable models involves one or several of the following features: 1) the presence of complex latent structure, 2) the observed and latent variables being either continuous or discrete, 3) constraints on parameters, 4) data size being large. Comparing with models without latent variables (e.g., linear regression and generalized linear regression), the estimation problem of latent variable models is typically more involved. In general, the estimation problem can have the following three perspectives: both latent variables and parameters are viewed as ﬁxed quantities, both latent variables and parameters are viewed as random quantities, and latent variables are random while parameters are ﬁxed.

The ﬁrst perspective (i.e., ﬁxed latent variables and parameters) leads to the joint maximum likelihood estimator. This estimator can usually be efﬁciently computed by alternating minimization-type algorithm (Birnbaum, 1968; Chen et al., 2020). However such estimator could be statistically inconsistent and may lead to biased statistical inference. The second perspective (i.e., random latent variables and parameters) leads to Markov Chain Monte Carlo (MCMC) methods (Metropolis et al., 1953). Metropolis-Hasting method (Hastings, 1970), Gibbs sampler methods (Gilks and Wild, 1992; Gilks et al., 1995) and other Bayesian methods (Gelman et al., 2013) are developed to solve the estimation problem. However, as data size increases, those methods can be computationally inefﬁcient and need long time for Markov chain to be stable.

In this paper, we adopt the third perspective (i.e., ﬁxed parameters and random latent variables) and develop a gradient-based computational method where we incorporate variance-reduced technique to accelerate the estimation procedure. Due to the existence of latent structure, we need to estimate the gradient via sampling the latent variables according to posterior distributions. When the analytical formula of posterior distribution is not obtainable, Q-function1

1Q-function is usually referred to as the objective in the M-step.

On Estimation in Latent Variable Models

in expectation-maximization (EM) method is hard to compute while our method allows latent variables being sampled from some approximate distributions (which can be easily constructed by using quasi Monte-Carlo methods or MCMC methods). On the theoretical side, we provide convergence analysis of the proposed method and show that it has better computational complexity. Especially, our results are established under very general statistical model assumptions including both independent and identically distributed (i.i.d.) cases and network (non-i.i.d.) cases. Moreover, we also consider the optimization settings with regularization terms. For both smooth and non-smooth cases, our estimator is shown to have nice statistical properties and achieves the efﬁciency.

The rest of paper is organized as follows. In Section 2, we give an introduction of latent variable models. In Section 3, we propose a variance-reduced method for parameter estimation in latent variable models. In Section 4, we provide theoretical analysis of proposed method under both smooth and non-smooth cases. We further extend our results into network setting in Section 5. Numerical results are presented in Section 6 and validate our theory. The concluding remarks are given in Section 7.

2. Preliminaries

2.1. Notation and Setting

We consider the estimation of a parametric latent variable model. Let Y be a random response representing the observed data and Z be a random variable representing the latent unobserved trait. We use y and z to denote their realizations, respectively. Let θ represent the generic vector of parameter which needs to be estimated. We assume θ B Rp where p is the dimension of the parameter and B is the domain. We assume latent variable Z follows a prior p(z) (the prior may depend on some nuisance parameters which we are not interested in). Given Z = z, we assume Y follows certain distribution function parameterized by θ with density fθ(y|Z = z). We may write fθ(y|Z = z) = fθ(y|z) for notational simplicity. Therefore, the joint density is fθ(y, z) = fθ(y|z)p(z) and the marginal density is fθ(y) = R fθ(y, z)dz. Throughout the paper, we use subscript i to indicate each individual sample.

In the sequel, we further adopt following notations. We use x and x 1 to represent ℓ2and ℓ1-norm of vector x. B(θ, δ) is used to denote an ℓ2 ball centered at θ with radius δ. We use [n] to denote set {1, . . . , n} and [n] [n] to denote set {(i, j) : i [n], j [n]}. We use θ to denote the true model parameter and θ to denote the optimizer of (4). Moreover, a = O(b) means there exists a constant K such that a Kb; a = Ω(b) means there exists a sufﬁciently large constant K such that a Kb. We use f ( 2f) to

represent the ﬁrst (second) derivative of f with respect to θ.

2.2. Examples

The latent variable models can be mainly divided into two categories, factor analysis (latent variables are continuous) and latent class analysis (latent variables are discrete). In this section we provide several illustrative examples to let readers have better understandings on the structures of latent variable models.

Latent Factor Models. In latent factor analysis (LFA), the latent variable Z := (ξ1, ξ2, . . . , ξK) is assumed to follow a multivariate Gaussian distribution, e.g., N(0, Σ) with Σ being the covariance matrix. Response Y is also assumed to be multivariate, that is, Y = (Y1, . . . , YJ).

In the linear bifactor model, response Y assumes the following form,

Yj = aj0 + a T j Z + ϵj, (1)

with ϵj s are i.i.d. N(0, σ2). In the item response cases, response Y takes discrete value and is usually binary.

P(Yj = 1|Z) = Φ(aj0 + a T j Z),

P(Yj = 0|Z) = 1 P(Yj = 1|Z),

with Φ( ) being certain link function. For example, Φ(x) = exp{x}/(1 + exp{x}) when Φ( ) is the logit link and Φ(x) is the cumulative function of standard normal distribution when Φ( ) is the probit link. Here, aj0 s are intercept parameters and aj s are loading vectors. Suppose there are n individuals, then the likelihood function can be written as

i=1 Li(θ) =

j=1 fθ(yij|zi)}p(zi)dzi

where θ = {(aj0, aj)}J j=1 and p(z) is the density function of N(0, Σ).

Latent Class Models. In statistics, a latent class model (LCM) relates a set of observed (usually discrete) to a set of latent variables. A class is characterized by a pattern of conditional probabilities that indicate the chance that the latent variables take on certain values. One of the most typical examples is the mixture Gaussian model that the response Y follows PC c=1 pcf(y|µc, σ2 c) with pc be the latent class probability and f(y|µc, σ2 c) is the density function of normal distribution with mean µc and variance σ2 c.

Furthermore, restricted latent class models are also widely used in social and behavioral sciences. For example, they are commonly used in education for cognitive diagnosis (von Davier and Lee, 2019). These models give the latent variable speciﬁc meanings and hence are easier to make diagnosis of the individuals. Here, we consider a setting

On Estimation in Latent Variable Models

where both response and latent variables are binary. Let Z = (α1, . . . , αK) {0, 1}K with αk {0, 1} indicating the mastery of k-th latent trait/attribute. The response Y = (Y1, . . . , YJ) with Yj follows a Bernoulli distribution which satisﬁes

P(Yj = 1|Z = z) = exp{θjz} 1 + exp{θjz}. (2)

Furthermore, for each ﬁxed j, θj,Z s satisfy partial order relationship. That is, θjz1 θjz2 if z1 z2, representing that an examinee is more likely to answer the j-th item correctly if he is more capable (i.e., z2 has more 1 s). Under the restricted LCM setting with n individuals, the likelihood function can be written as

i=1 Li(θ) =

zi {0,1}K pzi

j=1 fθ(y|zi)

with θ = (θjz; j = 1 . . . , J, z {0, 1}K).

Latent Network Model Latent network model is a generative model which is often used for characterizing the block structure in social networks. Assume there exist n nodes and an edge list A [n] [n]. Each node i is assigned with a latent membership zi [K]. For each pair of nodes (i, j) A, we can observe data Yij fθ(y|zi, zj) which are conditionally independent given zi s.

For example, in stochastic block model (SBM), A [n] [n] and zi belongs to K different communities. In addition, yij Bernoulli(θzizj) indicating whether there exists an observed link between node i and j. Usually, θzizj may take larger value if zi = zj belong to the same community. Thus parameter θ = (θkl, k, l [K]) can be viewed as K by K symmetric edge probability matrix.

In online social platform, different users are observed to have interactions over a period of time. Then A can be viewed as a friendship list. Only friends can send messages to each other. For each pair (i, j), the observed data yij is a sequence of events that happen between these pair of users. A continuous time counting process is usually well suited to capture the event sequence yij. In the literature, Poisson process model or Hawkes process model are widely used to capture such event dynamics. By assuming the conditional independence between different node pairs, we can write the likelihood function as

i=1 p(zi){ Y

(i,j) A fθ(yij|zi, zj)}, (3)

where z = (z1, . . . , zn), p(z) is the prior probability of latent class z and the formula of fθ(yij|zi, zj) depends on the model speciﬁcation (e.g., Poisson or Hawkes Process).

In modern applications, the statistical model may admit certain structure (e.g., sparseness or monotonicity) some

regularization terms can be imposed on the parameter θ. We use R(θ) to represent a general regularization term. Then the regularized maximal likelihood estimator is deﬁned as

θ = arg min θ { log L(θ) + R(θ)}. (4)

When R(θ) 0, θ becomes the maximal likelihood estimator (MLE). For simplicity, we may also write F(θ) := log L(θ) and P(θ) := log L(θ) + R(θ) in the rest of paper. Then θ is the minimizer of P(θ).

3. Algorithms

We consider a gradient-based method to estimate the parameters in the latent variable model setting when individuals are assumed to be mutually independent. To compute the gradient, we consider a variation-reduction technique to accelerate the optimization procedure. We ﬁrst recall the Fisher s identity,

log Li(θ) = Z { log fθ(yi, zi)}pθ(zi|yi)dzi,

which is a useful tool in maximum-likelihood parameter estimation problems. In other words, Hi(θ, zi) := log fθ(yi, zi) with zi pθ(z|yi) will be an unbiased estimator of log Li(θ). Thus, negative gradient log L(θ) can be approximated by

i Hi(θ, zi).

As data sample goes large, it could be time-consuming to compute the H(θ). A naive acceleration technique is using stochastic gradient method (SGD) by using Hi(θ) as the approximate gradient in each iteration. However, the variance of Hi(θ) does not vanish when we increase the sample size n. Here, we consider an alternative variancereduced gradient to avoid this issue.

The main procedure is described as follows. We ﬁrst randomly choose an initial point θ0 and set the snapshot parameter θ0 = θ0. In iteration s, we ﬁrst sample a batch set of data Bs with size n1 < n. For each data i Bs, we sample its corresponding latent variable z according to posterior distribution pθ(zi|yi) (or any approximate distribution). We then compute a snapshot of stochastic gradient, that is,

i Bs Hi( θs, zi).

For each iteration s, we further sequentially update the parameter by m times. For each t [m], we denote the current parameter value as θs+1 t . We randomly pick a data sample is+1 t from {1, . . . , n} and sample the corresponding

On Estimation in Latent Variable Models

latent variable zis+1 t according to posterior pθs+1 t (z|yi). We then deﬁne the variance-reduced gradient as

vs+1 t = Hit(θs+1 t , zit) Hit( θs, zit) + f s+1.

Scalar γ represents the step size/learning rate. See Algorithm 1 for the complete procedure.

Algorithm 1 Latent Stochastic Gradient Algorithm

1: Input: Observations: (yi, i [n]). 2: Output: Estimated parameter ˆθ. 3: Set initial parameter θ0 and let θ0 0 = θ0 = θ0. 4: for s = 0 to S - 1 do 5: θs+1 0 = θs m 6: Sample a subset Bs {1, . . . , n} with size n1 7: Sample zi according to posterior/approximate distribution for i Bs. 8: Compute f s+1 = 1 n1 P

i Bs Hi( θs, zi). 9: for t = 0 to m 1 do 10: Uniformly randomly pick is+1 t from {1, . . . , n}. 11: Sample zis+1 t according to posterior distribution pθs+1 t (z|yi) or approximate distribution.

12: Compute vs+1 t = Hit(θs+1 t , zis+1 t )

Hit( θs, zis+1 t ) + f s+1. 13: Smooth case: update

θs+1 t+1 = θs+1 t γvs+1 t .

Non-smooth case: update

θs+1 t+1 = proximalγR(θs+1 t γvs+1 t ).

14: end for 15: Set θs+1 = θs+1 m . 16: end for

Different from the classical stochastic variance-reduced algorithm (SVRG), some important features of Algorithm 1 are discussed as follows.

1. Algorithm 1 is constructed under the assumption that data are independent and identically distributed and follow a parameterized statistical model, while the usual SVRG does not require any model assumption. This is due to the existence of latent structure and the construction of snapshot gradient that f s+1 depends on batch set Bs (instead of [n]).

2. In variance-reduction step, index is+1 t is sampled from {1, . . . , n} instead of Bs (The latter one also works). This is helpful since it allows the algorithm to explore the whole data structure faster.

3. Note that zis+1 t is plugged into both Hit(θs+1 t , z) and

Hit( θs, z) in the variance-reduced gradient. This is also crucial since that Hit(θs+1 t , zis+1 t ) Hit( θs, zis+1 t ) has smaller variance compared with that of

Hit(θs+1 t , zis+1 t ) Hit( θs, zs it) where zs it is the latent variable used in the previous iteration.

4. In fact, we do not require to compute the exact posterior distribution pθs+1 t (z|yi). It is sufﬁcient to ﬁnd an approximate distribution ps+1 t (z) such that ps+1 t (z) pθs+1 t (z|yi) T V = O( 1 n1 ). This largely reduces the computational burden when the explicit form of pθs+1 t (z|yi) is not easily obtained. Under such case, we can either use quasi-Monte Carlo (Caﬂisch et al., 1998; Owen and Glynn, 2016) or Markov Chain Monte Carlo method (Andrieu et al., 2003; Liu, 2008; Robert and Casella, 2013) with ergodicity property to construct the approximate distribution.

Connections to stochastic variance-reduced method

Variance reduction technique (Owen and Zhou, 2000) is proved to be useful in integration approximation. In optimization, this technique is also widely adopted. Johnson and Zhang, 2013 proposed stochastic variance reduction gradient methods (SVRG). It shows empirically better than SGD methods and batch methods. Later on, a group of researchers (Reddi et al., 2016; Allen-Zhu and Hazan, 2016) established new theory for SVRG and showed that the varianced-reduced method converges faster than SGD and full gradient method. In recent years, stochastic variancereduced gradient Hamiltonian Monte Carlo method and Langevin dynamic method (Dubey et al., 2016; Zou et al., 2018; Xie et al., 2021; Zhao et al., 2021) are proposed to solve the optimization method from a Bayesian view. Stochastic Variance-Reduced Expectation and Maximization methods are (Zhu et al., 2017; Karimi et al., 2019; Karimi and Li, 2021) also developed for solving an optimization problem under exponential family. Our methods can be viewed as the extended version of SVRG methods in the setting with the existence of latent structure. It is designed for solving general latent variable models that both observed responses or latent variables could be either discrete or continuous. Additionally, we do not require computing the exact posterior distributions. Hence it is more widely applicable.

Connections to stochastic approximation

The proposed method is also closely related to the stochastic approximation approach which was ﬁrst proposed in Robbins and Monro (1951) and Kiefer and Wolfowitz (1952), and its variants given in Gu and Kong (1998); Cai (2010) that are specially designed for latent variable model estimation. Both methods (Gu and Kong, 1998; Cai, 2010) approximate the original Robbins-Monro method by using MCMC sampling to generate an approximate stochastic gradient in each iteration, when an unbiased stochastic gradient is difﬁcult to obtain. However these methods do not handle with non-smooth objective functions. In addition,

On Estimation in Latent Variable Models

the computational cost is high if we compute the posterior distribution based on entire dataset and the convergence rate becomes slow.

Connections to perturbed gradient algorithm

Our method is also related to perturbed gradient algorithms since we do not require to compute the posterior distribution exactly. The perturbed proximal gradient algorithm (Atchad e et al., 2017; Zhang and Chen, 2020) solves a similar optimization problem when the gradient is intractable and is approximated by Monte Carlo methods. Their method combines stochastic approximation, proximal gradient decent and Polyak-Ruppert averaging (Polyak, 1990; Ruppert, 1988). The theoretical analysis of Atchade et al. s work focuses on convex optimization, while we consider a more general setting of non-convex optimization that includes a wide range of latent variable model estimation problems as special cases.

Comparison with Classical Methods

The gradients computed by different methods are summarized as follows.

Proposed method:

Hit(θs+1 t , zis+1 t ) Hit( θs, zis+1 t ) + f s+1.

Stochastic gradient descent: Hit(θs+1 t , zis+1 t ).

Stochastic batch method: f s+1.

When θs t becomes stable (i.e., θs+1 t θs t = o(1)), the gradient of SGD may not be necessarily close to zero due to the randomness of is+1 t . While the proposed method and batch method is guaranteed to be o(1) under the mild statistical assumptions. On the other hand, the batch method may fail for convergence when the step size γ is set to be large value in practice. While the proposed method can still maintain a relatively good performance. For methods of other types, it is not straightforward to compare here. For example, expectation-maximization (EM) requires to compute the exact posterior distribution. Variational inference-based methods (Attias, 1999) gives a biased estimator if an unfavorable variational family is adopted.

4. Theoretical Analysis

In this section, we provide theoretical analysis for the proposed algorithm. Before introducing the main results, we ﬁrst present several assumptions.

A1 Variables (yi, zi) s are independently identically distributed with density fθ(y, z) = fθ(y|z)p(z).

A2 The parameter θ lies in a bounded compact set B Rp.

A3 Density fθ(y|z) is a smooth function 2 of θ for all z.

A4 Ey fθ [log fθ(y)] is a strictly concave function of θ over set B.

4.1. Smooth case

We consider the situation that there is no regularization term, that is R(θ) 0. We deﬁne the stopping time T(ϵ) := arg mins mint E F(θs t ) 2 ϵ, where E is the conditional expectation given data and {Bs}. (Note that lines 9 - 10 in Algorithm 1 contain random index it and latent variable zis+1 t . We take expectation with respect to these random variables.) In other words, T(ϵ) indicates the ﬁrst time that E F(θs t ) 2 is below certain threshold ϵ.

Theorem 1 Under Assumptions A1 A4, then

T(ϵ) C1 F(θ0) F( θ)

holds for any ϵ Ω(max{ 1

n1 , (mγ)2}), with probability going to 1 as n . Here, C1 is a universal constant.

Theorem 1 gives the theoretical guarantee for the convergence of Algorithm 1. Quantity E F(θs t ) 2 will ﬁnally drop below the threshold ϵ when ϵ is at least of order 1 n1 and (mγ)2. One thing should be noticed that the estimator does not necessarily converge to the global optimal solution. It may converge to any stationary point instead. The ﬁrst term 1 n1 comes from the sampling noises from batch Bs and the second term (mγ)2 appears since there exist gaps by computing the posterior distributions under different θs t s in each iteration. In each iteration, we need to compute n1 gradients for constructing the snapshots and compute m times for variance-reduced gradient. Therefore, the total computational complexity is O((n1 + m)T(ϵ)). By the special choice of n1, m and γ, we have the following corollary.

Corollary 1 We let n1 = n 2 3 α, m = n 2 3 α, γ = n α, then the total computational complexity is O(nα/ϵ) for any ϵ = Ω(n 2α/3).

From corollary 1, we know that the total computational cost decreases as α decreases while the error term ϵ becomes larger. Hence, we should avoid choosing too small α to prevent large bias.

Reﬁnement of Estimator

Note that the proposed algorithm has better computational complexity, however it could lead to larger estimation error when α is small. In this section, we make a reﬁnement to

2Here smooth function means that the function can be differentiated for arbitrary number of times. This assumption is satisﬁed by most statistical model.

On Estimation in Latent Variable Models

the estimator returned by Algorithm 1 such that the reﬁned estimator is root-n consistent. Speciﬁcally, we consider to use the second order information to make the correction for ˆθ. Recall Louis s Identity (Louis, 1982),

θ θT = E[ 2 log fθ(y, z)

θ θT + log fθ(y, z)

( log fθ(y, z)

θ )T |y, θ]

E[ log fθ(y, z)

θ |y, θ]E[ log fθ(y, z)

θ |y, θ]T .

It implies that we can compute the Hessian matrix via using posterior approximation. Therefore,

i [ 2 log fθ(yi, zi)

θ θT + log fθ(yi, zi)

( log fθ(yi, zi)

log fθ(yi, zi)

log fθ(yi, zi)

is a Monte Carlo approximation of 2f(θ)

Our two-step reﬁnement is speciﬁed as follows. Let θr1 be

θr1 := ˆθ H(ˆθ)

where ˆθ is the estimator obtained from Algorithm 1 and is in the n α/3-neighborhood of θ . We further deﬁne θr2 as

θr2 := θr1 H(θr1)

By such construction, we can show that θr2 is a root-n consistent and has asymptotic normal distribution when 3/4 < α 3/2.

Theorem 2 When 3/4 < α 3/2, θr1 is n-consistent estimator and n(θr2 θ ) N(0, I 1(θ )V (θ )I 1(θ )),

where I(θ ) is the ﬁsher information matrix and V (θ ) is E H(θ ) H(θ )T .

Here, variance I 1(θ )V (θ )I 1(θ ) is larger than Cramer Rao lower bound since we need sampling for latent variable zi s.

4.2. Non-smooth Case

Next we consider situation when R(θ) = 0. We deﬁne the stopping time T(ϵ) := arg mins mint EP(θs t ) P( θ) ϵ, where E is still the conditional expectation given data and

{Bs}. Quantity T(ϵ) is the ﬁrst time that EP(θs t ) drops below the P( θ) plus the threshold ϵ.

A5 Assume that P(θ) is strongly-convex in B1 = {β B : θ θ δ0}, where δ0 := θ θ0 .

Theorem 3 Under Assumptions A1 - A5, then

T(ϵ) C2 P(θ0) P( θ)

holds for any ϵ Ω(max{ 1 n1 , mγ}), with probability going to 1 as n . Here, C1 is a universal constant.

Theorem 3 gives the convergence guarantee when the initial point lies in the region where the objective has strong convexity. By special choice of m, n1 and γ, we have the following corollary.

Corollary 2 Especially, we let n1 = n2/3α, m = n2/3α, γ = n α. Then the total computational complexity is O(nα/ϵ).

Application for Sparse Learning

In particular, we take R(θ) as ℓ1 norm θ 1 to enforce the solution to be sparse. That is, we aim to solve

ˆθ = arg min θ log L(θ) + τ θ 1, (5)

where τ is a tuning parameter controlling the penalty level. Then the proposed algorithm can recover the true support set of θ . We deﬁne S := supp(θ ) as the support of θ

and deﬁne ˆS := supp(ˆθ) as the estimated support.

We further let S 1 be the set of indices corresponding to position of θ where true value is non-zero and S 0 be the set of indices corresponding to position of θ where true value is zero. For notational simplicity, we deﬁne θ(1) = θ[S 1] and θ(0) = θ[S 0]. We then can write gradient and Hessian matrix in the block format, that is,

F(θ) = 1F(θ) 0F(θ)

2F(θ) = 2 11F(θ) 2 10F(θ) 2 01F(θ) 2 00F(θ)

where 1F(θ) is the subvector of gradient corresponding to θ(1) and 2 11F(θ) is the block of Hessian matrix corresponding to θ(1). 1F(θ), 2 10F(θ), 2 01F(θ) and 2 00F(θ) are deﬁned in the same fashion.

We then introduce the following irrepresentable condition (Zhao and Yu, 2006).

A6 Assume there exists a positive constant η such that

| 01F(θ ) 2 11F(θ )) 1sign(θ (1))| 1 η.

On Estimation in Latent Variable Models

Here the means that the inequality holds elementwisely.

Given such irrepresentable condition, we then have the following result.

Theorem 4 Under Assumptions A1 - A6 with θ(0) in the neighbor of θ , then ˆS = S holds with probability tending to 1 as n , if we set τ = nκ with α

5. Analysis in Network Case

The analysis and algorithm in the previous section depend on the assumption that individuals are independent and identically distributed. However, in the latent network models, the individuals are no longer assumed to be mutually independent. In this section, we aim to solve the optimization problem under network setting.

5.1. Algorithm

Similar to Algorithm 1, we still consider a gradient-based method via variance-reduction technique. The main procedure is summarized as follows. We ﬁrst randomly choose an initial point θ0 and sample an initial latent vector z0. In iteration s, we ﬁrst randomly sample a batch set of data Bs with size n1 < n. For each data i Bs, we sample its corresponding latent variable zi according to posterior distribution pθ(zi|yi, z i) and update latent vector to get zs. We then compute a snapshot of stochastic gradient, that is,

i Bs Hi( θs, zs),

where Hi(θ, z) := P

j:(i,j) A log fθ(yij|zi, zj). For each iteration s, we further sequentially update the parameter by m times. For each t [m], we use similar procedure to compute the variance-reduced gradient as that in Algorithm 1. Again scalar γ represents the step size/learning rate. See Algorithm 2 for the complete procedure.

Several distinct features are described here. 1) We need to maintain the vector of latent membership, since the individuals are no longer independent of each other. The conditional posterior distribution of zi s depends on the node(s) that i connects to. 2) In outer loop s, each node has equal probability to be included in batch set such that every node has chance to get its latent membership updated even if it has small number of degree. 3) In the inner loop, the node is sampled according to its degree. Therefore, a node with larger degree is more likely to impact the gradient.

5.2. Analysis

We establish the convergence property for Algorithm 2 in this section. We ﬁrst introduce several assumptions.

Algorithm 2 Latent Network Stochastic Gradient Algorithm.

1: Input: Observations 2: Output: Estimated parameter 3: Initialization: Set initial parameter θ0 and let θ0 0 = θ0 = θ0. For i 1, . . . , n, sample z0 i according to initial prior distribution. Denote z0 = (z0 1, . . . , z0 n). 4: for s = 0 to S - 1 do 5: θs+1 0 = θs m 6: Sample a subset Bs {1, . . . , n} with size n1 7: Sample new zi according to approximate posterior distribution for i Bs and replace the old value to get zs. 8: Compute Hs+1 = 1 P

i Bs Hi( θs, zs).

9: for t = 0 to m 1 do 10: Randomly pick is+1 t from {1, . . . , n} according to distribution pi di. 11: Sample latent variable zis+1 t according to current posterior distribution to replace its old value and get zs+1 t . 12: Compute vs+1 t = 1 dis+1 t { His+1 t (θs+1 t , zs+1 t )

His+1 t ( θs, zs+1 t )} + Hs+1.

13: Update θs+1 t+1 = θs+1 t γvs+1 t . 14: end for 15: Set θs+1 = θs+1 m . 16: end for

B1 The parameter θ lies in a bounded compact set B Rp.

B2 The density fθ(yij|zi, zj) is a smooth function of θ for all zi, zj s.

B3 Ey fθ [log fθ(y)] is a strictly concave function of θ over set B.

B4 Let dmin := mini [n] di be the minimal degree. Assume dmin = nα0 (α0 > 0).

We deﬁne T(ϵ) := arg mins mint E F(θs t ) 2 ϵ. We then have the following local convergence results.

Theorem 5 Under Assumption B1 - B4, there exists δ such that,

T(ϵ) C F(θ0) F( θ)

holds for any ϵ = Ω(max{ 1 n1nα0 , (mγ)2}) and θ0 B(θ , δ), with probability tending to 1 as n .

Since the optimization problem over a graph is NP-hard problem, Theorem 5 only guarantees that the estimator will converge to the global optimal solution when the initial point is not far from true one. Given special choice of m, n1 and γ, we have the following corollary.

On Estimation in Latent Variable Models

Corollary 3 Additionally, we assume dmin dmax = nα0, where dmax := maxi [n] di. Let m = nα1, γ = n α, n1 = n2(α α1)/d0 and α1 = 2

3(α α0), then the total computational complexity is O(nα+α0/ϵ) for α0/2 < α < 1.

To end this section, we provide a brief discussion on the stability of gradient. It is known that the nodes with larger degrees will impact the gradient more. For networks with a few high degree nodes and more low degree nodes will this make the gradient calculation unstable. Let Nsmall be the set of nodes whose degree is no(1). If we additionally assume that (P

i Nsmall di)/|A| 0, then the gradient will not be affected by low degree nodes too much and thus becomes stable. It remains an open question that how unstable the gradient is when the graph becomes super sparse, i.e., lim infn(P

i Nsmall di)/|A| > 0.

6. Numerical Experiments

DINA Model We consider the deterministic-input, noisyand-gate (DINA; Rupp and Templin, 2008) model which is a special restricted latent class model. It is widely applied in psychometric and educational testing to make diagnosis of examinees. Suppose there are J items and let Yj be the response to j-th item. Yj takes value in {0, 1}; 1 means correct and 0 means incorrect. The model adopts the following formulation. For each examinee, he/she is associated with a latent vector Z = (α1, . . . , αK) {0, 1}K, where αk is interpreted as k-th skill/attribute and K is the total number of attributes.

P(Yj = 1|Z) = (1 sj)ξ(Z,Q)g1 ξ(Z,Q) j , (6)

where Q is a binary matrix specifying the relationship between item and attribute and ξ(Z, Q) = Q

k 1{Zk Qjk}. In other words, if the examinee has all attributes required by j-th item, he/she will have higher probability (1 sj) to answer the j-th item correctly. Otherwise, he/she will only have probability (gj) to answer correctly. Thus sj and gj can be interpreted as slipping and guessing parameters.

We generate the data based on DINA model with n = 2000 individuals and J = 30 items. We generate true sj s and gj s within [0, 0.2]. We compare the proposed method with batch method (gradient is computed based on the batch set instead of using VR-gradient) and full batch method (gradient is computed based on full data set). We set m = n1 = n2α/3 and γ = n α (α = 1.2). For batch and full batch methods, we scale the step sizes such that they roughly have the same magnitude per sample. The results are reported in Figure 1. We can see that the proposed method converges faster compared with other two methods. We can also see that the proposed method achieves faster convergence rate when α decreases, while the error becomes larger. This is consistent with our theoretical results.

0 5 10 15 20 Data Passes

Comparison: Dina Model

Proposed Batch Full

0 5 10 Data Passes

Path: Dina Model

alpha: 1.0 alpha: 0.9 alpha: 0.8 alpha: 0.7

0 50 100 150 Data Passes

Comparison: Bifactor Model

Proposed Batch Full

0 50 100 150 Data Passes

Path: Bifactor Model

alpha: 1.0 alpha: 0.9 alpha: 0.8 alpha: 0.7

0 10 20 Data Passes

Comparison: Network Model

Proposed Batch Full

0 20 40 Data Passes

Path: Network Model

alpha: 1.0 alpha: 0.9 alpha: 0.8 alpha: 0.7

Figure 1. The simulation results for DINA, bifactor and latent network model. The plots on left column are the comparisons between three different methods. The plots on right column show the solution paths under different α s which take values in {0.7, 0.8, 0.9, 1.0}. Here # Data Passes is the cumulative number of samples divided by n.

Bifactor Model Bifactor model (Reise, 2012) is a latent factor model where the loading matrix admits a special structure. The ﬁrst column of loading matrix is known as the main factor/dimension. Rest of columns are known as the sub-domain factor/dimension. Different from the DINA model, latent variable Z is continuous instead of discrete. Bifactor model postulates the following formulation

P(Yj = 1|Z) = exp{aj0 + a T j Z} 1 + exp{aj0 + a T j Z}, (7)

where Z RG+1 follows N(0, I(G+1) (G+1)). For each j, loading aj has only one non-zero entries excluding the main dimension. Thus, the model parameter is identiﬁable and does not have rotational indeterminacy issue.

With knowing positions of non-zero entries of loading matrix, we generate the data from bifactor model with n = 2000 and J = 15. Main factor loading aj0 s are sampled from N(0, 2) and non-zero entries of testlet factor loading aj are set to be 0.5. We use the similar strategy to choose m, n1 and γ as that in DINA model setting with α = 0.9. Note that it is prohibitively hard to compute the exact posterior distribution. We instead use MCMC to sample

On Estimation in Latent Variable Models

latent variables. From Figure 1, we observe that the proposed model again converges faster than other two methods.

Latent Network Model We consider a latent network model with n = 400 nodes and set the number of latent classes K = 3. The edge list A is constructed by randomly generating a subset of [n] [n]. For each pair (i, j) A, yij is a homogeneous Poisson process over time period [0, T] (T = 10) with intensity parameter θzizj. Here θkl = θlk, and their true values are generated from the interval [1, 5]. We again compare three methods and set m = n1 = n2α/3

and γ = n α (α = 0.9). The result is shown in Figure 1. From solution path, we can see that the numerical results match our theoretical ﬁndings that smaller α can leader to faster convergence and a little bit larger bias.

Reﬁnement and Support Recovery The performance of reﬁned estimator under DINA model is shown in Figure 2. Here, the sample size varies from n = 500 to n = 8000. Based on the curve, we can see that the decay rate of estimation error is almost n 1/2, which is consistent with our theory. For bifactor model, we further consider the situations without knowing the positions of non-zero entries. That is, we need to impose a regularization term to make the loading matrix sparse. A minimax concave penalty (MCP; Zhang, 2010) is considered.

2a if |θ| aλ aλ2/2 if |θ| > aλ.

By this choice, the formula of proximal operator can be simpliﬁed as

proximalγR(θ) = sgn(θ) a a 1 max{|θ| λγ, 0},

where sgn(x) represents the sign of scalar x and tuning parameter a is larger than 1. The recovery results for support of loading matrix is shown in Figure 2. The proposed method can identify the non-zero positions well when we choose suitable penalty level.

500 1000 1500 2000 2500 n

Refined Estimator

0 20 40 60 80 n(*100)

Support Recovery

Figure 2. The left plot shows the estimation error of reﬁned estimator under different sample sizes. The right plot shows the support recovery of loading matrix under different sample sizes. (λ = log n/(n1/2γ), a = 3.)

NESARC Data The data extracted from National Epidemiological Survey on Alcohol and Related Conditions (NESARC) concerning social phobia contains the responses of

728 respondents to 13 questions. We apply DINA model to ﬁt this data set. We compare the proposed method with stochastic gradient method (SGD) and the result in shown in Figure 3. Here we set m = n 2 3 α and γ = n 2

3 α with α = 1.4 for both methods. We can see that the solution of SGD has a larger variance while the proposed method is more stable.

PISA Data The data was collected in the collaborative problem solving (CPS) test from 2015 Programme for International Student Assessment (PISA). Students were chosen from all the OECD countries and regions where the English version of exam was administrated. The dataset contains 8856 students in total. Each student has responses to 29 questions. We use bifactor model to ﬁt the data set. The solution paths of proposed method and SGD are given in Figure 3. In particular, we set m = n 2 3 α and γ = n 0.3α

with α = 1.0 for both methods. We can see that SGD has a relatively larger bias compared with the proposed method.

0 5 10 15 20 Data Passes

NESARC Data

Proposed SGD

0 0.5 1 1.5 2 Data Passes

Proposed SGD

Figure 3. The right plot is for NESARC data and the left plot is for PISA data. The Error represents the difference between the current estimate and the optimal parameter (optimal parameter is computed via using full batch method).

7. Conclusion

In this paper, we consider an optimization problem for general latent variable models. Our proposed algorithm is a gradient-based method by adopting variation reduction technique. Our method does not require to compute the exact posterior distribution, which increases computational efﬁciency. The theoretical analysis is established and accommodates for both smooth and non-smooth settings. The theory also considers different types of statistical assumptions (i.e., data is independent and identically distributed or follows a network model). The numerical results match our theoretical ﬁndings. In future work, it is of interest to study model-speciﬁc algorithm. The structures of different latent variable models/simple neural networks may vary from one to another. In addition, one may also consider Adam (Kingma and Ba, 2015) or Nesterov s accelerated method (Nesterov, 1983) for computing the gradient. Then the variance-reduced step might be further improved to accelerate the algorithm.

On Estimation in Latent Variable Models

Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan M. Narayanamurthy, and Alexander J. Smola. Scalable inference in latent variable models. In Proceedings of the Fifth International Conference on Web Search and Web Data Mining (WSDM), pages 123 132, Seattle, WA, 2012.

Dennis J Aigner, Cheng Hsiao, Arie Kapteyn, and Tom Wansbeek. Latent variable models in econometrics. Handbook of econometrics, 2:1321 1393, 1984.

Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 699 707, New York City, NY, 2016.

Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to MCMC for machine learning. Machine learning, 50(1):5 43, 2003.

Yves F Atchad e, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algorithms. The Journal of Machine Learning Research, 18(1):310 342, 2017.

Hagai Attias. Inferring parameters and structure of latent variable models by variational bayes. In Proceedings of the Fifteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 21 30, Stockholm, Sweden, 1999.

David J Bartholomew, Martin Knott, and Irini Moustaki. Latent variable models and factor analysis: A uniﬁed approach, volume 904. John Wiley & Sons, 2011.

Zygmund William Birnbaum. On the importance of different components in a multicomponent system. Technical report, Washington Univ Seattle Lab of Statistical Research, 1968.

Christopher M. Bishop. Latent variable models. In Learning in Graphical Models, pages 371 403. 1998.

Russel E Caﬂisch et al. Monte carlo and quasi-monte carlo methods. Acta numerica, 1998:1 49, 1998.

Li Cai. High-dimensional exploratory item factor analysis by a metropolis hastings robbins monro algorithm. Psychometrika, 75(1):33 57, 2010.

Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Structured latent factor analysis for large-scale data: Identiﬁability, estimability, and their implications. Journal of the American Statistical Association, 115(532):1756 1770, 2020.

Kumar Avinava Dubey, Sashank J. Reddi, Sinead A. Williamson, Barnab as P oczos, Alexander J. Smola, and Eric P. Xing. Variance reduction in stochastic gradient langevin dynamics. pages 1154 1162, 2016.

Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. CRC press, 2013.

Wally R Gilks, Nicky G Best, and KKC Tan. Adaptive rejection metropolis sampling within gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 44(4):455 472, 1995.

Walter R Gilks and Pascal Wild. Adaptive rejection sampling for gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337 348, 1992.

Ming Gao Gu and Fan Hui Kong. A stochastic approximation algorithm with markov chain monte-carlo method for incomplete data estimation problems. Proceedings of the National Academy of Sciences, 95(13):7270 7274, 1998.

W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1): 97 109, 1970.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pages 315 323, Lake Tahoe, NV, 2013.

Belhal Karimi and Ping Li. Two-timescale stochastic em algorithms. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), 2021.

Belhal Karimi, Hoi-To Wai, Eric Moulines, and Marc Lavielle. On the global convergence of (fast) incremental expectation maximization methods. In Advances in Neural Information Processing Systems (Neur IPS), pages 2833 2843, Vancouver, Canada, 2019.

Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462 466, 1952.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, 2015.

Jun S Liu. Monte Carlo strategies in scientiﬁc computing. Springer Science & Business Media, 2008.

John C Loehlin and A Alexander Beaujean. Latent Variable Models: An Introduction to Factor, Path, and Structural Equation Analysis. Taylor & Francis, 2016.

Thomas A Louis. Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):226 233, 1982.

On Estimation in Latent Variable Models

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953.

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In Doklady an ussr, volume 269, pages 543 547, 1983.

Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):135 143, 2000.

Art B Owen and Peter W Glynn. Monte Carlo and Quasi Monte Carlo Methods. Springer, 2016.

Boris T Polyak. New stochastic approximation type procedures. Automat. i Telemekh, 7(98-107):2, 1990.

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab as P oczos, and Alexander J. Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 314 323, New York City, NY, 2016.

Steven P Reise. The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5):667 696, 2012.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400 407, 1951.

Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.

Andre A Rupp and Jonathan Templin. The effects of qmatrix misspeciﬁcation on parameter estimates and classiﬁcation accuracy in the dina model. Educational and Psychological Measurement, 68(1):78 96, 2008.

David Ruppert. Efﬁcient estimators from a slowly convergent robbins-monro procedure. Cornell University Technical Report, 1988.

Charles Spearman. general intelligence, objectively determined and measured. American Journal of Psychology, 15:201 292, 1904.

Matthias von Davier and Young-Sun Lee. Handbook of diagnostic classiﬁcation models. Cham: Springer International Publishing, 2019.

Jianwen Xie, Zilong Zheng, and Ping Li. Learning energybased model with variational auto-encoder as amortized sampler. In Proceedings of the Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 10441 10451, Virtual Event, 2021.

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2): 894 942, 2010.

Siliang Zhang and Yunxiao Chen. Computation for latent variable model estimation: A uniﬁed stochastic proximal framework. ar Xiv preprint ar Xiv:2008.07214, 2020.

Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7: 2541 2563, 2006.

Yang Zhao, Jianwen Xie, and Ping Li. Learning energybased generative models via coarse-to-ﬁne expanding and sampling. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.

Rongda Zhu, Lingxiao Wang, Chengxiang Zhai, and Quanquan Gu. High-dimensional variance-reduced stochastic gradient expectation-maximization algorithm. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 4180 4188, Sydney, Australia, 2017.

Difan Zou, Pan Xu, and Quanquan Gu. Stochastic variancereduced Hamilton Monte Carlo methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 6028 6037, Stockholm, Sweden, 2018.