# on_variational_inference_in_biclustering_models__c2c02abb.pdf On Variational Inference in Biclustering Models Guanhua Fang, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St Bellevue WA 98004 USA {guanhuafang, liping11}@baidu.com Biclustering structures exist ubiquitously in data matrices and the biclustering problem was first formalized by John Hartigan (1972) to cluster rows and columns simultaneously. In this paper, we develop a theory for the estimation of general biclustering models, where the data is assumed to follow certain statistical distribution with underlying biclustering structure. Due to the existence of latent variables, directly computing the maximal likelihood estimator is prohibitively difficult in practice and we instead consider the variational inference (VI) approach to solve the parameter estimation problem. Although variational inference method generally has good empirical performance, there are very few theoretical results around VI. In this paper, we obtain the precise estimation bound of variational estimator and show that it matches the minimax rate in terms of estimation error under mild assumptions in biclustering setting. Furthermore, we study the convergence property of the coordinate ascent variational inference algorithm, where both local and global convergence results have been provided. Numerical results validate our new theories. 1. Introduction In a wide range of data analytic scenarios, we encounter two-mode matrices with biclustering structures (Hartigan, 1972) and we might be interested in modeling the effects of both rows and columns on the data matrix. Consider the following specific situations. In educational testing (Templin and Henson, 2010; Matechou et al., 2016), two modes could be test takers and question items. We expect that test takers with same skills will form a group and similar questions may also form into different groups. In gene expression Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). studies (Preli c et al., 2006; Gu and Liu, 2008), one can organize the data matrix such that each row corresponds to a cancer patients and each column corresponds to transcript. Then the patients can form groups according to different cancer subtypes and the genes are also expected to exhibit clustering effect according to different pathways they belong to. In online e-commerce service (Dolnicar et al., 2012), researcher may wish to study user behaviors through their purchasing and navigation history. Users and items can be viewed as two modes. Users with similar shopping preferences can be clustered into a group and items with same functionality can also be grouped together. To capture aforementioned group effects, statistical models are often used for modeling the structure of two-mode data matrix. Latent variables are introduced to capture underlying block effects (Govaert and Nadif, 2010). However, in two-mode block mixture models, the maximal likelihood estimator (MLE) is prohibitively hard to compute since computing marginal likelihood requires summation over exponentially many terms. Quite a few computational methods are developed to estimate the parameters in biclustering models (e.g., double k-means method (Maurizio, 2001), model-based expectation-maximization (EM) method (Pledger and Arnold, 2014), Gibbs-sampling based method (Meeds and Roweis, 2007; Gu and Liu, 2008), nonparametric Bayesian method (Niu et al., 2012)). Unfortunately, these methods have no corresponding theoretical results. We aim to fill this gap in the literature. Variational inference (VI) approach (Jordan et al., 1999; Hoffman et al., 2013) is a powerful tool for parameter estimation in complicated hierarchical latent variable models. When the analytic form of posterior distribution of latent variables cannot be computed, VI seeks a good candidate distribution to approximate the true posterior and reduces computation complexity. VI method has also become popular in biclustering models in the recent literature (Guan et al., 2010; Vu and Aitkin, 2015) . In this paper, we start from a theoretical perspective and consider an estimation problem for biclustering models via variational inference and develop the corresponding theory. Specifically, we show both upper On Variational Inference in Biclustering Models and lower estimation bounds for varitaional estimator under biclustering setting, which bridges the gap in the literature of VI theory. Moreover, we study the coordinate ascent variational inference (CAVI) algorithm for parameter estimation and we establish the local convergence property of CAVI method. Furthermore, we also provide the global convergence results to discuss the situations under which CAVI may or may not return a consistent estimator. To the best of our knowledge, this is the first time that relatively complete theoretical results have been provided on variational inference estimation under the general biclustering settings. The rest of paper is organized as follows. In Section 2, we provide a preliminary of biclustering models and variational inference methods. In Section 3, we study the theoretical properties of variational estimator and give detailed estimation bounds. In Section 4, we study the convergence property of CAVI algorithm. Both local and global convergence results are provided. Multiple numerical results are provided in Section 5 to support our theoretical findings. Finally, the concluding remarks are given in Section 6. Notation. For positive integer m, we use [m] to denote set {1, . . . , m} and [m1] [m2] to denote set {(i, j) : i [m1], j [m2]}. For two positive sequences {an}, {bn}, an bn means that an Cbn for some large constant C independent of n, and an bn means that an bn and bn an. The symbols E and P( ) denote generic expectation and probability whose distribution may be determined from the context. Additionally, x / x 1 is used to denote ℓ2- / ℓ1norm of vector x and X F is used to denote Frobenius norm of matrix X. We use x[i] to represent the i-th entry of vector x and use X[i, j] to represent the entry of matrix X on i-th row and j-th column. We use f to represent the derivative of function f with respect to θ. 2. Preliminary In this paper, we consider the following general biclustering model. We assume Yij fθ(y|z1i, z2j), i [m1], j [m2], where z1i s and z2j s are unobserved latent memberships with z1i π1 and z2j π2, π1 is a discrete probability distribution over J1 latent classes and π2 is a discrete probability distribution over J2 latent classes. In other words, P(z1i = k) = π1[k] (k [J1]) and P(z2j = l) = π2[l] (l [J2]). Density function fθ(y|z1, z2) are parameterized by θ. Furthermore, we assume fθ(y|z1, z2) can be reduced to fθz1z2(y), that is, the observation only depends on latent class-specific parameter. Given observations (yij, i [m1], j [m2]), we can write the likelihood function as i π1[z1i] Y j π2[z2j] Y i,j fθ(yij|z1i, z2j) , where z1 = (z11, . . . , z1m1) and z2 = (z21, . . . , z2m2). Additionally, if only a subset Ω [m1] [m2] can be observed, then the corresponding likelihood function can be written as i,j π1[z1i]π2[z2j] Y (i,j) Ω fθ(yij|z1i, z2j) . In particular, when Yij follows the Gaussian distribution with mean θz1iz2j and variance 1, then fθ(yij|z1i, z2j) exp{ (yij θz1iz2j)2/2}. When Yij is the binary response, that is, Yij Bernoulli(θz1iz2j) with 0 < θz2iz2j < 1, then fθ(yij|z1i, z2j) = θyij z1iz2j(1 θz1iz2j)1 yij. This is also known as the stochastic block model (SBM, Holland et al., 1983; Abbe, 2017; Zhou and Li, 2020). When Yij is the ordinal response, that is, Yij Multinom(θz1iz2j[1], . . . , θz1iz2j[K]) (K is the number of categories), then fθ(yij|z1i, z2j) = k=1 (θz1iz2j[k])1{yij=k}. Furthermore, Yij can be an event process (i.e., yij = (tij,1, . . . , tij,n) is a sequence of event times in [0, T]), which follows a certain counting process model with intensity function θz1iz2j(t). Then fθ(yij|zi, zj) = { l=1 θzizj(tij,l)} exp{ Z T t=0 θz1iz2j(t)dt}. Especially, when θij(t) θz1iz2j, it reduces to the homogeneous Poisson point process. Therefore, fθ(yij|z1i, z2j) = θn z1iz2j exp{ Tθz1iz2j}. Approximation via varitaional inference. By a close look at the formula of L(θ), it is difficult to compute the likelihood directly, which requires summation over exponentially many terms. An alternative approach is to utilize variational inference (Jordan et al., 1999; Hoffman et al., 2013) methods to optimize the evidence lower bound (ELBO) instead of the log likelihood. In general, the ELBO function is defined as ELBO = Ez q(z)l(θ, z) Ez q(z) log q(z), On Variational Inference in Biclustering Models where the expectation is taken with respect to latent variables z and q(z) is an approximate distribution function for posterior of z. Under our setting, z = (z1, z2), l(θ, z) = X i log π1[z1j] + X j log π2[z2j] i,j log fθ(yij|z1i, z2j). For computational feasibility, we consider a meanfield family (Blei et al., 2017) for the choice of q(z). More precisely, we take q(z) := Q i qi(z1i) Q j qj(z2j), qi(z1i) = multinom(φ1i) with φ1i = (φ1i[1], . . . , φ1i[J1]) and qj(z2j) = multinom(φ2j) with φ2j = (φ2j[1], . . . , φ2j[Jz]). Here multinom(φ) represents a multinomial distribution with parameter φ. By calculation, the ELBO can be obtained, k,l φ1i[k]φ2j[l] log fθ(yij|k, l) k φ1i[k] log(π1[k]/φ1i[k]) l φ2j[l] log(π2[l]/φ2j[l]). (1) Although variational inference is a powerful tool in optimization for complex statistical models, the VI theory has not been fully explored yet in the literature. 3. Estimation Bound of Variational Estimator It is known that the ELBO function is a lower bound of the log-likelihood function, that is, log L(θ) ELBO = KL(q(z) pθ(z|y)). Since KL divergence is always non-negative, therefore the maximizer of ELBO may not be an unbiased estimator of true parameter. Under the biclustering model, we have KL(q(z) pθ(z|y)) j q(z2j) pθ(z1, z2|y)) i φ1i[z1i] Y i φ1i[z1i] Y j φ2j[z2j]) log pθ(z1, z2|y)}. As sample sizes m1 and m2 go to infinity, we are able to show that KL(q(z) pθ(z|y) goes to 0 in probability. This can guarantee that the VI estimator is asymptotically consistent. Before going to the main results, we first introduce some additional notations and assumptions. Assumption A1: For every k = k [J1], there exists l [J2] such that θkl = θk l. (2) For every l = l [J2], there exists k [J1] such that θkl = θkl . (3) Assumption A1 is an identifiability assumption that the matrix θ cannot have two same columns or two same rows. This constraint ensures that the individuals from different classes should have different structural properties. For ex- ample, when J1 = J2 = 2 and θ = θa θa θb θb we can easily differentiate objects from difference classes for the first mode, while we fail to classify the objects for the second mode. This can lead to mis-classification of z2j s. Assumption A2: There exists a bounded and compact set B such that θkl B, for k [J1], l [J2]. Assumption A2 is a compactness assumption to make sure the objective function has nice continuity property. For example, when Yij is binary and has distribution Bernoulli(θzizj), we require θkl [ξ, 1 ξ] for positive constant ξ. This is also a usual assumption in SBM literature (Celisse et al., 2012). Assumption A3: There exist positive constants γ1 and γ2 such that π1[k] [γ1, 1 γ1], for k [J1] and π2[l] [γ2, 1 γ2], for l [J2]. The above assumption implies that no class is drained. In this paper, γ1 and γ2 are assumed to be free of sample sizes, m1 and m2. Moreover, z1i s and z2j s are the realizations of multinomial random variables with parameter π1 and π2 respectively. Define empirical latent class probability π1[k] := N1[k] m1 and π2[l] := N2[l] m2 , where N1[k] := |{i [J1] : z1i = k}| and N2[l] := |{j [J2] : z2j = l}}|. Therefore, by large of large number, we know that π1 [γ1/2, 1 γ1/2] and π2 [γ2/2, 1 γ2/2]. But unfortunately, we do not have access to z1i s and z2j s. They have to be estimated as well. Assumption A4: It is assumed that fθ(y|z1, z2) fθz1,z2(y) for z1 [J1] and z2 [J2] and is log-concave twice differentiable function. We define hkl(θ) = Ey fθkl log fθ(y) (4) and assume hkl(θ) is a strictly concave function of θ over B for any k [J1], l [J2]. On Variational Inference in Biclustering Models Assumption A4 puts the requirement on the density of biclustering model. Smooth function fθ(y|z1, z2) is assumed to depend on θkl only. Function hkl(θ) is the expectation of log density function with respect to true distribution. In most case, hkl(θ) is naturally a strictly concave function. For example, when fθ(y) is the density of Gaussian distribution with variance 1, we have which is obviously strictly concave. When fθ(y) is the density of Bernoulli(θ), then hkl(θ) = θkl log θ + (1 θkl) log(1 θ), which is also strictly concave on its domain. Next, we define the variational estimator ˆθkl s, ˆπ1, ˆπ2, ˆφ1i s and ˆφ2j s as (ˆθ, ˆπ1, ˆπ2, ˆφ1i, ˆφ2j) := arg max θ,π1,π2,φ1i,φ2j ELBO. The domain for original parameter is θkl B, π1[k] [γ1, 1 γ1], π2[l] [γ2, 1 γ2]. The domain for variational parameter is φ1i S1 and φ2j S2, where S1 = {PJ1 k=1 φ[k] = 1 and 0 φ[k] 1} and S2 = {PJ2 l=1 φ[l] = 1 and 0 φ[l] 1}. In a summary, the total number of parameters is J1J2 +(J1 1)m1 +(J2 1)m2 + J1 + J2 2. 3.1. Theoretical Results Next, we provide the theoretical estimation bounds for the variational estimator, including classification consistency, population consistency and parameter consistency. We use superscript to denote the true values in the rest of paper. Classification consistency. We use δ1k to denote the probability mass function on discrete sets {1, . . . , J1} that assigns the total probability at k [J1] and use δ2l to denote the probability mass function on discrete sets {1, . . . , J2} that puts the total probability at l [J2]. Let ˆq1i = multinom(ˆφ1i) and ˆq2j = multinom(ˆφ2j) be the estimated approximate posterior distribution for z1i and z2j, respectively. Note that the model is invariant in regard to class label permutation. Without loss of generality, we always assume that the estimated class labels can be permuted to match the true labels when the estimator is consistent. We then can show that the ˆq1i converges to δ1z 1i and ˆq2j converges to δ2z 2j. The result is stated in Theorem 1. Theorem 1 Under Assumptions A1 - A4, there exist constants c0 and C such that d T V (ˆq1i, δ1z 1i) J1 exp{ c0m2} d T V (ˆq2j, δ1z 2j) J2 exp{ c0m1} hold with probability at least 1 (m1 + m2) max{J1, J2} exp{ C min{m1, m2}} for i [J1] and j [J2]. Here, d T V is the total variation distance between two distribution. Theorem 1 tells us that we can classify z1i s and z2j s into correct latent classes with high probability. To be more specific, we define estimated label as ˆz1i := arg maxk [J1] ˆq1i[k], ˆz2j := arg maxl [J2] ˆq2j[l], and ˆz1 = (ˆz11, . . . , ˆz1,m1), ˆz2 = (ˆz21, . . . , ˆz2m2). We then have P(ˆz1 = z 1, ˆz2 = z 2) 1, which is known as strong consistency in the SBM literature (Abbe et al., 2015; Mossel et al., 2014; Gao et al., 2017). Moreover, mis-classification errors decrease exponentially fast in terms of sample sizes. Population consistency. By the definition of variational estimator, we can obtain the relation between ˆπ1, ˆπ2 and ˆφ1i s, ˆφ2i s, which is i [m1] ˆφ1i[k] m1 for k [J1] P j [m2] ˆφ2j[l] m1 for l [J2] by simplifying the optimality conditions. By Theorem 1 and law of large number, we can obtain the consistency of ˆπ1 and ˆπ2. Theorem 2 Under Assumptions A1 - A4, there exist constants c0 c2 and C such that |ˆπ1[k] π 1[k]| J1 exp{ c0m2} + c1 m1 |ˆπ2[l] π 2[l]| J2 exp{ c0m1} + c2 m2 hold with probability at least 1 2(m1 + m2) max{J1, J2} exp{ C min{m1, m2}} for k [J1] and l [J2]. Here the estimation errors of π1 and π2 come from two sources, variational approximation and sampling noise (i.e., the deviation between empirical distribution of z1i s / z2i s and true prior π1 / π2.). When both m1, m2 go to infinity, then the sampling noise will become the dominated term. Hence the estimation errors of ˆπ1 and ˆπ2 achieve the optimal rate, i.e., O( 1 m1 ) and O( 1 m2 ). Parameter consistency. Next we move onto the estimation of θ which is the key parameter to differentiate between different classes. The upper bound is given in the following theorem. On Variational Inference in Biclustering Models Theorem 3 Under Assumptions A1 A4, there exist constants C1 C3 such that it holds ˆθ θ 2 F J1J2 C1(J1J2 + m1 log J1 + m2 log J2) m1m2 + max{J1, J2} exp{ C2 min{m1, m2}} (5) with probability at least 1 exp{ C3(J1J2 + m1 log J1 + m2 log J2)} (m1 + m2) max{J1, J2} exp{ C3 min{m1, m2}}. Again, we can see that the upper bound for 1 J1J2 ˆθ θ 2 F consists of two main parts (sampling noises and variational approximation errors). When both m1 and m2 go to infinity at the same order, the first term on the right hand side of (5) will denominate the second term. In fact, (J1J2+m1 log J1+m2 log J2) m1m2 is the optimal error rate for any estimator in the estimation of biclustering model, see explanations in the next part. Moreover, the part involving J1J2 reflects the number of parameters in θ, while the part involving (m1 log J1+m2 log J2) comes from the complexity of estimating the latent memberships for both modes. It is the price we need to pay when we do not know the true clustering information. Lower bound of parameter estimation. To show that our upper bound is tight, we first reformulate our parameter setting. We define an augmented parameter space, BΘ := {Θ Rm1 m2 : Θij = θz1iz2j, z1i [J1], z2j [J2], θ B}. (6) In other words, the parameter Θ is constructed based on θ by letting Θij = θz1iz2j through the latent memberships. Let ˆz1i = arg maxk ˆφ1i[k] and ˆz2j = arg maxj ˆφ2j[k] be the estimated latent memberships. We then can construct ˆΘ as ˆΘij = ˆθˆz1iˆz2j. Similarly, we can define Θ in the same way. We then have 1 m1m2 ˆΘ Θ 2 F 1 J1J2 ˆθ θ 2 F . Therefore, we only need to work on the lower bound of 1 m1m2 ˆΘ Θ 2 F . Theorem 4 Under Assumptions A1 A4, there exist some constants C, c > 0 such that inf ˇΘ sup Θ BΘ P 1 m1m2 ˇΘ Θ 2 F > C(J1J2 + m1 log J1 + m2 log J2) By Theorem 3 and Theorem 4, we know that the variational estimator could achieve the minimax rate when sample sizes m1 and m2 tend to infinity at the same order. Compared with the error bound of MLE (Gao et al., 2016), we can see that the bias induced by variational approximation is only of order exp{ C min{m1, m2}} which is negligible when both m1 and m2 go to infinity. Partial observation case. Furthermore, we consider the situation that the data may be only partially observed. That is, we observe (yij : (i, j) Ω) instead of (yij : (i, j) [m1] [m2]), where Ωis the subset of [m1] [m2]. In addition, we assume that each pair (i, j) is observed completely at random with fixed observation rate p (0 < p < 1). Thus, the size of Ωis approximately pm1mn. Under this setting, we can generalize our theoretical results as follows. Theorem 5 Under the partial observation setting and Assumptions A1 A4, there exist C 0 C 6 such that d T V (ˆq1i, δ1z 1i) J1 exp{ C 0pm2}, d T V (ˆq2j, δ1z 2j) J2 exp{ C 0pm1}, |ˆπ1[k] π 1[k]| J1 exp{ C 0pm2} + C 1 m1 , |ˆπ2[l] π 2[l]| J2 exp{ C 0pm1} + C 2 m2 , 1 J1J2 ˆθ θ 2 F C 3(J1J2 + m1 log J1 + m2 log J2) pm1m2 + max{J1, J2} exp{ C 4p min{m1, m2}} hold with probability at least 1 max{J1, J2}(m1 + m2) exp{ C 6p(1 p) min{m1, m2}} exp{ C 5(J1J2+ m1 log J1 + m2 log J2)}. To ensure the estimation consistency, we need that p should not be too small. From the bounds in Theorem 5, it is required that pm1m2/ max{J1J2, m1 log J1, m2 log J2} max{J1, J2}(m1 + m2) exp{ C 6p(1 p) min{m1, m2}} as m1, m2 . After simplification, we know that the observation rate p should be at least of order max{log m1, log m2} min{m1, m2} . Such requirement for p is nearly optimal since p = Ω(max{log J1 m2 , log J2 On Variational Inference in Biclustering Models is required for the consistency of maximum likeilhood estimator (Gao et al., 2016). Connection to the literature. The theoretical properties of maximal likelihood estimator under biclustering setting has been considered by Gao et al., 2016. They proved the minimax estimation rate of MLE under Gaussian biclustering models. The biclustering model is related to stochastic block model. The latter one assumes the symmetric relationship between two modes. Celisse et al., 2012 established classification and estimation consistency for variational inference of SBM, while they do not provide precise error bound of estimators. Biclustering model is also related to graphon model where both observations and latent variables are assumed to be continuous between [0, 1]. Theory on graphon estimation includes Airoldi et al., 2013; Olhede and Wolfe, 2014; Choi, 2017; Gao et al., 2015; Klopp et al., 2017 and the references therein. Graphon models are estimated by nonparametric methods instead of using VI. The biclustering model also has connection to the matrix completion problem. The latter one imposes a low rank constraints on parameter Θ as opposed to the latent class structure. The theory for matrix completion problem can be found in Cand es and Recht, 2009; Keshavan et al., 2010; Cai et al., 2010; Koltchinskii et al., 2011; Liu and Li, 2016; Chi et al., 2019; Cai and Li, 2020, etc. When the number of classes are unknown. A nature question is what is the performance of variational estimator when J1 and J2 are misspecified. If J1 or J2 is over-specified, it is expected that the estimation is still consistent in the sense that some classes may be split into two (or more) sub-classes. If J1 or J2 is under-specified, different classes may merge together. The VI estimator should converge to a local stationary point. The precise characterization of such local optimum is of great interest in the future work. 4. Convergence of Variational Algorithm Although the variational estimator entails nice theoretical properties, a natural question is how to compute the such estimator in practice. In this section, we consider a coordinate ascent variational inference (CAVI) algorithm via alternatively optimizing the model parameters and variational parameters and discuss the convergence issues. The procedure of estimating the biclustering model is summarized as follows. We use t {0, . . . , T} to represent the iteration index and T to represent the total iteration numbers. We first set the initial model parameter as θ(0), π(0) 1 , π(0) 2 and initialize the variational parameters φ1i s and φ2j s such that P k φ(0) 1i [k] = 1 and P l φ(0) 2j [l] = 1 for all i [m1], j [m2]. Next, we update model parameters θ, π1, π2 and variational parameters φ1i s, φ2j s alternatively. That is, we update θ(t) by maximizing equation (1) with φ1i, φ2j fixed at φ(t 1) 1i and φ(t 1) 2j . When density function fθ(y|z1i, z2j) belongs to some special distributional families, θ(t) admits an explicit form. For example, (i,j) A φ(t 1) 1i [k]φ(t 1) 2j [l]yij P (i,j) A φ(t 1) 1i [k]φ(t 1) 2j [l] when fθ(y|z1i, z2j) is the density of Bernoulli, Gaussian or Poisson distribution. We update φ(t) 1i by maximizing equation (1) with model parameters fixed at value of t-th iteration and φ2j fixed at φ(t 1) 2j . We update φ(t) 2j in the same fashion. We then update π(t) 1 and π(t) 2 via using relations (5) and (5). The detailed mathematical formulas are presented in Algorithm 1. Remark. We want to point out that we are not trying to propose a new algorithm in the current paper. The CAVItype algorithm is a standard computational scheme in VI optimization problems (Blei et al., 2017). Our goal is to analyze the behavior of CAVI in general biclustering models. In below, we establish the local and global convergence results which may benefit the understanding of landscape of biclustering models. 4.1. Local Convergence There exist a literature of convergence analysis for EM algorithm under mixture models. Local convergence property of EM are studied by Jin et al., 2016; Yan et al., 2017; Zhao et al., 2020. Global convergence property of EM is considered in Xu et al., 2016. Landscape of stochastic block model is described in Mukherjee et al., 2018. Local convergence theory of CAVI for SBM model has also been studied in Zhang and Zhou, 2020. However, there is no literature on VI algorithm for biclustering models. We fill this gap in the literature. We first study the local convergence of Algorithm 1. In this section, we show that algorithm returns a consistent estimator of model parameter when the initialization is good enough. Assumption A5: We assume that there exist constants D1 and D2 such that φ(0) 1i s satisfy P i:z 1i=k φ(0) 1i [k] P i:z 1i=k φ(0) 1i [k ] > D1 (7) for all k [J1] and φ(0) 2j s satisfy P j:z 2j=l φ(0) 2j [l] P j:z 2j=l φ(0) 2j [l ] > D2 (8) for all l [J2]. Here, (7) and (8) guarantee that, for each latent class, initial distribution should put relatively large mass on that true On Variational Inference in Biclustering Models label. In other words, the initialization of variational distribution should be good enough to concentrate locally around true labels. Algorithm 1 CAVI for Biclustering Model. 1: Input. Observations: {yij} 2: Output. Estimated parameter: ˆθ 3: Initialization. Randomly sample θ(0) from parameter space B and choose π(0) 1 = ( 1 J1 , . . . , 1 J1 ) and π(0) 2 = ( 1 J2 , . . . , 1 Sample variational parameter φ(0) 1i independently so that P k [J1] φ(0) 1i [k] = 1. Sample φ(0) 2j independently so that P l [J2] φ(0) 2j [l] = 1. 4: while not converged do 5: Increase the time index: t = t + 1. 6: For each k [J1] and l [J2], update θkl by θ(t) kl = arg max θ i,j φ(t 1) 1i [k]φ(t 1) 2j [l] log fθ(yij|k, l). 7: For each i [m1], update φ1i by φ(t) 1i = arg max φ k [J1],l [J2] φ[k]φ(t 1) 2j [l] log fθ(t)(yij|k, l) X k [J1] φ[k](log φ[k] log π(t 1) 1 [k]). 8: For each j [m2], update φ2j by φ(t) 2j = arg max φ k [J1],l [J2] φ[l]φ(t) 1i [k] log Fθ(t)(yij|k, l) X l [J2] φ[l](log φ[l] log π(t 1) 2 [l]). 9: For k [J1], update π(t) 1 [k] = 1 m1 i [m1] φ(t) 1i [k]. 10: For l [J2], update π(t) 2 [l] = 1 m2 j [m2] φ(t) 2j [l]. 11: end while 12: Set ˆθ = θ(Tc), where Tc is the time index when the algorithm converges. Theorem 6 Under Assumptions A1 - A4, we assume (7) and (8) hold with sufficiently large D1 and D2, then Algorithm 1 will return a consistent estimator with probability tending to 1 as m1, m2 . Note that VI estimator is biased under the finite sample cases. Here, m1, m2 means that m1 and m2 go to infinity at the same rate. In Algorithm 1, the order of updating model parameter θ and variational parameters φ s can be exchanged (i.e., we first implement lines 6-7 and then implement line 5 in each iteration). We can still show the local convergence if we additionally assume that θ(0) is in the neighborhood of θ . Theorem 7 Under Assumptions A1 - A4, we assume that (8) holds with sufficiently large D2 and θ(0) B(θ , δ) for small radius δ, then Algorithm 1 returns a consistent estimator with probability tending to 1 as m1, m2 . 4.2. Global Convergence However, we do not have the knowledge of true model parameters or true latent memberships so that Condition I1 cannot be guaranteed in practice. Can we obtain the global convergence of Algorithm 1. In the following, we give the partial answer to this question. We first observe that there exist saddle points in optimizing (1). For example, if we let φ(0) 1i = ( 1 J1 , . . . , 1 φ(0) 2j = ( 1 J2 , . . . , 1 J2 ). Then whatever true parameter is, the algorithm will always return that θ(t) θ, π(t) 1 ( 1 J1 , . . . , 1 J1 ), π(t) 2 ( 1 J2 , . . . , 1 where θ is a matrix with all entries equal to arg maxθ P i,j log fθ(yij). Hence, we need to add random noise into the initialization of variational parameters to escape from this saddle point. We consider the following random initialization. Specifically, we sample φ(0) 1i Dir(α1) and φ(0) 2j Dir(α2) (9) for i [m1] and j [m2] independently. Here Dir(α) represents the Dirichlet distribution with parameter α; α1 is a vector of length J1 with all entries being 1 and α2 is a vector of length J2 with all entries being 1. In other words, Dir(α1) and Dir(α2) are non-informative priors. Degenerate Case. We first consider a degenerate situation when J2 = 1. Then the biclustering model reduces to a latent class model. Under this simplified setting, we aim to find out the global convergence properties. We first additionally assume that J1 = 2. Then we are able to show that the algorithm can always gives a consistent On Variational Inference in Biclustering Models estimator as long as the model is identifiable (i.e., model satisfies Assumption A1, θ 11 = θ 21). Theorem 8 Under Assumptions A1 A4 and the setting that J1 = 2 and J2 = 1, then Algorithm 1 will return a consistent estimator with probability tending to 1 as both m1, m2 when the initialization satisfies (9). Actually, we have the stronger result that the algorithm only fails when φ1i s lie on certain measure zero set. A precise statement is stated as follows. Theorem 9 Under Assumptions A1 A4 and the setting that J1 = 2 and J2 = 1, then Algorithm 1 will fail to return a consistent estimator if and only if i,j φ(0) 1i [1] log fθ(yij) = arg max θ i,j φ(0) 1i [2] log fθ(yij). (10) We can easily see that naive random initialization will make (10) held with probability zero. This explains the usefulness of random initialization. This result is related to EM method in estimating mixture Gaussian models. In Xu et al., 2016, they fully characterize the global convergence of EM algorithm for two equal-proportion Gaussian distributions. When J1 3, the story is different. The algorithm might be trapped at local optimizers. When J1 = 3, we can relabel the latent classes such that θ 11 > θ 21 > θ 31. We define θ as follows, θ := arg max θ { k=1 πk Ey fθ k1 log fθ(y)}, which can be viewed as population mean. We similarly define θk1k2 := arg max θ {πk1Ey fθ k11 log fθ(y) +πk2Ey fθ k21 log fθ(y)}, which can be viewed as the population parameter of groups k1 and k2. We further define gk = Ey fθ k1[log f θ(y)], k {1, 2, 3}. (11) The slope gk quantifies the gap between θ k1 and θ. When θ k1 = θ, then gk = 0. In addition, we define several variance quantities, v1 = v2 = E[(φ(0) 1i [1] 1/3)2], v12 = E[(φ(0) 1i [1] 1/3)(φ(0) 1i [2] 1/3)], and let V = v1 v12 v12 v2 Theorem 10 Under Assumptions A1 - A4 and the setting with J1 = 3, J2 = 1, θ 31 < θ 21 < θ 11 and Ey fθ 21 log f θ12(y) > Ey fθ 21 log fθ 31(y), (12) the probability that Algorithm 1 fails to return a consistent estimator is at least P(g2Z1 + g3Z2 > 0, Z1 > 0, Z2 > 0) with (Z1, Z2) N(0, V), when both m1, m2 . Here condition (12) implies that Group 2 is closer to Group 1 rather than Group 3. When θ is close to θ 21, it is easier for algorithm to find global optimum. By contrast, when θ is close to θ 31, then Class 3 becomes the dominating group. The algorithm may be stuck at the local optimum, since Class 1 and Class 2 are classified into one group and the dominating Class 3 could be split into two groups. The algorithm can never jump out of this local optimum due to the constraint (12). By Theorem 10, we know that the algorithm may not always converge to the global maximizer and hence will give inconsistent estimator when J1 3. Case J1 = 2 and J2 = 2 Theorem 11 Under the Gaussian/Bernoulli/Poisson model satisfying Assumptions A1 - A3 with J1 = J2 = 2 and π1 = π2 = ( 1 2), Algorithm 1 returns a consistent estimator with probability tending to 1 as m1, m2 when initialization satisfies (9), and (θ 11 θ 21)(θ 12 θ 22) > 0, (θ 11 θ 12)(θ 21 θ 22) > 0. (13) Theorem 11 guarantees the global convergence when true parameters are well separated in the sense of (13). For example, consider a Bernoulli model with π1 = π2 = (0.5, 0.5) and θ = 0.7 0.3 0.3 0.7 the algorithm will be trapped at local optimum. Condition (13) is only sufficient, it is of interest to obtain the necessary condition under setting of J1 = J2 = 2 in future work. Technical challenges. 1) To establish the consistency results, different types of concentration inequalities are needed for variational parameter and model parameters separately. 2) To compute the estimation bound under the setting of general response distribution functions, the calculation is more involved and tedious. 3) To study the global convergence of CAVI, we need carefully calculating the differences between class-specific model parameters. The computation is overwhelming. On Variational Inference in Biclustering Models 100 200 300 400 500 m Estimation Bound, Binary J1 = J2 = 2 100 200 300 400 500 m Estimation Bound, Binary J1 = J2 = 3 100 200 300 400 500 m Estimation Bound, Poisson J1 = J2 = 2 100 200 300 400 500 m Estimation Bound, Poisson J1 = J2 = 3 Figure 1. Plots of estimation errors under different settings. Upper two plots are for Bernoulli models. Bottom two plots are for Poisson models. The standard error bars are also plotted. 5. Numerical Results In this section, we collect several numerical experiments to support our theory, i.e., validating the tightness of estimation bound and global convergence of CAVI algorithm. Estimation bound. We consider the Bernoulli biclustering model (i.e., Yij Bernoulli(θzizj)) and Poisson biclustering model (i.e., Yij Poisson(θzizj)). We set sample size m1 = m2 = m, where m take values from {100, 200, 300, 400, 500}. We set number of classes J1 = J2 = J = 2 or 3. True parameter θ is randomly generated and π1, π2 are set to be uniform prior. For each setting, we run 100 replications. Estimation errors ( 1 J1J2 ˆθ θ F ) with corresponding standard errors are shown in Figure 1. Based on curves, we can see that the estimation error decreases as sample sizes increase. When the number of classes increases, the error will become larger. In addition, when J is fixed, the estimation error decays at rate of 1 m with variational approximation error vanishing (less than 1e-9). This matches the results stated in Theorem 3. Global convergence. We take J1 = 3, J2 = 1 and consider mixture Bernoulli model and mixture Gaussian model. We fix parameters θ11, θ21, population prior π1 and but let θ31 vary. We want to study the probability of global convergence as θ31 changes. The detailed settings and corresponding results under different sample sizes can be found in caption of Figure 2. Each setting is replicated for 100 times. It is straightforward to see that group 3 is the dominating class (i.e., π1[3] has the largest value). From Figure 2, when θ31 is much smaller than θ11, θ21, then it becomes harder to find the global optimum. There exists a non-zero probability 0.54 0.56 0.58 0.6 0.62 theta3 Recovery (%) Binary Response m2 = 250 m2 = 500 m2 = 750 m2 = 1000 0.38 0.4 0.42 0.44 0.46 theta3 Recovery (%) Binary Response m2 = 250 m2 = 500 m2 = 750 m2 = 1000 -0.8 -0.6 -0.4 theta3 Recovery (%) Gaussian Response m2 = 250 m2 = 500 m2 = 750 m2 = 1000 -1.1 -1 -0.9 -0.8 -0.7 theta3 Recovery (%) Gaussian Response m2 = 250 m2 = 500 m2 = 750 m2 = 1000 Figure 2. Probability of global convergence for Bernoulli and Poisson models under different settings. The parameter choice is specified as follows. Upper left: π1 = (0.1, 0.2, 0.7), θ11 = 0.9, θ21 = 0.7. Upper right: π1 = (0.1, 0.2, 0.7), θ11 = 0.9, θ21 = 0.6. Bottom left: π1 = (0.2, 0.2, 0.6), θ11 = 1, θ21 = 0. Bottom right: π1 = (0.3, 0.1, 0.6), θ11 = 1, θ21 = 0. For all four cases, m1 = 500 and m2 {250, 500, 750, 1000}. of failure in recovering the true parameter. As θ31 increases, the recovery probability increases up to 1. This phenomenon matches our theory. 6. Conclusion In this paper, we develop a theory of variational inference estimation in biclustering models. Our result is general in the sense that we do not assume any specific response functions. The assumptions are mild and they are satisfied by most probabilistic models, including but not limited to SBM model, mixture Gaussian model and mxiture Poisson model. We establish the classification consistency, population consistency and parameter consistency. Both upper and lower bounds of estimation errors are also obtained. Our theory answers the question why variational inference works well for biclustering problem. In addition, we consider a coordinate ascent variational inference (CAVI) algorithm for parameter estimation. The algorithm is shown to have local convergence property under a reasonable initialization requirement. On the other hand, with the random initialization, global convergence results are also established under several important special settings. This work not only bridges the gap in the literature around VI theory and but also gives a deeper understanding of the landscape of biclustering models. Acknowledgement The authors sincerely thank the anonymous reviewers and area chair for their constructive comments. On Variational Inference in Biclustering Models Emmanuel Abbe. Community detection and stochastic block models: recent developments. Journal of Machine Learning Research, 18(1):6446 6531, 2017. Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471 487, 2015. Edoardo M. Airoldi, Thiago B. Costa, and Stanley H. Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems (NIPS), pages 692 700, Lake Tahoe, NV, 2013. David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. Jian-Feng Cai, Emmanuel J Cand es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. Yunfeng Cai and Ping Li. Solving the robust matrix completion problem via a system of nonlinear equations. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4162 4172, Online [Palermo, Sicily, Italy], 2020. Emmanuel J Cand es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009. Alain Celisse, Jean-Jacques Daudin, and Laurent Pierre. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electronic Journal of Statistics, 6:1847 1899, 2012. Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67 (20):5239 5269, 2019. David Choi. Co-clustering of nonsmooth graphons. Annals of Statistics, 45(4):1488 1515, 2017. Sara Dolnicar, Sebastian Kaiser, Katie Lazarevski, and Friedrich Leisch. Biclustering: Overcoming data dimensionality problems in market segmentation. Journal of Travel Research, 51(1):41 49, 2012. Chao Gao, Yu Lu, and Harrison H Zhou. Rate-optimal graphon estimation. Annals of Statistics, 43(6):2624 2652, 2015. Chao Gao, Yu Lu, Zongming Ma, and Harrison H Zhou. Optimal estimation and completion of matrices with biclustering structures. Journal of Machine Learning Research, 17(1):5602 5630, 2016. Chao Gao, Zongming Ma, Anderson Y Zhang, and Harrison H Zhou. Achieving optimal misclassification proportion in stochastic block models. Journal of Machine Learning Research, 18(1):1980 2024, 2017. G erard Govaert and Mohamed Nadif. Latent block model for contingency table. Communications in Statistics Theory and Methods, 39(3):416 425, 2010. Jiajun Gu and Jun S Liu. Bayesian biclustering of gene expression data. BMC Genomics, 9(1):1 10, 2008. Yue Guan, Jennifer G Dy, Donglin Niu, and Zoubin Ghahramani. Variational inference for nonparametric multiple clustering. In Multi Clust Workshop, KDD-2010. Citeseer, 2010. John A Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123 129, 1972. Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(5), 2013. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Chi Jin, Yuchen Zhang, Sivaraman Balakrishnan, Martin J. Wainwright, and Michael I. Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Advances in Neural Information Processing Systems (NIPS), pages 4116 4124, Barcelona, Spain, 2016. Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2): 183 233, 1999. Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010. Olga Klopp, Alexandre B Tsybakov, and Nicolas Verzelen. Oracle inequalities for network models and sparse graphon estimation. Annals of Statistics, 45(1):316 354, 2017. Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and optimal rates On Variational Inference in Biclustering Models for noisy low-rank matrix completion. Annals of Statistics, 39(5):2302 2329, 2011. Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE Trans. Signal Process., 64(21):5623 5633, 2016. Eleni Matechou, Ivy Liu, Daniel Fern andez, Miguel Farias, and Bergljot Gjelsvik. Biclustering models for two-mode ordinal data. Psychometrika, 81(3):611 624, 2016. Vichi Maurizio. Double k-means clustering for simultaneous classification of objects and variables. In Advances in classification and data analysis, pages 43 52. Springer, 2001. Ted Meeds and Sam Roweis. Nonparametric bayesian biclustering. Technical Report UTML TR 2007 001, January 2007. Elchanan Mossel, Joe Neeman, and Allan Sly. Consistency thresholds for binary symmetric block models. ar Xiv preprint ar Xiv:1407.1591, 3(5), 2014. Soumendu Sundar Mukherjee, Purnamrita Sarkar, Y. X. Rachel Wang, and Bowei Yan. Mean field for the stochastic blockmodel: Optimization landscape and convergence issues. In Advances in Neural Information Processing Systems (Neur IPS), pages 10717 10727, Montr eal, Canada, 2018. Donglin Niu, Jennifer G. Dy, and Zoubin Ghahramani. A nonparametric bayesian model for multiple clustering with overlapping feature views. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 814 822, La Palma, Canary Islands, Spain, 2012. Sofia C Olhede and Patrick J Wolfe. Network histograms and universality of blockmodel approximation. Proceedings of the National Academy of Sciences, 111(41):14722 14727, 2014. Shirley Pledger and Richard Arnold. Multivariate methods using mixtures: Correspondence analysis, scaling and pattern-detection. Computational Statistics & Data Analysis, 71:241 261, 2014. Amela Preli c, Stefan Bleuler, Philip Zimmermann, Anja Wille, Peter B uhlmann, Wilhelm Gruissem, Lars Hennig, Lothar Thiele, and Eckart Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):1122 1129, 2006. Jonathan Templin and Robert A Henson. Diagnostic measurement: Theory, methods, and applications. Guilford Press, 2010. Duy Vu and Murray Aitkin. Variational algorithms for biclustering models. Computational Statistics & Data Analysis, 89:12 24, 2015. Ji Xu, Daniel J. Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. In Advances in Neural Information Processing Systems (NIPS), pages 2676 2684, Barcelona, Spain, 2016. Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient EM on multi-component mixture of gaussians. In Advances in Neural Information Processing Systems (NIPS), pages 6956 6966, Long Beach, CA, 2017. Anderson Y Zhang and Harrison H Zhou. Theoretical and computational guarantees of mean field variational inference for community detection. Annals of Statistics, 48 (5):2575 2598, 2020. Ruofei Zhao, Yuanzhi Li, and Yuekai Sun. Statistical convergence of the em algorithm on gaussian mixture models. Electronic Journal of Statistics, 14(1):632 660, 2020. Zhixin Zhou and Ping Li. Rate optimal chernoff bound and application to community detection in the stochastic block models. Electronic Journal of Statistics, 14(1): 1302 1347, 2020.