# variational_inference_in_highdimensional_linear_regression__f76955a1.pdf Journal of Machine Learning Research 23 (2022) 1-56 Submitted 6/21; Revised 9/22; Published 10/22 Variational Inference in high-dimensional linear regression Sumit Mukherjee sm3949@columbia.edu Department of Statistics Columbia University New York, NY 10027, USA Subhabrata Sen subhabratasen@fas.harvard.edu Department of Statistics Harvard University Cambridge, MA 02138, USA Editor: Pierre Alquier We study high-dimensional bayesian linear regression with product priors. Using the nascent theory of non-linear large deviations (Chatterjee and Dembo, 2016), we derive sufficient conditions for the leading-order correctness of the naive mean-field approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true linear model for the observed data, we derive a limiting infinite dimensional variational formula for the log normalizing constant for the posterior. Furthermore, we establish that under an additional separation condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive sufficient conditions for the validity of this separation condition. Finally, we illustrate our results on concrete examples with specific design matrices. Keywords: Variational Inference, Linear regression, Naive Mean-Field Approximation 1. Introduction In this age of big-data, statisticians routinely analyze large, high-dimensional datasets arising from applications in genomics, finance, public policy etc., with the goal of discovering relationships between the response variable, and the observed features. The linear regression model is arguably the most common framework for this task when the response is continuous. Under a Bayesian formalism, the statistician posits a prior distribution π for the coefficient vector β, and constructs the corresponding posterior. Subsequent inference is based solely on this posterior distribution. Two questions are naturally relevant in this setting: 1. What are the statistical properties of Bayesian procedures, particularly in high-dimensions? 2. Are these procedures computationally tractable for large datasets with high-dimensional features? c 2022 Sumit Mukherjee and Subhabrata Sen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0696.html. Mukherjee and Sen The theoretical performance of high-dimensional Bayesian methods have been examined extensively in recent times. In Bayesian asymptotic theory, given data (yi, xi)n i=1, xi Rp, one assumes the correctness of a frequentist model yi = x T i β0 + ε, and studies frequentist inference of the coefficient vector β0 under Bayesian procedures. Ghosal (1999) established a version of the traditional Bernstein-Von-Mises theorem as long as p4 log p/n 0. A lot of recent attention has been focused on the high-dimensional regime p n with an additional assumption on the sparsity of β0. In this context, spike-and-slab based approaches (Mitchell and Beauchamp, 1988; Ishwaran and Rao, 2005) have been established to exhibit optimal frequentist properties (Castillo et al., 2015). In particular, these posterior distributions contract to the underlying truth β0 at the minimax optimal estimation rate (we refer the interested reader to Banerjee et al. (2021) for a formal definition of posterior contraction, and a survey of recent breakthroughs in this area). Despite the superior theoretical properties of this approach, the sharp spike-and-slab based approaches suffer from a computational bottleneck. In the special case of linear regression, continuous shrinkage priors (see e.g. Carvalho et al. (2010)) provide a comparatively more tractable alternative from the computational perspective (Bhattacharya et al., 2016). Contraction properties of the corresponding posteriors were characterized in Song and Liang (2017). Unfortunately, this strategy is specific to the linear model, and does not generalize beyond this setting. In general, computationally tractable Bayesian inference for high-dimensional models presents significant challenges. MCMC based strategies have been explored extensively for this purpose. Despite rapid progress in MCMC methodology and supporting theory, these methods are still slower than competing frequentist methodology e.g. those based on convex optimization. Variational methods (Wainwright and Jordan, 2008) provide an attractive general option to the Bayesian statistician. The simplest form of variational inference approximates the true posterior distribution using a product distribution this version is often referred to as naive mean-field Variational Bayes (n VB). Computing the best approximating product distribution is computationally fast, and thus these methods provide a practical option for modern datasets. We refer the interested reader to Blei et al. (2017) for an introduction to variational Bayes methods in statistics and machine learning. Although variational methods provide a computationally feasible strategy, supporting theoretical evidence has been relatively scarce. In Wang and Titterington (2006); Wang and Blei (2019b), the authors established the correctness of this approach in parametric models in the classical fixed p setting. Subsequently Wang and Blei (2019a) study variational Bayes in misspecified models. In the context of the linear model, early work by Neville et al. (2014); Ormerod et al. (2017) focused on variational inference in the low dimensional linear model, while Carbonetto and Stephens (2012) provides an early variational approximation for variable selection. The past two years have witnessed rapid progress in the analysis of variational methods for high-dimensional models (Alquier et al., 2016; Alquier and Ridgway, 2020; Han and Yang, 2019; Ray and Szab o, 2021; Ray et al., 2020; Yang et al., 2020). These results focus on high-dimensional models with a sparse underlying truth β0, and study the contraction properties of the variational posterior. In particular, they derive sufficient conditions for the variational posterior to contract at the minimax optimal rate. Correctness of variational methods have also been established for community detection (Bickel et al., 2013; Zhang and Variational Inference in linear regression Zhou, 2020), a poisson mixed model (Hall et al., 2011), frequentist models (Westling and Mc Cormick, 2019) and mixture models (Ch erief-Abdellatif and Alquier, 2018). In sharp contrast, n VB methods can fail in truly high-dimensional settings, e.g. in versions of topic modeling (Ghorbani et al., 2019). In this specific situation, the correct variational approximation is provided by the TAP approximation from spin-glass theory (M ezard et al., 1987). Fan et al. (2018) establishes rigorous guarantees regarding the validity of the TAP approach in the context of the simpler Z2 synchronization problem. Evidence regarding the inadequacy of the n VB approximation also arises in the work of Fasano et al. (2019). The authors analyze the n VB approximation for probit regression in the regime n fixed and p . In this regime, the variational posterior exhibits an over-shrinking phenomenon; to remedy this issue, the authors propose an alternative, locally factored approximation, which resolves this over-shrinking problem in this setting. There is thus an immediate need to understand general properties of statistical problems which ensure the correctness of the n VB approximation. In this paper, we study the accuracy of this approximation in Bayesian linear regression with product priors. We provide easily verifiable conditions which ensure the asymptotic accuracy of the n VB approximation. Further, we illustrate that the approximation can yield detailed information regarding the statistical properties of this model, which might be unavailable using other techniques. We elaborate on our specific contributions below. 1.1 Contributions Our main contributions in this paper are as follows: (i) In Theorem 7, we provide sufficient conditions for the asymptotic tightness of the n VB approximation. The conditions are easily verifiable, and can be explicitly checked in specific applications. This provides rigorous theoretical support for widely used meanfield approximation based methodology. (ii) Assuming a true frequentist linear model for the data, we derive a limiting variational formula for the log normalizing constant in Theorem 17. We emphasize that in contrast to existing results, we do not assume any sparsity on the true regression coefficients. We refer the interested reader to Lee et al. (2020) for a motivating discussion on the importance of such scenarios in scientific applications. (iii) Under an additional separation condition (11), we establish that the limiting variational problem has a unique optimizer (see Theorem 20 for a precise statement). Under this separation condition, empirical averages of samples drawn from the posterior distribution concentrate around explicit deterministic limits; in addition, these limits are characterized by the optimizer (Corollary 22). We also provide interpretable sufficient conditions which enforce this separation condition (Lemma 25). We emphasize that in existing analyses of high-dimensional Bayesian linear regression, properties of the posterior are directly established (Castillo et al., 2015), and the variational posterior is analyzed independently (Ray and Szab o, 2021). Our approach is inherently different we first establish the correctness of the mean-field approximation, and then study the posterior through the lens of the mean-field approximation formula. Mukherjee and Sen (iv) We further illustrate our general results by applying them to three specific examples a two factor ANOVA model, a gaussian design setting with spiked covariance, and a sparse bernoulli design. In each case, we identify the specific limiting functional which determines the limiting log normalizing constant. (v) From a theoretical perspective, our results crucially utilize recent advances in the theory of non-linear large deviations. Initiated in the seminal paper Chatterjee and Dembo (2016), the theory of non-linear large deviations was originally conceived to answer some deep questions concerning large deviations of sub-graph counts in sparse random graphs. In Basak and Mukherjee (2017), one of the authors successfully utilized this framework to establish the tightness of the naive mean field approximation for the log-partition function in a family of Potts models. To the best of our knowledge, these developments have not been utilized previously for other statistical models. In this paper, we demonstrate the usefulness of these tools in the context of high dimensional statistics; we hope that this spurs an in-depth study of their applicability for other high dimensional problems. We consider this to be a key conceptual contribution of this paper. (vi) Our results are intimately related to local asymptotic scaling regimes from classical statistics (Van der Vaart, 2000). In fact, for high-dimensional linear regression with iid gaussian design, our results can be viewed as natural extensions of classical localasymptotic results under the iterated limit p following n . We elaborate on this connection in Section 3. 1.2 Non-linear large deviations and related results In a breakthrough paper, Chatterjee and Dembo (Chatterjee and Dembo, 2016) introduced the theory of non-linear large deviations with the goal of studying large deviations for nonlinear functions of bernoulli random variables. As an application of this general machinery, they characterized sharp deviation probabilities for sub-graph counts in sparse Erd os-R enyi random graphs. Subsequent extensions by Eldan (Eldan, 2018) and Augeri (Augeri, 2019) allow one to track a wider regime of sparsity for the binary variables. In a different direction, Yan (Yan, 2020) extended the Chatterjee-Dembo framework to general bounded Banachspace valued variables. See also Austin (2019) for related decompositions of general Gibbs measures using information theoretic ideas. These results have galvanized the study of large deviations for sub-graph counts on sparse random graphs, and the past three years have witnessed rapid progress in this direction. We refer the interested reader to Cook et al. (2021) and references therein for a survey of recent progress in this area. At the heart of the Chatterjee-Dembo framework lies a tight approximation bound for the log-normalizing constant for general Gibbs type distributions in terms of the naive mean-field approximation formula. This framework was utilized by one of the authors (Basak and Mukherjee, 2017) to derive asymptotic limits for the log normalizing constant of Potts models on several sequences of graphs. Similar results were independently derived by (Jain et al., 2018, 2019) using different techniques. Variational Inference in linear regression In this paper, we study Bayesian linear regression through the lens of non-linear large deviations, and uncover precise statistical properties of these models by analyzing the meanfield variational problem. We observe {(yi, xi) : 1 i n}, yi R, xi Rp. We set y = (yi) Rn and XT = [x1, , xn]. Throughout, we work in an asymptotic setting where p and n = n(p) are both going to . A natural Bayesian model for such a dataset assumes β1, , βp iid π, βT = (β1, , βp), y = Xβ + ε, ε N(0, σ2In). (1) In the display above, π is a probability distribution supported on [ 1, 1], and we consider an iid prior on the regression coefficients. For our subsequent discussion, the precise interval [ 1, 1] for the prior support is not crucial our arguments go through unchanged, as long as the prior has bounded support. Note that in the context of Bayesian inference for linear regression, the regression parameters are often drawn from a parametric family, and the parameters specifying the prior are, in turn, sampled from a hyper-prior. In our discussions, we will restrict ourselves to the simpler setting where the prior π is fixed however, we do not make any assumptions on π apart from the bounded support assumption. Throughout, we assume that the noise variance σ2 > 0 is fixed and known to the statistician. Given the Bayesian model (1), one naturally constructs the posterior distribution dπ p (β) exp 1 2σ2 y Xβ 2 2 exp 1 2σ2 h βTApβ + βTDpβ 2z Tβ i , where z = XTy, and Dp, Ap are the diagonal and off-diagonal matrices obtained from XTX. More precisely, Ap(i, i) := 0, Dp(i, i) := XTX Ap(i, j) := XTX ij, Dp(i, j) := 0. Since the term βTDβ is additive in the components of β, we can absorb the term βTDβ in the base measure π p, via the following definition: Definition 1 (Exponential Family) For any γ := (γ1, γ2) R2 define a probability measure πγ on [ 1, 1] as dπ (z) := exp γ1z γ2 2 z2 c(γ) , c(γ) := log Z [ 1,1] exp γ1z γ2 2 z2 dπ(z). Using this definition we can write the posterior distribution µ as µy,X(dβ) exp 1 2σ2 h βTAβ 2z Tβ i p Y i=1 πi(dβi), where πi := π(0,di), di := Dii Mukherjee and Sen A central object in the theoretical study of these posterior distributions is the normalizing constant, also referred to as the partition function in statistical physics parlance. Formally, we define Zp(y, X) = Z [ 1,1]p exp 1 2σ2 h βTAβ 2z Tβ i p Y i=1 πi(dβi). (2) The partition function is intractable for most priors, unless special conjugacy properties are satisfied between the prior and the likelihood. Henceforth, we suppress the dependence of µ, Zp on y, X whenever it is clear from the context. The classical Gibbs variational principle characterizes the partition function as log Zp = sup Q h βTAβ 2z Tβ ii DKL(Q where the supremum ranges over probability distributions Q on [ 1, 1]p (see e.g. Wainwright and Jordan (2008)). In fact, the supremum in the above variational problem is attained by Q = µ. The naive mean-field approximation to Zp restricts the supremum to product distributions, and thus obtains a universal lower bound. Formally, we have, log Zp sup Q=Qp i=1 Qi h βTAβ 2z Tβ ii DKL(Q The optimizer in the display above provides the best approximation to µ among product distributions under KL divergence. This paper focuses on the tightness of the naive meanfield lower bound in the context of linear regression. For an in-depth survey of variational inference, we refer the interested reader to Bishop (2006); Blei et al. (2017); Wainwright and Jordan (2008). Outline: The rest of the paper is structured as follows. We collect our results in Section 2. We discuss some directions for future enquiry in Section 3. The main results are established in Section 4. We defer some proofs to the Appendix. We collect our main results in this section. To this end, we first discuss some elementary facts regarding exponential families. The next result collects some analytic properties of the cumulant generating function c( ), which will be relevant for our subsequent discussion. For the sake of completeness, we provide a proof in the Appendix. Lemma 2 Let c( ) : R2 7 R be as in Definition 1. Assume that both { 1, 1} belong to the support of π. Then the following conclusions hold. (i) c(γ1, γ2) := c(γ1,γ2) γ1 is strictly increasing in γ1, with limγ1 c(γ1, γ2) = 1 for every γ2 R. (ii) For any t ( 1, 1), there exists a unique h(t, γ2) such that c(h(t, γ2), γ2) = t. Further, limt 1 h(t, γ2) = for every γ2 R. Variational Inference in linear regression Armed with these basic facts, we can formally state the naive-mean field approximation to the log-normalizing constant (2). Definition 3 Define a possibly extended real valued function G on [ 1, 1] R by setting G(u, d) :=uh(u, d) c(h(u, d), d) + c(0, d) if u ( 1, 1), d R, :=DKL(π π(0,d)) if u = 1, d R, :=DKL(π π(0,d)) if u = 1, d R, where π and π are degenerate distributions which puts mass 1 at 1 and 1 respectively. We will need the following facts about the derivatives of G. We defer the proof of Lemma 4 to the Appendix. Lemma 4 We have, for u ( 1, 1) and d R, u (u, d) = h(u, d), G 1 z2dπ(h(u,d),d)(z) 1 1 z2dπ(0,d)(z). 2u (u, d) = 1 c(h(u, d), d) > 0. Consequently, we have, supu ( 1,1),d R | G d (u, d)| 1 Definition 5 Define Mp : [ 1, 1]p R as Mp(u) := n 1 2σ2 h u TAu 2z Tu i i=1 G(ui, di) o . (3) Lemma 6 With Mp( ) as in (3), we have, sup Q=Qp i=1 Qi h βTAβ 2z Tβ ii DKL(Q i=1 πi) = sup u [ 1,1]p Mp(u). (4) This result follows immediately from elementary facts about exponential families. We refer the interested reader to (Wainwright and Jordan, 2008, Section 5.3). 2.1 Validity of the naive mean-field approximation Throughout the paper, we use the usual Landau notation O( ) and o( ) for deterministic sequences dependent on p. Theorem 7 Assume that the matrix Ap satisfies the two conditions tr(A2 p) = o(p), (5) sup u [ 1,1]p j=1 Ap(i, j)uj = O(p). (6) Mukherjee and Sen (i) We have, setting Rp := supu [ 1,1]p Mp(u), as p , log Zp Rp = o(p). (7) (ii) If bi := Eµ(βi|βk, k = i), then the vector b := (b1, . . . , bp) satisfies Mp(b) Rp = o(p). (iii) Suppose there exists ˆu [ 1, 1]p which satisfies that for every η > 0 we have lim sup p 1 p h sup u [ 1,1]p: u ˆu 2 2 pη Mp(u) Rp i < 0. (8) Assume further that the empirical measure 1 p P i δDp(i,i) is uniformly integrable. Then, for any ε > 0 and for any continuous function ζ : [ 1, 1] [0, 1] R we have i=1 ζ βi, i [ 1,1] ζ z, i π(h(ˆui,di),di)(dz) > ε 0. Remark 8 The careful reader would have already noticed that Theorem 7 is stated for deterministic data (y, X). In practical applications, it is often more natural to assume that the data (y, X) is, in turn, sampled from some underlying distribution P. The conclusions of Theorem 7 continue to hold as long as the sufficient conditions hold asymptotically with high probability under P. The condition (6) is extremely mild, and specifies that the log normalizing constant is asymptotically of order p. This condition is satisfied, for example, whenever the matrix Ap = XTX Dp has spectral norm O(1). The assumption (5) is the non-trivial assumption in the statement above, and effectively guarantees the accuracy of the naive mean field approximation to leading exponential order. Note that if λ1 λp denote the eigenvalues of Ap, then tr(A2 p) = Pp i=1 λ2 i . Thus the requirement tr(A2 p) = o(p) can be qualitatively interpreted to mean that the eigenstructure of A is dominated by a few top eigenvalues. Similar results were derived in the context of naive mean-field approximation for Potts models by one of the authors in Basak and Mukherjee (2017). Covariance matrices with an approximately low rank are ubiquitous in modern datasets, and we believe these conditions are satisfied in diverse applications of practical interest. Our result provides formal evidence to the correctness of widely used mean-field approximations in these settings. We prove Theorem 7 in Section 4.1. The third part of our theorem provides further insights into the posterior distribution, assuming that the naive mean field approximation is dominated by a unique factorized distribution. In the language of Statistical Physics, these distributions are referred to be in a pure phase . We now demonstrate a few applications of Theorem 7 to concrete examples, which cover both deterministic and random design matrices. We defer the proofs of these corollaries to Appendix C. Variational Inference in linear regression Corollary 9 Let X be any sequence of deterministic design matrices with XTX = Ap+Dp, where Ap and Dp represent the offdiagonal and diagonal parts of XTX as before. Suppose Ap satisfies (5) and (6), and the empirical measure 1 p Pp i=1 Dp(i, i) is uniformly integrable. Then the conclusion of Theorem 7 holds. Corollary 10 Suppose that the ith row of the design matrix X equals n 1/2xi, where {xi}1 i n i.i.d. N(0, Γp). Assume that p = o(n), and the following conditions hold: (a) The offdiagonal part Γp,offof the covariance matrix Γp satisfies tr(Γ2 p,off) = o(p). (b) Γp 2 = O(1). Then the conclusion of Theorem 7 holds. Sparse design matrices arise routinely in coding theory (Gallager, 1962; Mac Kay, 1999) and genomics (Wu et al., 2011; Tang et al., 2014). Our next corollary discusses the accuracy of the n VB approximation in the context of a linear regression problem with a sparse bernoulli design. We assume that the entries of the design are independent, but not necessarily identical. Corollary 11 Suppose that the (i, j)th entry of the design matrix X equals q p n B(i, j), where {B(i, j)}1 i n,1 j p are mutually independent Bernoullis with P(B(i, j) = 1) λ p, for some λ > 0 free of p. If p = o(n), the conclusion of Theorem 7 applies. 2.2 Scaling limit for the log-normalizing constant Under the asymptotic validity of the naive mean-field approximation, we derive an asymptotic scaling limit for the log-normalizing constant. The limiting description for the lognormalizing constant is in terms of an infinite dimensional variational problem. We will subsequently illustrate that under an additional separation condition, the NMF variational problem at finite n, p has a unique near optimizer, which can be approximated in terms of the optimizer of the associated infinite dimensional limiting problem. As a concrete takeaway, the optimizer of the limiting problem characterizes the typical behavior of empirical averages under the posterior. To this end, we will require some notation. The theory of dense graph limits was developed in Borgs et al. (2008, 2012); Lov asz and Szegedy (2007), and has received tremendous attention over the last decade in Probability, Combinatorics, Computer Science and Statistics. We refer the interested reader to Lov asz (2012) for an in-depth survey of this area. Follow up work of Borgs et. al. (Borgs et al., 2019, 2018) has extended this theory significantly beyond the regime of dense graphs the resulting Lp-convergence theory can handle sparse graphs and weighted matrices. Utilizing this set up, our next result will show that under the assumption that the offdiagonal matrix Ap obtained from XTX converges in cut norm to a suitable graphon (need not be bounded), the corresponding log normalizing constant converges in probability to a deterministic optimization problem. We note that the use of cut-norms on matrices precedes the development of graph limit theory (see e.g. Frieze and Kannan (1999) and references therein). Mukherjee and Sen Definition 12 A function W : [0, 1]2 7 R is called symmetric if W(x, y) = W(y, x) for all x, y [0, 1]. Any symmetric function W : [0, 1]2 7 R which is L1 integrable, i.e. W 1 := R [0,1]2 |W(x, y)|dxdy < is called a graphon. Let W denote the space of all graphons. The cut norm of a graphon W is given by W = sup S,T [0,1] S T W(x, y)dxdy . The cut norm is equivalent to the L 7 L1 operator norm defined by W 7 1 := sup f,g: f , g 1 [0,1]2 W(x, y)f(x)g(x)dxdy . (9) More precisely, we have W W 7 1 4 W . It also follows from (9) that the cut norm is weaker than the L1 norm, i.e. convergence in L1 implies convergence in cut norm. Definition 13 Given a symmetric p p matrix B with real entries, define a piecewise constant function on [0, 1]2 by dividing [0, 1]2 into p2 smaller squares each of length 1/p, and set WB(x, y) :=B(i, j) if px = i, py = j with i = j, =0 otherwise. We will also need the following notion for embedding vectors into functions. Definition 14 (Vector to function) Given t := (t1, , tp) Rp, define the piecewise constant function wt,p on [0, 1] by dividing [0, 1] into p intervals p i=1 1 p(i 1, i] of equal length 1/p, and setting wt(x) := ti if x is in the ith interval, i.e. px = i. To derive the scaling limit, we will assume an underlying model y = Xβ0 + ε, where ε N(0, σ2I). We say that a random variable f := f(y, X) P|X 0 if for any ε > 0, P h |f| > ε|X i 0 as p . Thus this convergence is conditional on the sequence of design matrices. In our subsequent analysis, the following representation lemma will be crucial. Lemma 15 Suppose we are in the setting of Theorem 7. Assume further that i=1 Dp(i, i) = O(p). Then there exists (ξ1, ξp) i.i.d. N(0, 1) such that 1 p sup u [ 1,1]p Mp(u) f Mp(u) P|X 0, f Mp(u) := 1 Dp(i, i)ξi 2β 0X Xu i i=1 G(ui, di). Variational Inference in linear regression We will derive a limiting formula for the log-normalizing constant (2) in terms of a variational problem on a space of probability distributions. This requires the following definition. Definition 16 Let F denote the space of all bounded measurable functions from [0, 1] R to [ 1, 1]. Fixing W W, g, ψ L1[0, 1], define a functional GW,g,ψ( ) : F R by setting GW,g,ψ(F) := 1 2σ2 E[W(X, X )F(X, Z )F(X , Z )] + 1 σ2 E[g(X)F(X, Z)] ψ(X)F(X, Z)Z] E h G F(X, Z), ψ(X) where (X, X ) i.i.d. U[0, 1] and (Z, Z ) i.i.d. N(0, 1) are mutually independent. Theorem 17 Suppose y N Xβ0, σ2I . Writing XTX = Ap + Dp, set D to denote the vector in Rp containing the diagonal entries of Dp. Assume the following: (i) d (Wp Ap, W) 0 for some W W. (ii) wβ0 L1 φ( ), and w D L1 ψ( ). (iii) tr(A2 p) = o(p). Define a function g L1[0, 1] by [0,1] W(x, y)φ(y)dy + ψ(x)φ(x). Then we have, 1 p sup u [ 1,1]p Mp(u) P|X sup F F GW,g,ψ(F). (10) Remark 18 For convenience of the reader, we point out the analogues between the discrete objects (vectors, matrices) and their continuum versions (functions) used in the above theorem, in the list below: p Ap W, β0 φ, D ψ Lemma 15 and Theorem 17 are established in Section 4.2. Mukherjee and Sen 2.3 Uniqueness of the optimizer Theorem 7 identifies general conditions for the asymptotic tightness of the naive mean-field lower bound to the log-normalizing constant. The Gibbs Variational Principle (Wainwright and Jordan, 2008) establishes that under these settings, the distribution µ can be approximated, to the leading order, by a product distribution. However, this does not specify whether the best approximation is unique. Indeed, the ferromagnetic Ising model on the complete graph, henceforth referred to as the Curie-Weiss model, provides a classical example where the mean-field lower bound is tight, but without an external magnetization, the model has two distinct optimizers at low temperature . In this section, we identify some conditions which guarantee the uniqueness of the minimizers. From a statistical perspective, these conditions allow us to conclude that the posterior distribution roughly behaves like a product distribution. To this end, our next lemma identifies a set of sufficient conditions. Definition 19 Let (ξ1, . . . , ξp) i.i.d. N(0, 1) be as constructed in Lemma 15. For any u [ 1, 1]p, define a random empirical measure L(u) p on R3 by setting L(u) p := 1 Theorem 20 Suppose all assumptions of Theorem 17 hold. Assume further that there exists u p [ 1, 1]p such that for all ε > 0 there exists δ > 0 such that P h sup u: u u p 2 2>pε 1 p{Mp(u) Mp(u p)} < δ |X i = 1 o(1). (11) Then the following conclusions hold. (i) The limiting variational problem (10) has a unique optimizer F F. (ii) L (u p) p P µ , where µ is the law of (X, Z, F (X, Z)) with X U[0, 1] and Z N(0, 1) mutually independent. (iii) F satisfies the fixed point equation: F (x, z) a.s. = c 1 E[W(x, X)F (X, Z)] + g(x) + p ψ(x)z , ψ(x) where X U[0, 1] and Z N(0, 1) are mutually independent. Remark 21 Similar to remark 18, we point out the new discrete and continuum analogues appearing in Theorem 20. u p F , Lp(u p) µ . Combining Theorem 7 and Theorem 20, we get the following corollary, which deduces a Law of Large numbers under the posterior distribution. Variational Inference in linear regression Corollary 22 Suppose (6), (11), and the three conditions (i), (ii), (iii) of Theorem 17 hold. Then the optimization in Theorem 17 has a unique solution F F that satisfies F (x, z) a.s. = c 1 E[W(x, X)F (X, Z)] + g(x) + p ψ(x)z , ψ(x) Further, for any continuous function ζ : [ 1, 1]2 7 R we have i=1 ζ(βi, β0,i) E h Z [ 1,1] ζ(w, φ(X))dπ h(F (X,Z), ψ(X) σ2 (w) i > ε P|X 0, where X U([0, 1]) and Z N(0, 1) are independent. Corollary 22 illustrates how one can extract properties of the high-dimensional posterior using the NMF approximation, and is one of the main takeaways of this paper. The result has two parts, and should be interpreted as follows: first, assume a non-trivial scaling of the problem (6), and that the finite dimensional NMF variational problem has a unique approximate optimizer (11). In this setting, provided the problem parameters W, β0, D have appropriate continuum limits and the problem is approximately NMF, the infinite dimensional variational problem in Theorem 17 has a unique optimizer F . In turn, this optimizer F satisfies a functional fixed point equation. Note that in principle, this fixed point equation can still have multiple solutions we simply claim that the global optimizer is one of the solutions to this fixed point relation. This optimizer governs the Law of Large Numbers (LLN) behavior under the posterior distribution this is the main takeaway from the second part of the corollary above. Formally, if β is a sample from the posterior µy,X and ζ is any continuous test function, the empirical average 1 p Pp i=1 ζ(βi, β0,i) concentrates around an explicit, deterministic limit characterized in terms of the optimizer F . Note that there is an additional technical subtlety due to the randomness in the distribution of y given X, the posterior distribution µy,X( ) is actually random! Thus the aforementioned behavior of the empirical averages (under the posterior) is only valid with high probability under the distribution P|X. To further elucidate the usefulness of this corollary, we describe two specific applications in the next remark. Remark 23 We highlight two specific applications of Corollary 22. First, setting ζ(x, y) = (x y)2, we have, i=1 (βi β0,i)2 P|X E h Z [ 1,1] (w φ(X))2dπ h(F (X,Z), ψ(X) Thus although we do not have posterior contraction in this regime, the normalized L2 distance between the true vector β0 and a sample β from the posterior µy,X converges to the explicit value characterized above. In other words, the posterior is concentrated on a thin shell of a fixed radius around the true vector β0. As a second application, we consider a prior π such that zero is an isolated point in its support. In this case, ζ(x, y) = 1(x = 0) is almost surely continuous under π, and thus i=1 1(βi = 0) P|X E h π h(F (X,Z), ψ(X) σ2 ({0}) i . Mukherjee and Sen In this case, the proportion of zero coordinates in a sample from the posterior distribution can be tracked using our framework. These results are particularly useful for spike and slab type priors π. Remark 24 To use Corollary 22 in concrete examples, one needs to compute the functions (W, g, ψ), and verify the conditions (5), (6) and (11). Of these, the separation condition (11) is somewhat implicit, while the other two conditions are relatively easy to verify directly. The following lemma provides two sufficient conditions to this end. In particular, it shows that (11) holds in the so called high temperate regime, or if π has a density (with respect to Lebesgue measure) which is log concave. Lemma 25 (i) Suppose there exists λ > 0 and u p [ 1, 1]p such that for all u [ 1, 1]p P h Mp(u ) Mp(u) λ u u 2 2 X i = 1 o(1), (13) for some λ > 0, free of u and p. Then the condition (11) in Theorem 20 holds. (ii) In particular (13) holds under either of the following conditions: lim sup p sup 1 i p j =i |Ap(i, j)| < σ2. (b) Let the prior π have a density with respect to Lebesgue measure on [ 1, 1], Z exp( V (x)), where V is even, V : [0, 1] R is increasing, and V is convex on [0, 1). Further, we assume that lim infp λmin(XTX) > 0. We collect the proofs of Theorem 20 and Lemma 25 in Section 4.3. 2.4 Applications To illustrate the utility of Theorem 17 and Corollary 22, we apply our results to specific examples in this section. The proofs are deferred to Appendix C. Spiked Covariance Matrix: We consider linear regression with mean-zero gaussian features. We assume a spike covariance structure (Johnstone and Lu, 2009) on the features. Corollary 26 Suppose that the ith row of the design matrix X equals n 1/2xi, where {xi}1 i n i.i.d. N(0, Γp), where Γp = I + vv , with vi := 1 p G(i/p), and G : [0, 1] 7 R is continuous almost surely. Further assume that p n, and the true regression coefficient β0 satisfies wβ0 L1 φ, for some φ L1[0, 1]. Variational Inference in linear regression (a) Then for any ε > 0, 1 p log Zp(y, X) P|X sup F F GW,g,ψ(F), W(x, y) = G(x)G(y) ψ(x) = 1, g(x) = G(x) Z [0,1] G(y)φ(y)dy + φ(x). (b) Consider the following two cases: either (i) λ := max1 i n P j =i Γp(i, j) < σ2, or (ii) the conditions of Lemma 25 Part (b) (ii) hold. Then the conclusions of Corollary 22 hold. Remark 27 We now consider a very special case to illustrate our results. Suppose X has IID entries distributed as N(0, 1 n), and σ = 1. In this case the assumptions of Corollary 26 are satisfied with W 0, G 0, g = φ, and so the optimizing F (x, z) in Corollary 22 simplifies to F (x, z) = c(φ(x) + z, 1). The next example re-visits the sparse bernoulli design setting introduced in Corollary 11. Corollary 28 Suppose that the (i, j)th entry of the design matrix X equals q p n B(i, j), where {B(i, j)}1 i n,1 j p are mutually independent Bernoulli random variables, with P(B(i, j) = 1) = 1 p G i/n, j/p , where G is a function on [0, 1]2 which is continuous almost surely. Further assume that p n, and the true regression coefficient β0 satisfies wβ0 L1 φ, for some function φ L1[0, 1]. (a) Then 1 p log Zp(y, X) P sup F F GW,g,ψ(F), W(x, y) = Z [0,1] G(t, x)G(t, y)dt, ψ(x) = Z [0,1] G(t, x)dt, [0,1] W(x, y)φ(y)dy + φ(x)ψ(x). (b) Consider the following two cases: [0,1] W(x, y)dy = Z [0,12] G(t, x)G(t, y)dtdy. Suppose σ2 > ess sup S(X), where X U[0, 1]; Mukherjee and Sen (ii) the conditions of Lemma 25 Part (b) (ii) hold. Then the conclusions of Corollary 22 hold. As our last example, we consider a sequence of design matrices arising in the study of two-way ANOVA designs. Corollary 29 Suppose that we have a two factor ANOVA model of the form yij = 1 p(τi + γj) + ξij, 1 i, j p Here (τ1, . . . , τ p) [ 1, 1] p are the levels of the first factor, and (γ1, . . . , γ p) [ 1, 1] p are the levels of the second factor, and ξij i.i.d. N(0, σ2). Setting n = p2 and p = 2 p, let y Rn denote the vector obtained by linearizing the matrix ((yij)) row-wise, and X denote the corresponding n p design matrix, and let β := (τ, γ) [ 1, 1]p be the unknown parameter. Assume that the true regression coefficient β0 [ 1, 1]p satisfies wβ0 L1 φ, for some function φ L1[0, 1]. (a) Then 1 p log Zp(y, X) P sup F F GW,g,ψ(F), where ψ = 1 W(x, y) :=0 if (x, y) [0, .5)2 (.5, 1]2, =1 if (x, y) [0, .5) (.5, 1] (.5, 1] [0, .5). and g(x) = R [0,1] W(x, y)φ(y)dy + 1 (b) Consider the following two cases: (ii) the conditions of Lemma 25 Part (b) (ii) hold. Then the conclusions of Corollary 22 hold. 3. Discussions We discuss some limitations of our current results, and collect some questions for future enquiry. (i) The bounded support assumption on the prior A vital technical assumption in our analysis concerns the bounded support assumption on the prior. We note that for a general prior with unbounded support, the posterior µy,X might not even be a proper probability distribution. One intuitively expects that under appropriate taildecay conditions on the prior, the results in this paper should generalize. Going beyond the bounded support assumption requires extending the theory of non-linear large deviations to probability measures on unbounded spaces, and is thus beyond the scope of this paper. Variational Inference in linear regression (ii) Connection with classical asymptotics The curious reader might wonder about the connections between the results in this paper, and classical low-dimensional bayesian asymptotic theory. To explicate this connection, recall the model (1) with the covariates being drawn iid from a N(0, Ip) distribution. Consider the classical asymptotic regime with fixed dimension p and n . Let β0/ n Rp be a sequence of coefficients on the local scale. Using the theory of equivalence of statistical experiments (Van der Vaart, 2000), the likelihood is equivalent to an observation β N(β0, (XTX n ) 1). Further, the Fisher information matrix (XTX/n) 1 is concentrated around the identity matrix, and thus the data is asymptotically equivalent to β N(β0, Ip). In turn, the posterior, constructed from an iid product prior, is equivalent to 2(βi β0,i)2 dπ(βi). Finally, consider an additional limit p . Assuming wβ0 L1 φ, by the weak law of large numbers, for any continuous function ζ : [ 1, 1]2 R, i=1 ζ(βi, β0,i) P E h Z [ 1,1] ζ(w, φ(X))dπ(φ(X)+Z,1)(w) i , where X U([0, 1]) and Z N(0, 1) are independent. We emphasize that this is an iterated limit, and p is not a function of n in this discussion. Note that this behavior is consistent with Corollary 22 upon setting W = 0 and ψ = 1. Thus our results are naturally compatible with the local asymptotics regime in fact, for gaussian designs, our results establish the validity of this behavior as long as p = o(n). Of course, our framework is significantly more general, and can handle situations where the design distribution changes with n, p, e.g. sparse Bernoulli designs. (iii) Extensions to Gibbs posteriors and fractional posteriors A careful study of our proof reveals that our main results do not depend strongly on the correct model specification. As a result, we expect similar techniques to be broadly useful in the study of Gibbs and fractional posteriors (Alquier et al., 2016; Yang et al., 2020). (iv) Extensions to other GLMs Another natural question of interest concerns the applicability of these ideas to more general models, e.g. logistic regression. We consider this to be an extremely interesting question, and plan to explore this in the future. (v) Extensions to models with latent characteristics Variational methods are ubiquitous in applications with latent characteristics e.g. topic modeling, community detection etc. In contrast, the relevant variables are all observed in the linear model framework. It will be interesting to explore the applicability of our techniques to models with latent features. Mukherjee and Sen (vi) Connections with benign overfitting A recent series of papers have explored the benign overfitting phenomenon in the context of high-dimensional linear regression (Bartlett et al., 2020; Hastie et al., 2022; Tsigler and Bartlett, 2020). The main takeaway from this line of research is that if the population covariance matrix has a long and fat tail, the min-norm interpolant continues to have excellent prediction performance in high dimensions. There are some high-level conceptual similarities between this phenomenon, and the results reported in this paper. Indeed, both phenomena are driven by the eigenvalues of the covariates in our case, the condition Tr(A2) = o(p) essentially demands that the matrix A Rp p has only o(p) many eigenvalues which are O(1). Broadly, this highlights the importance of covariate geometry in governing the practical performance of ML methods in high-dimensions. (vii) Connections with Bayesian deep learning Some recent papers in Bayesian deep learning (Farquhar et al., 2020; Foong et al., 2020) experimentally establish that the n VB approximation might not be accurate in Bayesian neural networks, and that increasing the depth might help ameliorate some of the issues. Our results, on the other hand, derive sufficient conditions for the accuracy of the n VB approximation under the simple case of the linear model with product priors. While one should definitely not expect the n VB approximation to be uniformly accurate (especially in high-dimensions), it would be interesting to identify conditions on the covariates which guarantee its accuracy in Bayesian Neural Networks. We prove the main results in this section. Theorem 7 is established in Section 4.1, Theorem 17 is established in Section 4.2, while Theorem 20 is proved in Section 4.3. 4.1 Proof of Theorem 7 Our first lemma collects some basic facts about the exponential family πγ. The proof is deferred to the Appendix. Lemma 30 In the setting of Lemma 2, we have the following conclusions. (i) Set γ = (γ1, γ2) R2. The derivatives of c(γ1, γ2) with respect to γ1 are given by c(γ) := c(γ1, γ2) [ 1,1] zdπγ(z), c(γ) := 2c(γ1, γ2) [ 1,1] (z c(γ))2dπγ(z). (ii) We have, for γ = (γ1, γ2) R2, DKL(πγ π(0,γ2)) = γ1 c(γ) c(γ) + c(0, γ2). (iii) Consider a sequence γk = (γ1,k, γ2,k) R2 such that γ1,k and lim sup |γ2,k| < . Then πγk w π , where π is the degenerate distribution which puts mass 1 at 1. Variational Inference in linear regression (iv) For any d > 0, the function G( , d) : [ 1, 1] R+ is lower semicontinuous. (v) For x [ 1, 1], y R and r N, define H(x, y) = Z [ 1,1] zrdπ(h(x,y),y)(z) (14) Then we have, sup x ( 1,1),y R We now turn to the proof of Theorem 7. To this end, set j=1 Aijβj, θi := zi mi(β) where we recall that σ2 denotes the noise variance in our linear regression model (1), and z = XTy. Observe that Aii = 0 for all 1 i p implies that µ( |(βj)j =i) = π(θi,di), and thus Eµ[βi|(βj)j =i] = c(θi, di). We define b = ( c(θi, di))1 i p. (17) We will prove that certain statistics under the posterior distribution µ can be wellapproximated by the vector of conditional means. To this end, we establish the following results. Lemma 31 Under the conditions of Theorem 7, setting f(β) = 1 2σ2 βTAβ + 1 σ2 z Tβ, we have, log Zp = log h Z [ 1,1]p ef(β) p Y i=1 πi(dβi) i = sup u [ 1,1]p Mp(u) + o(p), (18) Eµ h βTAβ b TAb i2 = o(p2), (19) i=1 mi(β)(βi bi) i2 = o(p2), (20) Eµ h f(β) f(b) i=1 θi(βi bi) i2 = o(p2). (21) In the above displays, the random vectors θ := (θ1, , θp)T and b are as defined in (16) and (17) respectively. Lemma 32 Suppose φ : [ 1, 1] 7 R is a bounded measurable function, and Ap satisfies (5) and (6). Then for any c [ 1, 1]p we have lim sup p 1 p Eµ i=1 ci n φ(βi) Eµ[φ(βi)|βk, k = i] o#2 < . Mukherjee and Sen We defer the proofs of these lemmas to the end of this section, and establish Theorem 7, given these lemmas. Proof of Theorem 7 Proof of Part (i). This is exactly (18) of Lemma 31. Proof of Part (ii). With f(β) = 1 2σ2 βTAβ + 1 σ2 z Tβ as in Lemma 31, define Cp(ε) = n u [ 1, 1]p : f(u) uih(ui, di) c(h(ui, di), di) < sup u [ 1,1]p Mp(u) pε o . Note that it suffices to establish that for all ε > 0, µ(b Cp(ε)) 0 as p . To this end, define the event Ep(ε/2) = n β [ 1, 1]p : |f(β) f(b) i=1 θi(βi bi)| pε Note that (21), in combination with Chebychev inequality, implies that µ(Ep(ε/2)c) 0 as p . Therefore, µ(Cp(ε)) µ(Cp(ε) Ep(ε/2)) + µ(Ep(ε/2)c) = µ(Cp(ε) Ep(ε/2)) + o(1). (22) Now, we have, µ(Cp(ε) Ep(ε/2)) = 1 Cp(ε) Ep(ε/2) exp(f(β)) i=1 πi(dβi) Cp(ε) Ep(ε/2) exp h f(b) + i=1 θi(βi bi) i p Y i=1 πi(dβi) exp( pε/2 + supu [ 1,1]p Mp(u)) Cp(ε) Ep(ε/2) exp h p X i=1 (θiβi c(θi, di)) i p Y i=1 πi(dβi), where the first inequality follows using the definition of Ep(ε/2), while the last inequality follows from the definition of Cp(ε). Using the display above in combination with (7), it suffices to establish that Cp(ε) Ep(ε/2) exp h p X i=1 (θiβi c(θi, di)) i p Y i=1 πi(dβi) = o(p). This argument would be relatively straight-forward if the θi were fixed constants independent of β, as Cp(ε) Ep(ε/2) exp h p X i=1 (θiβi c(θi, di)) i p Y i=1 πi(dβi) [ 1,1]p exp h p X i=1 (θiβi c(θi, di)) i p Y i=1 πi(dβi) = 1. Variational Inference in linear regression We caution the reader that this is not the case, and the θi are themselves functions of β, as specified in (16). To overcome this issue, we proceed as follows. Fix δ > 0, and let Dp(δ) be a pδ net of the set {Aβ, β [ 1, 1]p} in the euclidean metric, satisfying limp 1 p log |Dp(δ)| = 0. Under the assumption tr(A2) = o(p), such a net was constructed in (Basak and Mukherjee, 2017, Lemma 3.4). Thus for every β [ 1, 1]p, there exists p Dp(δ) such that Aβ p 2 δ p. For any p Dp(δ), we set P(p) = {β [ 1, 1]p : Aβ p 2 δ p}. For β P(p), setting θp i = zi pi σ2 , we have, i=1 (θi θp i )2 = 1 i=1 (mi(β) pi)2 = 1 σ4 Aβ p 2 2 pδ where the last inequality uses the definition of P(p). This gives, i=1 (θi θp i )βi p c(θi, di) c(θp i , di) i=1 |θi θp i | p where the first inequality follows from Cauchy-Schwarz, and the second inequality follows from the smoothness of c( , ). Thus we have, Cp(ε) Ep(ε/2) exp h p X i=1 (θiβi c(θi, di)) i p Y i=1 πi(dβi) P(p) exp h p X i=1 (θiβi c(θi, di)) i p Y i=1 πi(dβi). [ 1,1]p exp h p X i=1 (θp i βi c(θp i , di)) i p Y i=1 πi(dβi) = exp 2p This concludes the proof, as δ > 0 is arbitrary. Proof of Part (iii). First, note that it suffices to show the result with ζ(x, y) = xrys for any r, s N. Thus we fix r, s N in the rest of the proof. Using Lemma 32 with φ(x) = xr and ci = ( i p)s it follows that for any ε > 0, i=1 H(bi, di) i where H is defined as in (14).To complete the proof, it thus suffices to show that i=1 H(bi, di) i i=1 H(ˆui, di) i s > ε 0. (23) Mukherjee and Sen Fixing K < and setting d K(i) := di1{|di| K}, using Lemma 30 Part (v) we have i=1 H(ˆui, di) i i=1 H(ˆui, d K(i)) i i=1 di1{|di| > K}. (24) which goes to 0 as p followed by K , using the uniform integrability assumption on 1 p Pp i=1 δdi. Also, with δ > 0 and setting ˆuδ(i) :=ˆui if |ˆui| 1 δ, :=1 δ if ˆui > 1 δ, := 1 + δ if ˆui < 1 + δ, i=1 H(ˆui, d K(i)) i i=1 H(ˆuδ(i), d K(i)) i sup x [1 δ,1],|y| K |H(x, y) H(1 δ, y)| + sup x [ 1,1+δ],|y| K |H(x, y) H( 1 + δ, y)|. (25) Fixing K > 0, we now claim that the RHS of (25) converges to 0 as δ 0. Given this claim, it follows from (24) and (25) that lim sup K lim sup δ 0 lim sup p i=1 H(ˆui, di) i i=1 H(ˆuδ(i), d K(i)) i A similar argument with ˆui replaced by bi gives lim sup K lim sup δ 0 lim sup p i=1 H(bi, di) i i=1 H(bδ(i), d K(i)) i where bδ(i) := bi1{|bi| 1 δ}. It suffices to show that for every δ, K fixed we have i=1 H(ˆuδ(i), d K(i)) i i=1 H(bδ(i), d K(i)) i But this follows on noting that sup |x| 1 δ,|y| K i=1 H(ˆuδ(i), d K(i)) i i=1 H(bδ(i), d K(i)) i i=1 |ˆuδ(i) bδ(i)| M h ˆui bi i2 , Variational Inference in linear regression which converges to zero in probability under the posterior distribution µy,X( ). Here the last estimate uses part (ii) of this Theorem. To complete the argument, it thus remains to verify the claim involving (25), for which it suffices to verify continuity of the function (x, y) 7 H(x, y) at x = 1, uniformly for y [ K, K]. By symmetry, it suffices to show that lim δ 0 sup x [1 δ,1],|y| K |H(x, y) 1| 0. Suppose this is not true. Then there exists sequences {xk}k 1, {yk}k 1 with limk xk = 1, and |yk| K, such that |H(xk, yk) 1| > ε for all k, for some ε > 0. Without loss of generality, by passing to a subsequence, we can assume that yk converges to y [ K, K]. Using Lemma 30 Part (iii) we have that π(h(xk,yk),yk) converges weakly to δ1, the point mass at 1. Consequently, using DCT we have H(xk, yk) = Z [ 1,1] xrdπ(h(xk,yk),yk)(z) 1, a contradiction. This verifies the claim, and hence completes the proof of part (iii). Next we turn to the proof of Lemmas 31 and 32. Proof of Lemma 31 Recall the notation 2σ2 βT Aβ + 1 σ2 z T β = 1 i,j=1 Ap(i, j)βiβj + 1 from the proof of Theorem 7 part (ii). Set 2σ2 βT Aβ = 1 i,j=1 Ap(i, j)βiβj, and note that we are in the setting of proof of (Yan, 2020, Thm 4), via the following connections: X β, ˆX b, N 1, JN N 1 2σ2 , n p, An Ap, efi(x) mi(β), h z. The only disparity in the above connection is that Yan (2020) works with h which is free of i, whereas in our setup we have hi = zi which varies with i. However, the proof of (Yan, 2020, Theorem 4) goes through verbatim under this more general choice of the linear term, as long as we have the mean field assumption tr(A2) = o(p). Thus, (Yan, 2020, Theorem 4) gives log Zp sup Q=Qp i=1 Qi h 1 2σ2 (EQ[X])TA(EQ[X]) + 1 σ2 z TEQ[X] i=1 D(Qi πi) i = o(p), Mukherjee and Sen where EQ[X] is the mean vector of X Q = Qp i=1 Qi. The first conclusion (18) then follows upon using Lemma 6. Also (19) and (20) follow from (Yan, 2020, (4.40)) and (Yan, 2020, (4.41)) respectively, by using the connections mentioned above. It thus suffices to prove (21). But this follows upon observing that i=1 θi(βi bi) 1 i=1 mi(β)(βi bi) + 1 2σ2 |βTAβ b TAb|, and using (19) and (20). Proof of Lemma 32 For any i, j [p] with i = j, setting ti := E[φ(βi)|βk, k = i] = Z [ 1,1] φ(z)dπ(θi,di)(z), θi := zi Pp k=1 Aikβk σ2 Eµ h φ(βi) ti ih φ(βj) tj i = Eµ h φ(βi) ti ih φ(βj) t(i) j i + Eµ h φ(βi) ti ih ti j tj i (26) t(i) j := Z [ 1,1] φ(z)dπ(θi j,dj)(z), θ(i) j := zj P k =i Ajkβk σ2 . Since t(i) j does not depend on βi, we have Eµ h φ(βi) ti ih φ(βj) t(i) j i = Eµ h φ(βj) t(i) j i Eµ h φ(βi) ti βk, k = i i = 0. For estimating the second term in the RHS of (26), setting [ 1,1] φ(z)dπ(x,di)(z) a Taylor s series expansion gives t(i) j tj = ℓ(θ(i) j ) ℓ(θj) = Aij σ2 βiℓ (θj) + A2 ij 2σ4 β2 i ℓ (ξ(i) j ). Using the last two displays and summing over i, j give i =j cicj Eµ h φ(βi) ti ih φ(βj) tj i i =j cicj Aij σ2 βiℓ (θj) h φ(βi) ti i i,j=1 cicj A2 ij 2σ4 β2 i ℓ (ξ(i) j ) σ2 sup u [ 1,1]p j=1 Aijuj + 1 2σ4 i,j=1 A2 ij, Variational Inference in linear regression where the last estimate uses the fact that ℓ(r) 1 for r = 1, 2. The desired conclusion follows upon using the given hypothesis on Ap. 4.2 Proof of Theorem 17 We start with a proof of Lemma 15. Proof of Lemma 15 Without loss of generality assume that the samples (Z1, , Zp) are arranged in increasing order of variance, i.e. increasing order of {Dp(i, i)}p i=1. Let {δp : p 1} be a positive sequence converging to 0, such that i,j=1 Ap(i, j)2 = o(p). The existence of such a sequence follows from the assumption (5). Let q := arg min{i [p] : Dp(i, i) δp}, r := arg max{i [p] : Dp(i, i) 1 Recall that if Z N(0, 1), E|Z| = q 2 π. This implies E sup u [ 1,1]p | i=1 Ziui| E π p = o(p), E sup u [ 1,1]p | i=r+1 Ziui| E Dp(i, i)1 Dp(i, i) 1 i=1 Dp(i, i) = o(p), where we use Pp i=1 Dp(i, i) = O(p). For q i r, setting Yi := Zi Dp(i,i), let Γp denote the (r q + 1 r q + 1) dimensional covariance matrix of the vector Y := (Yq, , Yr) . Define E = (ξq, , ξr) := Γ 1/2 p Y, and note that (ξq, , ξr) i.i.d. N(0, 1). Setting vi := ui p Dp(i, i) and v := (vq, , vr), we have i=q ui Zi = i=q vi Yi = v Γ1/2E = v E + v BE = Dp(i, i)ui + v BE, (28) where B := Γ1/2 p Ir q+1. We now claim that 1 p sup v 1 δp [ 1,1]r q+1 v BE P 0. (29) Mukherjee and Sen Given (29), it follows from (28) that 1 p sup u [ 1,1]r q+1 Dp(i, i)ξi P 0. (30) Finally with {ξi}[p]\[q,r] i.i.d. N(0, 1) independent of {ξi}r i=q, using arguments similar to the derivation of (27) we get E sup u [ 1,1]p E sup u [ 1,1]p i=1 Dp(i, i) = o(p). Combining (27), (30) and (31) the desired conclusion follows. It thus remains to verify (29). To this effect, note that Γp(i, i) = 1, and Γp(i, j) = Ap(i,j) Dp(i,i)Dp(j,j). Setting Cδ(i, j) := Γp(i, j)1{i = j}, and using Spectral Theorem, we have, ℓ=q λℓψℓψ ℓ= ΨΛΨ . Observe that Γp is positive semidefinite, and has eigenvalues 1 + λℓwith λℓ 1. Then we have i,j=1 Cδ(i, j)2 1 i,j=1 Ap(i, j)2. (32) Finally, the matrix B can expressed as B = (Ir q+1 + Cδ)1/2 Ir q+1 = ℓ=q µℓψℓψ ℓ=: ΨeΛΨ , where µℓ:= 1 + λℓ 1, and eΛ is a diagonal matrix with entries {µℓ}r ℓ=q. Consequently, δp [ 1,1]r q+1 |v BE| = 1 δp E BE 1 p δp E BE 2 = p δp E ΨeΛF 2 = p ℓ=q µ2 ℓF2 ℓ, where Fℓ:= ψ ℓE are i.i.d. N(0, 1) for q ℓ r. By Jensen s inequality, the last expression is bounded by ℓ=q µ2 ℓ C p ℓ=q λ2 ℓ= C p i,j=1 Cδ(i, j)2 C p i,j=1 Ap(i, j)2 Variational Inference in linear regression where C := supx 1 1+x 1 x < , and the last bound uses (32). The last term above is o(p) by the choice of δp, and so we have verified (29). This completes the proof of the Lemma. To establish Theorem 17, we need the following definitions. Definition 33 Let ξ1, , ξp N(0, 1) be iid random variables obtained from Lemma 15. Let X U([0, 1]) be independent of the ξi variables. Given u [ 1, 1]p set Z = ξi and U = ui if X [(i 1) p]. Define L(u) p to be the joint distribution of (X, Z, U). Fix W W, and φ, ψ are L1 functions on [0, 1]. Let F2,4 denote the space of all joint distributions (X, Z, U) ν such that X U([0, 1]), Eν[Z2] 4, |U| 1. Define the functional GW,φ,ψ : F2,4 R { } such that GW,φ,ψ(ν) = 1 2E[W(X1, X2)U1U2] + E[φ(X)U] + E[ p ψ(X)UZ] i E h G U, ψ(X) where (X1, Z1, U1), (X2, Z2, U2) are iid copies from ν. Finally, let F denote the space of all probability measures ν on R3, such that if (X, Z, U) ν, then we have X U([0, 1]), Z N(0, 1), X Z, |U| 1 a.s. (34) We note that F F2,4, an observation that will be helpful in our subsequent analysis. The following stability estimates will be crucial in our proof of Theorem 17. The proof is deferred to the Appendix. Lemma 34 (i) We have, for W, W W, φ, ψ, φ , ψ L1([0, 1]), sup ν F2,4 | GW,φ,ψ(ν) GW ,φ ,ψ (ν)| W W + φ φ 1 + ψ ψ 1. (ii) Suppose the following assumptions hold: (a) Wk, W W is such that d (Wk, W) 0. (b) φk, φ L is such that R [0,1] |φk(x) φ(x)|dx 0. (c) ψk, ψ L1[0, 1] is such that R [0,1] |ψk(x) ψ(x)|dx 0. Then we have sup ν F2,4 | GWk,Wk φk+φkψk,ψk(ν) GW,W φ+φψ,ψ(ν)| 0 (35) Lemma 35 We have, for any W W, g, ψ L1([0, 1]), sup ν F GW,g,ψ(ν) = sup F F GW,g,ψ(F). Further, both the suprema are attained. Mukherjee and Sen Remark 36 Depending on whether DKL(π π) and DKL(π π) are infinity or not, there are four possible cases. In our subsequent proofs, we consider the case DKL(π π) < and DKL(π π) = , noting that other cases follow by natural modifications. Lemma 37 Let d( , ) be the 2-Wasserstein distance on Pr([0, 1] R [ 1, 1]). For any δ > 0, set Bd( F, δ) = {ν Pr([0, 1] R [ 1, 1]) : inf ν F d(ν, ν ) < δ}. Then there exists a sequence δp 0 as p such that setting Fp = { u [ 1, 1]p, L(u) p Bd( F, δp)} we have, P(Fp|X) = 1 o(1). The proof of Lemma 37 is straightforward, and is thus omitted. Lemma 38 Let {up : p 1} [ 1, 1]p be such that L(up) p converges weakly to ν0 F. Then for any W W, g, ψ L1, lim sup p GW,g,ψ( L(up) p ) GW,g,ψ(ν0). Lemma 39 Suppose we are in the setting of Theorem 17. Fix F F and δ, K > 0 and p 1. Let Vi U([(i 1)/p, i/p]) be independent of ξi arising in Lemma 15. Set d K i = di1(|di| K) and define ( F(Vi, ξi) if F(Vi, ξi) 1 + δ, c(0, d K i ) o.w. (36) Then we have, setting u = ( ui), for any ε > 0 lim sup K lim sup δ 0 lim sup p P h 1 p f Mp(u) GW,g,ψ(F) > ε|X i = 0. We prove Theorem 17 assuming Lemma 35, 38 and 39. The corresponding proofs are deferred to the end of the section. Proof of Theorem 17 Using Lemma 15, it suffices to prove that 1 p sup u [ 1,1]p f Mp(u) P|X sup F F GW,g,ψ(F). Upper bound: Note that 1 p f Mp(u) (b) = GWp Ap,Wp Ap wβ0+wβ0w D,wσ2d( L(u) p ) (c) = GW,g,ψ( L(u) p ) + E(2) p (u), Variational Inference in linear regression where supu [ 1,1]p |E(2) p (u)| P|X 0 as p . Here, (b) uses the definition (33), and (c) uses Lemma 34 part (ii). To invoke Lemma 34, we use the fact that P[ u [ 1, 1]p, L(u) p F2,4] = P h p X i=1 ξ2 i 4p i 1 (37) as p . To complete the upper bound, invoking Lemma 35, it suffices to establish that for all ε > 0 P h sup u [ 1,1]p GW,g,ψ( L(u) p ) < sup ν F GW,g,ψ(ν) + ε|X i = 1 o(1). (38) We first turn to the proof of (38) by contradiction. Suppose there exists ε > 0, such that setting Ep = n sup u [ 1,1]p GW,g,ψ( L(u) p ) > sup ν F GW,g,ψ(ν) + ε o . we have lim sup P[Ep|X] > 0. Using Lemma 37, we have, lim sup P[Fp Ep] > 0. Therefore, there exists a subsequence pk along which Epk Fpk = . Consequently, there exists ξpk := (ξ1,pk, , ξpk,pk) Rpk, ξpk Epk Fpk and upk [ 1, 1]pk such that GW,g,ψ( L (upk) p ) > sup ν F GW,g,ψ(ν) + ε, L (upk) pk Bd( F, δp). This in turn implies that the sequence { L (upk) pk : k 1} is tight, and hence has a subse- quence converging weakly to ν0 F. Assuming without loss of generality that L (upk) pk d ν0, the desired contradiction follows once we show that lim sup k GW,g,ψ( L (upk) p ) GW,g,ψ(ν0). (39) But this follows from Lemma 38. Lower Bound: It suffices to show that for all ε > 0, p sup u [ 1,1]p f Mp(u) > sup F F GW,g,ψ(F) ε|X i = 1 o(1). To this end, let F F satisfy GW,g,ψ(F) > sup F F GW,g,ψ(F) ε Note that if E[G(F(X, Z), ψ(X)] = , then the lower bound is trivial. Thus we assume, without loss of generality, that E[G(F(X, Z), ψ(X)] < . Next, fixing K > 0, we have, GW,g,ψK(F) = GW,g,ψ(F) + o K(1), (40) Mukherjee and Sen where ψK(x) := ψ(x)1(ψ(x) K), and we have used Lemma 34 Part (i). Further, setting d K i = min{di, K}, a similar argument gives GWp Ap,Wp Ap wβ0+wβ0w D,wσ2d( L(u) p ) = GWp Ap,Wp Ap wβ0+wβ0w D,wσ2d K ( L(u) p ) + o K(1). (41) For any p 1, let ξ1, , ξp denote the iid N(0, 1) random variables arising in the optimization problem Mp. For each i [p], generate Vi U([(i 1)/p, i/p]) independent of each other, and of the ξi variables. Fixing δ > 0, we set ( F(Vi, ξi) if F(Vi, ξi) 1 + δ, c(0, d K i ) o.w. (42) Using Lemma 39, for any ε > 0, there exists δ, K > 0 (depending on ε), such that with probability at least 1 ε, 1 p sup u [ 1,1]p f Mp(u) 1 p f Mp( u) GW,g,ψ(F) ε sup F F GW,g,ψ(F) 2ε. As ε > 0 is arbitrary, this completes the proof of the lower bound. It remains to prove Lemma 35, 38 and 39. We establish each result in turn. Proof of Lemma 35 To begin, note that for F F, (X, Z, F(X, Z)) ν F. Therefore, sup ν F GW,g,ψ(ν) sup F F GW,g,ψ(F). We now turn to the reverse inequality. Fix (X, Z, U) ν F. Setting V := Φ(Z) note that X, V are U([0, 1]) iid. Define F(X, V ) = E[U|X, V ]. Note that E[W(X, X )UU ] = E[W(X, X )F(X, V )F(X , V )] E[g(X)U] = E[g(X)F(X, V )], E h G U, ψ(X) i E h G F(X, V ), ψ(X) where the final step follows from Jensen s inequality and the observation that 2 u2 G(u, d) 0 from Lemma 4. This implies GW,g,ψ(ν) 1 2σ2 E[W(X, X )F(X, V )F(X, V )] + 1 σ2 E[g(X)F(X, V )] ψ(X)F(X, V )Φ 1(V )] i E h G F(X, V ), ψ(X) Finally, noting that Φ 1 is strictly increasing, we have, ψ(X)F(X, V )Φ 1(V )] E[ p ψ(X)F (X, V )Φ 1(V )], Variational Inference in linear regression where for each x [0, 1], F (x, ) is the monotone rearrangment of F(x, ). This last step follows from Hardy-Littlewood inequality. Observe that as the Uniform distribution is invariant under measure preserving transformations, the other terms in the functional above are unchanged by this rearrangement. Thus GW,g,ψ(ν) GW,g,ψ(F ). Since F F, the proof is complete. Finally, we show that both suprema are attained. To this end, observe that by Lemma 30 Part (iv), GW,g,ψ( ) is upper semi-continuous. Moreover, F is compact under weak topology this follows as the collection F is tight, and thus pre-compact by Prokhorov s theorem. Thus a maximum exists. To see that GW,φ,ψ( ) attains it s maximum, start with a maximizer ν0 of GW,φ,ψ( ), and take F 0 as obtained in the proof above. Proof of Lemma 38 Observe that we can approximate W in L1 by a continuous function W (ℓ). This implies E L(u) p [W(X1, X2)U1U2] = E L(u) p [W (ℓ)(X1, X2)U1U2] + oℓ(1) p Eν0[W (ℓ)(X1, X2)U1U2] + oℓ(1) = Eν0[W(X1, X2)U1U2] + oℓ(1). Upon sending ℓ , we conclude that E L(u) p [W(X1, X2)U1U2] Eν0[W(X1, X2)U1U2]. A similar argument shows that E L(u) p [g(X)U] p Eν0[g(X)U]. Next, we note that we can approximate ψ in L1 by a sequence of continuous functions {ψ(ℓ) : ℓ 1}, and thus, by Cauchy-Schwarz |E L(u) p [ p ψ(X)UZ] E L(u) p [ q ψ(ℓ)(X)UZ]| E L(u) p [| p E L(u) p [( q ψ(X))2]E L(u) p [Z2] 0 |ψ(ℓ)(x) ψ(x)|dx = oℓ(1). The last inequality uses that L(up) p F2,4. This implies E L(u) p [ p ψ(X)UZ] = E L(u) p [ q ψ(ℓ)(X)UZ] + oℓ(1) ψ(ℓ)(X)UZ] + oℓ(1) ψ(X)UZ] + oℓ(1). Sending ℓ , we have E L(u) p [ p ψ(X)UZ] Eν0[ p ψ(X)UZ] as p . Mukherjee and Sen The desired conclusion follows once we establish that lim inf p E L(u) p h G U, ψ(X) i Eν0 h G U, ψ(X) But this follows on using Fatou s lemma and the lower semicontinuity of G , ψ(X) σ2 (Lemma 30 part (iv)). Proof of Lemma 39 The definition of u in (36) implies G(ui, d K i ) = ( G(F(Vi, ξi), d K i ) if F(Vi, ξi) 1 + δ, 0 o.w. Observe that i=1 G(ui, d K i ) =1 i=1 G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ) (43) Now, the function G : [ 1 + δ, 1] [0, K] R+ is bounded, and thus by Chebychev inequality, i=1 G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ) i=1 E h G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ) i P|X 0. (44) Also, note that i=1 E h G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ) i = E[G(F(X, Z), wd K,p(X))1(F(X, Z) 1 + δ)], = E h G F(X, Z), ψK(X) 1(F(X, Z) 1 + δ) i + o(1) (45) where X U([0, 1]) and Z N(0, 1) are independent. The last display uses Lemma 4 along with the fact that wd K,p ψK/σ2 in L1 for almost every K > 0. Now, the assumptions E[G(F(X, Z), ψ(X)/σ2)] < and DKL(π π) = together imply P[F(X, Z) = 1] = 0. In combination with Dominated Convergence Theorem, this implies lim δ 0 E h G F(X, Z), ψK(X) 1(F(X, Z) 1 + δ) i = E h G F(X, Z), ψK(X) Combining (43), (44), (45) and (46), for any ε > 0, we have, lim sup δ 0 lim sup p P h 1 i=1 G(ui, d K i ) E h G F(X, Z), ψK(X) i > ε|X i = 0. (47) Variational Inference in linear regression We now claim that for any ε > 0, lim sup δ 0 lim sup p P h 1 i,j=1 Ap(i, j)uiuj E[W(X1, X2)F(X1, Z1)F(X2, Z2)] > ε|X i = 0. lim sup δ 0 lim sup p P h u TApβ0 + u TDpβ0 E[g(X)F(X, Z)] > ε|X i = 0. lim sup δ 0 lim sup p P h Dp(i, i)ξi E[ p ψ(X)UZ] > ε|X i = 0. We first turn to the quadratic form. We have, ij=1 Ap(i, j)uiuj = E L(u) p [Wp Ap(X1, X2)U1U2] (a) = E L(u) p [W(X1, X2)U1U2] + o(1) (b) = E L(u) p [W (K)(X1, X2)U1U2] + o K(1). (49) Here, the step (a) uses Assumption (a), and step (b) uses W (K) := W1(|W| K) L1 W. We have, setting Mp(i, j) := R [(i 1)/p,i/p] R [(j 1)/p,j/p] W (K)(x, y)dxdy, E L(u) p [W (K)(X1, X2)U1U2] = i,j=1 Mp(i, j)uiuj = i,j=1 Mp(i, j)E[ui]E[uj] + o P (1) (50) by the weak law of large numbers, and the observation |Mp(i, j)| K/p2. Here, the o P (1) term converges to zero in probability under the joint distribution of {(Vi, ξi) : 1 i p}. Further, we have, i,j=1 Mp(i, j)E[ui]E[uj] i,j=1 Mp(i, j)E[F(Vi, Z)]E[F(Vj, Z)] 2KP[F(X, Z) 1 + δ], (51) where X U([0, 1]) and Z N(0, 1) are independent. Finally, i,j=1 Mp(i, j)E[F(Vi, Z)]E[F(Vj, Z)] = E[WMp(X, X )F(X, Z)F(X , Z )] (a) = E[W (K)(X, X )F(X, Z)F(X , Z )] + o(1) (b) = E[W(X, X )F(X, Z)F(X , Z )] + o K(1) (52) where (a) uses WMp W (K) 1 0 as p and (b) uses W W (K) 1 0 as K . Combining (49), (50), (51), (52), the first conclusion of (48) follows upon sending p , δ 0 and K . The other conclusions of (48) follow using analogous arguments, and are thus omitted. Mukherjee and Sen 4.3 Proof of Theorem 20 and related results The following continuity statement will be critical for the proof of Theorem 20. The proof is deferred to the Appendix. Lemma 40 Define a map m : Pr [0, 1] R [ 1, 1] W [0, 1] R, (µ, W, x) 7 E(X,Z,U) µ[W(x, X )U ]. (i) Fix x [0, 1], µ Pr [0, 1] R [ 1, 1] , and W, W W. For any ε > 0, set S(ε) = {x : |m(µ, W, x) m(µ, W , x)| > ε}. For X U([0, 1]), we have, P[X S(ε)] 2 (ii) For any W W, if νp w ν, then sup x [0,1] m(νp, W, x) m(ν, W, x) 0 Proof of Theorem 20 We break the proof into several steps. Note that using Lemma 35, both variational problems attain their suprema. (i) We establish uniqueness of the maximizer assuming (11). If possible, let F1 and F2 be two distinct optimizers of GW,g,ψ in F. Therefore, using Theorem 17, 1 p sup u [ 1,1]p Mp(u) P|X sup F F GW,g,ψ(F) = GW,g,ψ(F1) = GW,g,ψ(F2). Let ξ1, , ξp be iid N(0, 1) random variables arising in the functional f Mp( ). Further, let Vi U([(i 1)/p, i/p]) be independent of the ξi variables. Fixing δ > 0, following the recipe in (36), we define two sequences u(1) p,δ,K, u(2) p,δ,K in [ 1, 1]p corresponding to F1, F2 respectively. Using Lemma 39, for i = 1, 2, for δ, K > 0 we have p Mp(u(i) p,δ,K) > GW,g,ψ(Fi) δ X i 1 E(δ, K), where lim sup K lim supδ 0 E(δ, K) = 0. Moreover, using Theorem 17 and the assumption that F1 and F2 are optimizers of the limiting variational problem, we have, for i = 1, 2, p Mp(u p) < GW,g,ψ(Fi) + δ X i = 1 o(1). Variational Inference in linear regression On the event { u(i) p,δ,K u p 2 2 > pε, i = 1, 2} { 1 p Mp(u p) < GW,g,ψ(Fi)+ δ 4 }, for i = 1, 2 we have, 1 p Mp(u(i) p,δ,K) 1 p Mp(u p) δ GW,g,ψ(Fi) + δ 4 δ < GW,g,ψ(Fi) δ Using (11), this implies, for all p sufficiently large, P[ u(i) p,δ,K u p 2 2 > pε, i = 1, 2|X] 3E(δ, K). By triangle inequality, P[ u(1) p,δ,K u(2) p,δ,K 2 2 > 2pε|X] 3E(δ, K). As u(i) p,δ,K [ 1, 1]p, we have, 1 p E[ u(1) p,δ,K u(2) p,δ,K 2 2] 2ε + 12E(δ, K). Thus, with X U([0, 1]), Z N(0, 1) independent, we have, E[(F1(X, Z) F2(X, Z))2] i=1 E[(F1(Vi, ξi) F2(Vi, ξi))2] p E[ u(1) p,δ,K u(2) p,δ,K 2 2] + X a=1,2 P[Fa(X, Z) 1 + δ] 2ε + 12E(δ, K) + X a=1,2 P[Fa(X, Z) 1 + δ]. Setting δ 0 followed by K , we obtain that E[(F1(X, Z) F2(X, Z))2] 2ε. Here we use that F1, F2 are optimizers; thus without loss of generality we can assume E[G(Fa(X, Z), ψ(X)] < for a = 1, 2, which along with Remark 36 implies that P[Fa(X, Z) = 1] = 0. Finally, note that ε > 0 is arbitrary, and thus F1 = F2 a.s.. This establishes the desired uniqueness. (ii) As L (u p) p and L (u p) p are close in weak topology, it suffices to work with L (u p) p . With up,δ,K as in (36), for any ε > 0, Lemma 39 and (11) imply lim sup K lim sup δ 0 lim sup p P[ up,δ,K u p 2 2 > pε|X] = 0. Recalling the 2-Wasserstein distance d( , ) from Lemma 37, we have, d( L (u p) p , L(up,δ,K) p ) 1 p up,δ,K u p 2 2, and so it suffices to look at L(up,δ,K) p . Let η : R3 R be bounded continuous. Then E L (up,δ,K ) p [η(X, Z, U)] = 1 i=1 η(Vi, ξi, up,δ,K(i)) = 1 i=1 E[η(Vi, ξi, up,δ,K(i))] + o P (1), Mukherjee and Sen where the last equality follows from the law of large numbers and the boundedness of η. Here, the o P (1) term converges to zero in probability under the joint distribution of {(Vi, ξi) : 1 i p}. Finally, i=1 E[η(Vi, ξi, up,δ,K(i))] E[η(X, Z, F(X, Z)1(F(X, Z) > 1 + δ))] P[F(X, Z) 1 + δ]. This gives, for any ε > 0, lim sup K lim sup δ 0 lim sup p P h E L (up,δ,K ) p [η(X, Z, U)] E[η(X, Z, F(X, Z))] > ε|X i = 0, where we again use that P[F(X, Z) = 1] = 0. This completes the proof of Part (ii). (iii) For any p 1, the function f Mp( ) is upper semicontinuous on [ 1, 1]p, and thus attains its maximum. Let {ˆup : p 1} denote any sequence of global maximizers of f Mp( ). Lemma 15 and the separation condition (11) together imply that 1 p u p ˆup 2 2 P|X 0. In turn, this implies d( L (u p) p , L(ˆup) p ) P|X 0, (53) where d( , ) denotes the 2-Wasserstein distance. Thus L(ˆup) p converges weakly in probability to ν := (X, Z, F (X, Z)). Next, differentiating f Mp( ) at u ( 1, 1)p, we obtain f Mp(u) = h 1 Apu + Apβ0 + Dpβ0 + diag( q Dp(i, i))ξ h(u, d) i , where ξ = (ξ1, , ξp)T, and h( , ) is the vector obtained upon applying h coordinatewise to the two vectors. Note that if ui +1, then h(ui, di) , and thus ui f Mp(u, d) . Consequently, ˆui < 1. A similar argument implies ˆui > 1, and thus ˆup ( 1, 1)p. This immediately implies that ˆup is a critical point of f Mp, and satisfies the fixed point equation Apˆup + Apβ0 + Dpβ0 + diag( q Dp(i, i))ξ , d . (54) Let f1 : [0, 1] R and f2 : R R be bounded continuous functions. Recalling the function m from Lemma 40 and setting gp = Wp Ap wβ0 + wβ0w D, we have, E L (ˆup) p [f1(X)f2(Z)U] =E L (ˆup) p h f1(X)f2(Z) c 1 m( L(ˆup) p , Wp Ap, X) + gp(X) + p w D(X)Z , wd(X) i =E L (ˆup) p h f1(X)f2(Z) c 1 m( L(ˆup) p , Wp Ap, X) + g(X) + p ψ(X)Z , wd(X) i + Ep, (55) Variational Inference in linear regression where we use c(θ, d) 1, and observe that |Ep| E L (ˆup) p [|( p ψ(X))Z|] + gp g 1 w D ψ 1 + gp g 1 = o(1). (56) Define the good event E (ε) = {|m( L(ˆup) p , Wp Ap, X) m( L(ˆup) p , W, X)| ε} {|w D(X) ψ(X)| ε}. Then, upon observing that supγ1,γ2 | 2 γ1 γ2 c(γ1, γ2)| 1 and supγ1,γ2 | 2 γ2 1 c(γ1, γ2)| 1, we have, on the event E (ε), c 1 m( L(ˆup) p , Wp Ap, X) + g(X) + E[ p ψ(X)Z] , wd(X) m( L(ˆup) p , W, X) + g(X) + E[ p ψ(X)Z] , ψ(X) Thus we have, E L (ˆup) p h f1(X)f2(Z) c 1 m( L(ˆup) p , Wp Ap, X) + g(X) + p ψ(X)Z , wd(X) i E L (ˆup) p h f1(X)f2(Z) c 1 m( L(ˆup) p , W, X) + g(X) + p ψ(X)Z , ψ(X) ε + P[E (ε)c] ε W Wp Ap + P h |wd(X) ψ(X) σ2 | > ε i = O(ε) + o(1), where the last inequality uses Lemma 40 part (i). Finally, Lemma 40 part (ii) implies that E L (ˆup) p h f1(X)f2(Z) c 1 m( L(ˆup) p , W, X) + g(X) + p ψ(X)Z , ψ(X) =E L (ˆup) p h f1(X)f2(Z) c 1 m(ν , W, X) + g(X) + p ψ(X)Z , ψ(X) i + o(1). (58) Combining (55), (56), (57), (58), we have, E L (ˆup) p [f1(X)f2(Z)U] =E L (ˆup) p h f1(X)f2(Z) c 1 m(ν , W, X) + g(X) + p ψ(X)Z , ψ(X) i + o(1) + O(ε). Sending p , and then ε 0, we obtain Eν [f1(X)f2(Z)U] =Eν h f1(X)f2(Z) c 1 m(ν , W, X) + g(X) + p ψ(X)Z , ψ(X) The desired conclusion follows upon recalling that (X, Z, F (X, Z)) ν . Mukherjee and Sen We next turn to the proof of Lemma 25. To this end, we require the following auxiliary lemma. The proof is deferred to the Appendix. Lemma 41 Suppose rp : [ 1, 1]p R { } is upper semi-continuous on [ 1, 1]p, finite and differentiable on ( 1, 1)p, and sup u ( 1,1)p λmin(Hp(x)) η, where Hp( ) is the Hessian of rp( ). Then there exists a unique global maximizer x0 [ 1, 1]p, and further for any x [ 1, 1]p we have rp(x0) rp(x) η 2 x x0 2 2. Proof of Lemma 25 Proof of (i) The equation (13) implies that the event sup u [ 1,1]p: u u p 2 2>pε 1 p{Mp(u) Mp(u p)} < ελ occurs with probability 1 o(1). Proof of (ii)(a) Differentiating Mp( ), twice, the Hessian is given by σ2 Ap p, p(i, i) = 1 c(h(ui, di), di). (59) Note that c(h(u, d), d) is the variance of a random variable supported on [ 1, 1], and thus c(h(u, d), d) 1. Under the assumption of (i), there exists δ > 0 such that for all p large enough and all x ( 1, 1)p, x THp(u)x = 1 σ2 x TApx x px δ x 2 2. (60) It then follows from Lemma 41 that there exists u [ 1, 1]p which is a unique optimizer of Mp( ), and further, for any u [ 1, 1]p we have Mp(u ) Mp(u) δ This verifies (13). Proof of (ii)(b) To begin, note that the prior is absolutely continuous with respect to the Lebesgue measure on [ 1, 1] and thus DKL(π π) = DKL(π π) = . Thus to maximize Mp( ), it suffices to restrict to ( 1, 1)p. As in (60), it suffices to bound the lower eigenvalue of the Hessian, uniformly in ( 1, 1)p, for which using (59) and the fact that λmin(XTX) is bounded away from 0 it suffices to show that p(i, i) di. By GHS inequality (Ellis et al., Variational Inference in linear regression 1976), c(h(u, di), di) c(0, di), and thus the desired inequality follows, once we establish that sup d 0 d c(0, d) 1. (61) Consider now a scale family of parametric distributions {Pθ : θ 0} with dx exp( θV (x)) exp( dx2/2). Since V ( ) is even, it follows that the first moment under Pθ is 0 for all θ. Also, since V ( ) is an increasing, this family has monotone likelihood ratio in T(x) = |x|, and thus the second moment under the law Pθ is decreasing in θ for θ > 0, and so c(0, d) = Varθ=1(Z) Varθ=0(Z). Proceeding to bound the RHS of the above display, for d > 0 we have d Varθ=0(Z) = R 1 1 dz2 exp( dz2/2) dz R 1 1 exp( dz2/2) dz . We substitute dz = t, so that d Varθ=0(Z) = d t2 exp( t2/2) dt R d exp( t2/2) dt . This is exactly the variance of a truncated standard Gaussian distribution, truncated to the interval [ d]. Indeed, the truncated variance is 1 2 d) 1 < 1 (Johnson and Kotz, 1971), where φ( ) and Φ( ) represent the pdf and cdf of the standard Gaussian distribution. Acknowledgments The authors thank Pragya Sur for discussions on high-dimensional regression. SM gratefully thanks NSF (DMS 1712037) for support during this research. The authors gratefully thank the Associate Editor and two anonymous referees for their helpful comments, which significantly improved the exposition. Appendix A. Some technical lemmas We collect some basic results about exponential families in the first subsection. We prove Lemma 41 in the next subsection. Mukherjee and Sen A.1 Results on exponential families We prove Lemmas 2, 4 and 30 in this section. Proof of Lemma 30 (i) This follows by direct computation (see e.g. Lehmann and Casella (2006)). (ii) This follows by direct calculation. (iii) Recall from Definition 1 that dπ (z) = exp γ1z γ2 2 z2 c(γ) . Without loss of generality, we consider the case γ1,k . The case γ1,k follows using the same argument, with obvious modifications. Observe that as γ1,k and lim supk |γ2,k| < , for k sufficiently large, the function γ1,kz γ2,k 2 z2 is increasing on [ 1, 1]. This implies, for any ε > 0, Z 1 ε 1 exp γ1,kz γ2,k 2 z2 dπ(z) exp γ1,k(1 ε) γ2,k 2 (1 ε)2 π([ 1, 1 ε]). On the other hand, exp(c(γk)) can be lower bounded by Z 1 2 exp γ1,kz γ2,k 2 z2 dπ(z) exp γ1,k(1 ε On taking ratios of the above two displays we get πγk([ 1, 1 ε]) exp ε 2γ2 2,k π([ 1, 1 ε]) Also note that the ratio of probabilities under π in the RHS is positive, as 1 is in the support of π. The desired conclusion now follows upon letting k , upon noting that γ1,k , and γ2,k stays bounded. (iv) The proof follows directly from the lower semicontinuity of KL divergence under weak convergence (Posner, 1975, Theorem 1). (v) By definition, H(x, y) = Z [ 1,1] zr exp h(x, y)z y 2z2 c(h(x, y), y) dπ(z). For x { 1, 1}, H(x, ) is independent of y, and thus H(x,y) y = 0. We assume henceforth that x ( 1, 1). Differentiating, we obtain, [ 1,1] zrh z h(x, y) 2 c(h(x, y), y) h(x, y) i dπ(h(x,y),y)(z) [ 1,1] zrh (z x) h(x, y) i dπ(h(x,y),y)(z), Variational Inference in linear regression where the last equality uses c(h(x, y), y)) = x. Therefore, Eπ(h(x,y),y)[|Z x|] + 1 where we use Eπ(h(x,y),y)[|Z|r] 1 for all r 1. Now, differentiating c(h(x, y), y) = x in y, we have, 2c γ1 γ2 (h(x, y), y) c(h(x, y), y) Covπ(h(x,y),y)(Z, Z2) Varπ(h(x,y),y)(Z) 2 E(h(x,y),y)[(Z x)(Z2 x2)] E(h(x,y),y)[(Z x)2] 2 E(h(x,y),y)[(Z x)2(Z + x)] E(h(x,y),y)[(Z x)2] . This implies supx ( 1,1),y R h(x,y) y 1, where we use the trivial bound |Z + x| 2. The desired conclusion follows by plugging this back into (62). Proof of Lemma 2 (a) Lemma 30 Part (a) implies that c(γ1, γ2) = Varπγ(Z) > 0 for all γ1, γ2. Thus c(γ1, γ2) is strictly increasing in γ1. For every γ2 R, as γ1 , πγ w π using Lemma 30 Part (c). The desired conclusion follows on noting that c(γ1, γ2) = Eπγ[Z] γ1 1 by the Dominated Convergence Theorem. (b) Using Part (a), we know that c(γ1, γ2) is strictly increasing in γ1, and continuous. Further, if γ1 , c(γ1, γ2) 1. Thus for any t ( 1, 1), the existence of h(t, γ2) follows by the Intermediate Value Theorem. Finally, we argue that for any fixed γ2 R, as t 1, h(t, γ2) . The case t 1 is similar, and thus omitted. We complete the proof by contradiction. Suppose, if possible, that M := lim t 1 h(t, γ2) < . Mukherjee and Sen In this case, as 1 is in the support of π, π([ 1, 0]) > 0. This implies that t = c(h(t, γ2), γ2) = Z [ 1,1] z exp h(t, γ2)z γ2 2 z2 c(h(t, γ2), γ2) dπ(z) [ 1,0] z exp h(t, γ2)z γ2 2 z2 c(h(t, γ2), γ2) dπ(z)+ Z (0,1] z exp h(t, γ2)z γ2 2 z2 c(h(t, γ2), γ2) dπ(z) π(h(t,γ2),γ2)((0, 1]). To complete the argument, it suffices to show that lim inft 1 π(h(t,γ2),γ2)([ 1, 0]) > 0. But this follows on noting that the function t c(h(t, γ2), γ2) is non-decreasing on [0, 1], and thus π(h(t,γ2),γ2)([ 1, 0]) exp( M γ2 2 c(M, γ2))π([ 1, 0]). Proof of Lemma 4 The lemma follows by direct computation. First, note that u = h(u, d) + u h u(u, d) c(h(u, d), d) h u(u, d) = h(u, d), where the last equality follows upon noting that c(h(u, d), d) = u. For the second derivative, note that d =u h(u, d) d c(h(u, d), d) h(u, d) d γ2 c(γ1, γ2) (h(u,d),d) + γ2 c(γ1, γ2) (0,d) = γ2 c(γ1, γ2) (h(u,d),d) + γ2 c(γ1, γ2) (0,d) [ 1,1] z2dπ(h(u,d),d)(z)dz 1 [ 1,1] z2dπ(0,d)(z)dz. Finally, note that 2u = h(u, d) u = 1 c(h(u, d), d) > 0 where the last equality follows from differentiating the equation c(h(u, d), d) = u. A.2 Proof of Lemma 41 Proof Since rp( ) is upper semi-continuous there exists a global maximizer in [ 1, 1]p, say ex0. Fixing any x in ( 1, 1)p, consider the function f(t) := rp((1 t)ex0 + tx). Then f is twice differentiable on (0, 1), and f (t) = (ex0 x)THp((1 t)ex0 + tx)(ex0 x) η ex0 x 2 2. Variational Inference in linear regression Consequently, for any ε (0, 1) Taylor s expansion implies f(1) f(ε) + (1 ε)f (ε) 1 2η(1 ε)2 ex0 x 2 2. Taking limits as ε 0 we get f(1) f(0) + f (0+) 1 2η ex0 x 2 2 f(0) 1 2η ex0 x 2 2, where the last inequality uses the fact that f (0+) 0, as t = 0 is a global maximizer of f. The last display is equivalent to rp(x) rp(ex0) 1 2η ex0 x 2 2. This inequality then extends to x [ 1, 1]p by upper semi-continuity of rp. Appendix B. Stability Estimates for functionals We prove Lemma 34 and 40 in this section. B.1 Proof of Lemma 34 We start with the proof of Part (i). Define the functional TW,φ : F2,4 R TW,φ,ψ(ν) := 1 2E[W(X1, X2)U1U2] + E[U1φ(X1)] + E[ p ψ(X1)U1Z1]. where (X1, Z1, U1), (X2, Z2, U2) ν are iid. Similarly define Iψ : F2,4 R { } Iψ(ν) = E h G U, ψ(X) We observe that GW,φ,ψ(ν) = 1 σ2 TW,φ,ψ(ν) Iψ(ν). Further, sup ν F2,4 |TW,φ,ψ(ν) T W, φ,ψ(ν)| 1 2 W W + φ φ 1, (63) sup ν F2,4 |TW,φ,ψ(ν) TW,φ, ψ(ν)| 2 ψ ψ 1, (64) sup ν F2,4 |Iψ(ν) I ψ(ν)| 1 2σ2 ψ ψ 1. (65) Indeed, (63) is immediate from the definition of TW,φ,ψ( ), and (65) follows from Lemma 4 on noting that supu ( 1,1),d R | G d (u, d)| 1 2. Finally, (64) follows on noting that by Cauchy-Schwarz inequality, Eν h Z2 i Eν h p ψ(X) 2i 2 ψ ψ 1, Mukherjee and Sen where the last inequality uses ( a b)2 |a b| and Eν[Z2] 4. This completes the proof of Part (i). Next, we turn to the proof of Part (ii). Using the definition of cut norm (as in Definition 12) and Part (i), we have, sup ν F2,4 | GWk,Wk.φk+φkψk,ψk(ν) GW,W.φk+φkψ,ψ(ν)| Wk W + ψk ψ 1 0. Using Part (i) again, it suffices to show that W.φk + φkψ L1 W.φ + φψ. To this end, note that |W.φk + φk(x)ψ(x) W.φ φ(x)ψ(x)| converges to 0 in measure, and |W.φk(x) + φk(x)ψ(x) W.φ(x) φ(x)ψ(x)| 2 Z [0,1] |W(x, y)|dy + 2ψ, which is an integrable function. This completes the argument using DCT. B.2 Proof of Lemma 40 Proof of Lemma 40 (i) For any ε > 0, let S+(ε) = {x : m(µ, W, x) m(µ, W , x) > ε}. For X U([0, 1]), we have, P(X S+(ε)) 1 εE h (m(µ, W, X) m(µ, W , X))1S+(ε)(X) i εEX,X h (W(X, X ) W (X, X ))1S+(ε)(X)E[U |X ] i Setting S (ε) = {x : m(µ, W, x) m(µ, W , x) < ε}, the same argument now yields that P[X S (ε)] 1 (ii) Let (Xp, Zp, Up) νp and (X, Z, U) ν. Using Skorokhod Embedding Theorem, we assume that (Xp, Zp, Up) (X, Z, U) a.s. as p . Since R |W(x, y)|dxdy < , for any ε > 0 there exists W continuous such that W W 1 ε. Then we have, E[|m(µ, W, X) m(µ, W , X)|] W W 1 ε. |m(νp, W , x) m(ν, W , x)| = E[W (x, Xp)Up] E[W (x, X)U] E[W (x, Xp)Up] E[W (x, Xp)U] + E[W (x, Xp)U] E[W (x, X)U] W E[|Up U|] + sup x,y,z [0,1],|y z| |X Xp| W (x, y) W (x, z) = o(1) Variational Inference in linear regression using the uniform continuity of W . The desired conclusion follows upon combining the two displays above. Appendix C. Proofs of Examples We establish Corollaries 9-11 and 26-29 in this section. Throughout this section o P (1) terms converge to zero in probability under the marginal distribution of the design matrix X. C.1 Accuracy of mean-field approximation Proof of Corollary 9 This is immediate from Theorem 7. Proof of Corollary 10 With Ap and Dp denoting the off-diagonal and diagonal parts of the matrix XTX, to invoke Theorem 7 we need to verify that Ap satisfies (5) and (6), and the empirical measure 1 p Pp i=1 δDp(i,i) is uniformly integrable. We verify these conditions below: Since {xi}1 i n i.i.d. N(0, Γp), and i=1 xix T i , invoking (Vershynin, 2012, Proposition 2.1) gives XTX Γp 2 = OP r p = o P (1), (66) where the last equality uses p = o(n). Noting that Dp Γp,diag 2 XTX Γp 2 Ap Γp,off 2 = o P (1). (67) Consequently, Ap satisfies (5) with high probability, as tr(Γ2 p,off) = o(p). Further, since Γp 2 = O(1), it follows from (66) gives that Ap 2 = OP (1). Finally, noting that max 1 i p |Dp(i, i)| XTX 2 = Γp 2 + o P (1) p Pp i=1 δDp(i,i) is uniformly integrable with high probability. This completes the proof of the corollary. Mukherjee and Sen Proof of Corollary 11 It suffices to verify the same conditions on the matrices (Ap, Dp) as in Corollary 10. To this end, note that for any i = j we have Ap(i, j) = p k=1 B(k, i)B(k, j), (68) i =j Ap(i, j)2 = p2 k,ℓ=1 E[B(k, i)B(k, j)B(ℓ, i)B(ℓ, j)] k =ℓ E[B(k, i)B(k, j)B(ℓ, i)B(ℓ, j)] + p2 k=1 E[B(k, i)B(k, j)] p2 = λ4 + λ2p2 which is o(p) as p = o(n). This verifies (5). Also, (68) gives i,j=1 |Ap(i, j)| = 1 i =j E[B(k, i)B(k, j)] λ2, and so (6) holds with high probability. Finally we have i=1 |Dp(i, i)|2 = 1 k=1 B(k, i) 2 p2 i = λ2 + o(1), p Pp i=1 Dp(i, i) is uniformly integrable. C.2 Limiting variational formula Proof of Corollary 26 (a) The desired conclusion follows from Theorem 17, once we can verify d L1(w D, 1) = o P (1), d (Wp Ap, W) = o P (1). Here D = (Dp(1, 1), , Dp(p, p)) denotes the diagonal entries of XTX, and Ap denotes the offdiagonal part of XTX. Proceeding to verify the above display, invoking (66) we have Dp(i, i) = 1 + 1 p G(i/p)2 + o P (1), and so w D L1 1. Also, since Γp(i, j) = 1 p G(i/p)G(j/p) for all i = j, it follows that d (WpΓp, W) 0, where we use the almost sure continuity of G(., .). Since (67) gives d (Wp Ap, WpΓp) = o P (1), combining we get d (Wp Ap, W) d (Wp Ap, WpΓp) + d (WpΓp, W) = o P (1), This completes the proof of part (a). Variational Inference in linear regression (b) We begin by verifying condition (11). To this effect, set h u TΓpu 2z Tu i i=1 G(ui, di), and use the fact that maxi [p] P j =i Γp(i, j) λ < σ2 along with part (a) of Lemma 25 to get the existence of u p such that Mp(u p) Mp(u) λ u p u 2 2. This in turn shows that for any ε > 0 we have lim sup p sup u: u u p 2 2>pδ Mp(u p) Mp(u) o < 0. (69) Using (66) gives 1 p sup u [ 1,1]p 1 p Mp(u) f Mp(u) i = o P (1). (70) Given (69), and (70), it follows that lim sup p sup u: u u p 2 2>pδ n Mp(u p) Mp(u) o < 0. Thus we have verified (11). The desired conclusion then follows from Corollary 22. Part (b)(ii) follows from part (b)(ii) of Lemma 25 and Corollary 22. Proof of Corollary 28 (a) The desired conclusion follows from Theorem 17, once we can verify d L1(w D, ψ) = o P (1), d (Wp Ap, W) = o P (1). Proceeding to verify the above display, note that Dp(i, i) := (XTX)ii = p nσ2 Thus, setting e Di := 1 n Pn k=1 G(k/n, i/p), Using Chernoffbounds it follows that P max 1 i p |Dp(i, i) e Di| > δ pe cδ2 n p pe cδ2 p 0, and so it follows that d L1(w D, ψ) d L1(w D, w e D,p) + d L1(w e D,p, ψ) = o P (1) Mukherjee and Sen where the last equality uses the almost sure continuity of G(., .). It thus suffices to show that d (Wp Ap, W) 0. To this end, note that for any i = j we have Ap(i, j) = p k=1 B(k, i)B(k, j). Using Lemma 42, setting e Ap(i, j) := E[Ap(i, j)] = 1 np Pn k=1 G( k P max S,T [p] P i S,j T p Ap(i, j) P i S,j T p e Ap(i, j) From this, using the condition p = o( n) gives d (Wp Ap, Wp e Ap) = o P (1), which in turn gives d (Wp Ap, W) d (Wp Ap, Wp e Ap) + d (Wp e Ap, W) = o P (1), where the last equality again uses the almost sure continuity of G( , ). (b) (i) Once again, the desired conclusion of part (b) follows from Corollary 22, once we verify condition (11). To this effect, fixing δ > 0 and setting ϑi := Pp j =i Ap(i, j), define a p p matrix Bp,δ by setting Bp,δ(i, j) := Ap(i, j)1{ϑi σ2(1 δ), ϑj σ2(1 δ)}, and note that sup u [ 1,1]p u Apup u Bp,δu 2 i=1 ϑi1{ϑi > σ2(1 δ)}. We now claim that there exists δ > 0 such that lim sup p 1 p i=1 ϑi1{ϑi > σ2(1 δ)} = 0. (71) Given (71), it suffices to show that (11) holds for Mp : [ 1, 1]p 7 R defined by h u TBp,δu 2z Tu i i=1 G(ui, di). Variational Inference in linear regression But this is immediate from Lemma 25 part (a), on noting that j =i Bp,δ(i, j) σ2(1 δ). It thus remains to verify (71). To this effect, recall from part (a) that Wp Ap converges to W in the cut metric, which in turn implies 1 p Pp i=1 δϑi converges weakly in probability to the law of S(X), where X U[0, 1] (Borgs et al., 2015, Theorem 2.16). This gives lim sup p 1 p i=1 ϑi1{ϑi > σ2(1 δ)} ES(X)1{S(X) σ2(1 δ)}. It thus suffices to show that there exists δ > 0 such that the RHS above is 0. But this follows on noting that esssup(S(X)) < σ2. This completes the proof of part (i). Part (b)(ii) follows from part (b)(ii) of Lemma 25 and Corollary 22, as before. Proof of Corollary 29 (a) A direct calculation gives XTX = 1 2I + Ap, where Ap(i, j) =1 =0 otherwise . It then follows that d (Wp Ap, W) W, and d L1(w D, ψ) 0. The desired conclusion then follows from Theorem 17 as before. (b) Since lim sup p maxi [p] P j =i |Ap(i, j)| = 1 2, the result is immediate from Corollary 22 and Lemma 25. Appendix D. Relevant concentration inequalities Lemma 42 Let Bik Ber(G(i/n, j/p)/p) are independent random variables with G(x, y) λ. For any δ > 0, there exists C > 0 (depending on δ) such that P max S,T [p] k S,l T (Ap(k, l) E[Ap(k, l)]) > pδ 2p exp C n . To prove Lemma 42, we first obsere that if X is sub-exponential, then X2 is sub-Weibull Kuchibhotla and Chakrabortty (2018). Mukherjee and Sen Lemma 43 Let Y1, , Yn be independent sub-exponential random variables with common sub-exponential parameter c > 0. For any δ > 0, there exists C := C(δ) > 0 such that i=1 (Y 2 i E[Y 2 i ]) > nδ i exp C n . Proof We first claim that Y 2 i are sub-Weibull, i.e., there exists a constant C > 0 (depending only on c) such that P[|Y 2 i E[Y 2 i ]| > t] 2 exp( C We establish the upper tail deviation bound. A similar argument works for the lower tail, and is thus omitted. As the variables Yi have a common sub-exponential constant, maxi n E[Y 2 i ] is uniformly bounded in n. For any t > 0 with 2 1)2 maxi n E2[Yi], P[Y 2 i > E[Y 2 i ] + t] = P[Yi E[Yi] > q t + E[Y 2 i ] E[Yi]] P[Yi E[Yi] > p t + E2[Yi] E[Yi]]. We now claim that t + E2[Yi] E[Yi] > Using (73), along with the deviation bound above, we have, P[Y 2 i > E[Y 2 i ] + t] P h Yi E[Yi] > for some constant C > 0. The last inequality uses that Yi is sub-exponential. (73) can be verified by direct computation. Using (72) and (Kuchibhotla and Chakrabortty, 2018, Proposition A.3), maxi n Y 2 i ψ1/2 is uniformly bounded in n. Consequently, using (Kuchibhotla and Chakrabortty, 2018, Theorem 3.1), for any t > 0, we have, i=1 (Y 2 i E[Y 2 i ]) > C1 b 2( t + Lnt2) i exp( t), where C1 is a constant free of n, and Ln = C2 b / b 2 for some constant C2 > 0 independent of n. Here, b = ( Y 2 1 ψ1/2, , Y 2 n ψ1/2). We set C1 b 2( t + Lnt2) = nδ for some ε > 0. Direct calculation yields that t(δ) = q nδ C1C2 b (1 + o(1)), so that i=1 (Y 2 i E[Y 2 i ]) > nδ i exp C3 n , where C3 > 0 depends on δ. This concludes the proof. Variational Inference in linear regression Proof of Lemma 42 Recall that Bik Ber(G(i/n, j/p)/p) are independent random variables. For S [p], define Yi(S) = P k S Bik. By Chernoffbound, the collection {Yi(S) : 1 i n, S p} are sub-exponential with a common sub-exponential parameter (depending only on λ). By Chernoffbounds, for any fixed S [p], Yi = P k S Bik is a sub-exponential random variable. Using Lemma 43, k,l S (Ap(k, l) E[Ap(k, l)]) > pδ = P i=1 (Yi(S)2 E[Yi(S)2]) > nδ exp C n . A union bound then concludes P max S [p] k,l S (Ap(k, l) E[Ap(k, l)]) > pδ 2p exp C n . (74) We set εkl := Ap(k, l) E[Ap(k, l)] and claim that max S,T [p] | X k S,l T εkl| 5 2 max S [p] | X k,l S εkl| (75) The required conclusion follows from (75) and the deviation bound (74). It thus remains to prove (75). To this effect, note that k S,l T εkl = X k S\T,l T\S εkl + X k S\T,l S T εkl + X k S T,l T\S εkl + X k S T,l S T εkl. k,l S T εkl = X k,l S\T εkl + X k,l T\S εkl + X k,l S T εkl k S\T,l T\S εkl + X k S\T,l S T εkl + X k S T,l T\S εkl . k S,l T εkl = 1 k,l S T εkl X k,l S\T εkl + X k,l T\S εkl + X k,l S T εkl i + X k S T,l S T εkl which on using triangle inequality gives (75). Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approximations. Annals of Statistics, 48(3):1475 1497, 2020. Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of gibbs posteriors. The Journal of Machine Learning Research, 17(1): 8374 8414, 2016. Mukherjee and Sen Fanny Augeri. A transportation approach to the mean-field approximation. ar Xiv preprint ar Xiv:1903.08021, 2019. Tim Austin. The structure of low-complexity gibbs measures on product spaces. The Annals of Probability, 47(6):4002 4023, 2019. Sayantan Banerjee, Isma el Castillo, and Subhashis Ghosal. Bayesian inference in highdimensional models. ar Xiv preprint ar Xiv:2101.04491, 2021. Peter L Bartlett, Philip M Long, G abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020. Anirban Basak and Sumit Mukherjee. Universality of the mean-field for the potts model. Probability Theory and Related Fields, 168(3):557 600, 2017. Anirban Bhattacharya, Antik Chakraborty, and Bani K Mallick. Fast sampling with gaussian scale mixture priors in high-dimensional regression. Biometrika, page asw042, 2016. Peter Bickel, David Choi, Xiangyu Chang, and Hai Zhang. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4):1922 1943, 2013. Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. Christian Borgs, Jennifer T Chayes, L aszl o Lov asz, Vera T S os, and Katalin Vesztergombi. Convergent sequences of dense graphs i: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801 1851, 2008. Christian Borgs, Jennifer T Chayes, L aszl o Lov asz, Vera T S os, and Katalin Vesztergombi. Convergent sequences of dense graphs ii. multiway cuts and statistical physics. Annals of Mathematics, pages 151 219, 2012. Christian Borgs, Jennifer T Chayes, Henry Cohn, and Shirshendu Ganguly. Consistent nonparametric estimation for heavy-tailed sparse graphs. ar Xiv preprint ar Xiv:1508.06675, 2015. Christian Borgs, Jennifer T Chayes, Henry Cohn, and Yufei Zhao. An lp theory of sparse graph convergence ii: Ld convergence, quotients and right convergence. The Annals of Probability, 46(1):337 396, 2018. Christian Borgs, Jennifer Chayes, Henry Cohn, and Yufei Zhao. An lp theory of sparse graph convergence i: Limits, sparse random graph models, and power law distributions. Transactions of the American Mathematical Society, 372(5):3019 3062, 2019. Peter Carbonetto and Matthew Stephens. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis, 7(1):73 108, 2012. Variational Inference in linear regression Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465 480, 2010. Isma el Castillo, Johannes Schmidt-Hieber, and Aad Van der Vaart. Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986 2018, 2015. Sourav Chatterjee and Amir Dembo. Nonlinear large deviations. Advances in Mathematics, 299:396 450, 2016. Badr-Eddine Ch erief-Abdellatif and Pierre Alquier. Consistency of variational bayes inference for estimation and model selection in mixtures. Electronic Journal of Statistics, 12 (2):2995 3035, 2018. Nicholas A Cook, Amir Dembo, and Huy Tuan Pham. Regularity method and large deviation principles for the erdos-renyi hypergraph. ar Xiv preprint ar Xiv:2102.09100, 2021. Ronen Eldan. Gaussian-width gradient complexity, reverse log-sobolev inequalities and nonlinear large deviations. Geometric and Functional Analysis, 28(6):1548 1596, 2018. Richard S Ellis, James L Monroe, and Charles M Newman. The ghs and other correlation inequalities for a class of even ferromagnets. Communications in Mathematical Physics, 46(2):167 182, 1976. Zhou Fan, Song Mei, and Andrea Montanari. Tap free energy, spin glasses, and variational inference. ar Xiv preprint ar Xiv:1808.07890, 2018. Sebastian Farquhar, Lewis Smith, and Yarin Gal. Liberty or depth: Deep bayesian neural nets do not need complex weight posterior approximations. Advances in Neural Information Processing Systems, 33:4346 4357, 2020. Augusto Fasano, Daniele Durante, and Giacomo Zanella. Scalable and accurate variational bayes for high-dimensional binary regression models. ar Xiv preprint ar Xiv:1911.06743, 2019. Andrew Foong, David Burt, Yingzhen Li, and Richard Turner. On the expressiveness of approximate inference in bayesian neural networks. Advances in Neural Information Processing Systems, 33:15897 15908, 2020. Alan Frieze and Ravi Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175 220, 1999. Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21 28, 1962. Behrooz Ghorbani, Hamid Javadi, and Andrea Montanari. An instability in variational inference for topic models. In International conference on machine learning, pages 2221 2231. PMLR, 2019. Subhashis Ghosal. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli, 5(2):315 331, 1999. Mukherjee and Sen Peter Hall, Tung Pham, Matt P Wand, and Shen SJ Wang. Asymptotic normality and valid inference for gaussian variational approximation. The Annals of Statistics, 39(5): 2502 2532, 2011. Wei Han and Yun Yang. Statistical inference in mean-field variational bayes. ar Xiv preprint ar Xiv:1911.01525, 2019. Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2): 949 986, 2022. Hemant Ishwaran and J Sunil Rao. Spike and slab variable selection: frequentist and bayesian strategies. Annals of statistics, 33(2):730 773, 2005. Vishesh Jain, Frederic Koehler, and Elchanan Mossel. The mean-field approximation: Information inequalities, algorithms, and complexity. ar Xiv preprint ar Xiv:1802.06126, 2018. Vishesh Jain, Frederic Koehler, and Andrej Risteski. Mean-field approximation, convex hierarchies, and the optimality of correlation rounding: a unified perspective. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1226 1236, 2019. Norman L Johnson and Samuel Kotz. Continuous univariate distributions: Distributions in statistics. John Wiley and Sons, 1971. Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486): 682 693, 2009. Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. ar Xiv preprint ar Xiv:1804.02605, 2018. Se Yoon Lee, Debdeep Pati, and Bani K Mallick. Continuous shrinkage prior revisited: a collapsing behavior and remedy. ar Xiv preprint ar Xiv:2007.02192, 2020. Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business Media, 2006. L aszl o Lov asz. Large networks and graph limits, volume 60. American Mathematical Soc., 2012. L aszl o Lov asz and Bal azs Szegedy. Szemer edi s lemma for the analyst. GAFA Geometric And Functional Analysis, 17(1):252 270, 2007. David JC Mac Kay. Good error-correcting codes based on very sparse matrices. IEEE transactions on Information Theory, 45(2):399 431, 1999. Variational Inference in linear regression Marc M ezard, Giorgio Parisi, and Miguel Angel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company, 1987. Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the american statistical association, 83(404):1023 1032, 1988. Sarah E Neville, John T Ormerod, and MP Wand. Mean field variational bayes for continuous sparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8(1): 1113 1151, 2014. John T Ormerod, Chong You, and Samuel M uller. A variational bayes approach to variable selection. Electronic Journal of Statistics, 11(2):3549 3594, 2017. Edward Posner. Random coding strategies for minimum entropy. IEEE Transactions on Information Theory, 21(4):388 391, 1975. Kolyan Ray and Botond Szab o. Variational bayes for high-dimensional linear regression with sparse priors. Journal of the American Statistical Association, pages 1 12, 2021. Kolyan Ray, Botond Szabo, and Gabriel Clara. Spike and slab variational bayes for high dimensional logistic regression. ar Xiv preprint ar Xiv:2010.11665, 2020. Qifan Song and Faming Liang. Nearly optimal bayesian shrinkage for high dimensional regression. ar Xiv preprint ar Xiv:1712.08964, 2017. Huayang Tang, Xin Jin, Yang Li, Hui Jiang, Xianfa Tang, Xu Yang, Hui Cheng, Ying Qiu, Gang Chen, Junpu Mei, et al. A large-scale screen for coding variants predisposing to psoriasis. Nature genetics, 46(1):45 50, 2014. Alexander Tsigler and Peter L Bartlett. Benign overfitting in ridge regression. ar Xiv preprint ar Xiv:2009.14286, 2020. Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655 686, 2012. Martin J Wainwright and Michael Irwin Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008. Bo Wang and D Michael Titterington. Convergence properties of a general algorithm for calculating variational bayesian estimates for a normal mixture model. Bayesian Analysis, 1(3):625 650, 2006. Yixin Wang and David Blei. Variational bayes under model misspecification. In Advances in Neural Information Processing Systems, pages 13357 13367, 2019a. Yixin Wang and David M Blei. Frequentist consistency of variational bayes. Journal of the American Statistical Association, 114(527):1147 1161, 2019b. Mukherjee and Sen T Westling and TH Mc Cormick. Beyond prediction: A framework for inference with variational approximations in mixture models. Journal of Computational and Graphical Statistics, 28(4):778 789, 2019. Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82 93, 2011. Jun Yan. Nonlinear large deviations: Beyond the hypercube. Annals of Applied Probability, 30(2):812 846, 2020. Yun Yang, Debdeep Pati, and Anirban Bhattacharya. α-variational inference with statistical guarantees. Annals of Statistics, 48(2):886 905, 2020. Anderson Y Zhang and Harrison H Zhou. Theoretical and computational guarantees of mean field variational inference for community detection. Annals of Statistics, 48(5): 2575 2598, 2020.