# variational_inference_in_highdimensional_linear_regression__f76955a1.pdf

Journal of Machine Learning Research 23 (2022) 1-56 Submitted 6/21; Revised 9/22; Published 10/22

Variational Inference in high-dimensional linear regression

Sumit Mukherjee sm3949@columbia.edu Department of Statistics Columbia University New York, NY 10027, USA

Subhabrata Sen subhabratasen@fas.harvard.edu Department of Statistics Harvard University Cambridge, MA 02138, USA

Editor: Pierre Alquier

We study high-dimensional bayesian linear regression with product priors. Using the nascent theory of non-linear large deviations (Chatterjee and Dembo, 2016), we derive suﬃcient conditions for the leading-order correctness of the naive mean-ﬁeld approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true linear model for the observed data, we derive a limiting inﬁnite dimensional variational formula for the log normalizing constant for the posterior. Furthermore, we establish that under an additional separation condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive suﬃcient conditions for the validity of this separation condition. Finally, we illustrate our results on concrete examples with speciﬁc design matrices. Keywords: Variational Inference, Linear regression, Naive Mean-Field Approximation

1. Introduction

In this age of big-data, statisticians routinely analyze large, high-dimensional datasets arising from applications in genomics, ﬁnance, public policy etc., with the goal of discovering relationships between the response variable, and the observed features. The linear regression model

is arguably the most common framework for this task when the response is continuous. Under a Bayesian formalism, the statistician posits a prior distribution π for the coeﬃcient vector β, and constructs the corresponding posterior. Subsequent inference is based solely on this posterior distribution. Two questions are naturally relevant in this setting:

1. What are the statistical properties of Bayesian procedures, particularly in high-dimensions?

2. Are these procedures computationally tractable for large datasets with high-dimensional features?

c 2022 Sumit Mukherjee and Subhabrata Sen.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0696.html.

Mukherjee and Sen

The theoretical performance of high-dimensional Bayesian methods have been examined extensively in recent times. In Bayesian asymptotic theory, given data (yi, xi)n i=1, xi Rp, one assumes the correctness of a frequentist model yi = x T i β0 + ε, and studies frequentist inference of the coeﬃcient vector β0 under Bayesian procedures. Ghosal (1999) established a version of the traditional Bernstein-Von-Mises theorem as long as p4 log p/n 0. A lot of recent attention has been focused on the high-dimensional regime p n with an additional assumption on the sparsity of β0. In this context, spike-and-slab based approaches (Mitchell and Beauchamp, 1988; Ishwaran and Rao, 2005) have been established to exhibit optimal frequentist properties (Castillo et al., 2015). In particular, these posterior distributions contract to the underlying truth β0 at the minimax optimal estimation rate (we refer the interested reader to Banerjee et al. (2021) for a formal deﬁnition of posterior contraction, and a survey of recent breakthroughs in this area). Despite the superior theoretical properties of this approach, the sharp spike-and-slab based approaches suﬀer from a computational bottleneck. In the special case of linear regression, continuous shrinkage priors (see e.g. Carvalho et al. (2010)) provide a comparatively more tractable alternative from the computational perspective (Bhattacharya et al., 2016). Contraction properties of the corresponding posteriors were characterized in Song and Liang (2017). Unfortunately, this strategy is speciﬁc to the linear model, and does not generalize beyond this setting.

In general, computationally tractable Bayesian inference for high-dimensional models presents signiﬁcant challenges. MCMC based strategies have been explored extensively for this purpose. Despite rapid progress in MCMC methodology and supporting theory, these methods are still slower than competing frequentist methodology e.g. those based on convex optimization. Variational methods (Wainwright and Jordan, 2008) provide an attractive general option to the Bayesian statistician. The simplest form of variational inference approximates the true posterior distribution using a product distribution this version is often referred to as naive mean-ﬁeld Variational Bayes (n VB). Computing the best approximating product distribution is computationally fast, and thus these methods provide a practical option for modern datasets. We refer the interested reader to Blei et al. (2017) for an introduction to variational Bayes methods in statistics and machine learning.

Although variational methods provide a computationally feasible strategy, supporting theoretical evidence has been relatively scarce. In Wang and Titterington (2006); Wang and Blei (2019b), the authors established the correctness of this approach in parametric models in the classical ﬁxed p setting. Subsequently Wang and Blei (2019a) study variational Bayes in misspeciﬁed models. In the context of the linear model, early work by Neville et al. (2014); Ormerod et al. (2017) focused on variational inference in the low dimensional linear model, while Carbonetto and Stephens (2012) provides an early variational approximation for variable selection.

The past two years have witnessed rapid progress in the analysis of variational methods for high-dimensional models (Alquier et al., 2016; Alquier and Ridgway, 2020; Han and Yang, 2019; Ray and Szab o, 2021; Ray et al., 2020; Yang et al., 2020). These results focus on high-dimensional models with a sparse underlying truth β0, and study the contraction properties of the variational posterior. In particular, they derive suﬃcient conditions for the variational posterior to contract at the minimax optimal rate. Correctness of variational methods have also been established for community detection (Bickel et al., 2013; Zhang and

Variational Inference in linear regression

Zhou, 2020), a poisson mixed model (Hall et al., 2011), frequentist models (Westling and Mc Cormick, 2019) and mixture models (Ch erief-Abdellatif and Alquier, 2018). In sharp contrast, n VB methods can fail in truly high-dimensional settings, e.g. in versions of topic modeling (Ghorbani et al., 2019). In this speciﬁc situation, the correct variational approximation is provided by the TAP approximation from spin-glass theory (M ezard et al., 1987). Fan et al. (2018) establishes rigorous guarantees regarding the validity of the TAP approach in the context of the simpler Z2 synchronization problem. Evidence regarding the inadequacy of the n VB approximation also arises in the work of Fasano et al. (2019). The authors analyze the n VB approximation for probit regression in the regime n ﬁxed and p . In this regime, the variational posterior exhibits an over-shrinking phenomenon; to remedy this issue, the authors propose an alternative, locally factored approximation, which resolves this over-shrinking problem in this setting. There is thus an immediate need to understand general properties of statistical problems which ensure the correctness of the n VB approximation. In this paper, we study the accuracy of this approximation in Bayesian linear regression with product priors. We provide easily veriﬁable conditions which ensure the asymptotic accuracy of the n VB approximation. Further, we illustrate that the approximation can yield detailed information regarding the statistical properties of this model, which might be unavailable using other techniques. We elaborate on our speciﬁc contributions below.

1.1 Contributions

Our main contributions in this paper are as follows:

(i) In Theorem 7, we provide suﬃcient conditions for the asymptotic tightness of the n VB approximation. The conditions are easily veriﬁable, and can be explicitly checked in speciﬁc applications. This provides rigorous theoretical support for widely used meanﬁeld approximation based methodology.

(ii) Assuming a true frequentist linear model for the data, we derive a limiting variational formula for the log normalizing constant in Theorem 17. We emphasize that in contrast to existing results, we do not assume any sparsity on the true regression coefﬁcients. We refer the interested reader to Lee et al. (2020) for a motivating discussion on the importance of such scenarios in scientiﬁc applications.

(iii) Under an additional separation condition (11), we establish that the limiting variational problem has a unique optimizer (see Theorem 20 for a precise statement). Under this separation condition, empirical averages of samples drawn from the posterior distribution concentrate around explicit deterministic limits; in addition, these limits are characterized by the optimizer (Corollary 22). We also provide interpretable suﬃcient conditions which enforce this separation condition (Lemma 25).

We emphasize that in existing analyses of high-dimensional Bayesian linear regression, properties of the posterior are directly established (Castillo et al., 2015), and the variational posterior is analyzed independently (Ray and Szab o, 2021). Our approach is inherently diﬀerent we ﬁrst establish the correctness of the mean-ﬁeld approximation, and then study the posterior through the lens of the mean-ﬁeld approximation formula.

Mukherjee and Sen

(iv) We further illustrate our general results by applying them to three speciﬁc examples a two factor ANOVA model, a gaussian design setting with spiked covariance, and a sparse bernoulli design. In each case, we identify the speciﬁc limiting functional which determines the limiting log normalizing constant.

(v) From a theoretical perspective, our results crucially utilize recent advances in the theory of non-linear large deviations. Initiated in the seminal paper Chatterjee and Dembo (2016), the theory of non-linear large deviations was originally conceived to answer some deep questions concerning large deviations of sub-graph counts in sparse random graphs. In Basak and Mukherjee (2017), one of the authors successfully utilized this framework to establish the tightness of the naive mean ﬁeld approximation for the log-partition function in a family of Potts models. To the best of our knowledge, these developments have not been utilized previously for other statistical models. In this paper, we demonstrate the usefulness of these tools in the context of high dimensional statistics; we hope that this spurs an in-depth study of their applicability for other high dimensional problems. We consider this to be a key conceptual contribution of this paper.

(vi) Our results are intimately related to local asymptotic scaling regimes from classical statistics (Van der Vaart, 2000). In fact, for high-dimensional linear regression with iid gaussian design, our results can be viewed as natural extensions of classical localasymptotic results under the iterated limit p following n . We elaborate on this connection in Section 3.

1.2 Non-linear large deviations and related results

In a breakthrough paper, Chatterjee and Dembo (Chatterjee and Dembo, 2016) introduced the theory of non-linear large deviations with the goal of studying large deviations for nonlinear functions of bernoulli random variables. As an application of this general machinery, they characterized sharp deviation probabilities for sub-graph counts in sparse Erd os-R enyi random graphs. Subsequent extensions by Eldan (Eldan, 2018) and Augeri (Augeri, 2019) allow one to track a wider regime of sparsity for the binary variables. In a diﬀerent direction, Yan (Yan, 2020) extended the Chatterjee-Dembo framework to general bounded Banachspace valued variables. See also Austin (2019) for related decompositions of general Gibbs measures using information theoretic ideas. These results have galvanized the study of large deviations for sub-graph counts on sparse random graphs, and the past three years have witnessed rapid progress in this direction. We refer the interested reader to Cook et al. (2021) and references therein for a survey of recent progress in this area.

At the heart of the Chatterjee-Dembo framework lies a tight approximation bound for the log-normalizing constant for general Gibbs type distributions in terms of the naive mean-ﬁeld approximation formula. This framework was utilized by one of the authors (Basak and Mukherjee, 2017) to derive asymptotic limits for the log normalizing constant of Potts models on several sequences of graphs. Similar results were independently derived by (Jain et al., 2018, 2019) using diﬀerent techniques.

Variational Inference in linear regression

In this paper, we study Bayesian linear regression through the lens of non-linear large deviations, and uncover precise statistical properties of these models by analyzing the meanﬁeld variational problem.

We observe {(yi, xi) : 1 i n}, yi R, xi Rp. We set y = (yi) Rn and XT = [x1, , xn]. Throughout, we work in an asymptotic setting where p and n = n(p) are both going to . A natural Bayesian model for such a dataset assumes

β1, , βp iid π, βT = (β1, , βp), y = Xβ + ε, ε N(0, σ2In). (1)

In the display above, π is a probability distribution supported on [ 1, 1], and we consider an iid prior on the regression coeﬃcients. For our subsequent discussion, the precise interval [ 1, 1] for the prior support is not crucial our arguments go through unchanged, as long as the prior has bounded support. Note that in the context of Bayesian inference for linear regression, the regression parameters are often drawn from a parametric family, and the parameters specifying the prior are, in turn, sampled from a hyper-prior. In our discussions, we will restrict ourselves to the simpler setting where the prior π is ﬁxed however, we do not make any assumptions on π apart from the bounded support assumption. Throughout, we assume that the noise variance σ2 > 0 is ﬁxed and known to the statistician. Given the Bayesian model (1), one naturally constructs the posterior distribution

dπ p (β) exp 1 2σ2 y Xβ 2 2 exp 1 2σ2

h βTApβ + βTDpβ 2z Tβ i ,

where z = XTy, and Dp, Ap are the diagonal and oﬀ-diagonal matrices obtained from XTX. More precisely,

Ap(i, i) := 0, Dp(i, i) := XTX

Ap(i, j) := XTX

ij, Dp(i, j) := 0.

Since the term βTDβ is additive in the components of β, we can absorb the term βTDβ in the base measure π p, via the following deﬁnition:

Deﬁnition 1 (Exponential Family) For any γ := (γ1, γ2) R2 deﬁne a probability measure πγ on [ 1, 1] as

dπ (z) := exp γ1z γ2

2 z2 c(γ) , c(γ) := log Z

[ 1,1] exp γ1z γ2

2 z2 dπ(z).

Using this deﬁnition we can write the posterior distribution µ as

µy,X(dβ) exp 1 2σ2

h βTAβ 2z Tβ i p Y

i=1 πi(dβi), where πi := π(0,di), di := Dii

Mukherjee and Sen

A central object in the theoretical study of these posterior distributions is the normalizing constant, also referred to as the partition function in statistical physics parlance. Formally, we deﬁne

Zp(y, X) = Z

[ 1,1]p exp 1 2σ2

h βTAβ 2z Tβ i p Y

i=1 πi(dβi). (2)

The partition function is intractable for most priors, unless special conjugacy properties are satisﬁed between the prior and the likelihood. Henceforth, we suppress the dependence of µ, Zp on y, X whenever it is clear from the context. The classical Gibbs variational principle characterizes the partition function as

log Zp = sup Q

h βTAβ 2z Tβ ii DKL(Q

where the supremum ranges over probability distributions Q on [ 1, 1]p (see e.g. Wainwright and Jordan (2008)). In fact, the supremum in the above variational problem is attained by Q = µ. The naive mean-ﬁeld approximation to Zp restricts the supremum to product distributions, and thus obtains a universal lower bound. Formally, we have,

log Zp sup Q=Qp i=1 Qi

h βTAβ 2z Tβ ii DKL(Q

The optimizer in the display above provides the best approximation to µ among product distributions under KL divergence. This paper focuses on the tightness of the naive meanﬁeld lower bound in the context of linear regression. For an in-depth survey of variational inference, we refer the interested reader to Bishop (2006); Blei et al. (2017); Wainwright and Jordan (2008). Outline: The rest of the paper is structured as follows. We collect our results in Section 2. We discuss some directions for future enquiry in Section 3. The main results are established in Section 4. We defer some proofs to the Appendix.

We collect our main results in this section. To this end, we ﬁrst discuss some elementary facts regarding exponential families. The next result collects some analytic properties of the cumulant generating function c( ), which will be relevant for our subsequent discussion. For the sake of completeness, we provide a proof in the Appendix.

Lemma 2 Let c( ) : R2 7 R be as in Deﬁnition 1. Assume that both { 1, 1} belong to the support of π. Then the following conclusions hold.

(i) c(γ1, γ2) := c(γ1,γ2)

γ1 is strictly increasing in γ1, with limγ1 c(γ1, γ2) = 1 for every γ2 R.

(ii) For any t ( 1, 1), there exists a unique h(t, γ2) such that c(h(t, γ2), γ2) = t. Further, limt 1 h(t, γ2) = for every γ2 R.

Variational Inference in linear regression

Armed with these basic facts, we can formally state the naive-mean ﬁeld approximation to the log-normalizing constant (2).

Deﬁnition 3 Deﬁne a possibly extended real valued function G on [ 1, 1] R by setting

G(u, d) :=uh(u, d) c(h(u, d), d) + c(0, d) if u ( 1, 1), d R,

:=DKL(π π(0,d)) if u = 1, d R,

:=DKL(π π(0,d)) if u = 1, d R,

where π and π are degenerate distributions which puts mass 1 at 1 and 1 respectively.

We will need the following facts about the derivatives of G. We defer the proof of Lemma 4 to the Appendix.

Lemma 4 We have, for u ( 1, 1) and d R,

u (u, d) = h(u, d), G

1 z2dπ(h(u,d),d)(z) 1

1 z2dπ(0,d)(z).

2u (u, d) = 1 c(h(u, d), d) > 0.

Consequently, we have, supu ( 1,1),d R | G

d (u, d)| 1

Deﬁnition 5 Deﬁne Mp : [ 1, 1]p R as

Mp(u) := n 1 2σ2

h u TAu 2z Tu i

i=1 G(ui, di) o . (3)

Lemma 6 With Mp( ) as in (3), we have,

sup Q=Qp i=1 Qi

h βTAβ 2z Tβ ii DKL(Q

i=1 πi) = sup u [ 1,1]p Mp(u). (4)

This result follows immediately from elementary facts about exponential families. We refer the interested reader to (Wainwright and Jordan, 2008, Section 5.3).

2.1 Validity of the naive mean-ﬁeld approximation

Throughout the paper, we use the usual Landau notation O( ) and o( ) for deterministic sequences dependent on p.

Theorem 7 Assume that the matrix Ap satisﬁes the two conditions

tr(A2 p) = o(p), (5)

sup u [ 1,1]p

j=1 Ap(i, j)uj = O(p). (6)

Mukherjee and Sen

(i) We have, setting Rp := supu [ 1,1]p Mp(u), as p ,

log Zp Rp = o(p). (7)

(ii) If bi := Eµ(βi|βk, k = i), then the vector b := (b1, . . . , bp) satisﬁes

Mp(b) Rp = o(p).

(iii) Suppose there exists ˆu [ 1, 1]p which satisﬁes that for every η > 0 we have

lim sup p 1 p

h sup u [ 1,1]p: u ˆu 2 2 pη Mp(u) Rp i < 0. (8)

Assume further that the empirical measure 1

p P i δDp(i,i) is uniformly integrable. Then, for any ε > 0 and for any continuous function ζ : [ 1, 1] [0, 1] R we have

i=1 ζ βi, i

[ 1,1] ζ z, i

π(h(ˆui,di),di)(dz) > ε 0.

Remark 8 The careful reader would have already noticed that Theorem 7 is stated for deterministic data (y, X). In practical applications, it is often more natural to assume that the data (y, X) is, in turn, sampled from some underlying distribution P. The conclusions of Theorem 7 continue to hold as long as the suﬃcient conditions hold asymptotically with high probability under P.

The condition (6) is extremely mild, and speciﬁes that the log normalizing constant is asymptotically of order p. This condition is satisﬁed, for example, whenever the matrix Ap = XTX Dp has spectral norm O(1). The assumption (5) is the non-trivial assumption in the statement above, and eﬀectively guarantees the accuracy of the naive mean ﬁeld approximation to leading exponential order. Note that if λ1 λp denote the eigenvalues of Ap, then tr(A2 p) = Pp i=1 λ2 i . Thus the requirement tr(A2 p) = o(p) can be qualitatively interpreted to mean that the eigenstructure of A is dominated by a few top eigenvalues. Similar results were derived in the context of naive mean-ﬁeld approximation for Potts models by one of the authors in Basak and Mukherjee (2017). Covariance matrices with an approximately low rank are ubiquitous in modern datasets, and we believe these conditions are satisﬁed in diverse applications of practical interest. Our result provides formal evidence to the correctness of widely used mean-ﬁeld approximations in these settings. We prove Theorem 7 in Section 4.1. The third part of our theorem provides further insights into the posterior distribution, assuming that the naive mean ﬁeld approximation is dominated by a unique factorized distribution. In the language of Statistical Physics, these distributions are referred to be in a pure phase . We now demonstrate a few applications of Theorem 7 to concrete examples, which cover both deterministic and random design matrices. We defer the proofs of these corollaries to Appendix C.

Variational Inference in linear regression

Corollary 9 Let X be any sequence of deterministic design matrices with XTX = Ap+Dp, where Ap and Dp represent the oﬀdiagonal and diagonal parts of XTX as before. Suppose Ap satisﬁes (5) and (6), and the empirical measure 1

p Pp i=1 Dp(i, i) is uniformly integrable. Then the conclusion of Theorem 7 holds.

Corollary 10 Suppose that the ith row of the design matrix X equals n 1/2xi, where {xi}1 i n i.i.d. N(0, Γp). Assume that p = o(n), and the following conditions hold:

(a) The oﬀdiagonal part Γp,oﬀof the covariance matrix Γp satisﬁes tr(Γ2 p,oﬀ) = o(p).

(b) Γp 2 = O(1).

Then the conclusion of Theorem 7 holds.

Sparse design matrices arise routinely in coding theory (Gallager, 1962; Mac Kay, 1999) and genomics (Wu et al., 2011; Tang et al., 2014). Our next corollary discusses the accuracy of the n VB approximation in the context of a linear regression problem with a sparse bernoulli design. We assume that the entries of the design are independent, but not necessarily identical.

Corollary 11 Suppose that the (i, j)th entry of the design matrix X equals q

p n B(i, j),

where {B(i, j)}1 i n,1 j p are mutually independent Bernoullis with P(B(i, j) = 1) λ

p, for some λ > 0 free of p. If p = o(n), the conclusion of Theorem 7 applies.

2.2 Scaling limit for the log-normalizing constant

Under the asymptotic validity of the naive mean-ﬁeld approximation, we derive an asymptotic scaling limit for the log-normalizing constant. The limiting description for the lognormalizing constant is in terms of an inﬁnite dimensional variational problem. We will subsequently illustrate that under an additional separation condition, the NMF variational problem at ﬁnite n, p has a unique near optimizer, which can be approximated in terms of the optimizer of the associated inﬁnite dimensional limiting problem. As a concrete takeaway, the optimizer of the limiting problem characterizes the typical behavior of empirical averages under the posterior. To this end, we will require some notation. The theory of dense graph limits was developed in Borgs et al. (2008, 2012); Lov asz and Szegedy (2007), and has received tremendous attention over the last decade in Probability, Combinatorics, Computer Science and Statistics. We refer the interested reader to Lov asz (2012) for an in-depth survey of this area. Follow up work of Borgs et. al. (Borgs et al., 2019, 2018) has extended this theory signiﬁcantly beyond the regime of dense graphs the resulting Lp-convergence theory can handle sparse graphs and weighted matrices. Utilizing this set up, our next result will show that under the assumption that the oﬀdiagonal matrix Ap obtained from XTX converges in cut norm to a suitable graphon (need not be bounded), the corresponding log normalizing constant converges in probability to a deterministic optimization problem. We note that the use of cut-norms on matrices precedes the development of graph limit theory (see e.g. Frieze and Kannan (1999) and references therein).

Mukherjee and Sen

Deﬁnition 12 A function W : [0, 1]2 7 R is called symmetric if W(x, y) = W(y, x) for all x, y [0, 1]. Any symmetric function W : [0, 1]2 7 R which is L1 integrable, i.e. W 1 := R

[0,1]2 |W(x, y)|dxdy < is called a graphon. Let W denote the space of all graphons.

The cut norm of a graphon W is given by

W = sup S,T [0,1]

S T W(x, y)dxdy .

The cut norm is equivalent to the L 7 L1 operator norm deﬁned by

W 7 1 := sup f,g: f , g 1

[0,1]2 W(x, y)f(x)g(x)dxdy . (9)

More precisely, we have W W 7 1 4 W . It also follows from (9) that the cut norm is weaker than the L1 norm, i.e. convergence in L1 implies convergence in cut norm.

Deﬁnition 13 Given a symmetric p p matrix B with real entries, deﬁne a piecewise constant function on [0, 1]2 by dividing [0, 1]2 into p2 smaller squares each of length 1/p, and set

WB(x, y) :=B(i, j) if px = i, py = j with i = j,

=0 otherwise.

We will also need the following notion for embedding vectors into functions.

Deﬁnition 14 (Vector to function) Given t := (t1, , tp) Rp, deﬁne the piecewise constant function wt,p on [0, 1] by dividing [0, 1] into p intervals p i=1 1 p(i 1, i] of equal length 1/p, and setting wt(x) := ti if x is in the ith interval, i.e. px = i.

To derive the scaling limit, we will assume an underlying model y = Xβ0 + ε, where

ε N(0, σ2I). We say that a random variable f := f(y, X) P|X 0 if for any ε > 0, P h |f| > ε|X i 0 as p . Thus this convergence is conditional on the sequence of design matrices. In our subsequent analysis, the following representation lemma will be crucial.

Lemma 15 Suppose we are in the setting of Theorem 7. Assume further that

i=1 Dp(i, i) = O(p).

Then there exists (ξ1, ξp) i.i.d. N(0, 1) such that

1 p sup u [ 1,1]p

Mp(u) f Mp(u) P|X 0,

f Mp(u) := 1

Dp(i, i)ξi 2β 0X Xu i

i=1 G(ui, di).

Variational Inference in linear regression

We will derive a limiting formula for the log-normalizing constant (2) in terms of a variational problem on a space of probability distributions. This requires the following deﬁnition.

Deﬁnition 16 Let F denote the space of all bounded measurable functions from [0, 1] R to [ 1, 1]. Fixing W W, g, ψ L1[0, 1], deﬁne a functional GW,g,ψ( ) : F R by setting

GW,g,ψ(F) := 1 2σ2 E[W(X, X )F(X, Z )F(X , Z )] + 1

σ2 E[g(X)F(X, Z)]

ψ(X)F(X, Z)Z] E h G F(X, Z), ψ(X)

where (X, X ) i.i.d. U[0, 1] and (Z, Z ) i.i.d. N(0, 1) are mutually independent.

Theorem 17 Suppose y N Xβ0, σ2I . Writing XTX = Ap + Dp, set D to denote the vector in Rp containing the diagonal entries of Dp. Assume the following:

(i) d (Wp Ap, W) 0 for some W W.

(ii) wβ0 L1 φ( ), and w D L1 ψ( ).

(iii) tr(A2 p) = o(p).

Deﬁne a function g L1[0, 1] by

[0,1] W(x, y)φ(y)dy + ψ(x)φ(x).

Then we have,

1 p sup u [ 1,1]p Mp(u) P|X sup F F GW,g,ψ(F). (10)

Remark 18 For convenience of the reader, we point out the analogues between the discrete objects (vectors, matrices) and their continuum versions (functions) used in the above theorem, in the list below:

p Ap W, β0 φ, D ψ

Lemma 15 and Theorem 17 are established in Section 4.2.

Mukherjee and Sen

2.3 Uniqueness of the optimizer

Theorem 7 identiﬁes general conditions for the asymptotic tightness of the naive mean-ﬁeld lower bound to the log-normalizing constant. The Gibbs Variational Principle (Wainwright and Jordan, 2008) establishes that under these settings, the distribution µ can be approximated, to the leading order, by a product distribution. However, this does not specify whether the best approximation is unique. Indeed, the ferromagnetic Ising model on the complete graph, henceforth referred to as the Curie-Weiss model, provides a classical example where the mean-ﬁeld lower bound is tight, but without an external magnetization, the model has two distinct optimizers at low temperature . In this section, we identify some conditions which guarantee the uniqueness of the minimizers. From a statistical perspective, these conditions allow us to conclude that the posterior distribution roughly behaves like a product distribution. To this end, our next lemma identiﬁes a set of suﬃcient conditions.

Deﬁnition 19 Let (ξ1, . . . , ξp) i.i.d. N(0, 1) be as constructed in Lemma 15. For any u [ 1, 1]p, deﬁne a random empirical measure L(u) p on R3 by setting

L(u) p := 1

Theorem 20 Suppose all assumptions of Theorem 17 hold. Assume further that there exists u p [ 1, 1]p such that for all ε > 0 there exists δ > 0 such that

P h sup u: u u p 2 2>pε

1 p{Mp(u) Mp(u p)} < δ |X i = 1 o(1). (11)

Then the following conclusions hold.

(i) The limiting variational problem (10) has a unique optimizer F F.

(ii) L (u p) p P µ , where µ is the law of (X, Z, F (X, Z)) with X U[0, 1] and Z N(0, 1) mutually independent.

(iii) F satisﬁes the ﬁxed point equation:

F (x, z) a.s. = c 1

E[W(x, X)F (X, Z)] + g(x) + p

ψ(x)z , ψ(x)

where X U[0, 1] and Z N(0, 1) are mutually independent.

Remark 21 Similar to remark 18, we point out the new discrete and continuum analogues appearing in Theorem 20. u p F , Lp(u p) µ .

Combining Theorem 7 and Theorem 20, we get the following corollary, which deduces a Law of Large numbers under the posterior distribution.

Variational Inference in linear regression

Corollary 22 Suppose (6), (11), and the three conditions (i), (ii), (iii) of Theorem 17 hold. Then the optimization in Theorem 17 has a unique solution F F that satisﬁes

F (x, z) a.s. = c 1

E[W(x, X)F (X, Z)] + g(x) + p

ψ(x)z , ψ(x)

Further, for any continuous function ζ : [ 1, 1]2 7 R we have

i=1 ζ(βi, β0,i) E h Z

[ 1,1] ζ(w, φ(X))dπ h(F (X,Z), ψ(X)

σ2 (w) i > ε P|X 0,

where X U([0, 1]) and Z N(0, 1) are independent.

Corollary 22 illustrates how one can extract properties of the high-dimensional posterior using the NMF approximation, and is one of the main takeaways of this paper. The result has two parts, and should be interpreted as follows: ﬁrst, assume a non-trivial scaling of the problem (6), and that the ﬁnite dimensional NMF variational problem has a unique approximate optimizer (11). In this setting, provided the problem parameters W, β0, D have appropriate continuum limits and the problem is approximately NMF, the inﬁnite dimensional variational problem in Theorem 17 has a unique optimizer F . In turn, this optimizer F satisﬁes a functional ﬁxed point equation. Note that in principle, this ﬁxed point equation can still have multiple solutions we simply claim that the global optimizer is one of the solutions to this ﬁxed point relation. This optimizer governs the Law of Large Numbers (LLN) behavior under the posterior distribution this is the main takeaway from the second part of the corollary above. Formally, if β is a sample from the posterior µy,X and ζ is any continuous test function, the empirical average 1

p Pp i=1 ζ(βi, β0,i) concentrates around an explicit, deterministic limit characterized in terms of the optimizer F . Note that there is an additional technical subtlety due to the randomness in the distribution of y given X, the posterior distribution µy,X( ) is actually random! Thus the aforementioned behavior of the empirical averages (under the posterior) is only valid with high probability under the distribution P|X. To further elucidate the usefulness of this corollary, we describe two speciﬁc applications in the next remark.

Remark 23 We highlight two speciﬁc applications of Corollary 22. First, setting ζ(x, y) = (x y)2, we have,

i=1 (βi β0,i)2 P|X E h Z

[ 1,1] (w φ(X))2dπ h(F (X,Z), ψ(X)

Thus although we do not have posterior contraction in this regime, the normalized L2 distance between the true vector β0 and a sample β from the posterior µy,X converges to the explicit value characterized above. In other words, the posterior is concentrated on a thin shell of a ﬁxed radius around the true vector β0. As a second application, we consider a prior π such that zero is an isolated point in its support. In this case, ζ(x, y) = 1(x = 0) is almost surely continuous under π, and thus

i=1 1(βi = 0) P|X E h π h(F (X,Z), ψ(X)

σ2 ({0}) i .

Mukherjee and Sen

In this case, the proportion of zero coordinates in a sample from the posterior distribution can be tracked using our framework. These results are particularly useful for spike and slab type priors π.

Remark 24 To use Corollary 22 in concrete examples, one needs to compute the functions (W, g, ψ), and verify the conditions (5), (6) and (11). Of these, the separation condition (11) is somewhat implicit, while the other two conditions are relatively easy to verify directly. The following lemma provides two suﬃcient conditions to this end. In particular, it shows that (11) holds in the so called high temperate regime, or if π has a density (with respect to Lebesgue measure) which is log concave.

Lemma 25 (i) Suppose there exists λ > 0 and u p [ 1, 1]p such that for all u [ 1, 1]p

P h Mp(u ) Mp(u) λ u u 2 2 X i = 1 o(1), (13)

for some λ > 0, free of u and p. Then the condition (11) in Theorem 20 holds.

(ii) In particular (13) holds under either of the following conditions:

lim sup p sup 1 i p

j =i |Ap(i, j)| < σ2.

(b) Let the prior π have a density with respect to Lebesgue measure on [ 1, 1],

Z exp( V (x)),

where V is even, V : [0, 1] R is increasing, and V is convex on [0, 1). Further, we assume that lim infp λmin(XTX) > 0.

We collect the proofs of Theorem 20 and Lemma 25 in Section 4.3.

2.4 Applications

To illustrate the utility of Theorem 17 and Corollary 22, we apply our results to speciﬁc examples in this section. The proofs are deferred to Appendix C. Spiked Covariance Matrix: We consider linear regression with mean-zero gaussian features. We assume a spike covariance structure (Johnstone and Lu, 2009) on the features.

Corollary 26 Suppose that the ith row of the design matrix X equals n 1/2xi, where {xi}1 i n i.i.d. N(0, Γp), where Γp = I + vv , with vi := 1 p G(i/p), and G : [0, 1] 7 R is continuous almost surely. Further assume that p n, and the true regression coeﬃcient

β0 satisﬁes wβ0 L1 φ, for some φ L1[0, 1].

Variational Inference in linear regression

(a) Then for any ε > 0, 1 p log Zp(y, X) P|X sup F F GW,g,ψ(F),

W(x, y) = G(x)G(y) ψ(x) = 1, g(x) = G(x) Z

[0,1] G(y)φ(y)dy + φ(x).

(b) Consider the following two cases: either

(i) λ := max1 i n P j =i Γp(i, j) < σ2, or

(ii) the conditions of Lemma 25 Part (b) (ii) hold.

Then the conclusions of Corollary 22 hold.

Remark 27 We now consider a very special case to illustrate our results. Suppose X has IID entries distributed as N(0, 1

n), and σ = 1. In this case the assumptions of Corollary 26 are satisﬁed with W 0, G 0, g = φ,

and so the optimizing F (x, z) in Corollary 22 simpliﬁes to

F (x, z) = c(φ(x) + z, 1).

The next example re-visits the sparse bernoulli design setting introduced in Corollary 11.

Corollary 28 Suppose that the (i, j)th entry of the design matrix X equals q

p n B(i, j), where {B(i, j)}1 i n,1 j p are mutually independent Bernoulli random variables, with

P(B(i, j) = 1) = 1

p G i/n, j/p , where G is a function on [0, 1]2 which is continuous almost

surely. Further assume that p n, and the true regression coeﬃcient β0 satisﬁes wβ0 L1 φ, for some function φ L1[0, 1].

(a) Then 1 p log Zp(y, X) P sup F F GW,g,ψ(F),

W(x, y) = Z

[0,1] G(t, x)G(t, y)dt, ψ(x) = Z

[0,1] G(t, x)dt,

[0,1] W(x, y)φ(y)dy + φ(x)ψ(x).

(b) Consider the following two cases:

[0,1] W(x, y)dy = Z

[0,12] G(t, x)G(t, y)dtdy.

Suppose σ2 > ess sup S(X), where X U[0, 1];

Mukherjee and Sen

(ii) the conditions of Lemma 25 Part (b) (ii) hold.

Then the conclusions of Corollary 22 hold.

As our last example, we consider a sequence of design matrices arising in the study of two-way ANOVA designs.

Corollary 29 Suppose that we have a two factor ANOVA model of the form

yij = 1 p(τi + γj) + ξij, 1 i, j p

Here (τ1, . . . , τ p) [ 1, 1] p are the levels of the ﬁrst factor, and (γ1, . . . , γ p) [ 1, 1] p are

the levels of the second factor, and ξij i.i.d. N(0, σ2). Setting n = p2 and p = 2 p, let y Rn

denote the vector obtained by linearizing the matrix ((yij)) row-wise, and X denote the corresponding n p design matrix, and let β := (τ, γ) [ 1, 1]p be the unknown parameter.

Assume that the true regression coeﬃcient β0 [ 1, 1]p satisﬁes wβ0 L1 φ, for some function φ L1[0, 1].

(a) Then 1 p log Zp(y, X) P sup F F GW,g,ψ(F),

where ψ = 1

W(x, y) :=0 if (x, y) [0, .5)2 (.5, 1]2,

=1 if (x, y) [0, .5) (.5, 1] (.5, 1] [0, .5).

and g(x) = R

[0,1] W(x, y)φ(y)dy + 1

(b) Consider the following two cases:

(ii) the conditions of Lemma 25 Part (b) (ii) hold.

Then the conclusions of Corollary 22 hold.

3. Discussions

We discuss some limitations of our current results, and collect some questions for future enquiry.

(i) The bounded support assumption on the prior A vital technical assumption in our analysis concerns the bounded support assumption on the prior. We note that for a general prior with unbounded support, the posterior µy,X might not even be a proper probability distribution. One intuitively expects that under appropriate taildecay conditions on the prior, the results in this paper should generalize. Going beyond the bounded support assumption requires extending the theory of non-linear large deviations to probability measures on unbounded spaces, and is thus beyond the scope of this paper.

Variational Inference in linear regression

(ii) Connection with classical asymptotics The curious reader might wonder about the connections between the results in this paper, and classical low-dimensional bayesian asymptotic theory. To explicate this connection, recall the model (1) with the covariates being drawn iid from a N(0, Ip) distribution. Consider the classical asymptotic regime with ﬁxed dimension p and n . Let β0/ n Rp be a sequence of coeﬃcients on the local scale. Using the theory of equivalence of statistical experiments (Van der Vaart, 2000), the likelihood is equivalent to an observation β N(β0, (XTX

n ) 1). Further, the Fisher information matrix (XTX/n) 1 is concentrated around the identity matrix, and thus the data is asymptotically equivalent to β N(β0, Ip). In turn, the posterior, constructed from an iid product prior, is equivalent to

2(βi β0,i)2 dπ(βi).

Finally, consider an additional limit p . Assuming wβ0 L1 φ, by the weak law of large numbers, for any continuous function ζ : [ 1, 1]2 R,

i=1 ζ(βi, β0,i) P E h Z

[ 1,1] ζ(w, φ(X))dπ(φ(X)+Z,1)(w) i ,

where X U([0, 1]) and Z N(0, 1) are independent. We emphasize that this is an iterated limit, and p is not a function of n in this discussion.

Note that this behavior is consistent with Corollary 22 upon setting W = 0 and ψ = 1. Thus our results are naturally compatible with the local asymptotics regime in fact, for gaussian designs, our results establish the validity of this behavior as long as p = o(n). Of course, our framework is signiﬁcantly more general, and can handle situations where the design distribution changes with n, p, e.g. sparse Bernoulli designs.

(iii) Extensions to Gibbs posteriors and fractional posteriors A careful study of our proof reveals that our main results do not depend strongly on the correct model speciﬁcation. As a result, we expect similar techniques to be broadly useful in the study of Gibbs and fractional posteriors (Alquier et al., 2016; Yang et al., 2020).

(iv) Extensions to other GLMs Another natural question of interest concerns the applicability of these ideas to more general models, e.g. logistic regression. We consider this to be an extremely interesting question, and plan to explore this in the future.

(v) Extensions to models with latent characteristics Variational methods are ubiquitous in applications with latent characteristics e.g. topic modeling, community detection etc. In contrast, the relevant variables are all observed in the linear model framework. It will be interesting to explore the applicability of our techniques to models with latent features.

Mukherjee and Sen

(vi) Connections with benign overﬁtting A recent series of papers have explored the benign overﬁtting phenomenon in the context of high-dimensional linear regression (Bartlett et al., 2020; Hastie et al., 2022; Tsigler and Bartlett, 2020). The main takeaway from this line of research is that if the population covariance matrix has a long and fat tail, the min-norm interpolant continues to have excellent prediction performance in high dimensions. There are some high-level conceptual similarities between this phenomenon, and the results reported in this paper. Indeed, both phenomena are driven by the eigenvalues of the covariates in our case, the condition Tr(A2) = o(p) essentially demands that the matrix A Rp p has only o(p) many eigenvalues which are O(1). Broadly, this highlights the importance of covariate geometry in governing the practical performance of ML methods in high-dimensions.

(vii) Connections with Bayesian deep learning Some recent papers in Bayesian deep learning (Farquhar et al., 2020; Foong et al., 2020) experimentally establish that the n VB approximation might not be accurate in Bayesian neural networks, and that increasing the depth might help ameliorate some of the issues. Our results, on the other hand, derive suﬃcient conditions for the accuracy of the n VB approximation under the simple case of the linear model with product priors. While one should deﬁnitely not expect the n VB approximation to be uniformly accurate (especially in high-dimensions), it would be interesting to identify conditions on the covariates which guarantee its accuracy in Bayesian Neural Networks.

We prove the main results in this section. Theorem 7 is established in Section 4.1, Theorem 17 is established in Section 4.2, while Theorem 20 is proved in Section 4.3.

4.1 Proof of Theorem 7

Our ﬁrst lemma collects some basic facts about the exponential family πγ. The proof is deferred to the Appendix.

Lemma 30 In the setting of Lemma 2, we have the following conclusions.

(i) Set γ = (γ1, γ2) R2. The derivatives of c(γ1, γ2) with respect to γ1 are given by

c(γ) := c(γ1, γ2)

[ 1,1] zdπγ(z), c(γ) := 2c(γ1, γ2)

[ 1,1] (z c(γ))2dπγ(z).

(ii) We have, for γ = (γ1, γ2) R2,

DKL(πγ π(0,γ2)) = γ1 c(γ) c(γ) + c(0, γ2).

(iii) Consider a sequence γk = (γ1,k, γ2,k) R2 such that γ1,k and lim sup |γ2,k| < . Then πγk w π , where π is the degenerate distribution which puts mass 1 at 1.

Variational Inference in linear regression

(iv) For any d > 0, the function G( , d) : [ 1, 1] R+ is lower semicontinuous.

(v) For x [ 1, 1], y R and r N, deﬁne

H(x, y) = Z

[ 1,1] zrdπ(h(x,y),y)(z) (14)

Then we have,

sup x ( 1,1),y R

We now turn to the proof of Theorem 7. To this end, set

j=1 Aijβj, θi := zi mi(β)

where we recall that σ2 denotes the noise variance in our linear regression model (1), and z = XTy. Observe that Aii = 0 for all 1 i p implies that µ( |(βj)j =i) = π(θi,di), and thus Eµ[βi|(βj)j =i] = c(θi, di). We deﬁne

b = ( c(θi, di))1 i p. (17)

We will prove that certain statistics under the posterior distribution µ can be wellapproximated by the vector of conditional means. To this end, we establish the following results.

Lemma 31 Under the conditions of Theorem 7, setting f(β) = 1

2σ2 βTAβ + 1

σ2 z Tβ, we have,

log Zp = log h Z

[ 1,1]p ef(β) p Y

i=1 πi(dβi) i = sup u [ 1,1]p Mp(u) + o(p), (18)

Eµ h βTAβ b TAb i2 = o(p2), (19)

i=1 mi(β)(βi bi) i2 = o(p2), (20)

Eµ h f(β) f(b)

i=1 θi(βi bi) i2 = o(p2). (21)

In the above displays, the random vectors θ := (θ1, , θp)T and b are as deﬁned in (16) and (17) respectively.

Lemma 32 Suppose φ : [ 1, 1] 7 R is a bounded measurable function, and Ap satisﬁes (5) and (6). Then for any c [ 1, 1]p we have

lim sup p 1 p Eµ

i=1 ci n φ(βi) Eµ[φ(βi)|βk, k = i] o#2 < .

Mukherjee and Sen

We defer the proofs of these lemmas to the end of this section, and establish Theorem 7, given these lemmas. Proof of Theorem 7 Proof of Part (i). This is exactly (18) of Lemma 31.

Proof of Part (ii). With f(β) = 1

2σ2 βTAβ + 1

σ2 z Tβ as in Lemma 31, deﬁne

Cp(ε) = n u [ 1, 1]p : f(u)

uih(ui, di) c(h(ui, di), di) < sup u [ 1,1]p Mp(u) pε o .

Note that it suﬃces to establish that for all ε > 0, µ(b Cp(ε)) 0 as p . To this end, deﬁne the event

Ep(ε/2) = n β [ 1, 1]p : |f(β) f(b)

i=1 θi(βi bi)| pε

Note that (21), in combination with Chebychev inequality, implies that µ(Ep(ε/2)c) 0 as p . Therefore,

µ(Cp(ε)) µ(Cp(ε) Ep(ε/2)) + µ(Ep(ε/2)c) = µ(Cp(ε) Ep(ε/2)) + o(1). (22)

Now, we have,

µ(Cp(ε) Ep(ε/2)) = 1

Cp(ε) Ep(ε/2) exp(f(β))

i=1 πi(dβi)

Cp(ε) Ep(ε/2) exp h f(b) +

i=1 θi(βi bi) i p Y

i=1 πi(dβi)

exp( pε/2 + supu [ 1,1]p Mp(u))

Cp(ε) Ep(ε/2) exp h p X

i=1 (θiβi c(θi, di)) i p Y

i=1 πi(dβi),

where the ﬁrst inequality follows using the deﬁnition of Ep(ε/2), while the last inequality follows from the deﬁnition of Cp(ε). Using the display above in combination with (7), it suﬃces to establish that

Cp(ε) Ep(ε/2) exp h p X

i=1 (θiβi c(θi, di)) i p Y

i=1 πi(dβi) = o(p).

This argument would be relatively straight-forward if the θi were ﬁxed constants independent of β, as

Cp(ε) Ep(ε/2) exp h p X

i=1 (θiβi c(θi, di)) i p Y

i=1 πi(dβi)

[ 1,1]p exp h p X

i=1 (θiβi c(θi, di)) i p Y

i=1 πi(dβi) = 1.

Variational Inference in linear regression

We caution the reader that this is not the case, and the θi are themselves functions of β, as speciﬁed in (16). To overcome this issue, we proceed as follows. Fix δ > 0, and let Dp(δ) be a pδ net of the set {Aβ, β [ 1, 1]p} in the euclidean metric, satisfying limp 1

p log |Dp(δ)| = 0. Under the assumption tr(A2) = o(p), such a net was constructed in (Basak and Mukherjee, 2017, Lemma 3.4). Thus for every β [ 1, 1]p, there exists p Dp(δ) such that Aβ p 2 δ p. For any p Dp(δ), we set

P(p) = {β [ 1, 1]p : Aβ p 2 δ p}.

For β P(p), setting θp i = zi pi

σ2 , we have,

i=1 (θi θp i )2 = 1

i=1 (mi(β) pi)2 = 1

σ4 Aβ p 2 2 pδ

where the last inequality uses the deﬁnition of P(p). This gives,

i=1 (θi θp i )βi p

c(θi, di) c(θp i , di)

i=1 |θi θp i | p

where the ﬁrst inequality follows from Cauchy-Schwarz, and the second inequality follows from the smoothness of c( , ). Thus we have,

Cp(ε) Ep(ε/2) exp h p X

i=1 (θiβi c(θi, di)) i p Y

i=1 πi(dβi)

P(p) exp h p X

i=1 (θiβi c(θi, di)) i p Y

i=1 πi(dβi).

[ 1,1]p exp h p X

i=1 (θp i βi c(θp i , di)) i p Y

i=1 πi(dβi) = exp 2p

This concludes the proof, as δ > 0 is arbitrary.

Proof of Part (iii). First, note that it suﬃces to show the result with ζ(x, y) = xrys for any r, s N. Thus we ﬁx r, s N in the rest of the proof. Using Lemma 32 with φ(x) = xr and ci = ( i

p)s it follows that for any ε > 0,

i=1 H(bi, di) i

where H is deﬁned as in (14).To complete the proof, it thus suﬃces to show that

i=1 H(bi, di) i

i=1 H(ˆui, di) i

s > ε 0. (23)

Mukherjee and Sen

Fixing K < and setting d K(i) := di1{|di| K}, using Lemma 30 Part (v) we have

i=1 H(ˆui, di) i

i=1 H(ˆui, d K(i)) i

i=1 di1{|di| > K}. (24)

which goes to 0 as p followed by K , using the uniform integrability assumption on 1

p Pp i=1 δdi. Also, with δ > 0 and setting

ˆuδ(i) :=ˆui if |ˆui| 1 δ,

:=1 δ if ˆui > 1 δ,

:= 1 + δ if ˆui < 1 + δ,

i=1 H(ˆui, d K(i)) i

i=1 H(ˆuδ(i), d K(i)) i

sup x [1 δ,1],|y| K |H(x, y) H(1 δ, y)| + sup x [ 1,1+δ],|y| K |H(x, y) H( 1 + δ, y)|. (25)

Fixing K > 0, we now claim that the RHS of (25) converges to 0 as δ 0. Given this claim, it follows from (24) and (25) that

lim sup K lim sup δ 0 lim sup p

i=1 H(ˆui, di) i

i=1 H(ˆuδ(i), d K(i)) i

A similar argument with ˆui replaced by bi gives

lim sup K lim sup δ 0 lim sup p

i=1 H(bi, di) i

i=1 H(bδ(i), d K(i)) i

where bδ(i) := bi1{|bi| 1 δ}. It suﬃces to show that for every δ, K ﬁxed we have

i=1 H(ˆuδ(i), d K(i)) i

i=1 H(bδ(i), d K(i)) i

But this follows on noting that

sup |x| 1 δ,|y| K

i=1 H(ˆuδ(i), d K(i)) i

i=1 H(bδ(i), d K(i)) i

i=1 |ˆuδ(i) bδ(i)| M

h ˆui bi i2 ,

Variational Inference in linear regression

which converges to zero in probability under the posterior distribution µy,X( ). Here the last estimate uses part (ii) of this Theorem.

To complete the argument, it thus remains to verify the claim involving (25), for which it suﬃces to verify continuity of the function (x, y) 7 H(x, y) at x = 1, uniformly for y [ K, K]. By symmetry, it suﬃces to show that

lim δ 0 sup x [1 δ,1],|y| K |H(x, y) 1| 0.

Suppose this is not true. Then there exists sequences {xk}k 1, {yk}k 1 with limk xk = 1, and |yk| K, such that |H(xk, yk) 1| > ε for all k, for some ε > 0. Without loss of generality, by passing to a subsequence, we can assume that yk converges to y [ K, K]. Using Lemma 30 Part (iii) we have that π(h(xk,yk),yk) converges weakly to δ1, the point mass at 1. Consequently, using DCT we have

H(xk, yk) = Z

[ 1,1] xrdπ(h(xk,yk),yk)(z) 1,

a contradiction. This veriﬁes the claim, and hence completes the proof of part (iii).

Next we turn to the proof of Lemmas 31 and 32. Proof of Lemma 31 Recall the notation

2σ2 βT Aβ + 1

σ2 z T β = 1

i,j=1 Ap(i, j)βiβj + 1

from the proof of Theorem 7 part (ii). Set

2σ2 βT Aβ = 1

i,j=1 Ap(i, j)βiβj,

and note that we are in the setting of proof of (Yan, 2020, Thm 4), via the following connections:

X β, ˆX b, N 1, JN N 1

2σ2 , n p, An Ap, efi(x) mi(β), h z.

The only disparity in the above connection is that Yan (2020) works with h which is free of i, whereas in our setup we have hi = zi which varies with i. However, the proof of (Yan, 2020, Theorem 4) goes through verbatim under this more general choice of the linear term, as long as we have the mean ﬁeld assumption tr(A2) = o(p). Thus, (Yan, 2020, Theorem 4) gives

log Zp sup Q=Qp i=1 Qi

h 1 2σ2 (EQ[X])TA(EQ[X]) + 1

σ2 z TEQ[X]

i=1 D(Qi πi) i = o(p),

Mukherjee and Sen

where EQ[X] is the mean vector of X Q = Qp i=1 Qi. The ﬁrst conclusion (18) then follows upon using Lemma 6. Also (19) and (20) follow from (Yan, 2020, (4.40)) and (Yan, 2020, (4.41)) respectively, by using the connections mentioned above. It thus suﬃces to prove (21). But this follows upon observing that

i=1 θi(βi bi) 1

i=1 mi(β)(βi bi) + 1 2σ2 |βTAβ b TAb|,

and using (19) and (20).

Proof of Lemma 32 For any i, j [p] with i = j, setting

ti := E[φ(βi)|βk, k = i] = Z

[ 1,1] φ(z)dπ(θi,di)(z), θi := zi Pp k=1 Aikβk σ2

Eµ h φ(βi) ti ih φ(βj) tj i = Eµ h φ(βi) ti ih φ(βj) t(i) j i + Eµ h φ(βi) ti ih ti j tj i (26)

t(i) j := Z

[ 1,1] φ(z)dπ(θi j,dj)(z), θ(i) j := zj P k =i Ajkβk σ2 .

Since t(i) j does not depend on βi, we have

Eµ h φ(βi) ti ih φ(βj) t(i) j i = Eµ h φ(βj) t(i) j i Eµ h φ(βi) ti βk, k = i i = 0.

For estimating the second term in the RHS of (26), setting

[ 1,1] φ(z)dπ(x,di)(z)

a Taylor s series expansion gives

t(i) j tj = ℓ(θ(i) j ) ℓ(θj) = Aij

σ2 βiℓ (θj) + A2 ij 2σ4 β2 i ℓ (ξ(i) j ).

Using the last two displays and summing over i, j give

i =j cicj Eµ h φ(βi) ti ih φ(βj) tj i

i =j cicj Aij

σ2 βiℓ (θj) h φ(βi) ti i

i,j=1 cicj A2 ij 2σ4 β2 i ℓ (ξ(i) j )

σ2 sup u [ 1,1]p

j=1 Aijuj + 1 2σ4

i,j=1 A2 ij,

Variational Inference in linear regression

where the last estimate uses the fact that ℓ(r) 1 for r = 1, 2. The desired conclusion follows upon using the given hypothesis on Ap.

4.2 Proof of Theorem 17

We start with a proof of Lemma 15. Proof of Lemma 15 Without loss of generality assume that the samples (Z1, , Zp) are arranged in increasing order of variance, i.e. increasing order of {Dp(i, i)}p i=1. Let {δp : p 1} be a positive sequence converging to 0, such that

i,j=1 Ap(i, j)2 = o(p).

The existence of such a sequence follows from the assumption (5). Let

q := arg min{i [p] : Dp(i, i) δp}, r := arg max{i [p] : Dp(i, i) 1

Recall that if Z N(0, 1), E|Z| = q

2 π. This implies

E sup u [ 1,1]p |

i=1 Ziui| E

π p = o(p),

E sup u [ 1,1]p |

i=r+1 Ziui| E

Dp(i, i)1 Dp(i, i) 1

i=1 Dp(i, i) = o(p),

where we use Pp i=1 Dp(i, i) = O(p). For q i r, setting Yi := Zi

Dp(i,i), let Γp denote

the (r q + 1 r q + 1) dimensional covariance matrix of the vector Y := (Yq, , Yr) .

Deﬁne E = (ξq, , ξr) := Γ 1/2 p Y, and note that (ξq, , ξr) i.i.d. N(0, 1). Setting vi := ui p

Dp(i, i) and v := (vq, , vr), we have

i=q ui Zi =

i=q vi Yi = v Γ1/2E = v E + v BE =

Dp(i, i)ui + v BE, (28)

where B := Γ1/2 p Ir q+1. We now claim that

1 p sup v 1

δp [ 1,1]r q+1 v BE P 0. (29)

Mukherjee and Sen

Given (29), it follows from (28) that

1 p sup u [ 1,1]r q+1

Dp(i, i)ξi P 0. (30)

Finally with {ξi}[p]\[q,r] i.i.d. N(0, 1) independent of {ξi}r i=q, using arguments similar to the derivation of (27) we get

E sup u [ 1,1]p

E sup u [ 1,1]p

i=1 Dp(i, i) = o(p).

Combining (27), (30) and (31) the desired conclusion follows.

It thus remains to verify (29). To this eﬀect, note that Γp(i, i) = 1, and Γp(i, j) = Ap(i,j)

Dp(i,i)Dp(j,j). Setting Cδ(i, j) := Γp(i, j)1{i = j}, and using Spectral Theorem, we have,

ℓ=q λℓψℓψ ℓ= ΨΛΨ .

Observe that Γp is positive semideﬁnite, and has eigenvalues 1 + λℓwith λℓ 1. Then we have

i,j=1 Cδ(i, j)2 1

i,j=1 Ap(i, j)2. (32)

Finally, the matrix B can expressed as

B = (Ir q+1 + Cδ)1/2 Ir q+1 =

ℓ=q µℓψℓψ ℓ=: ΨeΛΨ ,

where µℓ:= 1 + λℓ 1, and eΛ is a diagonal matrix with entries {µℓ}r ℓ=q. Consequently,

δp [ 1,1]r q+1 |v BE| = 1

δp E BE 1 p

δp E BE 2 = p

δp E ΨeΛF 2 = p

ℓ=q µ2 ℓF2 ℓ,

where Fℓ:= ψ ℓE are i.i.d. N(0, 1) for q ℓ r. By Jensen s inequality, the last expression is bounded by

ℓ=q µ2 ℓ C p

ℓ=q λ2 ℓ= C p

i,j=1 Cδ(i, j)2 C p

i,j=1 Ap(i, j)2

Variational Inference in linear regression

where C := supx 1 1+x 1

x < , and the last bound uses (32). The last term above is o(p) by the choice of δp, and so we have veriﬁed (29). This completes the proof of the Lemma.

To establish Theorem 17, we need the following deﬁnitions.

Deﬁnition 33 Let ξ1, , ξp N(0, 1) be iid random variables obtained from Lemma 15. Let X U([0, 1]) be independent of the ξi variables. Given u [ 1, 1]p set Z = ξi and U = ui if X [(i 1)

p]. Deﬁne L(u) p to be the joint distribution of (X, Z, U).

Fix W W, and φ, ψ are L1 functions on [0, 1]. Let F2,4 denote the space of all joint distributions (X, Z, U) ν such that X U([0, 1]), Eν[Z2] 4, |U| 1. Deﬁne the functional GW,φ,ψ : F2,4 R { } such that

GW,φ,ψ(ν) = 1

2E[W(X1, X2)U1U2] + E[φ(X)U] + E[ p

ψ(X)UZ] i E h G U, ψ(X)

where (X1, Z1, U1), (X2, Z2, U2) are iid copies from ν. Finally, let F denote the space of all probability measures ν on R3, such that if (X, Z, U) ν, then we have

X U([0, 1]), Z N(0, 1), X Z, |U| 1 a.s. (34)

We note that F F2,4, an observation that will be helpful in our subsequent analysis. The following stability estimates will be crucial in our proof of Theorem 17. The proof is deferred to the Appendix.

Lemma 34 (i) We have, for W, W W, φ, ψ, φ , ψ L1([0, 1]),

sup ν F2,4 | GW,φ,ψ(ν) GW ,φ ,ψ (ν)| W W + φ φ 1 + ψ ψ 1.

(ii) Suppose the following assumptions hold:

(a) Wk, W W is such that d (Wk, W) 0.

(b) φk, φ L is such that R

[0,1] |φk(x) φ(x)|dx 0.

(c) ψk, ψ L1[0, 1] is such that R

[0,1] |ψk(x) ψ(x)|dx 0.

Then we have

sup ν F2,4 | GWk,Wk φk+φkψk,ψk(ν) GW,W φ+φψ,ψ(ν)| 0 (35)

Lemma 35 We have, for any W W, g, ψ L1([0, 1]),

sup ν F GW,g,ψ(ν) = sup F F GW,g,ψ(F).

Further, both the suprema are attained.

Mukherjee and Sen

Remark 36 Depending on whether DKL(π π) and DKL(π π) are inﬁnity or not, there are four possible cases. In our subsequent proofs, we consider the case DKL(π π) < and DKL(π π) = , noting that other cases follow by natural modiﬁcations.

Lemma 37 Let d( , ) be the 2-Wasserstein distance on Pr([0, 1] R [ 1, 1]). For any δ > 0, set

Bd( F, δ) = {ν Pr([0, 1] R [ 1, 1]) : inf ν F d(ν, ν ) < δ}.

Then there exists a sequence δp 0 as p such that setting

Fp = { u [ 1, 1]p, L(u) p Bd( F, δp)}

we have, P(Fp|X) = 1 o(1).

The proof of Lemma 37 is straightforward, and is thus omitted.

Lemma 38 Let {up : p 1} [ 1, 1]p be such that L(up) p converges weakly to ν0 F. Then for any W W, g, ψ L1,

lim sup p GW,g,ψ( L(up) p ) GW,g,ψ(ν0).

Lemma 39 Suppose we are in the setting of Theorem 17. Fix F F and δ, K > 0 and p 1. Let Vi U([(i 1)/p, i/p]) be independent of ξi arising in Lemma 15. Set d K i = di1(|di| K) and deﬁne

( F(Vi, ξi) if F(Vi, ξi) 1 + δ, c(0, d K i ) o.w. (36)

Then we have, setting u = ( ui), for any ε > 0

lim sup K lim sup δ 0 lim sup p P h 1

p f Mp(u) GW,g,ψ(F) > ε|X i = 0.

We prove Theorem 17 assuming Lemma 35, 38 and 39. The corresponding proofs are deferred to the end of the section. Proof of Theorem 17 Using Lemma 15, it suﬃces to prove that

1 p sup u [ 1,1]p f Mp(u) P|X sup F F GW,g,ψ(F).

Upper bound: Note that

1 p f Mp(u) (b) = GWp Ap,Wp Ap wβ0+wβ0w D,wσ2d( L(u) p )

(c) = GW,g,ψ( L(u) p ) + E(2) p (u),

Variational Inference in linear regression

where supu [ 1,1]p |E(2) p (u)| P|X 0 as p . Here, (b) uses the deﬁnition (33), and (c) uses Lemma 34 part (ii). To invoke Lemma 34, we use the fact that

P[ u [ 1, 1]p, L(u) p F2,4] = P h p X

i=1 ξ2 i 4p i 1 (37)

as p . To complete the upper bound, invoking Lemma 35, it suﬃces to establish that for all ε > 0

P h sup u [ 1,1]p GW,g,ψ( L(u) p ) < sup ν F GW,g,ψ(ν) + ε|X i = 1 o(1). (38)

We ﬁrst turn to the proof of (38) by contradiction. Suppose there exists ε > 0, such that setting

Ep = n sup u [ 1,1]p GW,g,ψ( L(u) p ) > sup ν F GW,g,ψ(ν) + ε o .

we have lim sup P[Ep|X] > 0. Using Lemma 37, we have, lim sup P[Fp Ep] > 0. Therefore, there exists a subsequence pk along which Epk Fpk = . Consequently, there exists ξpk := (ξ1,pk, , ξpk,pk) Rpk, ξpk Epk Fpk and upk [ 1, 1]pk such that

GW,g,ψ( L (upk) p ) > sup ν F GW,g,ψ(ν) + ε, L (upk) pk Bd( F, δp).

This in turn implies that the sequence { L (upk) pk : k 1} is tight, and hence has a subse-

quence converging weakly to ν0 F. Assuming without loss of generality that L (upk) pk d ν0, the desired contradiction follows once we show that

lim sup k GW,g,ψ( L (upk) p ) GW,g,ψ(ν0). (39)

But this follows from Lemma 38.

Lower Bound: It suﬃces to show that for all ε > 0,

p sup u [ 1,1]p f Mp(u) > sup F F GW,g,ψ(F) ε|X i = 1 o(1).

To this end, let F F satisfy

GW,g,ψ(F) > sup F F GW,g,ψ(F) ε

Note that if E[G(F(X, Z), ψ(X)] = , then the lower bound is trivial. Thus we assume, without loss of generality, that E[G(F(X, Z), ψ(X)] < . Next, ﬁxing K > 0, we have,

GW,g,ψK(F) = GW,g,ψ(F) + o K(1), (40)

Mukherjee and Sen

where ψK(x) := ψ(x)1(ψ(x) K), and we have used Lemma 34 Part (i). Further, setting d K i = min{di, K}, a similar argument gives

GWp Ap,Wp Ap wβ0+wβ0w D,wσ2d( L(u) p )

= GWp Ap,Wp Ap wβ0+wβ0w D,wσ2d K ( L(u) p ) + o K(1). (41)

For any p 1, let ξ1, , ξp denote the iid N(0, 1) random variables arising in the optimization problem Mp. For each i [p], generate Vi U([(i 1)/p, i/p]) independent of each other, and of the ξi variables. Fixing δ > 0, we set

( F(Vi, ξi) if F(Vi, ξi) 1 + δ, c(0, d K i ) o.w. (42)

Using Lemma 39, for any ε > 0, there exists δ, K > 0 (depending on ε), such that with probability at least 1 ε,

1 p sup u [ 1,1]p f Mp(u) 1

p f Mp( u) GW,g,ψ(F) ε sup F F GW,g,ψ(F) 2ε.

As ε > 0 is arbitrary, this completes the proof of the lower bound.

It remains to prove Lemma 35, 38 and 39. We establish each result in turn. Proof of Lemma 35 To begin, note that for F F, (X, Z, F(X, Z)) ν F. Therefore,

sup ν F GW,g,ψ(ν) sup F F GW,g,ψ(F).

We now turn to the reverse inequality. Fix (X, Z, U) ν F. Setting V := Φ(Z) note that X, V are U([0, 1]) iid. Deﬁne F(X, V ) = E[U|X, V ]. Note that

E[W(X, X )UU ] = E[W(X, X )F(X, V )F(X , V )]

E[g(X)U] = E[g(X)F(X, V )],

E h G U, ψ(X)

i E h G F(X, V ), ψ(X)

where the ﬁnal step follows from Jensen s inequality and the observation that 2

u2 G(u, d) 0 from Lemma 4. This implies

GW,g,ψ(ν) 1 2σ2 E[W(X, X )F(X, V )F(X, V )] + 1

σ2 E[g(X)F(X, V )]

ψ(X)F(X, V )Φ 1(V )] i E h G F(X, V ), ψ(X)

Finally, noting that Φ 1 is strictly increasing, we have,

ψ(X)F(X, V )Φ 1(V )] E[ p

ψ(X)F (X, V )Φ 1(V )],

Variational Inference in linear regression

where for each x [0, 1], F (x, ) is the monotone rearrangment of F(x, ). This last step follows from Hardy-Littlewood inequality. Observe that as the Uniform distribution is invariant under measure preserving transformations, the other terms in the functional above are unchanged by this rearrangement. Thus

GW,g,ψ(ν) GW,g,ψ(F ).

Since F F, the proof is complete. Finally, we show that both suprema are attained. To this end, observe that by Lemma 30 Part (iv), GW,g,ψ( ) is upper semi-continuous. Moreover, F is compact under weak topology this follows as the collection F is tight, and thus pre-compact by Prokhorov s theorem. Thus a maximum exists. To see that GW,φ,ψ( ) attains it s maximum, start with a maximizer ν0 of GW,φ,ψ( ), and take F 0 as obtained in the proof above.

Proof of Lemma 38 Observe that we can approximate W in L1 by a continuous function W (ℓ). This implies

E L(u) p [W(X1, X2)U1U2] = E L(u) p [W (ℓ)(X1, X2)U1U2] + oℓ(1)

p Eν0[W (ℓ)(X1, X2)U1U2] + oℓ(1) = Eν0[W(X1, X2)U1U2] + oℓ(1).

Upon sending ℓ , we conclude that E L(u) p [W(X1, X2)U1U2] Eν0[W(X1, X2)U1U2]. A similar argument shows that

E L(u) p [g(X)U] p Eν0[g(X)U].

Next, we note that we can approximate ψ in L1 by a sequence of continuous functions {ψ(ℓ) : ℓ 1}, and thus, by Cauchy-Schwarz

|E L(u) p [ p

ψ(X)UZ] E L(u) p [ q

ψ(ℓ)(X)UZ]| E L(u) p [| p

E L(u) p [( q

ψ(X))2]E L(u) p [Z2]

0 |ψ(ℓ)(x) ψ(x)|dx = oℓ(1).

The last inequality uses that L(up) p F2,4. This implies

E L(u) p [ p

ψ(X)UZ] = E L(u) p [ q

ψ(ℓ)(X)UZ] + oℓ(1)

ψ(ℓ)(X)UZ] + oℓ(1)

ψ(X)UZ] + oℓ(1).

Sending ℓ , we have E L(u) p [ p

ψ(X)UZ] Eν0[ p

ψ(X)UZ] as p .

Mukherjee and Sen

The desired conclusion follows once we establish that

lim inf p E L(u) p

h G U, ψ(X)

i Eν0 h G U, ψ(X)

But this follows on using Fatou s lemma and the lower semicontinuity of G , ψ(X)

σ2 (Lemma 30 part (iv)).

Proof of Lemma 39 The deﬁnition of u in (36) implies

G(ui, d K i ) =

( G(F(Vi, ξi), d K i ) if F(Vi, ξi) 1 + δ, 0 o.w.

Observe that

i=1 G(ui, d K i ) =1

i=1 G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ) (43)

Now, the function G : [ 1 + δ, 1] [0, K] R+ is bounded, and thus by Chebychev inequality,

i=1 G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ)

i=1 E h G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ) i P|X 0. (44)

Also, note that

i=1 E h G(F(Vi, ξi), d K i )1(F(Vi, ξi) 1 + δ) i

= E[G(F(X, Z), wd K,p(X))1(F(X, Z) 1 + δ)],

= E h G F(X, Z), ψK(X)

1(F(X, Z) 1 + δ) i + o(1) (45)

where X U([0, 1]) and Z N(0, 1) are independent. The last display uses Lemma 4 along with the fact that wd K,p ψK/σ2 in L1 for almost every K > 0. Now, the assumptions E[G(F(X, Z), ψ(X)/σ2)] < and DKL(π π) = together imply P[F(X, Z) = 1] = 0. In combination with Dominated Convergence Theorem, this implies

lim δ 0 E h G F(X, Z), ψK(X)

1(F(X, Z) 1 + δ) i = E h G F(X, Z), ψK(X)

Combining (43), (44), (45) and (46), for any ε > 0, we have,

lim sup δ 0 lim sup p P h 1

i=1 G(ui, d K i ) E h G F(X, Z), ψK(X)

i > ε|X i = 0. (47)

Variational Inference in linear regression

We now claim that for any ε > 0,

lim sup δ 0 lim sup p P h 1

i,j=1 Ap(i, j)uiuj E[W(X1, X2)F(X1, Z1)F(X2, Z2)] > ε|X i = 0.

lim sup δ 0 lim sup p P h u TApβ0 + u TDpβ0 E[g(X)F(X, Z)] > ε|X i = 0.

lim sup δ 0 lim sup p P h

Dp(i, i)ξi E[ p

ψ(X)UZ] > ε|X i = 0.

We ﬁrst turn to the quadratic form. We have,

ij=1 Ap(i, j)uiuj = E L(u) p [Wp Ap(X1, X2)U1U2]

(a) = E L(u) p [W(X1, X2)U1U2] + o(1)

(b) = E L(u) p [W (K)(X1, X2)U1U2] + o K(1). (49)

Here, the step (a) uses Assumption (a), and step (b) uses W (K) := W1(|W| K) L1 W. We have, setting Mp(i, j) := R

[(i 1)/p,i/p] R

[(j 1)/p,j/p] W (K)(x, y)dxdy,

E L(u) p [W (K)(X1, X2)U1U2] =

i,j=1 Mp(i, j)uiuj =

i,j=1 Mp(i, j)E[ui]E[uj] + o P (1) (50)

by the weak law of large numbers, and the observation |Mp(i, j)| K/p2. Here, the o P (1) term converges to zero in probability under the joint distribution of {(Vi, ξi) : 1 i p}. Further, we have,

i,j=1 Mp(i, j)E[ui]E[uj]

i,j=1 Mp(i, j)E[F(Vi, Z)]E[F(Vj, Z)]

2KP[F(X, Z) 1 + δ], (51)

where X U([0, 1]) and Z N(0, 1) are independent. Finally,

i,j=1 Mp(i, j)E[F(Vi, Z)]E[F(Vj, Z)] = E[WMp(X, X )F(X, Z)F(X , Z )]

(a) = E[W (K)(X, X )F(X, Z)F(X , Z )] + o(1)

(b) = E[W(X, X )F(X, Z)F(X , Z )] + o K(1) (52)

where (a) uses WMp W (K) 1 0 as p and (b) uses W W (K) 1 0 as K . Combining (49), (50), (51), (52), the ﬁrst conclusion of (48) follows upon sending p , δ 0 and K . The other conclusions of (48) follow using analogous arguments, and are thus omitted.

Mukherjee and Sen

4.3 Proof of Theorem 20 and related results

The following continuity statement will be critical for the proof of Theorem 20. The proof is deferred to the Appendix.

Lemma 40 Deﬁne a map m : Pr [0, 1] R [ 1, 1] W [0, 1] R, (µ, W, x) 7

E(X,Z,U) µ[W(x, X )U ].

(i) Fix x [0, 1], µ Pr [0, 1] R [ 1, 1] , and W, W W. For any ε > 0, set

S(ε) = {x : |m(µ, W, x) m(µ, W , x)| > ε}. For X U([0, 1]), we have,

P[X S(ε)] 2

(ii) For any W W, if νp w ν, then

sup x [0,1]

m(νp, W, x) m(ν, W, x) 0

Proof of Theorem 20 We break the proof into several steps. Note that using Lemma 35, both variational problems attain their suprema.

(i) We establish uniqueness of the maximizer assuming (11). If possible, let F1 and F2 be two distinct optimizers of GW,g,ψ in F. Therefore, using Theorem 17,

1 p sup u [ 1,1]p Mp(u) P|X sup F F GW,g,ψ(F) = GW,g,ψ(F1) = GW,g,ψ(F2).

Let ξ1, , ξp be iid N(0, 1) random variables arising in the functional f Mp( ). Further, let Vi U([(i 1)/p, i/p]) be independent of the ξi variables. Fixing δ > 0, following the recipe in (36), we deﬁne two sequences u(1) p,δ,K, u(2) p,δ,K in [ 1, 1]p corresponding to F1, F2 respectively. Using Lemma 39, for i = 1, 2, for δ, K > 0 we have

p Mp(u(i) p,δ,K) > GW,g,ψ(Fi) δ

X i 1 E(δ, K),

where lim sup K lim supδ 0 E(δ, K) = 0.

Moreover, using Theorem 17 and the assumption that F1 and F2 are optimizers of the limiting variational problem, we have, for i = 1, 2,

p Mp(u p) < GW,g,ψ(Fi) + δ

X i = 1 o(1).

Variational Inference in linear regression

On the event { u(i) p,δ,K u p 2 2 > pε, i = 1, 2} { 1

p Mp(u p) < GW,g,ψ(Fi)+ δ

4 }, for i = 1, 2 we have,

1 p Mp(u(i) p,δ,K) 1

p Mp(u p) δ GW,g,ψ(Fi) + δ

4 δ < GW,g,ψ(Fi) δ

Using (11), this implies, for all p suﬃciently large, P[ u(i) p,δ,K u p 2 2 > pε, i = 1, 2|X]

3E(δ, K). By triangle inequality, P[ u(1) p,δ,K u(2) p,δ,K 2 2 > 2pε|X] 3E(δ, K). As

u(i) p,δ,K [ 1, 1]p, we have,

1 p E[ u(1) p,δ,K u(2) p,δ,K 2 2] 2ε + 12E(δ, K).

Thus, with X U([0, 1]), Z N(0, 1) independent, we have,

E[(F1(X, Z) F2(X, Z))2]

i=1 E[(F1(Vi, ξi) F2(Vi, ξi))2]

p E[ u(1) p,δ,K u(2) p,δ,K 2 2] + X

a=1,2 P[Fa(X, Z) 1 + δ]

2ε + 12E(δ, K) + X

a=1,2 P[Fa(X, Z) 1 + δ].

Setting δ 0 followed by K , we obtain that E[(F1(X, Z) F2(X, Z))2] 2ε. Here we use that F1, F2 are optimizers; thus without loss of generality we can assume E[G(Fa(X, Z), ψ(X)] < for a = 1, 2, which along with Remark 36 implies that P[Fa(X, Z) = 1] = 0.

Finally, note that ε > 0 is arbitrary, and thus F1 = F2 a.s.. This establishes the desired uniqueness.

(ii) As L (u p) p and L (u p) p are close in weak topology, it suﬃces to work with L (u p) p . With up,δ,K as in (36), for any ε > 0, Lemma 39 and (11) imply

lim sup K lim sup δ 0 lim sup p P[ up,δ,K u p 2 2 > pε|X] = 0.

Recalling the 2-Wasserstein distance d( , ) from Lemma 37, we have,

d( L (u p) p , L(up,δ,K) p ) 1

p up,δ,K u p 2 2,

and so it suﬃces to look at L(up,δ,K) p . Let η : R3 R be bounded continuous. Then

E L (up,δ,K ) p [η(X, Z, U)] = 1

i=1 η(Vi, ξi, up,δ,K(i)) = 1

i=1 E[η(Vi, ξi, up,δ,K(i))] + o P (1),

Mukherjee and Sen

where the last equality follows from the law of large numbers and the boundedness of η. Here, the o P (1) term converges to zero in probability under the joint distribution of {(Vi, ξi) : 1 i p}. Finally,

i=1 E[η(Vi, ξi, up,δ,K(i))] E[η(X, Z, F(X, Z)1(F(X, Z) > 1 + δ))]

P[F(X, Z) 1 + δ].

This gives, for any ε > 0,

lim sup K lim sup δ 0 lim sup p P h E L (up,δ,K ) p [η(X, Z, U)] E[η(X, Z, F(X, Z))] > ε|X i = 0,

where we again use that P[F(X, Z) = 1] = 0. This completes the proof of Part (ii).

(iii) For any p 1, the function f Mp( ) is upper semicontinuous on [ 1, 1]p, and thus attains its maximum. Let {ˆup : p 1} denote any sequence of global maximizers of f Mp( ).

Lemma 15 and the separation condition (11) together imply that 1

p u p ˆup 2 2 P|X 0. In turn, this implies

d( L (u p) p , L(ˆup) p ) P|X 0, (53)

where d( , ) denotes the 2-Wasserstein distance. Thus L(ˆup) p converges weakly in probability to ν := (X, Z, F (X, Z)).

Next, diﬀerentiating f Mp( ) at u ( 1, 1)p, we obtain

f Mp(u) = h 1

Apu + Apβ0 + Dpβ0 + diag( q

Dp(i, i))ξ h(u, d) i ,

where ξ = (ξ1, , ξp)T, and h( , ) is the vector obtained upon applying h coordinatewise to the two vectors. Note that if ui +1, then h(ui, di) , and thus ui f Mp(u, d) . Consequently, ˆui < 1. A similar argument implies ˆui > 1,

and thus ˆup ( 1, 1)p. This immediately implies that ˆup is a critical point of f Mp, and satisﬁes the ﬁxed point equation

Apˆup + Apβ0 + Dpβ0 + diag( q

Dp(i, i))ξ , d . (54)

Let f1 : [0, 1] R and f2 : R R be bounded continuous functions. Recalling the function m from Lemma 40 and setting gp = Wp Ap wβ0 + wβ0w D, we have,

E L (ˆup) p [f1(X)f2(Z)U]

=E L (ˆup) p

h f1(X)f2(Z) c 1

m( L(ˆup) p , Wp Ap, X) + gp(X) + p

w D(X)Z , wd(X) i

=E L (ˆup) p

h f1(X)f2(Z) c 1

m( L(ˆup) p , Wp Ap, X) + g(X) + p

ψ(X)Z , wd(X) i + Ep, (55)

Variational Inference in linear regression

where we use c(θ, d) 1, and observe that

|Ep| E L (ˆup) p [|( p

ψ(X))Z|] + gp g 1 w D ψ 1 + gp g 1 = o(1). (56)

Deﬁne the good event

E (ε) = {|m( L(ˆup) p , Wp Ap, X) m( L(ˆup) p , W, X)| ε} {|w D(X) ψ(X)| ε}.

Then, upon observing that supγ1,γ2 | 2 γ1 γ2 c(γ1, γ2)| 1 and supγ1,γ2 | 2

γ2 1 c(γ1, γ2)| 1, we have, on the event E (ε), c 1

m( L(ˆup) p , Wp Ap, X) + g(X) + E[ p

ψ(X)Z] , wd(X)

m( L(ˆup) p , W, X) + g(X) + E[ p

ψ(X)Z] , ψ(X)

Thus we have, E L (ˆup) p

h f1(X)f2(Z) c 1

m( L(ˆup) p , Wp Ap, X) + g(X) + p

ψ(X)Z , wd(X) i

E L (ˆup) p

h f1(X)f2(Z) c 1

m( L(ˆup) p , W, X) + g(X) + p

ψ(X)Z , ψ(X)

ε + P[E (ε)c]

ε W Wp Ap + P h |wd(X) ψ(X)

σ2 | > ε i = O(ε) + o(1),

where the last inequality uses Lemma 40 part (i).

Finally, Lemma 40 part (ii) implies that

E L (ˆup) p

h f1(X)f2(Z) c 1

m( L(ˆup) p , W, X) + g(X) + p

ψ(X)Z , ψ(X)

=E L (ˆup) p

h f1(X)f2(Z) c 1

m(ν , W, X) + g(X) + p

ψ(X)Z , ψ(X)

i + o(1). (58)

Combining (55), (56), (57), (58), we have,

E L (ˆup) p [f1(X)f2(Z)U]

=E L (ˆup) p

h f1(X)f2(Z) c 1

m(ν , W, X) + g(X) + p

ψ(X)Z , ψ(X)

i + o(1) + O(ε).

Sending p , and then ε 0, we obtain

Eν [f1(X)f2(Z)U]

=Eν h f1(X)f2(Z) c 1

m(ν , W, X) + g(X) + p

ψ(X)Z , ψ(X)

The desired conclusion follows upon recalling that (X, Z, F (X, Z)) ν .

Mukherjee and Sen

We next turn to the proof of Lemma 25. To this end, we require the following auxiliary lemma. The proof is deferred to the Appendix.

Lemma 41 Suppose rp : [ 1, 1]p R { } is upper semi-continuous on [ 1, 1]p, ﬁnite and diﬀerentiable on ( 1, 1)p, and

sup u ( 1,1)p λmin(Hp(x)) η,

where Hp( ) is the Hessian of rp( ). Then there exists a unique global maximizer x0 [ 1, 1]p, and further for any x [ 1, 1]p we have

rp(x0) rp(x) η

2 x x0 2 2.

Proof of Lemma 25 Proof of (i) The equation (13) implies that the event

sup u [ 1,1]p: u u p 2 2>pε

1 p{Mp(u) Mp(u p)} < ελ

occurs with probability 1 o(1). Proof of (ii)(a) Diﬀerentiating Mp( ), twice, the Hessian is given by

σ2 Ap p, p(i, i) = 1 c(h(ui, di), di). (59)

Note that c(h(u, d), d) is the variance of a random variable supported on [ 1, 1], and thus c(h(u, d), d) 1. Under the assumption of (i), there exists δ > 0 such that for all p large enough and all x ( 1, 1)p,

x THp(u)x = 1

σ2 x TApx x px δ x 2 2. (60)

It then follows from Lemma 41 that there exists u [ 1, 1]p which is a unique optimizer of Mp( ), and further, for any u [ 1, 1]p we have

Mp(u ) Mp(u) δ

This veriﬁes (13). Proof of (ii)(b) To begin, note that the prior is absolutely continuous with respect to the Lebesgue measure on [ 1, 1] and thus DKL(π π) = DKL(π π) = . Thus to maximize Mp( ), it suﬃces to restrict to ( 1, 1)p. As in (60), it suﬃces to bound the lower eigenvalue of the Hessian, uniformly in ( 1, 1)p, for which using (59) and the fact that λmin(XTX) is bounded away from 0 it suﬃces to show that p(i, i) di. By GHS inequality (Ellis et al.,

Variational Inference in linear regression

1976), c(h(u, di), di) c(0, di), and thus the desired inequality follows, once we establish that

sup d 0 d c(0, d) 1. (61)

Consider now a scale family of parametric distributions {Pθ : θ 0} with

dx exp( θV (x)) exp( dx2/2).

Since V ( ) is even, it follows that the ﬁrst moment under Pθ is 0 for all θ. Also, since V ( ) is an increasing, this family has monotone likelihood ratio in T(x) = |x|, and thus the second moment under the law Pθ is decreasing in θ for θ > 0, and so

c(0, d) = Varθ=1(Z) Varθ=0(Z).

Proceeding to bound the RHS of the above display, for d > 0 we have

d Varθ=0(Z) =

R 1 1 dz2 exp( dz2/2) dz

R 1 1 exp( dz2/2) dz .

We substitute

dz = t, so that

d Varθ=0(Z) =

d t2 exp( t2/2) dt R

d exp( t2/2) dt .

This is exactly the variance of a truncated standard Gaussian distribution, truncated to the interval [

d]. Indeed, the truncated variance is 1 2

d) 1 < 1 (Johnson and Kotz,

1971), where φ( ) and Φ( ) represent the pdf and cdf of the standard Gaussian distribution.

Acknowledgments

The authors thank Pragya Sur for discussions on high-dimensional regression. SM gratefully thanks NSF (DMS 1712037) for support during this research. The authors gratefully thank the Associate Editor and two anonymous referees for their helpful comments, which signiﬁcantly improved the exposition.

Appendix A. Some technical lemmas

We collect some basic results about exponential families in the ﬁrst subsection. We prove Lemma 41 in the next subsection.

Mukherjee and Sen

A.1 Results on exponential families

We prove Lemmas 2, 4 and 30 in this section. Proof of Lemma 30

(i) This follows by direct computation (see e.g. Lehmann and Casella (2006)).

(ii) This follows by direct calculation.

(iii) Recall from Deﬁnition 1 that

dπ (z) = exp γ1z γ2

2 z2 c(γ) .

Without loss of generality, we consider the case γ1,k . The case γ1,k follows using the same argument, with obvious modiﬁcations. Observe that as γ1,k and lim supk |γ2,k| < , for k suﬃciently large, the function γ1,kz γ2,k

2 z2 is increasing on [ 1, 1]. This implies, for any ε > 0, Z 1 ε

1 exp γ1,kz γ2,k

2 z2 dπ(z) exp γ1,k(1 ε) γ2,k

2 (1 ε)2 π([ 1, 1 ε]).

On the other hand, exp(c(γk)) can be lower bounded by Z 1

2 exp γ1,kz γ2,k

2 z2 dπ(z) exp γ1,k(1 ε

On taking ratios of the above two displays we get

πγk([ 1, 1 ε]) exp ε

2γ2 2,k π([ 1, 1 ε])

Also note that the ratio of probabilities under π in the RHS is positive, as 1 is in the support of π. The desired conclusion now follows upon letting k , upon noting that γ1,k , and γ2,k stays bounded.

(iv) The proof follows directly from the lower semicontinuity of KL divergence under weak convergence (Posner, 1975, Theorem 1).

(v) By deﬁnition,

H(x, y) = Z

[ 1,1] zr exp h(x, y)z y

2z2 c(h(x, y), y) dπ(z).

For x { 1, 1}, H(x, ) is independent of y, and thus H(x,y)

y = 0. We assume henceforth that x ( 1, 1). Diﬀerentiating, we obtain,

[ 1,1] zrh z h(x, y)

2 c(h(x, y), y) h(x, y)

i dπ(h(x,y),y)(z)

[ 1,1] zrh (z x) h(x, y)

i dπ(h(x,y),y)(z),

Variational Inference in linear regression

where the last equality uses c(h(x, y), y)) = x. Therefore,

Eπ(h(x,y),y)[|Z x|] + 1

where we use Eπ(h(x,y),y)[|Z|r] 1 for all r 1. Now, diﬀerentiating c(h(x, y), y) = x in y, we have,

2c γ1 γ2 (h(x, y), y)

c(h(x, y), y)

Covπ(h(x,y),y)(Z, Z2)

Varπ(h(x,y),y)(Z)

2 E(h(x,y),y)[(Z x)(Z2 x2)]

E(h(x,y),y)[(Z x)2]

2 E(h(x,y),y)[(Z x)2(Z + x)]

E(h(x,y),y)[(Z x)2] .

This implies supx ( 1,1),y R h(x,y)

y 1, where we use the trivial bound |Z + x| 2. The desired conclusion follows by plugging this back into (62).

Proof of Lemma 2

(a) Lemma 30 Part (a) implies that c(γ1, γ2) = Varπγ(Z) > 0 for all γ1, γ2. Thus c(γ1, γ2) is strictly increasing in γ1. For every γ2 R, as γ1 , πγ w π using Lemma 30 Part (c). The desired conclusion follows on noting that

c(γ1, γ2) = Eπγ[Z] γ1 1

by the Dominated Convergence Theorem.

(b) Using Part (a), we know that c(γ1, γ2) is strictly increasing in γ1, and continuous. Further, if γ1 , c(γ1, γ2) 1. Thus for any t ( 1, 1), the existence of h(t, γ2) follows by the Intermediate Value Theorem. Finally, we argue that for any ﬁxed γ2 R, as t 1, h(t, γ2) . The case t 1 is similar, and thus omitted.

We complete the proof by contradiction. Suppose, if possible, that

M := lim t 1 h(t, γ2) < .

Mukherjee and Sen

In this case, as 1 is in the support of π, π([ 1, 0]) > 0. This implies that

t = c(h(t, γ2), γ2) = Z

[ 1,1] z exp h(t, γ2)z γ2

2 z2 c(h(t, γ2), γ2) dπ(z)

[ 1,0] z exp h(t, γ2)z γ2

2 z2 c(h(t, γ2), γ2) dπ(z)+ Z

(0,1] z exp h(t, γ2)z γ2

2 z2 c(h(t, γ2), γ2) dπ(z)

π(h(t,γ2),γ2)((0, 1]).

To complete the argument, it suﬃces to show that lim inft 1 π(h(t,γ2),γ2)([ 1, 0]) > 0. But this follows on noting that the function t c(h(t, γ2), γ2) is non-decreasing on [0, 1], and thus

π(h(t,γ2),γ2)([ 1, 0]) exp( M γ2

2 c(M, γ2))π([ 1, 0]).

Proof of Lemma 4 The lemma follows by direct computation. First, note that

u = h(u, d) + u h

u(u, d) c(h(u, d), d) h

u(u, d) = h(u, d),

where the last equality follows upon noting that c(h(u, d), d) = u. For the second derivative, note that

d =u h(u, d)

d c(h(u, d), d) h(u, d)

d γ2 c(γ1, γ2) (h(u,d),d) + γ2 c(γ1, γ2) (0,d)

= γ2 c(γ1, γ2) (h(u,d),d) + γ2 c(γ1, γ2) (0,d)

[ 1,1] z2dπ(h(u,d),d)(z)dz 1

[ 1,1] z2dπ(0,d)(z)dz.

Finally, note that

2u = h(u, d)

u = 1 c(h(u, d), d) > 0

where the last equality follows from diﬀerentiating the equation c(h(u, d), d) = u.

A.2 Proof of Lemma 41

Proof Since rp( ) is upper semi-continuous there exists a global maximizer in [ 1, 1]p, say ex0. Fixing any x in ( 1, 1)p, consider the function f(t) := rp((1 t)ex0 + tx). Then f is twice diﬀerentiable on (0, 1), and

f (t) = (ex0 x)THp((1 t)ex0 + tx)(ex0 x) η ex0 x 2 2.

Variational Inference in linear regression

Consequently, for any ε (0, 1) Taylor s expansion implies

f(1) f(ε) + (1 ε)f (ε) 1

2η(1 ε)2 ex0 x 2 2.

Taking limits as ε 0 we get

f(1) f(0) + f (0+) 1

2η ex0 x 2 2 f(0) 1

2η ex0 x 2 2,

where the last inequality uses the fact that f (0+) 0, as t = 0 is a global maximizer of f. The last display is equivalent to

rp(x) rp(ex0) 1

2η ex0 x 2 2.

This inequality then extends to x [ 1, 1]p by upper semi-continuity of rp.

Appendix B. Stability Estimates for functionals

We prove Lemma 34 and 40 in this section.

B.1 Proof of Lemma 34

We start with the proof of Part (i). Deﬁne the functional TW,φ : F2,4 R

TW,φ,ψ(ν) := 1

2E[W(X1, X2)U1U2] + E[U1φ(X1)] + E[ p

ψ(X1)U1Z1].

where (X1, Z1, U1), (X2, Z2, U2) ν are iid. Similarly deﬁne Iψ : F2,4 R { }

Iψ(ν) = E h G U, ψ(X)

We observe that GW,φ,ψ(ν) = 1 σ2 TW,φ,ψ(ν) Iψ(ν). Further,

sup ν F2,4 |TW,φ,ψ(ν) T W, φ,ψ(ν)| 1

2 W W + φ φ 1, (63)

sup ν F2,4 |TW,φ,ψ(ν) TW,φ, ψ(ν)| 2 ψ ψ 1, (64)

sup ν F2,4 |Iψ(ν) I ψ(ν)| 1 2σ2 ψ ψ 1. (65)

Indeed, (63) is immediate from the deﬁnition of TW,φ,ψ( ), and (65) follows from Lemma 4 on noting that supu ( 1,1),d R | G

d (u, d)| 1 2. Finally, (64) follows on noting that by Cauchy-Schwarz inequality,

Eν h Z2 i Eν h p

ψ(X) 2i 2 ψ ψ 1,

Mukherjee and Sen

where the last inequality uses ( a

b)2 |a b| and Eν[Z2] 4. This completes the proof of Part (i). Next, we turn to the proof of Part (ii). Using the deﬁnition of cut norm (as in Deﬁnition 12) and Part (i), we have,

sup ν F2,4 | GWk,Wk.φk+φkψk,ψk(ν) GW,W.φk+φkψ,ψ(ν)| Wk W + ψk ψ 1 0.

Using Part (i) again, it suﬃces to show that

W.φk + φkψ L1 W.φ + φψ.

To this end, note that |W.φk + φk(x)ψ(x) W.φ φ(x)ψ(x)| converges to 0 in measure, and |W.φk(x) + φk(x)ψ(x) W.φ(x) φ(x)ψ(x)| 2 Z

[0,1] |W(x, y)|dy + 2ψ,

which is an integrable function. This completes the argument using DCT.

B.2 Proof of Lemma 40

Proof of Lemma 40

(i) For any ε > 0, let S+(ε) = {x : m(µ, W, x) m(µ, W , x) > ε}. For X U([0, 1]), we have,

P(X S+(ε)) 1

εE h (m(µ, W, X) m(µ, W , X))1S+(ε)(X) i

εEX,X h (W(X, X ) W (X, X ))1S+(ε)(X)E[U |X ] i

Setting S (ε) = {x : m(µ, W, x) m(µ, W , x) < ε}, the same argument now yields that

P[X S (ε)] 1

(ii) Let (Xp, Zp, Up) νp and (X, Z, U) ν. Using Skorokhod Embedding Theorem, we assume that (Xp, Zp, Up) (X, Z, U) a.s. as p . Since R |W(x, y)|dxdy < , for any ε > 0 there exists W continuous such that W W 1 ε. Then we have,

E[|m(µ, W, X) m(µ, W , X)|] W W 1 ε.

|m(νp, W , x) m(ν, W , x)|

= E[W (x, Xp)Up] E[W (x, X)U]

E[W (x, Xp)Up] E[W (x, Xp)U] + E[W (x, Xp)U] E[W (x, X)U]

W E[|Up U|] + sup x,y,z [0,1],|y z| |X Xp|

W (x, y) W (x, z) = o(1)

Variational Inference in linear regression

using the uniform continuity of W . The desired conclusion follows upon combining the two displays above.

Appendix C. Proofs of Examples

We establish Corollaries 9-11 and 26-29 in this section. Throughout this section o P (1) terms converge to zero in probability under the marginal distribution of the design matrix X.

C.1 Accuracy of mean-ﬁeld approximation

Proof of Corollary 9 This is immediate from Theorem 7.

Proof of Corollary 10 With Ap and Dp denoting the oﬀ-diagonal and diagonal parts of the matrix XTX, to invoke Theorem 7 we need to verify that Ap satisﬁes (5) and (6), and the empirical measure 1 p Pp i=1 δDp(i,i) is uniformly integrable. We verify these conditions below:

Since {xi}1 i n i.i.d. N(0, Γp), and

i=1 xix T i ,

invoking (Vershynin, 2012, Proposition 2.1) gives

XTX Γp 2 = OP r p

= o P (1), (66)

where the last equality uses p = o(n). Noting that

Dp Γp,diag 2 XTX Γp 2

Ap Γp,oﬀ 2 = o P (1). (67)

Consequently, Ap satisﬁes (5) with high probability, as tr(Γ2 p,oﬀ) = o(p). Further, since Γp 2 = O(1), it follows from (66) gives that Ap 2 = OP (1). Finally, noting that

max 1 i p |Dp(i, i)| XTX 2 = Γp 2 + o P (1)

p Pp i=1 δDp(i,i) is uniformly integrable with high probability. This completes the proof of the corollary.

Mukherjee and Sen

Proof of Corollary 11 It suﬃces to verify the same conditions on the matrices (Ap, Dp) as in Corollary 10. To this end, note that for any i = j we have

Ap(i, j) = p

k=1 B(k, i)B(k, j), (68)

i =j Ap(i, j)2 = p2

k,ℓ=1 E[B(k, i)B(k, j)B(ℓ, i)B(ℓ, j)]

k =ℓ E[B(k, i)B(k, j)B(ℓ, i)B(ℓ, j)] + p2

k=1 E[B(k, i)B(k, j)]

p2 = λ4 + λ2p2

which is o(p) as p = o(n). This veriﬁes (5). Also, (68) gives

i,j=1 |Ap(i, j)| = 1

i =j E[B(k, i)B(k, j)] λ2,

and so (6) holds with high probability. Finally we have

i=1 |Dp(i, i)|2 = 1

k=1 B(k, i) 2 p2

i = λ2 + o(1),

p Pp i=1 Dp(i, i) is uniformly integrable.

C.2 Limiting variational formula

Proof of Corollary 26

(a) The desired conclusion follows from Theorem 17, once we can verify

d L1(w D, 1) = o P (1), d (Wp Ap, W) = o P (1).

Here D = (Dp(1, 1), , Dp(p, p)) denotes the diagonal entries of XTX, and Ap denotes the oﬀdiagonal part of XTX. Proceeding to verify the above display, invoking (66) we have Dp(i, i) = 1 + 1

p G(i/p)2 + o P (1), and so w D L1 1. Also, since Γp(i, j) = 1

p G(i/p)G(j/p) for all i = j, it follows that d (WpΓp, W) 0, where we use the almost sure continuity of G(., .). Since (67) gives d (Wp Ap, WpΓp) = o P (1), combining we get

d (Wp Ap, W) d (Wp Ap, WpΓp) + d (WpΓp, W) = o P (1),

This completes the proof of part (a).

Variational Inference in linear regression

(b) We begin by verifying condition (11). To this eﬀect, set

h u TΓpu 2z Tu i

i=1 G(ui, di),

and use the fact that maxi [p] P j =i Γp(i, j) λ < σ2 along with part (a) of Lemma 25 to get the existence of u p such that

Mp(u p) Mp(u) λ u p u 2 2.

This in turn shows that for any ε > 0 we have

lim sup p sup u: u u p 2 2>pδ

Mp(u p) Mp(u) o < 0. (69)

Using (66) gives

1 p sup u [ 1,1]p 1 p

Mp(u) f Mp(u) i = o P (1). (70)

Given (69), and (70), it follows that

lim sup p sup u: u u p 2 2>pδ

n Mp(u p) Mp(u) o < 0.

Thus we have veriﬁed (11). The desired conclusion then follows from Corollary 22.

Part (b)(ii) follows from part (b)(ii) of Lemma 25 and Corollary 22.

Proof of Corollary 28

(a) The desired conclusion follows from Theorem 17, once we can verify

d L1(w D, ψ) = o P (1), d (Wp Ap, W) = o P (1).

Proceeding to verify the above display, note that

Dp(i, i) := (XTX)ii = p nσ2

Thus, setting e Di := 1

n Pn k=1 G(k/n, i/p), Using Chernoﬀbounds it follows that

P max 1 i p |Dp(i, i) e Di| > δ pe cδ2 n

p pe cδ2 p 0,

and so it follows that

d L1(w D, ψ) d L1(w D, w e D,p) + d L1(w e D,p, ψ) = o P (1)

Mukherjee and Sen

where the last equality uses the almost sure continuity of G(., .).

It thus suﬃces to show that d (Wp Ap, W) 0.

To this end, note that for any i = j we have

Ap(i, j) = p

k=1 B(k, i)B(k, j).

Using Lemma 42, setting e Ap(i, j) := E[Ap(i, j)] = 1 np Pn k=1 G( k

P max S,T [p]

P i S,j T p Ap(i, j)

P i S,j T p e Ap(i, j)

From this, using the condition p = o( n) gives

d (Wp Ap, Wp e Ap) = o P (1),

which in turn gives

d (Wp Ap, W) d (Wp Ap, Wp e Ap) + d (Wp e Ap, W) = o P (1),

where the last equality again uses the almost sure continuity of G( , ).

(b) (i) Once again, the desired conclusion of part (b) follows from Corollary 22, once we verify condition (11).

To this eﬀect, ﬁxing δ > 0 and setting ϑi := Pp j =i Ap(i, j), deﬁne a p p matrix Bp,δ by setting Bp,δ(i, j) := Ap(i, j)1{ϑi σ2(1 δ), ϑj σ2(1 δ)},

and note that

sup u [ 1,1]p

u Apup u Bp,δu 2

i=1 ϑi1{ϑi > σ2(1 δ)}.

We now claim that there exists δ > 0 such that

lim sup p 1 p

i=1 ϑi1{ϑi > σ2(1 δ)} = 0. (71)

Given (71), it suﬃces to show that (11) holds for Mp : [ 1, 1]p 7 R deﬁned by

h u TBp,δu 2z Tu i

i=1 G(ui, di).

Variational Inference in linear regression

But this is immediate from Lemma 25 part (a), on noting that

j =i Bp,δ(i, j) σ2(1 δ).

It thus remains to verify (71). To this eﬀect, recall from part (a) that Wp Ap converges to W in the cut metric, which in turn implies 1

p Pp i=1 δϑi converges weakly in probability to the law of S(X), where X U[0, 1] (Borgs et al., 2015, Theorem 2.16). This gives

lim sup p 1 p

i=1 ϑi1{ϑi > σ2(1 δ)} ES(X)1{S(X) σ2(1 δ)}.

It thus suﬃces to show that there exists δ > 0 such that the RHS above is 0. But this follows on noting that esssup(S(X)) < σ2. This completes the proof of part (i).

Part (b)(ii) follows from part (b)(ii) of Lemma 25 and Corollary 22, as before.

Proof of Corollary 29

(a) A direct calculation gives XTX = 1

2I + Ap, where

Ap(i, j) =1

=0 otherwise .

It then follows that d (Wp Ap, W) W, and d L1(w D, ψ) 0. The desired conclusion then follows from Theorem 17 as before.

(b) Since lim sup p maxi [p] P j =i |Ap(i, j)| = 1

2, the result is immediate from Corollary 22

and Lemma 25.

Appendix D. Relevant concentration inequalities

Lemma 42 Let Bik Ber(G(i/n, j/p)/p) are independent random variables with G(x, y) λ. For any δ > 0, there exists C > 0 (depending on δ) such that

P max S,T [p]

k S,l T (Ap(k, l) E[Ap(k, l)]) > pδ 2p exp C n .

To prove Lemma 42, we ﬁrst obsere that if X is sub-exponential, then X2 is sub-Weibull Kuchibhotla and Chakrabortty (2018).

Mukherjee and Sen

Lemma 43 Let Y1, , Yn be independent sub-exponential random variables with common sub-exponential parameter c > 0. For any δ > 0, there exists C := C(δ) > 0 such that

i=1 (Y 2 i E[Y 2 i ]) > nδ i exp C n .

Proof We ﬁrst claim that Y 2 i are sub-Weibull, i.e., there exists a constant C > 0 (depending only on c) such that

P[|Y 2 i E[Y 2 i ]| > t] 2 exp( C

We establish the upper tail deviation bound. A similar argument works for the lower tail, and is thus omitted. As the variables Yi have a common sub-exponential constant, maxi n E[Y 2 i ] is uniformly bounded in n. For any t > 0 with

2 1)2 maxi n E2[Yi],

P[Y 2 i > E[Y 2 i ] + t] = P[Yi E[Yi] > q

t + E[Y 2 i ] E[Yi]]

P[Yi E[Yi] > p

t + E2[Yi] E[Yi]].

We now claim that

t + E2[Yi] E[Yi] >

Using (73), along with the deviation bound above, we have,

P[Y 2 i > E[Y 2 i ] + t] P h Yi E[Yi] >

for some constant C > 0. The last inequality uses that Yi is sub-exponential. (73) can be veriﬁed by direct computation. Using (72) and (Kuchibhotla and Chakrabortty, 2018, Proposition A.3), maxi n Y 2 i ψ1/2 is uniformly bounded in n. Consequently, using (Kuchibhotla and Chakrabortty, 2018, Theorem 3.1), for any t > 0, we have,

i=1 (Y 2 i E[Y 2 i ]) > C1 b 2(

t + Lnt2) i exp( t),

where C1 is a constant free of n, and Ln = C2 b / b 2 for some constant C2 > 0 independent of n. Here, b = ( Y 2 1 ψ1/2, , Y 2 n ψ1/2). We set C1 b 2(

t + Lnt2) = nδ for some

ε > 0. Direct calculation yields that t(δ) = q

nδ C1C2 b (1 + o(1)), so that

i=1 (Y 2 i E[Y 2 i ]) > nδ i exp C3 n ,

where C3 > 0 depends on δ. This concludes the proof.

Variational Inference in linear regression

Proof of Lemma 42 Recall that Bik Ber(G(i/n, j/p)/p) are independent random variables. For S [p], deﬁne Yi(S) = P k S Bik. By Chernoﬀbound, the collection {Yi(S) : 1 i n, S p} are sub-exponential with a common sub-exponential parameter (depending only on λ). By Chernoﬀbounds, for any ﬁxed S [p], Yi = P k S Bik is a sub-exponential random variable. Using Lemma 43,

k,l S (Ap(k, l) E[Ap(k, l)]) > pδ = P

i=1 (Yi(S)2 E[Yi(S)2]) > nδ exp C n .

A union bound then concludes

P max S [p]

k,l S (Ap(k, l) E[Ap(k, l)]) > pδ 2p exp C n . (74)

We set εkl := Ap(k, l) E[Ap(k, l)] and claim that

max S,T [p] | X

k S,l T εkl| 5

2 max S [p] | X

k,l S εkl| (75)

The required conclusion follows from (75) and the deviation bound (74). It thus remains to prove (75). To this eﬀect, note that

k S,l T εkl = X

k S\T,l T\S εkl + X

k S\T,l S T εkl + X

k S T,l T\S εkl + X

k S T,l S T εkl.

k,l S T εkl = X

k,l S\T εkl + X

k,l T\S εkl + X

k,l S T εkl

k S\T,l T\S εkl + X

k S\T,l S T εkl + X

k S T,l T\S εkl .

k S,l T εkl = 1

k,l S T εkl X

k,l S\T εkl + X

k,l T\S εkl + X

k,l S T εkl i + X

k S T,l S T εkl

which on using triangle inequality gives (75).

Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approximations. Annals of Statistics, 48(3):1475 1497, 2020.

Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of gibbs posteriors. The Journal of Machine Learning Research, 17(1): 8374 8414, 2016.

Mukherjee and Sen

Fanny Augeri. A transportation approach to the mean-ﬁeld approximation. ar Xiv preprint ar Xiv:1903.08021, 2019.

Tim Austin. The structure of low-complexity gibbs measures on product spaces. The Annals of Probability, 47(6):4002 4023, 2019.

Sayantan Banerjee, Isma el Castillo, and Subhashis Ghosal. Bayesian inference in highdimensional models. ar Xiv preprint ar Xiv:2101.04491, 2021.

Peter L Bartlett, Philip M Long, G abor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Anirban Basak and Sumit Mukherjee. Universality of the mean-ﬁeld for the potts model. Probability Theory and Related Fields, 168(3):557 600, 2017.

Anirban Bhattacharya, Antik Chakraborty, and Bani K Mallick. Fast sampling with gaussian scale mixture priors in high-dimensional regression. Biometrika, page asw042, 2016.

Peter Bickel, David Choi, Xiangyu Chang, and Hai Zhang. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4):1922 1943, 2013.

Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

David M Blei, Alp Kucukelbir, and Jon D Mc Auliﬀe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017.

Christian Borgs, Jennifer T Chayes, L aszl o Lov asz, Vera T S os, and Katalin Vesztergombi. Convergent sequences of dense graphs i: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801 1851, 2008.

Christian Borgs, Jennifer T Chayes, L aszl o Lov asz, Vera T S os, and Katalin Vesztergombi. Convergent sequences of dense graphs ii. multiway cuts and statistical physics. Annals of Mathematics, pages 151 219, 2012.

Christian Borgs, Jennifer T Chayes, Henry Cohn, and Shirshendu Ganguly. Consistent nonparametric estimation for heavy-tailed sparse graphs. ar Xiv preprint ar Xiv:1508.06675, 2015.

Christian Borgs, Jennifer T Chayes, Henry Cohn, and Yufei Zhao. An lp theory of sparse graph convergence ii: Ld convergence, quotients and right convergence. The Annals of Probability, 46(1):337 396, 2018.

Christian Borgs, Jennifer Chayes, Henry Cohn, and Yufei Zhao. An lp theory of sparse graph convergence i: Limits, sparse random graph models, and power law distributions. Transactions of the American Mathematical Society, 372(5):3019 3062, 2019.

Peter Carbonetto and Matthew Stephens. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis, 7(1):73 108, 2012.

Variational Inference in linear regression

Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465 480, 2010.

Isma el Castillo, Johannes Schmidt-Hieber, and Aad Van der Vaart. Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986 2018, 2015.

Sourav Chatterjee and Amir Dembo. Nonlinear large deviations. Advances in Mathematics, 299:396 450, 2016.

Badr-Eddine Ch erief-Abdellatif and Pierre Alquier. Consistency of variational bayes inference for estimation and model selection in mixtures. Electronic Journal of Statistics, 12 (2):2995 3035, 2018.

Nicholas A Cook, Amir Dembo, and Huy Tuan Pham. Regularity method and large deviation principles for the erdos-renyi hypergraph. ar Xiv preprint ar Xiv:2102.09100, 2021.

Ronen Eldan. Gaussian-width gradient complexity, reverse log-sobolev inequalities and nonlinear large deviations. Geometric and Functional Analysis, 28(6):1548 1596, 2018.

Richard S Ellis, James L Monroe, and Charles M Newman. The ghs and other correlation inequalities for a class of even ferromagnets. Communications in Mathematical Physics, 46(2):167 182, 1976.

Zhou Fan, Song Mei, and Andrea Montanari. Tap free energy, spin glasses, and variational inference. ar Xiv preprint ar Xiv:1808.07890, 2018.

Sebastian Farquhar, Lewis Smith, and Yarin Gal. Liberty or depth: Deep bayesian neural nets do not need complex weight posterior approximations. Advances in Neural Information Processing Systems, 33:4346 4357, 2020.

Augusto Fasano, Daniele Durante, and Giacomo Zanella. Scalable and accurate variational bayes for high-dimensional binary regression models. ar Xiv preprint ar Xiv:1911.06743, 2019.

Andrew Foong, David Burt, Yingzhen Li, and Richard Turner. On the expressiveness of approximate inference in bayesian neural networks. Advances in Neural Information Processing Systems, 33:15897 15908, 2020.

Alan Frieze and Ravi Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175 220, 1999.

Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21 28, 1962.

Behrooz Ghorbani, Hamid Javadi, and Andrea Montanari. An instability in variational inference for topic models. In International conference on machine learning, pages 2221 2231. PMLR, 2019.

Subhashis Ghosal. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli, 5(2):315 331, 1999.

Mukherjee and Sen

Peter Hall, Tung Pham, Matt P Wand, and Shen SJ Wang. Asymptotic normality and valid inference for gaussian variational approximation. The Annals of Statistics, 39(5): 2502 2532, 2011.

Wei Han and Yun Yang. Statistical inference in mean-ﬁeld variational bayes. ar Xiv preprint ar Xiv:1911.01525, 2019.

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2): 949 986, 2022.

Hemant Ishwaran and J Sunil Rao. Spike and slab variable selection: frequentist and bayesian strategies. Annals of statistics, 33(2):730 773, 2005.

Vishesh Jain, Frederic Koehler, and Elchanan Mossel. The mean-ﬁeld approximation: Information inequalities, algorithms, and complexity. ar Xiv preprint ar Xiv:1802.06126, 2018.

Vishesh Jain, Frederic Koehler, and Andrej Risteski. Mean-ﬁeld approximation, convex hierarchies, and the optimality of correlation rounding: a uniﬁed perspective. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1226 1236, 2019.

Norman L Johnson and Samuel Kotz. Continuous univariate distributions: Distributions in statistics. John Wiley and Sons, 1971.

Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486): 682 693, 2009.

Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. ar Xiv preprint ar Xiv:1804.02605, 2018.

Se Yoon Lee, Debdeep Pati, and Bani K Mallick. Continuous shrinkage prior revisited: a collapsing behavior and remedy. ar Xiv preprint ar Xiv:2007.02192, 2020.

Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business Media, 2006.

L aszl o Lov asz. Large networks and graph limits, volume 60. American Mathematical Soc., 2012.

L aszl o Lov asz and Bal azs Szegedy. Szemer edi s lemma for the analyst. GAFA Geometric And Functional Analysis, 17(1):252 270, 2007.

David JC Mac Kay. Good error-correcting codes based on very sparse matrices. IEEE transactions on Information Theory, 45(2):399 431, 1999.

Variational Inference in linear regression

Marc M ezard, Giorgio Parisi, and Miguel Angel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientiﬁc Publishing Company, 1987.

Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the american statistical association, 83(404):1023 1032, 1988.

Sarah E Neville, John T Ormerod, and MP Wand. Mean ﬁeld variational bayes for continuous sparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8(1): 1113 1151, 2014.

John T Ormerod, Chong You, and Samuel M uller. A variational bayes approach to variable selection. Electronic Journal of Statistics, 11(2):3549 3594, 2017.

Edward Posner. Random coding strategies for minimum entropy. IEEE Transactions on Information Theory, 21(4):388 391, 1975.

Kolyan Ray and Botond Szab o. Variational bayes for high-dimensional linear regression with sparse priors. Journal of the American Statistical Association, pages 1 12, 2021.

Kolyan Ray, Botond Szabo, and Gabriel Clara. Spike and slab variational bayes for high dimensional logistic regression. ar Xiv preprint ar Xiv:2010.11665, 2020.

Qifan Song and Faming Liang. Nearly optimal bayesian shrinkage for high dimensional regression. ar Xiv preprint ar Xiv:1712.08964, 2017.

Huayang Tang, Xin Jin, Yang Li, Hui Jiang, Xianfa Tang, Xu Yang, Hui Cheng, Ying Qiu, Gang Chen, Junpu Mei, et al. A large-scale screen for coding variants predisposing to psoriasis. Nature genetics, 46(1):45 50, 2014.

Alexander Tsigler and Peter L Bartlett. Benign overﬁtting in ridge regression. ar Xiv preprint ar Xiv:2009.14286, 2020.

Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655 686, 2012.

Martin J Wainwright and Michael Irwin Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.

Bo Wang and D Michael Titterington. Convergence properties of a general algorithm for calculating variational bayesian estimates for a normal mixture model. Bayesian Analysis, 1(3):625 650, 2006.

Yixin Wang and David Blei. Variational bayes under model misspeciﬁcation. In Advances in Neural Information Processing Systems, pages 13357 13367, 2019a.

Yixin Wang and David M Blei. Frequentist consistency of variational bayes. Journal of the American Statistical Association, 114(527):1147 1161, 2019b.

Mukherjee and Sen

T Westling and TH Mc Cormick. Beyond prediction: A framework for inference with variational approximations in mixture models. Journal of Computational and Graphical Statistics, 28(4):778 789, 2019.

Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82 93, 2011.

Jun Yan. Nonlinear large deviations: Beyond the hypercube. Annals of Applied Probability, 30(2):812 846, 2020.

Yun Yang, Debdeep Pati, and Anirban Bhattacharya. α-variational inference with statistical guarantees. Annals of Statistics, 48(2):886 905, 2020.

Anderson Y Zhang and Harrison H Zhou. Theoretical and computational guarantees of mean ﬁeld variational inference for community detection. Annals of Statistics, 48(5): 2575 2598, 2020.