# minimax_bounds_for_generalized_linear_models__60adb03b.pdf Minimax Bounds for Generalized Linear Models Kuan-Yun Lee and Thomas A. Courtade Department of Electrical Engineering and Computer Sciences University of California, Berkeley {timkylee,courtade}@berkeley.edu We establish a new class of minimax prediction error bounds for generalized linear models. Our bounds significantly improve previous results when the design matrix is poorly structured, including natural cases where the matrix is wide or does not have full column rank. Apart from the typical L2 risks, we study a class of entropic risks which recovers the usual L2 prediction and estimation risks, and demonstrate that a tight analysis of Fisher information can uncover underlying structural dependency in terms of the spectrum of the design matrix. The minimax approach we take differs from the traditional metric entropy approach, and can be applied to many other settings. 1 Introduction Throughout, we consider a parametric framework where observations X Rn are generated according to X Pθ, where Pθ denotes a probability measure on a measurable space (X Rn, F) indexed by an underlying parameter θ Θ Rd. For each Pθ, we associate a density f( ; θ) with respect to an underlying measure λ on (X, F) according to d Pθ(x) = f(x; θ)dλ(x). This setup contains a vast array of fundamental applications in machine learning, engineering, neuroscience, finance, statistics and information theory [1 10]. As examples, mean estimation [1], covariance and precision matrix estimation [2], phase retrieval [3,4], group or membership testing [5], pairwise ranking [10], can all be modeled in terms of parametric statistics. The central question to address in all of these problems is essentially the same: how accurately can we infer the parameter θ given the observation X? One of the most popular parameteric families is the exponential family, which captures a rich variety of parametric models such as binomial, Gaussian, Poisson, etc. Given a parameter η R, a density f( ; η) is said to belong to the exponential family if it can be written as f(x; η) = g(x) exp ηx Φ(η) Here, the parameter η is the natural parameter, g : X R [0, ) is the base measure, Φ : R R is the cumulant function, and s(σ) > 0 is a variance parameter. The density f( ; η) is understood to be on a probability space (X R, F) with respect to a dominating σ-finite measure λ. In this work, we are interested in the following generalized linear model (GLM), where observation X Rn is generated according to an exponential family with natural parameter equal to a linear transformation of the underlying parameter θ. In other words, g(xi) exp xi mi, θ Φ( mi, θ ) 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. for a real parameter θ := (θ1, θ2, . . . , θd) Rd and a fixed design matrix M Rn d, with rows given by the vectors {mi}n i=1 Rd. The above model assumes each Xi is drawn from its own exponential family, with respective natural parameters mi, θ , i = 1, 2, . . . , n. Evidently, this captures the classical (Gaussian) linear model X = Mθ + Z, where f( ; θ) is taken to be the usual Gaussian density, and also captures a much broader class of problems including phase retrieval, matrix recovery and logistic regression. See [11 13] for history and theory of the generalized linear model. In order to evaluate the performance of an estimator ˆθ (i.e., a measurable function of X), it is common to define a loss function L( , ) : Rd Rd 7 R and analyze the loss L(θ, ˆθ). A typical figure of merit is the constrained minimax risk R(M, Θ), defined as R(M, Θ) := inf ˆθ sup θ Θ L(θ, ˆθ). In words, the minimax risk characterizes the worst-case risk under the specified loss L( , ) achieved by the best estimator, with a constraint that θ belongs to a specified parameter space Θ. Two choices of the loss function L( , ) give rise to the usual variants of L2 loss: 1. Estimation loss, where the loss function L( , ) is defined as L1(θ, ˆθ) = E θ ˆθ 2 for all θ, ˆθ Rd. (3) 2. Prediction loss, where the loss function L( , ) is defined as L2(θ, ˆθ) = 1 n E Mθ M ˆθ 2 for all θ, ˆθ Rd. (4) In this work, we shall approach things from an information theoretic viewpoint. In particular, we will bound minimax risk under entropic loss (closely connected to logarithmic loss in the statistical learning and information literature, see, e.g., [14 16]), from which L2 estimates will follow. To start, let us review some of the key definitions in information theory. Suppose the parameter θ Rd follows a prior π, a probability measure on Rd having density ψ with respect to Lebesgue measure. The differential entropy h(θ) corresponding to random variable θ is defined as Rd ψ(u) log ψ(u)du. Here and throughout, we will take logarithms with respect to the natural base, and assume all entropies exist (i.e., their defining integrals exist in the Lebesgue sense). The mutual information I(θ; X) between parameter θ π and observation X Pθ is defined as I(θ; X) := Z X f(x; θ) log f(x; θ) R Rd f(x; θ )dπ(θ )dλ(x)dπ(θ). The conditional entropy is defined as h(θ|X) := h(θ) I(θ; X). The entropy power of a random variable U is defined as exp(2h(U)), and for any two random variables U and V with well-defined conditional entropy, the conditional entropy power is defined similarly as exp(2h(U|V )). Lower bounds on conditional entropy power can be translated into lower bounds of other losses, via tools in rate distortion theory [17]. To illustrate this, let s consider the following two Bayes risks, with suprema taken over all priors π on the parameter space Θ Rd, and infima taken over all valid estimators ˆθ (i.e., measurable functions of X). 1. Entropic estimation loss, where the Bayes risk is defined as Re(M, Θ) := inf ˆθ sup π i=1 exp 2h(θi|ˆθi) . (5) 2. Entropic prediction loss, where the Bayes risk is defined as Rp(M, Θ) := inf ˆθ sup π 1 n i=1 exp 2h(m i θ|m i ˆθ) . (6) The following simple observation shows that any lower bound derived for the entropic Bayes risks implies a lower bound on the minimax L2 risks. Lemma 1. We have inf ˆθ supθ Θ L1(θ, ˆθ) Re(M, Θ) and inf ˆθ supθ Θ L2(θ, ˆθ) Rp(M, Θ). Proof. This follows since Gaussians maximize entropy subject to second moment constraints and conditioning reduces entropy: E(θi ˆθi)2 Var(θi ˆθi) exp(2h(θi ˆθi)) exp(2h(θi| ˆθi)). Here and onwards, we use (also and ) to refer to (and , = , respectively) up to constants that do not depend on parameters. Although we focus on L2 loss in the present work, we remark that minimax bounds on entropic loss directly yield corresponding estimates on Lp loss using standard arguments involving covering and packing numbers of Lp spaces. See, for example, the work by Raskutti et al. [18]. Despite its universal nature, there is relatively limited work on deriving minimax bounds for the entropic loss. This is the focus of the present work, and as a consequence, we obtain bounds on L2 loss that significantly improve on prior results when the matrix M is poorly structured. 1.1 Contributions In this paper, we make three main contributions. 1. First, we establish L2 minimax risk and entropic Bayes risk bounds for the generalized linear model (2). The generality of the GLM allows us to extend our results to specific instances of the GLM such as the Gaussian linear model, phase retrieval and matrix recovery. 2. Second, we establish L2 minimax risk and entropic Bayes risk bounds for the Gaussian linear model. In particular, our bounds are nontrivial for many instances where previous results fail (for example when M Rn d does not have full column rank, including cases with d > n), and can be naturally applied to the sparse problem where θ 0 k. Further, we show that both our minimax risk and entropic Bayes risk bounds are tight up to constants and log factors when M is sampled from a Gaussian ensemble. 3. Third, we investigate the L2 minimax risk via the lens of the entropic Bayes risk, and provide evidence that information theoretic minimax methods can naturally extract dependencies on the structure of design matrix M via analysis of Fisher information. The techniques we develop are general and can be used to establish minimax results for other problems. 2 Main Results and Discussion The following notation is used throughout: upper-case letters (e.g., X, Y ) denote random variables or matrices, and lower-case letters (e.g., x, y) denote realizations of random variables or vectors. We use subscript notation vi to denote the i-th component of a vector v = (v1, v2, . . . , vd). We let [k] denote the set {1, 2, . . . , k}. We will be making the following assumption. Assumption: The second derivative of the cumulant function Φ is bounded uniformly by a constant L > 0: Φ ( ) L. The following lemma characterizes the mean and variance of densities in the exponential family. Lemma 2 (Page 29, [11]). Any observation X generated according to the exponential family (1) has mean Φ (η) and variance s(σ) Φ (η). In other words, our assumption is equivalent to saying that the variance of each observation X1, . . . , Xn is bounded. This is a common assumption made in the literature; See, for example, [19 22]. Our first main result establishes a minimax prediction lower bound corresponding to the generalized linear model (2). Let us first make a few definitions. For an n k matrix A, we define the vector ΛA := (λ1, . . . , λk) Rk, where the λi s denote the eigenvalues of the k k symmetric matrix A A in descending order. ΛA p denotes the usual Lp norm of the vector ΛA for p 1. Finally, we define Γ(A) := max ΛA 2 1 ΛA 2 2 , λmin(A A) Λ 1 A 1 where Λ 1 A := (λ 1 1 , . . . , λ 1 k ), with the convention that λmin(A A) Λ 1 A 1 = 0 when λmin(A A) = 0. Theorem 3. For observations X Rn generated via the generalized linear model (2) with a fixed design matrix M Rn d, the minimax L2 prediction risk and the entropic Bayes prediction risk are lower bounded by 1 n inf ˆθ sup θ Rd E M ˆθ Mθ 2 1 1 n inf ˆθ sup π i=1 exp 2h(m i θ | m i ˆθ) 1 L ΛM 2 1 ΛM 2 2 . Bounds on minimax risk under an additional sparsity constraint θ 0 k (i.e., the true parameter θ has at most k non-zero entries) can be derived as a corollary. Corollary 4 (Sparse Version of Theorem 3). For observations X Rn generated via the generalized linear model (2), with the additional constraint that θ 0 k (i.e., Θ := {θ Rd : θ 0 k}), the minimax prediction error is lower bounded by 1 n inf ˆθ sup θ Θ E M ˆθ Mθ 2 1 L max Q Mk Γ(Q). 1 n inf ˆθ sup π i=1 exp 2h(m i θ | m i ˆθ) 1 L max Q Mk ΛQ 2 1 ΛQ 2 2 . Here, the maximum is taken over Mk, the set of all n k submatrices of M, with k k. We now note an important specialization of Corollary 4. In particular, consider the Gaussian linear model with observations X Rn generated according to X = Mθ + Z, (8) with Z N(0, σ2 In) the standard Gaussian vector. This corresponds to the GLM of (2) when the functions are taken to be h(x) = e x2/(2σ2), s(σ) = σ2, and Φ(t) = t2/2 (hence, L = 1). This is a particularly important instance worth highlighting because of the ubiquity of the Gaussian linear model in applications. Theorem 5. For observations X Rn generated via the Gaussian linear model (8), with the sparsity constraint θ 0 k (i.e., Θ := {θ Rd : θ 0 k}), the minimax prediction error is lower bounded by 1 n inf ˆθ sup θ Θ E M ˆθ Mθ 2 σ2 n max Q Mk Γ(Q). 1 n inf ˆθ sup π i=1 exp 2h(m i θ | m i ˆθ) σ2 n max Q Mk ΛQ 2 1 ΛQ 2 2 . Here, the maximum is taken over Mk, the set of all n k submatrices of M, with k k. Remark 6. In the above results, the function Γ( ) can in fact be replaced with Γ(M) := max mi 4 2 Mmi 2 , λmin(M M) Λ 1 M 1 which is stronger than the original statements. However, the chosen statements above highlight the simple dependence on the spectrum of ΛM. 2.1 Related Work Most relevant to our results is the following lower bound on minimax L2 estimation risk and entropic Bayes estimation risk, developed in a recent work by Lee and Courtade [23]. We note that [23] does not bound prediction loss (which is often of primary interest), as we have done in the present paper. Theorem 7 (Theorem 3, [23]). Let observation X be generated via the generalized linear model defined in (2), with the additional structural constraint Θ = Bd 2(R) := {v : v 2 2 R2}. Suppose the cumulant function Φ satisfies Φ L for some constant L. Then, the minimax estimation error is lower bounded by inf ˆθ sup θ Θ E ˆθ θ 2 inf ˆθ sup π i=1 exp(2h(θi|ˆθi)) min R2, s(σ) L Tr((M M) 1) . (9) The bound of (9) is tight when X is generated by the Gaussian linear model, showing that (Gaussian) linear models are most favorable in the sense of minimax estimation error amongst the class of GLMs considered here. Lee and Courtade extracted the dependence on the Tr(M M) term by analyzing a Fisher information term in the class of Bayesian Cramér-Rao-type bounds from [24]. Earlier work (see, e.g., [25]) yielded bounds on the order of d/λmax(M M), which is loose compared to (9). There is a large body of work that establish minimax lower bounds on prediction error for specific models of the generalized linear model. Typically, these analyses depend on methods involving metric entropy (see, for example, [4,18,19,26 28]). A popular minimax result is due to Raskutti et al. [18], who consider the sparse Gaussian linear model, where for a fixed design matrix M with an additional sparsity constraint θ 0 k, σ2 Φ2k, (M) Φ2k,+(M) k n log ed inf ˆθ sup θ 0 k 1 n E M ˆθ Mθ 2 2 σ2 min k Here the terms Φr, (M) and Φr,+(M) correspond to the constrained eigenvalues, Φr, (M) := inf 0 = θ 0 r Mθ 2 θ 2 , Φr,+(M) := sup 0 = θ 0 r The upper bound of (10) is achieved by classical methods such as aggregation [29 32]. One can readily observe that the lower bound of (10) becomes degenerate for even mildly ill-structured design matrices M. For example, in the case where M has repeating columns, the above result gives a lower bound of 0, which is not very interesting. This suggests that the metric entropy approach does not easily capture the dependence of the structure of design matrix M at the resolution of the complete spectrum of M M as our results do. In fact, it can be shown that Corollary 4 uniformly improves upon (10) up to logarithmic factors; see Section 4.1 of the supplementary. Further, the lower bound of Raskutti et al. does not hold for k > n, which is a disadvantage for high dimensional problems where d n. Verzelen [30] discusses the regime where k 2 and k max(d1/3, n/5) and provide bounds for the worst-case matrix M, which is a different setting from ours. There are also lines of work on specific settings of the generalized linear model. For example, Candes et al. [28] discusses low-rank matrix recovery, and Cai et al. [4] considers phase retrieval. There are, however, fewer results that directly look at the generalized linear model of our setting. The closest work related is that of Abramovich and Grinshtein [19], where they consider estimating the entire vector Mθ, as opposed to our setting where we estimate θ first with ˆθ, then evaluate M ˆθ. Their result also depends on the ratio between (constrained) minimum and maximum eigenvalues as in (10), and hence fails when M is not full rank or otherwise has divergent maximum and minimum (constrained) eigenvalues. Comparing Theorems 3 and 5 with the results surveyed above raises several points (illustrated in Table 1): Nontrivialness when M is not full rank. Unlike the lower bound in (10), the ratio ΛM 2 1/ ΛM 2 2 does not vanish when M is not full rank; see Case (d) in Table 1. This is particularly important when the dimension of the parameter is large relative to the number of observed samples. Table 1: Values of identities in Γ(M) (defined in (7)) for different scenarios of ΛM = (λ1, . . . , λd) for fixed M Rn d. The value t satisfies t 1. In each row, the bold item marks the largest value. Case ΛM 2 1/ ΛM 2 2 λd Λ 1 M 1 d(λd/λ1) (a) ΛM = (1, 1, . . . , 1, 1) d d d (b) ΛM = (t, 1, 1, . . . , 1) (t + d)2/(t2 + d) d d/t (c) ΛM = (1, 1, . . . , 1, 1/t) d d/t d/t (d) ΛM = (1, 1, . . . , 1, 0) d 0 0 (e) ΛM = (t, 1, . . . , 1, 1/t) (t + d)2/(t2 + d) (t + d)/t d/t2 Insensitivity to extreme values in the spectrum ΛM. Unlike the ratio between largest and smallest (restricted) eigenvalues in ΛM, the ratio ΛM 2 1/ ΛM 2 2 is less sensitive to the setting where the maximum and minimum eigenvalues in ΛM diverge. Table 1 provides several examples in rows (b)-(e). Sharpness. Comparing with the upper bound (10), we observe that Theorem 3 (or, more specifically, Theorem 5) is sharp if either the largest (constant d) eigenvalues are of the same order or (for the L2 risk) if the smallest (constant d) eigenvalues are non-zero and of the same order. This can be seen by considering ΛM 2 1/ ΛM 2 2 and λmin(M M) Λ 1 M 1 in the former and latter cases, respectively. Moreover, as shown in the following Section, when M is sampled from a Gaussian ensemble, i.e., all components of M are sampled from a standard Gaussian, our bounds are optimal up to log factors with high probability. Logarithmic term for sparse linear regression. In many cases, the log factor is insignificant, and the improved spectral dependence of Theorem 5 can yield substantial improvement. For example, when k is not very sparse, say k = Θ(dc) for some c (0, 1], the log factor is not significant and our results can be significantly better than (10) when M is mildly illconditioned. In the very sparse case, say k = O(log d), our results still provide meaningful bounds for M with minimum constrained eigenvalue close to 0. Remark 8. In some cases, (10) can be improved by ignoring certain components of θ Rd via dimensionality reduction. For example, if the first two columns of M are the same, then it is possible to ignore the first component of θ and simply look at the remaining d 1 components. We remark that even with this reduction, (10) still depends on the ratio between minimum and maximum constrained eigenvalues of the new effective matrix, and leads to a poor lower bound when the minimum and maximum constrained eigenvalues are of a different order. We remark that other dimensionality reduction methods (such as rotations) may be limited by the sparsity constraint θ 0 k. Moreover, in general when the spectrum of M is all positive (with divergent large/small eigenvalues), one cannot use dimensionality reduction to improve the result of (10). 2.2 Application to Gaussian Designs Gaussian designs are frequently adopted in machine learning and compressed sensing (see, for example, [18, 33 35]). The following proposition provides a concentration bound for the ratio ΛM 2 1/ ΛM 2 2 when M is sampled from the standard Gaussian ensemble (i.e., where each component of M is sampled i.i.d. according to a standard Gaussian). Proposition 9. Let the design matrix M Rn k be sampled from the Gaussian ensemble. There exist universal constants c1, c2, c3 > 0 such that ΛM 2 1/ ΛM 2 2 c1 min(n, k) with probability at least 1 c2 exp( c3 min(n, k)). Proposition 9 implies that, with high probability, the lower bound of Theorem 5 (and therefore the corresponding estimate in Theorem 3) is sharp up to a logarithmic term that is negligible when d k. In particular, under the assumptions of Theorem 5, we obtain with the help of (10) that n, 1 inf ˆθ sup θ 0 k 1 n E M ˆθ Mθ 2 2 σ2 min k with the lower bound holding with high probability in min(n, k). This can significantly improve on the lower bound (10); consider, for example, the case where s := min(2k, d) = αn for some fixed α < 1. Note that any n s submatrix M of M satisfies Φ2k, (M)/Φ2k,+(M) λmin(M M )/λmax(M M ). An asymptotic result by Bai and Yin [36] implies that if α is fixed then this latter ratio converges to (1 α)2 / (1 + α)2 almost surely as n, k, d . Hence, asymptotically speaking, the result of (10) is tight at most up to constants depending on α while our results of Corollary 4 is tight (up to log factors) without dependency of α. Interestingly, Proposition 9 also holds for square matrices, where the minimum eigenvalue is close to zero (more precisely, for a square Gaussian matrix M Rn n, λmin(M M) is of the order n 1, as shown in the work of Rudelson and Vershynin [37]). Proposition 9 follows from Szarek s work [38] on concentration of the largest n/2 singular values for a square Gaussian matrix M Rn n, concentration of singular values of rectangular subgaussian matrices [26], and an application of interlacing inequalities for singular values of submatrices [39]. Similar results can be shown for subgaussian matrices under additional assumptions using tools from [40]. 3 Key Points of Proofs of Main Theorems In our approach, we will be using classical information theory tools inspired by the techniques developed by Lee and Courtade [23]. 3.1 Preliminaries We say that a measure µ is log-concave if dµ(x) = e V (x)dx for some convex function V ( ). The Fisher information IX(θ) given θ Rd corresponding to the map θ 7 Pθ is defined as IX(θ) = EX θ log f(X; θ) 2 2 , where the gradient is taken with respect to θ, and the expectation is taken with respect to X Pθ. If the parameter θ has a prior π that is log-concave, the following lemma gives an upper bound on the mutual information I(θ; X), which depends on the covariance matrix of θ, defined as Cov(θ). Lemma 10 (Theorem 2, [24]). Suppose the prior π of θ Rd is log-concave. Then, under mild regularity conditions on the map θ 7 Pθ, we have I(θ; X) d φ Tr(Cov(θ)) E IX(θ) where the function φ( ) is defined as φ(x) := x if 0 x < 1, 1 + 1 2 log x if x 1. We note that the regularity condition in Lemma 10 requires that each member of the parametric family Pθ has density f( ; θ) smooth enough to permit the following change of integral and differentiation, Z X θf(x; θ)dλ(x) = 0, µ a.e. θ. (14) In our case, since we are working with the GLM of (2), the regularity condition is automatically satisfied. When θ is a one-dimensional (i.e., d = 1) log-concave random variable, the bound of (13) is sharp up to a (modest) multiplicative constant when Var(θ) E IX(θ) is bounded away from zero. There exists a tighter version of Lemma 10 when π is uniformly log-concave, however Lemma 10 is enough for our purposes. We direct the interested reader to the paper [24]. 3.2 Proof Sketch of Theorem 3 We start off by noting that we can lower bound the entropic Bayes risk of (6) by taking a specific prior π. For our purposes, we will let θ have a multivariate Gaussian prior π = N 0, β2 Id . We continue with a bound on the sum of conditional entropy powers i=1 exp 2h(m i θ | m i ˆθ) i=1 exp 2h(m i θ) 2I(m i θ; X) , (15) which follows from the data-processing inequality I(m i θ; m i ˆθ) I(m i θ; X), since m i θ X m i ˆθ forms a Markov chain. When mi Rd is a zero-vector, exp 2h(m i θ|m i θ) = exp 2h(m i θ) 2I(m i θ; X) = 0 and hence does not contribute to the summations within (15). This implies that removing zero vector rows from M does not affect the proof following (15). Hence, in the following we will assume that the matrix M does not have rows that are zero vectors. By our choice of the prior π, the density of m i θ is Gaussian and hence log-concave, which allows us to invoke Lemma 10, implying i=1 exp 2h(m i θ | m i ˆθ) i=1 exp 2h(m i θ) 2φ(Var(m i θ) E IX(m i θ)) . (16) Here, the expectation is taken with respect to the marginal density of m i θ. The primary task is now to obtain a reasonable bound on the expected Fisher information term E IX(m i θ). To do this, we introduce the following lemma, which provides an upper bound for the expected Fisher information E IX(m i θ). Lemma 11. Fix M Rn d. If parameter θ has a prior π = N(0, β2 Id) and X Rn is sampled according to the generalized linear model defined as (2), then E IX(m i θ) L s(σ) Mmi 2 2 mi 4 2 + 1 β2 Ψi(M) for all i = 1, 2, . . . , n. (17) The function Ψi(M) depends only on M and is finite. The expectation is taken with respect to the marginal density of m i θ. The functions Ψi( ) are not explictly stated here because later we will be taking β large enough so that Ψi( )/β2 in (17) can be ignored. A proof of Lemma 11 and more details about the functions Ψi( ) are included in the supplementary. We can continue from (16) and see that i=1 exp 2h(m i θ | m i ˆθ) i=1 β2 mi 2 2 exp 2φ β2 mi 2 2 s(σ) Mmi 2 2 mi 4 2 + 1 L s(σ) Mmi 2 2 mi 4 2 + 1 β2 Ψi(M) (b) = (1 ϵ)s(σ) mi 4 2 Mmi 2 2 . (18) In the above, both (a) and (b) require a selection of β2 to be large enough. In particular, in (a), β2 s(σ)/L would guarantee that the function φ behaves logarithmically (recall from Lemma 10 that φ(t) behaves logarithmically if t 1). In (b), the variable ϵ depends on the selection of β. Since the function Ψi(M) is finite for all i = 1, . . . , n, by taking β2 a constant large enough, we can force ϵ to be as close to zero as possible. Hence, we can say that the inequality holds with ϵ = 0. A direct application of the Cauchy-Schwarz inequality then yields i=1 exp 2h(m i θ | m i ˆθ) s(σ) Pn i=1 mi 2 2 2 Pn i=1 Mmi 2 2 = s(σ) L ΛM 2 1 ΛM 2 2 . (19) On the other hand, from Theorem 7 and the matrix identity Mv 2 2 λmin(M M) v 2 2, inf ˆθ sup θ Rd E M ˆθ Mθ 2 2 λmin(M M) Tr (M M) 1 = λd Λ 1 M 1. (20) Combining (19) and (20) with Lemma 1 finishes the proof. 3.3 An Alternative Proof of Theorem 5 For the Gaussian linear model, we have the following tighter version of Lemma 11. Lemma 12. Fix M Rn d. If θ N(0, β2 Id) and X Rn is sampled according to the Gaussian linear model defined as (8). Then, E IX(m i θ) 1 σ2 Mmi 2 2 mi 4 2 for 1 i n. (21) By taking any β2 σ2 maxi mi 2 2/ Mmi 2 2 , the function φ( ) in (16) will again behave logarithmically, directly implying (18) with ϵ = 0. The remaining proof follows similarly as before. Remark 13. The functions Ψi( ) can be difficult to bound directly (see supplementary for more details). Hence, the improved tightness and simplicity of Lemma 12 over Lemma 11 for the Gaussian linear model provides more flexibility on the selection of β. This can be helpful when dealing with problem settings where there are other constraints on the parameter space Θ. Remark 14. There is a subtle but crucial difference in the proof techniques employed here compared to those in [23]. The key step in [23] requires bounding the Fisher information IX(θi) with diagonal terms in the Fisher information matrix IX(θ), i.e., Lemma 9 of [23]. In our case, we need to bound the Fisher information IX(m i θ) (e.g., Lemma 11), and here, the terms m i θ are not necessarily mutually independent as required in Lemma 9 of [23], which prevents us from a direct application. Instead, we choose θ to have a Gaussian prior and try to bound IX(θi) directly. This is facilitated by properties of the Gaussian distribution; see Section 4.3 in the appendix for more details. Broader Impact The generalized linear model (GLM) is a broad class of statistical models that have extensive applications in machine learning, electrical engineering, finance, biology, and many areas not stated here. Many algorithms have been proposed for inference, prediction and classification tasks under the umbrella of the GLM, such as the Lasso algorithm, the EM algorithm, Dantzig selectors, etc., but often it is hard to confidently assess optimality. Lower bounds for minimax and Bayes risks play a key role here by providing theoretical benchmarks with which one can evaluate the performance of algorithms. While many previous approaches have focused on the Gaussian linear model, in this paper we establish minimax and Bayes risk lower bounds that hold uniformly over all statistical models within the GLM. Our arguments demonstrate a set of information-theoretic techniques that are general and applicable to setups other than the GLM. As a result, many applications stand to potentially benefit from our work. Acknowledgments This work was supported in part by NSF grants CCF-1704967, CCF-1750430, CCF-0939370. [1] C. M. Stein, Estimation of The Mean of a Multivariate Normal Distribution, The Annals of Statistics, pp. 1135 1151, 1981. [2] J. Friedman, T. Hastie, and R. Tibshirani, Sparse Inverse Covariance Estimation with the Graphical Lasso, Biostatistics, vol. 9, no. 3, pp. 432 441, 2008. [3] G. Lecué and S. Mendelson, Minimax Rate of Convergence and the Performance of ERM in Phase Recovery, ar Xiv preprint ar Xiv:1311.5024, 2013. [4] T. T. Cai, X. Li, Z. Ma, et al., Optimal Rates of Convergence for Noisy Sparse Phase Retrieval via Thresholded Wirtinger Flow, The Annals of Statistics, vol. 44, no. 5, pp. 2221 2251, 2016. [5] D. Du, F. K. Hwang, and F. Hwang, Combinatorial Group Testing and Its Applications, vol. 12. World Scientific, 2000. [6] B. Hajek, S. Oh, and J. Xu, Minimax-Optimal Inference from Partial Rankings, in Advances in Neural Information Processing Systems, pp. 1475 1483, 2014. [7] P. Tichavsky, C. H. Muravchik, and A. Nehorai, Posterior Cramér-Rao Bounds for Discrete Time Nonlinear Filtering, IEEE Transactions on signal processing, vol. 46, no. 5, pp. 1386 1396, 1998. [8] L. Paninski, Convergence Properties of Some Spike-Triggered Analysis Techniques, in Advances in neural information processing systems, pp. 189 196, 2003. [9] J. Broder and P. Rusmevichientong, Dynamic Pricing Under a General Parametric Choice Model, Operations Research, vol. 60, no. 4, pp. 965 980, 2012. [10] N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. J. Wainwright, Estimation from Pairwise Comparisons: Sharp Minimax Bounds with Topology Dependence, The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2049 2095, 2016. [11] P. Mc Cullagh, Generalized Linear Models. Routledge, 2019. [12] J. A. Nelder and R. W. Wedderburn, Generalized Linear Models, Journal of the Royal Statistical Society: Series A (General), vol. 135, no. 3, pp. 370 384, 1972. [13] A. J. Dobson and A. G. Barnett, An Introduction to Generalized Linear Models. CRC press, 2018. [14] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge university press, 2006. [15] T. A. Courtade and R. D. Wesel, Multiterminal Source Coding with an Entropy-Based Distortion Measure, in 2011 IEEE International Symposium on Information Theory Proceedings, pp. 2040 2044, IEEE, 2011. [16] J. Jiao, T. A. Courtade, K. Venkat, and T. Weissman, Justification of Logarithmic Loss Via the Benefit of Side Information, IEEE Transactions on Information Theory, vol. 61, no. 10, pp. 5357 5365, 2015. [17] Y. Wu, Lecture Notes for Information-Theoretic Methods for High-Dimensional Statistics, Lecture Notes for ECE598YW (UIUC), vol. 16, 2017. [18] G. Raskutti, M. J. Wainwright, and B. Yu, Minimax Rates of Estimation for High-Dimensional Linear Regression Over lq-Balls, IEEE Transactions on Information Theory, vol. 57, no. 10, pp. 6976 6994, 2011. [19] F. Abramovich and V. Grinshtein, Model Selection and Minimax Estimation in Generalized Linear Models, IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3721 3730, 2016. [20] H.-G. Müller and U. Stadtmüller, Generalized Functional Linear Models, the Annals of Statistics, vol. 33, no. 2, pp. 774 805, 2005. [21] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, A Unified Framework for High Dimensional Analysis of M-estimators with Decomposable Regularizers, Statistical Science, vol. 27, no. 4, pp. 538 557, 2012. [22] P.-L. Loh and M. J. Wainwright, Regularized M-estimators with Nonconvexity: Statistical and Algorithmic Theory for Local Optima, The Journal of Machine Learning Research, vol. 16, no. 1, pp. 559 616, 2015. [23] K.-Y. Lee and T. A. Courtade, Linear Models are Most Favorable among Generalized Linear Models, ar Xiv preprint, to appear in ISIT 2020. [24] E. Aras, K.-Y. Lee, A. Pananjady, and T. A. Courtade, A Family of Bayesian Cramér-Rao Bounds, and Consequences for Log-Concave Priors, in 2019 IEEE International Symposium on Information Theory (ISIT), pp. 2699 2703, IEEE, 2019. [25] X. Chen, A. Guntuboyina, and Y. Zhang, On Bayes Risk Lower Bounds, The Journal of Machine Learning Research, vol. 17, no. 1, pp. 7687 7744, 2016. [26] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48. Cambridge University Press, 2019. [27] T. T. Cai, C.-H. Zhang, H. H. Zhou, et al., Optimal Rates of Convergence for Covariance Matrix Estimation, The Annals of Statistics, vol. 38, no. 4, pp. 2118 2144, 2010. [28] E. J. Candes and Y. Plan, Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements, IEEE Transactions on Information Theory, vol. 57, no. 4, pp. 2342 2359, 2011. [29] F. Bunea, A. B. Tsybakov, M. H. Wegkamp, et al., Aggregation for Gaussian Regression, The Annals of Statistics, vol. 35, no. 4, pp. 1674 1697, 2007. [30] N. Verzelen, Minimax Risks for Sparse Regressions: Ultra-High Dimensional Phenomenons, Electronic Journal of Statistics, vol. 6, pp. 38 90, 2012. [31] L. Birgé and P. Massart, Gaussian Model Selection, Journal of the European Mathematical Society, vol. 3, no. 3, pp. 203 268, 2001. [32] L. Birgé and P. Massart, Minimal Penalties for Gaussian Model Selection, Probability theory and related fields, vol. 138, no. 1-2, pp. 33 73, 2007. [33] M. F. Duarte and Y. C. Eldar, Structured Compressed Sensing: From Theory to Applications, IEEE Transactions on signal processing, vol. 59, no. 9, pp. 4053 4085, 2011. [34] G. Raskutti, M. J. Wainwright, and B. Yu, Restricted Eigenvalue Properties for Correlated Gaussian Designs, Journal of Machine Learning Research, vol. 11, no. Aug, pp. 2241 2259, 2010. [35] A. Javanmard, A. Montanari, et al., Debiasing the Lasso: Optimal Sample Size for Gaussian Designs, The Annals of Statistics, vol. 46, no. 6A, pp. 2593 2622, 2018. [36] Z.-D. Bai and Y.-Q. Yin, Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix, in Advances In Statistics, pp. 108 127, World Scientific, 2008. [37] M. Rudelson and R. Vershynin, The Least Singular Value of a Random Square Matrix is o(n 1/2), ar Xiv preprint ar Xiv:0805.3407, 2008. [38] S. J. Szarek, Spaces with Large Distance ln and Random Matrices, American Journal of Mathematics, vol. 112, no. 6, pp. 899 942, 1990. [39] R. C. Thompson, Principal Submatrices IX: Interlacing Inequalities for Singular Values of Submatrices, Linear Algebra and its Applications, vol. 5, no. 1, pp. 1 12, 1972. [40] F. Wei, Upper Bound for Intermediate Singular Values of Random Matrices, Journal of Mathematical Analysis and Applications, vol. 445, no. 2, pp. 1530 1547, 2017. [41] R. J. Hanson and C. L. Lawson, Extensions and Applications of the Householder Algorithm for Solving Linear Least Squares Problems, Mathematics of Computation, pp. 787 812, 1969.