# risk_and_regret_of_hierarchical_bayesian_learners__dac74a0b.pdf

Risk and Regret of Hierarchical Bayesian Learners

Jonathan H. Huggins JHUGGINS@MIT.EDU Computer Science and Artiﬁcial Intelligence Laboratory, MIT

Joshua B. Tenenbaum JBT@MIT.EDU Brain and Cognitive Science Department, MIT

Common statistical practice has shown that the full power of Bayesian methods is not realized until hierarchical priors are used, as these allow for greater robustness and the ability to share statistical strength. Yet it is an ongoing challenge to provide a learning-theoretically sound formalism of such notions that: offers practical guidance concerning when and how best to utilize hierarchical models; provides insights into what makes for a good hierarchical prior; and, when the form of the prior has been chosen, can guide the choice of hyperparameter settings. We present a set of analytical tools for understanding hierarchical priors in both the online and batch learning settings. We provide regret bounds under log-loss, which show how certain hierarchical models compare, in retrospect, to the best single model in the model class. We also show how to convert a Bayesian log-loss regret bound into a Bayesian risk bound for any bounded loss, a result which may be of independent interest. Risk and regret bounds for Student s t and hierarchical Gaussian priors allow us to formalize the concepts of robustness and sharing statistical strength. Priors for feature selection are investigated as well. Our results suggest that the learning-theoretic beneﬁts of using hierarchical priors can often come at little cost on practical problems.

1. Introduction

There are two standard justiﬁcations for the use of hierarchical models. The ﬁrst is that they allow for the representation of greater uncertainty by placing hyperpriors on

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

the hyperparameters of the prior distribution (Berger, 1985; Bernardo & Smith, 2000; Gelman et al., 2013). By explicitly modeling the additional uncertainty, there is greater robustness to misspeciﬁcation and unexpected data. The second is that hierarchical models permit the sharing of statistical strength between related observations or cohorts (Gelman et al., 2013). For example, take the recent Big Bayes Stories special issue of the journal Statistical Science, which was comprised of short articles describing successful applications of Bayesian models to a diverse range of problems, including political science, astronomy, and public health (Mengersen & Robert, 2014). Most of the Bayesian models were hierarchical, and the need for robustness and sharing of statistical strength because of limited data were commonly cited reasons by the practitioners for choosing a hierarchical Bayesian approach. Gelman & Hill (2006) and Gelman et al. (2013) both contain further examples of problems in which hierarchical modeling is critical to obtaining high-quality inferences.

Within the machine learning and vision literature, Salakhutdinov et al. (2011) offers an illustrative case study in the beneﬁts and the pitfalls of employing a hierarchical model. The motivation of Salakhutdinov et al. (2011) was that, for image classiﬁcation tasks, some categories of objects (e.g., car or dog ) have many labeled positive and negative examples while other, visually related, categories (e.g., bus or anteater ) have only a few labeled examples. Fig. 1(right, a) shows the distribution of training examples for the 200 object categories used while Fig. 1(right, b) shows the same distribution, but now objects are grouped with those with similar appearances. In both cases, the distributions are fat-tailed: there are a few categories with many training examples and many categories with a few training examples. It was hypothesized that by using a hierarchical Bayesian model, the classes with large amounts of labeled data could be used to construct better classiﬁers for the classes with small amounts of labeled data.

The model used by Salakhutdinov et al. (2011), which

Risk and Regret of Hierarchical Bayesian Learners

we will analyze in Section 4.2, consisted of a hierarchical Gaussian prior with a logistic regression likelihood. Twolevel, one-level, and ﬂat priors were all tested. The purpose of using the two-level prior was that it was able to encode information about which object classes had visually similar objects (e.g., car and track, dog and horse). Fig. 1(left) compares the predictive performance of this two-level hierarchical prior with the two more impoverished priors. Observe that the one-level and two-level priors both improve performance on most object classes compared to the ﬂat prior, but not all. Furthermore, the two-level prior always leads to greater improvement than the one-level prior on object classes where a hierarchical model helps, but also almost always leads to a greater degradation in performance on object classes where the hierarchical models decrease performance. Why the different performance characteristics for the two hierarchical models? Why do some categories have improved accuracy while others decreased accuracy? In a post-hoc analysis, Salakhutdinov et al. (2011) note that the objects with the largest improvement...borrow visual appearance from other frequent objects while objects with the largest decrease [such as umbrella and merchandise ] are abstract, and their visual appearance is very different from other object categories.

The results just described lead to numerous theoretical questions of practical consequence:

Q1 Can we formalize why for some object classes there was a beneﬁcial sharing of statistical strength, while for other classes the sharing was detrimental?

Q2 Can we understand when a ﬂat model should be preferred to a hierarchical one to avoid unfavorable sharing?

Q3 More generally, can we obtain guidance on the best type of prior for the problem at hand? Perhaps a different hierarchical prior would have been better suited to learning the image classiﬁers. For example, could placing hyperpriors on the variance parameters lead to greater robustness for object categories such as umbrella and merchandise, whose visual appearance differs from other object categories?

Q4 Once the form of the prior has been chosen, how should hyperparameters be set to maximize learning? The settings of the variance hyperparameters was left unspeciﬁed by Salakhutdinov et al. (2011), and it is not clear a priori how they should be set, or how much effect their choice will have on learning.

While we have primarily framed these questions in terms of a single model from one paper, this focus was simply for concreteness. Similar results leading to the same types of questions can be found in the numerous articles that make

Figure 1. Right: a) Distribution of training examples per object class. b) Same as a), but with objects grouped by visual appearance. Left: Improvement in classiﬁcation accuracy of hierarchical models compared to ﬂat model. Object categories are sorted by improvement. Reproduced and reconstructed from Salakhutdinov et al. (2011).

use of hierarchical Bayesian methods. For example, one might instead consider the hierarchical models have been used in political science for analyzing polling and census data to predict election outcomes (Ghitza & Gelman, 2013) and in demography for predicting population growth, life expectancy, and fertility rates (Raftery et al., 2013; 2012; Alkema et al., 2011).

In this paper we seek to answer the questions just posed in terms of two learning-theoretic quantities: regret (in online learning) and statistical risk (in batch learning). The online learning setting applies, for example, to the demography applications and election prediction while the batch setting is relevant to the image classiﬁcation problem as well as election prediction (whether the online or batch analysis applies to election prediction depends on how the problem is formulated).

In the online learning framework (Dawid & Vovk, 1999; Cesa-Bianchi & Lugosi, 2006), no assumptions are made about the data-generating mechanism. Inputs are presented to the learner one by one. After receiving each input the learner predicts the output, then suffers a loss after observing the true output. The goal of the learner is to not do much worse (i.e., have large regret) compared to a ﬁxed class of predictors. Online learning guarantees are attractive for the analysis of hierarchical Bayesian models because such models are so often used in exactly those circumstances when orthodox Bayesian justiﬁcations do not apply: typically the modeler does not think that her model reﬂects the true data generating process, but is instead employing hierarchical methods either to increase robustness against a poor choice of hyperparameters or to speed learning by allowing for the sharing of statistical strength between populations.

Regret bounds, however, do not themselves give any generalization guarantees about how the learner will perform on future data. Statistical risk bounds provide guarantees

Risk and Regret of Hierarchical Bayesian Learners

about the learner s expected loss on unseen examples by making assumptions about how the data are generated for example, from an i.i.d. or strongly mixing process. Although there is a stochastic assumption, risk bounds also do not assume that the data is generated according to the model used. We derive a general result for transferring Bayesian regret bounds to risk bounds for bounded losses.

Regret bound for a number of Bayesian models have previously been developed (Vovk, 2001; Kakade & Ng, 2004; Kakade et al., 2005; Banerjee, 2006; 2007; Seeger et al., 2008), with a particular focus on regression and simple priors such as independent Gaussian distributions for each regression coefﬁcient. For a discussion of more general (but asymptotic) Bayesian regret bounds for exponential families and other sufﬁciently regular model classes, see Gr unwald (2007, Chapter 8). We follow the approach originally taken in Kakade & Ng (2004), and further explored in Kakade et al. (2005), Banerjee (2006), and Seeger et al. (2008), which applies to a large class of Bayesian generalized linear models (GLMs). We extend the technique to apply to certain non-GLM likelihoods as well, including to multi-class logistic regression. All proofs are deferred to the Supplementary Material.

We answer Questions Q1-Q4 in some important cases by deriving regret bounds for three types of hierarchical priors. First, we consider the use of an inverse gamma hyperprior for the Gaussian prior s variance parameter and demonstrate that the hyperprior leads to greater robustness to data that is well-explained by setting the GLM parameter vector to have very large ℓ2 norm. Next, we analyze hierarchical Gaussian models that allow for the sharing of statistical strength. Our results, which complement existing work on transfer and multitask learning theory (Baxter, 1997; Ben-David & Schuller, 2003; Pentina & Lampert, 2014), show that when the parameters with small regret for a collection of related tasks are either (a) similar or (b) not unexpected under the prior, then the hierarchical model has a smaller regret bound than assuming the tasks are independent. Finally, we show that spike-and-slab priors can exploit sparse parameters with small regret.

2. Bayesian Online Learning

In online learning, the learner must predict (a distribution over) y Y R after observing x X Rn. In this paper, we assume the prediction is made according to a generalized linear model (GLM) p(y | x, θ) = p(y | θ x), where θ Θ Rn is a parameter vector to be chosen. GLMs provide signiﬁcant modeling ﬂexibility, and the class of GLM models and priors we analyze include a range of models used in real-world scientiﬁc applications (Gelman & Hill, 2006; Gelman et al., 2013). Two widely

used GLMs are the logistic regression likelihood

p(y | θ, x) = 1 1 + exp(yθ x), y { 1, 1}, (1)

and the Gaussian linear regression likelihood p(y | θ, x) = N(y | θ x, σ2), y R. Since we are taking a Bayesian approach, we place a prior density p0(θ) on Θ, with corresponding distribution P0.1 At time step t, the learner observes xt, outputs a distribution over Y, then observes yt Y. The Bayesian (model average) learner predicts p(y | xt, Zt 1), where Zt {(x1, y1), . . . , (xt, yt)}, and then suffers the log-loss ln p(yt | xt, Zt 1). Hence, the cumulative loss incurred is

LBayes(ZT ) PT t=1 ln p(yt | xt, Zt 1).

If Q is a distribution over θ, then using Q for prediction leads to loss on example t of ℓt(Q) EQ[ ln p(yt | xt, θ)] and hence cumulative loss

LQ(ZT ) EQ h PT t=1 ln p(yt | xt, θ) i .

If Q = δθ, then we write Lθ instead of LQ, so LQ(ZT ) = EQ[Lθ(ZT )]. Our objective is to derive regret bounds of the form

R(ZT , θ) LBayes(ZT ) Lθ(ZT ) B(θ) + C(T), (2)

where R(ZT , θ) is the regret and B(θ) + C(T) is a regret bound depending on the choice of prior P0. We aim for C(T) = o(T), so that for a ﬁxed θ, the average loss T 1LBayes(ZT ) is bounded by T 1Lθ(ZT ) + o(1).

Our approach to bounding LBayes(ZT ) follows that of previous work on Bayesian GLM regret bounds with logloss (Kakade & Ng, 2004; Kakade et al., 2005; Seeger et al., 2008), relying on the following well-known result:

Proposition 2.1 (Kakade & Ng (2004); Banerjee (2006)). The Bayesian cumulative loss is bounded as

LBayes(ZT ) LQ(ZT ) + KL(Q||P0). (3)

For the GLM model p(y | θ x), deﬁne fy(z) ln p(y | θ x = z). We make two assumptions throughout the remainder of the paper (they will usually not be stated explicitly):

|f y (z)| c for all y, z (A1)

xt 2 1 for all t. (A2)

The ﬁrst assumption can be understood as requiring the likelihood to be sufﬁciently smooth. The second assumption sets the scale of the problem, which is necessary since

1Throughout, we use lowercase letters for densities and uppercase letters to denote the corresponding measures.

Risk and Regret of Hierarchical Bayesian Learners

scaling xt up requires scaling θ down, and vice-versa: p(y | C 1θ Cx) = p(y | θ x) for any C = 0. Note that for the Gaussian linear regression model with variance σ2 and the logistic regression model, (A1) holds with c = 1/σ2 and c = 1/2, respectively. Proposition 2.1 leads to the following theorem for obtaining regret bounds for the Bayesian model average learner.

Theorem 2.2 (Bayesian regret meta-theorem). Let Qθ ,φ be a distribution with parameter φ Φ Rd (written φ if d = 1) and mean θ . If (A1) and (A2) hold, then for all φ,

R(Z, θ ) Tc

2 Var Qθ ,φ[θ] + KL(Qθ ,φ||P0),

where is the spectral norm. In particular, if the components of θ are uncorrelated, then Var Qθ ,φ[θ] = supi Var Qθ ,φ[θi]

Theorem 2.2 is our ﬁrst main result and will be repeatedly applied in Section 4 by choosing an appropriate Qθ ,φ and then optimizing φ. Although the bound appears to be linear in T, typically φ can be chosen such that Var Qθ ,φ[θ] = Θ(T 1) and KL(Qθ ,φ||P0) = Θ(ln T), leading to a logarithmic regret bound. Theorem 2.2 provides an attractive approach to deriving regret bounds because there is no need to work directly with the posterior, which is often analytically intractable. For example, there is no closed-form expression for the posterior of the Bayesian logistic regression model. The theorem generalizes the approach originally taken in Kakade & Ng (2004), in which a Gaussian prior for θ was considered:

Theorem 2.3 (Gaussian regret (Kakade & Ng, 2004)). If θ N(0, σ2I), then R(Z, θ ) is bounded by

RG Bayes(Z, θ ) 1 2σ2 θ 2 + n

2 ln 1 + Tcσ2

2.1. Beyond GLMs

Theorem 2.2 follows from a more general result, Theorem 2.4, which allows for non-GLM likelihoods. Specifically, instead of the likelihood being a GLM, we assume the likelihood can be written in the form p(y | x, ξ, ψ) = p(y | ξx, ψ), where ξ Rn n is a matrix and ψ Rn . The full parameter vector is θ = (ξ, ψ) RN, N nn + n (implicitly ﬂattening the matrix ξ). Let fy(z) ln p(y | (ξx, ψ) = z). We require the following assumption in place of (A1):

f y (z) c for all y, z, (A1 )

where f y (z) denotes the matrix of second partial derivatives (Hessian).

Theorem 2.4 (Generalized Bayesian regret meta-theorem). Let Qθ ,φ be a distribution with parameter φ Φ Rd

and mean θ . If (A1 ) and (A2) hold, then for all φ,

R(Z, θ) Tc(n + n )

2 Var Qθ ,φ[θ] + KL(Qθ ,φ||P0),

Of particular interest is that Theorem 2.4 can handle multiclass logistic regression (MLR). In multi-class regression, each example xt has one of K labels yt {1, . . . , K}, indicating which class the example belongs to. For MLR, each class k has an associated parameter θ(k). The parameters are combined into a single likelihood:

p(yt | θ, xt) = exp(θ(yt) x) PK k=1 exp(θ(k) xt) . (4)

Theorem 2.5 (MLR Gaussian regret). If θ(k) N(0, σ2I), k = 1, . . . , K, then using the MLR likelihood guarantees that R(Z, θ ) is bounded by

Rmlr G Bayes (θ , Z) 1 2σ2 θ 2 + n K

2 ln 1 + TKcσ2

3. Risk Bounds

While online regret bounds are attractive because they make no assumptions about the data-generating process, it is also desirable to have risk bounds in the batch setting since risk bounds provide generalization guarantees for unseen data. We now develop a connection between regret and risk bounds via a PAC-Bayesian analysis (Mc Allester, 2003; Audibert & Bousquet, 2007; Catoni, 2007). Such bounds also have the beneﬁt of applying to any bounded loss (e.g., the 0-1 loss for binary classiﬁcation), which may be more task-relevant than the log-loss. In the batch setting, the data ZT are received all at once by the learner and are assumed to be distributed i.i.d. according to some distribution D over X Y: (xt, yt) i.i.d. D, t = 1, . . . , T. Let ℓbe a bounded loss function taking a probability distribution over Y and an element of Y as arguments. Without loss of generality assume ℓ [0, 1]. Writing ℓθ(x, y) ℓ(p( | x, θ), y), for any distribution Q over Θ, let

L(Q) E(x,y) DEθ Q[ℓθ(x, y)] ˆL(Q, ZT ) T 1 PT t=1 Eθ Q[ℓθ(xt, yt)]

be, respectively, the expected and empirical losses under Q. PAC-Bayesian analyses consider the risk of the Gibbs predictor for the distribution Q (i.e., sample θ Q, predict with p( | x, θ)), not the model average over Q (i.e., predict with R p( | x, θ)Q(dθ)). A typical bound (specialized to the Bayesian setting) is the following (here p T (θ) p(θ | ZT )):

Theorem 3.1 (Audibert & Bousquet (2007)). Fix κ > 1/2 and write κ 2κ/(2κ 1). For any distribution D, with

Risk and Regret of Hierarchical Bayesian Learners

probability at least 1 δ over samples (xt, yt) i.i.d. D,

|L(PT ) ˆL(PT , ZT )|

KL(PT ||P0) + ln κ /δ . (5)

Combining the PAC-Bayesian risk bound with Bayesian regret bounds leads to our second main result: Theorem 3.2. Assume that (2) holds and ﬁx κ > 1/2. For any distribution D, with probability at least 1 δ over samples (xt, yt) i.i.d. D,

|L(PT ) ˆL(PT , ZT )|

B(ˆθ) + C(T) + ln κ /δ , (6)

where ˆθ arg minθ Lθ(ZT ).

An attractive feature of Theorem 3.2 is that the bound does not rely on understanding the posterior PT , as is required by a direct application of a PAC-Bayesian bound such as that given in Theorem 3.1, which requires calculating KL(PT ||P0). Yet the PAC-Bayesian regret bound remains data-dependent due to its dependence on the empirical risk minimizer (ERM) p(y | x, ˆθ).

Examining the proof of Theorem 3.2, it is easily seen that in fact any θ such that L θ(ZT ) < LPT (ZT ) can be chosen in place of ˆθ. Such alternative choices may lead to signiﬁcantly tighter bounds and are particularly important, for example, in the application of the theorem to the spike-andslab prior (cf. Section 4.3), as the ERM parameter will in almost all circumstances satisfy ˆθ 0 = n, which would lead to a poor generalization bound when n is large.

In words, Theorem 3.2 can be understood as stating that if the Bayesian (model average) learner has small log-loss regret compared to the ERM, then with high probability the Bayesian Gibbs predictor will generalize well if the loss function is bounded. Or, as a slogan, the theorem shows that small regret in the online learning setting implies good generalization bounds in the batch setting. The theorem thus connects PAC-Bayesian bounds, Bayesian regret bounds, and empirical risk minimization.

4. Applications

We now use Theorem 2.2 to analyze hierarchical priors for robustness, sharing of statistical strength, and feature selection.

4.1. Hierarchical Priors for Robustness

In this section we answer questions Q2-Q4 as they relate to hierarchical priors for robust inference, demonstrating how, with a proper choice of hyperparameters, a hierarchical prior can lead to increased robustness compared to a ﬂat

prior. Speciﬁcally, we analyze a canonical use of a hierarchical prior to capture greater uncertainty in the value of a parameter by placing a hyperprior on the variance of the Gaussian prior on that parameter (Berger, 1985; Bishop, 2006; Gelman et al., 2013):

σ2 0 | α, β Γ 1(α, β) and θi | µ0, σ2 0 N(µ0, σ2 0),

where Γ 1(α, β) is the inverse gamma distribution with shape α and scale β. Let ν 2α and σ2 β/α. Then the marginal distribution of θ follows the multivariate tdistribution with location µ01, scale matrix σ2I, and ν degrees of freedom:

θ | µ0, σ2, ν Tν(µ01, σ2I),

where 1 is the all-ones vector. The multivariate tdistribution density is

p T(θ | µ, Σ, ν)

ν (θ µ) Σ 1(θ µ) ν+n

2)πn/2νn/2|Σ|1/2 .

When ν is ﬁnite, the multivariate t-distribution is heavytailed: the probability of θ decreases at a polynomial rate as θ , compared to the exponential rate for a multivariate Gaussian. For example ν = σ2 = 1 and Σ = σ2

gives the multivariate Cauchy distribution. A multivariate Gaussian with covariance matrix Σ is recovered by taking ν . Placing a multivariate t-distribution prior on θ yields the following regret bound:

Theorem 4.1 (Multivariate t-distribution regret). If θ Tν(0, σ2I), then R(Z, θ ) is bounded by

Rmvt Bayes(Z, θ ) ν + n

2 ln 1 + θ 2

2 ln (ν + 1)(ν + n)

ν2 + Tc(ν + 1)σ2

Theorem 2.3 can be obtained as a special case of Theorem 4.1 by taking ν .

Assume ν 1. If θ 2

νσ2 is small, then

F( θ ) ν + n

2 ln 1 + θ 2

so for small values of θ 2 (relative to νσ2) the regret bound behaves similarly to having a Gaussian prior on θ. However, if θ 2

νσ2 1, then the regret bound grows only logarithmically with θ , as we would expect given that the multivariate t-distribution has heavy tails. Roughly speaking, F(x) can be thought of as switching from quadratic to logarithmic behavior when x2 = νσ2, since this is the value at which F switches from being convex to concave.

Risk and Regret of Hierarchical Bayesian Learners

In general, the regret bound is large when the choice of θ

with small loss has large magnitude. If a Gaussian prior is used, the possibility of θ being large can be ameliorated by choosing σ2 large, since there is only a logarithmic regret penalty in σ. However, without a priori knowledge of how large the optimal θ might be, choosing a multivariate t-distribution prior with a small value for ν and a moderate value for σ2 allows for guaranteed logarithmic regret in the magnitude of θ no matter how large θ is. Hence, the use of the hierarchical (multivariate tdistribution) prior does in fact yield greater robustness than the non-hierarchical (Gaussian) prior.2

We can, in fact, develop more speciﬁc guidance on the choice of hyperparameters for the t-distribution. Our goal is to choose ν such that we obtain a t-distribution regret bound that is essentially as good as the Gaussian prior regret bound RG Bayes = θ 2

2 ln 1 + T cσ2

n for small

θ and better when θ is large. If we choose ν equal to a constant, then for n much larger than ν, we have Rmvt Bayes n

2 ln 1 + θ 2

2 ln n ν + T cσ2

case of θ small, we therefore have that the ﬁrst term of Rmvt Bayes is approximately n

2σ2 , and thus larger than the ﬁrst term of RG Bayes by a factor of n/ν. Furthermore, for small T n cσ2 and any θ , the second term of Rmvt Bayes is approximately n

2 ln(n/ν) whereas the second term of RG Bayes is approximately Tcσ2 n. Thus, Rmvt Bayes with constant ν is not competitive with RG Bayes in the large n and small T regimes. Instead consider the choice ν = Cn for constant C > 0, so

2 ln 1 + θ 2

In this case, by choosing a moderate value of C, we see that a multivariate t-distribution prior with ν = Cn has a competitive regret bound with a Gaussian prior in the small θ regime, and exponentially smaller regret bound as θ becomes large. Furthermore, the t-distribution prior remains competitive with the Gaussian prior when T is small.

2A more rigorous version of this statement can be obtained for the Gaussian regression likelihood by using the fact that there is a matching lower bound on the regret for the Gaussian prior/Gaussian regression model (Kakade & Ng, 2004).

4.2. Hierarchical Priors for Sharing Statistical Strength

4.2.1. BACKGROUND

We next consider hierarchical priors that allow for the sharing of statistical strength, providing answers to Q1 and Q2: we specify some conditions under which sharing of statistical strength can be achieved and others in which a nonhierarchical prior is preferable.3 In the machine learning literature, the goal of sharing statistical strength has been formalized via multitask learning (MTL) and learning-tolearn (LTL) frameworks. A number of theoretical investigations of MTL and LTL haven been carried out, beginning with a series of papers by Baxter (cf. Baxter, 1997; 2000). Generically, such MTL and LTL frameworks involve two or more learning problems that are related to each other in some manner. The learning properties are investigated as the number of tasks and/or the number of examples from each task is increased. Baxter (2000) and Ben-David & Schuller (2003) give sample complexity bounds based on classical ideas from statistical and PAC learning theory. Baxter (1997) examines the asymptotic learning properties of hierarchical Bayesian models. Pentina & Lampert (2014) take a PAC-Bayesian approach while Hassan Mahmud & Ray (2007); Hassan Mahmud (2009), and Juba (2006) develop notions of task-relatedness from an (algorithmic) information-theoretic perspective.

Typically, tasks are equated with probability distributions over examples (e.g., (x, y) pairs). It is assumed that the tasks are drawn i.i.d. from an unknown task distribution. The goal is to learn the individual tasks and learn about the task distribution. Alternatively, a notion of similarity can be used to relate the tasks: the more similar the tasks, the greater the advantage of learning them using multitask algorithms. In the online learning framework no assumptions are made about the distribution of examples, so we consider two MTL scenarios in line with the latter setting. In the ﬁrst scenario, one example from one task is received at each time step. In the second, which is described in the Supplementary Material, at each time step an example for each task is received simultaneously.

4.2.2. SEQUENTIAL OBSERVATIONS FROM MULTIPLE SOURCES

The sequential observation setting is relevant to the image classiﬁcation example given in the introduction, in which there are many observations from some data sources and only a small number of observations from numerous other data sources. To model this situation, at time step t, an input xt from source zt is observed, where zt

3For simplicity results are for Gaussian priors, though the extension to multivariate t-distribution priors is straightforward.

Risk and Regret of Hierarchical Bayesian Learners

{1, . . . , K}. The learner predicts yt according to the posterior of θ(zt) given Zt 1. An equivalent formulation is that the Bayesian learner observes xt = (0, . . . , x(k) t , 0, . . . ) (if zt = k) at each time, then receives yt. Instead of using independent Gaussian priors on θ(1), . . . , θ(K), place a prior over the means of the K priors. For each dimension j = 1, . . . , n, let µj | σ2 0 N(0, σ2 0) and θ(k) j | µj, σ2 N(µj, σ2), k = 1, . . . , K, and write

θ(1:K) j (θ(1) j , . . . , θ(K) j ). Integrating out µj yields

θ(1:K) j | σ2 0, σ2 N(0, Σ), (8)

where, with 1K denoting the K K all-ones matrix,

Σ s2ρ1K + s2(1 ρ)I (9)

s2 σ2 0 + σ2, ρ σ2 0/(σ2 0 + σ2), (10)

This prior corresponds to the one-level prior in Salakhutdinov et al. (2011). Similar results, which will be discussed qualitatively below, can be obtained for the two-level prior at the cost of a signiﬁcantly more complicated bound. Deﬁne T (k) PT t=1 δk(zt), Z(k) {(x, y) Z|zt = k}, and γ2 Kσ2 0 + σ2

Theorem 4.2 (Hierarchical Gaussian regret, sequential observations). If θ(1:K) j N(0, Σ), j = 1, . . . , n, then R(Z, θ ) is bounded by

RHG seq Bayes (Z, θ ) 1 2γ2 PK k=1 θ (k) 2

+ σ2 0 σ2γ2 P

k<ℓ θ (k) θ (ℓ) 2 + n

2 ln 1 + Kσ2 0 σ2

2 PK k=1 ln 1 σ2 0 γ2 + T (k)cσ2

It is instructive to compare the upper bound given in (11) to P

k RG Bayes(Z(k), θ (k)) with prior variance s2 = σ2 0 + σ2. Setting σ0 = σ yields a condition for the hierarchical model to have smaller regret bound than the nonhierarchical model:

4 θ (1) θ (2) 2 + 3s2n P2 k=1 ln 4

3 n+T (i)cs2

θ (1) 2 + θ (2) 2 + 0.863s2n. (12)

Of particular interest is the one-shot learning scenario, in which only one observation (or a small number of observations) from a source are made while many observations are made from some other sources. This setting is exactly that of the image classiﬁcation problem of Salakhutdinov et al. (2011). For concreteness, consider a large data task with T (1) n cs2 and a small data task T (2) = 2, so that

3 n+T (1)cs2

n+T (1)cs2 0 and (12) becomes (approximately)

4 θ (1) θ (2) 2 + 3s2n ln 4n + 6cs2

θ (1) 2 + θ (2) 2 + 0.863s2n.

But even for n = 1, 3 ln 4n+6cs2 3n+6cs2 < 0.863, so the hierarchical model has smaller regret bound as long as 4 θ (1) θ (2) 2 θ (1) 2 + θ (2) 2 +Cs2n for some constant 0 < C < 0.863.

Hierarchical models for one-shot learning are designed with the goal of providing good predictive power on the new problem (the second data source) even with a small number of examples from that problem. To see if this is in fact the case for the hierarchical prior considered here, we can investigate how much greater the regret bound is for T (2) > 0 than for T (2) = 0 with T (1) n cs2 ﬁxed. With σ2 0 = σ2, (11) is greater in the former (K = 2) than the latter (K = 1) scenario by at most

6s2 + θ (2) 2

3s2 + 2 θ (1) θ (2) 2

3s2 + 3T (2)cs2

So if θ (2) is small and θ (2) and θ (1) are close in ℓ2

distance, the regret bound for the second source is small.

The regret bound for the two-level prior in Salakhutdinov et al. (2011) is quite similar to that for the one-level prior. Let sk {1, . . . , S} denote the superclass of class k. In the case of image classiﬁcation, object classes that have similar visual appearance would have a common superclass. The two-level prior consistes of an overall parameter prior β N(0, σ2 0I), superclass parameter priors µ(s) N(β, σ2 1I), and class parameter priors θ(k) N(µ(sk), σ2 2I). The regret bound for the two-level prior is

c0 PK k=1 θ (k) 2 + P k<ℓckℓ θ (k) θ (ℓ) 2

2 PK k=1 O(ln(c1 + c2T (k))) + O(1), (13)

where c0, c1, c2 > 0 are constants, ckℓ= csk if sk = sℓ and ckℓ= csksℓif sk = sℓ. Furthermore, cs > cs s for all s, s , s {1, . . . , S}. Hence, the regret bound s being small depends more on the parameter vectors in the same superclass being close to each other than on parameter vectors from different superclasses being close to each other. See the Supplementary Material for details. The two-level regret bound well-explains the results of Salakhutdinov et al. (2011). The poor performance on image classes with very different visual appearance from the other classes is unsurprising since the parameter vectors that predict these classes well are going to have large ℓ2 distance from the parameter vectors of other object classes.

Risk and Regret of Hierarchical Bayesian Learners

4.3. Hierarchical Priors for Feature Selection

A shortcoming of the priors investigated so far is the poor dependence on the feature space dimension n. For example, the Gaussian prior regret bound is (approximately)

LBayes(Z) inf θ Lθ (Z) + 1 2σ2 θ 2 + Tcσ2

when Tcσ2 n, so the regret may grow linearly in this regime.4 In the inﬁnite-dimensional case, Gaussian processes can be used while still obtaining meaningful regret bounds (Kakade et al., 2005; Seeger et al., 2008). However, methods that are applicable to high-dimensional problems for which n T but n is still ﬁnite are of great general interest. For example, in the image classiﬁcation example from the introduction, the feature vector has n 5000, whereas most object classes have fewer than 200 training examples. In high-dimensional problems it is desirable to use feature selection or sparse methods to reduce the effective dimension of the problem, with the aim of achieving better generalization performance and increasing interpretability of the model. A popular non-Bayesian approach for inducing sparsity is ℓ1 regularization, such as the lasso for linear regression (Tibshirani, 1996). A Bayesian approach is the Bayesian lasso: the ℓ1 regularizer of the lasso is converted into a prior, which amounts to placing a Laplace prior on θ (Park & Casella, 2008). However, the Bayesian lasso still seems to lead to a linear dependence on the dimension because the model puts zero prior mass on a component being exactly zero. A regret bound for the Bayesian lasso can be found in the Supplementary Material (we suspect that our bound is essentially tight, though we have been unable to obtain a matching lower bound).

Another common Bayesian approach to inducing sparsity is to use a hierarchical spike and slab prior, which places positive probability on a component being exactly zero (Ishwaran & Rao, 2005; Narisetty & He, 2014). One version of the spike and slab prior is

zi | p Bern(p) and θi | zi ziδ0 + (1 zi)N(0, σ2).

So with probability p component i is zero and with probability 1 p it is Gaussian-distributed. Integrating out zi yields prior density p0(θi) = pδ0(θi)+(1 p)N(θi | 0, σ2). Let v 0 denote the ℓ0 norm of the vector v. Theorem 4.3. For the spike-and-slab prior, if m = θ 0, then R(Z, θ ) is bounded by

RSS Bayes(Z, θ ) 1 2σ2 θ 2 + m ln 1 1 p (14)

+ (n m) ln 1

2 ln 1 + Tcσ2

4Dimension-independent regret bounds for the priors already considered can be obtained, but at the price of a constant greater than one front of the Lθ (Z) term. See, e.g., Banerjee (2007).

In particular, if p q1/n for some constant 0 < q < 1, then RSS Bayes(Z, θ ) is at most

2σ2 + m ln n 1 q + ln 1

2 ln 1 + Tcσ2

The theorem shows the importance of properly scaling p with the dimension of the problem. If p is kept ﬁxed, then the regret has linear dependence on n. However, by scaling p to be q1/n, we increase the probability of a component being zero as the dimension increases and thus are able to ensure that the regret is only logarithmic in n while simultaneously maintaining the appropriate linear dependence on m. The constant q turns out to be the prior probability that all of the components are set to zero. More generally, n k q(n k)/n(1 q1/n)k n q lnk q 1

k! , the limiting prior probability of choosing exactly k components to be non-zero. Hence, as n , the prior over the number of non-zero components converges to a Poisson distribution with rate parameter ln q 1. So when n is large the expected number of non-zero components is ln q 1. The choice of p close to 1 for large n is in notable contrast to the common practice of setting p = 1/2 or some other constant independent of n (Schneider & Corcoran, 2004; Ishwaran & Rao, 2005). Our results strongly recommend against this practice. See Narisetty & He (2014) for a discussion of purely statistical reasons to scale p with the dimension.

5. Conclusion

In this paper we set out to understand and quantify the learning-theoretic beneﬁts of Bayesian hierarchical modeling. In Section 4, we used ﬁrst our main result, Theorem 2.2, to analyze three speciﬁc hierarchical priors that, particularly when combined with a logistic or Gaussian regression likelihood, are widely used in practice. Indeed, these prior-likelihood combinations have often been used with substantial success even in situations when they are known to be rather poor models for the data generating mechanism. Our analysis offers an explanation for this success. The priors we analyzed are representative of the variety of ways in which hierarchical models are employed: representing uncertainty in hyperparameters, tying together related groups of observations, and creating more complicated distributions from simpler ones. Thus, our results answer Questions Q1-Q4 in some important cases and exemplify a learning-theoretic analysis technique that can be applied to other hierarchical models. In addition, using our second main result, Theorem 3.2, all of the insights gained in Section 4 for the log-loss regret setting apply equally well to the batch setting of statistical risk with bounded loss, further extending the applicability of our conclusions.

Risk and Regret of Hierarchical Bayesian Learners

Acknowledgments

Thanks to Peter Gr unwald, Sham Kakade, Peter Krafft, and Daniel Roy for helpful discussions and comments. Thanks also to the anonymous reviewers whose constructive comments improved the presentation of the results. JHH was supported by the U.S. Government under FA9550-11-C0028 and awarded by the Do D, Air Force Ofﬁce of Scientiﬁc Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.

Alkema, L., Raftery, A. E., Gerland, P., Clark, S. J., Pelletier, F., Buettner, T., and Heilig, G. K. Probabilistic Projections of the Total Fertility Rate for All Countries. Demography, 48(3):815 839, July 2011.

Audibert, J.-Y. and Bousquet, O. Combining PACBayesian and generic chaining bounds. The Journal of Machine Learning Research, 8:863 889, 2007.

Banerjee, A. On Bayesian Bounds. In International Conference on Machine Learning, pp. 81 88. ACM, 2006.

Banerjee, A. An Analysis of Logistic Models: Exponential Family Connections and Online Performance. In International Conference on Data Mining, pp. 204 215, 2007.

Baxter, J. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning, 28(1):7 39, 1997.

Baxter, J. A Model of Inductive Bias Learning. Journal of Artiﬁcial Intelligence Research, 12:149 198, 2000.

Ben-David, S. and Schuller, R. Exploiting task relatedness for multiple task learning. In Conference on Learning Theory, pp. 567 580. Springer, 2003.

Berger, J. O. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York, second edition, 1985.

Bernardo, J. M. and Smith, A. F. M. Bayesian Theory. Wiley, New York, 2000.

Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006.

Catoni, O. PAC-Bayesian Supervised Classiﬁcation: The Thermodynamics of Statistical Learning, volume 56 of Lecture Notes - Monograph Series. Institute of Mathematical Statistics, 2007.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, New York, 2006.

Dawid, A. P. and Vovk, V. G. Prequential probability: Principles and properties. Bernoulli, 5(1):125 162, 1999.

Gelman, A. and Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2006.

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. B. Bayesian Data Analysis. Chapman and Hall/CRC, third edition, 2013.

Ghitza, Y. and Gelman, A. Deep Interactions with MRP: Election Turnout and Voting Patterns Among Small Electoral Subgroups. American Journal of Political Science, 57(3):762 776, February 2013.

Gr unwald, P. D. The Minimum Description Length Principle. MIT Press, Cambridge, MA, 2007.

Hassan Mahmud, M. M. On universal transfer learning. Theoretical Computer Science, 410(19):1826 1846, 2009.

Hassan Mahmud, M. M. and Ray, S. R. Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations. In Advances in Neural Information Processing Systems, pp. 361 368, 2007.

Ishwaran, H. and Rao, J. S. Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics, 33(2):730 773, April 2005.

Juba, B. Estimating Relatedness via Data Compression. In International Conference on Machine Learning, pp. 441 448, 2006.

Kakade, S. M. and Ng, A. Y. Online Bounds for Bayesian Algorithms. In Advances in Neural Information Processing Systems, 2004.

Kakade, S. M., Seeger, M., and Foster, D. P. Worst-case bounds for Gaussian process models. In Advances in Neural Information Processing Systems, pp. 193 200, 2005.

Mc Allester, D. A. Simpliﬁed PAC-Bayesian Margin Bounds. In Conference on Learning Theory, pp. 203 215, 2003.

Mengersen, K. L. and Robert, C. P. Big Bayes Stories Foreword. Statistical Science, 29(1):1 1, 2014.

Narisetty, N. N. and He, X. Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics, 42(2):789 817, April 2014.

Park, T. and Casella, G. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681 686, June 2008.

Risk and Regret of Hierarchical Bayesian Learners

Pentina, A. and Lampert, C. H. A PAC-Bayesian bound for Lifelong Learning. In International Conference on Machine Learning, 2014.

Raftery, A. E., Li, N., ˇSevˇc ıkov a, H., Gerland, P., and Heilig, G. K. Bayesian probabilistic population projections for all countries. Proceedings of the National Academy of Sciences, 109(35):13915 13921, 2012.

Raftery, A. E., Chunn, J. L., Gerland, P., and ˇSevˇc ıkov a, H. Bayesian Probabilistic Projections of Life Expectancy for All Countries. Demography, 50(3):777 801, March 2013.

Salakhutdinov, R., Torralba, A., and Tenenbaum, J. B. Learning to share visual appearance for multiclass object detection. In Conference on Computer Vision and Pattern Recognition, pp. 1481 1488. IEEE, 2011.

Schneider, U. and Corcoran, J. N. Perfect sampling for Bayesian variable selection in a linear regression model. Journal of Statistical Planning and Inference, 126(1): 153 171, November 2004.

Seeger, M. W., Kakade, S. M., and Foster, D. P. Information Consistency of Nonparametric Gaussian Process Methods. Information Theory, IEEE Transactions on, 54 (5):2376 2382, 2008.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pp. 267 288, 1996.

Vovk, V. Competitive Online Statistics. International Statistical Review, 69(2):213 248, 2001.