# deep_hierarchy_in_bandits__37ff7624.pdf Deep Hierarchy in Bandits Joey Hong 1 Branislav Kveton 2 Sumeet Katariya 2 Manzil Zaheer 3 Mohammad Ghavamzadeh 4 Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of users for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (Hier TS) for this problem and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of Hier TS. Our regret bounds reflect the structure of the problem, that the regret decreases with more informative priors, and can be recast to highlight reduced dependence on the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments. 1. Introduction A contextual bandit (Li et al., 2010; Chu et al., 2011) is a sequential decision-making problem where a learning agent sequentially interacts with an environment over n rounds. In each round, the agent observes a context, chooses one of K possible actions, and then receives a reward for the taken action. The agent aims to maximize its expected cumulative reward over n rounds. It does not know the mean rewards of the actions a priori and learns them by taking the actions. This forces the agent to choose between exploring actions to learn about them and exploiting the action with the highest 1University of California, Berkeley 2Amazon 3Deep Mind 4Google Research. Correspondence to: Joey Hong . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). estimated reward. As an example, in online shopping, the context can be a user s query, the actions are recommended products, and the reward is the indicator of a purchase (Yue & Guestrin, 2011; Kveton et al., 2015; Combes et al., 2015; Zong et al., 2016; Li et al., 2016). In many practical problems, the action space is large and cannot be explored naively. However, the mean rewards of actions are correlated due to some underlying structure. As a result, exploration of one action can teach the agent about other actions, depending on how correlated they are, and this increases statistical efficiency of exploration. As an example, in online shopping, many products are semantically similar, and their relationships can be captured by a graphical model: both a keyboard and monitor are computer accessories; and both computer accessories and home theatre systems are electronic devices. Another example is classification with a bandit feedback, where the labels are clustered. For instance, the car and truck are vehicles, while the monkey and tiger are animals. We experiment with this kind of problems in Section 6.2. Unfortunately, the aforementioned structure is not easy to represent in traditional bandit algorithms (Auer et al., 2002; Chapelle & Li, 2012; Kawale et al., 2015; Sen et al., 2017). As an example, Thompson sampling (TS) (Thompson, 1933; Chapelle & Li, 2012; Agrawal & Goyal, 2012; Russo & Van Roy, 2014) is a natural and elegant strategy for exploration in bandit problems. However, in structured action spaces, TS would need to model correlations between all actions using their joint posterior. This could be computationally costly in practice, as observing the reward of a single action may require recomputing the joint posterior of all actions. Therefore, it is not immediately obvious how to design an efficient TS algorithm when the mean rewards of actions are correlated. In this work, we study a structured bandit problem where the action space has a hierarchical structure. This structure is represented using a hierarchical Bayesian model (Lindley & Smith, 1972; Zhang & Yang, 2017), where each action is a leaf node in a tree. Each node is associated with a node parameter, which is drawn i.i.d. from a distribution parameterized by its parent s parameter. This hierarchical structure not only models many practical bandit tasks, such as online shopping, but admits an efficient exploration because the Deep Hierarchy in Bandits joint posterior can be factorized and efficiently updated. We make the following contributions. First, we formalize a hierarchical Bayesian model T of our environment. Second, we propose a Thompson sampling algorithm Hier TS. The main novelty in the design of Hier TS is that the posterior factors along T , which permits exact posterior sampling and computationally-efficient updates. The posterior has a closed form in multi-armed and contextual linear bandits with Gaussian rewards. We derive Bayes regret bounds for Hier TS that capture the structure of T and the impact of priors. The bounds show increased statistical efficiency due to the hierarchy and can improve upon classical results in non-constant factors. We validate these theoretical results empirically and also apply Hier TS to a real-world classification problem with label hierarchy. We use the following notation. Random variables are capitalized. For any positive integer n, we denote by [n] the set {1, . . . , n}. We let 1{ } be the indicator function. The i-th entry of vector v is vi. If the vector is already indexed, such as vj, we write vj,i. We use O for the big O notation up to logarithmic factors. We consider a learning agent that interacts with a contextual bandit over n rounds (Li et al., 2010; Chu et al., 2011). In round t [n], the agent observes context Xt X Rd, takes an action At from an action set A of size K, and then observes a stochastic reward Yt = r(Xt, At) + εt, where r : Rd A R is a reward function and εt is independent σ2-sub-Gaussian noise. Our problem is structured. In particular, the action set A progressively breaks into finer clusters of actions with similar rewards. This decomposition is represented by a tree T (Figure 1) over nodes V N. Without loss of generality, the root has index 1. Each leaf of T corresponds to an action a A and we call it an action node. Each internal node of T has at least two and at most b children, where b is the branching factor. The height of the tree is h and the height of node i V is hi h. The height of the leaves is 0 and the height of the root is h. For any node i V, we denote its parent by pa(i) and its children by ch(i). An ancestor of node i is any node on a direct path from node i to the root, and node i is a descendant of any node on that path. With a slight abuse of notation, we use A V to refer to both the action set and leaves of T , and sometimes index action nodes by a to stress their role. The reward function is parameterized by model parameters Θ = (θi)i V, where θi is the parameter of node i. The true model parameters are Θ = (θ ,i)i V and we assume that Figure 1. Graphical model of our environment. The drawing depicts our notation: children ch(i) of node i, action nodes A, and updated nodes ψt after action At is taken. The height of the tree is h = 3 and that of node i is hi = 2. they are generated as θ ,1 P0,1 , (1) θ ,i | θ ,pa(i) P0,i( | θ ,pa(i)) , i V \ {1} , Yt | Xt, θ ,At P( | Xt; θ ,At) , t [n] . Here P0,1 is the prior distribution of the root node, called a hyper-prior; P0,i( | θ ,pa(i)) is the conditional prior distribution of node i, parameterized by the sampled value of its parent θ ,pa(i); and P( | x; θ ,a) is the reward distribution of action a in context x. We use r(x, a; Θ) to denote the mean reward of action a in context x under model parameters Θ and define it as r(x, a; Θ) = EY P ( |x;θa)[Y ]. Thus r(x, a; Θ) depends only on one parameter in Θ. The generative process in (1) relates any two node parameters to each other, through the lowest common ancestor. This induces complex correlations that can be used for efficient exploration. We discuss motivating examples for this setting in Section 1. The goal is to minimize the n-round regret defined as R(n; Θ ) = E t=1 r(Xt, At, ; Θ ) r(Xt, At; Θ ) where At, = arg max a A r(Xt, a; Θ ) is the optimal action in round t given context Xt. In this work, we assume that the parameters Θ are also random. We define the nround Bayes regret as BR(n) = E [R(n; Θ )], which takes an additional expectation over Θ . While weaker than the traditional frequentist regret R(n; Θ ), the Bayes regret is a practical performance measure, when the average performance across multiple instances of model parameters is of interest (Russo & Van Roy, 2014; Hong et al., 2020). Deep Hierarchy in Bandits Algorithm 1 Hier TS: Hierarchical Thompson sampling. Input: Tree T with height h, all priors P0, in (1) Initialize all posteriors P1, P0, for t = 1, . . . , n do Sample θt,1 Pt,1 for ℓ= h 1, . . . , 0 do Sample θt,i Pt,i( | θt,pa(i)) end for end for Θt (θt,i)i V Take action At arg max a A r(Xt, a; Θt) Observe Yt P( | Xt; θ ,At) Compute new posteriors Pt+1, end for 3. Algorithm Since our environment is a graphical model (Figure 1), we explore using Thompson sampling (TS) (Thompson, 1933; Chapelle & Li, 2012; Agrawal & Goyal, 2012; Russo & Van Roy, 2014). The main challenge in our algorithm design are latent variables. Specifically, the rewards of action nodes are observed and permit direct learning of their parameters, θ ,a for a A. In contrast, parameters θ ,i of the internal nodes i V \ A are only indirectly observed through their descendant action nodes, and thus are latent. It is unclear if modeling of the latent variables is necessary. To discuss alternatives, we need to introduce some notation. Let Ht = (Xℓ, Aℓ, Yℓ)ℓ [t 1] be the history of all interactions of the agent until round t and Θ ,A = (θ ,a)a A be the true model parameters of the action nodes. Because the mean rewards of actions depend only on Θ ,A, the most natural solution to our problem is TS over a joint posterior Θ ,A | Ht. This has two challenges. First, the exact posterior involves complex correlations, due to the dependencies in θ ,a induced by the generative process in (1). These correlations remain when the latent variables are marginalized out, and may not allow computationally-efficient sampling from Θ ,A | Ht. Second, the uncertainty of each θ ,a | Ht could be modeled individually. While computationally efficient, this may not be sound and would not be statistically efficient. We propose exact sampling from Θ ,A | Ht that is both computationally and statistically efficient. 3.1. Hierarchical Sampling Our approach is based on hierarchical sampling (Andrieu et al., 2003; Doucet et al., 2001), where the model parameters Θ are sampled similarly to the generative process in (1). To explain it, we introduce the following notation. For the root node, Pt,1(θ) = P (θ ,1 = θ | Ht) denotes the posterior distribution of its parameter in round t, which we also call a hyper-posterior. For any other node i, Pt,i(θ | θp) = P θ ,i = θ | θ ,pa(i) = θp, Ht,i (2) is the posterior distribution of its parameter conditioned on θ ,pa(i) = θp in round t, where Ht,i is a subset of interactions in Ht where Aℓis a descendant of node i. In (2), Ht could be replaced by Ht,i because the posterior of θ ,i is independent of the other observations given the value of the parent parameter θp. This structure is critical to the computational efficiency of our approach and is also used in the regret analysis (Section 5). Assuming that all posteriors Pt,i can be computed efficiently, it is trivial to propose a hierarchical Thompson sampling algorithm for our problem. We call it Hier TS and present its pseudo-code in Algorithm 1. In round t, Hier TS works as follows. First, we sample the root parameter θt,1. After that, we iterate over all nodes and sample node parameters whose parents are already sampled. Specifically, we define Vℓ= {i V : hi = ℓ} as the subset of nodes at height ℓ and then sample θt,i for i Vℓ, starting from the children of the root at height ℓ= h 1 all the way to the leaves at height ℓ= 0. By design, Θt = (θt,i)i V is a valid posterior sample, generated hierarchically. Finally, Hier TS takes an optimistic action with respect to Θt, observes Yt, and then updates its posterior. Note that Hier TS samples parameters at all action nodes. It is possible to leverage the tree structure to prune sub-trees with actions that are unlikely to have high mean rewards. For example, Sen et al. (2021) propose beam search over a tree to only evaluate a subset of actions. Such computational improvements can be easily incorporated into Hier TS. We view them as orthogonal to our main contribution, which is a statistically-efficient exploration using the tree structure. 3.2. Efficient Posterior Computation The main technical novelty in Hier TS is that the posteriors Pt,i can be maintained efficiently. We show it as follows. Fix any node i, its value θ, and the value of its parent θp. By Bayes rule, we have Pt,i(θ | θp) Lt,i(θ)P0,i(θ | θp) , (3) where Lt,i(θ) = P (Ht,i | θ ,i = θ) is the likelihood of observations Ht,i whose ancestor is node i with value θ. Note that Lt,i(θ) can be further decomposed as Lt,i(θ) = Q j ch(i) Lt,j(θ) , (4) where Lt,j(θ) = P Ht,j | θ ,pa(j) = θ is the likelihood of observations Ht,j whose ancestor is child node j and the value of its parent is θ. This identity follows from two facts. First, θ ,j are conditionally independent of each other given Deep Hierarchy in Bandits Algorithm 2 Updated statistics in round t. The dot notation means that the likelihoods are updated for all parameter values, which is possible in Gaussian models (Section 4). Initialize Lt+1, Lt, i At Lt+1,i( ) P(Yt | Xt; )Lt,i( ) Lt+1,i( ) R θ Lt+1,i(θ)P0,i(θ | ) dθ repeat i pa(i) Lt+1,i( ) Q j ch(i) Lt+1,j( ) if i > 1 then Lt+1,i( ) R θ Lt+1,i(θ)P0,i(θ | ) dθ end if until i = 1 θ ,i = θ. Second, Ht,j is independent of θ ,i given θ ,j. At a high level, each Lt,j(θ) can be viewed as the likelihood of an aggregate observation at node j, from all leaves that descend from node j, under the assumption that θ ,i = θ (Section 4.1). Finally, each Lt,j(θ) can be computed as Lt,j(θ) = Z θ Lt,j(θ )P0,j(θ | θ) dθ , (5) where Lt,j(θ ) is the likelihood of observations Ht,j whose ancestor is node j with value θ . Note that Lt,j(θ ) can be further rewritten as in (4), which gives rise to our recursive computation of the posterior. The pseudo-code for updating Lt,i and Lt,i after round t is shown in Algorithm 2. After this, (3) has to be recomputed for all nodes i on the path from At to the root. In general, (5) is hard to compute due to the integral over θ . However, in Gaussian graphical models (Section 4), this can be done in a closed form. In practice, (5) can be approximated for arbitrary distributions using approximate inference, either variational or MCMC (Doucet et al., 2001). 4. Gaussian Hierarchy In this section, we instantiate the environment in (1) as a hierarchical Gaussian model (Koller & Friedman, 2009) and derive its posterior. The model is defined as θ ,1 N(µ1, Σ0,1) , (6) θ ,i | θ ,pa(i) N(θ ,pa(i), Σ0,i) , i V \ {1} , Yt | Xt, θ ,At N(X t θ ,At, σ2) , t [n] , where θ ,i Rd are the node parameters and Σ0,i are the covariance matrices that control the closeness of θ ,i and θ ,pa(i). The mean reward is defined as r(x, a; Θ) = x θa. The hierarchical structure is motivated by multi-label classification (Prabhu et al., 2018; Yu et al., 2020a), where Xt is a feature vector, At is its predicted label, and Yt indicates if the label is correct. We return to this application in Section 6.2. When d = 1 and Xt = 1, we recover a K-armed Gaussian bandit, where θ ,a is the mean reward of action a. We assume that the agent knows the hyper-prior mean µ1, all covariances Σ0,i, and reward noise σ. This is only used in the regret analysis, where we study exact posterior sampling. In our experiments (Section 6.2), we learn these quantities from data. The special case of multi-armed bandits also shows computational gains over naive posterior sampling. Specifically, due to the dependencies in (1), Θ ,A | Ht is a multivariate Gaussian with K dimensions. To sample from it, we can compute the the root of the posterior covariance and multiply it by a K-dimensional standard normal vector, which takes O(K3) time. In contrast, sampling in Hier TS takes O(|V|) time. When each internal node of T has at least 2 children, which is without loss of generality, |V| 2K and our computational gain is O(K2). Now we present closedform posteriors for hierarchies of Gaussian and contextual linear models. 4.1. Multi-Armed Bandit We start with a K-armed Gaussian bandit. In this setting, each node i V is associated with a single scalar parameter θ ,i R, and its initial uncertainty is described by conditional prior variance Σ0,i = σ2 0,i R. The posteriors for this model are derived in Appendix A.1 and stated below. For any node i, the posterior of θ ,i given θ ,pa(i) = θp is Pt,i(θ | θp) = N(θ; ˆθt,i, ˆσ2 t,i). If node i is an internal node, ˆσ 2 t,i = σ 2 0,i + X j ch(i) σ 2 t,j , (7) ˆθt,i = ˆσ2 t,i σ 2 0,i θp + X j ch(i) σ 2 t,j θt,j At the root (i = 1), we set θp = µ1. The child parameters θt,j and σt,j are computed recursively as follows. If node j is an action node, then σ2 t,j = σ2 0,j + σ2 |St,j| , (8) θt,j = 1 |St,j| where St,j = {ℓ< t : Aℓ= j} are the rounds where action j is taken before round t. If node j is an internal node, σ2 t,j = σ2 0,j + M 1 , (9) θt,j = M 1 X k ch(j) σ 2 t,k θt,k , Deep Hierarchy in Bandits where M = P k ch(j) σ 2 t,k. The new child parameters θt,k and σt,k are computed recursively, using either (8) or (9). Finally, if node i is action node, its posterior has a standard form of ˆσ 2 t,i = σ 2 0,i + σ 2 |St,i| , ˆθt,i = ˆσ2 t,i σ 2 0,i θp + σ 2 X The recursive update in (9) can be derived from the observation that Lt,j(θ) exp h 1 2 σ 2 t,j (θ θt,j)2i holds for any node j and the value of its parent θ. The closed-form of the posterior Pt,i(θ | θp) is a direct combination of this result and the derivations in Section 3.2. The update in (9) also has an intuitive interpretation. Although we get Gaussian observations at action nodes, as in (8), they propagate to higher nodes in the tree through (9). These nodes then act as noisy observations of their parents with mean θt,j and variance σ2 t,j. This allows us to overcome the problem of latent variables in our model. The posterior in (7) is just a function of higher-level observations in all children of node i. To have closed forms of these quantities, we rely heavily on the properties of Gaussian random variables. 4.2. Contextual Linear Bandit We now consider the general case in (6). This model can be viewed as a hierarchy of linear models (Yue & Guestrin, 2011; Abbasi-Yadkori et al., 2011) indexed by actions. The posteriors for this model are derived in Appendix A.2 and stated below. For any node i, the posterior of θ ,i given θ ,pa(i) = θp is Pt,i(θ | θp) = N(θ; ˆθt,i, ˆΣt,i). If node i is an internal node, ˆΣ 1 t,i = Σ 1 0,i + X j ch(i) Σ 1 t,j , (10) ˆθt,i = ˆΣt,i Σ 1 0,i θp + X j ch(i) Σ 1 t,j θt,j At the root (i = 1), we set θp = µ1. The child parameters θt,j and Σt,j are computed recursively as follows. If node j is an action node, then Σt,j = Σ0,j + G 1 t,j , (11) θt,j = σ 2G 1 t,j X ℓ St,j XℓYℓ, where St,j = {ℓ< t : Aℓ= j} are the rounds where action j is taken before round t and Gt,j = σ 2 P ℓ St,j X ℓXℓ is the outer product of the corresponding feature vectors. If node j is an internal node, then Σt,j = Σ0,j + M 1 , (12) θt,j = M 1 X k ch(j) Σ 1 t,j θt,k , where M = P k ch(j) Σ 1 t,k. The new child parameters θt,k and Σt,k are computed recursively, depending on whether k is an action node or not. Finally, if node i is action node, its posterior has a standard form of ˆΣ 1 t,i = Σ 1 0,i + Gt,i , ˆθt,i = ˆΣt,i Σ 1 0,i θp + σ 2 X ℓ St,i XℓYℓ The recursive update in (12) can be derived from the observation that Lt,j(θ) exp h 1 2(θ θt,j) Σ 1 t,j (θ θt,j) i for any node j and the value of its parent θ. The posterior in Pt,i(θ | θp) is a direct combination of this result and the derivations in Section 3.2. As in Section 4.1, our recursive update can be viewed as propagation of observations from action nodes to higher nodes in the tree. 5. Analysis This section is primarily devoted to the Gaussian bandit in Section 4.1. We present the key lemmas, the main result, and discuss them. All proofs are deferred to Appendix B. In Section 5.4, we state a regret bound for the contextual bandit in Section 4.2. Its proof is deferred to Appendix C. 5.1. Key Steps in the Analysis We start with the observation that the hierarchical posterior sampling in Section 3.1 is just an efficient implementation of joint posterior sampling over the action node parameters ΘA. Since our model is a Gaussian graphical model, this posterior is a multivariate Gaussian (Koller & Friedman, 2009). This is because any conditioning or marginalization does not change the model class. This observation allows us to prove the following lemma. Lemma 1. For any δ > 0, the Bayes regret of Hier TS is bounded as 2n G(n) log(1/δ) + p 2/πσmax Knδ , where G(n) = E Pn t=1 σ2 t,At is a statistical complexity term, σ2 t,At = var [θt,At | Ht] is the marginal posterior variance of the mean reward of action At in round t, and σmax is the maximum marginal prior width at an action node. The second term in Lemma 1 is constant in n for δ = 1/n. Therefore, we focus on the first O( n) term. Also note that Deep Hierarchy in Bandits σ2 t,At in G(n) is none of the conditional posterior variances derived in Section 4.1. We show how to decompose it into these variances next. To relate the marginal posterior variance, which bounds the expected regret, to conditional posterior variances, which represent our model uncertainty, we adopt the following update-centric notation. We denote the list of nodes from the root to the action node At in round t by ψt. The length of ψt is Lt. To illustrate the notation, ψt(1) = 1 is the root, ψt(Lt) = At is the action node, and ψt(Lt 1) = pa(At) is its parent. Figure 1 visualizes ψt. Now we are ready to relate the two quantities. Lemma 2. In any round t, the marginal posterior variance in action node At decomposes as ˆσ4 t,ψt(j) σ4 0,ψt(j) ˆσ2 t,ψt(i) . The last piece is a lower bound, which shows that each term in Lemma 2 can be bounded by the posterior update of the corresponding node i, representing our information gain. Lemma 3. Fix any round t and i [Lt]. Then ˆσ 2 t+1,ψt(i) ˆσ 2 t,ψt(i) σ 2ci Lt Lt Y ˆσ4 t,ψt(j) σ4 0,ψt(j) where c = 1+σ2 0,max/σ2 and σ2 0,max = maxi V σ2 0,i is the maximum prior variance. For σ σ0,max, we have c = 2. The constant c in Lemma 3 is small when the reward noise is higher than all prior widths in T . This property can be always attained by initial forced exploration of all actions. 5.2. Regret Bound Now we are ready to present our main result. Recall that h is the height of T , hi is the height of node i, and that the action nodes have height 0 (Section 2). Theorem 4. For any δ > 0, the Bayes regret of Hier TS is bounded as 2n G(n) log(1/δ) + p 2/πσmax Knδ , where G(n) = P i V chiwi and c is a constant defined in Lemma 3. For an action node i, hi = 0 and wi = σ2 0,i log 1 + σ2 0,i σ2 log 1 + σ2 0,in σ2 For an internal node i, hi > 0 and wi = σ2 0,i log 1 + σ2 0,i σ2 log 1 + σ2 0,i X j ch(i) σ 2 0,j When δ = 1/n, the above bound is O( p n |V|), where n is the horizon and |V| is the number of nodes, which is also the number of learned parameters. The dependence on the horizon n is standard. As wi = O(σ2 0,i), the contribution of each node i to the regret is proportional to its prior width. Thus the regret decreases when the model is more certain. One notable dependence in G(n) is exponential scaling with height chi. This is not problematic, as the number of nodes with high hi is exponentially smaller than those with lower hi (Section 5.3). Theorem 4 also recovers a well-known Bayes regret bound for K-armed bandits (Russo & Van Roy, 2014). The reason is that a K-armed bandit can be viewed as a tree with height h = 1, where the root parameter θ1 is the prior mean of the actions. Because θ1 is certain, w1 = 0 and P i V chiwi = PK+1 i=2 wi = O(K). 5.3. Lower Regret Due to Hierarchy Now we give examples of how the hierarchy can help with reducing regret. To simplify the discussion, we ignore logarithmic factors in the definitions of wi in Theorem 4. We assume that T is a balanced b-ary tree with height h; with K = bh action nodes and bh ℓnodes at height ℓ. Our discussion is under the assumption that c = 2, as derived in Lemma 3. More gains are possible when c < 2. We compare the regret of Hier TS to classical Thompson sampling (TS), which maintains an independent posterior of θ ,a for each action a A. To have a fair comparison, we set the marginal prior variances of all actions in TS as in Hier TS. Specifically, let ψa be the path in T from action node a to the root. Then the marginal prior of action a is N(µ1, σ2 0,a), where σ2 0,a = P i ψa σ2 0,i and µ1 denotes the hyper-prior mean in (6). The regret of this algorithm can be bounded using Theorem 4, with the only difference that the complexity term becomes GTS(n) P a A σ2 0,a. Problem 1. We start with a problem where all prior variances are identical, σ2 0,i = 1 for any i V. In this case, all prior variances in TS are σ2 0,a = h + 1 and its complexity term is GTS(n) = (h + 1)bh. In Hier TS, we aggregate the nodes by height and get ℓ=0 bh ℓcℓ= bh h X ℓ=0 (2/b)ℓ 1 1 2/bbh . Thus Hier TS improves G(n) by Ω(h) when b > 2. Since h = logb bh = logb K, we get GTS(n)/G(n) logb K, and Hier TS reduces the Bayes regret by a multiplicative factor p logb K. This argument can be adjusted for b = 2 to get a comparable regret to TS. Problem 2. Now we consider a problem where the conditional prior variances in T double with height, σ2 0,i = 2hi, Deep Hierarchy in Bandits where hi is the height of node i. This setting is motivated in Section 1. We expect higher statistical gains because the uncertainty of highly-uncertain nodes at higher levels of T is reduced jointly by all actions. In this problem, all prior variances in TS are σ2 0,a = 2h+1 and its complexity term is GTS(n) = 2h+1bh. In comparison, Hier TS yields ℓ=0 bh ℓcℓ2ℓ ℓ=0 bh ℓ4ℓ 1 1 4/bbh , where the last step is analogous to Problem 1. So Hier TS improves G(n) by Ω(2h+1) if b > 4. Since 2h+1 = 2 2logb bh = 2 2log2 bh/ log2 b = 2K we get GTS(n)/G(n) K 1 log2 b , and Hier TS reduces the Bayes regret by a multiplicative factor p 1 log2 b . For b = 5, this factor would be close to K 1 4 . Therefore, the regret is reduced by a polynomial factor in K. 5.4. Contextual Regret Bound This section presents a contextual generalization of the regret bound in Theorem 4. We make two assumptions in its derivation. First, the length of the context vector is bounded, and we assume that Xt 2 1 without loss of generality. Second, Σ0,i = σ2 0,i Id. This latter assumption allows us to utilize λ1(Σ0,i) = λd(Σ0,i) = σ2 0,i, which is required by our matrix generalizations of Lemmas 2 and 3. Our regret bound is stated below. Theorem 5. For any δ > 0, the Bayes regret of Hier TS is bounded as 2dn G(n) log(1/δ) + p 2/πσmax Knδ , where G(n) = P i V chiwi and c is a constant defined in Lemma 3. For an action node i, hi = 0 and wi = σ2 0,i log 1 + σ2 0,i σ2 log 1 + σ2 0,in σ2d For an internal node i, hi > 0 and wi = σ2 0,i log 1 + σ2 0,i σ2 log 1 + σ2 0,i X j ch(i) σ 2 0,j The constants are similar to Theorem 4. The main difference is the extra factor of d, because each node is now associated with a learned d-dimensional parameter. The key insight in the proof is that the context in round t, Xt, is known and fixed. Because all conditional posteriors are Gaussian, posterior sampling in contextual Hier TS can be implemented using a tree with scalar nodes, where the conditional posterior of node i is N(X t ˆθt,i, X t ˆΣt,i Xt). As a result, the non-contextual analysis in Theorem 4 can be easily generalized. Lemma 1 changes in that the history Ht includes context Xt. Moreover, the marginal posterior variance of the mean reward of action At, σ2 t,At, turns into that in the direction of context Xt. The main technical challenges in the proof are generalizing the variance decomposition in Lemma 3 to covariances (Lemma 9), and extending the posterior update lower bound in Lemma 3 to matrices (Lemma 9). The rest of the proof follows the same outline as that of Theorem 4. 6. Experiments We compare Hier TS to three baselines that either neglect or partially use the tree T . The first baseline is Thompson sampling (TS), which treats each action independently and is introduced in Section 5.3. The second baseline only uses a 2-level hierarchy, namely the root and action nodes, and we call it Flat TS. Its hyper-prior for the root P0,1 is the same as in Hier TS. For any action a A, the conditional prior is P0,a( | θ ,1) = N( ; θ ,1, σ2 0,a σ2 0,1), where σ2 0,a is the marginal prior variance in TS. This baseline mimics existing algorithms for 2-level Gaussian hierarchies with a common root (Kveton et al., 2021; Basu et al., 2021; Hong et al., 2022), and is similar to structured bandits where the actions share a latent parameter (Gupta et al., 2018). The final baseline is the zooming algorithm of Kleinberg et al. (2008; 2013), which we call Zooming. This is a UCB algorithm where actions are embedded in a metric space, and can be viewed as a standard approach to handling large structured action spaces in bandits (Bubeck et al., 2008; Kleinberg et al., 2013). At a high level, Zooming maintains a set of active actions, which are sufficiently apart from each other given the history, and only explores these. The key step in implementing Zooming is constructing a metric space where actions with similar mean rewards are close. We design it as follows. We compute the graph Laplacian of T , represented as an undirected graph, and extract its d eigenvectors (vi)d i=1 with the smallest eigenvalues. Let vi,j be the j-th entry of vi. Then, for any action node a A, its embedding in the metric space is (v1,a, . . . , vd,a). This is a standard approach to deriving a metric space in spectral clustering (Yan et al., 2009). 6.1. Synthetic Experiments Our first experiments are on a synthetic Gaussian bandit, where we validate theoretical findings from Section 5.3. We experiment with both problems in Section 5.3, which are b-ary trees with height h and K = bh actions. In Problem 1, the prior variances are constant. In Problem 2, the prior Deep Hierarchy in Bandits 0 100 200 300 400 500 Round t b = 4, ¾ = 1 TS Flat TS Zooming Hier TS 0 100 200 300 400 500 Round t b = 8, ¾ = 1 TS Flat TS Zooming Hier TS 0 100 200 300 400 500 Round t b = 16, ¾ = 1 TS Flat TS Zooming Hier TS Figure 2. Regret of Hier TS on synthetic bandit problems with varying branching factor b. variances double with height. In both problems, the mean of the hyper-prior is µ1 = 0, and the reward of action a is θ ,a with variance σ2 = 1. We start with Problem 2, where the tree height is h = 2 and we vary the branching factor b. All algorithms are run for n = 500 rounds and evaluated by the Bayes regret on 100 independent samples of Θ . Its mean estimate and standard error are reported in Figure 2. For all values of b, Hier TS outperforms all baselines. Zooming is not competitive with the other baselines when the branching factor b is low. The poor performance of Zooming is because of its generality. Specifically, since it can be applied to any smooth reward function, its regret in d dimensions scales as O(n1 1 2+d ). In Figure 2, we only show the results for Zooming where d = K. We experimented with d < K and observed linear regret, because some action nodes with a common parent had identical features. In the next experiment, we consider both Problems 1 and 2. The branching factor is fixed at b = 2 and we vary the tree height h. All algorithms are run for n = 500 rounds on 100 independent samples of Θ . The reduction in the Bayes regret of Hier TS and Flat TS is measured as the ratio of the TS regret over the regret in question. In Figure 3a, we plot the ratios for Problem 1. Section 5.3 suggests that the regret decreases by Ω( h). Our plot confirms this. In Figure 3b, we plot the ratios for Problem 2. Section 5.3 suggests that the regret decreases by Ω(2h/2). We confirm this trend. 6.2. Multi-Label Image Classification The last experiment is on a multi-label image classification problem with linear rewards. All compared algorithms are implemented analogously to Section 6.1, except for variances being replaced with covariances. We use the CIFAR100 dataset (Krizhevsky, 2009), with 60 000 images of size 32 32. There are 50 000 training and 10 000 test images. Each image belongs to one of 100 classes (labels) and 20 super-classes, each consisting of 5 classes. Each image is represented by a d = 10 dimensional feature vector, which we obtain by downsampling a 100-dimensional feature vector. That feature vector is an embedding computed by an Efficient Net-L2 network applied to the image (Tan & Le, 2019; Xie et al., 2020; Foret et al., 2021). The network is a convolutional neural network pretrained on both Image Net (Russakovsky et al., 2015) and unlabeled JFT-300M (Sun et al., 2017), and fine-tuned on the CIFAR-100 training set. We randomly select 5 super-classes and their corresponding K = 25 classes become actions. The test and training sets are restricted to these classes. Our bandit problem is defined as follows. For each action a, θ ,a is the mean feature vector of test images in class a. In round t, the context Xt is the feature vector of a random image from the test set and the reward of action At is Yt N(X t θ ,At, 0.52). Therefore, the mean reward is maximized whenever the true class is chosen. Finally, we build a 3-level hierarchy T as follows. The hyper-prior of the root is a multivariate Gaussian fitted to all training images, P0,1 = N(µ1, Σ0,1). The nodes at height 1 are the 5 super-classes and their conditional priors are P0,i( | θ ,1) = N( ; θ ,1, Σ0,i), where Σ0,i is fitted to the training images of super-class i. Finally, the nodes at height 0 correspond to actions and their conditional priors are P0,a( | θ ,pa(a)) = N( ; θ ,pa(a), Σ0,a), where Σ0,a is fitted to the training images of class a. We report the mean and standard error of the regret over 10 runs in Figure 4. We observe again that Hier TS performs better than all baselines. We do not consider Zooming due to its poor performance earlier (Figure 2). Note that the true model parameters of T , namely µ1 and Σ0,i, are unknown; and we estimate them from training images. Therefore, this experiment shows that even when we relax the assumption that they are known, it is beneficial to estimate them, and use the structure of T . 7. Related Work Thompson sampling algorithms have been widely applied to contextual bandits because of their computational efficiency and strong empirical performance (Chu et al., 2011; Chapelle & Li, 2012; Abbasi-Yadkori et al., 2011). Russo & Van Roy (2014) proved first Bayes regret bounds for TS. Our proposed algorithm Hier TS extends TS to tree hierarchies. TS with a 2-level hierarchy over tasks was applied Deep Hierarchy in Bandits 1 2 3 4 5 6 7 Height h TS Regret / Regret Constant Variance (b = 16, ¾ = 1) Flat TS Hier TS 1 2 3 4 5 6 7 Height h TS Regret / Regret Doubling Variance (b = 2, ¾ = 1) Flat TS Hier TS Figure 3. Improvement in the Hier TS regret when the prior variance (a) is constant or (b) increases with height, as a function of the tree height h. 0 200 400 600 800 1000 Round t CIFAR Classification Lin TS Flat Lin TS Hier Lin TS Figure 4. Regret of Hier TS on a CIFAR-100 image classification problem. and analyzed in both meta-learning and multi-task learning (Kveton et al., 2021; Basu et al., 2021; Wan et al., 2021; Hong et al., 2022). The main difference in our work is that we move from 2-level hierarchies to an arbitrary depth, and develop both algorithmic and theory foundations for this setting. Our analysis extends the variance decompositions proposed in Hong et al. (2022) to trees. Alternatively, information theory could be used to derive Bayes regret bounds (Russo & Van Roy, 2016; Lu & Van Roy, 2019; Basu et al., 2021), but we are unaware of any for trees. Our problem has a structured action space. One structure studied by prior works is a shared latent parameter among all actions (Tirinzoni et al., 2020; Lattimore & Munos, 2014; Gupta et al., 2018). This can be viewed as an instance of our setting with a 2-level hierarchy. In latent bandits, the parameter is a discrete variable (Maillard & Mannor, 2014; Hong et al., 2020). Recent works also applied approximate TS to more complex structures (Gopalan et al., 2014; Yu et al., 2020b). Such algorithms are general, but can only be analyzed in limited settings under strong assumptions. We consider a special tree structure, where we can derive and analyze an exact algorithm. Another recent work that applies trees to bandits is Majzoubi et al. (2020), who proposed a general reduction-based algorithm for contextual bandits with a continuous action space, where the policies are represented by a tree. The most related work is Sen et al. (2021), who also studied contextual bandits with a tree hierarchy over actions. Both of our works rely on a hierarchy of regressors, motivated by multi-label classification (Prabhu et al., 2018; Yu et al., 2020a). However, the works differ in several key aspects. First, we study a stochastic variant of the problem, where the tree nodes are associated with prior distributions rather than fixed centers and radii, as in Sen et al. (2021). In fact, Sen et al. (2021) did not model the statistical uncertainty at all. Second, we propose a TS algorithm using novel recursive derivations of the posterior. Sen et al. (2021) proposed a greedy strategy and beam search to avoid evaluation of all actions. In Section 3, we discuss how similar improvements can be incorporated in Hier TS. Finally, our Bayesian analysis reveals structural properties that imply low regret. The low regret in Sen et al. (2021) is attained by making an assumption on the regression oracle, and can grow linearly when the oracle is imperfect. 8. Conclusions In many practical problems, the action space is large and a good generalization over actions is not obvious. Motivated by this, we study a contextual bandit problem with a deep hierarchy over actions. To solve the problem, we propose hierarchical Thompson sampling (Hier TS), which can be implemented exactly and efficiently in Gaussian models. We prove Bayes regret bounds for Hier TS that quantify its increased statistical efficiency over vanilla TS and validate this experimentally. We also apply Hier TS to a challenging classification problem with label hierarchy. Our work is a major step towards bandit algorithms with rich graphical models. Its limitations can be addressed by future works. For instance, a frequentist regret analysis is possible and would only differ in Lemma 1. The rest of the proof, which captures the structure of our problem, is a worst-case argument. Second, we believe that our method can be generalized beyond Gaussian trees. As discussed in Section 3.2, exact posterior sampling is challenging under the constraint of computational efficiency; but many tractable approximations exist. When exact sampling is possible, we believe that our proofs can be extended to general exponential-family distributions. Another direction for future work is an extension to directed acyclic graphs (DAGs). The nodes in DAGs can be ordered, and therefore similar recursions to Sections 4 and 5 can be established. Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pp. 2312 2320, 2011. Agrawal, S. and Goyal, N. Analysis of Thompson sampling Deep Hierarchy in Bandits for the multi-armed bandit problem. In Proceeding of the 25th Annual Conference on Learning Theory, pp. 39.1 39.26, 2012. Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. An introduction to MCMC for machine learning. Machine Learning, 50:5 43, 2003. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002. Basu, S., Kveton, B., Zaheer, M., and Szepesvari, C. No regrets for learning the prior in bandits. In Advances in Neural Information Processing Systems 34, 2021. Bubeck, S., Munos, R., Stoltz, G., and Szepesvari, C. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems 21, pp. 201 208, 2008. Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pp. 2249 2257, 2012. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011. Combes, R., Magureanu, S., Proutiere, A., and Laroche, C. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2015. Doucet, A., de Freitas, N., and Gordon, N. Sequential Monte Carlo Methods in Practice. Springer, New York, NY, 2001. Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In Proceedings of the 9th International Conference on Learning Representations, 2021. Gopalan, A., Mannor, S., and Mansour, Y. Thompson sampling for complex online problems. In Proceedings of the 31st International Conference on Machine Learning, pp. 100 108, 2014. Gupta, S., Chaudhari, S., Mukherjee, S., Joshi, G., and Yagan, O. A unified approach to translate classical bandit algorithms to the structured bandit setting. Co RR, abs/1810.08164, 2018. URL https://arxiv.org/ abs/1810.08164. Hong, J., Kveton, B., Zaheer, M., Chow, Y., Ahmed, A., and Boutilier, C. Latent bandits revisited. In Advances in Neural Information Processing Systems 33, 2020. Hong, J., Kveton, B., Zaheer, M., and Ghavamzadeh, M. Hierarchical Bayesian bandits. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022. Kawale, J., Bui, H., Kveton, B., Tran-Thanh, L., and Chawla, S. Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems 28, pp. 1297 1305, 2015. Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 681 690, 2008. Kleinberg, R., Slivkins, A., and Upfal, E. Bandits and experts in metric spaces. Co RR, abs/1312.1277, 2013. URL https://arxiv.org/abs/1312.1277. Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, 2009. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015. Kveton, B., Konobeev, M., Zaheer, M., Hsu, C.-W., Mladenov, M., Boutilier, C., and Szepesvari, C. Meta Thompson sampling. In Proceedings of the 38th International Conference on Machine Learning, 2021. Lattimore, T. and Munos, R. Bounded regret for finitearmed structured bandits. In Advances in Neural Information Processing Systems 27, pp. 550 558, 2014. Li, L., Chu, W., Langford, J., and Schapire, R. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 2010. Li, S., Wang, B., Zhang, S., and Chen, W. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1245 1253, 2016. Lindley, D. and Smith, A. Bayes estimates for the linear model. Journal of the Royal Statistical Society: Series B (Methodological), 34(1):1 18, 1972. Lu, X. and Van Roy, B. Information-theoretic confidence bounds for reinforcement learning. In Advances in Neural Information Processing Systems 32, 2019. Deep Hierarchy in Bandits Maillard, O.-A. and Mannor, S. Latent bandits. In Proceedings of the 31st International Conference on Machine Learning, pp. 136 144, 2014. Majzoubi, M., Zhang, C., Chari, R., Krishnamurthy, A., Langford, J., and Slivkins, A. Efficient contextual bandits with continuous actions. In Advances in Neural Information Processing Systems 33, 2020. Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., and Varma, M. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, 2018. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Image Net large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014. Russo, D. and Van Roy, B. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(68):1 30, 2016. Sen, R., Shanmugam, K., Kocaoglu, M., Dimakis, A., and Shakkottai, S. Contextual bandits with latent confounders: An NMF approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. Sen, R., Rakhlin, A., Ying, L., Kidambi, R., Foster, D., Hill, D., and Dhillon, I. Top-k extreme contextual bandits with arm hierarchy. In Proceedings of the 38th International Conference on Machine Learning, 2021. Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In IEEE International Conference on Computer Vision, 2017. Tan, M. and Le, Q. Efficient Net: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, pp. 6105 6114, 2019. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. Tirinzoni, A., Lazaric, A., and Restelli, M. A novel confidence-based algorithm for structured bandits. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020. Wan, R., Ge, L., and Song, R. Metadata-based multi-task bandits with Bayesian hierarchical models. In Advances in Neural Information Processing Systems 34, 2021. Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. Self-training with noisy student improves Image Net classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. Yan, D., Huang, L., and Jordan, M. Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009. Yu, H.-F., Zhong, K., Zhang, J., Chang, W.-C., and Dhillon, I. PECOS: Prediction for enormous and correlated output spaces. Co RR, abs/2010.05878, 2020a. URL http: //arxiv.org/abs/2010.05878. Yu, T., Kveton, B., Wen, Z., Zhang, R., and Mengshoel, O. Graphical models meet bandits: A variational Thompson sampling approach. In Proceedings of the 37th International Conference on Machine Learning, 2020b. Yue, Y. and Guestrin, C. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems 24, pp. 2483 2491, 2011. Zhang, Y. and Yang, Q. A survey on multi-task learning. Co RR, abs/1707.08114, 2017. URL http://arxiv. org/abs/1707.08114. Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016. Deep Hierarchy in Bandits A. Posterior Derivations This section contains our posterior derivations. We adopt the convention that Λ = Σ 1, where Λ is the precision matrix for covariance Σ. A.1. Multi-Armed Bandit Posterior The proof is by induction. We start with the inductive step. Lemma 6. Fix an internal node j. Let pa(j) = i and C = ch(j). Let P (Ht,j, θ ,j = θ | θ ,i = θi) exp σ 2 0 (θ θi)2 + X k C σ 2 k (θ θk)2 !# where θk, σk are the parameters of k C. Then P (Ht,j | θ ,i = θi) exp 1 2 σ 2 j (θi θj)2 for σ2 j = σ2 0 + (P k C σ 2 k ) 1 and θj = (P k C σ 2 k ) 1 P k C σ 2 k θk. Proof. Let s = σ 2 0 + P k C σ 2 k and v = σ 2 0 θi + P k C σ 2 k θk. We start with completing the square of θ, log P (Ht,j, θ ,j = θ | θ ,i = θi) σ 2 0 (θ θi)2 + X k C σ 2 k (θ θk)2 σ 2 0 θi + X k C σ 2 k θk + σ 2 0 θ2 i = s(θ2 2θs 1v + s 2v2) + σ 2 0 θ2 i s 1v2 = s(θ s 1v)2 + σ 2 0 θ2 i s 1v2 . In the second step, we omit constants in θ and θi. Since we got a quadratic form in θ, we know that Z θ P (Ht,j, θ ,j = θ | θ ,i = θi) dθ exp 1 2(σ 2 0 θ2 i s 1v2) . Let ˆs = σ 2 0 σ 4 0 s 1. Now we complete the square of θi, σ 2 0 θ2 i s 1v2 = σ 2 0 θ2 i s 1 σ 2 0 θi + X k C σ 2 k θk = σ 2 0 θ2 i σ 4 0 s 1 θi + σ2 0 X k C σ 2 k θk θ2 i 2θiˆs 1σ 2 0 s 1 X k C σ 2 k θk θi ˆs 1σ 2 0 s 1 X k C σ 2 k θk In the last two steps, we omit constants in θi. Finally, note that ˆs = σ 2 0 (s σ 2 0 ) s = (σ2 0 + (s σ 2 0 ) 1) 1 , ˆs 1σ 2 0 s 1 = s σ 2 0 (s σ 2 0 )σ 2 0 s 1 = (s σ 2 0 ) 1 . This completes the proof, for θj = (s σ 2 0 ) 1 P k C σ 2 k θk and σ2 j = σ2 0 + (s σ 2 0 ) 1. At an action node j, (13) holds for σk = σ and θk = Yk, where Yk is an observation k of node j and σ is observation noise. This is the base case of our proof by induction. Deep Hierarchy in Bandits A.2. Linear Bandit Posterior The proof is by induction. We start with the inductive step. Lemma 7. Fix an internal node j. Let pa(j) = i and C = ch(j). Let P (Ht,j, θ ,j = θ | θ ,i = θi) exp (θ θi) Λ0(θ θi) + X k C (θ θk) Λk(θ θk) where θk, Λk are the parameters of k C. Then P (Ht,j | θ ,i = θi) exp 1 2(θi θj) Λj(θi θj) for Λ 1 j = Λ 1 0 + (P k C Λk) 1 and θj = (P k C Λk) 1 P k C Λkµk. Proof. Let S = Λ0 + P k C Λk and V = Λ0θi + P k C Λkθk. We start with completing the square of θ, log P (Ht,j, θ ,j = θ | θ ,i = θi) (θ θi) Λ0(θ θi) + X k C (θ θk) Λk(θ θk) = θ S(θ 2S 1V ) + θ i Λ0θi = (θ S 1V ) S(θ S 1V ) + θ i Λ0θi V S 1V . In the second step, we omit constants in θ and θi. Since we got a quadratic form in θ, we know that Z θ P (Ht,j, θ ,j = θ | θ ,i = θi) dθ exp 1 2(θ i Λ0θi V S 1V ) . Let ˆS = Λ0 Λ0S 1Λ0. Now we complete the square of θi, θ i Λ0θi V S 1V = θ i Λ0θi θi 2 ˆS 1Λ0S 1 X θi ˆS 1Λ0S 1 X θi ˆS 1Λ0S 1 X In the last two steps, we omit constants in θi. Finally, by the Woodbury matrix identity, we have ˆS = Λ0 Λ0S 1Λ0 = (Λ 1 0 + (S Λ0) 1) 1 , ˆS 1Λ0S 1 = (Λ0 Λ0S 1Λ0) 1Λ0S 1 = (S Λ0) 1 . This completes the proof, for θj = (S Λ0) 1 P k C Λkµk and Λj = (Λ 1 0 + (S Λ0) 1) 1. For any node j, note that (14) can be written as P (Ht,j, θ ,j = θ | θ ,i = θi) exp (θ θi) Λ0(θ θi) + X k C θ Λkθ 2 X k C θ Λkθk) when constants in θ and θi are omitted. Then, at an action node, Λk = σ 2X k Xk and Λkθk = σ 2Xk Yk, where Yk is an observation k of node j at feature vector Xk and σ is observation noise. This is the base case of our proof by induction. Deep Hierarchy in Bandits B. Regret Bound Proofs This section contains proofs of our regret bound and supporting lemmas. B.1. Proof of Lemma 1 Fix round t. Let P (Θ | Ht) = N(Θ; Θt, Σt) be the joint posterior distribution of all action node parameters Θ RK, with mean Θt RK and covariance Σt RK K. Let At {0, 1}K and A {0, 1}K be indicator vectors of the taken action in round t and the optimal action, respectively. Each action is associated with one leaf node. Since Θt is deterministic given Ht, and A and At are i.i.d. given Ht, we have E A Θ A t Θ = E E A (Θ Θt) Ht + E E A t ( Θt Θ ) Ht . Moreover, Θ Θt is a zero-mean random vector independent of At, and thus E A t ( Θt Θ ) Ht = 0. So we only need to bound the first term above. Let Et = n a A : |a (Θ Θt)| p 2 log(1/δ) a Σt o be the event that all high-probability confidence intervals hold. Fix history Ht. Then by the Cauchy-Schwarz inequality, E A (Θ Θt) Ht p 2 log(1/δ) E A Σt Ht + E A (Θ Θt)1 Et Ht . Now note that for any action a, a (Θ Θt)/ a Σt is a standard normal variable. It follows that E A (Θ Θt)1 Et Ht 2 X 2 log(1/δ) u exp u2 2 π σmax Kδ , where we use that Et implies Σ 1 2 t (Θ Θt) p 2 log(1/δ). Now we combine all inequalities and have E A (Θ Θt) Ht p 2 log(1/δ) E At Σt Ht + 2 π σmax Kδ . We also used that At and A are i.i.d. given Ht. Since the above bound holds for any history Ht, we combine everything and get t=1 A Θ A t Θ 2 log(1/δ) E 2 π σmax Knδ 2n log(1/δ) E t=1 At 2 Σt 2 π σmax Knδ 2n log(1/δ) t=1 At 2 Σt 2 π σmax Knδ . The second step uses the Cauchy-Schwarz inequality and the third step uses the concavity of the square root. Since a is an indicator vector, At 2 Σt = σ2 t,At is the marginal posterior variance of the mean reward of action At in round t. Likewise, σmax is the maximum marginal prior width of the mean reward at an action node. This concludes the proof. B.2. Proof of Lemma 2 Since round t is fixed, we write L instead of Lt and refer to node ψt(i) by i for any i [Lt]. Fix any node i. By the total variance decomposition, where we introduce a parent parameter θt,i 1, var [θt,i | Ht] = E ˆσ2 t,i Ht + var h ˆθt,i Ht i . Deep Hierarchy in Bandits For Gaussian random variables, ˆσ2 t,i is independent of θt,i 1, as shown in (7). Therefore, E ˆσ2 t,i Ht = ˆσ2 t,i . For the second term, as shown in (7), ˆθt,i = ˆσ2 t,i(σ 2 0,i θt,i 1 + c), where c is a constant conditioned on Ht. Therefore, var h ˆθt,i Ht i = ˆσ4 t,i σ4 0,i var [θt,i 1 | Ht] . Now we chain all identities for node i and get var [θt,i | Ht] = ˆσ2 t,i + ˆσ4 t,i σ4 0,i var [θt,i 1 | Ht] . Finally, we apply the above identity recursively, from node L up to the root, and get our claim, var [θt,L | Ht] = ˆσ2 t,L + ˆσ4 t,L σ4 0,L var [θt,L 1 | Ht] = ˆσ2 t,L + ˆσ4 t,L σ4 0,L ˆσ2 t,L 1 + ˆσ4 t,L σ4 0,L ˆσ4 t,L 1 σ4 0,L 1 var [θt,L 2 | Ht] ˆσ4 t,j σ4 0,j This completes the proof. B.3. Proof of Lemma 3 Since round t is fixed, we write L instead of Lt and refer to node ψt(i) by i for any i [Lt]. For an action node i, the claim holds trivially since ˆσ 2 t+1,i ˆσ 2 t,i = σ 2. The rest of the proof is for internal nodes. We start with proving that σ 2 σ 2 t+1,i σ 2 t,i σ 2 ˆσ4 t,i σ4 0,i 1 1 + σ 2σ2 0,i holds for any node i. The proof is by induction. The basis of the induction is that the claim holds for action nodes. Let node i = L be an action node, with n observations by round t. We apply (8) and get σ 2 t+1,i σ 2 t,i = (σ2 0,i + σ2/(n + 1)) 1 (σ2 0,i + σ2/n) 1 = σ 4 0,i [(σ 2 0,i + σ 2n) 1 (σ 2 0,i + σ 2(n + 1)) 1] = σ 4 0,i (σ 2 0,i + σ 2n) 2 σ 2 1 + σ 2(σ 2 0,i + σ 2n) 1 = ˆσ4 t,i σ4 0,i 1 + σ 2ˆσ2 t,i . The second and third equalities are by the Woodbury and Sherman-Morrison formulas, respectively, applied to scalars. The lower bound follows from ˆσ2 t,i σ2 0,i. The upper bound uses that ˆσ4 t,i/σ4 0,i 1 and 1 + σ 2ˆσ2 t,i 1. In the inductive step, we assume that (15) holds for node i + 1 and prove it for node i. Note that node i + 1 is the only child of node i where the posterior between rounds t and t + 1 changes. Let s = P j ch(i) σ 2 t,j and ε = σ 2 t+1,i+1 σ 2 t,i+1. Now we apply (9) and get σ 2 t+1,i σ 2 t,i = (σ2 0,i + (s + ε) 1) 1 (σ2 0,i + s 1) 1 = σ 4 0,i [(σ 2 0,i + s) 1 (σ 2 0,i + s + ε) 1] = σ 4 0,i (σ 2 0,i + s) 2 ε 1 + (σ 2 0,i + s) 1ε = ˆσ4 t,i σ4 0,i ε 1 + ˆσ2 t,iε . As in the proof for the action node, the second and third equalities are by the Woodbury and Sherman-Morrison formulas, respectively, applied to scalars. The lower bound follows from ˆσ2 t,i σ2 0,i and ε σ 2, where the latter is by the inductive argument. The upper bound uses that ˆσ4 t,i/σ4 0,i 1, 1 + ˆσ2 t,iε 1, and ε σ 2. Deep Hierarchy in Bandits To complete the proof, we have from (7) that ˆσ 2 t+1,i ˆσ 2 t,i = σ 2 t+1,i+1 σ 2 t,i+1 holds for any internal node i. Finally, note that 1 + σ 2σ2 0,i 1 + σ 2σ2 0,max = c. B.4. Proof of Theorem 4 First, we apply Lemma 1 and get that 2n V(n) log(1/δ) + p 2/πσmax Knδ , where V(n) = E Pn t=1 σ2 t,At and σ2 t,At is the marginal posterior variance in node At in round t. We derive a worst-case upper bound on V(n) next. We start with a worst-case upper in any round t. Since round t is fixed, we write L instead of Lt and refer to node ψt(i) by i for any i [Lt]. Let si = ˆσ2 t,j σ2 0,j . Then by Lemma 2, σ2 t,At = σ2 σ2 t,At σ2 = σ2 L X i=1 s2 i ˆσ2 t,i σ2 σ2 L X 1 + s2 i ˆσ2 t,i σ2 where ci is an upper bound at node i defined as s2 i ˆσ2 t,i σ2 log 1 + s2 i ˆσ2 t,i σ2 ˆσ2 t,i σ2 log 1 + ˆσ2 t,i σ2 σ2 0,i σ2 log 1 + σ2 0,i σ2 = ci . The first inequality holds because si 1, and ax/ log(1 + ax) x/ log(1 + x) for any a [0, 1] and x > 0. The second inequality holds because x/ log(1 + x) is maximized when x is, which happens at ˆσt,i = σ0,i. For any c 1 and node i [L], 1 + s2 i ˆσ2 t,i σ2 = c L ici L log 1 + s2 i ˆσ2 t,i σ2 1 + ci Ls2 i ˆσ2 t,i σ2 where the inequality holds because a log(1 + x) log(1 + ax) for any a [0, 1] and x > 0. Moreover, 1 + ci Ls2 i ˆσ2 t,i σ2 = log ˆσ 2 t,i + ci Ls2 i σ2 log(ˆσ 2 t,i ) log(ˆσ 2 t+1,i) log(ˆσ 2 t,i ) , where the last step is by Lemma 3. Now we chain all inequalities, switch to the full notation, and get σ2 t,At σ2 Lt X i=1 c Lt icψt(i)[log(ˆσ 2 t+1,ψt(i)) log(ˆσ 2 t,ψt(i))] . Finally, we sum up the above upper bound over all rounds t. Let hi be the maximum length of any path from node i to its descendant. Due to c 1 and telescoping in the above decomposition, we get i N chici[log(ˆσ 2 n+1,i) log(σ 2 0,i )] = σ2 X i N chici log(σ2 0,iˆσ 2 n+1,i) . For an action node i, ˆσ 2 n+1,i σ 2 0,i + σ 2n, and thus log(σ2 0,iˆσ 2 n+1,i) log 1 + σ2 0,in σ2 Deep Hierarchy in Bandits For an internal node i, ˆσ 2 n+1,i σ 2 0,i + P j ch(i) σ 2 0,j, and thus log(σ2 0,iˆσ 2 n+1,i) log 1 + σ2 0,i X j ch(i) σ 2 0,j This completes the proof. C. Contextual Regret Bound Proofs This section contains proofs of our contextual regret bound and supporting lemmas. C.1. Proof of Theorem 5 First, we apply Lemma 1 and get that 2n V(n) log(1/δ) + p 2/πσmax Knδ , where V(n) = E Pn t=1 σ2 t,At and σ2 t,At = Xt 2 Σt,At denotes the marginal posterior variance in node At in round t, in the direction of context Xt. We derive a worst-case upper bound on V(n) next. We start with a worst-case upper in any round t. Since round t is fixed, we write L instead of Lt and refer to node ψt(i) by i for any i [Lt]. Let Si = QL j=i+1 Σ 1 0,j ˆΣt,j. Then by Lemma 8, σ2 t,At = X t Σt,At Xt = σ2 L X i=1 σ 2X t S i ˆΣt,i Si Xt σ2 L X i=1 ci log(1 + σ 2X t S i ˆΣt,i Si Xt) , where ci is an upper bound at node i defined as X t S i ˆΣt,i Si Xt σ2 log(1 + σ 2X t S i ˆΣt,i Si Xt) λ1(ˆΣt,i) σ2 log(1 + σ 2λd(ˆΣt,i)) σ2 0,i σ2 log 1 + σ2 0,i σ2 = ci . The first inequality holds because λ1(Si) 1 and Xt 2 1, where the former follows from the definitions of ˆΣt,j and Σ0,j = σ2 0,j Id. We also use the fact that ax/ log(1 + ax) x/ log(1 + x) holds for any a [0, 1] and x > 0. The second inequality holds because λ1(ˆΣt,i) λ1(Σ0,i) = σ2 0,i. We also use that x/ log(1 + x) is maximized when x is. For any c 1 and node i [L], log(1 + σ 2X t S i ˆΣt,i Si Xt) = c L ici L log(1 + σ 2X t S i ˆΣt,i Si Xt) c L i log(1 + σ 2ci LX t S i ˆΣt,i Si Xt) , where the inequality holds because a log(1 + x) log(1 + ax) for any a [0, 1] and x > 0. Moreover, log(1 + σ 2ci LX t S i ˆΣt,i Si Xt) = log det(Id + σ 2ci L ˆΣ 1 2 t,i Si Xt X t S i ˆΣ = log det(ˆΣ 1 t,i + σ 2ci LSi Xt X t S i ) log det(ˆΣ 1 t,i ) log det(ˆΣ 1 t+1,i) log(ˆΣ 1 t,i ) , where the last step is by Lemma 9. Now we chain all inequalities, switch to the full notation, and get σ2 t,At σ2 Lt X i=1 c Lt icψt(i)[log det(ˆΣ 1 t+1,ψt(i)) log det(ˆΣ 1 t,ψt(i))] . Finally, we sum up the above upper bound over all rounds t. Let hi be the maximum length of any path from node i to its descendant. Due to c 1 and telescoping in the above decomposition, we get i N chici[log det(ˆΣ 1 n+1,i) log det(Σ 1 0,i )] = σ2 X i N chici log det(Σ 1 2 0,i ˆΣ 1 n+1,iΣ Deep Hierarchy in Bandits For an action node i, 1 2 0,i ˆΣ 1 n+1,iΣ 1 2 0,i) d log 1 1 2 0,i ˆΣ 1 n+1,iΣ 1 2 0,i) d log 1 + σ2 0,in σ2d For an internal node i, 1 2 0,i ˆΣ 1 n+1,iΣ 1 2 0,i) d log 1 1 2 0,i ˆΣ 1 n+1,iΣ 1 2 0,i) d log 1 + σ2 0,i X j ch(i) σ 2 0,j This completes the proof. C.2. Supporting Lemmas The first claim is a matrix generalization of Lemma 2. Lemma 8. In any round t, the marginal posterior covariance in action node At decomposes as i=1 S i ˆΣt,ψt(i)Si , where Si = QLt j=i+1 Σ 1 0,ψt(j) ˆΣt,ψt(j). Proof. The proof is exactly the same as in Appendix B.2, except that the variances are replaced by covariances. The second claim is a matrix generalization of Lemma 3. Lemma 9. Fix any round t and i [Lt]. Then ˆΣ 1 t+1,ψt(i) ˆΣ 1 t,ψt(i) σ 2ci Lt Si Xt X t S i , where c = 1 + σ2 0,max/σ2 and Si is defined in Lemma 8. Proof. Since round t is fixed, we write L instead of Lt and refer to node ψt(i) by i for any i [Lt]. For an action node i, the claim holds trivially since ˆΣ 1 t+1,i ˆΣ 1 t,i = σ 2Xt X t . The rest of the proof is for internal nodes. We start with proving that for any node i, Σ 1 t+1,i Σ 1 t,i = vv is a rank-1 matrix for some v Rd, such that vv σ 2ci L 1Si 1Xt X t Si 1 , v v σ 2 . (16) The proof is by induction. The basis of the induction is that the claim holds for action nodes. Let node i = L be an action node. We apply (11) and get Σ 1 t+1,i Σ 1 t,i = (Σ0,i + (Gt,i + σ 2Xt X t ) 1) 1 (Σ0,i + G 1 t,i ) 1 = Σ 1 0,i [(Σ 1 0,i + Gt,i) 1 (Σ 1 0,i + Gt,i + σ 2Xt X t ) 1]Σ 1 0,i = Σ 1 0,i (Σ 1 0,i + Gt,i) 1 σ 2Xt X t 1 + σ 2X t (Σ 1 0,i + Gt,i) 1Xt (Σ 1 0,i + Gt,i) 1Σ 1 0,i = Σ 1 0,i ˆΣt,i σ 2Xt X t 1 + σ 2X t ˆΣt,i Xt ˆΣt,iΣ 1 0,i . The second and third equalities are by the Woodbury and Sherman-Morrison formulas, respectively. The matrix is rank 1. The lower bound follows from Σ0,i = σ2 0,i Id and Xt 2 1, which can be used to derive 1 + σ 2X t ˆΣt,i Xt 1 + σ 2λ1(ˆΣt,i) 1 + σ 2σ2 0,i c . Deep Hierarchy in Bandits The upper bound uses that Σ 1 0,i ˆΣt,i Xt 2 1 and 1 + x 1 for any x 0. In the inductive step, we assume that (16) holds for node i + 1 and prove it for node i. Note that node i + 1 is the only child of node i where the posterior between rounds t and t + 1 changes. Let M = P j ch(i) Σ 1 t,j and ε = Σ 1 t+1,i+1 Σ 1 t,i+1. Now we apply (12) and get Σ 1 t+1,i Σ 1 t,i = (Σ0,i + (M + ε) 1) 1 (Σ0,i + M 1) 1 = Σ 1 0,i [(Σ 1 0,i + M) 1 (Σ 1 0,i + M + ε) 1]Σ 1 0,i = Σ 1 0,i (Σ 1 0,i + M) 1 ε 1 + v (Σ 1 0,i + M) 1v (Σ 1 0,i + M) 1Σ 1 0,i = Σ 1 0,i ˆΣt,i vv 1 + v ˆΣt,iv ˆΣt,iΣ 1 0,i , where ε = vv . The matrix is rank 1, because by the inductive argument ε is rank 1. As in the proof for the action node, the second and third equalities are by the Woodbury and Sherman-Morrison formulas. The lower bound is derived using 1 + v ˆΣt,iv 1 + v 2 2λ1(ˆΣt,i) 1 + σ 2σ2 0,i c . This follows from Σ0,i = σ2 0,i Id and v v σ 2, where the latter is by the inductive argument. The upper bound uses that Σ 1 0,i ˆΣt,iv 2 σ 1 and 1 + x 1 for any x 0. To complete the proof, we have from (10) that ˆΣ 1 t+1,i ˆΣ 1 t,i = Σ 1 t+1,i+1 Σ 1 t,i+1 holds for any internal node i.