# factoredreward_bandits_with_intermediate_observations__bfb328f5.pdf Factored-Reward Bandits with Intermediate Observations Marco Mussi * 1 Simone Drago * 1 Marcello Restelli 1 Alberto Maria Metelli 1 In several real-world sequential decision problems, at every step, the learner is required to select different actions. Every action affects a specific part of the system and generates an observable intermediate effect. In this paper, we introduce the Factored-Reward Bandits (FRBs), a novel setting able to effectively capture and exploit the structure of this class of scenarios, where the reward is computed as the product of the action intermediate observations. We characterize the statistical complexity of the learning problem in the FRBs, by deriving worst-case and asymptotic instancedependent regret lower bounds. Then, we devise and analyze two regret minimization algorithms. The former, F-UCB, is an anytime optimistic approach matching the worst-case lower bound (up to logarithmic factors) but fails to perform optimally from the instance-dependent perspective. The latter, F-Track, is a bound-tracking approach, that enjoys optimal asymptotic instancedependent regret guarantees. 1. Introduction In several real-world sequential decision-making problems, the learner is required to select, at every interaction, different actions, i.e., an action vector, acting on different portions of the system, each producing an intermediate observation. In such scenarios, the reward is often a combination of these observations. Consider, for instance, the case in which we want to sell a product on an e-commerce website. Our goal is to maximize the overall revenue derived from the sales of a given item. In this business process, we have to choose (i) the price at which to sell the product and (ii) how much budget to invest in advertising. On the one hand, the price we set determines the propensity of the users to buy a given item, i.e., the conversion rate, representing for each price, the fraction of the customers that will buy the item (Broder *Equal contribution 1Politecnico di Milano, Milan, Italy. Correspondence to: Marco Mussi . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). & Rusmevichientong, 2012; Den Boer, 2015). On the other hand, the advertising budget we invest influences the number of potential customers that will be exposed to such an item, i.e., the number of impressions we are able to generate with the advertisement campaign (Feldman et al., 2007). Thus, every time we select a price-budget pair (i.e., action vector), we observe a noisy realization of the conversion rate, which depends on the price, and a noisy realization of the expected number of impressions, which depends on the budget we invest in advertising (i.e., intermediate observations). Thus, our objective is to maximize the revenue (i.e., reward) that is computed as the product between the price, the conversion rate, and the impressions (which will give us our income) subtracting the invested advertising budget.1 This scenario can be, in principle, addressed as a standard Multi-Armed Bandit (MAB, Lattimore & Szepesv ari, 2020) by looking at the reward (i.e., revenue) only and considering price-budget couples as actions. However, with such an approach, intermediate observations (i.e., the conversion rate consequence of the price we set and the impressions we generate a consequence of the adv budget we invest) that could provide useful information would be ignored with a possible detrimental effect on the learning process. Indeed, if we look just at the reward and disregard this factored structure, the learning problem will: (i) present an unnecessarily large action space, including all the possible combinations of action components (e.g., price and budget pairs), and (ii) suffer a possibly amplified effect of the noise in the reward due to the product of the noisy intermediate observations (e.g., impressions times conversion rate). A notion of factored bandits has been studied in (Zimmert & Seldin, 2018) in which the expected reward is a general function of the action components. No intermediate observations are considered and the noise is applied to the final reward only. Thus, this setting ultimately fails to model the real-world scenarios we are interested in, where the intermediate observations play a crucial role and are combined with a specific function (i.e., the product). As we shall see later in the paper, this specificity, motivated by the considered real-world scenarios, will allow us to obtain tighter and more detailed performance guarantees. 1The formalization of this example and an additional motivating example are reported in Appendix A. Factored-Reward Bandits with Intermediate Observations Contributions In this paper, we propose the novel setting of the Factored-Reward Bandits (FRBs) to model sequential decision-making problems in which the agent is required to play an action vector a pa1, . . . , adq T consisting of d action components. Each action component ai provides a noisy intermediate observation xi whose product forms the reward r x1x2 xd. We study this setting from computational and statistical perspectives and propose two regret minimization algorithms endowed with theoretical guarantees. The contributions are summarized as follows: In Section 2, we introduce the FRB setting, describe the feedback and noise models, and the learning problem. In Section 3, we study the statistical complexity of the learning problem in the FRB setting by deriving regret lower bounds. First, in Theorem 3.1, we present the worstcase regret lower bound of order Ωpσd ? k Tq, being σ the subgaussian proxy, d the number of action components, k the number of possible choices for each action component, and T is the learning horizon.2 This result highlights how the complexity of the problem scales linearly with d and its derivation makes use of technical tools from the multitask bandits literature. In Theorem 3.2, we show that dependence on σd (exponential in d) is unavoidable when intermediate observations are not present, motivating their crucial role. Second, we present the instance-dependent asymptotic regret lower bound which is first formulated as a linear program of Opkdq variables (Theorem 3.3) and, subsequently, elaborated in a more explicit form (Theorem 3.4), whose derivation makes use of the rearrangement inequalities (Hardy et al., 1952) and that enjoys a computational complexity of Opdk log kq. Qualitatively, this result shows how the different action components choices need to coordinate to match the lower bound. In Section 4, we provide a novel intuitive optimistic anytime regret minimization algorithm, Factored Upper Confidence Bound (F-UCB), in which optimism is applied to every action component independently. Then, we characterize its worst-case regret which has order r Opσd ? k Tq, matching the lower bound up to logarithmic factors (Theorem 4.1). Then, we empirically study its instance-dependent regret, revealing that it does not match the lower bound (Theorem 4.3). This confirms how coordination between action components is necessary. In Section 5, we design and analyze a novel algorithm, Factored Track (F-Track). F-Track is based on tracking the bound (Lattimore & Szepesvari, 2017), and succeeds in matching the instance-dependent lower bound in the asymptotic regime (Theorem 5.1). Its analysis reveals, once more, the need for coordinating the action components to achieve the optimal performance. Appendix B discusses additional related works. Numerical 2In the following, we provide more general results in which each action component i can have a different number ki of choices. simulations are provided in Appendix E. The proofs of all the statements are reported in Appendix C. 2. Factored Reward Bandits In this section, we introduce the Factored-Reward Bandits (FRBs), the learner-environment interaction, the assumptions, and we present the learning problem.3 Problem Formulation Let T P N be the learning horizon. In a FRB, at every round t P JTK, the learner chooses an action vector aptq pa1ptq, . . . , adptqq T in the action space A : Jk1K ˆ ˆ Jkd K, where for every i P Jd K we have that ki P Ně2 is the number of options of the ith action component aiptq of the vector, and d P Ně1 is the action vector dimension (i.e., the number of components that the learner must select at every round t). As an effect of the action, the learner observes a vector of d intermediate observations xptq px1ptq, . . . , xdptqq T and receives as reward the product of the intermediate observations rptq ś i PJd K xiptq. The ith component xiptq of the intermediate observation vector xptq is the effect of the ith action component aiptq in the action vector aptq. Specifically, every component i P Jd K of the intermediate observation vector xptq is independent of the others and sampled from a distribution xiptq νi,aiptq, so that, xptq νaptq : bi PJd Kνi,aiptq. Thus, we will denote an FRB as ν : bi PJd K bai PJki K νi,ai. Furthermore, we can write xiptq µi,aiptq ϵiptq, where µi,aiptq is the expected intermediate observation of the ith action component aiptq, and ϵiptq is σ2-subgaussian random noise, independent conditioned to the past and the other noise realizations ϵjptq for j P Jd Kztiu. As customary, we assume bounded expected values for the intermediate observations, i.e., µi,ai P r0, 1s for every i P Jd K and ai P Jki K, and all intermediate observation components xiptq characterized by the same known subgaussian proxy σ.4 Learning Problem An optimal action vector is a pa 1, . . . , a dq T P arg maxa pa1,...,adq TPA ś i PJd K µi,ai and, since all expected intermediate observations are nonnegative, we can factorize the optimization problem observing that a i P arg maxai PJki K µi,ai for every i P Jd K. We denote with µ i µi,a i the expected intermediate observation of the optimal ith action component. We define the suboptimality gap related to the ith action component as i,ai : µ i µi,ai for ai P Jki K, and the suboptimality gap related to the action vector a pa1, . . . , adq T P A as a : ś i PJd K µ i ś i PJd K µi,ai. 3Let a, b P N with a ď b, we introduce the symbols: Ja, b K : ta, a 1, . . . , b 1, bu and Jb K : J1, b K. A zero-mean random variable ξ is σ2-subgaussian if Ereλξs ď e λ2σ2 2 , for every λ P R. 4The extension with different known subgaussian proxies σi for every component i P Jd K is straightforward. Factored-Reward Bandits with Intermediate Observations Let ν be an FRB, A be a learning algorithm, and T P N be the learning horizon, we define its cumulative regret as: RT p A, νq: T ź i PJd K µ i ÿ i PJd K µi,aiptq ÿ t PJT K aptq. (1) The goal of the learner consists in minimizing the expected cumulative regret Er RT p A, νqs, where the expectation is taken w.r.t. the randomness of the observations and the possible randomness of the algorithm A. 3. Regret Lower Bounds In this section, we provide lower bounds to the expected regret that any learning algorithm suffers when addressing the learning problem in a FRB, both in the minimax (Section 3.1) and the instance-dependent (Section 3.2) cases. 3.1. Worst-Case Lower Bound We present the worst-case lower bound that every algorithm suffers and discuss the role of the structure of the FRB. Theorem 3.1 (Worst-Case Lower Bound). For every algorithm A, there exists an FRB ν such that for: T ě 2 1 2 1 d 1 2σ2 max i PJd K ki O σ2d2k , (2) A suffers an expected cumulative regret of at least: E r RT p A, νqs ě σ 4 ? In particular, if ki : k for every i P Jd K, we have E r RT p A, νqs ě Ωpσd ? Proof Sketch. The challenge is the structure of the regret in a FRB. We lower-bound the regret RT p A, νq as a sum of the regrets Rpiq T p A, νq that an algorithm A would have suffered by playing d parallel MABs. Choosing µ i 1: RT p A, νq ÿ t PJT K i,aiptq : 1 i PJd K Rpiq T p A, νq. This derivation leverages an ad-hoc technical Lemma C.2, which holds for sufficiently small suboptimality gaps, i.e., i,ai ď 1 2 1 d 1 . This condition gives rise to the constraint on the minimum time horizon (Equation 2), since the suboptimality gaps will be chosen 9T 1{2. Indeed, intuitively, if the suboptimality gaps i,ai are too large (depending on d) we will have 1 ś i PJd Kp1 i,aiptqq ! ř i PJd K i,ai making the instances more distinguishable and, consequently, reducing the regret. The result is obtained by showing that regret component satisfies Rpiq T p A, νq ě Ωpσ?ki Tq redesigning for the subgaussian case the solution designed for Bernoulli rewards from the multitask bandit literature (Wang et al., 2021, Theorem 10). To understand the beneficial effect of: (i) the factored structure and (ii) the intermediate observations, it is worth comparing the result of Theorem 3.1 with the regret lower bounds of common settings. If we remove (i), we are in the presence of a MAB with A Jk1K ˆ ˆ Jkd K as action space.5 It is worth noting that, even in this case, the reward rptq ś i PJd K xiptq is the product of d subgaussian random variables which is not, in general, subgaussian (see Lemma D.1). Nevertheless, rptq is guaranteed to preserve a finite variance of order at least σ2 σ2d (see Lemma D.3). Thus, we can look at the setting as a heavy-tailed MAB with finite variance (Bubeck et al., 2013) with ś i PJd K ki ac- tions, leading to a regret of order Ωpσ i PJd K ki Tq, which becomes Ωpσd? kd Tq when ki k for every i P Jd K. It is natural to wonder if (i) is enough to break the exponential dependence in d (on both σ and k). This setting is similar, but not exactly overlapping, to that of Zimmert & Seldin (2018), in which a general factored structure is considered without intermediate observations and assuming that the subgaussian noise is applied to the reward directly. Nevertheless, (Zimmert & Seldin, 2018) provide neither worst-case lower bound nor worst-case regret analysis of the proposed algorithm. The following result shows that (i) only is enough to remove the exponential dependence in d on k but not on σ, which remains unavoidable without (ii). Theorem 3.2 (Worst-Case Lower Bound without Intermediate Observations). For every algorithm A: that ignores the intermediate observations xptq and observes the reward rptq only, there exists an FRB ν such that for: T ě 4pmin i PJd K ki 1q{d, A: suffers an expected cumulative regret of at least: E RT p A:, νq ě σd pmini PJd K ki 1q T In particular, if ki : k for every i P Jd K, we have E RT p A:, νq ě Ωpσda Thus, Theorem 3.2 shows that the exponential dependence of d on σ is maintained even with the factored structure. This is particularly significant when σ ą 1, a regime in which the function σd{ ? d is exponentially increasing in d. This motivates the interest in studying this setting combining factored structure (i) and intermediate observations (ii). Remark 3.1 (About the independence of the intermediate observations). The formulation of the FRB in Section 2 assumes that the components xiptq of the observation vector xptq are independent. This is necessary to treat the problem with appropriate advantages over standard MABs on the combinatorial action space A. Indeed, if we rule out the independence assumption, we can always define a FRB in which xptq pyptq, 1, . . . , 1q T, where yptq ν1,aptq. This 5Note that makes no sense to consider (ii) without (iq. Factored-Reward Bandits with Intermediate Observations corresponds to a standard σ2-subgaussian MAB with A as action space and arm distributions ν1,a. Nevertheless, it is possible to relax the independence assumption, by requiring non-correlation among the intermediate observations. 3.2. Instance-Dependent Lower Bound We present the instance-dependent lower bound that every algorithm suffers on a specific instance ν of the FRB setting. Theorem 3.3 (Instance-Dependent Lower Bound). For every consistent6 algorithm A and FRB ν with unique optimal arm a P A it holds that: lim inf T Ñ 8 E r RT p A, νqs log T ě Cpνq, (3) where Cpνq is defined as the solution to the following optimization problem: min p Laqa PAzta u a PAzta u La a (4) s.t. Li,j ÿ a PAzta u ai j La, @i PJd K, j PJki Kzta i u (5) 2 i,j , @i PJd K,j PJki Kzta i u (6) La ě0, @a PAzta u. (7) Proof Sketch. Here we provide an informal derivation that captures the intuition, although the formal proof requires some additional technical effort (see Appendix C.1). Thanks to the factored structure, we can show, as for stochastic bandits, that for every j P Jki Kzta i u and i P Jd K the expected number of pulls Er Ni,jp Tqs is lower bounded by (Constraint 6): Li,j : Er Ni,jp Tqs log T ě 2σ2 2 i,j for T Ñ 8 We now want to find the arrangements of the number of pulls of action vectors Nap Tq, for every a P Azta u, to minimize the cumulative regret. Recalling that Ni,jp Tq ř a PA : ai j Nap Tq, we define Li,j ř a PAzta u : ai j La (Constraint 5). Finally, by recalling the decomposition of the regret Er RT p A,νqs a PA La a we get the objective function in Equation (4) to be minimized. Notice that to make the proof fully formal we need to properly manage the asymptotic behavior of the sequences Er Ni,jp Tqs and Er Nap Tqs when T Ñ 8. The optimization problem in Theorem 3.3 is a Linear Program (LP) with ś i PJd K ki ř i PJd K ki d 1 variables and ś i PJd K ki 2 ř i PJd K ki 2d 1 constraints. Constraint (5) establishes the relation between the number of pulls of the action vectors La and the number of pulls of the action components Li,j. This captures the information sharing of the 6An algorithm A is consistent if for every FRB ν and p ą 0, it holds that lim sup T Ñ 8 Er RT p A, νqs{T p 0. setting in which we obtain a sample for the action component pi, jq whenever we pull an action vector a such that ai j. Being a minimization problem, Constraint (6) will be satisfied with equality allowing the removal of variables Li,j and the relative constraints. Thus, the LP can be solved in polynomial time w.r.t. ś i PJd K ki (Vaidya, 1989). Explicit Solution of the LP Program We now illustrate how to solve the LP program with a smaller time complexity of order Opř i PJd K ki log kiq. We first provide the intuition and, then, provide the formal argument. The minimum proportion with which the action component pi, jq is to be pulled (Constraint 6) can be accomplished by pulling different sequences of action vectors a such that ai j. How to arrange the pulls of the action vectors to satisfy Constraint (6) and minimize the regret? To start capturing the intuition, consider the simplest setting with d 2, k1 k2 2, a 1 a 2 1, µ1,1 µ2,1 1 and µ1,2 µ2,2 y P p0, 1q. To satisfy Constraint (6), we have to guarantee L1,2 L2,2 2σ2p1 yq 2 (in the solution the constraint is satisfied with equality) and we have at our disposal 4 action vectors A tp1, 1q, p1, 2q, p2, 1q, p2, 2qu. We can satisfy the constraint in two ways:7 (i) playing action p2, 2q (i.e., with both suboptimal components) for a proportion of 2σ2p1 yq 2 times, suffering 1 y2 instantaneous regret; (ii) playing actions p1, 2q and p2, 1q (i.e., with one suboptimal component) for a proportion of 2σ2p1 yq 2 each, suffering 1 y instantaneous regret; It is simple to convince that (i) is the choice that minimizes the cumulative regret. Indeed, for y P p0, 1q, we have: 2σ2p1 yq 2p1 y2q loooooooooooomoooooooooooon ď 4σ2p1 yq 2p1 yq loooooooooooomoooooooooooon This intuitive reasoning can be extended to the general case. To this end, let us define the sorting functions πi : Jki K Ñ Jki K for every i P Jd K as any bijective function such that: µi,πip1q ď ď µi,πipki 1q ď µi,πipkiq µ i . We claim that in the optimal arrangement the action components need to coordinate as illustrated in Figure 1. For every dimension i P Jd K (row), we sort the action components in non-decreasing order of µi,j according to the sorting function πi. To every j P Jki 1K, an interval of length Li,j is associated corresponding to the proportion of pull. Now, we combine the different rows to obtain the active action vector (represented by different colors) made by the corresponding action components. Each active action vector will be pulled for a proportion (the colored vertical slices) depending on the Li,j values of the corresponding components. Notice that we can have at most ř i PJd K ki 1 active action vectors and the total propor- 7Any mix between (i) and (ii) is clearly suboptimal. Factored-Reward Bandits with Intermediate Observations L2,π2p1q L2,π2p2q L2,π2pk2 1q µ1,π1p1q µ1,π1p2q µ 1 µ2,π2pk2 1q µ2,π2p2q µ2,π2p1q µd,πdp1q µd,πdp2q Figure 1. Efficient solution to the LP presented in Theorem 3.3. tion of the pulls (the width of the full table in Figure 1) is given by M : maxi PJd K ř j PJki 1K Li,j. To formally characterize the solution, we introduce, for every i P Jd K and l P Jki 1K, the variables Mi,l : ř l1PJl K Li,πipl1q and Mi,ki 8 as the cumulative proportion of pulls of the action components more suboptimal than pi, πiplqq, i.e., fixing a row i, the position of the black vertical lines in Figure 1 sorted from left to right. Let us define the sorting function π : JKK Ñ Ť i PJd Kptiu ˆ Jki Kq, where K ř i PJd K ki, as any bijection such that: Mπp1q ď ď Mπp K dq, with the convention Mπp0q 0, i.e., the position in which we move from one vertical slice to the next one in Figure 1 sorted from left to right. For every ℓP JKK, we define the active action vector as αℓ pj1,ℓ, . . . , jd,ℓq T P A where: ji,ℓ: π 1 i arg maxl PJki Kt Mi,l ě Mπpℓqu . This allows us to prove the following result. Theorem 3.4 (Instance-Dependent Lower Bound (Explicit)). Let Cpνq be the solution of the optimization problem of Theorem 3.3. It holds that: Mπpℓq Mπpℓ 1q αℓ, that can be computed in Opř i PJd K ki log kiq. Proof Sketch. We generalize Equation (8) with the rearrangement inequality for integrals (Luttinger & Friedberg, 1976), the continuous version of the more known rearrangement inequality for sequences (Hardy et al., 1952). 4. A Worst-Case Optimal Algorithm In this section, we present an optimistic any-time regret minimization algorithm for the FRB setting. Factored Upper Confidence Bound (F-UCB), whose pseudocode is reported in Algorithm 1, is based on the idea of running a UCB-like exploration (Auer et al., 2002) independently for every dimension i P Jd K and estimate the expected observation µi,ai for every action component ai P Jki K. The algorithm requires as input the number of action compo- Algorithm 1: F-UCB. Input :Exploration Parameter α, Subgaussian proxy σ, Action component size ki, @i P Jd K 1 Initialize Ni,aip0qÐ0, pµi,aip0qÐ 0 @ai PJki K, i PJd K 2 for t P JTK do 3 Select aptq P arg max a pa1, ... adq TPA i PJd K UCBi,aiptq where UCBi,aiptq pµi,aipt 1q σ b α log t Ni,ai pt 1q 4 Play aptq and observe xptq px1ptq, . . . , xdptqq T 5 Update pµi,aiptqptq and Ni,aiptqptq for every i P Jd K nents ki for every i P Jd K, the exploration parameter α ą 2, and the subgaussian proxy σ. After initializing the variables to keep track of the number of pulls Ni,aiptq and the sample mean pµi,aiptq for all action components (line 1), the algorithm starts the learner-environment interaction. At every round t P JTK, F-UCB computes the optimistic action, i.e., the action aptq maximizing the optimistic index: aptq P arg max a pa1, ..., adq TPA i PJd K UCBi,aiptq, where pµi,aiptq is the empirical mean of the observations for the ith component of the observation vector determined by the action component ai, and Ni,aiptq is the number of times the corresponding component of the action vector has been played (line 3). Then, the algorithm plays it and observes the d-dimensional observation vector xptq px1ptq, . . . , xdptqq T (line 4). The observation vector is used to incrementally update the sample means of all action components involved and the related counters (lines 5). Finally, the algorithm reduces to UCB1 when d 1. F-UCB enjoys a time complexity of Op T ř i PJd K kiq and a space complexity of Opř i PJd K kiq. Indeed, at every round t P JTK, we need to recompute the index UCBi,aiptq for all ř i PJd K ki action components (at least the bonus changes at every round). Note that the computation of the optimistic action is not combinatorial since the optimization can be performed independently for every dimension i P Jd K. 4.1. Worst-Case Regret Analysis In this section, we provide the worst-case regret analysis of F-UCB as summarized in the following result. Theorem 4.1 (Worst-Case Upper Bound for F-UCB). For any FRB ν, F-UCB with α ą 2 suffers an expected regret bounded as: E r RT p F-UCB, νqs ď 4σ ÿ αki T log T gpαq ÿ i PJd K ki, where gpαq r O pα 2q 2 .8 In particular, if ki : k, for every i P Jd K, we have E r RT p F-UCB, νqs ď r Opσd ? 8The complete expression is reported in the proof. Factored-Reward Bandits with Intermediate Observations Proof Sketch. Under a suitable good event , we have that µi,ai ď UCBi,aiptq for every i P Jd K, ai P Jki K, and t P JTK. Thus, the instantaneous regret is bounded as: i PJd K µ i ź i PJd K µi,aiptq i PJl 1K µ i lomon µ l µl,alptq looooooomooooooon ďUCBi,aiptqptq µl,alptq i PJl 1,d K µi,aiptq loomoon UCBl,alptqptq µl,al , where the first line is obtained by summing and subtracting all mixed terms ś i PJl K µ i ś i PJl 1,d K µi,aiptq and the second by optimism µ l ď UCBl,a l ptq ď UCBl,alptqptq. Comparing the upper bound of Theorem 4.1 with the lower bound in Theorem 3.1, we realize that the dependence on the learning horizon T is tight up to logarithmic factors (just like UCB1) and the dependence on the number of action components ki, the number of dimensions d, and the subgaussian proxy σ are tight up to constant factors. It is worth comparing our results with the ones that could be obtained by applying literature algorithms to our FRB setting. As already mentioned in Section 3, although each intermediate observation xiptq is σ2-subgaussian, their product rptq, i.e., the reward, is not in general. This prevents, for instance, the application of UCB1 which assumes subgaussian (or bounded) reward. Precisely, for d 2, the reward rptq x1ptqx2ptq is a subexponential random variable, a scenario that can be still approached with the standard sample mean estimator but leveraging the Bernstein s concentration bound (Boucheron et al., 2013). However, for d ě 3, as shown in Lemma D.1, the reward rptq does not admit a moment-generating function and, consequently, displays a heavy-tailed behavior (Bubeck et al., 2013). Nevertheless, the reward rptq random variable maintains a finite variance bounded by σ2 1 σ2 d 1 (see Lemma D.2). This enables the application of algorithms designed for heavytailed bandits, such as Robust-UCB (Bubeck et al., 2013), able to handle generic distributions with finite variance, by resorting to estimators other than the sample mean. It is easy to verify that by considering the Median of Means estimator (Bubeck et al., 2013), we obtain a regret upper bound in the order of r O σ bś i PJd K ki T . This result is in line with the discussion in Section 3 and, clearly, not optimal. Indeed, the dependence on the product ś i PJd K ki " ř i PJd K ki is because Robust-UCB does not exploit the factored property of the FRB setting. Furthermore, the dependence on σ a p1 σ2qd 1 ě σ is justified by the fact that the intermediate observations are ignored. Finally, the analysis of Factored Bandit TEA (Zimmert & Seldin, 2018) cannot be adapted to our setting since, as already mentioned, the subgaussian noise is applied to the final reward only. 4.2. Instance-Dependent Upper Bound In this section, we provide the analysis of the instancedependent regret upper bound for the F-UCB algorithm. The following theorem summarizes the result. Theorem 4.2 (Instance-Dependent Upper Bound for F-UCB). For a given FRB ν, F-UCB with α ą 2 suffers an expected regret bounded as: E r RT p F-UCB, νqs ď Cp F-UCB, νq, where Cp F-UCB, νq is defined as the solution to the following optimization problem (where gpαq r Oppα 2q 2q): max p Naqa PA a PAzta u Na a (9) s.t. Ni,j ÿ a PAzta u ai j Na, @i PJd K, j PJki Kzta i u (10) Ni,j ď 4ασ2log T 2 i,j gpαq, @i PJd K, j PJki Kzta i u (11) a PA Na T (12) Naě0, @a PA (13) The derivation of the LP in Theorem 4.2 follows a similar rationale as that of the instance-dependent lower bound of Theorem 3.3. Since F-UCB runs an optimistic UCB strategy independent for every action component, we can derive an upper bound on the expected number of pulls for every i P Jd K and j P Jki Kzta i u (denoted with Ni,j in the LP): Er Ni,jp Tqs ď 4ασ2 log T 2 i,j gpαq, generating Constraint (11), that, since the problem involves a maximization, will be satisfied with equality. To relate the expected number of pulls Er Nap Tqs of the action vectors a P Azta u (denoted with Na in the LP) with the ones of the action components Er Ni,jp Tqs, we use the same argument of Theorem 3.3, producing Constraint (10). Similarly to the LP in Theorem 3.3, the problem is made of ś i PJd K ki ř i PJd K ki d variables and 1 ś i PJd K ki 2 ř i PJd K ki 2d constraints. We now provide an explicit solution to a relaxation of the LP of Theorem 4.2. Corollary 4.3 (Explicit Instance-Dependent Upper Bound for F-UCB). For a given FRB ν, F-UCB with α ą 2 suffers an expected regret bounded by: E r RT p F-UCB, νqs ď Cp F-UCB, νq ď 4ασ2 log T ÿ i PJd K µ i ÿ j PJki Kzta i u 1 i,j gpαq ÿ i PJd K ki, where µ i ś l PJd Kztiu µ l ď 1 for every i P Jd K. Proof Sketch. The result is based on providing a relaxation of the objective function of the optimization problem in Theorem 4.2, which is based on the following bound on the suboptimality gaps of the action vector a pa1, . . . , adq T Factored-Reward Bandits with Intermediate Observations in terms of the suboptimality gaps of the action components: i PJd K i,aiµ i. This allows to upper bound the objective function as: ÿ a PAzta u Na a ď ÿ i PJd K µ i ÿ j PJki Kzta i u Ni,j i,j. By Constraint (11) to upper bound Ni,ai, we get the result. Alternatively, we can drop the constraint ř a PAzta u Na T and use a rearrangement inequality (Hardy et al., 1952) to upper bound the objective function. It is worth comparing this instance-dependent regret upper bound of F-UCB with the one achievable with an algorithm for heavy-tailed bandits, such as Robust-UCB (Bubeck et al., 2013). Our result of Corollary 4.3 is of order (neglecting the dependence on α and on constants): i PJd K µ i ÿ j PJki Kzta i u Instead, Robust-UCB, for instance with the Median of Means estimator, is characterized by the following instancedependent regret of order (neglecting constants): where σ2 p1 σ2qd 1 ě σ2. It is simple to observe that Equation (15) is larger than Equation (14). Indeed, consider the subset of action vectors in which exactly one component is not optimal, i.e., A Ť i PJd K A i where A i : ta P A : ai a i , aj a j , j P Jd Kztiuu. We observe that for every a P A i , the action vector suboptimality gap is related with equality to that of the suboptimal component: l PJd K µ l µi,ai ź l PJd Kztiu µ l µ i i,ai. This allows the conclusion of the following as desired: ÿ i PJd K µ i ÿ j PJki Kzta i u Finally, let us compare Corollary 4.3 with the instancedependent regret upper bound of the Factored Bandit TEA algorithm (Zimmert & Seldin, 2018), although the noise model is different. Theorem 2 of (Zimmert & Seldin, 2018) provides a bound of order (neglecting constants): j PJki Kzta i u logp T log Tq log logp T log T q where κ is such that a ď κ ř i PJd K i,ai. Thus, we can set κ maxi PJd K µ i. This result is slightly worse than ours because of the presence of the larger κ and the additional log log T and logp1{ 2 i,jq terms. Remark 4.1 (About Instance-Dependent Optimality of F-UCB). We argue about the instance-dependent optimality of F-UCB. To this end, we focus on a specific FRB instance with generic d ą 1 and k1 kd 2. We consider Gaussian intermediate observations with expected values µi,1 1 and µi,2 1 where P p0, 1q for every i P Jd K. By applying Theorems 3.3 and 4.2, we deduce that for T Ñ 8, we have the lower bound (left) and the F-UCB upper bound (right) on the number of pulls of each suboptimal action component i P Jd K bounded as: Er Ni,2p Tqs log T ě 2σ2 2 and Er Ni,2p Tqs log T ď 4ασ2 Thanks to Theorem 3.4 and Corollary 4.3, we can compute Cpνq and upper bound Cp F-UCB, νq: Cpνq 2σ2p1 p1 qdq 2 and Cp F-UCB,νq log T ď 4dασ2 It is immediate to realize the following extreme behaviors: Cp F-UCB,νq Cpνqlog T ď 2dα 1 p1 qd Ñ # 2α Ñ0 2αd Ñ1 . (16) This suggests that for sufficiently large 1, F-UCB can perform significantly worse than the lower bound, introducing an additional dependence on d. Instead, for sufficiently small 0, F-UCB can match the lower bound up to constant factors.9 Clearly, we conducted this analysis employing an upper bound to the expected regret of F-UCB, which might, in principle, be affected by some analysis artifacts, making it not tight. In Figure 2, we compare the ratio between the actual regret obtained by running F-UCB (5 runs) on the proposed FRB example and the instancedependent lower bound (left) with the ratio between the upper bound and the instance-dependent lower bound computed in Equation (16) (right). We clearly observe that, although the y-scales are different, the behavior confirms a linear dependence of the actual regret of F-UCB on the number of dimensions of the action vector d. 5. Optimal Asymptotic Instance-Dependent Algorithm In this section, we provide an algorithm that matches the derived instance-dependent lower bound (Theorem 3.3) in the asymptotic regime. The algorithm, named Factored Track (F-Track), whose pseudocode is reported in Algorithm 2, is based on the idea of tracking the lower bound (Lattimore & Szepesvari, 2017). The rationale behind the algorithm is that if we want to match the instancedependent lower bound, we need to properly coordinate the choice of the action vectors a P A, given that we have a 9Indeed, when the suboptimality gaps are close to 0, the instantaneous regret ś i PJd K µ i ś i PJd K µi,aiptq approaches the sum of the regrets on each action component ř i PJd Kpµ i µi,aiptqq. Factored-Reward Bandits with Intermediate Observations 0 5 10 15 20 0 Action vector dimension d Regret Ratio = 0.2 = 0.5 = 1 0 5 10 15 20 0 Action vector dimension d Figure 2. Ratio between the actual regret of F-UCB and the instance-dependent lower bound (left) and ratio between the regret upper bound and the instance-dependent lower bound (Equation 16) (right), for different values of d (5 runs, mean 2std). lower bound on the minimum number of pulls for the action components pi, jq (Theorem 3.3). To impose such a structure we must plan in advance our sequence of action vector choices. We devise an algorithm composed of three phases: warm-up, success, and recovery. In the warm-up phase, the algorithm pulls some action vectors in such a way that each action component is pulled at least N0 times, i.e., Ni,j ě N0 (line 3). This can be achieved by roundrobing the action components values j of each component i, leading to a number of pulls in the warm-up phase equal to Twarm-up N0 maxi PJd K ki. We use these samples to estimate the expected values pµi,jp Twarm-upq and define the confidence interval threshold ϵT . Then, we use these values as if they were the true ones µi,j to compute the suboptimality gaps p i,j : maxj1PJki K pµi,j1p Twarm-upq pµi,jp Twarm-upq (line 6) and, using them, the number of pulls (line 7): p Ni,j 2σ2f T p1{Tq p 2 i,j , @j P Jki K, i P Jd K where for every δ P p0, 1q: f T pδq : ˆ 1 1 log T ˆ c log log T log ˆ1 where c is a universal constant and, with them, we compute the number of pulls for every action vector p Na by solving the optimization problem in Theorem 3.3 (line 8). It is worth noting that f T p1{Tq log T and this form is needed for technical reasons to guarantee that the confidence bounds hold. In the success phase, until we run out of the rounds t ď T, we track the lower bound by pulling in a roundrobin fashion all arms whose number of pulls Naptq ă p Na (line 10). If we realize that the estimated expected reward pµi,jpt 1q are too far from the ones estimated at the end of the warm-up phase pµi,aip Twarm-upq based on the threshold ϵT , we move to the recovery phase (line 9). In this phase, we play F-UCB until the end of the rounds discarding all the data collected so far (line 12). The following result shows that F-Track asymptotically matches the lower bound for a proper choice of N0 and ϵT . Algorithm 2: F-Track. Input :Warm-up sample size N0, Threshold ϵT , Action component size ki, @i P Jd K, 2 while mini PJd K minj PJki K Ni,jptq ă N0 do 3 Pull action vector aptq with aiptq pt 1q mod ki 1 for all i P Jd K, t Ð t 1 5 Twarm-up Ð t 1 6 Estimate the suboptimality gaps @i P Jd K, j P Jki K : p i,j : maxj1PJki K pµi,j1p Twarm-upq pµi,jp Twarm-upq 7 Compute the number of pulls p Ni,j 2σ2f T p1{Tq p 2 i,j for every action component i P Jd K and j P Jki K 8 Compute the number of pulls p Na for every action vector a P A by solving the LP in Theorem 3.3 9 while t ď T and maxi PJd K,j PJki K |pµi,jp Twarm-upq pµi,jpt 1q| ď 2ϵT do 10 Pull action vector aptq P arg mint Naptq : a P A and Naptq ď p Nau, t Ð t 1 12 Discard all data and play F-UCB until t T Theorem 5.1 (Instance-Dependent Upper Bound for F-Track). For any FRB ν, F-Track run with: log T U and ϵT 2σ2f T p1{ log Tq suffers an expected regret of: lim sup T Ñ 8 E r RT p F-Track, νqs log T Cpνq. 6. Discussion and Conclusions In this paper, we introduced the Factored-Reward Bandits, a novel setting to represent decision-making problems in which the learner is required to perform a set of actions, whose effects can be observed, and the reward is the product of those effects. We characterized the inherent complexity through worst-case and instance-dependent lower bounds, and we discussed the performances of current solutions. To address the regret minimization problem, we proposed two algorithms using the intermediate observations to reduce the complexity of learning in this setting. The first F-UCB is an optimistic solution that we proved minimax optimal (up to logarithmic factors). Such a solution deals with action components independently of the others and we have illustrated how, without coordination, we cannot reach instancedependent optimality. To overcome this issue, we propose F-Track, an algorithm able to perform planning on the action components, and we proved its asymptotically instancedependent optimality. As future lines of research, we plan to investigate the possibility of developing an algorithm able to guarantee both non-asymptotic instance-dependent optimality and to consider functions for aggregating intermediate observations different from the product. Factored-Reward Bandits with Intermediate Observations Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgments Funded by the European Union Next Generation EU within the project NRPP M4C2, Investment 1.3 DD. 341 - 15 March 2022 FAIR Future Artificial Intelligence Research Spoke 4 - PE00000013 - D53C22002380006. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2312 2320, 2011. Agrawal, R. The continuum-armed bandit problem. SIAM journal on control and optimization, 33(6):1926 1951, 1995. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235 256, 2002. Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. Broder, J. and Rusmevichientong, P. Dynamic pricing under a general parametric choice model. Operations Research, 60(4):965 980, 2012. Bubeck, S. Bandits games and clustering foundations. Ph D thesis, Universit e des Sciences et Technologie de Lille, 2010. Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. Bandits with heavy tail. IEEE Transactions on Information Theory, 59 (11):7711 7717, 2013. Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. Combes, R., Talebi, M. S., Prouti ere, A., and Lelarge, M. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2116 2124, 2015. Combes, R., Magureanu, S., and Prouti ere, A. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems (Neur IPS), pp. 1763 1771, 2017. Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Proceedings of the Conference on Learning Theory (COLT), pp. 355 366, 2008. Den Boer, A. V. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science, 20(1): 1 18, 2015. Feldman, J., Muthukrishnan, S., Pal, M., and Stein, C. Budget optimization in search-based advertising auctions. In Proceedings of the ACM Conference on Electronic Commerce (EC), pp. 40 49, 2007. Hardy, G. H., Littlewood, J. E., and P olya, G. Inequalities. Cambridge University Press, 1952. Katariya, S., Kveton, B., Szepesv ari, C., Vernade, C., and Wen, Z. Stochastic rank-1 bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pp. 392 401. PMLR, 2017. Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 535 543. PMLR, 2015. Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22, 1985. Lattimore, T. and Szepesvari, C. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 728 737. PMLR, 2017. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Luttinger, J. and Friedberg, R. A new rearrangement inequality for multiple integrals. Archive for Rational Mechanics and Analysis, 61:45 64, 1976. Magureanu, S., Combes, R., and Prouti ere, A. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of the Conference on Learning Theory (COLT), volume 35, pp. 975 999. JMLR, 2014. Pinelis, I. Product of three or more independent subgaussian varibles. Math Overflow, 2021. Vaidya, P. M. Speeding-up linear programming using fast matrix multiplication. In Annual symposium on foundations of computer science, pp. 332 337. IEEE Computer Society, 1989. Factored-Reward Bandits with Intermediate Observations Wang, Z., Zhang, C., Singh, M. K., Riek, L. D., and Chaudhuri, K. Multitask bandit learning through heterogeneous feedback aggregation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130, pp. 1531 1539. PMLR, 2021. Yu, J. Y. and Mannor, S. Unimodal bandits. In Proceedings of the International Conference on Machine Learning (ICML), pp. 41 48. Omnipress, 2011. Zimmert, J. and Seldin, Y. Factored bandits. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2840 2849, 2018. Factored-Reward Bandits with Intermediate Observations A. Examples In this appendix, we first formalize the example described in Section 1 using the formalism of the FRB setting (Appendix A.1). Then, we present an additional example of a higher dimensional problem that can be generalized by the FRB setting (Appendix A.2). A.1. Formalization of the Example of Section 1 Consider the case of joint pricing and advertising described in Section 1. In this scenario, at every round t P JTK, we must select a vector of dimension d 2. Suppose that the first action component is the advertising budget, and the second action component is the selling price. We have k1 advertising budgets over which we want to choose and k2 prices at which we can sell our item. At every round t, we select the budget a1ptq and the price a2ptq. Then, we observe a realization of the impressions we generate due to the budget a1ptq we invested: x1ptq µ1,a1ptq ϵ1ptq, and a realization of the conversion rate due to the price a2ptq we set: x2ptq µ2,a2ptq ϵ2ptq. The reward is equal to rptq a2ptqx1ptqx2ptq a1ptq, corresponding to the return for each sales (the price, considering the turnover as target), multiplied by the fraction of users willing to buy and by the number of customers exposed to the price (i.e., the impressions), minus the budget invested in advertising. Note that the operations of multiplying by the selling price and subtracting the advertising budget do not increase the statistical complexity of the learning problem, as after we select an action, such quantities are deterministic. However, to deal with this more elaborated formulation, we have to take care of it in the choice of the optimal action a : a P arg max a pa1, a2q TPA a2 ź i PJ2K µi,ai a1. (17) Run this problem on F-UCB Moving to the F-UCB, we can easily adapt the formulation of Equation (17) to the one required by the algorithm: aptq P arg max a pa1, a2q TPA a2 ź i PJ2K UCBi,aiptq a1. In practice, as we have done in Section 4, we can replace the real value with our optimistic estimator. Clearly, the analysis of the regret continues to hold with a multiplicative factor maxa2PJk2K |a2|. A.2. Additional Example We present an additional example of problems that can be generalized through the FRB setting related to manufacturing processes. Consider the problem in which we run a manufacturing firm that has to set up the production line for a product. The goal in this scenario is to optimize the following trade-off: maximize the production yield (i.e., the number of items that come out of the production line undamaged) while minimizing the production cost. Considering the item we want to manufacture, let us define a batch size B and a production line consisting of d stages. Assume that each stage has a 1 : 1 production rate (i.e., 1 input corresponds to 1 output). For each stage i P Jd K, we have to select a method to fulfill the stage among a set of ki available alternatives. Each alternative will have an aleatoric impact on the percentage of faulty outputs, and a deterministic cost of production. As such, at every round t, we select an action vector aptq pa1ptq, a2ptq, . . . , adptqq, with aiptq P Jki K, @i P Jd K. At every stage i, we then observe a percentage of undamaged outputs defined as: xiptq µi,aiptq ϵiptq, µi,aiptq P r0, 1s is the expected percentage of faultless products due to selecting action aiptq, Factored-Reward Bandits with Intermediate Observations ϵiptq is a σ2-subgaussian random noise, independent conditioned to the past and the other noise realizations ϵjptq for j P Jd Kztiu. We can model the reward function as: i 1 cipaiptqq, where cipaiptqq is the (deterministic and known) cost associated with the selection of action aiptq. Observe that B is a known and fixed quantity, and cipaiptqq are deterministic and known to the learner. As such, they do not increase the complexity of the learning problem. For this reason, this scenario can be generalized through the F-UCB setting. B. Additional Related Works In this section, we discuss the related works from the action structure perspective and the works that present a notion of factored structure.Then, we compare the most significant related algorithms with our work from the theoretical perspective. Action Structure Originally, multi-armed bandit frameworks focused on independent arms with no inherent structure (Lai & Robbins, 1985). However, in recent decades, various bandit models with several kinds of structure have emerged, such as linear (Dani et al., 2008; Abbasi-Yadkori et al., 2011), Lipschitz (Agrawal, 1995; Magureanu et al., 2014) and unimodal (Yu & Mannor, 2011) bandits. These contributions aim to incorporate diverse forms of structure into the arms being considered. Combes et al. (2017) introduced a generalization of structured bandits, accommodating a wide range of structural concepts among arms. Their work offers a statistically efficient (at least in the general case) algorithm for handling generic structures, at the expense of solving a semi-infinite linear program at each time step. The necessity of choosing several actions at a time in a structured manner has been widely studied in the field of combinatorial bandits (Cesa-Bianchi & Lugosi, 2012; Kveton et al., 2015; Combes et al., 2015). Notions of Factored Bandits Among the several kinds of structure, Zimmert & Seldin (2018) is the most similar to the work we propose from the point of view of the action structure, although the two works differ from the feedback perspective. Both works employ an action structure in which an action component ai is selected for each problem dimension i P Jd K. The action components are combined with a general function that obeys a uniform identifiability assumption under which the performance of each action vector can only improve when any action component is switched with the optimal one. However, in the work of Zimmert & Seldin (2018) the feedback comprises a single observation of the subgaussian reward rpatq applied to the aggregated expected reward, whereas, in our work, the feedback comprises one noisy observation for every action component. This peculiarity of our work implies that the reward obtained as the product over all the dimensions is not subgaussian anymore (Lemma D.1). (Zimmert & Seldin, 2018) generalizes (Katariya et al., 2017) to the case of more than two dimensions. B.1. Comparison of the Theoretical Results In Table 1, we summarize our setting with the one of Heavy-Tails Bandits (Bubeck et al., 2013) and the Factored Bandits (Zimmert & Seldin, 2018). We also analyze and compare both our solutions with Robust-UCB (Bubeck et al., 2013) and TEA (Zimmert & Seldin, 2018) from the instance-dependent point of view. Then, in Table 2 we compare worst-case lower and upper bounds from the worst-case perspective. Factored-Reward Bandits with Intermediate Observations Table 1. Comparison with the instance-dependent guarantees of (Bubeck et al., 2013) and (Zimmert & Seldin, 2018). :This result holds for T Ñ 8. ;The authors consider σ 1. Setting Characteristics Lower Bound Upper Bound Match Structure Intermediate Feedback σ d k T (Bubeck et al., 2013) Ω (Zimmert & Seldin, 2018) Ω j PJki Kzta i u j PJki Kzta i u logp T log Tq log logp T log T q This Work F-UCB Theorem 3.3: Theorem 4.2 F-Track Theorem 5.1: Table 2. Comparison with the worst-case guarantees of (Bubeck et al., 2013) (Zimmert & Seldin 2018 do not provide worst-case bounds). Setting Characteristics Lower Bound Upper Bound Match Structure Intermediate Feedback σ d k T (Bubeck et al., 2013) Ω σ ? This Work (F-UCB) Theorem 3.1 Theorem 4.1 Factored-Reward Bandits with Intermediate Observations C. Proofs and Derivations In this section, we provide proofs of the statements discussed in the main paper (Section C.1) and some technical lemmas needed in order to prove them (Section C.2). C.1. Proofs of the Theorems Theorem 3.1 (Worst-Case Lower Bound). For every algorithm A, there exists an FRB ν such that for: T ě 2 1 2 1 d 1 2σ2 max i PJd K ki O σ2d2k , (2) A suffers an expected cumulative regret of at least: E r RT p A, νqs ě σ 4 ? In particular, if ki : k for every i P Jd K, we have E r RT p A, νqs ě Ωpσd ? Proof. Consider an scenario in which µa 1 and i,j ď 1 2 1{pd 1q, @i P Jd K, j P Jki K, then Lemma C.3 allow us to rewrite the expected regret as: E r RT p A, νqs E i PJd K i,aiptq t PJT K i,aiptq i PJd K E Rpiq T p A, νq ı , (18) where Rpiq T p A, νq is the expected regret generated by pulling suboptimal arms on the component i P Jd K. This fact implies that if we take sufficiently small i,j ă , @i P Jd K, j P Jki K, we can analyze the expected regret Rpiq T p A, νq we pay for each action component i P Jd K independently and then summing up the regret we pay as shown above. We will see how the condition of the sufficiently small i,j implies that we have to add a condition on the minimum time budget T for which this lower bound holds. We can define a set of ś i PJd K ki FRB base instances as follows. Given a vector ph1, . . . , hdq T P Jk1Kˆ ˆJkd K identifying an instance, we define the expected rewards of such an instance as follows, for P p0, 1{2q: # 1 if j hi 1 if j P Jki Kzthiu , @i P Jd K. (19) We refer as νph1,...,hdq to the instance in which expected values are characterized by the vector ph1, . . . , hdq T P Jk1K ˆ ˆ Jkd K as in Equation (19). We now focus on bounding the regret of a single component i P Jd K. In particular, we focus on component i 1 for the sake of simplicity in the presentation. Then, we can extend the same reasoning to all the others. Let us define a set of helper instances which are needed for the analysis. For all the components different from the first, we consider as before a vector ph2, . . . , hdq T P Jk2K ˆ ˆ Jkd K which characterize the instance νp0,h2,...,hdq defined as follows: µ1,j 1 , @j P Jk1K µi,j # 1 if j hi 1 if j P Jki Kzthiu , @i P J2, d K. (20) We now need to introduce some additional objects. Given a vector ph1, h2, . . . , hdq T P pt0u Y Jk1Kq ˆ Jk2K ˆ ˆ Jkd K, we call Pph1,h2,...,hdq the distribution induced by the history of the pulls and the related rewards for the d components over Factored-Reward Bandits with Intermediate Observations time horizon T in instance νph1,h2,...,hdq. We denote with Ph for h P t0u Y Jk1K the distribution induced by the history averaged over the other dimensions, formally: Ph 1 ś i PJ2,d K ki ř ph2,h3,...,hdq PJk2Kˆ ˆJkd K Pph,h2,...,hdq, and with Eh the expectation over Ph. Coming back to the proof, given the definition of the base instances (Equation 19), the expected regret E Rp1q T p A, νph1,...,hdqq ı related to the first component is given by: E Rp1q T p A, νph1,...,hdqq ı ÿ j PJk1Kzth1u E r N1,jp Tqs p T E r N1,h1p Tqsq . We now want to use Lemma C.4 in order to obtain the following condition: 1 k1 h PJk1K Ehr T N1,hp Tqs ě T To apply Lemma C.4, we need an upper bound on the total variation d TV that we can compute @h P Jk1K as follows: 1 ś i PJ2,d K ki ph2,h3,...,hdq PJk2Kˆ ˆJkd K Pp0,h2,...,hdq Pph,h2,...,hdq 1 i PJ2,d K ki ph2,h3,...,hdq PJk2Kˆ ˆJkd K Pp0,h2,...,hdq Pph,h2,...,hdq 1 (22) i PJ2,d K ki ph2,h3,...,hdq PJk2Kˆ ˆJkd K 1 2DKL Pp0,h2,...,hdq ˇˇˇˇPph,h2,...,hdq (23) i PJ2,d K ki ph2,h3,...,hdq PJk2Kˆ ˆJkd K 1 2Ep0,h2,...,hdqr N1,hp Tqs DKL p0 ph (24) i PJ2,d K ki ph2,h3,...,hdq PJk2Kˆ ˆJkd K 1 2Ep0,h2,...,hdqr N1,hp Tqs 2 g f f e 1 ś i PJ2,d K ki ph2,h3,...,hdq PJk2Kˆ ˆJkd K 1 2Ep0,h2,...,hdqr N1,hp Tqs 2 2σ2 E0r N1,hp Tqs, (27) where line (22) is the triangle inequality for norms, line (23) is due the Pinsker s inequality, line (24) is due to the divergence decomposition lemma (Lattimore & Szepesv ari, 2020, Lemma 15.1) considering that all the component different from the first are equal, line (25) is derived by the expression of DKL between Gaussian distributions, line (26) is due to the Jensen s inequality, and line (27) is obtained by marginalizing w.r.t. the first component. Given this upper bound to the total variation, we can finally apply Lemma C.4 considering m k1 and B 2σ2k1 2 . What we get is: 2 N1,hp Tq ȷ ě σ2k1 We can now select the value of in order to have in Equation (28) a bound on T: Factored-Reward Bandits with Intermediate Observations This implies a choice of in the form of: Given such a choice of and the bound given by Equation (21), we get that the regret of the first action component can be bounded as: E Rp1q T p A, νq ı ě p T E r N1,h1p Tqsq The same reasoning can be done for all the others d 1 action components and the bound of Equation (18): E r RT p A, νqs ě 1 i PJd K E Rpiq T p A, νq ı The last point needed is to check that the condition of the choices we made on the is compliant for all the dimensions i P Jd K with the one of Lemma C.3, i.e., all the s are less than defined as: 2σ2 maxi PJd K ki This implies a lower bound on the T for which this bound holds: c 2σ2 maxi PJd K ki T ď 1 2 1{pd 1q. Isolating T we get: T ě 2σ2 maxi PJd K ki 1 2 1{pd 1q 2 . We highlight that the lower bound on the horizon T is quadratic in d. Indeed, for d ě 2 we have: 1 2 1 d 1 2 1 e log 2 d 1 2 ď ˆ 1 2pd 1q 2 4pd 1q2 Opd2q, having exploited the fact that 1 e x log 2 ě x{2 as x P r0, 1s, having set x 1 d 1 P r0, 1s for d ě 2. Thus, we require a mild (quadratic) condition on T ě Opd2 maxi PJd K kiq. We remark that even for standard bandits, the minimax lower bound requires the constraint T ě Opkq, being k the number of arms (Theorem 15.3, Lattimore & Szepesv ari, 2020). This concludes the proof. Theorem 3.2 (Worst-Case Lower Bound without Intermediate Observations). For every algorithm A: that ignores the intermediate observations xptq and observes the reward rptq only, there exists an FRB ν such that for: T ě 4pmin i PJd K ki 1q{d, A: suffers an expected cumulative regret of at least: E RT p A:, νq ě σd pmini PJd K ki 1q T In particular, if ki : k for every i P Jd K, we have E RT p A:, νq ě Ωpσda Proof. For simplicity, we consider d even. We consider the following base instance ν, parametrized by σ ą 1 and Factored-Reward Bandits with Intermediate Observations P r0, 1{4s with ď σd, defined for all i P Jd K and j P Jki Kzt1u: 2σ σ w.p. 1 It is clear that µi,1 1{d and µi,j 0. Consequently, the optimal arm is p1, . . . , 1q J with performance µ and all the other arms have performance 0. Furthermore, the variance of the suboptimal arm components is given by σ2 which is also the subgaussian proxy, while for the optimal arm components, the variance is smaller. Consider now for every i P Jd K: j i P arg min j PJki Kzt1u E νr Ni,jp Tqs ùñ E νr Ni,j i p Tqs ď T ki 1. (30) We construct the alternative instance ν which is equal to ν1 except for the the components pi, j i q for i P Jd K: 2σ σ w.p. 1 enforcing ď σd{2. In this alternative instance, the optimal arm is pj 1 , . . . j d q J, with performance pµ q1 2 . We are considering algorithms that do not observe individual components. Therefore, the distribution of the product of the individual components has to be computed. Since they will be used in the computation of the KL-divergence, we just consider the two most dissimilar ones: # σd w.p. 1 σd σd w.p. 1 # σd w.p. 1 2 σd w.p. 1 where the probability of the first case in which we play, for instance, p1, . . . , 1q J in the base instance is obtained by the following reasoning: we get σd if the number of σ realizations is even (being d even). Thus, we have: l 0 1tl is evenu ˆd j The KL divergence becomes, using reverse Pinsker inequality: DKLpνb : , νb ; q ď 1 1 2 σd DTVpνb : , νb ; q 4 ˆ requiring ď σd{4. Let us now lower bound the regret with Bretagnolle-Huber s inequality: maxt Er RT p A, νqs, Er RT p A, ν1qsu ě T t 1 1t Di P Jd K : aiptq j i u DKLpνb aptq}pν1qb aptqq i PJd K E νr Ni,j i p Tqs4 2 4 exp ˆ 4d T 2 being k mini PJd K ki. We set b 4d T with T ě 4pk 1q{d. Theorem 3.3 (Instance-Dependent Lower Bound). For every consistent10 algorithm A and FRB ν with unique optimal arm a P A it holds that: lim inf T Ñ 8 E r RT p A, νqs log T ě Cpνq, (3) where Cpνq is defined as the solution to the following optimization problem: min p Laqa PAzta u a PAzta u La a (4) 10An algorithm A is consistent if for every FRB ν and p ą 0, it holds that lim sup T Ñ 8 Er RT p A, νqs{T p 0. Factored-Reward Bandits with Intermediate Observations s.t. Li,j ÿ a PAzta u ai j La, @i PJd K, j PJki Kzta i u (5) 2 i,j , @i PJd K,j PJki Kzta i u (6) La ě0, @a PAzta u. (7) Proof. The proof of this statement is divided into two parts. Part one is dedicated to finding a lower bound on the expected number of pulls of every action component Ni,jp Tq for each action component i P Jd K, j P Jki Kzta i u. Part two is dedicated to understanding how these pulls of the action components can be combined in action vectors in the best way possible. Part 1: Lower bounding the expected number of pulls for each action component The proof of the expected number of pulls of a sub-optimal action j P Jki Kzta i u of action component i P Jd K is inspired by the proof of the asymptotic number of pulls of sub-optimal arms presented in Theorem 16.2 of (Lattimore & Szepesv ari, 2020). We call Mmn the set of distributions referring to the mth component (m P Jd K) and the nth arm (n P Jkm K). Then, consider Pmn as a specific distribution taken from Mmn to model the reward observations of arm n of component m in a given instance of the setting. Let ν be an instance of the FRB setting with d components and ki actions for every i P Jd K. We start by selecting a component i and a sub-optimal arm j. Let ε ą 0 P R be arbitrary constant. We define a new instance of the FRB setting ν1 such that P 1 ij Pij, @i P Jd Kztiu, @j P Jki K, and P 1 ij Pij, @j P Jki Kztju, and P 1 i,j P Mi,j be such that DKLp Pi,j, P 1 i,jq ď di,j ε and µ1 i,j ą µ i . dmn represents the KL divergence between Pmn and P m. The newly defined instance ν1 is then identical to ν for every arm of every component different from i, and in the ith component every arm is identical except for arm j, which is sub-optimal in ν and is optimal in ν1. Following the original proof, we can define, for any event E: Pνp Ei,jq Pν1p EA i,jq ě 1 Ni,jp Tq ı di,j ε . Now, let Ei,j t Ni,jp Tq ą T{2u, and let RT RT p A, νq and R1 T RT p A, ν1q. Then: RT R1 T ě T Pνp Ei,jqfipµq i,j Pν1p EA i,jqfipµqpµ1 i,j µ i q , where fipµq is obtained by the following observation. Since at every round t P JTK, in which we pull pi, jq we suffer the instantaneous regret in the base instance: ź i PJd K µ i µi,j i PJd Kztiu µi,jptq ě pµ i µi,jq ź i PJd Kztiu µ i i,j i PJd Kztiu µ i (38) and in the alternative instance: i PJd Kztiu µ i ź i PJd K µi,jptq ě pµ1 i,j µ i q ź i PJd Kztiu µ i , (39) i PJd K tiu µ i . (40) Since the term fipµq multiplies both i,j and pµ1 i,j µ i q, it is straightforward to continue the original proof and write: RT R1 T ě T 4 fipµq mint i,j, pµ1 i,j µ i qu exp Eν Ni,jp Tq ı di,j ε . Rearranging and dividing by log T, we obtain: Eνr Ni,jp Tqs logp Tq ě logp Tq log fipµq 4 mint i,j, pµ1 i,j µ i qu logp RT R1 T q pdi,j εq logp Tq (41) Factored-Reward Bandits with Intermediate Observations 1 di,j ε log fipµq 4 mint i,j, pµ1 i,j µ i qu logp RT R1 T q pdi,j εq logp Tq (42) hi,jp Tq, (43) by letting ε Ñ 0, having exploited the expression of KL-divergence between Gaussians and having set: hi,jp Tq : max %0, log fipµq 4 mint i,j, pµ1 i,j µ i qu logp RT R1 T q Notice that lim sup T Ñ 8 hi,jp Tq 0 under consistency. Now, iterating this reasoning over i P Jd K and over j P Jki K, we get the lower bound on the expected number of pulls for all the arms of all the action components. Part 2: Understanding how the pulls we have to perform on the action components can be combined From Part 1 of this proof, we have a result on the expectation of the minimum number of pulls. We can now define the quantity: Li,jp Tq : Er Ni,jp Tqs log T , @i P Jd K, j P Jki K. This quantity can be lower bounded as: Li,jp Tq ě 2σ2 2 ij hi,jp Tq, @i P Jd K, j P Jki Kzta i u. Now, we want to understand how these pulls of the action s suboptimal components influence the regret. We chose to look at the asymptotic expected regret, defined as follows: E r RT p A, νqs E r Nap Tqs and we denote: Lap Tq : Er Nap Tqs log T , @a P A. The regret becomes defined as: E r RT p A, νqs a PA Lap Tq a, Now, we want to look at how the pulls of the action vectors La and the ones of the action components are related. We can easily observe that the following relation occurs: a PA:ai j Lap Tq, @i P Jd K, j P Jki K. Given that, we can write an optimization problem in which we search for the best combination of pulls of the action vector satisfying the constraints on the minimum number of pulls of the action components. min Lap T q,Li,jp T q a PAzta u Lap Tq a (45) s.t. Li,jp Tq ÿ a PA:ai j Lap Tq, @i P Jd K, j P Jki Kzta i u (46) Li,jp Tq ě 2σ2 2 i,j hi,jp Tq, @i P Jd K, j P Jki Kzta i u (47) Lap Tq ě 0, @a P Azta u. (48) Now, to simplify notation, we define xpaq Lap Tq, remove the variables Li,j since constraint (47) will be satisfied with Factored-Reward Bandits with Intermediate Observations equality, and reformulate in the unconstrained form using the indicator function IX pxq # 0 if x P X 8 otherwise: inf xpaq f T pxq : ÿ a PAzta u xpaq a ÿ j PJki Kzta i u IRě0 a PA : ai j xpaq 2σ2 2 i,j hi,jp Tq a PA IRě0pxpaqq. (49) With this notation, we want to characterize the value of the optimization problem as the horizon T grows to infinity, i.e., lim inf T Ñ 8 infxpaq f T pxq. Notice that this is exactly what we need to obtain a lower bound to lim inf T Ñ 8 Er RT p A,νqs In the following, we show that: lim inf T Ñ 8 inf xpaq f T pxq inf xpaq f8pxq, (50) where f8 is defined as follows: a PA xpaq a ÿ j PJki Kzta i u IRě0 a PA : ai j xpaq 2σ2 a PA IRě0pxpaqq, (51) corresponding to the optimization problem in which we remove the hi,jp Tq function from the right-hand side of the constraint. First of all, we observe that for every x and T, we have that f T pxq ď f8pxq. It follows that infxpaq f T pxq ď infxpaq f8pxq and, consequently, lim inf T Ñ 8 infxpaq f T pxq ď infxpaq f8pxq. Thus, it remains to prove that lim inf T Ñ 8 infxpaq f T pxq ě infxpaq f8pxq. Since the optimization problem is linear and feasible (for sufficiently large T), there must exist x T such that infxpaq f T pxq f T px T q for every finite T, but also for T 8. Now, consider for a fixed x: lim inf T Ñ 8 f T pxq ÿ a PA xpaq a ÿ a PA IRě0pxpaqq lim inf T Ñ 8 j PJki Kzta i u IRě0 a PA : ai j xpaq 2σ2 2 i,j hi,jp Tq a PA xpaq a ÿ a PA IRě0pxpaqq ÿ j PJki Kzta i u lim inf T Ñ 8 IRě0 a PA : ai j xpaq 2σ2 2 i,j hi,jp Tq f8pxq, (54) uniformly since lim sup T Ñ 8 hi,jp Tq 0 and IRě0 is a decreasing function in its argument, having also exploited that lim infnpan bnq ě lim infn an lim infn bn. Indeed, let c ř a PA : ai j xpaq 2σ2 2 i,j and y T hi,jp Tq, we have to compute lim inf T Ñ 8 IRě0pc y T q. Since 0 ď y T and lim sup T Ñ 8 y T 0, we have lim T Ñ 8 y T 0. If c 0, there exists Tpcq such that for T ě Tpcq, we have that y T ď |c|{2. Consequently, lim inf T Ñ 8 IRě0pc y T q IRě0pcq. If, instead, c 0, we have to compute lim T Ñ 8 IRě0py T q; being IRě0 right continuous and y T ě 0 we have that lim T Ñ 8 IRě0py T q 0. This, combined with the fact f T pxq ď f8pxq leads to lim inf T Ñ 8 f T pxq f8pxq, uniformly. Thus, we have that for every ε ą 0 there exists Tpεq ą 0 such that for every T ě T0pεq we have uniformly: ˇˇˇˇ inf T 1ěT f T 1pxq f8pxq ˇˇˇˇ ď ε. (55) Consequently, we have: inf T 1ěT inf xpaq f T 1pxq inf T 1ěT f T 1px T 1q ě f8px T 1q ε ě f8px 8q ε inf xpaq f8pxpaqq ε. (56) This concludes the proof. Theorem 3.4 (Instance-Dependent Lower Bound (Explicit)). Let Cpνq be the solution of the optimization problem of Theorem 3.3. It holds that: Mπpℓq Mπpℓ 1q αℓ, that can be computed in Opř i PJd K ki log kiq. Factored-Reward Bandits with Intermediate Observations Proof. Let M maxi PJd K Mi,ki 1. For every i P Jd K, let us define a non-negative function function fi : R Ñ tµi,juj PJki KY t0u such that: ż R 1tfipxq µi,judx Li,j @j P Jki Kzta i u, (57) ż R 1tfipxq µi,a i udx M Mi,ki 1. (58) Clearly, fi is not uniquely defined. Any function fi satisfying these conditions is measurable (by definition, since the pre-image of any Y Ď tµi,juj PJki K Y t0u is measurable) and correspond to a possible arrangement of a proportion of pulls of the arm components of dimension i. Specifically, all functions satisfying these conditions are called equimesurable meaning that for every fi, gi fulfilling the conditions, we have that tx : fipxq ě yu tx : gipxq ě yu for every y P R. We call this set of functions Fi. A possible arrangement of the proportion of the pulls for component i P Jd K, corresponds to a function fi P Fi such that fipxq 0 for x ă 0 or x ą M. Thus, to minimize the regret as in the optimization problem of Theorem 3.3, we maximize the reward as follows: sup fi PFi, fipxq 0 for x ă 0 or x ą M, i PJd K i PJd K fipxiqdxi ď sup fi PFi, i PJd K i PJd K fipxiqdxi. (59) Let f i be the symmetric decreasing rearrangement of fi for every i P Jd K, which, in our specific case, is a piecewise constant symmetric function. Define x0 0, xi,1 p M Mi,ki 1q{2, xi,l 1 xi,l Li,πipki lq{2 for l P Jki K, we have: l PJki K µi,πipki l 1q1t|x| P rxi,l 1, xi,lqu. (60) From the rearrangement inequality for multiple integrals (Luttinger & Friedberg, 1976), we have: sup fi PFi, i PJd K i PJd K fipxiqdxi ż i PJd K f i pxiqdxi. (61) Let us observe that the product of ş i PJd K f i pxiqdxi actually leads to the solution depicted in the statement of the theorem. Concerning the computational complexity, we observe that it is dominated by the sorting in each dimension i P Jd K. Theorem 4.1 (Worst-Case Upper Bound for F-UCB). For any FRB ν, F-UCB with α ą 2 suffers an expected regret bounded as: E r RT p F-UCB, νqs ď 4σ ÿ αki T log T gpαq ÿ i PJd K ki, where gpαq r O pα 2q 2 .11 In particular, if ki : k, for every i P Jd K, we have E r RT p F-UCB, νqs ď r Opσd ? Proof. The proof is composed of two parts. In the first part, we define the probability, given the chosen confidence bounds, that the good event holds, i.e., the probability that all the confidence bounds are valid. The goal is to find an upper bound on the probability that the good event does not hold along the whole time horizon T. In the second part, we aim to characterize the regret under the good event for a specific round t P JTK. Finally, we join the two parts to find an upper bound on the expected cumulative regret. Part 1: Upper bounding the bad event over time horizon T We start by defining our good event Et at round t P JTK, which implies that all the confidence bounds of interest hold, i.e., we are not making a severe underestimate of the expected value of the optimal action components, and severely overestimating the expected values of the suboptimal ones. Formally: @i P Jd K, @ai P Jki Kzta i u : pµi,aiptq µi,ai ď σ α log t Ni,aiptq 11The complete expression is reported in the proof. Factored-Reward Bandits with Intermediate Observations @i P Jd K : µi,a i pµi,a i ptq ď σ α log t Ni,a i ptq We now want to find an upper bound of the probability of the bad event EA t : Di P Jd K, Dai P Jki Kzta i u : pµi,aiptq µi,ai ą σ α log t Ni,aiptq Di P Jd K : µi,a i pµi,a i ptq ą σ α log t Ni,a i ptq Di P Jd K, Dai P Jki Kzta i u, Ds P Jt K : pµi,airss µi,aiptq ą σ s looooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooon Di P Jd K, Ds P Jt K : µi,a i pµi,a i rss ą σ s looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon having highlighted with the symbols pµi,airss and pµi,a i rss the dependence of the estimators on the number of pulls s. We now bound (A) and (B) separately. Similar to the proof of Theorem 2.2 proposed by Bubeck (2010), we use a peeling argument together with Hoeffding s maximal inequality. We apply the peeling argument with a geometric grid over the time interval r1, ts to bound the probability of term (A). Given β P p0, 1q, we note that if s P t1, . . . , tu, then Dj P ! 0, . . . , log t log 1{β ) : βj 1t ă s ď βjt. As such, we obtain: Di P Jd K, Dai P Jki Kzta i u, Ds P Jt K : pµi,airss µi,ai ą σ Di P Jd K, Dai P Jki Kzta i u, Ds P Jt K : xi,airls µi,aiptq ą σ a log t log 1{β ÿ Di P Jd K, Dai P Jki Kzta i u, Ds : βj 1t ă s ď βjt, xi,airls µi,aiptq ą σ a log t log 1{β ÿ Di P Jd K, Dai P Jki Kzta i u, Ds : βj 1t ă s ď βjt, xi,airls µi,aiptq ą σ a αβj 1t log t having denoted with xi,airls the l-sample used to compute the sample mean pµi,airss. Applying a union bound on the summations on i and ai, and Hoeffding s maximal inequality, we obtain: P p(A)q ď ÿ ai PJki Kzta i u log t log 1{β ÿ σ2αβj 1t log t 2 ai PJki Kzta i u log t log 1{β ÿ j 0 exp ˆ αβ log t ai PJki Kzta i u log t log 1{β ÿ ai PJki Kzta i u log t log 1 Factored-Reward Bandits with Intermediate Observations Applying the same procedure, we can bound the probability of term (B) in Equation (62) to obtain: P p(B)q ď ÿ log t log 1 As such, we can write the upper bound of the probability of the bad event as: P EA t P p(A)q P p(B)q ď ÿ log t log 1 Let us now bound the sum of the probabilities of the bad event over the horizon T: t PJT K P EA t ď ÿ i PJd K ki ÿ log t log 1 log t log 1 log 1{β 1 ˆ 2 2 αβ t1 αβ 1 4 p2 αβq log 1{β ˆ 2 2 αβ 4 p2 αβq2 logp1{βq ˆ 2 2 αβ 4 p2 αβq2 logp1{βq where line (63) is obtained by bounding the summation with the integral, line (64) is obtained via integration by parts, and the first term of line (65) is obtained by imposing αβ ą 2. Substituting now β 4 α 2, which verifies β P p0, 1q if α ą 2, we obtain: ÿ t PJT K P EA t ď α 2 α 2 pα 2q2 pα 2q2 1 log α 2 i PJd K ki r O pα 2q2 ÿ i PJd K ki. Part 2: Upper bounding the instantaneous regret at time t under the good event We can now bound the instantaneous regret at time t supposing the good event holds. We define the regret Rt at time t as the difference in expectation between the optimal action and the one performed by F-UCB, formally: i PJd K µ i ź i PJd K µi,aiptq (67) i PJl 1K µ l loooomoooon µ l µl,alptq ź i PJl 1,d K µi,aiptq loooooooomoooooooon µ l µl,alptq (69) µ l µl,alptq UCBl,alptqptq (70) UCBl,alptqptq µl,alptq (71) pµl,alptqptq βl,alptqptq µl,alptq (72) l PJd K βl,alptqptq, (73) where line (68) is obtained by summing and subtracting all mixed terms, line (69) follows from bounding the left and right Factored-Reward Bandits with Intermediate Observations products with 1 being all factors (including the middle one) made of non-negative terms, line (71) comes from the optimism under the good event, having denoted with βl,alptq the exploration bonus. Upper bound of the expected cumulative regret Rp F-UCB, Tq Recalling that we call Rt the instantaneous regret under the good event, can now compute an upper bound on the expected cumulative regret as: E r RT p F-UCB, νqs ď ÿ 1 P EA t Rt P Et t PJT K P EA t ÿ t PJT K Rt P ET t PJT K P EA t ÿ t PJT K P EA t ÿ t PJT K 2 ÿ i PJd K βi,aiptqptq t PJT K P EA t 2 ÿ α log t Ni,aiptq t PJT K P EA t 2σ a t PJT K P EA t 2σ a j PJNi,aip T q K t PJT K P EA t 2σ a j PJT {ki K t PJT K P EA t 2σ a 1 j dj (76) t PJT K P EA t 2σ a t PJT K P EA t 2σ a ai PJki K 2 t PJT K P EA t 4σ a α 2 α 2 pα 2q2 pα 2q2 1 log α 2 i PJd K ki 4σ a where line (74) is obtained by rewriting the series over the arms and the number of pulls for each arm, line (75) is derived by considering the worst case, i.e., when all the arms are pulled equally (this is the worst case because we are looking at a concave function), and line (76) is obtained by bounding the summation with the corresponding integral. This concludes the proof. Theorem 4.2 (Instance-Dependent Upper Bound for F-UCB). For a given FRB ν, F-UCB with α ą 2 suffers an expected regret bounded as: E r RT p F-UCB, νqs ď Cp F-UCB, νq, Factored-Reward Bandits with Intermediate Observations where Cp F-UCB, νq is defined as the solution to the following optimization problem (where gpαq r Oppα 2q 2q): max p Naqa PA a PAzta u Na a (9) s.t. Ni,j ÿ a PAzta u ai j Na, @i PJd K, j PJki Kzta i u (10) Ni,j ď 4ασ2log T 2 i,j gpαq, @i PJd K, j PJki Kzta i u (11) a PA Na T (12) Naě0, @a PA (13) Proof. The proof of this statement is divided into two parts. The first part is dedicated to finding an upper bound on the expected number of pulls for each action component Nij. The second part is dedicated to understanding how these pulls can be combined to find an upper bound on the regret. Part 1: Upper bounding the expected number of pulls for each action component The proof of the expected number of pulls for σ2-subgaussian variables comprises three parts, extending and following the proof of Theorem 2.2 proposed by Bubeck (2010). Given an instance ν of FRB, consider a component i P Jd K, and a suboptimal action ai P Jki Kzta i u, which suffers a suboptimality gap of i,ai. In this part, we show that if Ii,t ai (i.e., the action selected for component i at time t is ai), then one of the three following equations is true: UCBi,a i ptq ď µ i , (77) ˆµi,aipt 1q ą µi,ai σ α log t Ni,aipt 1q, (78) Ni,aipt 1q ă 4σ2α log T 2 i,ai , (79) where: UCBi,a i ptq is the confidence bound of the optimal arm for component i at time t, having pulled such an arm for Ni,a i pt 1q times in the previous rounds, and ˆµi,ai,Ni,aipt 1q is the estimated value of the mean of arm ai of component i after Ni,aipt 1q pulls. For absurd, if we assume that the three equations are false, then we have: UCBi,a i ptq ą µ i σ2α log t Ni,aipt 1q ě ˆµi,ai,Ni,aipt 1q σ2α log t Ni,aipt 1q UCBi,aipt 1q, which implies that aiptq ai. Now, we bound the probability that Equation (77) or Equation (78) hold true. Similar to the original proof, we use a peeling argument together with Hoeffding s maximal inequality, which is a consequence of Azuma-Hoeffding inequality. Note that: Pp Eq. (77) is trueq ď P Ds P t1, . . . , tu : ˆµi,a i rss Factored-Reward Bandits with Intermediate Observations Ds P t1, . . . , tu : l 1 pxi,a i rls µ i q ď a We now apply the peeling argument with a geometric grid over the time interval r1, ts. More precisely, given β P p0, 1q, we note that if s P t1, . . . , tu, then Dj P ! 0, . . . , log t log 1{β ) : βj 1t ă s ď βjt. As such, we get: Pp Eq. (77) is trueq ď log t log 1{β ÿ Ds : βj 1t ă s ď βjt, l 1 pxi,a i rls µ i q ď a log t log 1{β ÿ Ds : βj 1t ă s ď βjt, l 1 pxi,a i rls µ i q ď a σ2αβj 1t log t We now bound this last term using Hoeffding s maximal inequality, which gives: Pp Eq. (77) is trueq ď log t log 1{β ÿ σ2αβj 1t log t 2 log t log 1{β ÿ j 0 exp ˆ αβ log t log 1{β 1 1 Using the same arguments, it can be proven that: Pp Eq. (78) is trueq ď ˆ log t log 1{β 1 1 We can now write: E r Ni,aip Tqs E t 1 1t Ii,t aiu t u 1 1t Ii,t ai and Eq. (79) is falseu t u 1 1t Eq. (77) or Eq. (78) is trueu t u 1 p Pp Eq. (77) is trueq Pp Eq. (78) is trueqq , where u r 4σ2α log T We can now upper bound the probability of Equations (77) and (78) holds: t u 1 p Pp Eq. (77) is trueq Pp Eq. (78) is trueqq log 1{β 1 1 Factored-Reward Bandits with Intermediate Observations log 1{β 1 1 log 1{β 1 ˆ 2 2 αβ t1 αβ 1 4 p2 αβq log 1{β 4 2 αβ 8 p2 αβq2 log 1{β 4 2 αβ 8 p2 αβq2 log 1{β , where line (80) is obtained via integration by parts and the first term of line (81) is obtained imposing αβ ą 2. Substituting now β 4 α 2, which verifies β P p0, 1q if α ą 2, we obtain: t u 1 p Pp Eq. (77) is trueq Pp Eq. (78) is trueqq ď 4 2 4α α 2 8 2 4α α 2 2 1 log α 2 2 α 2pα 2q2 p2 αq2 1 log α 2 α 2 2 log α 2 Rearranging the upper bound on the expected number of pulls given the three cases presented above, we get: Er Ni,jp Tqs ď 4ασ2 log T 2 i,j 2pα 2q α 2 2 log α 2 We set gpαq 2pα 2q α 2 2 logp α 2 α 2 α 2 2 r O pα 2q 2 . Part 2: Upper bounding the expected cumulative regret We now have to understand how the pulls defined in part 1 can be combined. We want to look at the worst combination in which we can pull the suboptimal action components. We recall that regret can be defined by highlighting the dependence on the pulls of the action vectors: Er RT p F-UCB, νqs ÿ As before, we can bind the pulls of the action components Nij and the action vectors Na as follows: Er Ni,jp Tqs ÿ a PA:ai j Na, @i P Jd K, j P Jki K. We know that the pulls cannot be negative, and that the total number of pulls of the action vectors sums to T, so we impose these additional constraints. Now, acting on the number of pulls Na, @a P A we want to find the worst-case in which we can combine action components in action vectors. So, we solve a maximization problem on the regret defined as a function of the number of pulls, given the constraints defined above, and the upper bound on the expected number of pulls of the action components Nij, @i P Jd K, j P Jki Kzta i u defined in Part 1 of this proof. Corollary 4.3 (Explicit Instance-Dependent Upper Bound for F-UCB). For a given FRB ν, F-UCB with α ą 2 suffers an expected regret bounded by: E r RT p F-UCB, νqs ď Cp F-UCB, νq ď 4ασ2 log T ÿ i PJd K µ i ÿ j PJki Kzta i u 1 i,j gpαq ÿ i PJd K ki, where µ i ś l PJd Kztiu µ l ď 1 for every i P Jd K. Factored-Reward Bandits with Intermediate Observations Proof. In order to obtain a relaxed solution of the optimization problem in Theorem 4.2, we first derive the following upper bound to the suboptimality gaps of the action vector a pa1, . . . , adq T: i PJd K µ i ź i PJd K µi,ai ź i PJd K µ i i PJd K µ i ˆ 1 min i PJd K µi,ai i PJd K µ i max i PJd K i PJd K µ i ÿ i PJd K pµ i µi,aiq ź j PJd Kztju µ j (86) i PJd K i,aiµ i, (87) where line (83) follows from observing that ś i PJd K µi,ai µ i ď mini PJd K µi,ai µ i since µi,ai µ i P r0, 1q, line (86) comes from defining j PJd Kztju µ j ď 1. Thus, by considering the objective function in the optimization problem of Theorem 4.2, we have: ÿ a PAzta u Na a ď ÿ a PAzta u Na ÿ i PJd K i,aiµ i (88) i PJd K µ i ÿ a PA : ai j Na i,ai (89) i PJd K µ i ÿ ai PJki Kzta i u Ni,ai i,ai. (90) By using the Constraint (11) to upper bound Ni,ai and recalling that i,j ď 1, we get the result. Theorem 5.1 (Instance-Dependent Upper Bound for F-Track). For any FRB ν, F-Track run with: log T U and ϵT 2σ2f T p1{ log Tq suffers an expected regret of: lim sup T Ñ 8 E r RT p F-Track, νqs log T Cpνq. Proof. Preliminary Results Let us introduce the symbol: ϵi,jpt, δq : Ni,jptq . (91) Consider the event Epδq : t Di P Jd K, Dj P Jki K, Dt P JTwarm-up, TK ě 1 : |pµi,jptq µi,j| ą ϵi,jpt, δqu and let us bound its probability: Pp Epδqq ď ÿ j PJki K P p Dt P JTwarm-up, TK : |pµi,jptq µi,j| ą ϵi,jpt, δqq (92) Ds P JTK : |pµi,jrss µi,j| ą Factored-Reward Bandits with Intermediate Observations j PJki K δ kδ, (94) where line (92) follows from a union bound over the values of i and j, line (93) follows by rewriting the probability by highlighting the dependence of the estimator on the number of samples s, and line (94) follows from Lemma C.1, recalling that sppµi,jrss µi,jq is a martingale difference sequence and it is σ2-subgaussian. We will make use of the following two instantiations of event Epδq: E1 : Ep1{ log Tq and E2 : Ep1{Tq. (95) Clearly, from the previous result, we have that Pp E1q ď k{ log T and Pp E2q ď k{T. We start decomposing the regret over the phases of the algorithm: E νr Rp F-Track, Tqs E ν t Pwarm-up aptq loooooooooomoooooooooon :Eνr Rwarm-upp T qs t Psuccess aptq looooooooomooooooooon :Eνr Rsuccessp T qs t Precovery aptq loooooooooomoooooooooon :Eνr Rrecoveryp T qs where, with little abuse of notation, we denoted with t P phase denotes the rounds in which phase phase is active. We proceed to analyze the three components separately. Part 1: Regret in Warm-Up Phase Eνr Rwarm-upp Tqs We start by analyzing the regret in the warm-up phase, whose duration is given by Twarm-up N0 maxi PJd K ki r?log Ts maxi PJd K ki. Thus, the corresponding expected cumulative regret can be bounded as follows: E νr Rwarm-upp Tqs ď max Qa log T U max i PJd K ki O a log T , (97) where max maxa PA a and the Big-O notation retains the dependence on T only. Thus, its contribution to the regret is asymptotically negligible: lim sup T Ñ 8 Eνr Rwarm-upp Tqs log T 0. (98) Part 2: Regret in the Recovery Phase Eνr Rrecoveryp Tqs We move to the analysis of the regret in the recovery phase. We start by showing that if event E1 does not hold, then, the recovery phase never activates. Indeed, under EA 1 simultaneously for all i P Jd K, j P Jki K, and t P JTwarm-up, TK we have that: |pµi,jptq µi,j| ď ϵi,jpt, 1{ log Tq, (99) which implies simultaneously for all i P Jd K, j P Jki K, and t P JTwarm-up, TK that: |pµi,jp Twarm-upq pµi,jpt 1q| ď |pµi,jp Twarm-upq µi,j| |pµi,jpt 1q µi,j| (100) ď ϵi,jp Twarm-up, 1{ log Tq ϵi,jpt 1, 1{ log Tq (101) ď 2ϵi,jp Twarm-up, 1{ log Tq, (102) being ϵi,jpt, 1{ log Tq a decreasing in t. Recalling that Ni,jp Twarm-upq ě N0, we have: 2ϵi,jp Twarm-up, 1{ log Tq 2 2σ2f T p1{ log Tq Ni,jp Twarm-upq ď 2 2σ2f T p1{ log Tq N0 2ϵT . (103) Thus, we conclude that the termination condition of the while loop never activates and, consequently, the recovery phase activates only when E1 holds, i.e., with probability at most 1{ log T. In the recovery phase, our F-Track algorithm plays F-UCB that, from Corollary 4.3, is proved to suffer logarithmic regret of the form: ρp Tq : 4ασ2 log T ÿ i PJd K µ i ÿ j PJki Kzta i u 1 i,j gpαq ÿ i PJd K ki Oplog Tq. (104) Thus, we have that the cumulative regret of the recovery phase is bounded by: E νr Rrecoveryp Tqs E νr Rrecoveryp Tq|EA 1s Pp EA 1q E νr Rrecoveryp Tq|E1s Pp E1q ď 0 ρp Tq log T Op1q. (105) Factored-Reward Bandits with Intermediate Observations Consequently, its contribution to the expected cumulative regret is asymptotically negligible. Indeed: lim sup T Ñ 8 Eνr Rrecoveryp Tqs log T 0. (106) Part 3: Regret in the Success Phase Eνr Rsuccessp Tqs We conclude with the most challenging part consisting of bounding the regret in the success phase. The cumulative regret in the success phase needs to be further decomposed as follows: E νr Rsuccessp Tqs E ν t Psuccess aptq 1t E1 EA 2u ÿ t Psuccess aptq t Psuccess aptq We analyze each term separately. Part 3.1: Regret under EA 1 In what follows, all estimated quantities are estimated with the samples available at the end of the warm-up phase and, thus, we will omit the dependence on Twarm-up. We show that asymptotically, during the success phase and under event EA 1, the algorithm suffers the optimal regret. To this end, we need to introduce some auxiliary tools. For every i P Jd K, let us define a sorting function as any bijective function πi : Jki K Ñ Jki K such that: µi,πip1q ď ď µi,πipkiq. (108) If all µi,j are different, the sorting function is unique. Furthermore, for every i P Jd K and j P Jki Kztπipkiqu (i.e., excluding the action component with maximum expected reward), let us denote: Ni,j 2σ2f T p1{Tq 2 i,j , (109) where i,j µi,πipkiq µi,j. Let us notice that Ni,j corresponds approximately to the minimum number of pulls of component pi, jq prescribed by the lower bound in Theorem 3.3 and denoted with Li,j 2σ2 log T 2 i,j . Given the definition of f T p1{Tq, we have that Li,j{Ni,j Ñ 1 as T Ñ 8. Given the sorting function, it is clear that also: Ni,πip1q ď ď Ni,πipkiq. (110) Let us define: βi : f T p1{Tq 1 min l,l1PJki K : Ni,πiplq Ni,πipl1q ˇˇNi,πiplq Ni,πipl1q ˇˇ . (111) It is clear that if for every i P Jb K and j P Jki K we have we have | p Ni,j Ni,j| ď βif T p1{Tq{4, then, for any sorting function pπi of the estimated quantities N i,j, there exist a sorting function πi of the true quantities Ni,j such that pπi πi. Let us define for every i P Jd K and j P Jki K: l 1 Ni,πiplq. (112) We define now a sorting function π : Jk K Ñ Ť i PJd Kptiu ˆ Jki Kq as any bijection such that: Mπp1q ď ď Mπpkq, (113) and convene (with a little abuse of notation) that Mπp0q 0. It is clear that Mπpkq Mπpk 1q Mπpk d 1q T. Let l P Jk K, we define the active action as: αplq : pj1, . . . , jdq where ji s.t. πpl1q pi, jiq and l1 mintl2 ě l and πpl2q pi, qu with i P Jd K. (114) We can now rewrite the regret with this notation: Mπplq Mπpl 1q αplq, (115) having observed that for the k d 1 terms we play the optimal action and the successive ones are zero. Furthermore, given the relation between Li,j and Ni,j, we have that: ř a a Na f T p1{Tq C and lim sup T Ñ 8 log T C. (116) Factored-Reward Bandits with Intermediate Observations Let us now define: β : f T p1{Tq 1 min l,l1PJk K : Mπplq Mπpl1q ˇˇMπplq Mπpl1q ˇˇ . (117) It is clear that if for every i P Jb K and j P Jki K we have |x Mi,j Mi,j| ď βf T p1{Tq{4, for every sorting function pπ of the estimated quantities x Mi,j, there exist a sorting function π of the true quantities Mi,j such that pπ π. If this is the case, then, the active action pαplq induced by pπ must be the same as αplq since the active action depends on the sorting function only. We now show that we can always guarantee | p Ni,j Ni,j| ď pβif T p1{Tqq{4 and |x Mi,j Mi,j| ď pβf T p1{Tqq{4 for sufficiently large T. First of all, let us ensure that we identify the optimal component for every i P Jd K. This is guaranteed whenever for every j P Jki K we have: |pµi,j µi,j| ď ϵi,jp Twarm-up, 1{ log Tq ď ϵT ď min{4, (118) where min mini PJd K minj PJki Kztπipkiqu µi,πipkiq µi,j. The inequality is satisfied for sufficiently large T since: 2σ2f T p1{ log Tq P?log T T O σ2 log log T ?log T Ñ 0 as T Ñ 8. (119) Under this condition, we have that πipkiq pπipkiq and, consequently: p i,j pµi,πpkiq pµi,j and i,j µi,πpkiq µi,j. (120) Thus, under event EA 1, we have |p i,j i,j| ď 2ϵT . Let us now consider i P Jk K and j P Jki Kztπipkiqu, we have: ˇˇˇ p Ni,j Ni,j ˇˇˇ ˇˇˇˇˇ 2σ2f T p1{Tq p 2 i,j 2σ2f T p1{Tq ˇˇˇˇˇ (121) 2σ2f T p1{Tqp i,j p i,jq| i,j p i,j| 2 i,j p 2 i,j (122) ď 8σ2f T p1{Tqp2 max min{2q 4 min ϵT , (123) where max maxi PJd K maxj,j1PJki K |µi,j µi,j1| and having observed that p i,j ě i,j 2ϵT ě min min{2 min{2 and p i,j ď i,j 2ϵT ď max min{2 min{2. Thus, the difference can go below βif T p1{Tq for sufficiently large T. Let us now move to the Mi,j variables. For sufficiently large T such that the sorting function πi coincide with their estimated counterparts pπi, we have that for i P Jd K and j P Jki K: ˇˇˇMi,j x Mi,j ˇˇˇ l 1 Ni,πiplq l 1 p Ni,pπiplq ˇˇˇˇˇ (124) ˇˇˇN i,πiplq p Ni,πiplq ˇˇˇ (125) ď 8σ2jf T p1{Tqp2 max min{2q 4 min ϵT . (126) Similarly, as before, we can conclude that this difference can be made smaller than β for sufficiently large T, and, consequently, make the estimated sorting function pπ equal the true counterpart π. Under these conditions, we can bound the cumulative regret under EA 1: ÿ t Psuccess aptq ÿ a a p Na a (127) x Mpπplq x Mpπpl 1q pαplq (128) x Mπplq x Mπpl 1q αplq (129) Factored-Reward Bandits with Intermediate Observations x Mπplq Mπplq Mπpl 1q x Mπpl 1q αplq Mπplq Mπpl 1q αplq (130) ˇˇˇx Mπplq Mπplq ˇˇˇ Cf T p1{Tq (131) ď 8σ2pk dq max i PJd K kif T p1{Tqp2 max min{2q 4 min ϵT Cf T p1{Tq (132) OpϵT f T p1{Tqq Cf T p1{Tq, (133) where we used Equation 126. Thus, recalling that ϵT Ñ 0 for T Ñ 8, we have: lim sup T Ñ 8 E 1t EA 1u ř t Psuccess aptq log T C. (134) Consequently, its contribution to the asymptotic regret is exactly C. Part 3.2: Regret under E1 EA 2 In this case, we have to prove that the regret remains logarithmic. We consider two cases: Case 1 We perform the analysis in the first case under the following conditions: @i P Jd K : πipkiq pπipkiq and @j P Jki Kztπipkiqu : p i,j ě min{4. (135) In such a case, it is simple to show that the regret is at most logarithmic. Indeed, being the optimal arm correctly identified (πipkiq pπipkiq) we have: a a p Na a ď 2 max l 1 x Mpπplq (136) j PJki Kztπipkiqu p Ni,πipjq (137) ď 4σ2f T p1{Tq max ÿ j PJki Kztπipkiqu p 2 i,πipjq (138) ď 64kσ2f T p1{Tq max 2 min Oplog Tq, (139) where we observed that since the optimal arm is correctly identified, the following inequality holds: řk d l 1 x Mpπplq ď ř j PJki Kztπipkiqu p Ni,πipjq. Case 2 If the condition in Equation (135) is violated, we show that the success phase stops after a logarithmic number of rounds. Consider the smallest round ti,j in which for a given i P Jk K and j P Jki Kztpπipkiqu, it holds that: Ni,jpti,jq ě min # 2σ2f T p1{Tq p 2 i,j , 128σ2f T p1{Tq Since the F-Track algorithm in the success phase proceeds with the round robin of at most k arms, we have that: ti,j ď k min # 2σ2f T p1{Tq p 2 i,j , 128σ2f T p1{Tq ď 128kσ2f T p1{Tq 2 min : t Oplog Tq. (141) Now, we consider two sub-cases. Case 2.1 In the first sub-case, we deal with the case in which some optimal components are not correctly identified: Di P Jd K : πipkiq pπipkiq (142) In such a case, at most at round t , we have that: pµi,πipkiqptq ě µi,πipkiqptq 2σ2f T p1{Tq Ni,πipkiqptq (143) ě µi,πipkiqptq max ! p i,πipkiq, min{8 ) (144) ě µi,πipkiqptq p i,πipkiq min{8 (145) Factored-Reward Bandits with Intermediate Observations ě µi,pπipkiqptq i,pπipkiq min{8 p i,πipkiq (146) ě pµi,pπipkiqptq 2σ2f T p1{Tq Ni,pπipkiqptq i,pπipkiq min{8 p i,πipkiq (147) ě pµi,pπipkiqptq maxt0, min{8u i,pπipkiq min{8 p i,πipkiq (148) ě pµi,pπipkiqptq 3{4 min pµi,πipkiqp Twarm-upq pµi,pπipkiqp Twarm-upq. (149) where line (143) follows from the fact that event E2 does not hold, line (144) follows from Equation (140) with j πipkiq, line (145) is obtained with max a, b ď a b for a, b ě 0, line (146) is obtained from the definition of i,pπipkiq, line (147) follows from the fact that event E2 does not hold, line (148) follows from Equation (140) with j pπipkiq (whose estimated p i,pπipkiq 0, and line (149) is obtained from the definition of p i,πipkiq and from i,pπipkiq ě min. This implies that at this round: pµi,πipkiqptq pµi,πipkiqp Twarm-upq pµi,pπipkiqp Twarm-upq pµi,pπipkiqptq ě 3{4 min ě 4ϵT , (150) where the latter holds for sufficiently large T. Thus, we have that the success phase stops after at most t rounds, leading to a regret of: ÿ t Psuccess aptq ď max 32kσ2f T p1{Tq 2 min Oplog Tq. (151) Case 2.2 In the first sub-case, we deal with the case holding under the condition: @i P Jd K : πipkiq pπipkiq and Di P Jd K : Dj P Jki Kztπipkiqu : p i,j ă min{4 (152) At round t , for the pi, jq fulfilling the second part of the condition: pµi,πipkiqptq pµi,πipkiqp Twarm-upq pµi,jp Twarm-upq pµi,jptq (153) ě pµi,πipkiqptq pµi,jptq p i,j (154) ě µi,πipkiqptq 2σ2f T p1{Tq Ni,πipkiqptq µi,jptq 2σ2f T p1{Tq Ni,jptq p i,j (155) ě maxt0, min{8u maxtp i,j, min{8u i,j p i,j (156) ě min{4, (157) having exploited p i,j ď min{4 and i,j ě min. Thus, for sufficiently large T, we have that 4ϵT ď min{4 and, consequently, the success phase ends. Part 3.3: Regret under E2 We conclude by bounding the regret under event E2, In this case, we proceed with the following trivial bound, recalling that Prp E2q ď 1{T. t Psuccess aptq ď max T Pp E2q ď max Op1q. (158) Consequently, its contribution to the asymptotic regret is negligible. C.2. Technical Lemmas Lemma C.1. Let T P N, ϵ ą 0. Let X1, . . . , XT be a martingale difference sequence adapted to the filtration F0, F1, . . . , such that for every t P JTK, it holds that EreλXts ď epσ2λ2q{2 a.s. for every λ P R. Then, for every δ P p0, 1q it holds that: 2 p1 plog Tq 1q max tϵ, tσ2u ˆ log ˆ 1 R logp Tσ2{ϵq logp1 plog Tq 1q Furthermore, for sufficiently large T, it holds that: 2σ2tf T pδq Factored-Reward Bandits with Intermediate Observations f T pδq : ˆ 1 1 log T ˆ c log log T log ˆ1 and c ą 0 is a universal constant. Proof. The first statement is obtained from Lemma 14 of (Lattimore & Szepesvari, 2017) considering that the inequality employed in Equation (19) of that proof applies for σ2-subgaussian random variables and not for Gaussian variables only. The second statement is obtained by setting ϵ σ2 and bounding 1 logp1 plog T q 1q ď log T and logp1 rplog Tq2sq ď c log log T for some universal constant c ( 2). Lemma C.2. Let x P r0, 1q, d P N, then if xi P r0, xq , @i P Jd K, it holds: i PJd K p1 xiq ě p1 xqd 1 ÿ i PJd K xi. Proof. We prove this statement by induction. First, we can observe how for d 1 this result trivially holds: 1 p1 x1q x1. We can now make the inductive step on d: i PJd K p1 xiq 1 p1 xdq ź i PJd 1K p1 xiq i PJd 1K p1 xiq xd i PJd 1K p1 xiq i PJd 1K xi ě p1 xqd 1 ÿ i PJd K xi, where line (162) is the inductive step on d. Lemma C.3. In a FRB, considering µa 1, if i,j ď 1 1 21{pd 1q , @i P Jd K, j P Jki K, the regret can be bounded as: RT p A, νq ÿ i PJd K i,aiptq. Proof. We prove this statement by looking at a single time t. We can rewrite Lemma C.2 as: i PJd K p1 i,aiptqq ě p1 qd 1 ÿ i PJd K i,aiptq, if i,j ď P r0, 1q, @i P Jd K, j P Jki K. We make a choice we want to transform this result in order to have: i PJd K p1 i,aiptqq ě 1 i PJd K i,aiptq. This can be done by imposing: 1 2 ď p1 qd 1 1 21{pd 1q ď p1 q Factored-Reward Bandits with Intermediate Observations ď 1 1 21{pd 1q . Lemma C.4 (Wang et al. 2021). Suppose m, B are positive integers and m ě 2; there are m 1 probability distributions P0, P1, . . . Pm, and m random variables N1, . . . , Nm, such that: (i) Under any of the Pi s, N1, . . . , Nm are non-negative and ř i PJm K Ni ď B with probability 1; (ii) @i P Jm K, d TV ď 1 B E0r Nis. Then: i PJm K Eir B Nis ě B Proof. For the proof of this Lemma, we refer the reader to Lemma 24 of (Wang et al., 2021). D. Additional Theorems and Lemmas In this section, we provide additional Theorems and Lemmas useful in the discussion of the work. Lemma D.1. The product X1X2 Xn of n ě 3 independent σ2-subgaussian random variables is not subgaussian. Proof. The proof follows the one proposed by (Pinelis, 2021). The proof of this statement can be done by verifying that the moment-generating function of the product of n independent Gaussian distributions with unit variance (Xi Np0, 1q, @i P Jn K) is unbounded: fl 8, @c ą 0. Let us call X the vector composed of our random variables X : p X1, X2, . . . , Xnq and let p U1, U2, . . . Unq be a uniformly distributed unit random vector. For some real Cn ą 0: 1 " Xi ą ||X||2 2?n , @i P Jn K *fi c 1 p2?nqn rn looooomooooon rn 1 exp ˆ r2 looooooooomooooooooon dr P ˆ Ui ą 1 2?n, @i P Jn K looooooooooooooomooooooooooooooon 0 exp ˆ c 1 p2?nqn rn cn p2?nqn rn 1 loooooooooooooooooooomoooooooooooooooooooon loooooomoooooon dr P ˆ Ui ą 1 2?n, @i P Jn K exp ˆ c 1 p2?nqn rn exp ˆ r2 c 1 p2?nqn rn looooomooooon P ˆ Ui ą 1 2?n, @i P Jn K (165) ě Cn p2?nqn ˆ r8 0s ż 8 r dr P ˆ Ui ą 1 2?n, @i P Jn K (166) r8 0s exp ˆ r2 P ˆ Ui ą 1 2?n, @i P Jn K Cną0 ně3 cą0 8. Factored-Reward Bandits with Intermediate Observations The inequality in Equation (163) follows from the fact that the event inside the indicator function happens with a probability ď 1. Equation (164) is a rewriting of the previous line under the assumption that the indicator function evaluates to 1. We can rewrite the expected value as an integral over the positive real numbers since, according to the indicator function, every random variable Xi must be greater than ||X||2 2?n , which is a positive quantity. Term (A) is a substitution of ś i PJn K Xi with r 2?n repeated n times, which comes from the indicator function. r is the integration variable and represents the Euclidean norm of vector X. Term (B) represents the probability density of the Euclidean norm of a Gaussian vector X Np0, Inq. Finally, term (C) represents the probability of the indicator function evaluating to 1. Considering the vector Y whose elements are Yi Xi{||X||2, then ||Y ||2 1. The probability that Yi ą 1 2?n, @i P Jn K can be thought of as the probability that the point defined by Y in the n-dimensional space is located on the surface of the n-dimensional hyper-sphere of radius 1 in the region induced by the condition Yi ą 1 2?n. Equation (165) is an integration by parts of the two functions fprq and g1prq identified in the line above. Equation (166) holds under the assumption that n ě 3 and c ą 0. First, the term: exp ˆ c 1 p2?nqn rn exp ˆ r2 ně3 cą0 8 0 under such an assumption. Second, we can write: exp ˆ c 1 p2?nqn rn r2 0 exp ˆ c 1 p2?nqn rn r2 The final result then holds under the further assumption that Cn ą 0. Lemma D.2 (Variance of the product of independent random variables). Let X1, X2, . . . Xn independent random variables. The variance of their product is: Varr X1X2 Xns ź Varr Xis p Er Xisq2 ź i PJn K p Er Xisq2 Varr X1X2 Xns Erp X1X2 Xnq2s p Er X1X2 Xnsq2 Er X2 1X2 2 X2 ns p Er X1sq2p Er X2sq2 p Er Xnsq2 Er X2 1s Er X2 2s Er X2 ns p Er X1sq2p Er X2sq2 p Er Xnsq2 Varr Xis p Er Xisq2 ź i PJn K p Er Xisq2 Lemma D.3. Let X1, X2, . . . , Xn independent subgaussian random variables with expected value µi P r0, 1s and subgaussianity parameter σi P r0, 8q. The variance of the product X1X2 Xn is bounded by: ź i PJd K σ2 i ď Varr X1X2 Xns ď ź Proof. Now, we want to find the worst combination of µi, i P Jn K, i.e., the combination of expected values which maximizes the variance of the product of such random variables. To do so, we can consider a single i P Jn K, and look at the behavior of the first derivative when we change µi P r0, 1s. We recall from Lemma D.2 that: Varr X1X2 Xns ź Varr Xis p Er Xisq2 ź i PJn K p Er Xisq2 σ2 i µ2 i ź i PJn K µ2 i Factored-Reward Bandits with Intermediate Observations i PJn Kztiu σ2 i µ2 i µ2 i PJn Kztiu µ2 i , (167) i PJn Kztiu σ2 i µ2 i µ2 i PJn Kztiu µ2 i σ2 i PJn Kztiu σ2 i µ2 i (168) i PJn Kztiu loooooooooomoooooooooon i PJn Kztiu µ2 i loooomoooon i PJn Kztiu loooooooooooomoooooooooooon where lines (167), (168) and (169) are no other than an algebraic step to make explicit in the product the dependence on µi. Now we want to look at the worst case scenario for the variance, i.e., the value of µi that maximize it. Recalling the constraints on µi which is assumed to be bounded in r0, 1s and σ2 i that is defined over r0, 8s, it trivial to see that term A is predominant over term B and so the worst case for element i is to consider µi 1, no matter the other values of µi, i P Jn Kztiu. The term C is not relevant as µi does not appear. This reasoning applies for all the possible values of i P Jn K, and so the worst case variance is when all the µi are equal to 1, for all the components i P Jn K. Given that, the variance of the product of independent random variables with expected values in µi P r0, 1s and variance σ2 i can be bounded as: Varr X1X2 Xns ď ź A symmetric reasoning leads to the lower bound. This concludes the proof. E. Numerical Validation In this appendix, we provide numerical simulations to validate the proposed solutions. First, in Appendix E.1, we validate F-UCB against bandit baselines in several scenarios. Then, in Appendix E.2, we compare the two algorithms we propose (i.e., F-UCB and F-Track) in different scenarios to highlight their peculiarities. Finally, in Appendix E.3, we evaluate the proposed algorithms behavior in the case in which the noise affecting intermediate observations is partially correlated. The code of the experiments can be found at https://github.com/marcomussi/FRB. E.1. Comparison of F-UCB against Bandit Baselines In this part, we show the effectiveness of F-UCB against bandit baselines. Baselines The first baseline we consider is UCB1 (Auer et al., 2002), which is designed for stochastic bandits. We consider the anytime version of the algorithm, proposed by Bubeck (2010). Due to its characteristics, we expect it to perform in a comparable manner to F-UCB for d 1, with its performance degrading as the dimensionality grows. As an additional baseline, we consider a robust version of UCB algorithm designed for heavy-tail (HT) distributions (Bubeck et al., 2013) considering the Median of Means estimator (RUCB-Mo M). Due to the capability of this algorithm to handle non-subgaussian noise, we expect it to converge for any problem dimensionality, although at a slower rate. Finally, we consider the TEA algorithm, proposed by Zimmert & Seldin (2018). Since this algorithm provides theoretical guarantees for handling only subgaussian noise applied to the reward, we expect it to have a performance that degrades when d ą 1. For all the baselines, we consider the values of the hyperparameters as prescribed in the respective original papers. Setting For the sake of simplicity in the presentation of the results, we consider the scenario in which all the problem dimensions present the same number of actions (i.e., k1 kd : k). Moreover, we consider the setting in which the intermediate observations are drawn from Gaussian distributions with mean µi,aiptq for every action component aiptq in position i of the action vector a, formally xiptq Npµi,aiptq, σ2q, @i P Jd K. We consider values of k P J3, 5K, and values of d P J4K. We draw the expected values µi,j for i P Jd K and j P Jk K from a uniform distribution in the range r0.7, 1s. We fix a value of σ 0.1. It is worth noting that the results in the following paragraph are not comparable among the different k and d, mostly for what concerns the comparison between different values of d. We evaluate the performances in terms of Factored-Reward Bandits with Intermediate Observations cumulative regret with T 104, averaged over 50 trials. Results In Figure 3, we present the cumulative regret for the F-UCB algorithm and the other bandit baselines. The value of k increases with the columns, and the value of d increases with the rows of the figure. The following comments are valid for all the considered values of k, as no unexpected or relevant behaviors are present when we increase the number of actions for each action component. We observe that for d 1, F-UCB achieves a cumulative regret that matches that of UCB1. This is expected, as F-UCB collapses to UCB1 for d 1. RUCB-Mo M achieves a sublinear regret, although higher than the previous algorithms, whereas TEA suffers a cumulative regret that is linear in the considered time horizon. The behavior changes for d 2. F-UCB achieves a low cumulative regret. The cumulative regret of UCB1, instead, constantly increases over the time horizon. RUCB-Mo M continues to achieve a sublinear regret, however it is higher, due to the increased cardinality of the equivalent action space and the incremented effect of the noise. The behavior of TEA remains the same as for d 1. For d ě 3, we observe a stabilization of the behavior. F-UCB manages to achieve a cumulative regret that scales well as d and k increase. UCB1 now suffers a linear regret, RUCB-Mo M a sublinear regret worse with the increase of d, and TEA behaves as in the previous cases. E.2. Comparison of F-UCB and F-Track In this part, we provide numerical simulations intended to compare F-UCB and F-Track in different scenarios. As discussed in Remark 4.1 and shown Figure 2, the performances of F-UCB decrease when the number d of dimensions increases and when the suboptimality gaps are large. The goal of this part is to (i) verify once again this fact and (ii) observe if F-Track is able to mitigate such a phenomenon. Setting We consider the scenario in which the number of arms is constant across all dimensions, i.e., ki k, @i P Jd K. Given our goal to verify the algorithms behavior over the action vector dimensionality d and the suboptimality gaps dimension, we fixed the other parameters. We consider a scenario in which we have k 2 and observations affected Gaussian i.i.d. noise with σ 0.5. We evaluate the two algorithms for d P t2, 5, 10, 20, 30u. For what concerns the expected values, for all the dimensions, we enforce the first arm to be the best one, with expected value µi,1 µ i 1, @i P Jd K. The suboptimal arms have all the same expected values µi,2 1 i,2, @i P Jd K. Such a value i,2 has been tested in the set i,2 P t0.5, 0.7, 0.9u. We evaluate the performances in terms of regret, averaged over 10 runs with target time horizons T P r104, 105s. We remark that F-UCB is an anytime algorithm and can be run once to obtain the entire curve of the cumulative regret. Instead, F-Track requires the knowledge of the horizon to compute the correct values of N0 and ϵT . As such, we repeated the experiment for F-Track several times, each with a different time horizon up to the maximum T. Results In Figure 4, we present the cumulative regret for F-UCB and F-Track in the above-mentioned setting. First, we observe that for small values of d (i.e., d P t2, 5u), F-UCB outperforms F-Track for all the values of i,2. This behavior is less evident when we move to d 10, where the performances become comparable, with an advantage for F-UCB for smaller values of i,2, while for larger value of the suboptimality gap, F-Track is better. The results turn in favor of F-Track when d becomes larger (i.e., d P t20, 30u), and such an advantage further increases when i,2 is large. E.3. Robustness to Correlated Noise In this part, we provide numerical simulations intended to compare F-UCB and F-Track when there is a correlation between the noises affecting the different dimensions. As discussed in Remark 3.1, in our setting, we require that the observations must be non-correlated. Otherwise, the problem cannot be factored properly given that, in general, if there is a correlation between the noises, we have that: i PJd K xiptq i PJd K E rxiptqs . (170) Setting We consider the scenario in which the number of arms is constant across all dimensions, i.e., ki k, @i P Jd K. We consider k 2 and d 10. For what concerns the expected values, for all the dimensions, we enforce the first arm to be the best one, with expected value µi,1 µ i 1, @i P Jd K. The suboptimal arms have all the same expected values µi,2 0.5, @i P Jd K. In order to evaluate the behavior of the algorithms in the presence of correlation in the noise of intermediate observations, we introduce a term α P r0, 1s to control the interdependence of the intermediate observations. Factored-Reward Bandits with Intermediate Observations 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (a) d 1, k 3. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (b) d 1, k 4. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (c) d 1, k 5. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (d) d 2, k 3. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (e) d 2, k 4. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (f) d 2, k 5. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (g) d 3, k 3. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (h) d 3, k 4. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (i) d 3, k 5. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (j) d 4, k 3. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (k) d 4, k 4. 0 0.2 0.4 0.6 0.8 1 104 Cumulative Regret F-UCB UCB1 RUCB-Mo M TEA (l) d 4, k 5. Figure 3. Performance of F-UCB, UCB1, RUCB-Mo M and TEA considering k P J3, 5K and d P J4K (50 runs, mean std). Factored-Reward Bandits with Intermediate Observations 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (a) d 2, i,2 0.5. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (b) d 2, i,2 0.7. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (c) d 2, i,2 0.9. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (d) d 5, i,2 0.5. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (e) d 5, i,2 0.7. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (f) d 5, i,2 0.9. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (g) d 10, i,2 0.5. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (h) d 10, i,2 0.7. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (i) d 10, i,2 0.9. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (j) d 20, i,2 0.5. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (k) d 20, i,2 0.7. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (l) d 20, i,2 0.9. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (m) d 30, i,2 0.5. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (n) d 30, i,2 0.7. 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track (o) d 30, i,2 0.9. Figure 4. Cumulative regret of F-UCB and F-Track considering k 2, σ 0.5, d P t2, 5, 10, 20, 30u, and i,2 P t0.5, 0.7, 0.9u, @i P Jd K (10 runs, mean 2std). Factored-Reward Bandits with Intermediate Observations 0 0.2 0.4 0.6 0.8 1 Correlation parameter α Average Reward Estimates a p1, . . . , 1q a p2, . . . , 2q Figure 5. Monte Carlo estimates of the expected values for the tested values of the correlation parameter α P t0, 0.2, 0.4, 0.6, 0.8, 1u (106 Monte Carlo simulations). The additive noise applied to the observations xiptq is defined as α ηptq p1 αqϵiptq, where ηptq, ϵiptq Np0, σ2q. The noise term ηptq is applied to all the dimensions, whereas the ϵiptq terms are individual and applied to the single dimensions i P Jd K. Given this formulation, if α 0 the intermediate observations are independent, while if α 1, the intermediate observations are fully correlated. For values of α P p0, 1q, the noise term in the intermediate observations will comprise a correlated term and an independent term. We consider the case in which the Gaussian noise with σ 0.5 (for both the independent and correlated components) affects only action components ai 2 (i.e., those with expected value µi,2 0.5) for i P Jd K. We consider values of α P t0, 0.2, 0.4, 0.6, 0.8, 1u. We evaluate the performances in terms of cumulative regret averaged over 10 runs with target time horizons T P r104, 105s. Results Before commenting on the results, we observe that the presence of correlated noise over action components ai 2 has the effect of changing the optimal vector action depending on the value of α. In Figure 5, we plot the value of the expected reward of the action vectors p1, . . . , 1q and p2, . . . , 2q estimated using 106 Monte Carlo simulations for the values of α under analysis. We consider just the two action vectors p1, . . . , 1q and p2, . . . , 2q, given that all the other combinations of action components will give intermediate results (and are suboptimal). We first observe that, given that all the observations of the action vector p1, . . . , 1q are not influenced by any noise, its expected reward is stable over α. On the other hand, for action vector p2, . . . , 2q, affected by noise, we see how as the correlation increases, the expected reward increases itself and overtakes the one of action vector p1, . . . , 1q. Moving to the simulations, Figure 6 shows a comparison of the performances of F-UCB and F-Track when we vary correlation parameter α. First, we observe how the two algorithms present a consistent behavior over the different values of α. They are able to achieve satisfactory performances (i.e., sublinear regret) up to α 0.6. Then, the regret degenerates to linear. This is consistent with what we observed in Figure 5, as these algorithms look at the expected values of the single action components, but in this case, the noise correlation altered the optimal arm, which is no longer the one with the highest product of the expected observations. Factored-Reward Bandits with Intermediate Observations 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track 0.2 0.4 0.6 0.8 1 Cumulative Regret F-UCB F-Track Figure 6. Cumulative regret of F-UCB and F-Track considering k 2, σ 0.5, d 5, i,2 0.5, @i P Jd K, and correlation parameter α P t0, 0.2, 0.4, 0.6, 0.8, 1u (10 runs, mean 2std).